Solution Factory.Idea to vetted architecture.

An autonomous solution architect built on LangGraph. Hand it a plain-English description of a business process you want to automate. A panel of expert AI personas interrogates the problem, a corpus-grounded architect designs the solution, a second panel attacks that design until no blocking objection survives, and the output is a build-ready spec. Two human gates. Everything between them runs on its own.

Honest status: this is a local dev tool, built and verified end to end on real runs, not a deployed product. The pipeline through the approved build spec is functional. Autonomous implementation against a real repo (v2) is not built. The baseline run completed across two threads with three bugs fixed mid-run, so one clean single-thread run has not yet been demonstrated. On novel briefs the panel frequently caps and escalates to the human gate by design.

A LangGraph StateGraph of 14 nodes: Send fan-out for the parallel critic panel, interrupt() for the human pause points, a Postgres checkpointer for durable cross-process pause and resume, and a deterministic zero-blocker consensus rule. The architect that designs is structurally forbidden from grading its own work.

View source on GitHub

What the Solution Factory Is.

The Solution Factory turns a sentence about something you want to automate into a defensible architecture. You hand it a plain-English description of a business process. A panel of expert AI personas interrogates the problem the way a sharp consultant would, surfacing the requirements you did not think to name. You approve a refined proposal. A corpus-grounded architect then designs the solution, citing a curated library of agentic-AI patterns rather than inventing from thin air. A second panel attacks that design in parallel and in isolation, each persona hunting for the flaw that sinks it in month three. The architect revises until no blocking objection survives.

It is a ground-up rebuild on LangGraph and LangChain. The earlier Claude Code version is retired. The pipeline is a single LangGraph StateGraph: the critic panel fans out with Send, the human gates are durable interrupt() pauses backed by a Postgres checkpointer, and the whole revision loop is governed by a five-line router that does pure arithmetic over the findings. Consensus is not a feeling. It is a countable condition: zero remaining open BLOCKER findings. A panel that returns no findings is treated as a failure, not a win.

It is general-purpose, not tied to FORGE or any one stack. The knowledge corpus is a resource it cites, not a constraint it obeys. v1 stops at an approved build spec, a planning roadmap with per-phase verification criteria that an engineer can build from. I have driven the same brief through it three times, with different architect modes and panels, and watched it land on three different terminal paths. Every one of them ended safely at a human gate with an honest, reconciled finding count, never a false green.

Execution engineLangGraph StateGraph, 14 nodes

Human gates2 mandatory, 4 interrupt() points

Consensus ruleZero open BLOCKER findings

Revision loopCapped at 3, stall-detect, escalate

Baseline grounding21 of 21 citations resolvable

Baseline cost$0.71 real vs $2.97 all-metered

Verified runs3 end to end, all reconcile clean

Current stateLocal dev tool, v1. v2 not built.

One Graph. Two Gates.

The pipeline is a LangGraph StateGraph. Every phase writes a markdown artifact first, then emits structured rows into a Postgres ledger that is a rebuildable projection of the markdown. v1 stops at an approved build spec.

Pipeline · LangGraph · Two human gates

Every phase writes a markdown artifact + ledger rows.Two human gates · v1 stops at the build-spec

PANEL & INTAKE

A vague idea becomes a real problem statement.

The setup node reads your one sentence and chooses which experts belong in the room, a security reviewer, a performance reviewer, a systems architect, and so on, picked to fit the problem. It opens a ledger row to track the run. A discussion panel then reads the brief and generates the questions a good consultant would ask before quoting work: what volume, what defines success, what happens to failures, how fast it must run. It puts roughly a dozen of these to you and pauses at the first interrupt().

You answer in free text. From your answers the system writes a refined, sharpened statement of the actual problem and its constraints. Nothing expensive has happened yet. The Socratic front end is a deliberate brake on a model that would otherwise start generating immediately, because the most expensive mistake in any build is solving the wrong problem precisely.

GATE 1 (human interrupt)

You read and approve the refined proposal before any design work starts
You can approve, reject (the run ends), or send it back for another round of questions
This gate sits before the expensive design work, so a misunderstanding costs a few intake calls, not a full siege

ARCHITECT DRAFT

A corpus-grounded architect designs the solution.

Once you approve, a single architect agent goes to work. It queries the curated agentic-knowledge corpus, a read-only vault of field-harvested atoms, pulling the patterns, failure modes, and reference designs relevant to your problem. It drafts an end-to-end schematic: components, data flow, infrastructure, cost notes, alternatives, and explicit gap flags where it is guessing. Every claim it can ground, it cites as a wikilink to a real atom. Where the corpus is silent, it says so and tags a web fallback.

Grounding is by-domain retrieval over the markdown corpus, not a vector store. The default path prefetches a fixed top-N of atoms, then designs. An opt-in agentic mode runs the architect as a create_react_agent that queries the corpus itself through tools, iteratively, then designs. On the verified runs the agentic architect made 30 to 32 live corpus tool calls and every citation in the draft resolved to a real atom. Critically, this architect never grades its own work.

ADVERSARIAL PANEL

The panel fans out and attacks.

The panel assembled at the start now fans out with a LangGraph Send: each persona reviews the schematic in parallel and in isolation, seeing only the design and its own lens, never the architect reasoning that produced it. Each returns findings classified BLOCKER (cannot ship as drawn) or WARNING (a risk worth recording). A security persona might flag an unauthenticated webhook as a BLOCKER; a performance persona might warn the classifier will miss the latency you specified. A deferred reduce node barriers on all critics before merging their findings.

The separation is the single most important quality mechanism. A design is assumed guilty until a hostile panel fails to convict it. Critics receive only the schematic and their lens as a self-contained payload, never the architect state, so the review cannot collude with the design. The rigid taxonomy is what makes the next phase countable: consensus is defined as zero remaining open blockers, which is only measurable because every finding carries a severity.

BOUNDED REVISION

Revise, re-attack, or escalate. Pure arithmetic.

After every critique round a router decides with pure Python. Zero open blockers means consensus, the design survived. Blockers that remain and are still shrinking under the iteration cap send the reviser back to address every open finding, and the panel re-attacks the new draft. A stall (blockers stopped strictly decreasing), the cap, or a panel that returned nothing routes to escalation. Zero findings from an adversarial panel is treated as a failure, a parser or dispatch bug, never as success.

The loop runs at most three real revision rounds. If it cannot converge, it pauses at a third interrupt() and tells you why: the reason, how many blockers remain, which iteration it died on. You choose to accept the residual risk, force one more round, or abandon. This is by design. Hard problems are supposed to surface a human, not get rubber-stamped. On genuinely novel briefs the panel frequently does not converge organically, so the human gate is load-bearing, not decorative.

The five-line governance model

No findings this round: escalate (the panel said nothing, treat as a bug)
Zero open blockers: consensus (the design survived)
Iteration cap hit: escalate (backstop)
Blockers not strictly decreasing: escalate (stall)
Otherwise: revise (shrinking, under cap)

CONSENSUS & BUILD SPEC

Zero blockers becomes an executable plan.

When zero BLOCKERs remain, organically or by your accept-risk call, the system writes the consensus document: the surviving schematic, the full findings table with every objection and how it was resolved, and the open-blocker curve across rounds. A deterministic reconciliation guard then asserts that the blocker count in the consensus document equals the finding rows in the ledger, catching any silent finding loss between the markdown source of truth and its projection.

Consensus is translated into a planning roadmap: per-phase plan files, each with its own verification criteria. On the verified runs the build-spec tasks carried decision-ID provenance back to the consensus design, and the verification criteria were concrete and executable, specific inputs with asserted outputs, not generic boilerplate. This is the thing an engineer, or a future autonomous implementer, can actually build from.

GATE 2 (human interrupt). This is where v1 ends.

You read and approve a defensible, build-ready specification
Autonomous implementation against a real repo (v2) is not built; the deliverable is the approved spec, full stop
The spec is the last cheap place to catch a wrong turn before real code gets written

LangGraph for the Spine.

A ground-up rebuild on LangGraph and LangChain. The earlier Claude Code version is retired. This is the basis for any LangGraph or LangChain claim: substantial, load-bearing usage, not a token import.

Execution engine

LangGraph StateGraph, 14 nodes: Send fan-out for the parallel critic panel, interrupt() for the four human pause points (intake, GATE 1, escalation, GATE 2)A Postgres checkpointer for durable cross-process pause and resume; RetryPolicy on the LLM nodes to ride headless-CLI flakinessKeyed and sentinel-reset state reducers so concurrent critic writes merge correctly and a finding can flip open to resolved in place

Models (LangChain)

langchain-anthropic ChatAnthropic for the architect and revisercreate_react_agent for the opt-in agentic architect that queries the corpus through toolsPydantic structured output: RouteDecision for routing, DesignManifest for gap flags and the cost envelopeLangSmith tracing with on-wire redaction: prompts and completions become length-and-hash fingerprints before serialization, so briefs and corpus never leave the machine

Corpus grounding

A curated agentic-engineering corpus in a read-only vault, queried by_domain over the markdownNo vector store and no RAG index: the semantic tool does not scale to the corpus, so the architect grounds by_domain aloneEvery citation is checked for resolvability against the real corpus; unresolvable ones are logged, not silently kept

Hybrid transport (the cost decision)

The high-cardinality critic panel runs on headless claude -p, flat-rate Claude Max, zero marginal costThe few high-stakes architect and reviser calls run on the metered native Anthropic API via a separate keyBilling isolation invariant: the metered key is passed explicitly to the client and never written to the global env, so the critic subprocesses stay on the subscription

Persistence

Markdown artifacts written first as the source of truth (the proposal, schematic, critique, consensus, and the planning spec)A 9-table Postgres ledger as a rebuildable projection of the markdown, dual-written at phase boundaries and never the only copyA separate Postgres checkpointer store for opaque durable run state; the two stores are never conflated

Surfaces

A durable CLI driver (run, list, show, resume) that can start a run in one terminal and resume an interrupt from another, days laterLangGraph Studio for browser-driven runs over the same graph, with a normalize entry shim

The Design Principles.

A pipeline is just plumbing. What makes one trustworthy is the judgment behind it. These are the decisions that shaped the Solution Factory.

The architect cannot grade its own work.

The agent that designs is structurally forbidden from reviewing the design. Critics receive only the schematic and their own lens as a self-contained payload, in parallel and in isolation, never the architect state or reasoning. A design is assumed guilty until a hostile panel fails to convict it. Even on the agentic path the architect only reads the corpus; the tool-calling is grounding, not collusion.

Consensus is a count, not a feeling.

Consensus has exactly one definition: zero remaining open BLOCKER findings. Not a vibe that the panel feels good. A deterministic count, enforced by a five-line router and verified by a reconciliation guard that asserts the consensus blocker count equals the ledger rows. A panel that returns no findings is an alarm, not an all-clear, treated as a parser or dispatch bug until a human says otherwise.

Ground every claim, flag every gap.

The architect cites real atoms a practitioner shipped, written as wikilinks, or it openly tags what it cannot ground with a web fallback and a gap flag. Provenance is recorded so you can audit where a design decision came from. Grounding is by-domain retrieval over a curated markdown corpus, not a vector database. On the verified runs every citation in the draft resolved to a real atom.

Spend tokens where they change the outcome.

A run is dominated by panel volume, up to roughly a dozen cheap critics across up to three rounds. Those run on the flat-rate Claude Max subscription, so the high-cardinality cost is effectively fixed. The few calls that actually steer the design run on the metered native API, where the model-layer leverage is highest. On the baseline this was $0.71 real versus $2.97 if every call were metered, about 76 percent saved. It pairs directly with FORGE model routing as a documented through-line: spend the expensive dollar only where quality compounds.

Bound every loop.

The revision loop has a hard cap of three rounds, a strictly-decreasing stall check, and a zero-findings anomaly guard. The machine would rather stop and ask a human than loop forever or declare a false victory. On genuinely novel briefs it frequently caps and escalates by design, which means the human gate is load-bearing on hard problems, not decorative. Budget for being called in.

Markdown is the source of truth, the ledger is a projection.

Every artifact is written to disk first. Then structured rows are written to a 9-table Postgres ledger at each phase boundary. The ledger is rebuildable from the markdown and is never the only copy, so a failed database write never loses the real record. This mirrors FORGE own discipline: structured data for retrieval, but the working medium stays where the agents natively operate.

Interrogate before you design.

The first phase asks you questions instead of assuming answers, because the most expensive failure is solving the wrong problem well. The Socratic front end is a deliberate brake on a model tendency to start generating immediately. Catching a misframed problem at the first gate costs a conversation; catching it after the architecture is drawn costs the whole back half. Two human gates, and v1 stops at the approved spec, so the tool earns trust before it earns autonomy.

The Top of the Loop.

FORGE ships websites. The knowledge corpus accumulates the frontier of what is known about building agentic systems. The Solution Factory reasons over that corpus to turn knowledge into architecture.

The systems form a loop. FORGE ships websites and runs the daily work. A curated corpus accumulates the frontier of agentic-engineering patterns as typed, field-harvested atoms. The Solution Factory sits at the top: it reasons over that corpus to turn knowledge into a defensible, build-ready architecture, with the major objections already surfaced and resolved. It is general-purpose, so the same machinery designs whatever the problem and the evidence call for, infrastructure included.

The value is not a fixed answer. Across three real runs the same brief converged organically, hit the iteration cap, and stalled, and that non-determinism is inherent to LLM design plus LLM critique. The point is that every terminal path is safe: cap, stall, and organic all end at a human gate with an honest, reconciled finding count, never a false green. Where you already know exactly what to build, this is overhead; it is a design tool, not a code generator. Where the cost of a bad design is high enough to justify an adversarial review and two human checkpoints, it earns its keep.

Explore the Other Systems.

FORGE

The autonomous pipeline. Five stages, one operator, built solo.

View the build

Founders Circle

A seven-agent strategy council that runs itself.

View the build

There's More.

The Solution Factory is one of the AI systems on my resume page. FORGE (the centerpiece) and the Founders Circle council are from-scratch builds with full deep-dives of their own.

Back to Resume Get in Touch