Skip to content

FORGE.The autonomous pipeline.

An autonomous multi-agent AI pipeline that finds small businesses, builds them custom websites, sends personalized outreach, and carries built-in payment and retention. Five stages. One human in the loop. Built solo on a Mac Mini.

End-to-end multi-agent orchestration, with durable workflows, model-tier routing, informational QA audits, and a human-in-the-loop approval gate.

What FORGE Is.

FORGE is an autonomous pipeline that runs the full lifecycle of a customer relationship: research, build, sell, retain. The name is the acronym of its five stages, Find, Outfit, Reach, Grow, Embark. Each stage is a Temporal workflow with its own activities, agents, and outputs.

The system runs on a Mac Mini M4 (16GB RAM, always on) using Temporal.io as the orchestration backbone, a local self-hosted dev server with SQLite persistence, not Temporal Cloud. Temporal handles retries, crash recovery, event history, and long-running workflows. If the machine crashes mid-build, the workflow resumes from the last checkpoint. The entire codebase is a single TypeScript project. One Postgres database. One Next.js dashboard. Four built creative agent templates, two of which run autonomously inside the build workflow.

FORGE has staged 54 production-ready websites at the time of writing. It runs whether I'm at the desk or not. Creative work routes to a flat-rate Claude Max subscription and analytical work to Gemini Flash, so the marginal cost of a build is a handful of metered API calls.

Pipeline stages5 (FIND, OUTFIT, REACH, GROW, EMBARK)
AI agents4 built (2 autonomous, 2 triggered)
Architectural iterationsFive over roughly five months
OrchestrationTemporal, durable by design
Production-ready sites54 staged and ready
OperatorOne. Me.
Runs onA single Mac Mini M4

The Case Study, Slide by Slide.

The full FORGE deck. Swipe or use the arrows to move through the premise, the five stages, and the reliability engineering underneath.

FORGE, the flagship case study. A production-grade multi-agent AI pipeline.
01 / 17

FORGE, the flagship case study. A production-grade multi-agent AI pipeline.

Five Stages. End to End.

Each stage is a Temporal workflow with its own activities and durability guarantees. They hand off through Postgres and Temporal signals.

Pipeline · End to End · Temporal signal handoffs
FIND
Research engine
OUTFIT
HATTORI · HANZO
REACH
BARD
GROW
SHERPA
EMBARK
Strategic loop
Each stage is its own Temporal workflow.One human gate · REACH approval
F

FIND

The discovery and research engine.

Given a business name and city, FIND fans out across three categories of source in parallel. Scrapers pull what's public on the major directories. Research APIs pull regulated data and competitive signal. Cloudinary enhances any photos surfaced along the way. None of these depend on each other, so the work happens concurrently rather than serially.

Gemini Flash then synthesizes the raw data into a structured dossier: who they are, what they do, what their customers say, what their competitive landscape looks like, what their strengths and weaknesses are. The same dossier feeds a six-dimension scoring pass that ranks every prospect on opportunity. The FIND machinery is also callable as a standalone enrichment script outside the Temporal workflow, so an existing database row can be re-enriched without re-discovering anything.

A full FIND pass takes 5 to 10 minutes per prospect, not 20, because the work is parallel rather than sequential. The dossier is the single document every downstream agent reads.

Data sources, run in parallel
  • Scrapers: Google Maps, Yelp, Thumbtack, existing site
  • Research APIs: Perplexity Sonar, Tavily (direct API)
  • Licensing: Oregon CCB, Secretary of State, bonding, insurance
  • Imagery: Cloudinary
Six scoring dimensions (weighted)
  • Website quality (0.25)
  • Social proof (0.20)
  • Contact confidence (0.15)
  • Market opportunity (0.15)
  • Business maturity (0.15)
  • Competitive position (0.10)
FIND · Parallel fanout, 5 to 10 minutes
Input
Business name + city
Maps
Yelp
Thumbtack
Existing site
Perplexity Sonar
Tavily (API)
CCB · SoS
Cloudinary
Output
Structured dossier + 6-dim score
O

OUTFIT

Where the website gets built.

OUTFIT runs two AI agents in sequence. HATTORI writes the full website copy package first (headlines, body, CTAs, services, about, testimonials) and hands it over as suggestions, not a fixed outline. HANZO builds the site second, writing its own internal design plan, wire-framing, then writing a complete bespoke site in raw HTML and CSS from scratch. No framework. No build step. No template fill. Plain files that deploy to Vercel with zero configuration.

Before it finalizes, HANZO runs a structured self-QA checklist: head and meta tags, JSON-LD that parses, alt text on every image, nav anchor integrity, the form fields and slug token, plus an executable anti-slop grep. I am honest about its limits. Only the grep truly executes; the rest the model self-attests by reading its own HTML, since it has no browser, and it logs a "decisions I cannot trace to anything" list. After the build, a Playwright technical audit and a Gemini Vision visual QA run in parallel as informational second eyes for me, never as gates. The one true gate is my manual approval.

The copy-before-design sequence is deliberate. Copy dictates layout in web design. By having HATTORI write the full copy package first, HANZO knows exactly how much text it is working with and how the page should flow. Design first, fill in copy later produces lorem ipsum rectangles and awkward fits.

R

REACH

Outreach, sales, and payment.

REACH has a fundamentally different rhythm than FIND and OUTFIT. Those stages finish in under an hour with no human intervention. REACH can take weeks and requires my approval at multiple points. Splitting it into its own Temporal workflow keeps the event histories clean and lifecycle management sane.

BARD drafts a personalized outreach message referencing the prospect, specific strengths, and the staged website built for them. The draft lands in my dashboard alongside the dossier and the site preview. I approve, edit, or reject. On approval, the pipeline checks the Do Not Contact list one more time and sends via iMessage. A Stripe payment link is included. When the Stripe webhook fires on payment, the pipeline receives a Temporal signal and transitions the prospect to GROW.

Every outreach send is human-approved. This is deliberate. The first impression with a potential client is too important to fully automate, especially when the system is still building a track record.

G

GROW

Retention and recurring value.

GROW starts when Stripe confirms payment and runs indefinitely. Day one handles the launch tasks. After that the workflow enters a monthly loop that runs forever.

The workflow resets monthly using Temporal's Continue-As-New pattern, which prevents the event history from growing unbounded over the lifetime of a client. The onboarding workflow is wired end to end and verified in test mode on a signed Stripe test event; no real card has driven it yet, by design, since FORGE is pre-revenue. SHERPA, the client-success agent planned for this stage, is scaffolded and not yet built.

Most web design agencies lose clients because they disappear after the sale. GROW exists to make the $200/month justified by ongoing value, not just the initial build.

Day one
  • Domain setup
  • Site goes live on the real domain
  • Welcome email drip
  • Initial SEO configuration
Every month after (planned, activities scaffolded)
  • Performance report
  • SEO rankings check
  • Uptime check
  • Upsell suggestions (additional pages, blog posts, Google Business optimization)
E

EMBARK

Self-improvement and expansion.

EMBARK is the strategic learning layer, where the pipeline learns about itself rather than about individual prospects.

Not yet built. Not needed until first revenue is consistent. Documented here because the architecture anticipates it: every QA result, every outreach response, and every conversion is already being logged to learnings tables that EMBARK will consume.

  • Weekly system reflections
  • Vertical analysis (should the next cohort be dentists, restaurants, etc.)
  • System health reporting

Deliberately Small.

Every piece of the stack earns its place. Temporal owns workflow state, so the application database carries no orchestration tables. FORGE is one TypeScript project, a 22-table Postgres schema, and four built agents.

Orchestration
Temporal.io (durable workflows, retries, event history, signals; local self-hosted dev server, SQLite persistence, not Temporal Cloud)n8n (peripheral discovery feeds and notifications)
AI Agents
Claude Max via the headless claude -p CLI (HATTORI for copy and HANZO for design and build run autonomously; HANZO-lite for revisions and BARD for outreach are triggered; SHERPA for client success is scaffolded, not built)Gemini Flash API (research synthesis, visual QA scoring, dossier generation)
Data
Postgres (business data, artifacts, send history, learnings; 22-table schema, no orchestration tables)Workspaces on disk (per-prospect agent context, copied from templates)
Research & Enrichment
Perplexity SonarTavily (direct API)Apify (Google Maps, Yelp, Thumbtack, contact scrapers)Cloudinary (image enhancement)Oregon CCB and Secretary of State integrations
Deployment & QA
Vercel (static HTML/CSS deployment, zero-config)Playwright (desktop + mobile screenshots)Stripe (payment processing and webhooks)iMessage via AppleScript (outreach delivery)

The Judgment Calls.

A pipeline is just plumbing. What makes one work is the engineering judgment behind it. These are the decisions that shaped FORGE.

01

Claude Max for creative, Gemini Flash for analytical.

Claude Max is a flat-rate subscription. Every creative call (copywriting, design, iteration, outreach drafting) costs zero at the margin. Gemini Flash is cheap per call and fast, making it ideal for the dozen-plus analytical calls per prospect (research synthesis, QA scoring, persona generation). The hard rule prevents scope creep in both directions.

02

Temporal for the spine, Postgres only for business data.

Earlier versions tried to put workflow state in Postgres: event ledgers, retry counters, scheduling tables. This required building custom state machines. Temporal provides all of it as a mature, battle-tested platform. FORGE uses Postgres only for business data that outlives any single workflow, so the application schema carries no orchestration tables.

03

Maximize creative freedom. Validate with judgment.

Earlier versions controlled creative output through rules: content tiers, layout templates, section visibility flags. All eliminated. AI agents produce better creative work when given strong context and quality judgment than when constrained by deterministic rules. Instead of rules, FORGE uses quality prompts (skill files), a builder self-QA checklist, and informational audits that surface weak builds for the human gate. Agents are free to make creative decisions; the human approval catches the bad ones.

04

I removed a QA gate that hurt quality.

An earlier version ran an automated QA and re-stylize loop: score the site, send it back, restyle, score again. I diagnosed that it was homogenizing output without improving it, every site drifting toward the same look, and removed it in a clean cutover. What replaced it is single-pass generation, a builder self-QA checklist, non-blocking informational audits, and one human gate. Knowing when a guardrail is making the work worse, and having the discipline to cut it, is a harder engineering call than adding one.

05

Workspaces are disposable. Templates are permanent.

Every prospect gets an isolated workspace directory containing all the context their agents need (dossier, skills, tools, input/output folders). Workspaces are generated from templates stored in the codebase. If I improve HATTORI's copywriting skill, that change lives in the template and applies to all future prospects. Already-generated workspaces keep their original version. Template changes never retroactively affect in-progress prospects.

06

No dead letters. No silent failures.

Earlier versions had a dead-letter queue where failed prospects went to die. FORGE has none. If a prospect can't be processed, it escalates to me. If outreach fails, it is logged and flagged. If a site can't pass QA, I review it manually. The reasoning: pre-revenue, every prospect matters. Silent failures mean lost revenue. Escalation means I can investigate, learn what went wrong, and improve the system.

The Patterns Travel.

FORGE is built for trades businesses, but the architecture isn't specific to them. The patterns that make it work generalize.

The Claude-Max-for-creative-and-Gemini-for-analytical split is a cost and quality discipline that applies to any pipeline producing customer-facing creative output at scale. The Temporal-as-spine pattern is the right shape for any multi-stage business workflow that needs to survive crashes and human-in-the-loop approval. The deterministic-post-processing pattern, catching looks-right-but-wrong output with concrete checks rather than a probabilistic QA loop, applies to any agentic system where silent failures are unacceptable.

The most directly transferable piece is FIND: a multi-source intelligence operation that turns a bare business listing into a research dossier rich enough to drive bespoke copy and visual design. Swap the Oregon CCB integration for a different vertical licensing dataset and the same machinery enriches dentists, restaurants, real estate agents, or any local-business category. The standalone enrichment script mode means it can also run as a B2B data product against pre-existing customer lists.

There's More.

FORGE is the centerpiece, but it's not the only system I've built. The resume supplement page has the full story, including the other AI systems I've designed and shipped.