Multi-Agent Canvas
Watchable multi-agent system: a Sonnet orchestrator delegates to parallel Haiku workers, two critics debate, a judge decides — all streamed live onto a graph canvas, every session reproducible as an event-sourced permalink.
Key engineering call
Mid-stream critic interrupt cut — looked spectacular but produced worse output (the model reacting to its own half-finished work). The debate now fires after Layout v1, against the full result. Less visually exciting, materially better. Same pattern throughout: no UX-theatre lies, surface the deferral honestly (steer_queued → steer_applied).
Sketch a layout, drop a brand reference, type a prompt — a Sonnet 4.6 orchestrator proposes a roster of Haiku 4.5 workers, fans out parallel drafts, runs a two-critic debate (performance + aesthetics) against a judge, and ships a finished landing page. Every token streams in real time onto a graph canvas; every inter-agent message is event-sourced and replayable from a permalink. Deliberately built as a portfolio piece: the landing-page output is the vehicle — the real artifact is the engineering substrate underneath (WebSocket multiplexing over concurrent Anthropic streams, event-sourced replay, prompt caching with a live cache-hit signal, an honest cancel + steer lifecycle, visible branch-and-re-merge topology on the canvas).
The thesis
This is a portfolio piece, not a product. The landing-page output is the vehicle — the real artifact is the engineering substrate underneath: a WebSocket multiplexer over concurrent Anthropic streams, an event-sourced replay model, prompt caching with a live cache-hit signal, an honest cancel + steer lifecycle, and a multi-agent debate that visibly branches and re-merges on the canvas.
The flow
Six stages from sketch input to auto-published layout. Every stage is live on the canvas — nodes appear, stream, collapse or glow depending on lifecycle state.
┌─────────────┐
│ Inputs │ prompt · reference image · paint-canvas sketch
└──────┬──────┘
▼
┌─────────────┐ propose_roster (tool use, vision)
│ Orchestrator│ delegate(Designer, alts=3) ─┐
│ Sonnet 4.6 │ delegate(Mascot, alts=3) ─┤ ← parallel
│ + thinking │ delegate(Copywriter) ─┤
└──────┬──────┘ delegate(Layout) ─┘
▼
┌────────────────────────────┐
│ Workers (Haiku 4.5) │
│ Designer×3 · Mascot×3 │
│ Copywriter · Layout │
└──────┬─────────────────────┘
▼
┌─────────────┐ ┌──── Critic-Performance ┐
│ Layout v1 │──────►│ │── Judge ─► one ISSUE/FIX
└──────┬──────┘ └──── Critic-Aesthetics ──┘ │
│ │
└─────► (user approves) ────► Layout v2 ────► auto-publish ◄─┘
│
▼
/r/<id> permalink replaySix engineering problems
Each has its own deep-dive in the repo under docs/hard-problems/. These are the things that don't fall out of a one-shot prompt.
WebSocket multiplexing over N concurrent Anthropic streams
One socket, monotonic per-session sequence IDs, per-agent demux on the client. Not one socket per worker — a single connection, many sub-streams.
Cancel and steer mid-stream — without state corruption
Real AbortController cancel; steer is honest about deferring to the next turn (steer_queued vs. steer_applied as two visible events). No UX theater — the streaming protocol's limits are surfaced honestly.
Speculative parallel drafts with a hard cost ceiling
Designer ×3, Mascot ×3 run in parallel via Promise.all. Per-session and monthly USD caps are checked before every model call; tight per-worker maxTokens budgets prevent single-call blowouts.
Sketch → vision → constrained layout generation
A native HTML canvas as paint input (no Excalidraw/tldraw dependency) → base64 PNG → vision payload to the orchestrator → constrained layout prompt that honors the sketch.
Event-sourced replay over non-deterministic LLM output
Every WebSocket frame is appended to libSQL; permalink replay reads from the log — no re-inference. Permalinks are bit-identical visual replays, not a re-prompt.
Prompt-cached orchestrator system prompt with a live HUD
cache_control: ephemeral on the system prompt; the cache-hit signal surfaces as a live HUD badge. Cold vs. warm run cost is measured and documented in the README.
Stack — picked pragmatically
One layer per decision. Justified where it matters — e.g. Sonnet vs. Opus for tool routing, Bun over Node for native WS.
| Layer | Choice |
|---|---|
| Runtime | Bun 1.3 |
| API | Hono + native Bun WebSocket |
| Event store | libSQL (file-mode lokal, Turso-kompatibel) |
| Anthropic SDK | messages.stream() mit adaptivem Thinking |
| Orchestrator | claude-sonnet-4-6 (5× günstiger als Opus, gleichwertig für Tool-Routing) |
| Workers | claude-haiku-4-5 |
| Frontend | Next.js 15 App Router + React 19 |
| Canvas | @xyflow/react + custom labeled-bead edges |
| State | Zustand mit per-Agent-Slices (keine Context-Re-Renders) |
| UI | Tailwind + Radix Dialog + Framer Motion + Sonner |
| Sketch input | natives <canvas> (keine Excalidraw/tldraw-Dep) |
| Deploy | Docker + Caddy auf einer VPS |
Tradeoffs — called out honestly
The awkward decisions most demos hide. Made visible here.
Replay reproduces the recorded run, not a fresh inference
Permalinks are bit-identical visual replays from the event log — not a re-prompt of the model. The landing footer says so openly.
Steer applies one turn late
The Anthropic streaming protocol has no mid-message interrupt; pretending it did would be UX theater. The two-event steer_queued → steer_applied protocol makes the deferral visible.
The roster isn't fully emergent
The orchestrator is nudged by the system prompt to propose a Performance Critic + Aesthetics Critic + Judge debate — for canvas-shape reasons. A fully-emergent orchestrator was tried; it produced flatter rosters with weaker review.
Cost cap is checked between calls, not within one
Mitigation: tight per-worker maxTokens budgets. Documented in docs/hard-problems/03-speculative-drafts.md.
Mid-stream critic interrupt was cut
Real interrupts looked spectacular — but produced worse output (the model reacting to its own half-finished work). The debate fires after Layout v1, against the full result. Less visually exciting; materially better.
UI is German on purpose
System markers (ISSUE, FIX, HEADLINE, <!doctype html>, JSON keys) stay English — the parsers depend on them. Worker output is German.
What's deliberately not built
- Multi-tenant — single Anthropic key, one global cost pool.
- Auth — the demo is public, cost caps are the only gate.
- Mobile canvas — desktop-only, mobile shows a fallback banner with the demo video.
- Branching from a permalink — the data model supports it, the UI doesn't.
- Versioned event-log schemas — additive types are graceful, renames would break old replays.
Deployment
Docker Compose plus Caddy on a VPS. Caddy auto-issues TLS; the API serves WebSocket on /ws, replay on /r/<id>, everything else routes to Next.js. Live at multi.prototyp.ms — source open at github.com/stackola/multi.