Multi-Agent Canvas

Watchable multi-agent system: a Sonnet orchestrator delegates to parallel Haiku workers, two critics debate, a judge decides — all streamed live onto a graph canvas, every session reproducible as an event-sourced permalink.

Key engineering call

Mid-stream critic interrupt cut — looked spectacular but produced worse output (the model reacting to its own half-finished work). The debate now fires after Layout v1, against the full result. Less visually exciting, materially better. Same pattern throughout: no UX-theatre lies, surface the deferral honestly (steer_queued → steer_applied).

Sketch a layout, drop a brand reference, type a prompt — a Sonnet 4.6 orchestrator proposes a roster of Haiku 4.5 workers, fans out parallel drafts, runs a two-critic debate (performance + aesthetics) against a judge, and ships a finished landing page. Every token streams in real time onto a graph canvas; every inter-agent message is event-sourced and replayable from a permalink. Deliberately built as a portfolio piece: the landing-page output is the vehicle — the real artifact is the engineering substrate underneath (WebSocket multiplexing over concurrent Anthropic streams, event-sourced replay, prompt caching with a live cache-hit signal, an honest cancel + steer lifecycle, visible branch-and-re-merge topology on the canvas).

The thesis

This is a portfolio piece, not a product. The landing-page output is the vehicle — the real artifact is the engineering substrate underneath: a WebSocket multiplexer over concurrent Anthropic streams, an event-sourced replay model, prompt caching with a live cache-hit signal, an honest cancel + steer lifecycle, and a multi-agent debate that visibly branches and re-merges on the canvas.

The flow

Six stages from sketch input to auto-published layout. Every stage is live on the canvas — nodes appear, stream, collapse or glow depending on lifecycle state.

   ┌─────────────┐
   │   Inputs    │   prompt · reference image · paint-canvas sketch
   └──────┬──────┘
          ▼
   ┌─────────────┐    propose_roster (tool use, vision)
   │ Orchestrator│    delegate(Designer, alts=3)  ─┐
   │  Sonnet 4.6 │    delegate(Mascot,   alts=3)  ─┤  ← parallel
   │ + thinking  │    delegate(Copywriter)        ─┤
   └──────┬──────┘    delegate(Layout)            ─┘
          ▼
   ┌────────────────────────────┐
   │  Workers (Haiku 4.5)       │
   │  Designer×3 · Mascot×3     │
   │  Copywriter · Layout       │
   └──────┬─────────────────────┘
          ▼
   ┌─────────────┐       ┌──── Critic-Performance ┐
   │  Layout v1  │──────►│                         │── Judge ─► one ISSUE/FIX
   └──────┬──────┘       └──── Critic-Aesthetics ──┘                │
          │                                                          │
          └─────► (user approves) ────► Layout v2 ────► auto-publish ◄─┘
                                                            │
                                                            ▼
                                                   /r/<id> permalink replay

Six engineering problems

Each has its own deep-dive in the repo under docs/hard-problems/. These are the things that don't fall out of a one-shot prompt.

WebSocket multiplexing over N concurrent Anthropic streams

One socket, monotonic per-session sequence IDs, per-agent demux on the client. Not one socket per worker — a single connection, many sub-streams.

Cancel and steer mid-stream — without state corruption

Real AbortController cancel; steer is honest about deferring to the next turn (steer_queued vs. steer_applied as two visible events). No UX theater — the streaming protocol's limits are surfaced honestly.

Speculative parallel drafts with a hard cost ceiling

Designer ×3, Mascot ×3 run in parallel via Promise.all. Per-session and monthly USD caps are checked before every model call; tight per-worker maxTokens budgets prevent single-call blowouts.

Sketch → vision → constrained layout generation

A native HTML canvas as paint input (no Excalidraw/tldraw dependency) → base64 PNG → vision payload to the orchestrator → constrained layout prompt that honors the sketch.

Event-sourced replay over non-deterministic LLM output

Every WebSocket frame is appended to libSQL; permalink replay reads from the log — no re-inference. Permalinks are bit-identical visual replays, not a re-prompt.

Prompt-cached orchestrator system prompt with a live HUD

cache_control: ephemeral on the system prompt; the cache-hit signal surfaces as a live HUD badge. Cold vs. warm run cost is measured and documented in the README.

Stack — picked pragmatically

One layer per decision. Justified where it matters — e.g. Sonnet vs. Opus for tool routing, Bun over Node for native WS.

Layer	Choice
Runtime	Bun 1.3
API	Hono + native Bun WebSocket
Event store	libSQL (file-mode lokal, Turso-kompatibel)
Anthropic SDK	messages.stream() mit adaptivem Thinking
Orchestrator	claude-sonnet-4-6 (5× günstiger als Opus, gleichwertig für Tool-Routing)
Workers	claude-haiku-4-5
Frontend	Next.js 15 App Router + React 19
Canvas	@xyflow/react + custom labeled-bead edges
State	Zustand mit per-Agent-Slices (keine Context-Re-Renders)
UI	Tailwind + Radix Dialog + Framer Motion + Sonner
Sketch input	natives <canvas> (keine Excalidraw/tldraw-Dep)
Deploy	Docker + Caddy auf einer VPS

Tradeoffs — called out honestly

The awkward decisions most demos hide. Made visible here.

Replay reproduces the recorded run, not a fresh inference

Permalinks are bit-identical visual replays from the event log — not a re-prompt of the model. The landing footer says so openly.

Steer applies one turn late

The Anthropic streaming protocol has no mid-message interrupt; pretending it did would be UX theater. The two-event steer_queued → steer_applied protocol makes the deferral visible.

The roster isn't fully emergent

The orchestrator is nudged by the system prompt to propose a Performance Critic + Aesthetics Critic + Judge debate — for canvas-shape reasons. A fully-emergent orchestrator was tried; it produced flatter rosters with weaker review.

Cost cap is checked between calls, not within one

Mitigation: tight per-worker maxTokens budgets. Documented in docs/hard-problems/03-speculative-drafts.md.

Mid-stream critic interrupt was cut

Real interrupts looked spectacular — but produced worse output (the model reacting to its own half-finished work). The debate fires after Layout v1, against the full result. Less visually exciting; materially better.

UI is German on purpose

System markers (ISSUE, FIX, HEADLINE, <!doctype html>, JSON keys) stay English — the parsers depend on them. Worker output is German.

What's deliberately not built

Multi-tenant — single Anthropic key, one global cost pool.
Auth — the demo is public, cost caps are the only gate.
Mobile canvas — desktop-only, mobile shows a fallback banner with the demo video.
Branching from a permalink — the data model supports it, the UI doesn't.
Versioned event-log schemas — additive types are graceful, renames would break old replays.

Deployment

Docker Compose plus Caddy on a VPS. Caddy auto-issues TLS; the API serves WebSocket on /ws, replay on /r/<id>, everything else routes to Next.js. Live at multi.prototyp.ms — source open at github.com/stackola/multi.