* feat(mocks): replay-based mock CLIs for opencode/claude/codex/deepseek/qwen/grok
Drops in a `mocks/` top-level dir that pretends to be the real agent
CLIs by streaming pre-recorded sessions in each CLI's native stdout
protocol. Zero LLM tokens.
## Use cases
- **E2E tests** in `apps/daemon/tests/` — exercise the full chat-server
pipeline against a known trace, assert UI events / artifacts.
- **Self-validation during dev** — iterate on `claude-stream.ts` /
`json-event-stream.ts` parser changes without burning provider budget.
- **Regression harness** — replay the same trace before and after a
charter / parser change; diff the daemon events the UI surfaces.
- **Demo / onboarding** — show what a 17-tool claude editing session
looks like end-to-end, offline.
## How
- 6 bash wrappers (`mocks/bin/`) shadow the real CLIs when PATH-overlaid.
- `mocks/mock-agent.mjs` reads `mocks/recordings/<trace>.jsonl`, picks
one via env var (`SYNCLO_EXPLORE_MOCK_TRACE` / `_POOL` /
`_BY_PROMPT_HASH`), streams the trace in the requested format.
- Each format renderer matches the EXACT JSON shape the OD daemon
parser expects, verified line-by-line against
`apps/daemon/src/{json-event-stream,claude-stream}.ts`:
| CLI | streamFormat | parser source |
| ------------------------- | ------------------------- | ------------------------------------------ |
| `opencode` | `json-event-stream` | `handleOpenCodeEvent` |
| `codex` | `json-event-stream` | `handleCodexEvent` |
| `claude` | `claude-stream-json` | `createClaudeStreamHandler` |
| `deepseek` `qwen` `grok` | `plain` | `server.ts` (raw stdout) |
## Quick start
```bash
export PATH="$PWD/mocks/bin:$PATH"
export SYNCLO_EXPLORE_MOCK_TRACE=04097377 # 8-char prefix OK
export SYNCLO_EXPLORE_MOCK_NO_DELAY=1
echo "any prompt" | opencode run
echo "any prompt" | claude -p --output-format=stream-json
echo "any prompt" | codex exec
```
The mock binary announces the picked trace id on stderr:
`[mock-opencode] picked 04097377… via fixed`.
Recording selection (env, in priority order):
- `SYNCLO_EXPLORE_MOCK_TRACE=<id>` — fixed (prefix OK)
- `SYNCLO_EXPLORE_MOCK_BY_PROMPT_HASH=1` + stdin prompt — `sha256(prompt) % N`
- `SYNCLO_EXPLORE_MOCK_POOL=<tag>` — random within `agent:claude` /
`skill:agent-browser` / `outcome:failed` / etc.
- (default) uniform random
- `SYNCLO_EXPLORE_MOCK_SEED=<str>` — reproducible "random"
- `SYNCLO_EXPLORE_MOCK_NO_DELAY=1` — skip inter-event waits
## Dataset
179 anonymized Langfuse traces from this project's own production
telemetry:
- 9 agents: claude 57 · opencode 41 · codex 38 · gemini 25 ·
cursor-agent 11 · qwen 2 · copilot 2 · deepseek 2 · antigravity 1
- outcomes: succeeded 144 · failed 35
- skills: default 71 · ad-creative 50 · algorithmic-art 30 ·
agent-browser 22 · video-hyperframes 2 · plus magazine-web-ppt /
brainstorming / data-report / penpot-flutter-design-source 1 each
- 124 multi-turn (sessions with ≥2 turns)
- 18 produce `<artifact>` output
- ~4.5 MB on disk total
Anonymization: `/Users/<name>/` → `${HOME}/`,
`C:\Users\<name>\` → `%USERPROFILE%\`, project UUIDs →
stable `proj-001`, `proj-002`, …. Tool input/output payloads
preserved verbatim (templated UI, no cell-level PII).
## Smoke test
`bash mocks/scripts/smoke-test.sh` — 6 checks across all 6 agents.
All pass on this branch (verified locally):
```
✓ opencode first event = step_start
✓ codex first event = thread.started
✓ claude first event = system
✓ deepseek emitted plain text (144 chars on first line)
✓ qwen emitted plain text (144 chars on first line)
✓ grok emitted plain text (144 chars on first line)
All mock CLIs working. ✅
```
## Adding more recordings
The exporter that produced this set lives in
[nexu-io/agent-pr-explore](https://github.com/nexu-io/agent-pr-explore)
(see `cli/src/local/orchestrator/langfuse-import.ts` + the `local
langfuse-import` CLI command). Operators with the Langfuse keys can pull
more by tag / outcome / artifact / multi-turn filter, then run
`local recordings anonymize --out-dir ~/Documents/open-design/mocks/recordings`.
`mocks/README.md` has the full instructions.
## Out of scope (follow-ups)
- **ACP agents** (`devin`, `hermes`, `kilo`, `kimi`, `kiro`, `vibe`) need
a JSON-RPC server on stdio rather than a one-shot stream — separate
`format-acp.mjs` module not yet written.
- **Per-agent json-event-stream variants** (`cursor-agent`, `gemini`,
`qoder`, `copilot`, `pi`) currently fall back to the `plain` renderer;
their parsers are in `apps/daemon/src/json-event-stream.ts` and follow
the same template as `format-codex.mjs`.
## AGENTS.md updates
- Added `mocks/` to the top-level content directories listing
- Added a Validation strategy bullet pointing here for agent-stream /
parser changes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(mocks): add opencode-cli/kiro-cli/vibe-acp bin aliases and unref ACP timeout
- Add mocks/bin/opencode-cli, kiro-cli, vibe-acp wrappers for the primary
RuntimeAgentDef bin names OD resolves before any fallback. Without these,
a PATH-overlaid OD daemon run bypasses the mock entirely (opencode-cli,
kiro-cli) or cannot find the mock at all (vibe-acp, which has no fallback).
- Include opencode-cli, kiro-cli, vibe-acp in the smoke-test ACP/JSON loop
so coverage is verified end-to-end.
- Call .unref() on the 30s safety timeout in format-acp.mjs so a completed
ACP session exits promptly instead of waiting the full 30 seconds.
Generated-By: looper 0.9.2 (runner=fixer, agent=claude-code)
* feat(mocks): add vela (AMR) — login / models / ACP with strict set_model gate
Extends mocks/ to cover OD's own AMR runtime. `vela` is the bin name
`apps/daemon/src/runtimes/defs/amr.ts` specifies (`bin: 'vela'`,
`streamFormat: 'acp-json-rpc'`). It's richer than the generic ACP
agents — covers full login + models + chat-session lifecycle.
### What vela does (mirrored from apps/daemon/tests/fixtures/fake-vela.mjs)
1. `vela login` — writes ~/.amr/config.json with a fake profile (controlKey,
runtimeKey, user{email,name,plan}, profile-specific apiUrl/linkUrl).
The on-disk projection is what OD's daemon login route + AmrLoginPill
poller read; production goes through device-auth, the mock skips
straight to the file write.
2. `vela models` — prints the production-shaped public model catalog as
newline-separated `public_model_* vela` lines. Override via
FAKE_VELA_MODELS env.
3. `vela agent run --runtime opencode` — ACP JSON-RPC server with three
vela-specific protocol extensions:
a. `initialize` response carries `agentCapabilities`
(`promptCapabilities.embeddedContext`) + `models`
(`currentModelId` + `availableModels`).
b. `session/new` response carries the same `models` block.
c. **Strict set_model gate**: `session/prompt` is rejected with
JSON-RPC -32602 ("session/set_model must be called before
session/prompt") UNLESS `session/set_model` (or
`session/set_config_option`) has been called for the current
sessionId. Mirrors real vela 0.0.1 contract; catches regressions
in `attachAcpSession` that silently skip set_model.
### Error injection envs (in sync with fake-vela.mjs)
FAKE_VELA_SESSION_ID - sessionId returned by session/new
FAKE_VELA_TEXT - override assistant text
FAKE_VELA_THOUGHT - optional thought_chunk before text
FAKE_VELA_SESSION_NEW_ERROR - fail session/new
FAKE_VELA_SET_MODEL_ERROR - fail session/set_model
FAKE_VELA_PROMPT_ERROR - fail session/prompt
FAKE_VELA_REQUIRE_SET_MODEL='0' - disable the strict gate (legacy)
FAKE_VELA_LOGIN_USER_EMAIL - email written into config profile
FAKE_VELA_LOGIN_USER_PLAN - plan written into config profile
FAKE_VELA_LOGIN_DELAY_MS - sleep before write (test in-flight)
FAKE_VELA_LOGIN_FAIL - print + exit 1
FAKE_VELA_MODELS - override models stdout
VELA_PROFILE - profile slot (prod | test | local)
### Components
`mocks/lib/format-vela.mjs` (~205 LOC)
- Full ACP server with vela protocol extensions
- Strict set_model gate
- Error injection plumbing
`mocks/lib/vela-subcommands.mjs` (~90 LOC)
- runVelaLogin() — writes ~/.amr/config.json
- runVelaModels() — prints catalog
`mocks/bin/vela` — dispatcher wrapper. Forwards `vela <subcmd>` to
mock-agent.mjs which routes to login/models or falls through to ACP.
`mocks/mock-agent.mjs` — parseArgs now collects positionals so the vela
dispatcher can read subcommand from there; switch case added for vela.
`mocks/scripts/smoke-test.sh` — +4 assertions:
vela models prints ≥10 catalog lines
vela login writes ~/.amr/config.json with the requested email
vela agent run ACP roundtrip (initialize+models+set_model+stream+result)
vela strict set_model gate rejects prompt without prior set_model
### Verified locally
✓ vela models printed 15 catalog lines
✓ vela login wrote ~/.amr/config.json with profile.prod.user.email
✓ vela agent run ACP roundtrip (initialize+models, set_model accepted, prompt streamed)
✓ vela strict set_model gate rejects session/prompt without prior set_model
All 21 smoke checks pass (up from 17 with previous P3 ACP commit).
### AGENTS.md + README updates
AGENTS.md — mention `vela (AMR — vela CLI)` alongside ACP agents in
the directory listing entry.
mocks/README.md — protocol table row + dedicated vela section with
subcommand contract, strict gate explanation, env-injection cheat
sheet. Mock-tree listing updated.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(mocks): honor REPORT_FILE env when --report-file flag not given
Harnesses that spawn the mock without translating their report-path
contract to the mock's CLI flag (notably nexu-io/agent-pr-explore's
orchestrator, which passes REPORT_FILE as env per the existing
opencode/claude/codex agent launchers) wouldn't get a report file
written, so the harness's "agent exit 0 but produced no report"
check would always fire and mark mock runs as failure even though the
stdout stream was complete.
Fix: in mock-agent.mjs parseArgs, fall through to process.env.REPORT_FILE
when --report-file wasn't provided on argv. Each format renderer already
accepts opts.reportFile and writes the recording's final assistant text
to it (`format-*.mjs` already had this — only the wiring was missing).
Verified: synclo-explore run with `mock=true, mock_trace=04097377`
against the opencode wrapper now produces a plan.md with the recording's
17-tool claude editing session report. ~1.5s per run vs ~70s real opencode.
* mocks: move recordings to Cloudflare R2; PR→main→Action upload path
The 179-recording corpus (~4.5 MB raw, ~280 KB after compression) has
been moved off git into Cloudflare R2 at the bucket open-design-mocks
under recordings/v1/. The repo now ships:
- mocks/manifest.json — the canonical catalog (renamed from
recordings/index.json) with sha256 + storage hints; consumers
fetch this to discover what exists, then pull individual jsonl
files on demand
- mocks/scripts/fetch-recordings.sh — parallel, sha256-verified,
idempotent puller for the public r2.dev URL
- mocks/scripts/add-recording.sh — local maintainer helper that
validates a new .jsonl and copies it into recordings-staging/
(no R2 calls; no credentials needed)
- mocks/scripts/upload-to-r2.mjs — called only by the CI workflow
- mocks/scripts/lib/manifest-utils.mjs — shared sha256/meta/
rebuild-histograms logic, used by both add-recording (preview)
and upload-to-r2 (actual write) so the entry shape never drifts
- .github/workflows/sync-mocks-to-r2.yml — fires on push to main
when mocks/recordings-staging/ changes; uploads to R2, updates
manifest, commits cleanup back; serialized via concurrency group
Trust model: R2 write credentials (CLOUDFLARE_API_TOKEN,
CLOUDFLARE_ACCOUNT_ID) are repo secrets; nobody can push from a
laptop. Read stays public via the r2.dev URL.
Why not pnpm install integration: contributors who do not touch
agent code do not pay the fetch cost. Fetch happens on first
smoke-test run (auto-fallback) or when a mock spawn needs data.
Repo size: -4.55 MB net (delete 179 jsonl, +280 KB manifest +
scripts). Smoke test (21 checks) still green against the fetched
corpus.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mocks: scope R2 write token to a dedicated secret name
Use CLOUDFLARE_R2_MOCKS_TOKEN (instead of reusing the shared
CLOUDFLARE_API_TOKEN that landing-page-*.yml uses for Pages deploys)
so the R2 write capability can be scoped to just the
open-design-mocks bucket without bleeding extra capability into the
Pages workflows.
Also hardcode the powerformer CF account_id directly in the workflow
(account IDs are not secret and the shared CLOUDFLARE_ACCOUNT_ID
secret may point at a different account).
Workflow now fails fast with an actionable error message + dashboard
link if the secret is unset.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mocks: switch R2 sync to S3-compat API (wrangler getMemberships gate)
wrangler 4.x calls /memberships before any r2 action, requiring
user:read scope. R2 "Object Read & Write" tokens deliberately lack
that scope (defense in depth — a leaked token should not enumerate
account-level resources). The workflow now uses the aws CLI talking
straight to the R2 S3-compatible endpoint with SigV4, no membership
lookup.
Secret rotation: CLOUDFLARE_R2_MOCKS_TOKEN (Bearer) is replaced by
CLOUDFLARE_R2_MOCKS_AK / CLOUDFLARE_R2_MOCKS_SK (matching the
existing CLOUDFLARE_R2_RELEASES_AK/SK naming convention). End-to-end
tested locally: PUT recording → manifest rebuild → manifest PUT →
staging cleanup all green.
aws CLI is pre-installed on ubuntu-latest, so no install step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mocks: scrub synclo namespace; use OD_MOCKS_* env prefix throughout
These mocks were copy-pasted from synclo-explore, where they
originated, and inherited the SYNCLO_EXPLORE_MOCK_* env-var
convention. That brand-bleed is not appropriate in OD: rename the
public env surface to OD_MOCKS_* (matching OD-native prefixes like
OD_MOCKS_CACHE_DIR, OD_TRACE_R2_UPLOAD, OD_EXPECT_TIMEOUT_SECONDS).
Renames:
SYNCLO_EXPLORE_MOCK_TRACE → OD_MOCKS_TRACE
SYNCLO_EXPLORE_MOCK_BY_PROMPT_HASH → OD_MOCKS_BY_PROMPT_HASH
SYNCLO_EXPLORE_MOCK_POOL → OD_MOCKS_POOL
SYNCLO_EXPLORE_MOCK_SEED → OD_MOCKS_SEED
SYNCLO_EXPLORE_MOCK_NO_DELAY → OD_MOCKS_NO_DELAY
SYNCLO_EXPLORE_MOCK_RECORDINGS_DIR → OD_MOCKS_RECORDINGS_DIR
SYNCLO_EXPLORE_MOCK_SMOKE_TRACE → OD_MOCKS_SMOKE_TRACE
SYNCLO_OD_MOCKS_I_KNOW_WHAT_IM_DOING → OD_MOCKS_ALLOW_LOCAL_UPLOAD
Also drop the inline harvester usage from README. The harvester is an
external CLI in nexu-io/agent-pr-explore — its README is the right
place for langfuse-import flags, anonymization options, etc. OD only
documents its own staging→PR→Action workflow.
Smoke test (21 checks) still green; OD_MOCKS_TRACE end-to-end
verified to route correctly.
Consumers of the OLD env names (notably the orchestrator in
nexu-io/agent-pr-explore) need a matching rename. No back-compat
shim here — the explore side has zero external users today and a
one-line follow-up is cleaner than a permanent deprecation layer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* AGENTS.md: align mock env names with mocks/ rename (SYNCLO_* → OD_MOCKS_*)
Missed in the prior commit (
|
||
|---|---|---|
| .. | ||
| bin | ||
| golden | ||
| lib | ||
| recordings | ||
| scripts | ||
| manifest.json | ||
| mock-agent.mjs | ||
| README.md | ||
mocks/ — replay-based mock CLIs for OD's supported agents
A drop-in replacement for the real agent CLIs (claude, opencode,
codex, gemini, cursor-agent, deepseek, qwen, grok, the
ACP family devin / hermes / kilo / kimi / kiro / vibe, and
the AMR vela CLI) that replays pre-recorded sessions in each CLI's
native protocol — stdout streaming for most, JSON-RPC over stdio for
ACP and AMR. Zero LLM tokens.
Used by:
- E2E tests in
apps/daemon/tests/— run the full chat-server pipeline against a known agent trace, assert UI events / artifacts. - Local self-tests during development — iterate on
chat-routes.ts,claude-stream.ts,json-event-stream.tsparser changes without burning provider budget. - Demo / onboarding — show what a 17-tool
claudeediting session looks like end-to-end, offline. - Regression harness — replay the same trace before and after a charter / parser change; diff the events the daemon surfaces.
The recordings are anonymized exports from open-design's Langfuse project (179 traces across 9 agents and 5+ skills as of this commit).
tl;dr
# First-time setup — pull the recording corpus from R2 (~30s, 4.5MB):
bash mocks/scripts/fetch-recordings.sh
# Subsequent runs hit the local cache (sha256-verified, instant).
# Make the mock CLIs override the real ones for this shell:
export PATH="$PWD/mocks/bin:$PATH"
# Pick any recording to play back (8-char prefix OK):
export OD_MOCKS_TRACE=04097377
# Speed up replay (skip inter-event sleeps):
export OD_MOCKS_NO_DELAY=1
# Now anything that spawns opencode/claude/codex gets the recording:
echo "any prompt body" | opencode run
echo "any prompt" | claude -p --output-format=stream-json
echo "any prompt" | codex exec
The mock binaries are bash wrappers that exec
node mocks/mock-agent.mjs --as <agent>. Anything fed to stdin is
discarded by the renderer but used by the recording picker (see hash
mode below).
Recordings live on R2, not in this repo
The 179-recording corpus (~4.5 MB) is hosted on Cloudflare R2 at
open-design-mocks and fetched on demand — pnpm install does NOT
pull them, and the repo stays small. Recordings only land in
mocks/recordings/ when:
- You run
bash mocks/scripts/fetch-recordings.shdirectly, OR bash mocks/scripts/smoke-test.shruns and the dir is empty (auto- fetch fallback), OR- A mock binary spawn finds no data — it errors with a pointer at the fetch script (no silent failure).
This is by design: contributors who don't touch agent code don't pay
the fetch cost. CI jobs that DO touch agent code (apps/daemon/tests/
parser changes, etc.) run the fetch as a quick pre-step and cache
mocks/recordings/ between runs.
# Fetch everything (parallel, sha256-verified, idempotent):
bash mocks/scripts/fetch-recordings.sh
# Fetch a subset:
bash mocks/scripts/fetch-recordings.sh --agent claude # 57 claude traces
bash mocks/scripts/fetch-recordings.sh --outcome failed # 35 failed-path traces
bash mocks/scripts/fetch-recordings.sh --skill agent-browser
# Override cache location (e.g. share across multiple OD checkouts):
OD_MOCKS_CACHE_DIR=~/.cache/od-mocks bash mocks/scripts/fetch-recordings.sh
Manifest at mocks/manifest.json is the committed source of truth —
it lists every recording's trace_id, sha256, bytes, agent,
outcome, skills, multi_turn, plus histograms over the corpus.
Tooling reads this; you don't have to.
Provenance per recording
Beyond identity (trace_id, sha256), each manifest entry carries
fixture-trust signals so consumers can decide whether the recording
is still meaningful as the real CLIs evolve:
| Field | Meaning |
|---|---|
captured_at |
ISO 8601 timestamp of the original session — populated for all 179 current entries |
cli_version |
The CLI version the trace was captured against (e.g. "claude-code 1.0.65") — populated only on traces the harvester writes it to, null otherwise |
protocol_version |
Stream-format version ("claude-stream-json/v1", "opencode/json-event-stream") — populated by harvester |
anonymization_version |
Which anonymizer pass scrubbed the recording — populated by harvester |
For now most of these are null on the existing 179 — the harvester in
nexu-io/agent-pr-explore is the next thing to teach to
write them. Once a recording's cli_version falls behind the actual
CLI by more than one minor version, treat it as a candidate for
re-harvest.
Golden daemon-event snapshots
mocks/golden/<trace>.events.json holds the exact event sequence the
OD daemon emits when fed each (mock CLI → handler) pipeline. Diffed
on every pnpm --filter @open-design/daemon test run by
apps/daemon/tests/mocks-golden.test.ts.
A parser refactor that semantically changes events (drops a field,
renames sessionId, stops emitting turn_end) fails the diff loudly.
After an intentional parser change, regenerate:
MOCKS_GOLDEN_UPDATE=1 pnpm --filter @open-design/daemon test mocks-golden
git diff mocks/golden/ # eyeball the new shapes
git add mocks/golden/ && git commit -m "mocks: refresh goldens for <parser change>"
Per-spawn volatile fields (currently just claude's generated
sessionId) are stripped to "<normalized>" so the snapshot stays
stable. See mocks/golden/README.md for the coverage rationale.
Real-CLI contract check
The mocks catch parser regressions against the recordings; they do
not catch the recordings themselves drifting away from the live
agent CLIs. For that, mocks/scripts/contract-check.sh spawns a real
CLI alongside the mock with a fixed prompt and prints a side-by-side
event-type distribution.
This is human-driven and costs real LLM tokens — run on a real-CLI
release or before a parser refactor, not on a cron. Full doc:
docs/MOCKS-CONTRACT-CHECK.md.
What gets emitted
Each renderer matches the EXACT event shapes the OD daemon expects, as
verified line-by-line against the parsers in apps/daemon/src/:
| CLI | OD streamFormat | Parser source |
|---|---|---|
opencode |
json-event-stream (opencode kind) |
json-event-stream.ts:handleOpenCodeEvent |
codex |
json-event-stream (codex kind) |
json-event-stream.ts:handleCodexEvent |
claude |
claude-stream-json |
claude-stream.ts:createClaudeStreamHandler |
gemini |
json-event-stream (gemini kind) |
json-event-stream.ts:handleGeminiEvent |
cursor-agent |
json-event-stream (cursor-agent kind) |
json-event-stream.ts:handleCursorEvent |
deepseek qwen grok |
plain |
server.ts (raw stdout = final assistant text) |
devin hermes kilo kimi kiro vibe |
acp-json-rpc |
acp.ts:attachAcpSession |
vela (AMR) |
acp-json-rpc + login / models subcommands |
runtimes/defs/amr.ts + apps/daemon/tests/fixtures/fake-vela.mjs (sibling stub) |
Note on
geminiandcursor-agent: OD's parsers for these two agents do NOT recognize tool-call events — only init / assistant text / usage. The renderers therefore emit ONLY the final assistant text wrapped in the expected init/text/usage envelope. Tool calls present in the source recording are silently dropped (which matches the real CLI's UI behavior — these agents don't surface tools in OD's chat view).
Note on ACP agents (
devin/hermes/kilo/kimi/kiro/vibe): These do NOT stream stdout — they speak JSON-RPC v2 over stdio. OD's daemon sendsinitialize→session/new→ (optionalsession/set_model) →session/prompt; the mock responds in order, streams text viasession/updatenotifications carryingagent_message_chunkparts, then responds to the prompt request with usage stats. Tool calls aren't part of the ACP protocol on this path (tools surface via MCP or other side channels), so they're dropped from playback.
Note on
vela(AMR): vela is the bin OD's AMR runtime spawns. It extends the generic ACP shape withagentCapabilities+modelsblocks ininitialize/session/new, plus a strict set_model gate —session/promptis rejected with -32602 untilsession/set_model(orsession/set_config_option) has been called for the current sessionId, mirroring real vela 0.0.1 contract.vela also has two non-ACP subcommands:
vela login→ writes~/.amr/config.jsonwith a fake profile so OD's daemon login route +AmrLoginPillpoller see the same on-disk projection production produces.vela models→ prints the production-shapedpublic_model_* velacatalog.Error injection envs (kept in sync with
apps/daemon/tests/fixtures/fake-vela.mjs):FAKE_VELA_SESSION_NEW_ERROR/FAKE_VELA_SET_MODEL_ERROR/FAKE_VELA_PROMPT_ERROR/FAKE_VELA_LOGIN_FAIL/FAKE_VELA_REQUIRE_SET_MODEL=0.
Each tool call from the recording is rendered with the original input arguments and tool output. The agents' assistant text is rendered as the final message.
Recording selection
Driven by env vars, in priority order:
| Env | Behavior |
|---|---|
OD_MOCKS_TRACE=<id> |
Always play this trace. 8-char prefix OK. |
OD_MOCKS_BY_PROMPT_HASH=1 + stdin prompt |
Deterministic by sha256(prompt) % len(all). Same prompt → same trace. Useful for "stable answer per question" tests. |
OD_MOCKS_POOL=<tag> |
Random within the tag pool. Examples: agent:claude, skill:agent-browser, outcome:failed. |
OD_MOCKS_SEED=<str> |
Makes "random" picks reproducible across runs. |
OD_MOCKS_NO_DELAY=1 |
Skip inter-event waits. |
OD_MOCKS_RECORDINGS_DIR=<path> |
Override the recordings dir. |
If none are set, a uniformly random recording is played each invocation.
The mock binary announces the picked trace id on stderr:
[mock-opencode] picked 04097377… via fixed
This line is invisible to OD's stdout parser but useful for "wait, why did my test get the FAQ-fix trace?" debugging.
Recording catalog
The recordings live as one JSONL file per Langfuse trace under
recordings/. Each file starts with a meta event carrying:
{
"type": "meta",
"source": {"provider": "langfuse", "trace_id": "...", "project_id": "..."},
"agent": "claude" | "codex" | "opencode" | "gemini" | "cursor-agent" | "qwen" | "copilot" | "deepseek" | "antigravity",
"model": "...",
"outcome": "succeeded" | "failed" | "errored" | "interrupted",
"duration_ms": 33620,
"tool_call_count": 17,
"error_count": 0,
"total_tokens": 12345,
"tags": ["agent:claude", "skill:agent-browser", "open-design", ...],
"user_input": "...",
"session_id": "..."
}
Subsequent events are tool_call, tool_result, and report (the
final assistant text).
Indexed metadata
mocks/manifest.json is a flat manifest with one entry per recording
plus histograms over all recordings, committed to the repo. It's also
mirrored to R2 alongside the .jsonl files so consumers can fetch the
current catalog without cloning. Query with jq:
# All multi-turn claude sessions about HTML editing
jq '.entries[] | select(.agent=="claude" and .multi_turn==true)' \
mocks/manifest.json | head -50
# Failed codex traces (negative-path tests)
jq '.entries[] | select(.agent=="codex" and .outcome=="failed") | .trace_id' \
mocks/manifest.json
# Agent-browser skill, sorted by tool count desc
jq '[.entries[] | select(.skills | index("agent-browser"))] | sort_by(-.tool_count)' \
mocks/manifest.json
Headline stats (current dataset)
| Dimension | Distribution |
|---|---|
| Agents | claude 57 · opencode 41 · codex 38 · gemini 25 · cursor-agent 11 · qwen/copilot/deepseek 2 each · antigravity 1 |
| Outcomes | succeeded 144 · failed 35 |
| Skills | default 71 · ad-creative 50 · algorithmic-art 30 · agent-browser 22 · video-hyperframes 2 · magazine-web-ppt / brainstorming / data-report / penpot-flutter 1 each |
| Multi-turn | 124 traces tied to a session with ≥2 turns |
| Artifact | 18 traces produce <artifact> output |
Anonymization
User-specific data has been scrubbed from every recording:
/Users/<name>/…,/home/<name>/…,C:\Users\<name>\…→${HOME}/…/%USERPROFILE%\…- Project UUIDs → stable
proj-001,proj-002, … per recording - meta tag
project:<uuid>rewritten too
The anonymizer is idempotent. Tool input/output payloads (HTML, code, etc.) are preserved verbatim — they're templated UI without cell-level PII; if a future audit finds otherwise, add specific scrubs in the harvester repo (see "Adding more recordings" below) and re-run.
Adding more recordings
Local maintainer flow — the .jsonl never enters the repo. Only the manifest delta (≈200 B per entry) gets committed.
Step 1 — produce an anonymized .jsonl
The harvester that produced the current 179-trace set lives in a
separate repo, nexu-io/agent-pr-explore. See its README
for how to authenticate against your trace store, filter by skill /
agent / outcome, and anonymize the result. Output is one
<trace-id>.jsonl file per recording.
Step 2 — one-shot upload + manifest update
# prereq, once: wrangler login (OAuth, no token to manage)
bash mocks/scripts/upload-recording.sh /path/to/<trace-id>.jsonl
The script validates the file, prints the manifest entry it will add,
uploads the .jsonl to R2, rewrites mocks/manifest.json locally, then
uploads the updated manifest to R2 too (so consumers see the new entry
without waiting for the next git push).
Step 3 — commit the manifest delta
git add mocks/manifest.json
git commit -m "mocks: add recording <trace-id>"
git push # or open a PR — your call
The only thing in the commit is a ~200-byte JSON edit listing the new
entry's trace_id, sha256, bytes, agent, outcome, skills,
etc. The .jsonl itself stays in R2.
Trust model
- R2 write is wrangler-OAuth gated. Maintainers do
wrangler loginonce. The bucket is on the powerformer Cloudflare account (pinned in the script). No long-lived tokens in repo secrets, no Action to hijack — just account access. - Repo stays small forever. No .jsonl files ever land in git; the manifest grows by ~200 B per recording.
- Read stays public. Anyone can fetch via the r2.dev URL — see Recordings live on R2, not in this repo.
Removing a recording
# 1. delete from R2
export CLOUDFLARE_ACCOUNT_ID=64ad4569ffd912432d6b86d5656484c4
wrangler r2 object delete open-design-mocks/recordings/v1/<trace-id>.jsonl --remote
# 2. drop the entry from manifest.json (edit by hand, or use `jq`)
# 3. re-upload manifest
wrangler r2 object put open-design-mocks/recordings/v1/manifest.json \
--file mocks/manifest.json --remote
# 4. git add mocks/manifest.json && git commit && git push
There's no automation for delete because (a) it's rare and (b) you
want a human to think about whether removing a recording would
invalidate any test fixtures that pin it via OD_MOCKS_TRACE=<id>.
Usage from OD's test code
From a test (Vitest / Jest)
import { spawn } from 'node:child_process';
import { join } from 'node:path';
const MOCK_BIN = join(__dirname, '../../mocks/bin');
it('parses an opencode session with 4 tool calls into 4 UI events', async () => {
const child = spawn('opencode', ['run'], {
env: {
...process.env,
PATH: `${MOCK_BIN}:${process.env.PATH}`,
OD_MOCKS_TRACE: '06a9324a', // 4-tool claude session
OD_MOCKS_NO_DELAY: '1',
},
stdio: ['pipe', 'pipe', 'pipe'],
});
child.stdin.write('test prompt');
child.stdin.end();
// ... assert events parsed from child.stdout
});
From a manual playback
# See what claude's 17-tool "delete v2" session emits to OD:
export PATH=$(git rev-parse --show-toplevel)/mocks/bin:$PATH
export OD_MOCKS_TRACE=04097377
export OD_MOCKS_NO_DELAY=1
echo "anything" | claude -p --output-format=stream-json | jq .type | uniq -c
Files
mocks/
├── README.md ← you are here
├── mock-agent.mjs ← entry; routes --as <agent> to format renderer
├── lib/
│ ├── recording-picker.mjs ← env-driven trace selection
│ ├── format-opencode.mjs ← matches handleOpenCodeEvent
│ ├── format-codex.mjs ← matches handleCodexEvent
│ ├── format-claude.mjs ← matches createClaudeStreamHandler
│ ├── format-gemini.mjs ← matches handleGeminiEvent
│ ├── format-cursor-agent.mjs ← matches handleCursorEvent
│ ├── format-acp.mjs ← JSON-RPC server matching attachAcpSession
│ ├── format-vela.mjs ← AMR vela: ACP + models block + set_model gate
│ ├── vela-subcommands.mjs ← `vela login` + `vela models` handlers
│ └── format-plain.mjs ← raw stdout (deepseek/qwen/grok)
├── bin/
│ ├── opencode claude codex
│ ├── gemini cursor-agent
│ ├── deepseek qwen grok
│ ├── devin hermes kilo kimi kiro vibe
│ └── vela ← 15 bash wrappers, PATH-overlay
├── manifest.json ← committed: 179 entries' metadata + sha256 + provenance + R2 storage hints
├── golden/ ← committed: daemon-event regression snapshots
│ ├── README.md
│ └── *.events.json ← 3 representative traces (claude/codex/opencode)
├── scripts/
│ ├── smoke-test.sh ← 21 checks; auto-fetches recordings if empty
│ ├── fetch-recordings.sh ← pull from R2 (parallel, sha256-verified, idempotent)
│ ├── upload-recording.sh ← maintainer-local: validate + wrangler put + manifest update
│ ├── contract-check.sh ← real-CLI vs mock protocol drift check (manual)
│ └── lib/
│ └── manifest-utils.mjs ← shared sha256 / meta-parse / manifest-rebuild logic
└── recordings/ ← populated at runtime, gitignored .jsonl
└── .gitignore ← recordings come via fetch
No external dependencies. Pure node:fs/crypto/child_process. Works
under any Node ≥18.
Limitations
copilot,qoder,pi(the nichecopilot-stream-json/qoder-stream-json/pi-rpcformats) are recorded but not yet rendered as their native protocols — they fall back to the plain renderer for now. If you need them, add aformat-<agent>.mjsfollowing the same pattern asformat-codex.mjs; the parsers are inapps/daemon/src/{copilot-stream,qoder-stream}.tsand the pi-rpc handler insideapps/daemon/src/server.ts.- The mock does not honor CLI flags that change semantics (
--model,--permission-mode,--allowed-tools). They're silently ignored.
Provenance / safety
All recordings come from open-design's own Langfuse project (the
open-design project under the powerformer org). Users opted into
telemetry when they installed the desktop client. The anonymizer
removed user-identifying paths and project UUIDs before checking in.
If you find a recording that includes content that should be redacted, follow the Removing a recording flow above.