mirror of https://github.com/nexu-io/open-design.git synced 2026-06-01 03:14:35 +07:00

lefarcen a30b868a60 mocks: scrub synclo namespace; use OD_MOCKS_* env prefix throughout

These mocks were copy-pasted from synclo-explore, where they
originated, and inherited the SYNCLO_EXPLORE_MOCK_* env-var
convention. That brand-bleed is not appropriate in OD: rename the
public env surface to OD_MOCKS_* (matching OD-native prefixes like
OD_MOCKS_CACHE_DIR, OD_TRACE_R2_UPLOAD, OD_EXPECT_TIMEOUT_SECONDS).

Renames:
  SYNCLO_EXPLORE_MOCK_TRACE             → OD_MOCKS_TRACE
  SYNCLO_EXPLORE_MOCK_BY_PROMPT_HASH    → OD_MOCKS_BY_PROMPT_HASH
  SYNCLO_EXPLORE_MOCK_POOL              → OD_MOCKS_POOL
  SYNCLO_EXPLORE_MOCK_SEED              → OD_MOCKS_SEED
  SYNCLO_EXPLORE_MOCK_NO_DELAY          → OD_MOCKS_NO_DELAY
  SYNCLO_EXPLORE_MOCK_RECORDINGS_DIR    → OD_MOCKS_RECORDINGS_DIR
  SYNCLO_EXPLORE_MOCK_SMOKE_TRACE       → OD_MOCKS_SMOKE_TRACE
  SYNCLO_OD_MOCKS_I_KNOW_WHAT_IM_DOING  → OD_MOCKS_ALLOW_LOCAL_UPLOAD

Also drop the inline harvester usage from README. The harvester is an
external CLI in nexu-io/agent-pr-explore — its README is the right
place for langfuse-import flags, anonymization options, etc. OD only
documents its own staging→PR→Action workflow.

Smoke test (21 checks) still green; OD_MOCKS_TRACE end-to-end
verified to route correctly.

Consumers of the OLD env names (notably the orchestrator in
nexu-io/agent-pr-explore) need a matching rename. No back-compat
shim here — the explore side has zero external users today and a
one-line follow-up is cleaner than a permanent deprecation layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-29 13:20:35 +08:00

17 KiB

Raw Blame History

`mocks/` — replay-based mock CLIs for OD's supported agents

A drop-in replacement for the real agent CLIs (claude, opencode, codex, gemini, cursor-agent, deepseek, qwen, grok, the ACP family devin / hermes / kilo / kimi / kiro / vibe, and the AMR vela CLI) that replays pre-recorded sessions in each CLI's native protocol — stdout streaming for most, JSON-RPC over stdio for ACP and AMR. Zero LLM tokens.

Used by:

E2E tests in apps/daemon/tests/ — run the full chat-server pipeline against a known agent trace, assert UI events / artifacts.
Local self-tests during development — iterate on chat-routes.ts, claude-stream.ts, json-event-stream.ts parser changes without burning provider budget.
Demo / onboarding — show what a 17-tool claude editing session looks like end-to-end, offline.
Regression harness — replay the same trace before and after a charter / parser change; diff the events the daemon surfaces.

The recordings are anonymized exports from open-design's Langfuse project (179 traces across 9 agents and 5+ skills as of this commit).

tl;dr

# First-time setup — pull the recording corpus from R2 (~30s, 4.5MB):
bash mocks/scripts/fetch-recordings.sh
# Subsequent runs hit the local cache (sha256-verified, instant).

# Make the mock CLIs override the real ones for this shell:
export PATH="$PWD/mocks/bin:$PATH"

# Pick any recording to play back (8-char prefix OK):
export OD_MOCKS_TRACE=04097377

# Speed up replay (skip inter-event sleeps):
export OD_MOCKS_NO_DELAY=1

# Now anything that spawns opencode/claude/codex gets the recording:
echo "any prompt body" | opencode run
echo "any prompt"     | claude -p --output-format=stream-json
echo "any prompt"     | codex exec

The mock binaries are bash wrappers that exec node mocks/mock-agent.mjs --as <agent>. Anything fed to stdin is discarded by the renderer but used by the recording picker (see hash mode below).

Recordings live on R2, not in this repo

The 179-recording corpus (~4.5 MB) is hosted on Cloudflare R2 at open-design-mocks and fetched on demand — pnpm install does NOT pull them, and the repo stays small. Recordings only land in mocks/recordings/ when:

You run bash mocks/scripts/fetch-recordings.sh directly, OR
bash mocks/scripts/smoke-test.sh runs and the dir is empty (auto- fetch fallback), OR
A mock binary spawn finds no data — it errors with a pointer at the fetch script (no silent failure).

This is by design: contributors who don't touch agent code don't pay the fetch cost. CI jobs that DO touch agent code (apps/daemon/tests/ parser changes, etc.) run the fetch as a quick pre-step and cache mocks/recordings/ between runs.

# Fetch everything (parallel, sha256-verified, idempotent):
bash mocks/scripts/fetch-recordings.sh

# Fetch a subset:
bash mocks/scripts/fetch-recordings.sh --agent claude       # 57 claude traces
bash mocks/scripts/fetch-recordings.sh --outcome failed     # 35 failed-path traces
bash mocks/scripts/fetch-recordings.sh --skill agent-browser

# Override cache location (e.g. share across multiple OD checkouts):
OD_MOCKS_CACHE_DIR=~/.cache/od-mocks bash mocks/scripts/fetch-recordings.sh

Manifest at mocks/manifest.json is the committed source of truth — it lists every recording's trace_id, sha256, bytes, agent, outcome, skills, multi_turn, plus histograms over the corpus. Tooling reads this; you don't have to.

What gets emitted

Each renderer matches the EXACT event shapes the OD daemon expects, as verified line-by-line against the parsers in apps/daemon/src/:

CLI	OD streamFormat	Parser source
`opencode`	`json-event-stream` (opencode kind)	`json-event-stream.ts:handleOpenCodeEvent`
`codex`	`json-event-stream` (codex kind)	`json-event-stream.ts:handleCodexEvent`
`claude`	`claude-stream-json`	`claude-stream.ts:createClaudeStreamHandler`
`gemini`	`json-event-stream` (gemini kind)	`json-event-stream.ts:handleGeminiEvent`
`cursor-agent`	`json-event-stream` (cursor-agent kind)	`json-event-stream.ts:handleCursorEvent`
`deepseek` `qwen` `grok`	`plain`	`server.ts` (raw stdout = final assistant text)
`devin` `hermes` `kilo` `kimi` `kiro` `vibe`	`acp-json-rpc`	`acp.ts:attachAcpSession`
`vela` (AMR)	`acp-json-rpc` + `login` / `models` subcommands	`runtimes/defs/amr.ts` + `apps/daemon/tests/fixtures/fake-vela.mjs` (sibling stub)

Note on gemini and cursor-agent: OD's parsers for these two agents do NOT recognize tool-call events — only init / assistant text / usage. The renderers therefore emit ONLY the final assistant text wrapped in the expected init/text/usage envelope. Tool calls present in the source recording are silently dropped (which matches the real CLI's UI behavior — these agents don't surface tools in OD's chat view).

Note on ACP agents (devin / hermes / kilo / kimi / kiro / vibe): These do NOT stream stdout — they speak JSON-RPC v2 over stdio. OD's daemon sends initialize → session/new → (optional session/set_model) → session/prompt; the mock responds in order, streams text via session/update notifications carrying agent_message_chunk parts, then responds to the prompt request with usage stats. Tool calls aren't part of the ACP protocol on this path (tools surface via MCP or other side channels), so they're dropped from playback.

Note on vela (AMR): vela is the bin OD's AMR runtime spawns. It extends the generic ACP shape with agentCapabilities + models blocks in initialize / session/new, plus a strict set_model gate — session/prompt is rejected with -32602 until session/set_model (or session/set_config_option) has been called for the current sessionId, mirroring real vela 0.0.1 contract.

vela also has two non-ACP subcommands:

vela login → writes ~/.amr/config.json with a fake profile so OD's daemon login route + AmrLoginPill poller see the same on-disk projection production produces.

vela models → prints the production-shaped public_model_* vela catalog.

Error injection envs (kept in sync with apps/daemon/tests/fixtures/fake-vela.mjs): FAKE_VELA_SESSION_NEW_ERROR / FAKE_VELA_SET_MODEL_ERROR / FAKE_VELA_PROMPT_ERROR / FAKE_VELA_LOGIN_FAIL / FAKE_VELA_REQUIRE_SET_MODEL=0.

Each tool call from the recording is rendered with the original input arguments and tool output. The agents' assistant text is rendered as the final message.

Recording selection

Driven by env vars, in priority order:

Env	Behavior
`OD_MOCKS_TRACE=<id>`	Always play this trace. 8-char prefix OK.
`OD_MOCKS_BY_PROMPT_HASH=1` + stdin prompt	Deterministic by `sha256(prompt) % len(all)`. Same prompt → same trace. Useful for "stable answer per question" tests.
`OD_MOCKS_POOL=<tag>`	Random within the tag pool. Examples: `agent:claude`, `skill:agent-browser`, `outcome:failed`.
`OD_MOCKS_SEED=<str>`	Makes "random" picks reproducible across runs.
`OD_MOCKS_NO_DELAY=1`	Skip inter-event waits.
`OD_MOCKS_RECORDINGS_DIR=<path>`	Override the recordings dir.

If none are set, a uniformly random recording is played each invocation.

The mock binary announces the picked trace id on stderr:

[mock-opencode] picked 04097377… via fixed

This line is invisible to OD's stdout parser but useful for "wait, why did my test get the FAQ-fix trace?" debugging.

Recording catalog

The recordings live as one JSONL file per Langfuse trace under recordings/. Each file starts with a meta event carrying:

{
  "type": "meta",
  "source": {"provider": "langfuse", "trace_id": "...", "project_id": "..."},
  "agent": "claude" | "codex" | "opencode" | "gemini" | "cursor-agent" | "qwen" | "copilot" | "deepseek" | "antigravity",
  "model": "...",
  "outcome": "succeeded" | "failed" | "errored" | "interrupted",
  "duration_ms": 33620,
  "tool_call_count": 17,
  "error_count": 0,
  "total_tokens": 12345,
  "tags": ["agent:claude", "skill:agent-browser", "open-design", ...],
  "user_input": "...",
  "session_id": "..."
}

Subsequent events are tool_call, tool_result, and report (the final assistant text).

Indexed metadata

mocks/manifest.json is a flat manifest with one entry per recording plus histograms over all recordings, committed to the repo. It's also mirrored to R2 alongside the .jsonl files so consumers can fetch the current catalog without cloning. Query with jq:

# All multi-turn claude sessions about HTML editing
jq '.entries[] | select(.agent=="claude" and .multi_turn==true)' \
  mocks/manifest.json | head -50

# Failed codex traces (negative-path tests)
jq '.entries[] | select(.agent=="codex" and .outcome=="failed") | .trace_id' \
  mocks/manifest.json

# Agent-browser skill, sorted by tool count desc
jq '[.entries[] | select(.skills | index("agent-browser"))] | sort_by(-.tool_count)' \
  mocks/manifest.json

Headline stats (current dataset)

Dimension	Distribution
Agents	claude 57 · opencode 41 · codex 38 · gemini 25 · cursor-agent 11 · qwen/copilot/deepseek 2 each · antigravity 1
Outcomes	succeeded 144 · failed 35
Skills	default 71 · ad-creative 50 · algorithmic-art 30 · agent-browser 22 · video-hyperframes 2 · magazine-web-ppt / brainstorming / data-report / penpot-flutter 1 each
Multi-turn	124 traces tied to a session with ≥2 turns
Artifact	18 traces produce `<artifact>` output

Anonymization

User-specific data has been scrubbed from every recording:

/Users/<name>/…, /home/<name>/…, C:\Users\<name>\… → ${HOME}/… / %USERPROFILE%\…
Project UUIDs → stable proj-001, proj-002, … per recording
meta tag project:<uuid> rewritten too

The anonymizer is idempotent. Tool input/output payloads (HTML, code, etc.) are preserved verbatim — they're templated UI without cell-level PII; if a future audit finds otherwise, add specific scrubs in the harvester repo (see "Adding more recordings" below) and re-run.

Adding more recordings

The flow is staging in PR → auto-upload on merge. No one — not even a core maintainer — has R2 write credentials on their laptop; only the GitHub Action does. This means a stray mocks/scripts/... invocation can't corrupt prod data, and every new recording lands in a PR diff for review first.

Step 1 — produce an anonymized .jsonl

The harvester that produced the current 179-trace set lives in a separate repo, nexu-io/agent-pr-explore. See its README for how to authenticate against your trace store, filter by skill / agent / outcome, and anonymize the result.

The output is one <trace-id>.jsonl file per recording; copy that into a scratch dir of your choice, then continue with step 2.

Step 2 — stage in this repo, open a PR

cd ~/Documents/open-design
for f in /tmp/new-recordings/*.jsonl; do
  bash mocks/scripts/add-recording.sh "$f"
done

git checkout -b mocks/add-data-report
git add mocks/recordings-staging/
git commit -m "mocks: stage 30 data-report recordings"
gh pr create

add-recording.sh validates each .jsonl (meta event present, UUID filename) and prints the exact manifest entry the CI workflow will commit post-merge. The reviewer eyeballs the diff (~30 KB per recording, mostly tool I/O) for anonymization gaps.

Step 3 — merge → CI auto-syncs

On merge to main, .github/workflows/sync-mocks-to-r2.yml fires (paths filter mocks/recordings-staging/**), runs mocks/scripts/upload-to-r2.mjs, and:

Uploads each staged .jsonl to R2 (4-concurrent, sha256-verified)
Rebuilds mocks/manifest.json (entry insert, histograms, multi_turn)
Uploads the updated manifest to R2 too
Deletes the staged files and commits mocks: sync N recordings to R2 [skip ci] back to main

Concurrency group r2-mocks-upload serializes runs so two parallel PRs can't race on the manifest.

Why this trust model

R2 write secret never leaves CI. CLOUDFLARE_R2_MOCKS_AK + CLOUDFLARE_R2_MOCKS_SK are repo secrets with scope limited to the open-design-mocks bucket (deliberately separate from the shared CLOUDFLARE_API_TOKEN used by landing-page Pages deploys and from the CLOUDFLARE_R2_RELEASES_* pair, so the mocks write capability stays narrow). Local scripts have no path to upload.
No silent corruption. Reviewer sees every byte of new data in the PR diff before it reaches R2.
Read stays public. Anyone can fetch via the r2.dev URL; only appending data requires merging to main.

If you absolutely need to push from your laptop (e.g. backfilling an old trace the Action somehow lost), set OD_MOCKS_ALLOW_LOCAL_UPLOAD=1 and run upload-to-r2.mjs with your own wrangler login. Not recommended; consider opening a PR instead.

Usage from OD's test code

From a test (Vitest / Jest)

import { spawn } from 'node:child_process';
import { join } from 'node:path';

const MOCK_BIN = join(__dirname, '../../mocks/bin');

it('parses an opencode session with 4 tool calls into 4 UI events', async () => {
  const child = spawn('opencode', ['run'], {
    env: {
      ...process.env,
      PATH: `${MOCK_BIN}:${process.env.PATH}`,
      OD_MOCKS_TRACE: '06a9324a',   // 4-tool claude session
      OD_MOCKS_NO_DELAY: '1',
    },
    stdio: ['pipe', 'pipe', 'pipe'],
  });
  child.stdin.write('test prompt');
  child.stdin.end();
  // ... assert events parsed from child.stdout
});

From a manual playback

# See what claude's 17-tool "delete v2" session emits to OD:
export PATH=$(git rev-parse --show-toplevel)/mocks/bin:$PATH
export OD_MOCKS_TRACE=04097377
export OD_MOCKS_NO_DELAY=1
echo "anything" | claude -p --output-format=stream-json | jq .type | uniq -c

Files

mocks/
├── README.md                 ← you are here
├── mock-agent.mjs                ← entry; routes --as <agent> to format renderer
├── lib/
│   ├── recording-picker.mjs      ← env-driven trace selection
│   ├── format-opencode.mjs       ← matches handleOpenCodeEvent
│   ├── format-codex.mjs          ← matches handleCodexEvent
│   ├── format-claude.mjs         ← matches createClaudeStreamHandler
│   ├── format-gemini.mjs         ← matches handleGeminiEvent
│   ├── format-cursor-agent.mjs   ← matches handleCursorEvent
│   ├── format-acp.mjs            ← JSON-RPC server matching attachAcpSession
│   ├── format-vela.mjs           ← AMR vela: ACP + models block + set_model gate
│   ├── vela-subcommands.mjs      ← `vela login` + `vela models` handlers
│   └── format-plain.mjs          ← raw stdout (deepseek/qwen/grok)
├── bin/
│   ├── opencode  claude  codex
│   ├── gemini    cursor-agent
│   ├── deepseek  qwen    grok
│   ├── devin hermes kilo kimi kiro vibe
│   └── vela                       ← 15 bash wrappers, PATH-overlay
├── manifest.json                 ← committed: 179 entries' metadata + sha256 + R2 storage hints
├── scripts/
│   ├── smoke-test.sh             ← 21 checks; auto-fetches recordings if empty
│   ├── fetch-recordings.sh       ← pull from R2 (parallel, sha256-verified, idempotent)
│   ├── add-recording.sh          ← maintainer-local: validates + copies to staging dir (no R2 calls)
│   ├── upload-to-r2.mjs          ← called only by .github/workflows/sync-mocks-to-r2.yml
│   └── lib/
│       └── manifest-utils.mjs    ← shared sha256 / meta-parse / manifest-rebuild logic
├── recordings-staging/           ← drop new .jsonl here, PR, merge → Action uploads
│   └── .gitkeep
└── recordings/                   ← populated at runtime, gitignored .jsonl
    └── .gitignore                ← recordings come via fetch

No external dependencies. Pure node:fs/crypto/child_process. Works under any Node ≥18.

Limitations

copilot, qoder, pi (the niche copilot-stream-json / qoder-stream-json / pi-rpc formats) are recorded but not yet rendered as their native protocols — they fall back to the plain renderer for now. If you need them, add a format-<agent>.mjs following the same pattern as format-codex.mjs; the parsers are in apps/daemon/src/{copilot-stream,qoder-stream}.ts and the pi-rpc handler inside apps/daemon/src/server.ts.
The mock does not honor CLI flags that change semantics (--model, --permission-mode, --allowed-tools). They're silently ignored.

Provenance / safety

All recordings come from open-design's own Langfuse project (the open-design project under the powerformer org). Users opted into telemetry when they installed the desktop client. The anonymizer removed user-identifying paths and project UUIDs before checking in.

If you find a recording that includes content that should be redacted, open a PR removing it from mocks/recordings-staging/ (or, if already synced, file an issue — manifest regeneration after a delete needs to run against R2 manually and is not automated yet).

17 KiB Raw Blame History

mocks/ — replay-based mock CLIs for OD's supported agents