Commit graph

30 commits

Author SHA1 Message Date
lefarcen
da19ff3ca0
feat(mocks): replay-based mock CLIs for 14 of OD's supported agents (opencode/codex/claude/gemini/cursor-agent/deepseek/qwen/grok + ACP family devin/hermes/kilo/kimi/kiro/vibe) (#3241)
* feat(mocks): replay-based mock CLIs for opencode/claude/codex/deepseek/qwen/grok

Drops in a `mocks/` top-level dir that pretends to be the real agent
CLIs by streaming pre-recorded sessions in each CLI's native stdout
protocol. Zero LLM tokens.

## Use cases

- **E2E tests** in `apps/daemon/tests/` — exercise the full chat-server
  pipeline against a known trace, assert UI events / artifacts.
- **Self-validation during dev** — iterate on `claude-stream.ts` /
  `json-event-stream.ts` parser changes without burning provider budget.
- **Regression harness** — replay the same trace before and after a
  charter / parser change; diff the daemon events the UI surfaces.
- **Demo / onboarding** — show what a 17-tool claude editing session
  looks like end-to-end, offline.

## How

- 6 bash wrappers (`mocks/bin/`) shadow the real CLIs when PATH-overlaid.
- `mocks/mock-agent.mjs` reads `mocks/recordings/<trace>.jsonl`, picks
  one via env var (`SYNCLO_EXPLORE_MOCK_TRACE` / `_POOL` /
  `_BY_PROMPT_HASH`), streams the trace in the requested format.
- Each format renderer matches the EXACT JSON shape the OD daemon
  parser expects, verified line-by-line against
  `apps/daemon/src/{json-event-stream,claude-stream}.ts`:

  | CLI                       | streamFormat              | parser source                              |
  | ------------------------- | ------------------------- | ------------------------------------------ |
  | `opencode`                | `json-event-stream`       | `handleOpenCodeEvent`                      |
  | `codex`                   | `json-event-stream`       | `handleCodexEvent`                         |
  | `claude`                  | `claude-stream-json`      | `createClaudeStreamHandler`                |
  | `deepseek` `qwen` `grok`  | `plain`                   | `server.ts` (raw stdout)                   |

## Quick start

```bash
export PATH="$PWD/mocks/bin:$PATH"
export SYNCLO_EXPLORE_MOCK_TRACE=04097377   # 8-char prefix OK
export SYNCLO_EXPLORE_MOCK_NO_DELAY=1

echo "any prompt" | opencode run
echo "any prompt" | claude -p --output-format=stream-json
echo "any prompt" | codex exec
```

The mock binary announces the picked trace id on stderr:
`[mock-opencode] picked 04097377… via fixed`.

Recording selection (env, in priority order):
- `SYNCLO_EXPLORE_MOCK_TRACE=<id>` — fixed (prefix OK)
- `SYNCLO_EXPLORE_MOCK_BY_PROMPT_HASH=1` + stdin prompt — `sha256(prompt) % N`
- `SYNCLO_EXPLORE_MOCK_POOL=<tag>` — random within `agent:claude` /
  `skill:agent-browser` / `outcome:failed` / etc.
- (default) uniform random
- `SYNCLO_EXPLORE_MOCK_SEED=<str>` — reproducible "random"
- `SYNCLO_EXPLORE_MOCK_NO_DELAY=1` — skip inter-event waits

## Dataset

179 anonymized Langfuse traces from this project's own production
telemetry:

- 9 agents: claude 57 · opencode 41 · codex 38 · gemini 25 ·
  cursor-agent 11 · qwen 2 · copilot 2 · deepseek 2 · antigravity 1
- outcomes: succeeded 144 · failed 35
- skills: default 71 · ad-creative 50 · algorithmic-art 30 ·
  agent-browser 22 · video-hyperframes 2 · plus magazine-web-ppt /
  brainstorming / data-report / penpot-flutter-design-source 1 each
- 124 multi-turn (sessions with ≥2 turns)
- 18 produce `<artifact>` output
- ~4.5 MB on disk total

Anonymization: `/Users/<name>/` → `${HOME}/`,
`C:\Users\<name>\` → `%USERPROFILE%\`, project UUIDs →
stable `proj-001`, `proj-002`, …. Tool input/output payloads
preserved verbatim (templated UI, no cell-level PII).

## Smoke test

`bash mocks/scripts/smoke-test.sh` — 6 checks across all 6 agents.
All pass on this branch (verified locally):

```
  ✓ opencode first event = step_start
  ✓ codex first event = thread.started
  ✓ claude first event = system
  ✓ deepseek emitted plain text (144 chars on first line)
  ✓ qwen emitted plain text (144 chars on first line)
  ✓ grok emitted plain text (144 chars on first line)
All mock CLIs working. 
```

## Adding more recordings

The exporter that produced this set lives in
[nexu-io/agent-pr-explore](https://github.com/nexu-io/agent-pr-explore)
(see `cli/src/local/orchestrator/langfuse-import.ts` + the `local
langfuse-import` CLI command). Operators with the Langfuse keys can pull
more by tag / outcome / artifact / multi-turn filter, then run
`local recordings anonymize --out-dir ~/Documents/open-design/mocks/recordings`.
`mocks/README.md` has the full instructions.

## Out of scope (follow-ups)

- **ACP agents** (`devin`, `hermes`, `kilo`, `kimi`, `kiro`, `vibe`) need
  a JSON-RPC server on stdio rather than a one-shot stream — separate
  `format-acp.mjs` module not yet written.
- **Per-agent json-event-stream variants** (`cursor-agent`, `gemini`,
  `qoder`, `copilot`, `pi`) currently fall back to the `plain` renderer;
  their parsers are in `apps/daemon/src/json-event-stream.ts` and follow
  the same template as `format-codex.mjs`.

## AGENTS.md updates

- Added `mocks/` to the top-level content directories listing
- Added a Validation strategy bullet pointing here for agent-stream /
  parser changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mocks): add opencode-cli/kiro-cli/vibe-acp bin aliases and unref ACP timeout

- Add mocks/bin/opencode-cli, kiro-cli, vibe-acp wrappers for the primary
  RuntimeAgentDef bin names OD resolves before any fallback. Without these,
  a PATH-overlaid OD daemon run bypasses the mock entirely (opencode-cli,
  kiro-cli) or cannot find the mock at all (vibe-acp, which has no fallback).
- Include opencode-cli, kiro-cli, vibe-acp in the smoke-test ACP/JSON loop
  so coverage is verified end-to-end.
- Call .unref() on the 30s safety timeout in format-acp.mjs so a completed
  ACP session exits promptly instead of waiting the full 30 seconds.

Generated-By: looper 0.9.2 (runner=fixer, agent=claude-code)

* feat(mocks): add vela (AMR) — login / models / ACP with strict set_model gate

Extends mocks/ to cover OD's own AMR runtime. `vela` is the bin name
`apps/daemon/src/runtimes/defs/amr.ts` specifies (`bin: 'vela'`,
`streamFormat: 'acp-json-rpc'`). It's richer than the generic ACP
agents — covers full login + models + chat-session lifecycle.

### What vela does (mirrored from apps/daemon/tests/fixtures/fake-vela.mjs)

1. `vela login` — writes ~/.amr/config.json with a fake profile (controlKey,
   runtimeKey, user{email,name,plan}, profile-specific apiUrl/linkUrl).
   The on-disk projection is what OD's daemon login route + AmrLoginPill
   poller read; production goes through device-auth, the mock skips
   straight to the file write.

2. `vela models` — prints the production-shaped public model catalog as
   newline-separated `public_model_*    vela` lines. Override via
   FAKE_VELA_MODELS env.

3. `vela agent run --runtime opencode` — ACP JSON-RPC server with three
   vela-specific protocol extensions:

   a. `initialize` response carries `agentCapabilities`
      (`promptCapabilities.embeddedContext`) + `models`
      (`currentModelId` + `availableModels`).
   b. `session/new` response carries the same `models` block.
   c. **Strict set_model gate**: `session/prompt` is rejected with
      JSON-RPC -32602 ("session/set_model must be called before
      session/prompt") UNLESS `session/set_model` (or
      `session/set_config_option`) has been called for the current
      sessionId. Mirrors real vela 0.0.1 contract; catches regressions
      in `attachAcpSession` that silently skip set_model.

### Error injection envs (in sync with fake-vela.mjs)

  FAKE_VELA_SESSION_ID            - sessionId returned by session/new
  FAKE_VELA_TEXT                  - override assistant text
  FAKE_VELA_THOUGHT               - optional thought_chunk before text
  FAKE_VELA_SESSION_NEW_ERROR     - fail session/new
  FAKE_VELA_SET_MODEL_ERROR       - fail session/set_model
  FAKE_VELA_PROMPT_ERROR          - fail session/prompt
  FAKE_VELA_REQUIRE_SET_MODEL='0' - disable the strict gate (legacy)
  FAKE_VELA_LOGIN_USER_EMAIL      - email written into config profile
  FAKE_VELA_LOGIN_USER_PLAN       - plan written into config profile
  FAKE_VELA_LOGIN_DELAY_MS        - sleep before write (test in-flight)
  FAKE_VELA_LOGIN_FAIL            - print + exit 1
  FAKE_VELA_MODELS                - override models stdout
  VELA_PROFILE                    - profile slot (prod | test | local)

### Components

`mocks/lib/format-vela.mjs` (~205 LOC)
  - Full ACP server with vela protocol extensions
  - Strict set_model gate
  - Error injection plumbing

`mocks/lib/vela-subcommands.mjs` (~90 LOC)
  - runVelaLogin() — writes ~/.amr/config.json
  - runVelaModels() — prints catalog

`mocks/bin/vela` — dispatcher wrapper. Forwards `vela <subcmd>` to
mock-agent.mjs which routes to login/models or falls through to ACP.

`mocks/mock-agent.mjs` — parseArgs now collects positionals so the vela
dispatcher can read subcommand from there; switch case added for vela.

`mocks/scripts/smoke-test.sh` — +4 assertions:
  vela models prints ≥10 catalog lines
  vela login writes ~/.amr/config.json with the requested email
  vela agent run ACP roundtrip (initialize+models+set_model+stream+result)
  vela strict set_model gate rejects prompt without prior set_model

### Verified locally

  ✓ vela models printed 15 catalog lines
  ✓ vela login wrote ~/.amr/config.json with profile.prod.user.email
  ✓ vela agent run ACP roundtrip (initialize+models, set_model accepted, prompt streamed)
  ✓ vela strict set_model gate rejects session/prompt without prior set_model

All 21 smoke checks pass (up from 17 with previous P3 ACP commit).

### AGENTS.md + README updates

  AGENTS.md — mention `vela (AMR — vela CLI)` alongside ACP agents in
  the directory listing entry.
  mocks/README.md — protocol table row + dedicated vela section with
  subcommand contract, strict gate explanation, env-injection cheat
  sheet. Mock-tree listing updated.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mocks): honor REPORT_FILE env when --report-file flag not given

Harnesses that spawn the mock without translating their report-path
contract to the mock's CLI flag (notably nexu-io/agent-pr-explore's
orchestrator, which passes REPORT_FILE as env per the existing
opencode/claude/codex agent launchers) wouldn't get a report file
written, so the harness's "agent exit 0 but produced no report"
check would always fire and mark mock runs as failure even though the
stdout stream was complete.

Fix: in mock-agent.mjs parseArgs, fall through to process.env.REPORT_FILE
when --report-file wasn't provided on argv. Each format renderer already
accepts opts.reportFile and writes the recording's final assistant text
to it (`format-*.mjs` already had this — only the wiring was missing).

Verified: synclo-explore run with `mock=true, mock_trace=04097377`
against the opencode wrapper now produces a plan.md with the recording's
17-tool claude editing session report. ~1.5s per run vs ~70s real opencode.

* mocks: move recordings to Cloudflare R2; PR→main→Action upload path

The 179-recording corpus (~4.5 MB raw, ~280 KB after compression) has
been moved off git into Cloudflare R2 at the bucket open-design-mocks
under recordings/v1/. The repo now ships:

- mocks/manifest.json — the canonical catalog (renamed from
  recordings/index.json) with sha256 + storage hints; consumers
  fetch this to discover what exists, then pull individual jsonl
  files on demand
- mocks/scripts/fetch-recordings.sh — parallel, sha256-verified,
  idempotent puller for the public r2.dev URL
- mocks/scripts/add-recording.sh — local maintainer helper that
  validates a new .jsonl and copies it into recordings-staging/
  (no R2 calls; no credentials needed)
- mocks/scripts/upload-to-r2.mjs — called only by the CI workflow
- mocks/scripts/lib/manifest-utils.mjs — shared sha256/meta/
  rebuild-histograms logic, used by both add-recording (preview)
  and upload-to-r2 (actual write) so the entry shape never drifts
- .github/workflows/sync-mocks-to-r2.yml — fires on push to main
  when mocks/recordings-staging/ changes; uploads to R2, updates
  manifest, commits cleanup back; serialized via concurrency group

Trust model: R2 write credentials (CLOUDFLARE_API_TOKEN,
CLOUDFLARE_ACCOUNT_ID) are repo secrets; nobody can push from a
laptop. Read stays public via the r2.dev URL.

Why not pnpm install integration: contributors who do not touch
agent code do not pay the fetch cost. Fetch happens on first
smoke-test run (auto-fallback) or when a mock spawn needs data.

Repo size: -4.55 MB net (delete 179 jsonl, +280 KB manifest +
scripts). Smoke test (21 checks) still green against the fetched
corpus.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mocks: scope R2 write token to a dedicated secret name

Use CLOUDFLARE_R2_MOCKS_TOKEN (instead of reusing the shared
CLOUDFLARE_API_TOKEN that landing-page-*.yml uses for Pages deploys)
so the R2 write capability can be scoped to just the
open-design-mocks bucket without bleeding extra capability into the
Pages workflows.

Also hardcode the powerformer CF account_id directly in the workflow
(account IDs are not secret and the shared CLOUDFLARE_ACCOUNT_ID
secret may point at a different account).

Workflow now fails fast with an actionable error message + dashboard
link if the secret is unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mocks: switch R2 sync to S3-compat API (wrangler getMemberships gate)

wrangler 4.x calls /memberships before any r2 action, requiring
user:read scope. R2 "Object Read & Write" tokens deliberately lack
that scope (defense in depth — a leaked token should not enumerate
account-level resources). The workflow now uses the aws CLI talking
straight to the R2 S3-compatible endpoint with SigV4, no membership
lookup.

Secret rotation: CLOUDFLARE_R2_MOCKS_TOKEN (Bearer) is replaced by
CLOUDFLARE_R2_MOCKS_AK / CLOUDFLARE_R2_MOCKS_SK (matching the
existing CLOUDFLARE_R2_RELEASES_AK/SK naming convention). End-to-end
tested locally: PUT recording → manifest rebuild → manifest PUT →
staging cleanup all green.

aws CLI is pre-installed on ubuntu-latest, so no install step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mocks: scrub synclo namespace; use OD_MOCKS_* env prefix throughout

These mocks were copy-pasted from synclo-explore, where they
originated, and inherited the SYNCLO_EXPLORE_MOCK_* env-var
convention. That brand-bleed is not appropriate in OD: rename the
public env surface to OD_MOCKS_* (matching OD-native prefixes like
OD_MOCKS_CACHE_DIR, OD_TRACE_R2_UPLOAD, OD_EXPECT_TIMEOUT_SECONDS).

Renames:
  SYNCLO_EXPLORE_MOCK_TRACE             → OD_MOCKS_TRACE
  SYNCLO_EXPLORE_MOCK_BY_PROMPT_HASH    → OD_MOCKS_BY_PROMPT_HASH
  SYNCLO_EXPLORE_MOCK_POOL              → OD_MOCKS_POOL
  SYNCLO_EXPLORE_MOCK_SEED              → OD_MOCKS_SEED
  SYNCLO_EXPLORE_MOCK_NO_DELAY          → OD_MOCKS_NO_DELAY
  SYNCLO_EXPLORE_MOCK_RECORDINGS_DIR    → OD_MOCKS_RECORDINGS_DIR
  SYNCLO_EXPLORE_MOCK_SMOKE_TRACE       → OD_MOCKS_SMOKE_TRACE
  SYNCLO_OD_MOCKS_I_KNOW_WHAT_IM_DOING  → OD_MOCKS_ALLOW_LOCAL_UPLOAD

Also drop the inline harvester usage from README. The harvester is an
external CLI in nexu-io/agent-pr-explore — its README is the right
place for langfuse-import flags, anonymization options, etc. OD only
documents its own staging→PR→Action workflow.

Smoke test (21 checks) still green; OD_MOCKS_TRACE end-to-end
verified to route correctly.

Consumers of the OLD env names (notably the orchestrator in
nexu-io/agent-pr-explore) need a matching rename. No back-compat
shim here — the explore side has zero external users today and a
one-line follow-up is cleaner than a permanent deprecation layer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* AGENTS.md: align mock env names with mocks/ rename (SYNCLO_* → OD_MOCKS_*)

Missed in the prior commit (a30b868a) — only grepped mocks/ subdir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mocks: drop staging dir + GH Action; back to local-script upload

The staging-dir + Action design (added earlier in this PR) had a flaw
the user caught: new recordings briefly entered the repo on their way
through staging, leaving them in git history forever even after the
Action cleanup commit removed them from HEAD. That defeats the whole
point of moving recordings to R2.

Replace with the simpler local-maintainer flow:

  bash mocks/scripts/upload-recording.sh /path/to/<trace>.jsonl
  # → validates, wrangler r2 put, updates manifest.json, wrangler r2 put manifest
  git add mocks/manifest.json && git commit && git push
  # → only the ~200B manifest delta enters git

The wrangler-OAuth gate replaces the CI secret + Action duo. For a
solo / small maintainer team this collapses the trust chain down to
"do you have wrangler login to the powerformer account?" — no GH
secrets to rotate, no concurrency window to worry about, no
inevitable repo-history bloat.

Deletes:
- .github/workflows/sync-mocks-to-r2.yml
- mocks/scripts/upload-to-r2.mjs   (CI-only)
- mocks/scripts/add-recording.sh   (staging helper, now obsolete)
- mocks/recordings-staging/        (empty dir, never to be repopulated)

Adds:
- mocks/scripts/upload-recording.sh

Kept:
- mocks/scripts/fetch-recordings.sh
- mocks/scripts/lib/manifest-utils.mjs (still used by upload-recording.sh)
- mocks/manifest.json (committed; the only mocks artifact in git)

End-to-end tested locally: re-upload an existing recording is
idempotent, manifest math is stable, fetch + smoke test still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mocks: address review — guard allowlist + safe ~/.amr + loud OD_MOCKS_TRACE typo

Three concrete issues raised across recent Siri-Ray (Looper) review
threads on #3241:

1. scripts/guard.ts only allowlisted mocks/lib/ + mocks/mock-agent.mjs,
   leaving mocks/scripts/lib/manifest-utils.mjs outside the residual-
   JS guard. Result: Preflight fail on every push. Extend the allowlist
   to mocks/scripts/ — same precedent as the lib/ entry directly above.

2. mocks/scripts/smoke-test.sh moved the caller real ~/.amr to
   ~/.amr-smoke-backup, ran vela login (which writes a fake config),
   then rm -rf the .amr and restored the backup. Two failure modes:
   crash mid-run loses the user real config, and re-running before
   restore overwrites the backup with the fake login. Fix: sandbox
   vela login into a mktemp -d HOME via env (HOME=$amr_sandbox vela
   login). Never touches the real ~/.amr at all. trap cleans up.

3. mocks/lib/recording-picker.mjs silently fell through to
   prompt-hash → pool → random when OD_MOCKS_TRACE was set but did
   not match any recording (typo, prefix too short, corpus not
   fetched). Tests using a pinned trace would silently get a
   different trace, hiding regressions. Fix: throw an explicit error
   with the failing value + a pointer at fetch-recordings.sh.

Verified locally: pnpm guard prints "Residual JavaScript check
passed", smoke-test still 21/21, ~/.amr mtime unchanged after run,
typo on OD_MOCKS_TRACE now produces "mock-agent: OD_MOCKS_TRACE=...
set but no matching recording in <dir>" on stderr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fetch-recordings: detect empty filter result before line-counting

printf '%s\n' on an empty string emits a single empty line, so the
previous TOTAL=$(printf ... | grep -c "") math returned 1 on an
empty $ENTRIES_TSV — a typo like `--agent no-such-agent` printed
"Fetching up to 1 recordings", downloaded zero, and exited 0
("ready"). Check `-z $ENTRIES_TSV` first.

Reproduced + fix verified per the reviewer thread.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mocks: address mrcfps review — goldens + provenance + contract check

Three durability improvements suggested in the PR #3241 top-level
review:

## 1. Golden daemon-event snapshots (mocks/golden/*.events.json + apps/daemon/tests/mocks-golden.test.ts)

Smoke-test verified that mocks RUN; that catches crashes but not a
parser change that semantically reshapes the events the daemon emits.

Commit the daemon-event sequence for 3 representative traces:
- claude  314d6833 — median-complexity agent-browser session
- codex   dcdff3b3 — 14-tool refactor
- opencode 9a9522ec — 7-tool data-report

apps/daemon/tests/mocks-golden.test.ts spawns the mock, feeds stdout
through the real createClaudeStreamHandler / createJsonEventStreamHandler,
normalizes per-spawn volatile fields (only sessionId today, only on
claude), and deep-equals against the committed snapshot. A parser
regression fails the test loudly.

After an intentional parser change, regenerate:

  MOCKS_GOLDEN_UPDATE=1 pnpm --filter @open-design/daemon test mocks-golden
  git diff mocks/golden/
  # eyeball; commit if shapes match intent

## 2. Provenance fields on every manifest entry (mocks/scripts/lib/manifest-utils.mjs + mocks/manifest.json)

Augment inspectRecording() to write:

  captured_at         — ISO 8601 from existing meta.timestamp
  cli_version         — null until harvester writes it
  protocol_version    — null until harvester writes it
  anonymization_version — null until harvester writes it

captured_at is now populated for all 179 existing entries from the
meta event the harvester already emits. The harvester in
nexu-io/agent-pr-explore is the next step for cli_version /
protocol_version / anonymization_version — once those are
populated, consumers can detect when a recording is older than ~1
minor version behind the live CLI and flag for re-harvest.

No matrix of (cli_version × agent) recordings — that explodes
maintenance. Just metadata per recording so trust decay is visible.

## 3. Real-CLI contract check (mocks/scripts/contract-check.sh + docs/MOCKS-CONTRACT-CHECK.md)

Mocks catch parser regressions against recordings; they do NOT
catch recordings drifting away from the live agent CLI as that CLI
evolves. The contract check spawns the real CLI alongside the mock
with a fixed deterministic prompt + diffs top-level event-type
distributions.

Deliberately human-driven, not cron-scheduled:
- costs real LLM tokens per invocation
- requires real CLI auth
- maintainer reads the output, not a regex

Suggested triggers per doc: real-CLI release notes mentioning
"output format" / "stream" / "JSON" / "events"; before a parser
refactor; ad-hoc when something looks off.

## Coverage note

README updated to position mocks as "deterministic protocol/parser
coverage" (not "e2e replacement") per mrcfps framing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mocks-golden test): drop import of non-exported ParserKind

Use plain string (the type alias is `string` anyway) — Preflight
typecheck on a31fa71a failed:

  tests/mocks-golden.test.ts(29,8): error TS2459: Module
  "../src/json-event-stream.js" declares "ParserKind" locally, but
  it is not exported.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* recording-picker: structured OD_MOCKS_POOL + hard-fail no-match

Siri-Ray review: \`OD_MOCKS_POOL=outcome:failed\` was documented as a
supported selection knob, but the matcher only checked tags and
\`meta.agent\` — so the negative-path pool found 0 candidates and
silently fell through to global random, validating against any
recording instead of a failed trace.

Fix:
- Parse \`<dim>:<value>\` shape and route each dim to the right meta
  field: \`outcome\` → \`meta.outcome\`, \`agent\` → \`meta.agent\`,
  \`skill\` → \`tags[]\`. Bare values still fall back to tag substring.
- If the env was set and matched nothing, throw with the failing
  value and a jq one-liner for inspection. Same loud-fail policy as
  OD_MOCKS_TRACE — silent fallback was the original bug.

Verified locally: outcome:failed, agent:codex, skill:agent-browser
all route correctly; outcome:nonsense throws the explicit error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* contract-check.sh: fix lost $PROMPT in mock invocation

Siri-Ray review on e576074a: the mock side wrapped its pipeline in
`bash -c "printf %s \"\$PROMPT\" | ..."` — but $PROMPT was a parent
shell variable, not exported, so the child bash expanded it to an
empty string. Result: the contract check sent the real prompt to the
real CLI and an empty string to the mock, defeating the
same-input invariant the whole script rests on. Also let the mock
randomly select a different trace whenever a maintainer happens to
have OD_MOCKS_BY_PROMPT_HASH=1 in their env.

Fix: drop the inner bash -c entirely; use a subshell that scopes the
PATH overlay and pipes printf into the PATH-resolved mock binary
directly. The subshell limits the PATH change without var-passing.

Verified locally: with prompt-A the mock picks trace 54ec02ee via
hash; prompt-B → 2667e851 via hash; empty prompt (old broken
behavior) → random — confirms the prompt is now actually reaching
the mock under PATH overlay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-29 07:17:20 +00:00
Marc Chan
d5659d82d4
chore(nix): streamline pnpm deps hash maintenance (#2919)
* chore(nix): streamline pnpm deps hash maintenance

Generated-By: looper 0.9.0 (runner=worker, agent=opencode)

* fix(ci): satisfy actionlint in nix hash autofix

Generated-By: looper 0.9.0 (runner=fixer, agent=opencode)

* fix(ci): allow nix hash autofix on fork PRs

Generated-By: looper 0.9.0 (runner=fixer, agent=opencode)

* fix(ci): follow up nix hash review

Generated-By: looper 0.9.0 (runner=fixer, agent=opencode)

* fix(ci): tolerate nix hash bot token failures

Generated-By: looper 0.9.0 (runner=fixer, agent=opencode)
2026-05-26 07:35:38 +00:00
Marc Chan
619087a6b4
refactor(web): split global CSS by ownership (#2609)
* refactor(web): split global CSS by ownership

* test(web): expand CSS imports in style checks

* fix(web): keep privacy consent banner above modals
2026-05-25 05:48:28 +00:00
PerishFire
34165ff189
chore: retire tools-pr (#2867) 2026-05-25 05:15:04 +00:00
Patrick A
ce2e7a0e66
fix(web): chat pane preserves scroll position when todo card grows (#2299)
* fix(web): chat pane preserves scroll position when todo card grows

The PinnedTodoSlot renders outside the chat-log scroll container. When
the todo card grows (new tasks added via TodoWrite), the scroll container's
clientHeight shrinks in the flex layout, drifting the user away from the
bottom. The existing ResizeObserver only observed children of the chat-log
div, so pinned-todo growth was invisible to followLatestIfPinned.

Fix: pass a containerRef to PinnedTodoSlot and observe that element in the
same ResizeObserver. syncPinnedTodo() is called on effect setup and from
the MutationObserver callback so observation stays current as the slot
appears and disappears across TodoWrite snapshots.

Red spec: apps/web/tests/components/chat-todo-autoscroll.test.tsx

* fixup! fix(web): chat pane preserves scroll position when todo card grows

Clarify test comment: the second test confirms followLatestIfPinned
snaps scroll to bottom when fired. The structural guarantee (pinned-todo
element is observed) is separately asserted in test 1, which is the
check that goes red on main without the fix.

* fix(web): correctness extend MutationObserver to pane ancestor for PinnedTodoSlot mount detection

The MutationObserver was only watching the .chat-log element. PinnedTodoSlot
(.chat-pinned-todo) is a sibling of .chat-log-wrap inside .pane, outside the
observed subtree. syncPinnedTodo inside the MutationObserver callback was
therefore dead code for mount/unmount transitions of the slot.

Add a second observation on paneEl (el.parentElement?.parentElement) with
childList-only so the MutationObserver fires when PinnedTodoSlot mounts or
unmounts and syncPinnedTodo can register/deregister the element with the
ResizeObserver.

* test(e2e): chat pane auto-scroll on todo card growth

Add Playwright spec that goes red on origin/main and green on this fix
branch. Scenario A asserts that a chat-log pinned to the bottom snaps
back after the PinnedTodoCard grows (the ResizeObserver-on-pinned-todo
path). Scenario B asserts that a deliberate scroll-up is not overridden.

Also allow OD_WORKSPACE_ROOT env override in next.config.ts so Turbopack
resolves node_modules correctly when the web app is booted from a worktree
whose node_modules symlinks resolve outside the default workspace root.

* docs(agents): note pinned-todo observer coverage in chat UI conventions

PinnedTodoSlot sits outside the .chat-log scroll container, so the
ResizeObserver and MutationObserver coverage that keeps auto-scroll
working when the todo card grows is non-obvious to future implementers.
Document the invariant in the Chat UI conventions section.

* fix(web): validate OD_WORKSPACE_ROOT, harden autoscroll test precondition

* fix(web): validate OD_WORKSPACE_ROOT existence, make autoscroll precondition unconditional

* fix(web): throw on invalid OD_WORKSPACE_ROOT instead of warn-and-fallback

* fix(web): require pnpm-workspace.yaml at OD_WORKSPACE_ROOT, drop dead test branch

Three follow-ups to nettee's review feedback:

1. apps/web/next.config.ts gains a pnpm-workspace.yaml existence check
   after the relative-path validation. Without it, an override like
   '<repo>/apps' or '<repo>/apps/web' passes the relative(resolved, WEB_ROOT)
   check but the resolved path is missing the sibling packages/* directory
   that apps/web imports from (for example @open-design/contracts). Next
   would later fail deep inside file tracing / Turbopack with a much
   harder-to-diagnose error. Now we throw at config load with a clear message.

2. e2e/ui/chat-todo-autoscroll.test.ts drops the redundant
   'if (scrollUpOccurred)' branch. The hard precondition above it already
   guarantees distanceAfterScroll > 80, so the if was dead code that read
   as a false-green path. The body now runs unconditionally.

3. Same test tightens the post-grow assertion. The previous
   toBeGreaterThan(60) would pass even if a regression dragged the log
   most of the way back to the bottom (e.g. before=150, after=61).
   Replaced with Math.abs(distanceAfterGrow - distanceAfterScroll) less than
   SCROLL_PRESERVATION_TOLERANCE_PX (20) — a delta check that actually
   verifies the comment's claim of 'within ~20px of where the user left it'.

* fix(web): canonicalize workspace root with realpathSync and tighten scenario B assertion

- Use realpathSync on both resolved and WEB_ROOT before the ancestor check so
  that symlinked paths (macOS /tmp vs /private/tmp, worktree checkouts) compare
  correctly instead of false-throwing on a physically valid override.
- Add isAbsolute(rel) guard for the Windows cross-drive case where path.relative()
  returns an absolute path instead of a ..-prefixed string.
- Scenario B: replace distance-to-bottom delta assertion with scrollTop preservation
  check. Growing the pinned todo naturally increases distance-to-bottom by ~extraPx
  (clientHeight shrinks while scrollTop is held fixed), so the old Math.abs(after -
  before) < 20 check would fail on correct behavior. asserting scrollTop directly
  catches the real regression: followLatestIfPinned incorrectly snapping a non-pinned
  user back to the bottom.
- Add hard precondition that clientHeight actually changed so the test fails fast
  if the layout stops exercising the non-pinned path.

* test(e2e,web): add clientHeight guard to scenario A and mount-wiring unit test

---------

Co-authored-by: Patrick A <eefynet@users.noreply.github.com>
Co-authored-by: Patrick A <259201958+eefynet@users.noreply.github.com>
2026-05-23 00:19:59 +08:00
Marc Chan
10192dcc52
fix(ci): catch nix hash drift before merge (#2530)
* fix(ci): catch nix hash drift before merge

* fix(nix): add pnpm hash refresh helper

* chore(nix): drop redundant hash alias

* fix(nix): raise update-hash output buffer

Generated-By: looper 0.8.1 (runner=fixer, agent=opencode)

* fix(nix): handle current pnpm deps hash

Generated-By: looper 0.8.1 (runner=fixer, agent=opencode)

* fix(nix): reject non-mismatch hash updates

Generated-By: looper 0.8.1 (runner=fixer, agent=opencode)
2026-05-21 16:08:13 +08:00
PerishFire
15d08d4158
feat: add windows packaged auto update flow (#2362) 2026-05-20 12:56:14 +08:00
Tom Huang
86ec951fb9
[codex] Add automation templates and proposal workflows (#2193)
* feat(web): introduce Automations tab with dual-track capability for routines

This commit adds a new Automations tab that consolidates routines, schedules, and live artifacts, allowing users to manage automations seamlessly. The tab features a modal for creating and editing automations, which supports various scheduling options (hourly, daily, weekdays, weekly) and project modes (create_each_run, reuse). The CLI is also updated to expose automation commands, ensuring consistency between the web UI and CLI interfaces.

Key changes include:
- New `NewAutomationModal` component for automation creation and editing.
- Updated `TasksView` to integrate the new Automations functionality.
- Enhanced styling for the Automations tab to improve user experience.

This implementation aligns with the dual-track capability exposure policy, ensuring all features are accessible via both the web UI and CLI.

* feat(daemon): enhance automation context handling and CLI commands

This commit introduces several improvements to the automation context management and updates the CLI commands accordingly. Key changes include:

- Added support for new context fields (`plugin`, `mcp`, `connector`) in automation commands.
- Updated the CLI to reflect new target options (`new-project`).
- Enhanced error messages for invalid target inputs.
- Introduced functions to handle context selection and normalization for routines, including the ability to parse and store context data in the database.
- Updated the database schema to include a new `context_json` field for routines.
- Improved the handling of context in routine routes and the web interface, ensuring that selected contexts are properly managed and displayed.

These changes aim to provide a more robust and flexible automation experience, aligning with the recent enhancements in the web UI.

* feat(web): enhance TasksView with automation run history and status indicators

This commit introduces several new features to the TasksView component, including:

- Added functionality to display automation run history for each routine, showing metadata such as status, timestamps, and project details.
- Implemented status indicators for routine runs, providing visual feedback on their current state (succeeded, failed, running, queued).
- Enhanced the UI to allow users to expand and view detailed run history, including the ability to open the corresponding project conversation.
- Updated styles to improve the presentation of automation statuses and history.

These changes aim to provide users with better insights into their automation routines and improve overall usability.

* feat(daemon): implement automation ingestion and proposal management

This commit introduces several new features related to automation ingestion and proposal management within the daemon. Key changes include:

- Added new modules for handling automation source packets and proposals, allowing for the storage, retrieval, and management of automation-related data.
- Implemented functions to list, create, and apply automation proposals, enhancing the automation workflow.
- Introduced new CLI commands for interacting with memory entries and automation sources, providing users with more control over their automation processes.
- Enhanced the server routes to support automation source and proposal APIs, enabling seamless integration with the existing system.

These changes aim to improve the overall automation experience, making it easier for users to manage and utilize automation proposals and ingestions effectively.
2026-05-19 16:35:28 +08:00
bluto447
6e1ebb97dd
docs: add Windows native setup notes to AGENTS.md (#324)
* docs: add Windows native setup notes to AGENTS.md

* docs: address review feedback on Windows native setup notes

- Reframe Node 22 FAQ around the actual user question (no orphaned
  citation); ground the answer in package.json#engines only
- Remove fabricated better-sqlite3 engines field claim; the
  lockfile entry has only resolution/integrity, no engines metadata
- Cross-reference historical Windows issues (#10, #96, #100, #203,
  #315) for fresh-installer context
- Remove tools-dev bullet that duplicated the Local lifecycle section
- Specify Visual Studio Build Tools 2022 or newer

* docs: pin Windows Node install to verified Node 24 (Looper follow-up)

`OpenJS.NodeJS.LTS` floats with whichever line is currently LTS — works
today (Node 24.15.0) but will silently advance to Node 26 in October
2026, conflicting with package.json#engines: node: "~24". WinGet has no
`OpenJS.NodeJS.24` package to pin against. Reword the install bullet to
(a) keep the WinGet command, (b) offer nodejs.org as alternative, (c)
require `node --version` verification, and (d) call out the October
2026 drift explicitly.
2026-05-19 13:23:00 +08:00
PerishFire
4424f08be0
[codex] Add packaged desktop auto-update (#1375)
* Add packaged desktop auto-update

* Handle counted beta nightly update versions

* Refresh desktop auto-update branch for main

* Serialize desktop updater operations

* Refresh auto-update branch for packaged paths
2026-05-19 11:20:05 +08:00
lefarcen
22a3b99a47 Merge origin/main into preview/v0.8.0
Sync 49 commits from main. Conflicts resolved:
- .github/workflows/ci.yml: kept v0.8.0 granular per-area gating, added main's
  linux specs + release-stable.yml + release-preview.yml triggers
- .github/workflows/release-preview.yml: kept v0.8.0's full workflow over main's placeholder
- apps/web/src/components/AssistantMessage.tsx: combined v0.8.0 file-ops
  summary with main's stripTodoToolGroups + suppressAskUserQuestionFallbackText
- apps/web/src/components/ChatPane.tsx: kept both new imports
- apps/web/src/index.css: kept both .msg-plugin-chip and .user-copy-btn blocks
- e2e/ui/*.test.ts: kept v0.8.0 openEntrySettingsDialog helper over main's
  inline dialog navigation (UI was redesigned in v0.8.0)
- nix/package-{daemon,web}.nix: kept v0.8.0 pnpmDepsHash; rerun nix build to refresh
2026-05-15 18:23:33 +08:00
Chris Seifert
9cf265e520
feat(claude): wire AskUserQuestion tool through chat + pin TodoWrite (#1743)
* feat(claude): wire AskUserQuestion tool through chat + pin TodoWrite

Claude calls `AskUserQuestion` for mid-conversation clarifications when
the natural answer is one of a small finite set of choices. Until now
the tool round trip hit two dead ends in headless mode: claude-code -p
cannot prompt the user, so it auto-errored the tool and retried 4x;
the model then hedged by also writing the same options as a markdown
bulleted list. The host had no way to feed a real `tool_result` back.

This change makes the AskUserQuestion round trip work end to end:

* Switch Claude to `--input-format stream-json`. The daemon wraps the
  prompt as a JSONL `user` message on stdin and keeps stdin OPEN, so
  later writes (a `tool_result` for the open AskUserQuestion) feed
  back into the same child instead of needing a fresh spawn.
* New `RuntimeAdapter.promptInputFormat()` ('text' default,
  'stream-json' for Claude) so the spawn loop keeps the old close-on-
  prompt behavior for every other agent.
* New `POST /api/runs/:id/tool-result` daemon endpoint and
  `submitChatRunToolResult` web helper. Body carries `toolUseId` and
  `content`; daemon writes a JSONL `user` message with the matching
  `tool_result` content block.
* Track outstanding host answers on the run (`pendingHostAnswers`)
  and close stdin on either a `usage` event or a synthesized
  `turn_end` event (extracted from `assistant.message.stop_reason`
  in `claude-stream`). Without the per-turn `turn_end` signal stdin
  would never close after the follow up turn finished and the run
  would hang until the inactivity watchdog killed it.
* System prompt: tell Claude to use AskUserQuestion for follow ups
  with 2-4 finite choices, and to STOP after the tool call instead
  of writing a markdown duplicate.

Web UI:

* New `AskUserQuestionCard` renders the tool input as labelled chip
  buttons (single or multi select) with a Submit button styled like
  the composer's Send. On submit the answer routes through
  `submitChatRunToolResult` (live tool_result path) and falls back
  to `onSubmitForm` (plain user message) only if the run has already
  terminated. Selected chips persist across page reloads by re
  parsing the stored `tool_result.content`.
* Hide markdown text that follows an AskUserQuestion in the same
  turn — defense in depth against the model emitting the duplicate.
* Collapse identical `AskUserQuestion` / `TodoWrite` retries inside
  any tool group to a single card. TodoWrite is a snapshot tool,
  so older calls are duplicates of state.
* Pinned TodoCard above the chat composer. The latest TodoWrite
  snapshot across the conversation renders once, expandable /
  collapsible header, count shows in-progress + completed (1/4),
  Done button dismisses when all tasks finish, soft fade gradient
  above so scrolling chat text dissolves into the panel instead of
  hard clipping under the card.
* Composer gains a top shadow that only appears when the pinned
  todo slot sits directly above it (dark mode strengthened).
* Accordion expand / collapse motion shared between TodoCard, the
  ToolGroupCard disclosure, and BashCard output via
  `grid-template-rows: 0fr -> 1fr` with `cubic-bezier(0.23, 1, 0.32, 1)`
  and asymmetric durations (200ms enter, 140ms exit) per Emil
  Kowalski's animation framework.
* Jump-to-latest button no longer unmounts on hide; slides up with
  scale 0.9 -> 1 + fade on show, slides down with scale + fade on
  hide. Always horizontally centered via `margin: 0 auto`.

i18n:

* `tool.askQuestion`, `tool.askQuestionSubmit`, `tool.askQuestionPending`,
  `tool.askQuestionAnswered`, `tool.todosExpand`, `tool.todosCollapse`,
  `tool.todosDone`, `tool.todosDismiss` added to all 18 locales.

Unblocker:

* Fix a pre-existing render loop in `ProjectView` when the user
  clicks "New conversation". `handleNewConversation` now navigates
  to the fresh conversation id synchronously after
  `setActiveConversationId` so the route-sync effect at L512 and
  the URL-sync effect at L851 do not ping pong (route mismatch
  triggered repeated reverts; React's nested-update guard fired).

* fix(claude): order turn_end after content blocks + cover chat switching

Two follow-up fixes to the AskUserQuestion + new-conversation work:

* `claude-stream.ts` emitted `turn_end` BEFORE iterating the assistant
  message's content blocks. When claude-code lacks
  `--include-partial-messages` (older builds), tool_use events surface
  only from that loop, so the daemon's stdin-close handler saw an
  empty `pendingHostAnswers` set and closed stdin before the
  AskUserQuestion tool_use was even registered. The result: the model
  retried, hit the same race, and gave up writing the questions in
  prose. Emit `turn_end` AFTER the content loop so tool_use ids land
  in `pendingHostAnswers` first.

* `server.ts` now ignores `turn_end` events with
  `stop_reason: 'tool_use'`. That stop reason means the model paused
  to wait for a tool execution (claude-code's internal tool runner
  for Bash / Edit / Read, or a host-answered tool like
  AskUserQuestion). Either way the conversation is still in flight —
  closing stdin there would kill the follow-up response. Only the
  natural turn-end stop reasons (`end_turn`, etc.) close stdin.

* `ProjectView.handleSelectConversation` now navigates to the picked
  conversation id synchronously, mirroring the fix already in
  handleNewConversation. The route-sync effect at L512 was reverting
  the active conversation on every switch, ping-ponging with the
  URL-sync effect at L851 until React's nested-update guard fired
  with "Maximum update depth exceeded". Same bug class as the
  pre-existing new-conversation render loop.

* docs(agents): capture AskUserQuestion runtime + chat UI conventions

Record the patterns this PR introduces so future contributors can find
them without spelunking server.ts:

* Agent runtime conventions — `RuntimeAgentDef.promptInputFormat`,
  `run.pendingHostAnswers` / `run.stdinOpen` lifecycle, `turn_end`
  ordering rule, `POST /api/runs/:id/tool-result` endpoint shape, the
  Claude only system prompt block that nudges AskUserQuestion, and the
  `suppressAskUserQuestionFallbackText` defense in depth.
* Chat UI conventions — URL-load vs srcDoc render mode dispatch with
  bridge disqualifiers, the dual iframe visibility swap pattern,
  `isOurIframe` plus the active-iframe re-check for signals that must
  only come from the visible iframe, pinned TodoCard via
  `PinnedTodoSlot`, count includes `in_progress`,
  `dedupeSnapshotToolRetries` for AskUserQuestion / TodoWrite stacks.
* i18n keys — 18 locale files, add the key to `types.ts` first.
* UI animation philosophy — `cubic-bezier(0.23, 1, 0.32, 1)` ease out,
  asymmetric 200/140ms enter/exit, accordion via `grid-template-rows`,
  no `transform: scale(0)`, keep mounted + toggle class for exit
  transitions instead of relying on React unmount.

* fix(claude): read promptInputFormat as field, close stdin on deferred answer

Two PR review follow-ups on the AskUserQuestion stream-json wiring.

* server.ts:4616 referenced `runtimeAdapter.promptInputFormat()` — but
  `runtimeAdapter` is not declared, imported, or assigned anywhere. The
  prior adapter abstraction was deleted in #1656; when the changes
  were folded back into the inline handler the format was moved onto
  `RuntimeAgentDef.promptInputFormat`, but this call site was missed.
  `server.ts` starts with `// @ts-nocheck` so typecheck never caught
  it — every chat run hit `ReferenceError: runtimeAdapter is not
  defined` the moment we wrote the prompt to a stdin-fed child, which
  is every agent with `promptViaStdin: true` (claude, codex, copilot,
  cursor-agent, gemini, opencode, pi, qoder). Read the format off the
  in-scope `def` and default missing values to `'text'`.

* `submitToolResultToRun` cleared the answered id from
  `pendingHostAnswers` but never closed stdin if a `turn_end` /
  `usage` event had already fired with the set non-empty (deferred
  by the event handler). The child then waited indefinitely for
  further input until the inactivity watchdog killed it, losing the
  model's follow-up response. Close stdin on the last-answer
  transition when stream-json stdin is still open.

Test: pin `promptInputFormat` for every `promptViaStdin: true` agent
so future regressions of the field-vs-method contract fail at
typecheck-adjacent test time instead of in production. The new test
asserts `typeof def.promptInputFormat` is a string (or undefined),
not a function — exactly the shape mistake the original line made.

* fix(web): keep AskUserQuestion multi-select chips selected after reload when labels contain commas

`handleSubmit` joined multi-select answers with `', '` while the
reload parser split them on `','`. The pair is asymmetric: a valid
model-generated option like `"Yes, including images"` round-tripped
as `["Yes", "including images"]`, so after a page reload the locked
question card showed the user's pick as unselected — even though the
`tool_result` content the daemon actually wrote into the run was
correct, and the model saw the right answer. Bounded to post-reload
visual state, but silently confusing.

Switch to a `- ` bullet list per option, one per line, with the
parser stripping the leading `- ` back off. Newlines never appear
inside a label so the round trip is exact. The outer pairs separator
stays `\n\n` because individual answer bodies still never contain
that double-newline.

* chore: drop accidental personal design-system file

`design-systems/foldar/DESIGN.md` was added to the AskUserQuestion
branch in 31ac531 by mistake — it's a personal brand spec that does
not belong in the upstream design-systems catalogue. Removing it
keeps the branch's surface area scoped to the feature.
2026-05-15 15:50:27 +08:00
PerishCode
43b1b94c8e Add preview release channel 2026-05-14 19:15:16 +08:00
lefarcen
6341b2677a
docs(pr): require user-perspective description and surface area (#1520)
* docs(pr): require user-perspective description and surface area

The previous template asked for Summary + Validation, which encouraged
code-perspective descriptions and let user-visible surface changes slip
past review unnoticed. Replace with:

- "Problem" — issue link + motivation
- "What users will see" — first-person user-visible effect
- "Surface area" — 9-item checklist (UI, shortcut, CLI/env, API/contract,
  extension point, i18n, top-level dependency, default behavior change, none)
- "Screenshots" — required when UI surface is checked, focused on the
  entry point users discover rather than the feature in isolation
- "Validation" — kept, retitled away from "Summary"

Authoritative rules added to AGENTS.md under a new "Pull request expectations"
section so external contributors' agents (Claude Code, Cursor, etc.) pick up
the requirement when reading the repo. CONTRIBUTING.md gets one pointer line
in "Commits & pull requests"; localized CONTRIBUTING variants (zh-CN, de, fr,
ja-JP, pt-BR) are left for follow-up translation PRs per the existing
docs-update workflow.

The existing "Fixes #" prompt is preserved verbatim — that template addition
from #1263 enforces PR-to-issue auto-linking and stays load-bearing.

* docs(pr): broaden dependency surface to dev deps as well

The "New top-level dependency" checkbox narrowed scope to runtime deps,
but CONTRIBUTING.md L239 says "No new top-level dependencies" without
limiting to runtime, and the AGENTS.md rule uses the same broad phrase.
A new devDependency (tool/test/build package) belongs in the same
bytes-vs-benefit explanation, so the checklist item should match.

* docs(pr): add Why and Bug fix verification; scope deps to root package.json

Three follow-up tweaks from review feedback:

1. Rename `## Problem` to `## Why` with a broader prompt that asks
   contributors to cover both their own use case (what made them write this
   PR today) and the pain being addressed. The old "Problem" framing only
   covered user-facing motivations and left no slot for the contributor's
   stake — a key signal for triaging external PRs.

2. New `## Bug fix verification` section between Screenshots and Validation,
   conditional on the PR being a bug fix. Surfaces the AGENTS.md
   "Bug follow-up workflow" red-spec requirement at PR-authoring time
   instead of leaving it implicit; asks for the test path and the
   red-on-main / green-on-branch confirmation.

3. Clarify the "New top-level dependency" checkbox to specify the **root**
   `package.json`. Without that word, contributors in a monorepo could read
   the check as applying to any workspace `package.json` (e.g. adding `react`
   to `apps/web/package.json` would be in scope) when CONTRIBUTING.md L239's
   "small on purpose" rule clearly meant root-level deps only.

AGENTS.md `## Pull request expectations` and CONTRIBUTING.md's pointer
line are updated to match the new section names and add the bug-fix
red-spec expectation.
2026-05-13 15:28:05 +08:00
PerishFire
8c0fb8dc01
feat(tools-pr): add maintainer PR-duty workspace (#1259)
* feat(tools-pr): add maintainer PR-duty workspace

Adds `tools/pr` as the maintainer-only control plane for PR-duty work on
this repo. Thin `gh` wrapper that encodes repo-specific knowledge:
review lanes, forbidden surfaces, lane-specific checklists, validation
command derivation from touched packages.

Subcommands:
- `list` — triage open queue by lane and review-state bucket.
- `view <num>` — agent-friendly review brief for a single PR.
- `classify [num]` — emit script-level tags for one PR or the whole
  open queue; full-queue JSON output lands under `.tmp/tools-pr/classify/`
  with rate-limit telemetry per run.
- `assignment` — assigner-perspective view of PR ownership, idle time,
  and blockers (derived from existing tags; no new judgments).

Tag dictionary (13 tags) covers: bot-only-approval, needs-rebase,
forbidden-surface, unlabeled, duplicate-title, non-ascii-slug,
maintainer-edits-disabled, org-member, unresolved-changes-requested,
stale-approval, and three awaiting-* timing tags. Each rule is
expressible as one factual sentence over `gh` data + repo paths — see
`tools/pr/AGENTS.md` for the full dictionary plus precision rules.

Templates in `tools/pr/templates/*.md` are aesthetic references for
recurring maintainer comments (duplicate-title ask, awaiting-author
nudge, agent-review brief shape). `templates/examples/` holds
frozen-in-time agent-review snapshots for three PR shapes.

Infrastructure:
- `gh()` wraps `execFile` with minimum-touch retry (2 attempts at 1s + 2s
  backoff) on transient 5xx / network errors. Persistent failures still
  surface — retry is anti-jitter, not an exponential-backoff resilience
  layer.
- Heavy chunks (`reviews`, `comments`, `commits`, assignment timelines)
  use cursor-paginated `gh api graphql` via `fetchPaginatedPrList` to
  stay under GitHub's GraphQL server-side timeout. Light chunks stay on
  `gh pr list --json`.
- `fetchOrgMembers` cached per process via `gh api orgs/<owner>/members
  --paginate`.

Wiring:
- Root `package.json` adds `pnpm tools-pr` to the allowed root entry
  points.
- `scripts/postinstall.mjs` builds `tools/pr` alongside other workspace
  packages.
- `scripts/guard.ts` allowlists `tools/pr/bin/tools-pr.mjs` and
  `tools/pr/esbuild.config.mjs`, and adds `pr/` to the `tools/` top-level
  layout allowlist.
- Root `AGENTS.md` and `tools/AGENTS.md` document the new command
  surface, root-command-boundary update, and per-tool ownership.

* docs(agents): brief tools-pr in root AGENTS.md, link to tools/pr/AGENTS.md

Adds a `PR-duty tooling` section to the root AGENTS.md summarising what
`pnpm tools-pr` is, listing the four common subcommands (list / view /
classify / assignment), and pointing readers to `tools/pr/AGENTS.md` for
the full tag dictionary, operational playbook, templates, and design
rules. The section keeps root-level guidance to high-level orientation
while details stay local to the tool's own AGENTS.md.

* fix(tools-pr): drop overly broad touches-root-package.json forbidden hit

`deriveForbidden` was flagging any change to root `package.json` as a
forbidden-surface hit, but AGENTS.md §Root command boundary only forbids
specific *lifecycle* aliases (pnpm dev / test / build / daemon / preview
/ start) — tools-control-plane entrypoints like `pnpm tools-pr` are
explicitly allowed. Distinguishing "forbidden alias" from "allowed
entry" requires reading the diff content, which is `pnpm guard`'s job
rather than a path-derived classify tag.

Dogfooded on this branch's own PR (#1259), which added the `pnpm
tools-pr` script and was incorrectly flagged. Removing the hit aligns
the `forbidden-surface` tag with what tools-pr can mechanically detect
from file paths alone (apps/nextjs/, packages/shared/).

* fix(tools-pr): paginate commits fetch, recognise ready-to-merge, escape title-index separator

Three review follow-ups on #1259, all factual fixes:

- `fetchOpenPrCommits` now uses `fetchPaginatedPrList` instead of a
  one-shot `pullRequests(first: $first)` query. GitHub GraphQL caps
  connection page size at 100, so the previous implementation would
  fail at runtime when callers passed `--limit > 100`. The paginated
  path makes the commits fetch consistent with the other heavy chunks
  (reviews, comments, assignment timelines) and removes the artificial
  ceiling entirely. The `limit` parameter is dropped from
  `fetchOpenPrCommits`; the CLI `--limit` continues to bound the
  `gh pr list --json` chunks.
- `deriveStatus` in `assignment.ts` now reads `facts.reviewDecision`
  and `facts.mergeStateStatus`. When the PR is `APPROVED` with merge
  state `CLEAN` or `UNSTABLE` and carries no blockers, status renders
  as `ready to merge` instead of falling through to `in review`. The
  assignment view loses its main triage signal without this — a clean
  human-approved PR rendered identical to a REVIEW_REQUIRED one.
- `tags.ts:tagDuplicateTitle` and `tags.ts:buildContext` both
  constructed the title-index key with a literal NUL byte between
  author and title, which made the file appear as binary in `git diff`
  / review tooling. Replaced the literal byte with a Unicode escape
  sequence in source; the runtime string value is identical, the
  source stays plain text and round-trips through review tooling
  cleanly.

* fix(tools-pr): raise default --limit to 1000 to cover the live open queue

mrcfps flagged that `tools-pr list` (and `classify --all`, `assignment`)
defaults to `--limit 100`, which silently drops every PR past the first
100 in the open queue. The repo currently sits at 104 open PRs, so the
out-of-the-box run was already omitting four PRs.

Raise the default to 1000 in `list.ts`, `classify.ts`, and `assignment.ts`,
and remove the now-pointless 200 ceiling — `gh pr list --limit N` paginates
internally, so a high cap is cheap. Users can still pass `--limit <small>`
for a truncated preview. CLI help text on the three subcommands updated to
match.

* fix(web): pass designTemplates to ProjectView render helper

#955 made `designTemplates` a required Prop on ProjectView, but the test
helper added in #1244 (`renderProjectView` in
`ProjectView.api-empty-response.test.tsx`) was never updated. The two
PRs landed on main without conflicting, leaving `apps/web` typecheck red
for every PR that rebases past b5eb8c16.

Pass `designTemplates={[] as SkillSummary[]}` alongside the existing
`skills={[] as SkillSummary[]}` so the helper compiles. The component
already treats the array shape (empty included) as a no-op fallback in
the empty-response paths the test exercises.

* fix(tools-pr): correct author signal + merge inline review comments

Two correctness gaps in the awaiting-* signal pipeline surfaced during
review of the new tools-pr commands:

1. `authorSignalAt` iterated every PR commit unconditionally. On
   `maintainerCanModify=true` PRs a maintainer's follow-up push would
   advance the author timestamp, masking a stalled author response.
   Filter commits to those whose `authorLogin` matches `facts.author`,
   mirroring the same filter already applied to comments.

2. `fetchOpenPrComments` (and `fetchView`) only fetched
   `pullRequest.comments` / `gh pr view --json comments`, which is the
   issue-conversation thread. Inline review-thread replies — where
   authors and reviewers actually exchange most fix-up replies — live in
   `reviewThreads.comments` / REST `pulls/{n}/comments`. Missing them let
   `humanReviewerSignalAt` / `authorSignalAt` and the `view` brief point
   at the wrong side after someone replied inline. Extend the list-mode
   GraphQL to also sweep `reviewThreads(last: 20).comments(first: 20)`,
   and add a parallel REST inline-comments fetch in `fetchView` that
   merges into `GhView.comments`.
2026-05-11 19:17:21 +08:00
Tom Huang
b5eb8c1647
feat: generic skills + split skills/design-templates + finalize-design API (#955)
* feat: general-purpose skills with @-mention composition and user import

Lift skills from "one mode-bound skill per project" to a generic capability
the user can compose per turn:

- Daemon: scan multiple skill roots (user-skills under runtime data, then
  the bundled `skills/`); user-imported skills can shadow built-ins by id.
- New `POST /api/skills/import` and `DELETE /api/skills/:id` endpoints,
  with CONFLICT/BAD_REQUEST/NOT_FOUND error codes and built-in delete
  protection.
- ChatRequest gains `skillIds: string[]`; the chat run concatenates each
  picked skill's body (and merges craftRequires) into the system prompt
  for that turn only — the project's persistent `skillId` is untouched.
- Web composer: `@` popover now lists skills alongside project files;
  picks render as removable chips above the textarea and ride along with
  the request as `skillIds`.
- Settings → Library: import form (name/description/triggers/body),
  per-card delete for user skills, "user" origin badge.

* chore(web): drop welcome pet teaser + add ds→prompt-template mapping util

- SettingsDialog: remove the inline pet adoption teaser from the welcome
  panel so the first-run modal stays focused on configuration.
- New `inferPromptTemplateCategoriesForDs(ds)` helper that maps a design
  system's authored metadata to prompt-template gallery categories.
  Imported by the design-system gallery wiring on a sibling branch; no
  callers in this branch yet.

* feat: split skills/design-templates and add finalize-design API

Phase 0 of the skills/design-templates refactor (specs/current/
skills-and-design-templates.md):

- Move ~104 rendering catalogue entries from skills/ to design-templates/
  and keep skills/ for the small set of functional skills that *do work*
  on user input (utilities, briefs, packagers).
- Add design-templates/AGENTS.md and skills/AGENTS.md describing the
  contract, and a brand-agnostic craft/ surface for opt-in craft rules.
- Daemon: add DESIGN_TEMPLATES_DIR / USER_DESIGN_TEMPLATES_DIR roots and
  an /api/design-templates surface mirroring /api/skills. Asset/example
  routes still span both registries so existing srcdoc URLs keep
  resolving across the rename.
- Web: split LibrarySection into SkillsSection + DesignSystemsSection,
  rename the EntryView "Examples" tab to "Templates", and update locales
  + the New-project picker accordingly.

Adds the finalize-design endpoint:

- New apps/daemon/src/finalize-design.ts and packages/contracts/src/api/
  finalize.ts — one-shot synthesis of a project's transcript + active
  design system + current artifact into <projectDir>/DESIGN.md via the
  Anthropic Messages API. Per-project .finalize.lock mirrors the
  transcript-export hygiene from PR #493; provider credentials are not
  persisted by the daemon.

Other supporting changes:

- README + AGENTS.md updates to document the new directory split and
  craft/ surface, plus i18n strings across 13 locales.
- Test refactors and new coverage (finalize-design, runs, sidecar
  server, plus refreshed daemon integration tests).
- .gitignore: scope the *.exe ignore to /OpenDesign.exe so legitimate
  vendor binaries are no longer hidden.

* fix(merge): move clinical-case-report to design-templates/

Origin/main added the clinical-case-report skill under skills/ before
the skills/design-templates split landed. Its od.mode is prototype, so
per specs/current/skills-and-design-templates.md it is a design template
and belongs alongside the other rendering catalogue entries — not under
the slimmed-down functional skills/ root. Moving it keeps the EntryView
Templates tab consistent with origin/main's intent.

* feat(skills): curated design/creative catalogue + collapsible Settings rows

Seed ~100 curated design/creative skill stubs under skills/ sourced from
awesome-claude-skills (ComposioHQ) and awesome-agent-skills (VoltAgent).
Each stub carries an od.category tag so the new filter pill row in
Settings -> Skills can group them. The seed script
(scripts/seed-curated-design-skills.ts, pnpm seed:curated-design-skills)
is idempotent: it only creates folders that don't already exist, so
hand-edited stubs are never overwritten.

- Daemon: parse and surface od.category on SkillInfo with a strict slug
  normaliser; mirror the field on SkillSummary in @open-design/contracts.
  Category is purely a UI hint — system-prompt composition is unchanged.
- Web: rewrite SkillsSection from a left-list / right-detail grid into a
  vertical stack of collapsible rows mirroring the External MCP panel
  (header always visible with name + mode/source/category pills + per-row
  enable toggle; SKILL.md preview, file tree and inline edit form expand
  on demand). Add a Category filter row above the list. Reorder Settings
  nav so Skills + External MCP sit above the Composio/MCP cluster. Update
  composer placeholder/hint across 17 locales to advertise '@ files or
  skills · / for commands'.
- Docs: extend skills/AGENTS.md with the curated catalogue rules
  (idempotency, category vocabulary, no upstream vendoring).

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(skills): teach localized-content + system-prompt tests about the skills/design-templates split

mrcfps blocking review on PR #955: the skills/design-templates split
(b5993385) moved ~110 SKILL.md entries out of `skills/` and into
`design-templates/`, but two repo-level tests still hard-coded the
single-root layout, so CI gates went red on the merged branch:

- `e2e/tests/localized-content.test.ts` only scanned `<repo>/skills`
  while the locale `skillCopy` map keeps id-keyed entries spanning
  both roots (ExamplesTab/Templates uses one lookup regardless of
  origin). Teach the helper to read both `skills/` and
  `design-templates/`, deduplicating ids so the union matches the
  localized claim.
- `apps/daemon/tests/prompts/system.test.ts` read
  `skills/live-artifact/SKILL.md`, which now lives under
  `design-templates/live-artifact/`. Update the absolute path so
  composeSystemPrompt's coverage of the live-artifact preamble is
  exercised again.

Also enroll the curated design/creative catalogue (PR #955, ~91
stubs sourced from awesome-claude-skills / awesome-agent-skills) in
the DE / FR / RU `_SKILL_IDS_WITH_EN_FALLBACK` lists. The stubs are
English-only by design (frontmatter advertises an upstream URL); the
fallback list is exactly the place to acknowledge "we know this id
exists, English copy is fine here" so the localized-content coverage
gate passes without forcing a translation task per locale.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(skills): always quote frontmatter name so importUserSkill round-trips numeric / boolean ids

mrcfps PR #955 review: `buildSkillMarkdown` emitted `name:
${escapeYamlString(name)}` without quotes, so YAML coerced names
like `123`, `true`, `false`, or `null` into non-string scalars on
re-parse. listSkills() then read `data.name` as a number/boolean
and the import flow's follow-up `findSkillById(skills, result.id)`
missed it, falling into `/api/skills/import`'s "imported skill
could not be re-read" 500 path for those ids.

Switch the emitter to a quoted scalar (`name: "..."`) — the
double-escape already in `escapeYamlString` makes the quoted form
safe — and add a round-trip test covering `123`, `true`, `false`,
`null`, and `0` to lock in the contract.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): drop staged-skill chips when the matching @<id> token leaves the draft

mrcfps PR #955 review: `submit()` always forwarded every id in
`stagedSkills`, but that state was only mutated on picker click and
chip removal. Hand-deleting an `@<id>` token from the textarea left
the chip staged, so the request still carried `skillIds: [<id>]` and
the daemon composed a skill the prompt no longer referenced.

Sync the chips with the draft inside `handleChange()` by pruning
`stagedSkills` whenever the new value no longer contains the
`@<id>` token (using the same whitespace boundary as
`removeStagedSkill`'s strip regex). Comment explains why this
prune does not run for `staged` file attachments — users frequently
add files via the upload button without leaving an `@<path>` token,
so a symmetric prune there would erase legitimate uploads.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(daemon): stage @-composed skills' side files alongside the active skill

codex PR #955 review: composing a per-turn `@`-picked skill into the
system prompt appended its body (with the `withSkillRootPreamble`
guidance pointing at relative paths under `<cwd>/.od-skills/<folder>/`)
but never staged the actual folder. `startChatRun` only copied
`activeSkillDir`, so when the project's primary skill was different
(or absent) the composed skill's references/, examples/, and scripts/
files lived only at their absolute repo path — agents that honour
the cwd-relative form (or that don't get `--add-dir`, e.g. Codex with
allowlisted gpt-image projects) couldn't reach them.

Thread the composed skills' dirs out of `composeDaemonSystemPrompt`
as `extraSkillDirs` and stage each one through the same
`stageActiveSkill` API used for the primary skill. Dedupe by folder
basename so a project whose primary skill is also `@`-composed isn't
copied twice. Each preamble already advertises its own folder, so the
prompt and the staged tree stay aligned without further changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): respect the Library disable toggle in the project @-mention picker

codex PR #955 review: only `EntryView` received `enabledSkills`
(filtered against `config.disabledSkills`); active projects still
got `skills={skills}` raw, so a skill the user disabled in Settings
kept appearing in the project's `@`-mention popover and could ride
along to the daemon via `skillIds`. That broke the Library toggle
for any project opened on the post-split branch.

Compute a functional-skills-only enabled subset
(`enabledFunctionalSkills`) and pass it into `<ProjectView>` instead.
Templates stay separate — design-templates are filtered through their
own `enabledDesignTemplates` memo for the Templates gallery — so
ProjectView's chat composer still only sees skills, never templates,
matching the pre-split prop surface.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(e2e): mock /api/design-templates for example-use-prompt flow

The Templates tab in EntryView fetches from /api/design-templates after
the skills/design-templates split (specs/current/skills-and-design-templates.md).
The example-use-prompt Playwright scenario only mocked /api/skills, so the
gallery card never appeared and the test timed out waiting on
example-card-warm-utility-example. Serve the same fixture summary on both
endpoints so the templates gallery renders the card the test clicks.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(tools-pack): create design-templates fixture for resources test

The packaging resources copy now bundles the new design-templates tree
alongside skills (see resources.ts BUNDLED_RESOURCE_TREES). The
copyBundledResourceTrees fixture only created skills, design-systems,
craft, etc., so the recursive copy crashed with ENOENT on
design-templates before it could check the prompt-templates assertion.
Add the missing fixture directory so the test exercises the same set
of resource trees the packaged build does.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(skills): clone built-in side files into the shadow on first edit

mrcfps PR #955 review: editing a built-in skill wrote a USER_SKILLS_DIR
shadow folder that contained only a new SKILL.md. The next listSkills()
pass surfaced the shadow as the active dir, but every side-file resolver
(/api/skills/:id/files, /example, /assets/*, the system-prompt preamble,
and the per-turn cwd staging) reads through skill.dir. With nothing but
SKILL.md in the shadow, the bundled assets/, references/, scripts/, and
examples/ disappeared the moment the user hit save — a built-in like
last30days or live-artifact would break immediately after edit instead
of just having its body overridden.

Teach updateUserSkill() to take a `sourceDir` and clone every entry
except SKILL.md / dotfiles into the shadow on the very first edit. The
shadow stays self-contained, so all the resolvers keep working without
fallback bookkeeping. Subsequent edits detect the existing shadow and
skip the clone, so user tweaks under the side tree survive a re-save.

Wire `sourceDir: skill.dir` from server.ts's PUT /api/skills/:id handler
and add two regression tests:
- 'clones built-in side files into the shadow on the first edit' walks
  the file tree after save and asserts assets/template.html, references/
  notes.md, and scripts/helper.sh all round-trip from the built-in.
- 'preserves user-edited side files on subsequent edits' edits the
  staged assets/template.html, re-saves, and confirms the user content
  is still there.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(e2e): rename home tab from Examples to Templates

The Examples tab was renamed to Templates in EntryView (b5993385's
skills/design-templates split — entry.tabExamples became entry.tabTemplates
and the tab value moved from 'examples' to 'templates'), but
entry-chrome-flows still asserted the old label and testId. Update both.

* fix(skills+web): preserve template body in API mode and dir-based skill delete

Two follow-ups from PR #955 review:

1. ProjectView only received `enabledFunctionalSkills`, but
   `composedSystemPrompt()` still resolved `project.skillId` through that
   prop and `fetchSkill()`. Projects created from the new
   `/api/design-templates` surface keep a template id in `project.skillId`,
   so opening one in API mode dropped the template body from the system
   prompt and the upstream request ran without the project's primary
   template instructions. Now ProjectView takes a separate
   `designTemplates` prop (the unfiltered template list, so a
   later-disabled template still loads for projects already created from
   it) and `composedSystemPrompt()` plus the metadata / `isDeck` lookups
   fall back to that list, with `fetchDesignTemplate()` as the body-fetch
   fallback to `fetchSkill()`. The chat composer's `@`-picker keeps
   receiving only the enabled functional skills.

2. `DELETE /api/skills/:id` used `deleteUserSkill(USER_SKILLS_DIR, skill.id)`
   which re-slugified the frontmatter id and removed
   `<userSkillsDir>/<slug>/`. That matched the import shape but missed the
   install shape — `installFromTarget` writes the folder at
   `sanitizeRepoName(url)` (GitHub) or `path.basename(realpath)` (local
   symlink), neither of which is guaranteed to equal the slugified
   frontmatter `name`. A duplicate `app.delete('/api/skills/:id', ...)`
   handler at the install routes never fired because Express resolved the
   earlier registration first, leaving the install/uninstall path without
   working teardown. The handler now removes `skill.dir` (the absolute
   path listSkills already discovered) under a USER_SKILLS_DIR safety
   check, using `lstat` + `unlinkSync` so symlinked local installs unlink
   cleanly without recursing into the user's source tree. The dead
   duplicate handler is removed; `deleteUserSkill` is dropped from the
   server.ts import set (still exported and unit-tested in skills.ts).
   Regression coverage in `apps/daemon/tests/skills-delete-route.test.ts`
   pins both shapes plus the symlink-preserves-source case.

* test(daemon): point hyperframes system-prompt test at design-templates

The merge with main brought in a hyperframes system-prompt test that
reads `skills/hyperframes/SKILL.md`, but this branch's split moved
`hyperframes` into `design-templates/` (same migration as `live-artifact`
already handled above in this file). CI was failing with ENOENT on the
old path.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 17:48:34 +08:00
PerishFire
f2db5a749c
chore: enforce PR→issue linking discipline (#1263)
PRs that omit Fixes #N break the release-time reverse lookup
(issue → closing PR → merge sha → first containing tag), since the
auto-link only fires on the explicit closing keywords. We've been doing
this by hand on recent fixes; codify it so future PRs don't drift.

- Add .github/pull_request_template.md with a Fixes # placeholder so
  the link surface is in front of the author by default.
- Add a corresponding bullet to the Bug follow-up workflow in the root
  AGENTS.md so the discipline lives next to the methodology that
  produces issue-linked work.
2026-05-11 17:24:24 +08:00
PerishFire
a797e079b1
fix(desktop): exit fullscreen before hiding window on macOS close (#1249)
* fix(desktop): exit fullscreen before hiding window on macOS close (#1215)

When a preview is in 演示 → 全屏 mode, the macOS close handler called
window.hide() directly, leaving the OS fullscreen Space orphaned as a
black screen — the window vanished but the Space stayed up.

Extract hideWindowExitingFullscreen as the named invariant ("hide,
but first leave fullscreen so the OS Space tears down with the window")
and route the darwin close handler through it. The hide is deferred
until 'leave-full-screen' fires so we don't race the OS Space teardown.

Bootstraps Vitest on apps/desktop with a single test under
tests/main/hide-window-exiting-fullscreen.test.ts that exercises the
helper through a structural mock — the bug shape is pure logic, no real
Electron window required. Spec was red against a hide-only helper and
green after the leave-full-screen sequencing.

* docs(agents): codify bug follow-up workflow

Distill the spec-first / cheapest-layer / scope-discipline /
invariant-shaped-fix / baseline-diff playbook used recently on #135 and
#1215 into a top-level subsection of root AGENTS.md, framed as a default
action shape with explicit room for case-by-case judgment rather than a
hard rule. Includes a single pointer back to the worked example spec.

* docs(agents): require staged human verification for visible bugs

Add the human-verification gate as a sixth bullet in the Bug follow-up
workflow. UI / platform-native / animation symptoms can pass green specs
and still ship the visible regression — proven by #1215, where the
desktop unit test green-lighted the helper logic but only a side-by-side
buggy-vs-fix run on a real macOS Space proved the black-screen actually
went away.

Reinforces the production-API-only seed constraint while we're there:
source-level backdoors prove a fake flow, not the real one, so they
invalidate the verification.

* fix(desktop): defer hide across the fullscreen-enter transition (#1215)

mrcfps observed on PR #1249 that the close handler only catches windows
already in fullscreen — Electron's enter-full-screen event is async on
macOS, so isFullScreen() can still read false during the OS Space
transition triggered by requestFullscreen(). A close in that window
took the plain hide() path and stranded the same black Space the fix
was meant to eliminate.

Track in-flight fullscreen entry from webContents.enter-html-full-screen
(set) and BrowserWindow.{enter,leave}-full-screen (clear), and surface
it through WindowFullscreenSurface.isEnteringFullscreen. The helper now
parks on enter-full-screen until the OS confirms the Space, then runs
the existing exit-then-hide path.

Adds a regression test ("waits out a fullscreen-enter transition before
exiting and hiding") that goes red against the previous helper.
2026-05-11 17:04:42 +08:00
Marc Chan
bdd49ebc47
docs: add repository-wide code review guidelines (#927)
* docs: add repository-wide code review guidelines

Introduces docs/code-review-guidelines.md as the operational review
standard layered on top of AGENTS.md, and adds a Code review guide
section to AGENTS.md that points reviewers at it.

The guide codifies the Product relevance test as a pre-implementation
gate, names the canonical list of forbidden surfaces, lists the
ownership areas in scope, and defines five review lanes: default
code/tests, contract and protocol changes, design-system additions,
skill additions, and craft additions. It also captures the secrets,
runtime data, performance, and maintainability checks that previously
lived only as oral conventions, and aligns the approval bar with the
validation rules in AGENTS.md.

AGENTS.md remains the source of truth when the two disagree; the new
doc is the operational guide on top of it.

* docs: tighten review guidelines for governance docs and bugfix discipline

- Reference scripts/guard.ts as source of truth for guard checks
- Add governance documentation as an explicit in-scope category
- Require reviewers to build a module/caller map before commenting
- Add bugfix-specific reproduction and regression-test checks
- Carve out documentation-only exception in the approval bar

* docs: align review guidelines with repository policy

Keep the review scope aligned with maintained workspace surfaces and preserve AGENTS.md as the authoritative validation bar.

Generated-By: looper 0.6.3 (runner=fixer, agent=opencode)
2026-05-08 19:18:39 +08:00
PerishFire
f1cdb2844a
test(e2e): gate beta packaged runtime (#637)
* test(e2e): gate beta mac packaged runtime

* test(e2e): separate ui automation layout

* test(e2e): move localized content coverage

* chore(release): prepare packaged 0.4.1 beta validation

* test(e2e): keep ui lane playwright-only

* fix(web): keep chat recoverable after conversation load failure

* fix(desktop): honor native mac quit
2026-05-06 17:44:29 +08:00
PerishFire
bbdd4e84b5
chore: enforce test directory conventions (#496)
* chore: enforce test directory conventions

Move package, app, and tool tests out of src and add guard enforcement so source directories stay source-only.

* ci: use guard and package-scoped tests

Run the new repository guard in CI and keep test execution aligned with package-scoped commands after removing root aliases.

* ci: align stable release guard check

Use the new repository guard in stable release verification after replacing the residual-JS-only script.

* chore: tighten test layout enforcement

Enforce sibling tests directories, typecheck moved test suites with dedicated configs, and refresh remaining guidance that pointed at src-based tests.

* chore: clarify no-emit test tsconfigs

Explicitly disable declaration-only emit in test tsconfigs so review tooling sees they are no-emit typecheck configs.
2026-05-05 15:34:22 +08:00
Chris Tam
71bf5634e8
feat(daemon): allow OD_MEDIA_CONFIG_DIR to relocate media-config.json (#411)
* feat(daemon): allow OD_MEDIA_CONFIG_DIR to relocate media-config.json

`media-config.ts:configFile()` always wrote to `<projectRoot>/.od/`,
which fails when the daemon runs from a read-only install (Nix store,
immutable image, etc.) — `PUT /api/media/config` then 500s with
ENOENT/EROFS on the mkdir.

Add an opt-in `OD_MEDIA_CONFIG_DIR` env var that, when set, redirects
the file to `<OD_MEDIA_CONFIG_DIR>/media-config.json`. Existing
workspace-local installs are unaffected — the default still resolves
to `<projectRoot>/.od/media-config.json`. The override is intended to
be pointed at the same writable directory as `OD_DATA_DIR`
(`~/.od`, `$XDG_CONFIG_HOME/open-design`, etc.) by the supervisor that
launches the daemon.

Adds a test pinning the override path so the fallback semantics don't
silently regress.

* fix(daemon): align OD_MEDIA_CONFIG_DIR with OD_DATA_DIR semantics + cover write path

Address PR review:

* The previous diff trimmed the env value but skipped path expansion,
  so `~/.od` would land at literal `./~/.od/media-config.json` and
  relatives anchored against process.cwd instead of <projectRoot>.
  Mirror server.ts:resolveDataDir() — `~/` expands to homedir, relatives
  resolve against projectRoot. Pinned with three new test cases
  (absolute, `~/`, relative-against-projectRoot).

* The packaged daemon (apps/packaged/src/sidecars.ts) doesn't pass
  OD_MEDIA_CONFIG_DIR through to the child and its allowlist won't
  forward it, so packaged installs would still fail. Add a fallback
  precedence: OD_MEDIA_CONFIG_DIR > OD_DATA_DIR > <projectRoot>/.od.
  Packaged installs (which already set OD_DATA_DIR) and the Home
  Manager / NixOS modules now route media-config there for free, no
  separate plumbing required. Workspace-local installs are untouched.

* Add a write-path test that points OD_MEDIA_CONFIG_DIR at a not-yet-
  existing nested directory, calls writeConfig, and asserts both file
  creation and round-trip read. This reproduces the original PUT
  /api/media/config failure mode.

* Document the precedence chain in AGENTS.md (single source of truth
  for module-level concerns) and the storage row of README.md, plus
  expand the file's header comment with a migration note for users
  who already had a custom OD_DATA_DIR alongside a workspace
  media-config.json.
2026-05-04 10:55:10 +08:00
iulian
02638af353
Add linux x64 AppImage to tools-pack and release workflows (#369)
* feat(tools-pack): extend config types for linux platform

* feat(tools-pack): add linux resource files (icon, .desktop template)

* feat(tools-pack): export linuxResources paths

* feat(tools-pack): scaffold linux.ts module

* chore(tools-pack): add vitest devdep for linux lane unit tests

* feat(tools-pack): add buildDockerArgs helper for containerized linux builds

* chore: update pnpm lockfile after adding vitest dep

* feat(tools-pack): add renderDesktopTemplate helper

* fix(tools-pack): use @@ICON_PATH@@ token in linux .desktop template

Reviewer flagged the third .replace() in renderDesktopTemplate as dead code
because the template hardcoded Icon=open-design-@@NAMESPACE@@ instead of
using @@ICON_PATH@@. Switch the template to @@ICON_PATH@@ so install logic
controls the icon stem name independent of namespace, and move the
sanitizeNamespace assertion out of the renderDesktopTemplate describe block
into its own describe.

* feat(tools-pack): add matchesAppImageProcess helper

* test(tools-pack): cover matchesAppImageProcess missing APPIMAGE env case

Closes a coverage gap flagged by code review: a process whose executable
matches /tmp/.mount_*/AppRun but has no APPIMAGE env should be rejected.
The implementation already returned false for this case (undefined ===
installPath is false); this test pins the behavior explicitly.

* feat(tools-pack): implement packLinux native build path

* fix(tools-pack): packLinux extra resources, output pre-clear, publish never

Code review flagged three plan-level omissions in packLinux that mac.ts
handles correctly:

1. writeAssembledApp now writes packagedConfigPath (open-design-config.json)
   with namespace, nodeCommandRelative, and namespaceBaseRoot. Without it
   apps/packaged falls back to defaults at runtime and cannot find the
   namespace runtime tree.

2. writeLinuxBuilderConfig now bundles the resource tree and packaged
   config into the AppImage via extraResources. Without it the running
   app cannot find skills/, design-systems/, craft/, frames/, or the
   bundled bin/node.

3. runElectronBuilderLinux now pre-clears appBuilderOutputRoot and passes
   --publish never to electron-builder, preventing stale-artifact bleed
   between runs and accidental publish attempts in CI when env tokens
   are present.

Also aligns appId with mac/win (io.open-design.desktop) and drops a
no-op productNameSafe template-literal.

* feat(tools-pack): implement containerized linux build via Docker

* feat(tools-pack): register linux CLI commands

* fix(tools-pack): align linux electron-builder config with mac.ts

Smoke testing the AppImage revealed the daemon sidecar was missing from
the bundled app.asar:

  Cannot find module '@open-design/daemon/dist/sidecar/index.js'

Root cause: writeLinuxBuilderConfig was missing the 'files' field, so
electron-builder used defaults that excluded transitive workspace deps
from the asar. Plus several other mac.ts patterns that I dropped from
the plan: artifactName, executableName, extraMetadata.main/name/
productName/version, npmRebuild=false, nodeGypRebuild=false,
buildDependenciesFromSource=false, compression=maximum, top-level icon.

Switch asar:true → asar:false to match mac.ts (easier to debug missing
files; perf difference negligible for dev installs).

* feat(tools-pack): implement linux install

* feat(tools-pack): implement linux start with extract-and-run

The packaged sidecar's 35-second wait timeout is exceeded when the
AppImage runs from a FUSE-mounted SquashFS (Node module loads + daemon
init are slow through FUSE). Pass --appimage-extract-and-run as the
first arg so AppImage extracts to /tmp first; subsequent file reads
go through a real filesystem and daemon boot completes in time.

Wait for apps/packaged to write desktop-root.json (60s ceiling, generous
to cover AppImage extraction overhead), then fetch desktop status via
sidecar IPC, return the merged LinuxStartResult.

* fix(tools-pack): align linux start helper with mac.ts (log echo + write semantics)

Code review flagged two unjustified divergences from mac.ts in
startPackedLinuxApp:

1. Missing OD_DESKTOP_LOG_ECHO=0 in spawn extraEnv. Without it the
   packaged logger echoes to the spawned process's stdout, which goes
   nowhere (logFd: null). Added the suppression to match mac.ts.

2. The desktop log truncate writeFile() was wrapped in .catch(() =>
   undefined), silently swallowing fs errors that would later surface as
   confusing missing-log symptoms. Removed the .catch so errors
   propagate per mac.ts.

Also added an inline comment explaining the 60s waitForMarker timeout
(vs mac's tighter ceiling) so the rationale is preserved at the
call site.

* feat(tools-pack): implement linux stop with marker validation

* fix(tools-pack): align linux stop with mac.ts (graceful shutdown + reason strings)

Code review flagged divergences from mac.ts in stopPackedLinuxApp:

1. No graceful IPC SHUTDOWN attempt before SIGTERM/SIGKILL. Mac's
   pattern lets Electron renderers + sidecars flush state (SQLite WAL,
   logs) first. `gracefulRequested: true` was hardcoded, lying to
   callers about what actually happened. Now attempts SHUTDOWN with a
   1500ms timeout and reports the actual outcome.

2. The dead-PID-but-marker-exists branch returned reason 'ok' (the
   neutral placeholder from readDesktopRootIdentityMarker), which says
   nothing useful. Override to 'marker-pid-not-running' to match mac.ts.

3. After a clean stop, remove the desktop-root.json marker so a
   subsequent start has a fresh slate (mac.ts does this too).

* fix(tools-pack): clear stale desktop-root.json before linux start

Smoke-testing the install/start/stop loop revealed waitForMarker
returns instantly when a stale marker from a previous run still exists
on disk (e.g., the previous AppImage was killed without going through
'tools-pack linux stop'). The start function then reports success
without actually waiting for the new spawn to write its own marker.

Defensively remove the marker file before spawning. mac.ts removes it
in stop, so a clean stop->start sequence has nothing to remove here.
This only matters for crash-recovery.

* fix(tools-pack): linux stop validates extract-and-run AppImages

Smoke testing exposed a gap from Task 7: matchesAppImageProcess only
recognized FUSE-mode (/tmp/.mount_*/AppRun) but Task 13 launches with
--appimage-extract-and-run, which puts the live executable at
/tmp/appimage_extracted_<hex>/<binary>. Stop's cmdOk validation
returned false, marker validation failed, the running app was
classified as 'unmanaged' and stop refused to kill it.

Fix:
1. matchesAppImageProcess accepts both runner patterns. Extract-and-run
   regex matches /^\/tmp\/appimage_extracted_[^/]+\/[^/]+$/.

2. stopPackedLinuxApp now passes paths.installAppImagePath (or the
   built fallback) as the canonical install path, not marker.appPath
   (which apps/packaged unhelpfully writes as '/' on Linux).

3. linux.test.ts gains 2 new tests covering the extract-and-run mode
   (both positive and the wrong-APPIMAGE-env negative case).

* fix(tools-pack): resolve linux paths in stop (typecheck regression from previous commit)

* feat(tools-pack): implement linux logs

* feat(tools-pack): implement linux uninstall

* feat(tools-pack): implement linux cleanup

* docs(tools-pack): document linux lane in READMEs and AGENTS files

* ci(release): add linux x64 AppImage to release-beta and release-stable

Mirrors the existing build_mac/build_win pattern with a build_linux job
in both release workflows. Builds via `tools-pack linux build
--containerized --to appimage` so the AppImage is linked against the
electronuserland/builder glibc 2.27 baseline (portable across distros)
rather than the ubuntu-latest glibc 2.39.

The linux asset is uploaded to the immutable version release tag
alongside mac/win. The beta channel-feed release (latest-mac.yml,
latest.yml) is intentionally not extended with latest-linux.yml because
tools/pack/src/linux.ts has no electron-builder publish block wired,
so the auto-update feed would point users at a feed that never updates.
AppImage auto-update is a separate follow-on.

Linux is unsigned (no signing path in tools-pack yet), so the beta
asset uses the .unsigned suffix matching the windows convention; the
stable asset uses no suffix, matching the stable windows convention.

* fix(tools-pack): propagate --dir/--portable into containerized linux build

The inner `pnpm tools-pack linux build` invocation in `buildDockerArgs`
only forwarded `--to` and `--namespace`. Callers passing `--dir` (e.g.
the new release workflows using `--dir $RUNNER_TEMP/tools-pack`) had
their flag silently dropped: the container defaulted to writing under
/project/.tmp/tools-pack while the host's `findBuiltAppImage` looked at
the caller's chosen `--dir`, producing "expected AppImage not found"
on any non-default tool-pack root. Callers passing `--portable` had
the same drop, baking build-machine runtime roots into shipped artifacts.

Fix:
- Mount `${config.roots.toolPackRoot}:/tools-pack` (new third volume,
  alongside the existing /project, /home/builder, and cache mounts).
- Forward `--dir /tools-pack` to the inner build so its output lands
  inside the mounted host dir.
- Forward `--portable` when `config.portable` is true.

The mount overlaps harmlessly with /project when toolPackRoot lives
under workspaceRoot (default case): Docker exposes the same host inode
at both paths. The existing .docker-home and .docker-cache/* mounts
continue to shadow the parent at their specific /home/builder paths.

Document the shell-interpolation safety invariant on the inner command:
config.namespace is sanitized at config-time, config.to is enum-validated,
config.portable is boolean -- none can carry shell metacharacters.

Tests: add coverage for the new /tools-pack mount, --dir forwarding,
and --portable propagation (both true and false branches).

Resolves the P1 review feedback from the Codex bot on PR #369.

* docs(tools-pack): polish linux README based on PR review

Addresses non-blocking P2/P3 review feedback on PR #369:

- AppImage launch mode: name the test distros (Ubuntu 24.04, Arch Linux)
  and frame the FUSE-vs-extract-and-run gap as an order-of-magnitude
  improvement instead of an unspecified slowdown.
- Optional system tools: add a libfuse2 paragraph distinguishing FUSE
  launch (needs libfuse2) from extract-and-run (does not), with the
  Ubuntu-24-vs-pre-24 package name caveat.
- New section "Format choice: why AppImage first" anchoring the
  AppImage-only decision against industry precedent (VS Code, Discord,
  Slack, Cursor, Obsidian) so the rationale survives without a reviewer.
- Out of scope: convert the dense one-liner into a bulleted list, mark
  AppImage signing as gated on GPG infra + verification flow design
  (no ETA), explain the latest-linux.yml gap, and remove the now-stale
  "release lane" entry since this PR adds it.

* fix(tools-pack): add --appimage-extract-and-run to installed .desktop launcher

The XDG .desktop file installed by `tools-pack linux install` invoked
the AppImage directly via `Exec=env OD_NAMESPACE=<ns> <exec> %U`. That
bypassed the extract-and-run flag that `tools-pack linux start` applies,
so menu launches and `od://` desktop activations could hit the FUSE
slow path that was already shown to make the daemon sidecar exceed
apps/packaged's 35-second startup timeout. CLI-spawned starts succeeded
while menu-launched starts could fail with the same artifact.

Add `--appimage-extract-and-run` to the template's `Exec=` line and
update the renderDesktopTemplate test expectation. New regression test
locks the flag into place so a future template edit can't silently
drop it.

Resolves a P1 review finding from mrcfps/Looper on PR #369.

* fix(tools-pack): treat signal-terminated container builds as failures

`runBuildInContainer` resolved the build promise on `code === null`,
which in Node's child-process `exit` event means the child was
terminated by a signal (SIGTERM, SIGKILL, OOM-killer, parent process
death). A killed Docker build could therefore make `packLinux` report
a containerized build as complete even though the artifact was
partial or missing.

Accept the `signal` argument on the exit handler. Resolve only when
`code === 0 && signal == null`. Otherwise reject with a message
naming either the non-zero code or the terminating signal so the
failure mode is visible in CI logs and `tools-pack linux build --json`
output.

Resolves a P1 review finding from mrcfps/Looper on PR #369.

* fix(tools-pack): tear down orphaned process tree on failed linux start

If `startPackedLinuxApp` spawned the AppImage but the post-spawn
readiness path then failed -- either because the 60s waitForMarker
ceiling elapsed without the daemon writing desktop-root.json, or
because fetchDesktopStatus threw -- the detached child was left
running. Because the marker is the only persistent identity source
used by `stopPackedLinuxApp`, future lifecycle commands could not
associate the orphan with the namespace, leaving stale Electron and
sidecar processes plus stale IPC sockets that would interfere with
subsequent starts.

Wrap the readiness wait + status fetch in try/catch. On failure,
collect the spawned child's process tree via listProcessSnapshots +
collectProcessTreePids and stopProcesses() it (the same path
stopPackedLinuxApp uses for its tree teardown), then rethrow the
original error. Cleanup errors are swallowed so the original failure
is preserved in the rejection.

Extract the tree-teardown helper as `teardownOrphanedStart` so the
intent is documented at the call site without inlining 4 imports of
implementation detail.

Resolves a P2 review finding from mrcfps/Looper on PR #369.

* fix(tools-pack): use `corepack pnpm` in containerized linux build

The inner command in `buildDockerArgs` started with `corepack enable`,
which writes pnpm/yarn/npm shims into the directory containing the
node binary. In `electronuserland/builder:base`, that directory is
owned by root, but the container runs as the host's non-root uid via
`--user` (so build artifacts come out owned by the caller, not root).
The `corepack enable` step therefore fails with EACCES before
`pnpm install` ever runs, blocking the new release `build_linux` job
from publishing the Linux AppImage.

Switch to `corepack pnpm install --frozen-lockfile && corepack pnpm
tools-pack linux build ...`, which resolves and runs the version of
pnpm pinned in package.json's `packageManager` field directly. No
shims, no global mutation, no root writes — corepack just dispatches
to the pinned binary as the unprivileged user.

Update the existing inner-command test to match the new corepack
invocation, and add a regression test that asserts the inner command
contains `corepack pnpm` and never `corepack enable` so a future edit
can't reintroduce the root-write requirement.

Resolves a P1 review finding from mrcfps/Looper on PR #369.

* fix(tools-pack): accept menu-launched processes in linux stop/uninstall

stopPackedLinuxApp validated the live process via matchesStampedProcess
against the process command line, requiring a SIDECAR_SOURCES.TOOLS_PACK
stamp. That worked for `tools-pack linux start` (which spawns with
createProcessStampArgs), but rejected menu launches: the installed
.desktop entry only sets OD_NAMESPACE and does not pass stamp args, so
apps/packaged falls back to a SIDECAR_SOURCES.PACKAGED stamp written
into desktop-root.json -- a perfectly valid identity, just not the one
the validator accepted.

Symptoms with the old behavior:
  - `tools-pack linux stop` reported `unmanaged` for menu-launched apps
    and refused to stop them.
  - `tools-pack linux uninstall` would happily remove the AppImage,
    .desktop entry, and icon while the packaged app was still running,
    breaking handles to the AppImage's mounted/extracted contents.

Switch the validator to read marker.stamp directly (the file content
written by apps/packaged itself, not the process command) and accept
either TOOLS_PACK or PACKAGED. The expected app/mode/namespace/ipc
fields are still required to match. Mirrors the dual-source acceptance
pattern in mac.ts:709-714.

The matchesAppImageProcess (cmdOk) and namespaceRoot checks are
preserved -- the marker still has to point at our AppImage at a path
in our namespace's runtime root.

Drop the now-unused matchesStampedProcess import.

Resolves a P1 review finding from mrcfps/Looper on PR #369.

* fix(tools-pack): per-platform --to help text in CLI

addBuildOptions is shared across mac/win/linux but its --to help text
hard-coded the mac targets (all|app|dmg|zip), so:
  - tools-pack linux --help advertised --to all|app|dmg|zip even
    though resolveToolPackBuildOutput accepts only all|appimage|dir,
    sending users at invalid targets and hiding the AppImage option.
  - tools-pack win --help had the same problem (advertised mac
    targets while accepting all|dir|nsis with default nsis).

Parameterize addBuildOptions(command, platform) and back it with a
TO_HELP_BY_PLATFORM table that mirrors the resolver's accepted targets
in config.ts. Update the three call sites.

Smoke verified by running --help for each platform:
  linux: all|appimage|dir (default: all)
  mac:   all|app|dmg|zip (default: all)
  win:   all|dir|nsis (default: nsis)

The misleading "--signed: build a signed/notarized mac artifact" line
on win/linux is left alone -- out of scope for this fix and not part
of the review feedback.

Resolves a P3 review finding from mrcfps/Looper on PR #369.

* fix(tools-pack): use OD_PACKAGED_NAMESPACE in installed .desktop launcher

The installed .desktop entry's Exec= line set OD_NAMESPACE=<ns>, but
apps/packaged/src/config.ts:9 reads namespace overrides from
OD_PACKAGED_NAMESPACE, not OD_NAMESPACE. The env assignment was a
silent no-op for menu launches: the packaged app fell back to whatever
namespace was baked into open-design-config.json at install time,
ignoring the namespace advertised in the .desktop file.

Practical effect: a .desktop launcher created for namespace "foo"
could end up running as the namespace baked into the AppImage's
shipped config (typically "default"), so installs created across
multiple namespaces could collide silently from menu launches. CLI
launches via `tools-pack linux start` were unaffected because they
pass the namespace through createSidecarLaunchEnv which targets the
correct env var.

Switch the template to OD_PACKAGED_NAMESPACE. Update the existing
renderDesktopTemplate test fixture/expectation, and add a regression
test that asserts the Exec= line uses OD_PACKAGED_NAMESPACE and never
the wrong OD_NAMESPACE name.

Resolves a P1 review finding from mrcfps/Looper on PR #369.

* fix(tools-pack): gate linux uninstall + cleanup on stop status

uninstallPackedLinuxApp called stopPackedLinuxApp first, then deleted
the AppImage / .desktop entry / icon unconditionally. cleanupPackedLinux
Namespace did the same with the output and runtime namespace roots.
Both ignored stop.status -- so when stop returned "partial" (some
processes survived SIGTERM->SIGKILL) or "unmanaged" (the running PID
failed marker validation), uninstall would yank the install files out
from under a still-running packaged app, breaking handles to the
mounted/extracted AppImage contents and leaving an orphan with stale
SQLite WAL files / log handles / IPC sockets.

Extract a small `isSafeToRemoveInstallFiles(stop)` helper that returns
true only for "stopped" or "not-running". Both uninstall and cleanup
short-circuit when it returns false:

  - uninstall reports "skipped-process-running" for each removal slot
    and "skipped" for the post-install hooks. Existing "ok" / "already-
    removed" / "ok"|"missing"|"failed" paths are unchanged.
  - cleanup leaves both removed* booleans false and adds a new
    `skipped: boolean` field set to true. Old consumers that only read
    the booleans see the same "nothing was removed" signal they would
    have seen for an already-clean namespace; new consumers can
    distinguish "nothing to remove" from "refused to remove."

LinuxUninstallResult.removed.{appImage,desktop,icon} now also accepts
"skipped-process-running"; LinuxUninstallResult.postUninstall.* now
also accepts "skipped". LinuxCleanupResult gains the `skipped` field.
Workspace typecheck clean -- the only consumer is the CLI's printJson,
which doesn't constrain the wire shape.

Resolves a P1 review finding from mrcfps/Looper on PR #369.
2026-05-04 00:49:00 +08:00
Tom Huang
1edab990bb
feat(craft): add brand-agnostic craft references + Refero-derived lint rules (#225)
* feat(craft): add brand-agnostic craft references and refero-derived lint rules

Introduce `craft/` as a third top-level content axis alongside `skills/`
and `design-systems/`, holding universal (brand-agnostic) craft rules
that apply on top of any DESIGN.md. Skills opt in via a new
`od.craft.requires` front-matter array; the daemon resolves the slug
list and injects the matching files between DESIGN.md and the skill
body in the system prompt.

Initial vendor (MIT, adapted from referodesign/refero_skill): typography
craft, color craft, anti-ai-slop. Pilot wired on saas-landing.

Extend the existing lint-artifact pass with two refero-derived rules:
- P0 ai-default-indigo — solid #6366f1 / #4f46e5 / #4338ca / #8b5cf6 as
  accent (not just gradients) is the most-reported AI tell.
- P1 all-caps-no-tracking — `text-transform: uppercase` rules without
  ≥0.06em letter-spacing.

The craft loader silently drops missing files so a skill can
forward-reference future sections (e.g. `motion`) without breaking.

* fix(daemon): skip :root token blocks in ai-default-indigo lint

The ai-default-indigo P0 check scanned the whole HTML for the raw
hex, so brands that intentionally encode indigo as `--accent: #6366f1`
in :root and consume it via var(--accent) downstream were flagged
as AI-default — a false positive that forced the agent to "fix"
valid output. Strip :root token-definition blocks (including
attribute-selector theme variants) before scanning, mirroring the
existing pattern used by the raw-hex P1 check. Hex still flagged
when it appears in component rules or inline styles.

* docs(craft): address PR #225 P3 review feedback

- craft/README.md: explain why missing craft sections are silently
  dropped (forward-compatibility) instead of surfacing a warning.
- craft/typography.md: ground the 0.06em ALL CAPS tracking floor in
  Bringhurst-derived typographic practice rather than presenting
  the threshold as unattributed.
- craft/color.md: cover the edge case where a brand's DESIGN.md
  intentionally encodes indigo as --accent — `var(--accent)` uses
  remain unflagged because the linter only inspects hardcoded hex.
- docs/skills-protocol.md: link the "missing files dropped silently"
  note back to craft/README.md for the canonical slug list and the
  rationale behind the choice.

* fix(craft): address PR #225 P0 review feedback

- tools/pack: copy `craft/` into the packaged resource root alongside
  `skills`, `design-systems`, and `frames`, so the `od.craft.requires`
  integration isn't a silent no-op when the daemon resolves
  `${OD_RESOURCE_ROOT}/craft` in packaged builds.
- packages/contracts: add `craftRequires?: string[]` to `SkillSummary`
  (and therefore `SkillDetail`) so the field that `listSkills()`
  already returns and `/api/skills(/:id)` already serializes via
  `...rest` is part of the documented web/daemon contract instead of
  leaking through as an untyped property.
- apps/daemon/lint-artifact: expand the indigo token-strip pass to
  cover selector lists containing `:root` (e.g. `:root, [data-theme="light"]`)
  and any rule whose body is custom-property-only (e.g. a
  `[data-theme="dark"] { --accent: ... }` theme variant). Real
  component rules with a hardcoded indigo are still preserved so the
  P0 finding still fires; tests cover the new selector-list and
  theme-variant cases.

* fix(craft): address PR #225 follow-up review feedback

- lint-artifact: scope the indigo token-strip to <style> blocks so the
  rule-shaped regex no longer captures leading `<style>` text into the
  selector (which broke `:root` recognition for token blocks that mix
  `color-scheme`/etc. with `--accent`). Run the strip on the extracted
  CSS instead, with a regression covering `:root { color-scheme: light;
  --accent: #6366f1 }`.
- lint-artifact: tighten the custom-property-only exemption to global
  theme-scope selectors (`:root`, `html`, `body`, bare attribute
  selectors like `[data-theme="dark"]`). Component-local rules such as
  `.cta { --cta-bg: #6366f1 }` are no longer exempted, so an agent
  cannot launder default indigo through a local var. Regression test
  added.
- craft/anti-ai-slop.md: stop claiming every rule below is enforced by
  the linter; only several are. The unenforced rules (standard
  Hero→Features→Pricing→FAQ→CTA flow, decorative blob/wave SVG
  backgrounds, perfect symmetry) are now flagged inline as
  "(guidance, not auto-checked)" so the contract with the lint surface
  stays honest.

* fix(daemon): tighten lint-artifact iteration and :root token gating

- all-caps-no-tracking: iterate every <style> block. The previous
  check called `exec` once on a non-global regex, so an artifact
  whose offending uppercase rule sat in a second <style> block
  (e.g. a reset block followed by a components block) slipped
  past. Switch to `matchAll` and break across both loops once a
  violation is found. Regression test covers a second-block
  uppercase rule.
- ai-default-indigo: stop unconditionally exempting any selector
  list containing `:root`. The exemption now requires both
  conditions to hold: every selector in the list is global theme
  scope AND the body is token-shaped (CSS custom properties or
  the `color-scheme` keyword). So `:root { background: #6366f1 }`
  and `:root, .cta { --cta-bg: #6366f1 }` no longer launder a
  hardcoded indigo through the strip pass. Regression tests cover
  both bypass shapes.

* fix(daemon): scope theme-attr exemption and strip CSS comments in token blocks

Address PR #225 review feedback on `ai-default-indigo`:

- The bare-attribute branch of `selectorListIsGlobalThemeScope` accepted
  any `[attr=...]` selector, so a custom-property-only rule on a
  component/state attribute (e.g. `[data-variant="primary"]`,
  `[aria-current="page"]`) was treated as a global theme block and
  stripped before the indigo scan — exactly the component-local indigo
  laundering this lint is meant to catch. Restrict the exemption to a
  small allowlist of known theme switches: `data-theme`,
  `data-color-scheme`, `data-mode`.
- `stripTokenBlocksFromCss` split rule bodies on `;` and matched each
  fragment from the start, so a token block whose body contained a
  normal CSS comment such as `:root { /* brand accent */ --accent:
  #6366f1; }` produced a fragment beginning with the comment, failed
  `isTokenShapedDeclaration`, and the rule was left in scope of the
  indigo scan — a false P0 on a legitimate token definition. Strip CSS
  comments before splitting/classifying declarations.

Add regression coverage: arbitrary component/state attribute selectors
still trip `ai-default-indigo`; `data-color-scheme` theme variants stay
exempted; `:root` token blocks with leading, trailing, and
between-declaration CSS comments are recognized.

* fix(daemon): strip CSS comments and recognize tokens nested in at-rules

The all-caps-no-tracking scan ran against raw `<style>` content, so a
commented-out rule like `/* .eyebrow { text-transform: uppercase; } */`
matched `upperRe` and emitted a P1 for CSS the browser ignores. Strip
CSS comments from the style body before structural matching.

`stripTokenBlocksFromCss` only matched flat `selector { body }` rules,
so a media-query-wrapped token block like
`@media (prefers-color-scheme: dark) { :root { --accent: #6366f1 } }`
had its outer `@media` rule treated as the selector/body pair and the
inner `:root` token block was never stripped, producing a P0 false
positive on legitimate responsive theme CSS. Tighten the body
alternation to `[^{}]*` so the regex matches innermost rules and
recognizes the inner `:root` block directly while preserving the
outer at-rule wrapper.

* fix(daemon): align ai-default-indigo list with documented cardinal sins

The lint's AI_DEFAULT_INDIGO subset omitted #3730a3 and #a855f7, which
craft/anti-ai-slop.md lists as P0-blocked solid accents. An artifact
could hard-code one of those documented colors as a button fill and
slip past the indigo scan unless it happened to be inside a gradient.

Bring the lint set to the exact list documented in the craft doc, and
tighten the doc's wording from "etc." to an explicit enumeration that
points at AI_DEFAULT_INDIGO so the prompt contract and daemon behavior
stay in sync. Add regression tests pinning each newly-included hex.

* fix(daemon): tighten theme-scope selector and scan inline ALL CAPS

The theme-scope exemption used to accept any attribute on `:root`,
`html`, or `body` (e.g. `:root[data-variant="primary"]`), letting an
agent launder default indigo through a component/state attribute and
slip past the `ai-default-indigo` lint. The prefixed branches now
require the attribute name to be one of GLOBAL_THEME_ATTRIBUTES,
matching the bare-attribute branch.

The `all-caps-no-tracking` rule only iterated `<style>` blocks, so
inline declarations like `<span style="text-transform: uppercase">`
produced no finding even though craft/typography.md treats the
≥0.06em tracking floor as having no exceptions. Added a second scan
over `style="..."` attributes that runs the same letter-spacing
check and dedupes against the existing `<style>`-block finding so
the agent gets a single corrective signal per artifact.

* fix(daemon): align uppercase tracking px floor with the 0.06em rule

The previous absolute fallback (>=1.5px) was stricter than the craft
rule it enforces. `font-size: 12px; letter-spacing: 1px` is 0.083em
— above the 0.06em floor — but 1.5px would reject it and trigger an
unnecessary correction loop on compliant small-label CSS.

Extract `hasAdequateUppercaseTracking`: read `font-size` from the same
rule body and compare px tracking against `fontSize * 0.06`; fall back
to a conservative >=1px floor when font-size is inherited (covers the
default 16px body where 1px ≈ 0.0625em). Apply the helper to both the
<style>-block scan and the inline-style scan, and add 12–14px label
tests in both branches.

* fix(daemon): treat rem letter-spacing as absolute, not per-element em

`rem` was previously folded into the same branch as `em` and accepted
at the 0.06 threshold. But `rem` is relative to the root font-size
(16px default), not the element's own font-size, so on a 48px heading
`letter-spacing: 0.06rem` resolves to 0.96px — about 0.02em of the
element, well below the 0.06em rule the lint enforces.

Convert rem to absolute px through the 16px root assumption and reuse
the same px-vs-element-font-size resolution: same-rule `font-size: <n>px`
gives an exact `n * 0.06` floor; otherwise the conservative >=1px
fallback applies. Add regression tests for 48px headings with 0.06rem
tracking (must flag) plus the 16px-element and rem-floor matches that
must keep passing, in both <style>-block and inline-style branches.

* fix(daemon): resolve var() refs in uppercase tracking lint

`hasAdequateUppercaseTracking` only matched literal numeric values,
so a tokenized rule like `letter-spacing: var(--caps-tracking)` —
exactly the pattern the craft prompt steers artifacts toward — was
falsely reported as `all-caps-no-tracking`. Extract `--name: value`
declarations from global theme scopes (`:root`, `html`, theme-attribute
selectors) once per artifact, then expand simple `var(--name)` (and
`var(--name, fallback)`) references in the inspected rule body before
applying the existing 0.06em / px-floor / rem-conversion logic.
References without a matching token and no fallback stay in place,
preserving the conservative "missing tracking" finding.

* fix(daemon): resolve rem and var() font-size in uppercase tracking lint

Previously the px-vs-element-font-size resolution only matched
`font-size: <n>px`. Any rem-based or tokenized display size fell
through to the lenient `>= 1px` body-text fallback, so an artifact
emitting `.display { font-size: 3rem; text-transform: uppercase;
letter-spacing: 1px; }` (a ~48px heading with a 2.88px floor) slipped
past the lint that this helper exists to enforce.

Resolve `rem` font-size via the same root-font assumption already used
for tracking, and treat any explicitly declared but unresolvable unit
(`em`, `%`, `calc(...)`, an unresolved `var(...)`) conservatively —
refuse the lenient fallback so the rule must use either an `em`
letter-spacing or a verifiable px/rem font-size.

`var()` font-size declarations resolve through the existing
`resolveCssVars` pass before the size scan runs, so the same fix
catches the tokenized-display-size pattern (`--display-size: 3rem`).

* fix(daemon): parse declarations to ignore custom-prop names in uppercase tracking lint

The hasAdequateUppercaseTracking and resolveFontSizePx helpers used substring regexes against the rule body, so a token-name declaration such as `--letter-spacing: 0.08em` or `--display-font-size: 48px` could satisfy the `letter-spacing` / `font-size` checks even though it has no rendered effect — letting actual ALL-CAPS-without-tracking rules slip past the P1 lint.

Parse the declaration list, compare exact property names, and skip declarations whose property starts with `--`. Adds regression tests covering token-name letter-spacing (style-block + inline) and a token-name font-size masking the bail-out branch.

* fix(daemon): scope indigo token exemption to --accent only

Previously stripTokenBlocksFromCss removed every custom-property-only
global theme block before the ai-default-indigo scan, which let a
laundered indigo token like `:root { --primary: #6366f1 }` consumed
via `var(--primary)` slip past the lint. The craft contract is that
the only escape hatch is encoding indigo as the design system's
`--accent` token; any other token name is still the LLM-default
color hidden behind an arbitrary name. Narrow the strip pass so a
non-`--accent` token whose value carries an AI-default indigo hex
keeps the rule in scope, and add regression tests for `--primary` /
`--button-bg` global tokens feeding a CTA, including the at-rule
and theme-attribute variants.

* fix(daemon): model CSS cascade in tracking lint and detect blue→cyan trust gradients

Address PR #225 review feedback (3 comments):

- `letter-spacing` / `font-size` selection now picks the LAST matching
  declaration in the rule body, modeling CSS source-order cascade.
  `.eyebrow { letter-spacing: 0.08em; letter-spacing: 0.02em }` renders
  the noncompliant 0.02em the browser actually shows; the previous
  first-match behaviour silently passed it.
- `extractCssTokens` now records every distinct value seen for a token
  across global theme scopes, and `hasAdequateUppercaseTracking`
  enumerates each combination so a default-theme value below the floor
  cannot be rescued by a scoped override that happened to be parsed
  later (`:root { --caps-tracking: 0.02em }` +
  `[data-theme="dark"] { --caps-tracking: 0.08em }` now fires).
- New `trust-gradient` P0 rule pairs blue/sky tokens against cyan
  tokens in `linear-gradient(...)` bodies so `blue→cyan` two-stop
  trust gradients (documented as a cardinal sin in
  `craft/anti-ai-slop.md`) are actually enforced — both the hex form
  (`linear-gradient(90deg, #3b82f6, #06b6d4)`) and the keyword form
  (`linear-gradient(90deg, blue, cyan)`).

Adds 11 regression tests covering each path (cascade override in
<style> and inline form, font-size cascade shifting the floor, both
orderings of the conflicting-token cascade, the don't-over-fire case
when every theme value clears the floor, hex / keyword / sky variants
of the trust gradient, and the don't-double-fire case when
purple-gradient already caught a mixed gradient).

* fix(daemon): apply per-scope cascade in extractCssTokens

When the same CSS custom property is declared more than once inside a
single rule body (e.g. `:root { --caps-tracking: 0.02em;
--caps-tracking: 0.08em }`), CSS source-order cascade collapses to the
last value; the earlier declaration never reaches any element.
`extractCssTokens` was treating intra-scope duplicates as simultaneous
theme alternatives, so `hasAdequateUppercaseTracking` enumerated the
stale 0.02em and emitted a spurious all-caps-no-tracking finding.

Collapse duplicate token declarations within a rule body to the last
value before merging into the cross-scope distinct-value map. Cross-scope
overrides (separate `:root` and `[data-theme]` rules) remain preserved
as distinct values so the conservative theme-cascade check still fires
when ANY applicable theme renders below the floor.

* fix(daemon): scope tracking lint to innermost rules and per-theme tokens

Restrict the upperRe body alternation to [^{}]* so the regex matches
innermost CSS rules and skips at-rule wrappers — an outer @media or
@supports could otherwise capture as a single rule whose selector was
the at-rule and whose body began with the inner selector token, masking
the same-rule font-size and letting noncompliant tracking on large
headings slip through the lenient inherited-size fallback.

Replace the by-name-distinct-values token map with per-scope token
records and a buildResolvedThemes pass that materializes one effective
map per theme. Paired token declarations now stay paired during
evaluation, so theme variants like :root + [data-theme=dark] no longer
generate cross-theme cartesian pairings (e.g. default-size + dark-track)
that emit false positives on legitimate light/dark themes.

---------

Co-authored-by: looper <looper@open-claude.dev>
2026-05-02 11:00:33 +08:00
PerishFire
a40d817d28
Add mac packaged runtime and beta release flow (#170)
* feat(pack): add mac packaged runtime control plane

* feat(pack): harden mac packaged runtime lifecycle

Keep packaged state namespace-scoped, make daemon paths explicit through sidecar launch env, and add conservative desktop identity/logging fallbacks for local mac package validation.

* feat(pack): add mac beta release flow

* fix(pack): generate mac update feed fallback

* fix(pack): write portable beta checksums

* fix(pack): make beta artifacts portable

* fix(pack): clean up mac install visuals

* fix(pack): address packaged runtime review feedback
2026-04-30 20:25:49 +08:00
nettee
9b57c22c38
docs: add git commit co-author policy (#131) 2026-04-30 15:08:22 +08:00
nettee
86c256ad56
Improve tools-dev web startup flow (#128) 2026-04-30 14:58:52 +08:00
PerishFire
c6d11018a0
Refresh desktop integration control plane (#123)
* feat(dev): add desktop tools-dev control plane

* refactor(sidecar): split Open Design contracts

Move Open Design-specific sidecar protocol definitions into @open-design/contracts so sidecar and platform can remain descriptor-driven primitives.

* refactor(daemon): organize package sources

Keep daemon app code, tests, and sidecar entrypoints in separate package directories so each layer can be built and verified independently.

* chore(repo): streamline maintenance entrypoints

Centralize agent guidance by directory and reduce root command chains while preserving the existing build scope.

* docs: translate agent guidance to English

* fix(sidecar): tolerate stale IPC sockets

Remove stale Unix socket files only after confirming no listener is active, so tools-dev can restart after unclean shutdowns.
2026-04-30 14:23:53 +08:00
nettee
56d08b8c5f
Add shared contracts and migrate project code to TypeScript (#118) 2026-04-30 13:01:15 +08:00
nettee
950c0bfd32
init AGENTS.md (#114) 2026-04-30 11:58:03 +08:00