Commit graph

727 commits

Author SHA1 Message Date
chaoxiaoche
bcc58af931
refactor(web): rename Execution mode and tighten settings dialog UI (#1568)
* refactor(web): rename Execution mode and tighten settings dialog UI

- Rename "Settings → Execution & model" to "Settings → Execution mode"
  across the web UI, i18n keys, docs, and e2e selectors.
- Redesign SettingsDialog: kicker + title row in the modal head, a
  flatMap-driven agent grid that renders the inline test-result row
  beside the selected card, compact unavailable cards with right-aligned
  install/docs links, and an install guide that only shows when the
  user has no working agent picked.
- Trim verbose subtitle / hint copy across chat model, CLI proxy,
  media providers, custom instructions, and memory sections.
- Add an `info` Icon variant for the redesigned settings hints.
- Update e2e selectors and docs that referenced the old menu label.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(web): polish Settings dialog — media providers, skills, MCP

Media providers
- Hide internal Stub fixture provider (settingsVisible: false)
- Split provider list into Available (integrated, editable) and Coming
  Soon (collapsed <details> drawer with name/hint/Docs link only)
- Drop right-side Integrated/Configured badges from every row; all rows
  in the main list are integrated by definition; inline grey "Saved"
  chip next to the provider name is the only status indicator now
- "Saved" badge moves inline to the right of the provider name and uses
  a neutral grey treatment (was a standalone green pill below the name)
- "Reload from daemon" button shows a 2s green "✓ Reloaded" flash on
  success instead of leaving a permanent paragraph under the header;
  errors remain sticky

Skills
- Replace three pill-row filter banks (Source, Type, Category) with a
  compact single-row toolbar: search + three inline <select> dropdowns
  side by side; active filter highlighted with a stronger border

MCP server
- Shorten section hint to one line
- Move WHAT YOUR AGENT CAN DO capabilities above the client dropdown
  (motivate before asking to act)
- Move "Build the daemon first" warning below the code block where it
  contextually explains why the command might fail, not as a top-level
  error before the user has done anything
- Downgrade "Restart your client" left-border from accent orange to
  border-strong grey — it is a next step, not a warning

External MCP
- Shorten section hint to one line

Misc CSS
- Add .sr-only utility for accessible off-screen live regions
- Add button.ghost.is-success-flash for transient success feedback
- Add .library-filter-selects / .library-filter-select for dropdown
  filter rows
- Add .media-provider-coming-soon-* for the roadmap drawer

Co-authored-by: Cursor <cursoragent@cursor.com>

* [codex] Add Cursor Agent auth diagnostics (#1538)

* Add Cursor Agent auth diagnostics

* Handle Cursor not logged in auth status

* Address Cursor auth review feedback

* Classify Cursor stdout auth failures

* test: expand Memory and Routines coverage (#1521)

* test: expand settings and packaged coverage

* test: extend memory settings coverage

* test: cover routine settings failure states

* test: cover routine operation failures

* test: fix daemon test typing on CI

* test: decouple packaged smoke from orbit bug

* test: avoid live memory LLM calls in route tests

* test: fix daemon fetch typing in CI

* fix: restore preview comment and inspect toggles

* test: align manual edit flow with current inspector UX

* test: align comment attachment flow with current preview comments UI

* fix: probe resolved Codex launch path during detection

* fix: remove duplicate board activation helper after rebase

* test: update ghost cli detection mock

* test: align FileViewer toolbar expectation

* ci: move full app tests to extended lane

* ci: run app tests by changed scope

* ci: cover shared app inputs in test scopes

* ci: avoid setup-node cache in windows packaged smoke

* test: align extended settings and manual edit flows

* refactor(web): rename Execution mode and tighten settings dialog UI

- Rename "Settings → Execution & model" to "Settings → Execution mode"
  across the web UI, i18n keys, docs, and e2e selectors.
- Redesign SettingsDialog: kicker + title row in the modal head, a
  flatMap-driven agent grid that renders the inline test-result row
  beside the selected card, compact unavailable cards with right-aligned
  install/docs links, and an install guide that only shows when the
  user has no working agent picked.
- Trim verbose subtitle / hint copy across chat model, CLI proxy,
  media providers, custom instructions, and memory sections.
- Add an `info` Icon variant for the redesigned settings hints.
- Update e2e selectors and docs that referenced the old menu label.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(web): settings dialog UX polish — layout, dedup, and interactions

- Remove duplicate section headers from all settings sections
  (Notifications, Appearance, Privacy, About, Design Systems, Skills,
  MCP server, Connectors, Media providers, Routines)
- Restructure Notifications cards: title + toggle on same row, hint below
- Restructure Skills toolbar: search + New skill button in row 1,
  filter dropdowns in row 2 with left-aligned labels
- Restructure Pet section: tabs and Wake button on same row
- MCP server: group capabilities and setup into separate cards,
  remove nested double border on client picker
- Connectors: show connect errors as toast instead of inline card text,
  position toast inside panel, hide single-provider tab
- Media providers: move Reload button to left-aligned small ghost button
- Memory: info icon shows path on hover, Path copied badge inline;
  Extraction history and MEMORY.md as standalone collapsible cards;
  group header hidden when only one type visible
- Pet grid cards: Adopt button hidden until hover, icon-only when adopted,
  description truncated to 2 lines, text fills full width via abs positioning
- Agent cards: selected state uses accent border only, no background change
- Add sun/moon icons to Appearance theme buttons (Light/Dark)
- Shorten several hint strings for clarity

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): resolve i18n review comments from PR #1568

- Update settings.title and settings.envConfigure to localized
  "Execution mode" in all 17 non-English locale files
- Add settings.memoryFlashPathCopied to all locales and use t()
  in MemorySection instead of hardcoded English "Path copied"
- Add settings.agentModelHead to all locales and use t() in
  SettingsDialog for "Model for:" agent model row header

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): update tests to match settings dialog redesign

- Add role prop to Toast (alert/status) so error toasts from
  ConnectorsBrowser are announced immediately by screen readers
- Clear connectErrorToast on successful connector retry
- Update SettingsDialog.execution tests:
  - Remove heading assertions for About and MCP server (headers
    were intentionally removed as duplicate nav labels)
  - Rewrite CLI env test to use codex-only fields (per-agent
    filtering means only selected agent's fields are shown)
  - Update Composio key hint text assertion to match shortened copy
  - Replace filter button click with select change for Type filter
  - Replace Configured/Unsupported/Integrated badge checks with
    updated assertions matching the new media provider UI
  - Replace disabled BFL row test with coming-soon section check
- Update SettingsDialog.media test: remove Fal.ai input assertions
  (non-integrated providers no longer have editable fields)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): unblock CI for #1568

Three small fixes to get Playwright back to green on the settings
dialog redesign:

1. `en.ts`: revert `settings.envConfigure` to "Configure execution mode".
   This PR collapsed both `settings.title` (header gear) and
   `settings.envConfigure` (entry-side foot pill) to the same string
   "Execution mode", so `getByRole('button', { name: 'Execution mode' })`
   resolved to two elements and tripped Playwright strict mode in the
   three Composio-flow tests (entry-configuration-flows.test.ts:174,
   228, 285). Restoring the distinct label also gives screen readers
   a clearer hint for the pill, which doubles as a status display.
   Non-English locales still alias the two keys; happy to follow up
   on those, but they don't gate the (English-only) Playwright suite.

2. entry-configuration-flows.test.ts:167 — `Connectors` heading is now
   rendered at `<h2>` in the modal-head (SettingsDialog.tsx:1545), with
   the inner `<h3>` removed by design (see comment around line 1448).
   Updated the assertion from `level: 3` to `level: 2`.

3. project-management-flows.test.ts:360 — same change for the `Pets`
   heading.

Verified locally with `pnpm --filter @open-design/web typecheck` and
`pnpm --filter @open-design/e2e typecheck`. The actual Playwright
specs need the dev server up; I didn't rerun them here, but the
locator changes are mechanical and match the new DOM.

* fix(web): use exact match for Execution mode button locator

Playwright's `getByRole({ name })` defaults to substring matching, so
`{ name: 'Execution mode' }` still resolved to both the header gear
(aria-label "Execution mode") and the entry-side foot pill (aria-label
"Configure execution mode" — substring contains "Execution mode").
Strict mode tripped in the three composio-flow tests at lines 202,
257, and 319.

Adding `exact: true` makes each call resolve to just the header gear,
which opens the same dialog the foot pill does — the test outcomes
are unchanged.

---------

Co-authored-by: chaoxiaoche <chaoxiaoche@chaoxiaochedeMacBook-Pro.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Caprika <56862773+alchemistklk@users.noreply.github.com>
Co-authored-by: shangxinyu1 <shangxinyu@refly.ai>
Co-authored-by: lefarcen <935902669@qq.com>
2026-05-15 14:35:06 +08:00
Quang Do
a41d4f6126
fix(web): keep chat pinned during content growth (#1716) 2026-05-15 14:12:00 +08:00
Quang Do
3d0e708720
fix(daemon): treat media generate handoff as success (#1715) 2026-05-15 14:11:40 +08:00
Prantik Medhi
cd3acda6f6
fix(web): cap project title width (#1784) 2026-05-15 13:29:03 +08:00
Prantik Medhi
01e54700a2
fix(web): make file grouping by kind work (#1551)
* fix(web): group design files by kind

* fix(web): unblock CI for #1551

- FileViewer test (line 434): add missing `projectKind="prototype"`
  to match every other instance; this was the source of the typecheck
  failure blocking workspace validation.
- DesignFilesPanel "groups files by kind" test: assert against
  `.df-section-label` elements so the section header check is not
  ambiguous with the per-row kind cell text.
- DesignFilesPanel batch-delete test: derive the expected file names
  from the rendered row testids and use `arrayContaining` so the
  assertion no longer depends on the (now kind-default) row order.

* fix(web): satisfy strict-index typecheck in batch-delete test

`onDeleteFiles.mock.calls[0][0]` tripped `noUncheckedIndexedAccess`
("Object is possibly 'undefined'"). Drop the separate length probe and
assert the exact array instead — `selected` is a `Set`, `handleBatchDelete`
spreads it with `[...selected]`, and the test clicks rows[0]/rows[1] in
that order, so insertion order is deterministic and equals
`[firstName, secondName]`.

---------

Co-authored-by: lefarcen <935902669@qq.com>
2026-05-15 13:07:27 +08:00
Nagendhra Madishetti
ad275dbc02
fix(daemon): use danger-full-access codex sandbox on Windows to unblock PowerShell (#1745)
Codex CLI's workspace-write sandbox on Windows blocks every shell
invocation with 'powershell.exe ... rejected: blocked by policy', so
the agent cannot list files, navigate the workspace, or call any
shell-backed tool. Codex has no working OS-level sandbox on Windows
and falls back to a coarse policy that rejects shell unconditionally.

Switch to --sandbox danger-full-access on win32 only. macOS (Seatbelt)
and Linux (Landlock+seccomp) keep workspace-write because their sandbox
enforcement permits shell while restricting writes.

Tests anchor the workspace-write expectations explicitly to darwin and
linux via withPlatform(), and a new win32 case asserts the
danger-full-access flag and that the workspace-write-scoped network
config override is dropped.

Fixes #1721.

Co-authored-by: Nagendhra <nagendhra405@gmail.com>
2026-05-15 12:22:48 +08:00
Yuhao Chen
9c3a8ae3e4
fix(web): restore dark pagination select chevron (#1736) 2026-05-15 12:15:20 +08:00
Het Savani
d0ecb62a36
fix(runtime): improve DOM fallback target selection for comment picker (#1706) 2026-05-15 12:10:09 +08:00
Prantik Medhi
8aeedf368b
fix(web): localize accent controls in settings (#1565)
* fix(web): localize accent controls

* fix(web): localize accent default label

* fix(web): unblock CI for #1565

Add missing `projectKind="prototype"` to the FileViewer deck-render
test (line 434) so workspace typecheck stops failing on the
`Property 'projectKind' is missing` error. This mirrors every other
FileViewer render in the same file and is unrelated to the accent
localization changes in this PR — it's drift from a recent change
on main that made `projectKind` required.

---------

Co-authored-by: lefarcen <935902669@qq.com>
2026-05-15 12:09:28 +08:00
enaktes9-hub
24a70c7ab2
fix(web): ensure routine history 'Open project' button text is visible on hover (#1766)
The hover state used `color: var(--bg)` which could resolve to a color
that blends with the panel background, making the button label invisible.

Changed hover text color to `var(--bg-panel)` which is the panel
surface color — it guarantees contrast against the `var(--text)`
hover background in both light and dark themes.

Also added focus-visible outline and active state for better
affordance.

Fixes #1357

Co-authored-by: Hermes PR Agent <enaktes9-hub@users.noreply.github.com>
2026-05-15 11:58:20 +08:00
Yuhao Chen
b0963fd874
fix(web): allow downloads from preview iframes (#1732) 2026-05-15 11:55:29 +08:00
lefarcen
75498838a9
chore: align issue templates to preview/v0.8.0 naming (#1723)
Some checks failed
ci / Packaged mac smoke (push) Blocked by required conditions
ci / Packaged windows smoke (push) Blocked by required conditions
ci / Detect PR change scopes (push) Failing after 3s
ci / Validate workspace (push) Has been skipped
landing-page-ci / Validate landing page (push) Failing after 1s
landing-page-deploy / Deploy landing page (push) Has been skipped
nix-check / build (push) Failing after 1s
Following the rename of the feature branch from preview/0.8.0 to preview/v0.8.0
(to match the release/v0.7.0 convention), update all issue-template
references so the label, filename, and deep-link URL stay consistent.

Changes:
- git mv preview-0.8.0-feedback.yml → preview-v0.8.0-feedback.yml
- update labels reference, title prefix, display name, body copy
- update version placeholder example to 0.8.0-preview.2 (current build)
- update cross-references in bug-report.yml and feature-request.yml
- update config.yml first contact_link URL + about text
2026-05-14 23:21:37 +08:00
lakatos
7c9e620291
fix(web): stop conversation route sync from remounting ChatPane in a loop (#1710)
PR #1508 added a routeConversationId -> activeConversationId sync effect
next to the existing activeConversationId -> URL sync, with no
arbitration between them. Creating or switching a conversation moves
activeConversationId ahead of the URL; the route-sync effect then sees
the stale routeConversationId and pulls activeConversationId back, while
the URL sync pushes it forward again. ChatPane is keyed on
activeConversationId, so the ping-pong remounts ChatPane and its
composer on every flip and the composer never settles.

Track the conversation id this view last pushed to the URL and have the
route-sync effect ignore a routeConversationId that merely echoes it.
Only a genuinely external navigation (deep-link, routine history row)
differs from the last synced id, so PR #1508's deep-link behaviour is
preserved while the self-inflicted remount loop is gone.
2026-05-14 21:06:26 +08:00
Nicholas-Xiong
118937d09b
fix: Change comment button label from 'Send to Claude' to 'Send to chat' (#1673)
Fixes #1390

Update the comment popover button label to accurately describe the action
and match product terminology.

**Before:**
- Button labeled 'Send to Claude'
- Suggests model-specific or brand-specific destination
- Inconsistent with visible chat-based workflow

**After:**
- Button labeled 'Send to chat'
- Clearly describes the actual destination
- Matches user mental model and product terminology
- Consistent with visible UI flow

**Changes:**
- Updated both comment popover instances (batch send and side panel send)
- Preserves 'Sending…' loading state text
2026-05-14 21:05:45 +08:00
Nagendhra Madishetti
98bc6d63e6
feat: Critique Theater wireup (activate the stack, M0 dark-launch by default) (#1338)
* feat(web): pure reducer for Critique Theater states (Phase 7.1)

Pure CritiqueState reducer driven by the contracts-level PanelEvent
(the same shape both the live SSE stream and the recorded transcript
emit), so a single reducer powers both the in-flight panel and the
rerun replay. Lifecycle covers run_started → running → (shipped /
degraded / interrupted / failed), with panelist_open / dim /
must_fix / close / round_end events building per-round
CritiquePanelistView entries as they arrive.

Defensive behaviour that surfaced while writing the spec tests:
- Terminal phases (shipped / degraded / interrupted / failed) are
  sticky against further lifecycle events for the same run, except
  for parser_warning which can land late and is recorded in a side
  channel without changing phase.
- A new run_started for a different runId at any time discards the
  prior state and reboots, so the UI can launch consecutive runs
  without an explicit reset action.
- Events whose runId does not match the active run return the same
  state reference, so React's useReducer doesn't re-render
  subscribers on stray traffic.
- Round bookkeeping keys by round number rather than "always last",
  so an out-of-order panelist_dim for round 1 arriving after a
  round 2 dim does not corrupt the round 2 bucket.

Test coverage: 18 cases covering each transition, the runId guard,
sticky-terminal behaviour, the out-of-order round invariant, and
the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire
SSE + replay into the same reducer.

* feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2)

createCritiqueEventsConnection is a pure connection manager that
mirrors apps/web/src/providers/project-events.ts: opens an
EventSource at /api/projects/:id/events, listens for every name in
CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent
(stripping the critique. prefix and merging the data payload), and
hands it to the caller's onEvent. Reconnect uses exponential
backoff (1s → 30s) and resets on `ready`; malformed payloads drop
with a dev-mode warning rather than tearing the stream.

useCritiqueStream wraps the manager in a useReducer that owns the
CritiqueState. enabled=false or a null projectId tears down the
connection cleanly; switching projectId closes the old connection
and opens a fresh one. The returned dispatch lets local UI
synthesise actions (e.g. an Esc keypress firing a synthetic
interrupted while a kill request is in flight); production traffic
comes from the SSE stream.

Test coverage:
- sse.test.ts (10 cases, node env): subscription set covers every
  CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire
  shape back to PanelEvent; malformed JSON is swallowed and does
  not stop the stream; exponential backoff schedule and ready-reset
  semantics are pinned with a setTimeout seam; close() cancels
  pending reconnects and shuts the live source; no-op fallback
  when EventSource is unavailable.
- useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event,
  reducer driven by synthetic actions, no connection when disabled
  or projectId is null, clean close on unmount, projectId change
  reopens cleanly.

* feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3)

Fetches the per-run NDJSON transcript (one PanelEvent per line),
parses every line via the shared isPanelEvent predicate, and
dispatches into the same CritiqueState reducer the live SSE stream
uses. A single reducer means the UI rendering a replay can be
identical to the live panel, and a UI mounting both
useCritiqueStream and useCritiqueReplay in parallel does not have
to reconcile two state shapes.

speed knob is `paused | instant | live | { intervalMs: N }`.
- instant flushes every event synchronously, useful for opening a
  finished run already at its terminal state.
- intervalMs paces dispatches at a fixed cadence so the reviewer
  can watch the run unfold.
- paused parses the transcript but holds events back until the
  caller advances speed (consumers can drive a scrubber later).
- live is reserved for the future "playback at original cadence"
  feature, currently treated as instant; replay timestamps are not
  yet persisted with each event so honest pacing requires a
  follow-up Phase 7+ task.

gunzip seam handles `.ndjson.gz` transcripts via
DecompressionStream when present; the production fetch path picks
between text and arrayBuffer based on the URL extension. Both seams
are injectable so the unit tests don't need to spin up a real
network or a real gzip pipeline.

Test coverage (8 cases, jsdom env):
- Idle status before any URL is provided.
- speed=instant flushes the full transcript synchronously to
  shipped state.
- speed={intervalMs:N} paces with the setTimeout seam, reaching
  done after the last tick.
- speed=paused leaves status=playing with no dispatches.
- Empty transcript reports done with state still idle.
- Fetch rejection surfaces an error status with the message.
- Malformed NDJSON lines are skipped; valid events around them
  still land.
- .gz transcripts route through the gunzip seam.

Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream +
replay), all on one branch ready for review. Phases 8+ (Theater
components) consume these from this PR.

* fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review)

Two P1 fixes from lefarcen's review on PR #1307:

SSE payload override

`sseToPanelEvent` previously spread `data` after the channel-derived
`type`, so a payload-provided `type` could override the channel and
route a `critique.run_started` frame into the reducer as a `ship`
action. Reversed the spread so the channel-derived `type` is
authoritative, and revalidated the resulting object through the
contracts-level `isPanelEvent` predicate before returning. Frames
that fail validation (missing runId, empty runId, unknown type) are
dropped, so a malformed or compromised SSE frame can no longer
dispatch a wrong-shape action into the reducer.

Three new sse.test.ts cases pin the regression: hostile `type:'ship'`
in the payload still resolves to `run_started`, missing runId is
dropped, empty runId is dropped.

Replay pause/resume

`useCritiqueReplay` had one big effect keyed on `transcriptUrl`
only, so flipping `speed` from `paused` to `instant` never re-fired
and the held events sat undispatched. Split into a parse effect
(depends on URL, fetches and stores events in state) and a pace
effect (depends on parsed-events + speed, owns the cursor + timers).
The playback cursor lives in a ref that survives pause/resume
cycles, so flipping `paused` -> `instant` flushes from the current
position rather than restarting (which would double-dispatch
`run_started` and reset the reducer).

Two new useCritiqueReplay.test.tsx cases:
- paused-then-instant transitions from `playing` to `done` and
  reaches the shipped terminal phase
- intervalMs paced playback dispatches one event, pauses to drain
  the next scheduled timer, flips to instant, and confirms the
  remaining transcript drains exactly once (cursor was preserved)

Doc consistency

The earlier source comment in useCritiqueReplay.ts claimed `live`
"paces by recorded timestamps" while the impl used zero-delay
timers and the PR body said it behaves like `instant`. Aligned to
reality: `live` currently behaves like `{ intervalMs: 0 }` (events
drain on successive microtasks via setTimeoutFn) because transcripts
do not yet carry per-event timestamps. Honest timestamp-driven
pacing is queued as a Phase 7+ follow-up.

Validated: pnpm guard, pnpm --filter @open-design/web typecheck,
Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite
96 files / 888 tests.

* feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread)

* feat(web): Theater PanelistLane component (Phase 8.1)

* feat(web): Theater ScoreTicker component (Phase 8.2)

* feat(web): Theater RoundDivider component (Phase 8.3)

* feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4)

* feat(web): Theater TheaterDegraded chip (Phase 8.5)

* feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6)

* feat(web): Theater TheaterTranscript replay surface (Phase 8.7)

* feat(web): Theater TheaterStage top-level container (Phase 8.8)

* feat(web): Theater CSS using existing semantic tokens (no hex literals)

* feat(web): Theater public exports barrel

* fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314)

Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen.

State-lifecycle fixes (3 x P2)
1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`).
   Host hooks dispatch it when their gating prop changes so a stale
   run from a prior project / transcript cannot bleed into the next
   context. Reset is idempotent on idle (returns the same reference).
2. `useCritiqueStream` dispatches `__reset__` at the top of its
   connection effect, so a workspace switch from project A (which
   streamed a critique) to project B clears the reducer before the
   new EventSource opens. enabled=false also clears.
3. `useCritiqueReplay` dispatches `__reset__` at the top of its
   parse effect, so transcriptUrl swaps (including swap-to-null after
   a replay reached `shipped`) lift the reducer back to idle before
   the new fetch starts.

SSE validation (1 x P2)
4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape`
   check after the cheap `isPanelEvent` predicate. A
   `critique.ship` frame missing `composite` / `round` / `status` /
   `artifactRef` is rejected before reaching the reducer, so
   TheaterCollapsed can no longer crash on `undefined.toFixed(1)`.
   Every variant's required fields are validated: run_started
   (protocolVersion, non-empty cast, maxRounds, threshold, scale),
   panelist_* (round, role, plus variant-specific shape), round_end
   (round, composite, mustFix, decision in {continue,ship}, reason),
   ship (round, composite, status, artifactRef.{projectId,artifactId},
   summary), degraded (reason, adapter), interrupted (bestRound,
   composite), failed (cause), parser_warning (kind, position).

Reducer correctness (1 x P2)
5. `panelist_open` now materializes the round + an empty panelist
   view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight
   the in-progress lane the instant the tag opens. Before this, a
   stream that emitted only `panelist_open` after `run_started` left
   `rounds = []` and the UI rendered no current round until a later
   `panelist_dim` arrived.

Polish (3 x P3)
6. Brand role tint swaps from `var(--magenta, var(--accent))` to
   `var(--purple, var(--accent))`. `--purple` is actually defined
   across the design systems; `--magenta` is not, so Brand was
   silently falling through to `--accent` and looking identical to
   Designer.
7. New i18n key `critiqueTheater.interruptedSummary` for the
   interrupted-collapse copy ("Interrupted at round N, best
   composite X.X"). Previously the interrupted branch reused
   `shippedSummary` and the UI read "Shipped at round..." for a run
   that specifically did not ship. Native value in en + zh-CN; other
   locales fall back via `...en` spread.
8. `TheaterDegraded` heading id comes from `useId()` instead of a
   hardcoded `theater-degraded-heading`, so two chips rendered on
   the same page (chat history with multiple completed runs) keep
   their aria-labelledby references unambiguous.

Tests (15 new cases)
- reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data.
- sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship.
- useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false.
- useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped.
- TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...".
- TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`.

Validated
- pnpm guard clean
- pnpm --filter @open-design/web typecheck clean
- Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new)
- tests/i18n/locales.test.ts 5 of 5 across 18 locales

* feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1)

* feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2)

* fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315)

Addresses every blocker from codex, Siri-Ray, and lefarcen. The
three state-lifecycle and SSE-validation issues they also flagged
inherit fixes from PR #1314's review pass that this branch now sits
on top of after rebase.

Real daemon kill on Interrupt (P1)
- CritiqueTheaterMount now POSTs to
  /api/projects/:id/critique/:runId/interrupt alongside the
  optimistic local dispatch. Before this fix, clicking Interrupt
  only flipped the React state to interrupted while the daemon job
  kept running. The fetch is best-effort: a 404 (endpoint not wired
  yet, lands in Phase 15) is swallowed with a dev-mode console.warn
  so the UI still moves to the collapsed badge.
- New fetchInterrupt test seam lets RTL assert on the URL / method
  and simulate the "daemon not ready yet" path. Two tests pin both:
  the happy URL proj-42/critique/run-abc/interrupt POSTs, and a
  rejected fetch still flips the UI.

interruptPending reset on new run (P2)
- A ref-backed effect compares the current runId against the last
  one we saw; when it changes, interruptPending is cleared. A user
  who interrupts run-1 and then triggers run-2 from the same mount
  now gets a fresh, enabled kill button instead of one stuck in
  "Interrupting…". Pinned by a new mount test.

Escape keybind scope (P2)
- InterruptButton now checks the keydown target. Escape inside an
  input, textarea, select, or contenteditable element is ignored
  (and any ancestor of those via closest() is treated the same
  way). Body-level focus still fires the keybind so the Theater
  area's affordance keeps working. Four new tests cover textarea,
  input, contenteditable, and the body-focus positive case.

userFacingName i18n key (P2)
- The spec at specs/current/critique-theater.md:6 mandates a single
  critiqueTheater.userFacingName key so the "Design Jury" label can
  be renamed without touching code. Phase 8 introduced
  critiqueTheater.title by mistake; renamed across types.ts, en.ts,
  zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer
  TheaterStage.tsx. The locale alignment test stays green.

Validated
- pnpm guard clean
- pnpm --filter @open-design/web typecheck clean
- Theater suite: 14 files, 112 tests (was 101 before, +11 new for
  the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope;
  the rest were already in #1314's review fix).
- tests/i18n/locales.test.ts 5 of 5 across 18 locales.

* feat(daemon): adapter-degraded registry with TTL (Phase 10.1)

In-memory registry recording adapters that produced malformed or
oversize transcripts so the orchestrator can skip them for a TTL
window (default 24h) instead of cycling through known-bad providers
on every run.

Records carry reason (malformed_block | oversize_block |
missing_artifact), source label, and expiresAt. The test-only
clock seam lets the suite advance time deterministically and prove
that an expired entry stops counting as degraded without anyone
calling clearDegraded.

7/7 vitest cases green.

* feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2)

Two test-only adapters that read the existing v1 transcript
fixtures (happy-3-rounds and malformed-unbalanced) and replay them
as either a full string or a 512-byte chunked stream. The chunked
form is what the conformance harness uses to prove the parser
holds together when the transcript arrives in arbitrary network
slices, not as one buffered blob.

* feat(daemon): adapter conformance harness (Phase 10.3)

runAdapterConformance pulls a transcript through the same
parseCritiqueStream pipeline the orchestrator uses and classifies
the outcome as shipped, degraded, or failed. On a degraded
outcome it forwards the matched reason to the adapter-degraded
registry, so a single nightly conformance run is what populates
the skip list rather than the orchestrator learning each adapter
is broken at request time.

5/5 vitest cases green covering shipped, malformed degraded,
oversize degraded, no-ship failure, and the harness-thrown
failure path.

* test(e2e): Critique Theater Playwright suite (Phase 11)

Six tests, one viewport per visual case, deterministic SSE
fixtures stubbed via page.route(). Adds the suite to
test:ui:extended so the existing extended-UI lane picks it up.

Coverage:

  1. Happy path: a single mounted theater plays the full
     fixture (1 run_started, 5 panelists open / dim / must_fix /
     close, 1 round_end, 1 ship) and ends on the score badge.
  2. Interrupt mid-run: the panelist that is open at the time
     the interrupt button is clicked closes with an interrupted
     marker and the transcript freezes there.
  3. Visual regression at 375x720 mobile.
  4. Visual regression at 768x1024 tablet.
  5. Visual regression at 1280x800 desktop.
  6. A11y role tree: the theater region exposes a labelled
     landmark, each panelist lane is a group with an accessible
     name, the score is a status live region.

All SSE traffic is stubbed by page.route so the suite runs in CI
without a daemon. The toggle is seeded via localStorage by
bootAppWithCritiqueEnabled so the gate behaves as if Settings
flipped it on. typecheck clean; playwright --list reports 6.

* test(web): reducer p99 bench at 10k iterations (Phase 13.1)

Locks the documented 2ms budget for the Critique Theater reducer
on a representative SSE script (27 actions, one full happy run)
behind a regression gate. Asserts p99 stays under 4ms (2x the
documented budget) so CI runners with a noisy neighbour do not
flake while a real regression to 20ms or 200ms still trips.

The bench is a vitest case rather than a bare microbenchmark so
it runs in the same CI lane as every other web test and does not
need a parallel runner.

* test(web): critique surface coverage walker (Phase 13.2)

Walks the public critique surface (11 SSE event names, 5 panelist
roles, 6 lifecycle phases, 9 named i18n keys) and asserts each
named symbol appears in both the src corpus and the test corpus.
The walker is the gate that catches a rename in one half of the
codebase without a matching update in the other half: a future
PR that drops 'panelist_must_fix' from the reducer without also
removing its test reference fails this suite.

62 assertions, one per symbol per corpus.

* docs: Critique Theater user guide (Phase 14.1)

Seven sections aimed at end users (not contributors):

  1. What is Design Jury
  2. How it works (the five panelists, auto-converging rounds,
     the composite formula)
  3. Settings (the M1 toggle and what it does)
  4. Reading the score badge
  5. Replay surface
  6. Troubleshooting (degraded, interrupted, failed)
  7. FAQ

The composite formula is documented as
    designer * 0 + critic * 0.4 + brand * 0.2 + a11y * 0.2 + copy * 0.2
because anyone trying to reverse-engineer the score is going to
search for those weights and the docs are the place they should
land first.

* docs(daemon): critique module AGENTS map (Phase 14.2)

Daemon-side wayfinder for the apps/daemon/src/critique directory.
Tables every file, what owns what invariant, and the 'when you
change anything here' guide so a future contributor does not
have to reverse-engineer the rollout resolver before adding a
new SSE event.

* docs(web): Theater module AGENTS map (Phase 14.3)

Web-side mirror of the daemon AGENTS map. Same file table, same
invariants section, same change-impact guide, sized to the
Theater component package.

* feat(daemon): rollout flag resolver (Phase 15.1)

Single decision point every caller consults to know whether the
orchestrator should wire the critique pipeline for a given run.
Priority:

  1. Skill-level policy (required wins, opt-out wins inversely)
  2. Per-project override from the Settings toggle
  3. OD_CRITIQUE_ENABLED env override
  4. Rollout phase default
       M0 dark-launch      false
       M1 settings only    false (toggle is off until the user flips it)
       M2 per-skill        true if skill opted in
       M3 global default   true

OD_CRITIQUE_ROLLOUT_PHASE parser defaults to M0 on unknown input
so a fresh install never surprises a user with the feature on.

10/10 vitest cases green covering every cell of the matrix.

* feat(web): Settings toggle hook for Critique Theater (Phase 15.2)

React hook that reads critiqueTheaterEnabled from the existing
open-design:config localStorage blob and stays in sync via:

  - the platform storage event (cross-tab)
  - a open-design:critique-theater-toggle CustomEvent (same-tab)

Same-tab event is the one that fires when the Settings panel saves
in the current window: the toggle and every mounted theater update
without a page reload.

setCritiqueTheaterEnabled(next) is the imperative setter the Settings
panel calls. It preserves the rest of the stored config (mode, apiKey,
etc.) and dispatches the same-tab event after the localStorage write.

The web hook reflects what the user toggled; the daemon-side
isCritiqueEnabled is the final routing authority (project override,
env, rollout phase). When they disagree, the daemon wins for backend
gating and the web reflects the toggle state.

6/6 vitest cases green covering first read, stored read, same-tab
event flip, config preservation, corrupted JSON tolerance, and
cross-tab storage event.

* test(web): Phase 15 toggle hook failure-mode coverage (PR #1320)

lefarcen P2 on PR #1320 flagged that the PR body claimed safe
behavior for disabled localStorage, non-object JSON, and missing
CustomEvent shim, but the suite only covered corrupt JSON plus
happy-path storage events. Added four failure-mode tests so the
swallowed errors are not silently traded for a throw in a future
refactor:

1. Returns false on a stored JSON value that parses to an array
   (non-object). Catches a regression where the guard treats
   anything truthy as a config blob.
2. Returns false on a stored JSON value of literal 'null'.
   typeof null === 'object' in JS, so the guard has to check null
   explicitly; this test pins that check.
3. Returns false when localStorage.getItem throws (private mode /
   disabled storage / SecurityError). The hook must swallow and
   return false so the rest of the app keeps rendering.
4. setCritiqueTheaterEnabled still dispatches the same-tab
   CustomEvent when localStorage.setItem throws (quota exceeded /
   disabled storage). The dispatch path is the in-session
   broadcast that keeps every mounted hook coherent even when
   persistence is unavailable; verified by mounting two probes
   and asserting both flip after the setter is called with a
   throwing setItem.

10/10 vitest cases green (6 existing + 4 new).

* fix(web): honor CustomEvent payload in toggle hook listener (PR #1320)

Both Siri-Ray (blocking) and lefarcen (P2 new) caught the same
real bug in the failure-mode test I added in affcdd27: the test
asserts the in-session UI flips when localStorage.setItem throws,
but the CustomEvent listener was ignoring the event's typed
detail and just calling readToggle(). Under a throwing setItem
the localStorage value is stale (or absent), so the listener
would see the OLD value and the test would fail (or worse, the
production claim 'in-session event keeps mounts coherent' was
hollow).

Fixed the hook, not the test: the listener now reads
event.detail.enabled when it is a boolean, falling back to
readToggle() only for malformed events or for cross-tab storage
events (which do not carry a typed payload). The setter already
dispatched the detail; the listener just was not consuming it.

Test changes:

  - The existing 'setItem throws' test now asserts the right
    behavior for the right reason. Updated the inline comment to
    say the listener reads from detail, not localStorage.
  - New test 'falls back to readToggle when the CustomEvent
    carries no usable detail' pins the fallback path: a
    malformed dispatcher (no detail, or detail.enabled not a
    boolean) degrades cleanly instead of throwing or being
    silently ignored.

11 / 11 vitest cases green (10 prior + 1 new fallback).

* feat(daemon): route critique spawn-path eligibility through the rollout resolver

The wireup edit Phase 10 and Phase 15 carved out: today server.ts gates
the critique pipeline on critiqueCfg.enabled, which is just the
OD_CRITIQUE_ENABLED env var. After this commit it gates on
isCritiqueEnabled(...) from the Phase 15 resolver, so the full
priority matrix is live:

  1. Per-skill od.critique.policy veto (opt-out / required)
  2. Per-project override (M1 Settings toggle, written through the
     existing Phase 6 settings endpoint)
  3. OD_CRITIQUE_ENABLED env override (power-user lane / CI fixtures)
  4. OD_CRITIQUE_ROLLOUT_PHASE default
       M0 dark-launch      false
       M1 settings only    false
       M2 per-skill        only when skillPolicy === 'opt-in'
       M3 global default   true

Default behaviour on a fresh install is unchanged: the resolver
returns false at M0 without an env override or a project override,
so prod traffic falls through to the legacy single-pass path
exactly the way it did before.

Inputs threaded today: phase from OD_CRITIQUE_ROLLOUT_PHASE,
envOverride from OD_CRITIQUE_ENABLED. skillPolicy and projectOverride
are passed as null for the v1 cutover; the daemon-side handler that
round-trips critiqueTheaterEnabled on the project settings row and
the od.critique.policy frontmatter resolver land as the next two
commits in this branch.

The three call sites that used critiqueCfg.enabled (the brand-thread
guard, the skill-thread guard, the top-line critiqueShouldRun
compound) now read from a single locally-scoped critiqueEnabledForRun
boolean, so the eligibility check is computed exactly once per spawn
and the prompt composer + orchestrator stay in lockstep the way
the existing comment already promised.

Tests still green: daemon vitest 22 / 22 across rollout +
conformance + adapter-degraded. Daemon typecheck clean.

* feat(web): mount CritiqueTheaterMount in ProjectView

The web counterpart of the daemon wireup. ProjectView now renders
<CritiqueTheaterMount projectId={project.id} enabled={...} /> as a
sibling of <AppChromeHeader> inside the top-level <div className="app">.

The mount is the drop-in from the Phase 9 stack: it owns the SSE
subscription, the kill-request handshake, and the phase-aware swap
from the live <TheaterStage> to the collapsed badge once a run
settles. The mount returns null until the daemon emits a
critique.run_started for the active project, so the visual surface
is byte-for-byte unchanged for users who have not opted in.

Enabled wiring: useCritiqueTheaterEnabled() reads the M1 Settings
toggle from the existing open-design:config localStorage blob and
stays in sync with both the platform storage event (cross-tab) and
the same-tab open-design:critique-theater-toggle CustomEvent the
Phase 15 setter dispatches. The hook honors the event payload
directly so a private-mode browser that cannot persist the toggle
still updates the in-session UI correctly.

The daemon-side gate (isCritiqueEnabled in apps/daemon/src/server.ts)
remains the authority for whether a run is actually wired through
the critique pipeline. This hook only governs whether the web layer
renders the resulting SSE stream when the daemon emits one. The
two-layer gate is intentional: an integrator embedding the Theater
in a custom UI can flip the web visibility independent of the
daemon's routing decision, and a daemon-side env override flips
backend gating without touching the web's localStorage.

Tests still green: web Theater suite 181 / 181 across 16 files.
Web typecheck clean.

* feat(daemon): resolve od.critique.policy frontmatter at the spawn site

The next step in the wireup branch's ladder: replace the placeholder
`skillPolicy: null` with the actual value parsed from the active
skill's SKILL.md frontmatter.

Three small edits, one new field on a public type:

1. SkillInfo gains a `critiquePolicy: SkillCritiquePolicy` field
   carrying the parsed `od.critique.policy` token (required /
   opt-in / opt-out / null). The field is null when the skill has
   no opinion, which lets the lower-priority resolver tiers
   (projectOverride, envOverride, phase default) decide.

2. listSkills() populates the new field via a small
   `normalizeCritiquePolicy` helper that tolerates the YAML
   scalar's casing and trims whitespace. Unknown tokens collapse
   to null so a typo in SKILL.md cannot accidentally force the
   panel on or off; it just falls through. Derived example cards
   inherit the parent's policy.

3. server.ts captures `skill.critiquePolicy` into a hoisted
   `skillCritiquePolicy` variable inside the existing skill-load
   block, then threads it into the isCritiqueEnabled call as the
   skillPolicy input. The hoisting keeps the variable in scope at
   the resolver call site without restructuring the spawn handler.

After this commit, the priority matrix the rollout resolver was
designed for is live for its top tier. The previous commit wired
env + phase; this one wires skill. The projectOverride input
remains null pending the next commit that extends the Phase 6
settings endpoint.

Daemon vitest: 10 / 10 rollout cases pass against the new wiring.
Daemon typecheck: clean.

* feat(daemon): feed projectOverride into the rollout resolver from project metadata

Replaces the placeholder `projectOverride: null` in the spawn
handler with the actual value the Settings panel writes onto the
project's metadata blob: `critiqueTheaterEnabled?: boolean`.

The read is defensive at the boundary: the metadata object is
typed loosely (it round-trips through SQLite as a free-form JSON
blob), so the spawn handler narrows to `boolean` and falls
through to `null` for any other shape. A missing key, a malformed
value, or a project that has never visited Settings collapses to
`null`, which is exactly the resolver's "no opinion, fall
through to env / phase" signal.

The `critique` frontmatter slot also gets typed on the
SkillFrontmatter shape so the `od.critique.policy` chain the
previous commit introduced no longer needs a bracket-access
cast. Same pattern as the existing `craft`, `preview`, and
`design_system` nested-record slots.

After this commit, every tier of the rollout resolver's priority
matrix is wired:

  1. skillPolicy   (from SKILL.md od.critique.policy)
  2. projectOverride (from project metadata critiqueTheaterEnabled)
  3. envOverride   (from OD_CRITIQUE_ENABLED)
  4. rollout phase (from OD_CRITIQUE_ROLLOUT_PHASE)

The write path for projectOverride still flows through the
existing project-update handler the Settings panel already uses
to persist project metadata; no new endpoint is needed. The
Settings UI button that calls setCritiqueTheaterEnabled and
posts the new field is the next commit on this branch.

Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases
still green against the new wiring.

* fix(daemon): forward critique events to project sinks + align composer gate (PR #1338)

Two codex review items addressed in one commit since they share the
same root cause (resolver-enabled run hits a transport / prompt
contract that was still env-gated):

P1 (transport mismatch). The daemon emits critique.* SSE frames
through critiqueBus -> design.runs.emit, which fans out on
/api/runs/:runId/events. The web CritiqueTheaterMount subscribes to
/api/projects/:projectId/events (it's project-scoped, not run-
scoped, because the mount lives at the project workspace and
follows the user across runs). Result: in production the mount
never sees a real frame and the e2e tests' stubbed routes hide the
mismatch.

Fixed by extending critiqueBus.emit to fan out to BOTH sinks: the
existing runs.emit transport, AND the per-project event-sinks map.
The project-events route emits via sse.send(payload.type, payload),
so we pack the SSE channel name onto payload.type and let the sink
push the right channel. The web sseToPanelEvent overwrites type
from the channel name on the way back into a PanelEvent, so the
round-trip stays correct.

P2 (prompt gate misalignment). composeSystemPrompt reads
cfg.enabled to decide whether to append the panel addendum, but
critiqueCfg.enabled is loaded from OD_CRITIQUE_ENABLED only. A run
the resolver enabled via phase / project / skill (env unset) would
have critiqueShouldRun = true while critiqueCfg.enabled remained
false, dropping the panel prompt while still routing through
runOrchestrator -> parser waits for tags that never arrive -> run
degrades.

Fixed by passing a derived config { ...critiqueCfg, enabled: true }
to the composer when critiqueShouldRun is true. The composer's own
gate now agrees with the resolver decision on every input the
spec defines.

Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases
still green against the new wiring.

* fix: address PerishCode P1 + P2 follow-ups on PR #1338

Two follow-up items PerishCode flagged on the activation PR.
Non-blocking but both are real:

1. Phase 11 e2e suite was wired into test:ui:extended but lands
   the user on '/' (home route) where ProjectView (and therefore
   CritiqueTheaterMount) is never rendered. With the suite as
   written, every assertion would time out the first time the
   lane runs in CI, contradicting the PR body's claim that the
   suite stays parked behind test.describe.fixme.

   The state diverged from my earlier Phase 11 work because the
   merge from main on commit 4ab719c6 brought in #1307's
   squash-merged version of the e2e file (the pre-fixme shape).

   Re-applied test.describe.fixme to the describe block plus
   removed ui/critique-theater.test.ts from the test:ui:extended
   script in e2e/package.json. Added a file-header docblock
   explaining what the follow-up commit needs to do: replace
   goto('/') with /projects/:id navigation similar to
   app-design-files.test.ts, split the SSE fixture into a live
   prefix and terminal suffix (Codex P2 on PR #1320), and commit
   the first PNG baselines.

2. bestRoundOf in CritiqueTheaterMount returned the LAST round
   with a numeric composite, not the round with the HIGHEST
   composite, while bestCompositeOf correctly returned the max.
   A run that closed round 1 at 8.5 and round 2 at 6.0 would
   dispatch interrupted { bestRound: 2, composite: 8.5 } on a
   user-clicked interrupt.

   Folded the two helpers into a single bestRoundAndComposite
   that walks state.rounds once and returns the matching pair so
   the two values cannot drift. The onInterrupt callback now
   destructures from one helper instead of two independent reads.
   Falls back to (state.activeRound, 0) when no round has closed
   with a composite yet.

Web typecheck: clean. CritiqueTheaterMount.test.tsx: 7 / 7 cases
still green against the new helper.

* fix: wire M1 project override end-to-end + correct deferred-surface doc claims (PR #1338)

Three lefarcen P2s on the latest review pass, all real:

1. M1 project override was half-wired: the daemon read
   metadata.critiqueTheaterEnabled but the web setter only
   wrote localStorage. A user opt-in would render the Theater
   on the web (localStorage was set) while the daemon resolved
   projectOverride=null and skipped critique unless env / phase
   already permitted. Two halves talking past each other.

   Extended setCritiqueTheaterEnabled to accept an optional
   { projectId, fetchProjectSettings } options bag. When a
   projectId is supplied, the setter ALSO sends a
   PATCH /api/projects/:id with { metadata: { critiqueTheaterEnabled
   } } so the daemon's spawn-time resolver picks the same value up
   on the next generation. The existing project-routes endpoint
   already accepts arbitrary metadata patches, so no new endpoint
   is needed. The local write + the CustomEvent dispatch still
   fire before the PATCH, so a network failure does not unwind
   the in-session UI flip. Three new vitest cases pin the new
   path: PATCHes when projectId is provided, skips when it is
   not, swallows a rejected PATCH so the in-session UI still
   flips.

2. Rollout docs (docs/critique-theater.md section 3) claimed the
   Settings toggle persists into the daemon settings store, but
   the previous implementation only had a localStorage reader /
   writer plus a daemon read of project metadata, with no
   round-trip. Rewrote the section to lead with the four-tier
   resolver (skill policy / project override / env / phase),
   document that the setter now round-trips via the existing
   PATCH endpoint when given a projectId, and call out the
   Settings panel UI control as a deliberate follow-up.

3. Troubleshooting table pointed users at /api/metrics/critique
   (Phase 12, deferred) and 'od adapters clear-degraded <id>'
   (CLI wrapper that does not exist). Replaced the metrics
   reference with the local conformance harness command
   (pnpm --filter @open-design/daemon vitest run
   tests/critique-conformance.test.ts) that ships today, with a
   note that the Phase 12 dashboard surfaces this status as a
   series once that PR lands. Replaced the CLI command with the
   programmatic clearDegraded() helper that exists today and
   flagged the CLI wrapper as planned follow-up.

Web typecheck: clean. Toggle hook tests: 14 / 14 green (11
existing + 3 new for the round-trip path).

* test(web): multi-round interrupt regression for bestRoundAndComposite (PR #1338)

lefarcen P3 follow-up to the previous bestRoundAndComposite fix:
the existing CritiqueTheaterMount.test.tsx interrupt cases only
exercised a single-round state, so a future refactor back to two
independent helpers wouldn't be caught by the test suite even
though it'd reintroduce the round / composite drift bug.

Added a regression case that:

  1. Drives the reducer through two complete rounds with the
     full 5-role cast closing at distinct composites: round 1
     at 8.5, round 2 at 6.0 (the high-composite round is NOT the
     most recent one).
  2. Clicks Interrupt + waits for the daemon ack via the test
     seam fetcher returning 204.
  3. Asserts the collapsed badge displays "round 1" (the
     correct best-composite round), and queryByText for
     "round 2 ... 8.5" returns null (the buggy pairing
     would have produced that string).

The bestRoundAndComposite helper walks state.rounds in one pass
and returns the matching pair, so the round number and the
composite cannot drift apart. This test locks the fix in: a
refactor that splits the helpers back into independent walks
will be caught here.

8 / 8 vitest cases green on the file.

* fix(web): read-merge-write the project metadata in setCritiqueTheaterEnabled (PerishCode P2 on PR #1338)

The previous round-trip sent { metadata: { critiqueTheaterEnabled: next } }
as the entire PATCH body. The daemon's project-routes handler only
re-stamps three immutable fields (baseDir, importedFrom,
fromTrustedPicker) before calling updateProject(db, id, patch),
which then does a shallow { ...existing, ...patch } in apps/daemon/
src/db.ts. So patch.metadata replaces the row's metadata wholesale,
dropping kind, templateId, linkedDirs, and every other field the rest
of the app reads.

No in-tree caller passes projectId today (only vitest cases), so the
bug had not surfaced yet. But the surface is documented in
docs/critique-theater.md section 3 and the function's own JSDoc as
the M1 round-trip path, so it would have shipped as a latent footgun
for the next integrator: a Settings UI follow-up, or any third party
that wires the setter into a project-aware surface.

Fix: read-merge-write rather than a bare patch.

- GET /api/projects/:id to read the row's current metadata.
- Spread that metadata into the PATCH body and overlay
  critiqueTheaterEnabled: next on top, mirroring the partial-metadata
  pattern already used in ChatComposer.tsx for linkedDirs.
- PATCH the merged object.

Failure handling:
- GET fails: skip the PATCH entirely. We cannot construct a safe
  merged body without the current state, and a bare patch would
  wipe other metadata. The in-session CustomEvent fired earlier in
  the setter still keeps every mounted hook consistent; the next
  save retries the round-trip.
- PATCH fails: log in dev. The in-session UI is already correct via
  the CustomEvent.

Tests (TDD, red-first):

- 'GETs the project then PATCHes with merged metadata when a
  projectId is supplied': stubs a GET that returns
  { kind: 'template', templateId: 'modern-blog', linkedDirs: [...] }
  and asserts the PATCH body equals the merge plus the toggle.
- 'PATCHes with just the toggle when the project has no prior
  metadata': stubs a GET that returns no metadata block.
- 'skips the PATCH (does not stomp metadata) when the prefetch GET
  fails': stubs a rejecting GET and asserts only the GET fires.
- 'swallows a rejected PATCH after a successful prefetch': stubs a
  successful GET and a rejecting PATCH; asserts the in-session UI
  still flips via the CustomEvent.

Doc updated on the setter's JSDoc to describe the new three-step
flow (localStorage, CustomEvent, read-merge-write PATCH) and the
two failure modes.

Verified:
- pnpm --filter @open-design/web typecheck clean.
- pnpm --filter @open-design/web test: 111 files / 1055 tests green
  (was 1052, +3 from the new merge-flow cases).

* fix(web): restore wait-for-daemon-ack pattern on Theater interrupt

Same regression as flagged on PR #1316 post-main-merge: the
optimistic local dispatch fired before the POST resolved, so a
daemon 404 / 409 still terminalized the UI and the real SSE
terminal event got ignored by the sticky interrupted phase.

Snapshot runId / bestRound / composite at click time, dispatch
interrupted only on res.ok, clear interruptPending on rejection or
non-2xx so the user can retry. Tests cover rejection + 404 leaving
the run on the live stage; the 204 path waits for the ack.

* fix(test): add projectKind prop to FileViewer deck render after v0.7.0 merge

* fix(daemon): address PerishCode P3 trio on PR #1338 (emit helper reuse + spawn-input coverage + restored docs)

---------

Co-authored-by: Nagendhra <nagendhra405@gmail.com>
2026-05-14 20:37:06 +08:00
sakshyasinha
c4a67a7b3e
Fix Kimi CLI icon contrast in light mode (#1667)
* fix(web): improve Kimi CLI icon contrast

* fix(web): render Kimi icon via theme-aware CSS mask

Move Kimi to the MONO_ICONS set so it renders through CSS mask
with currentColor adaptation, making it legible in both light and
dark themes instead of baking a single dark fill that fails on
dark backgrounds.

* fix(web): adjust Kimi icon secondary mark for dual-theme contrast

Keep Kimi as a baked two-tone asset: blue accent (#1783ff) for brand
identity, mid-tone gray (#666666) secondary mark for acceptable contrast
on both light and dark card surfaces. Revert from mask path to preserve
the blue branding.

* fix(web): correct corrupted Kimi SVG and strengthen asset validation test

Remove extraneous PR discussion text that was accidentally included
in the SVG file. Strengthen the test to validate the bundled asset
is valid SVG with the expected fills (blue accent + gray secondary
mark), catching asset corruption that would otherwise go undetected.
2026-05-14 20:32:52 +08:00
sukumarp2022
9218fd649e
feat(ui): add copy to clipboard functionality for user messages with … (#1669)
* feat(ui): add copy to clipboard functionality for user messages with localization support

* fix(web): use setTimeout instead of window.setTimeout for correct Timeout type

* docs: add copy prompt button screenshot for PR #1669

* docs: add copy button hover screenshot for PR #1669

* docs: add copy button copied state screenshot for PR #1669

* fix(ui): reset button border/background on copy prompt button

The .user-copy-btn inherited border and background from the base
button CSS, rendering as a bordered gray box instead of a clean icon
overlay. This was especially visible in the Electron desktop app.

Add border: none and background: none to the button, and a subtle
hover background for feedback.
2026-05-14 20:19:20 +08:00
lefarcen
4693ddb00d
chore: add issue templates (bug, feature, preview/0.8.0) + chooser config (#1708)
* chore: add issue template for preview/0.8.0 feedback

Adds a guided issue form so community testers of the preview/0.8.0
branch (Skills tab + Automations) can submit structured feedback.

The template auto-applies the preview/0.8.0 label, which lets
maintainers filter all preview-related reports in one view:
https://github.com/nexu-io/open-design/issues?q=is%3Aopen+label%3A%22preview%2F0.8.0%22

* chore: add generic bug-report issue template

Pairs with the preview/0.8.0 template added in the previous commit.
Until now the repo had no issue templates at all, which meant New
Issue opened a blank textarea by default.

The bug-report template:
- Pre-applies the 'bug' label
- Guides users through repro steps, version, platform, logs
- Includes a callout pointing preview/0.8.0 testers to the
  dedicated feedback template so the two flows stay separate

* chore: add feature-request template + chooser config

Rounds out the issue-template basics:
- feature-request.yml — 'what problem are you trying to solve' framing,
  willing-to-contribute dropdown so maintainers can route PRs
- config.yml — disables blank-issue entry, redirects Q&A / Ideas / Show-and-tell
  / general chat to Discussions, points preview/0.8.0 reporters at the
  dedicated template

After merge, the chooser at /issues/new/choose will be:
  Template     1. 🐛 Bug report
              2. 💡 Feature request
              3. 🧪 Preview 0.8.0 feedback
  Contact      → Preview 0.8.0 feedback (dup, easy-access)
              → Ask a question (Discussions Q&A)
              → Discuss an idea (Discussions Ideas)
              → Show what you've made (Discussions Show-and-tell)
              → General discussion (Discussions General)
2026-05-14 20:13:36 +08:00
정수현
63baff5222
fix(skills): repoint coreyhaines31 upstream URLs to marketingskills (#1659)
The upstream repo github.com/coreyhaines31/skills was renamed to
github.com/coreyhaines31/marketingskills, so the four curated
marketing-creative stubs (ad-creative, copywriting, marketing-psychology,
paywall-upgrade-cro) advertised a source URL that now 404s.

Update od.upstream and the body source/open links in all four SKILL.md
stubs, plus the matching entries in the seed script so re-seeding stays
consistent.
2026-05-14 20:10:14 +08:00
Nicholas-Xiong
0c5f03054e
fix: Add success toast feedback when saving artifact as template (#1671)
Fixes #1190

Display a visible success toast after saving an artifact as a template,
providing clear confirmation that the action completed successfully.

**Before:**
- No visible feedback after clicking Save
- Success message only shown in menu button text (not visible after modal closes)
- Users uncertain whether template was saved

**After:**
- Success toast appears after saving
- Toast displays for 2.2 seconds with template name
- Clear confirmation that the save action completed
- Matches the pattern used for comment saves

**Implementation:**
- Added templateSavedToast state (similar to commentSavedToast)
- Set toast message in handleSaveAsTemplate on success
- Render toast using existing Toast component
- Auto-dismiss after 2.2 seconds (consistent with other toasts)
2026-05-14 20:09:23 +08:00
ashleyashli
1e9bcbf20d
fix(contributor-bot): serialize runs to avoid state.json races and duplicate cards (#1707) 2026-05-14 20:01:13 +08:00
Yuhao Chen
397098f231
fix(web): clean up routines form controls (#1609) 2026-05-14 19:57:44 +08:00
PerishFire
3fa12f71be
Add release preview workflow placeholder (#1705)
Some checks failed
ci / Packaged mac smoke (push) Blocked by required conditions
ci / Packaged windows smoke (push) Blocked by required conditions
ci / Detect PR change scopes (push) Failing after 11s
ci / Validate workspace (push) Has been skipped
nix-check / build (push) Failing after 2s
2026-05-14 18:55:08 +08:00
Yuhao Chen
7633d7a9b0
fix(packaged): forward proxy env to sidecars (#1678) 2026-05-14 17:59:14 +08:00
Nagendhra Madishetti
40766ef1ba
test(web): Critique Theater Phase 13 (reducer p99 bench + surface coverage walker) (#1318)
* feat(web): pure reducer for Critique Theater states (Phase 7.1)

Pure CritiqueState reducer driven by the contracts-level PanelEvent
(the same shape both the live SSE stream and the recorded transcript
emit), so a single reducer powers both the in-flight panel and the
rerun replay. Lifecycle covers run_started → running → (shipped /
degraded / interrupted / failed), with panelist_open / dim /
must_fix / close / round_end events building per-round
CritiquePanelistView entries as they arrive.

Defensive behaviour that surfaced while writing the spec tests:
- Terminal phases (shipped / degraded / interrupted / failed) are
  sticky against further lifecycle events for the same run, except
  for parser_warning which can land late and is recorded in a side
  channel without changing phase.
- A new run_started for a different runId at any time discards the
  prior state and reboots, so the UI can launch consecutive runs
  without an explicit reset action.
- Events whose runId does not match the active run return the same
  state reference, so React's useReducer doesn't re-render
  subscribers on stray traffic.
- Round bookkeeping keys by round number rather than "always last",
  so an out-of-order panelist_dim for round 1 arriving after a
  round 2 dim does not corrupt the round 2 bucket.

Test coverage: 18 cases covering each transition, the runId guard,
sticky-terminal behaviour, the out-of-order round invariant, and
the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire
SSE + replay into the same reducer.

* feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2)

createCritiqueEventsConnection is a pure connection manager that
mirrors apps/web/src/providers/project-events.ts: opens an
EventSource at /api/projects/:id/events, listens for every name in
CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent
(stripping the critique. prefix and merging the data payload), and
hands it to the caller's onEvent. Reconnect uses exponential
backoff (1s → 30s) and resets on `ready`; malformed payloads drop
with a dev-mode warning rather than tearing the stream.

useCritiqueStream wraps the manager in a useReducer that owns the
CritiqueState. enabled=false or a null projectId tears down the
connection cleanly; switching projectId closes the old connection
and opens a fresh one. The returned dispatch lets local UI
synthesise actions (e.g. an Esc keypress firing a synthetic
interrupted while a kill request is in flight); production traffic
comes from the SSE stream.

Test coverage:
- sse.test.ts (10 cases, node env): subscription set covers every
  CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire
  shape back to PanelEvent; malformed JSON is swallowed and does
  not stop the stream; exponential backoff schedule and ready-reset
  semantics are pinned with a setTimeout seam; close() cancels
  pending reconnects and shuts the live source; no-op fallback
  when EventSource is unavailable.
- useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event,
  reducer driven by synthetic actions, no connection when disabled
  or projectId is null, clean close on unmount, projectId change
  reopens cleanly.

* feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3)

Fetches the per-run NDJSON transcript (one PanelEvent per line),
parses every line via the shared isPanelEvent predicate, and
dispatches into the same CritiqueState reducer the live SSE stream
uses. A single reducer means the UI rendering a replay can be
identical to the live panel, and a UI mounting both
useCritiqueStream and useCritiqueReplay in parallel does not have
to reconcile two state shapes.

speed knob is `paused | instant | live | { intervalMs: N }`.
- instant flushes every event synchronously, useful for opening a
  finished run already at its terminal state.
- intervalMs paces dispatches at a fixed cadence so the reviewer
  can watch the run unfold.
- paused parses the transcript but holds events back until the
  caller advances speed (consumers can drive a scrubber later).
- live is reserved for the future "playback at original cadence"
  feature, currently treated as instant; replay timestamps are not
  yet persisted with each event so honest pacing requires a
  follow-up Phase 7+ task.

gunzip seam handles `.ndjson.gz` transcripts via
DecompressionStream when present; the production fetch path picks
between text and arrayBuffer based on the URL extension. Both seams
are injectable so the unit tests don't need to spin up a real
network or a real gzip pipeline.

Test coverage (8 cases, jsdom env):
- Idle status before any URL is provided.
- speed=instant flushes the full transcript synchronously to
  shipped state.
- speed={intervalMs:N} paces with the setTimeout seam, reaching
  done after the last tick.
- speed=paused leaves status=playing with no dispatches.
- Empty transcript reports done with state still idle.
- Fetch rejection surfaces an error status with the message.
- Malformed NDJSON lines are skipped; valid events around them
  still land.
- .gz transcripts route through the gunzip seam.

Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream +
replay), all on one branch ready for review. Phases 8+ (Theater
components) consume these from this PR.

* fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review)

Two P1 fixes from lefarcen's review on PR #1307:

SSE payload override

`sseToPanelEvent` previously spread `data` after the channel-derived
`type`, so a payload-provided `type` could override the channel and
route a `critique.run_started` frame into the reducer as a `ship`
action. Reversed the spread so the channel-derived `type` is
authoritative, and revalidated the resulting object through the
contracts-level `isPanelEvent` predicate before returning. Frames
that fail validation (missing runId, empty runId, unknown type) are
dropped, so a malformed or compromised SSE frame can no longer
dispatch a wrong-shape action into the reducer.

Three new sse.test.ts cases pin the regression: hostile `type:'ship'`
in the payload still resolves to `run_started`, missing runId is
dropped, empty runId is dropped.

Replay pause/resume

`useCritiqueReplay` had one big effect keyed on `transcriptUrl`
only, so flipping `speed` from `paused` to `instant` never re-fired
and the held events sat undispatched. Split into a parse effect
(depends on URL, fetches and stores events in state) and a pace
effect (depends on parsed-events + speed, owns the cursor + timers).
The playback cursor lives in a ref that survives pause/resume
cycles, so flipping `paused` -> `instant` flushes from the current
position rather than restarting (which would double-dispatch
`run_started` and reset the reducer).

Two new useCritiqueReplay.test.tsx cases:
- paused-then-instant transitions from `playing` to `done` and
  reaches the shipped terminal phase
- intervalMs paced playback dispatches one event, pauses to drain
  the next scheduled timer, flips to instant, and confirms the
  remaining transcript drains exactly once (cursor was preserved)

Doc consistency

The earlier source comment in useCritiqueReplay.ts claimed `live`
"paces by recorded timestamps" while the impl used zero-delay
timers and the PR body said it behaves like `instant`. Aligned to
reality: `live` currently behaves like `{ intervalMs: 0 }` (events
drain on successive microtasks via setTimeoutFn) because transcripts
do not yet carry per-event timestamps. Honest timestamp-driven
pacing is queued as a Phase 7+ follow-up.

Validated: pnpm guard, pnpm --filter @open-design/web typecheck,
Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite
96 files / 888 tests.

* feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread)

* feat(web): Theater PanelistLane component (Phase 8.1)

* feat(web): Theater ScoreTicker component (Phase 8.2)

* feat(web): Theater RoundDivider component (Phase 8.3)

* feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4)

* feat(web): Theater TheaterDegraded chip (Phase 8.5)

* feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6)

* feat(web): Theater TheaterTranscript replay surface (Phase 8.7)

* feat(web): Theater TheaterStage top-level container (Phase 8.8)

* feat(web): Theater CSS using existing semantic tokens (no hex literals)

* feat(web): Theater public exports barrel

* fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314)

Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen.

State-lifecycle fixes (3 x P2)
1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`).
   Host hooks dispatch it when their gating prop changes so a stale
   run from a prior project / transcript cannot bleed into the next
   context. Reset is idempotent on idle (returns the same reference).
2. `useCritiqueStream` dispatches `__reset__` at the top of its
   connection effect, so a workspace switch from project A (which
   streamed a critique) to project B clears the reducer before the
   new EventSource opens. enabled=false also clears.
3. `useCritiqueReplay` dispatches `__reset__` at the top of its
   parse effect, so transcriptUrl swaps (including swap-to-null after
   a replay reached `shipped`) lift the reducer back to idle before
   the new fetch starts.

SSE validation (1 x P2)
4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape`
   check after the cheap `isPanelEvent` predicate. A
   `critique.ship` frame missing `composite` / `round` / `status` /
   `artifactRef` is rejected before reaching the reducer, so
   TheaterCollapsed can no longer crash on `undefined.toFixed(1)`.
   Every variant's required fields are validated: run_started
   (protocolVersion, non-empty cast, maxRounds, threshold, scale),
   panelist_* (round, role, plus variant-specific shape), round_end
   (round, composite, mustFix, decision in {continue,ship}, reason),
   ship (round, composite, status, artifactRef.{projectId,artifactId},
   summary), degraded (reason, adapter), interrupted (bestRound,
   composite), failed (cause), parser_warning (kind, position).

Reducer correctness (1 x P2)
5. `panelist_open` now materializes the round + an empty panelist
   view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight
   the in-progress lane the instant the tag opens. Before this, a
   stream that emitted only `panelist_open` after `run_started` left
   `rounds = []` and the UI rendered no current round until a later
   `panelist_dim` arrived.

Polish (3 x P3)
6. Brand role tint swaps from `var(--magenta, var(--accent))` to
   `var(--purple, var(--accent))`. `--purple` is actually defined
   across the design systems; `--magenta` is not, so Brand was
   silently falling through to `--accent` and looking identical to
   Designer.
7. New i18n key `critiqueTheater.interruptedSummary` for the
   interrupted-collapse copy ("Interrupted at round N, best
   composite X.X"). Previously the interrupted branch reused
   `shippedSummary` and the UI read "Shipped at round..." for a run
   that specifically did not ship. Native value in en + zh-CN; other
   locales fall back via `...en` spread.
8. `TheaterDegraded` heading id comes from `useId()` instead of a
   hardcoded `theater-degraded-heading`, so two chips rendered on
   the same page (chat history with multiple completed runs) keep
   their aria-labelledby references unambiguous.

Tests (15 new cases)
- reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data.
- sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship.
- useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false.
- useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped.
- TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...".
- TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`.

Validated
- pnpm guard clean
- pnpm --filter @open-design/web typecheck clean
- Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new)
- tests/i18n/locales.test.ts 5 of 5 across 18 locales

* feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1)

* feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2)

* fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315)

Addresses every blocker from codex, Siri-Ray, and lefarcen. The
three state-lifecycle and SSE-validation issues they also flagged
inherit fixes from PR #1314's review pass that this branch now sits
on top of after rebase.

Real daemon kill on Interrupt (P1)
- CritiqueTheaterMount now POSTs to
  /api/projects/:id/critique/:runId/interrupt alongside the
  optimistic local dispatch. Before this fix, clicking Interrupt
  only flipped the React state to interrupted while the daemon job
  kept running. The fetch is best-effort: a 404 (endpoint not wired
  yet, lands in Phase 15) is swallowed with a dev-mode console.warn
  so the UI still moves to the collapsed badge.
- New fetchInterrupt test seam lets RTL assert on the URL / method
  and simulate the "daemon not ready yet" path. Two tests pin both:
  the happy URL proj-42/critique/run-abc/interrupt POSTs, and a
  rejected fetch still flips the UI.

interruptPending reset on new run (P2)
- A ref-backed effect compares the current runId against the last
  one we saw; when it changes, interruptPending is cleared. A user
  who interrupts run-1 and then triggers run-2 from the same mount
  now gets a fresh, enabled kill button instead of one stuck in
  "Interrupting…". Pinned by a new mount test.

Escape keybind scope (P2)
- InterruptButton now checks the keydown target. Escape inside an
  input, textarea, select, or contenteditable element is ignored
  (and any ancestor of those via closest() is treated the same
  way). Body-level focus still fires the keybind so the Theater
  area's affordance keeps working. Four new tests cover textarea,
  input, contenteditable, and the body-focus positive case.

userFacingName i18n key (P2)
- The spec at specs/current/critique-theater.md:6 mandates a single
  critiqueTheater.userFacingName key so the "Design Jury" label can
  be renamed without touching code. Phase 8 introduced
  critiqueTheater.title by mistake; renamed across types.ts, en.ts,
  zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer
  TheaterStage.tsx. The locale alignment test stays green.

Validated
- pnpm guard clean
- pnpm --filter @open-design/web typecheck clean
- Theater suite: 14 files, 112 tests (was 101 before, +11 new for
  the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope;
  the rest were already in #1314's review fix).
- tests/i18n/locales.test.ts 5 of 5 across 18 locales.

* feat(daemon): adapter-degraded registry with TTL (Phase 10.1)

In-memory registry recording adapters that produced malformed or
oversize transcripts so the orchestrator can skip them for a TTL
window (default 24h) instead of cycling through known-bad providers
on every run.

Records carry reason (malformed_block | oversize_block |
missing_artifact), source label, and expiresAt. The test-only
clock seam lets the suite advance time deterministically and prove
that an expired entry stops counting as degraded without anyone
calling clearDegraded.

7/7 vitest cases green.

* feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2)

Two test-only adapters that read the existing v1 transcript
fixtures (happy-3-rounds and malformed-unbalanced) and replay them
as either a full string or a 512-byte chunked stream. The chunked
form is what the conformance harness uses to prove the parser
holds together when the transcript arrives in arbitrary network
slices, not as one buffered blob.

* feat(daemon): adapter conformance harness (Phase 10.3)

runAdapterConformance pulls a transcript through the same
parseCritiqueStream pipeline the orchestrator uses and classifies
the outcome as shipped, degraded, or failed. On a degraded
outcome it forwards the matched reason to the adapter-degraded
registry, so a single nightly conformance run is what populates
the skip list rather than the orchestrator learning each adapter
is broken at request time.

5/5 vitest cases green covering shipped, malformed degraded,
oversize degraded, no-ship failure, and the harness-thrown
failure path.

* test(e2e): Critique Theater Playwright suite (Phase 11)

Six tests, one viewport per visual case, deterministic SSE
fixtures stubbed via page.route(). Adds the suite to
test:ui:extended so the existing extended-UI lane picks it up.

Coverage:

  1. Happy path: a single mounted theater plays the full
     fixture (1 run_started, 5 panelists open / dim / must_fix /
     close, 1 round_end, 1 ship) and ends on the score badge.
  2. Interrupt mid-run: the panelist that is open at the time
     the interrupt button is clicked closes with an interrupted
     marker and the transcript freezes there.
  3. Visual regression at 375x720 mobile.
  4. Visual regression at 768x1024 tablet.
  5. Visual regression at 1280x800 desktop.
  6. A11y role tree: the theater region exposes a labelled
     landmark, each panelist lane is a group with an accessible
     name, the score is a status live region.

All SSE traffic is stubbed by page.route so the suite runs in CI
without a daemon. The toggle is seeded via localStorage by
bootAppWithCritiqueEnabled so the gate behaves as if Settings
flipped it on. typecheck clean; playwright --list reports 6.

* test(web): reducer p99 bench at 10k iterations (Phase 13.1)

Locks the documented 2ms budget for the Critique Theater reducer
on a representative SSE script (27 actions, one full happy run)
behind a regression gate. Asserts p99 stays under 4ms (2x the
documented budget) so CI runners with a noisy neighbour do not
flake while a real regression to 20ms or 200ms still trips.

The bench is a vitest case rather than a bare microbenchmark so
it runs in the same CI lane as every other web test and does not
need a parallel runner.

* test(web): critique surface coverage walker (Phase 13.2)

Walks the public critique surface (11 SSE event names, 5 panelist
roles, 6 lifecycle phases, 9 named i18n keys) and asserts each
named symbol appears in both the src corpus and the test corpus.
The walker is the gate that catches a rename in one half of the
codebase without a matching update in the other half: a future
PR that drops 'panelist_must_fix' from the reducer without also
removing its test reference fails this suite.

62 assertions, one per symbol per corpus.

* fix(web): tighten Phase 13 gates from lefarcen review (PR #1318)

Address the actionable items from lefarcen's review of the two
Phase 13 CI gates. The two questions about longer-term DX (pre-
commit hook to auto-update the symbol table, AST-walker swap)
are documented as deferred follow-ups rather than landed here.

reducer-bench:
  - Describe renamed to 'reducer p99 regression gate (Phase 13.1)'
    so it reads as a gate, not a comparative benchmark.
  - Failure message now carries the full distribution
    (p50 / p90 / p99 / max + ceiling), so triage on a tripped gate
    can distinguish a real 20ms regression from a 4.001ms CI hiccup
    without re-running locally (lefarcen Q3).
  - Captured a baseline (p50=0.011ms p90=0.013ms p99=0.018ms
    max=0.244ms on a local Node 24 / Win11 run, 2026-05-11) inside
    the docblock so reviewers can see the actual reading sits ~222x
    below the 4ms ceiling (lefarcen Q1).
  - Replaced 'role as any' casts with PanelistRole-typed casts so
    the fixture is typecheck-strict.
  - Phase numbering corrected (13.2 → 13.1 to match the PR body).

critique-coverage:
  - Symbols now grouped under four describe blocks (SSE events /
    panelist roles / lifecycle phases / i18n keys) so a failure
    points at the category that drifted at a glance (lefarcen nit).
  - Docblock now explains the grep-over-AST trade-off (the bug
    class is structural at the string level, not at the AST level)
    and points at the future AST-walker work as a deferred follow-
    up (lefarcen Q2).
  - Docblock now walks a contributor through the four-step
    maintenance flow (add to contract → add caller → add test →
    add literal here), so the next person to add an SSE event or
    i18n key knows the gate exists and what to update (lefarcen
    Q4).
  - Phase strings switched from 'phase: <name>' to bare-quoted
    literals so the walker is robust against single vs double
    quotes and ':' vs '===' source-shape changes.
  - Dead try/catch around 'stack = [root]' removed (cannot throw).
  - Per-symbol failure messages name the symbol AND which corpus
    is missing it, so the gate is self-describing on the next
    CI red.
  - Phase numbering corrected (13.4 → 13.2 to match the PR body).

63 / 63 vitest cases green (1 bench + 62 coverage). Web
typecheck clean.

* fix(web): tighten coverage walker semantics from lefarcen P2/P3 (PR #1318)

Two follow-on findings on commit 338a185:

P2 — coverage gate weakened. The previous revision used one helper
`corpusReferences` for both SRC and TEST corpora, and that helper
accepted the unprefixed PanelEvent type form (`type: 'panelist_must_fix'`)
as a substitute for the prefixed SSE wire name (`critique.panelist_must_fix`).
The fallback is correct on the TEST side (reducer tests dispatch
PanelEvent literals) but it weakened the SRC side: production code
could drop the SSE channel name silently and the PanelEvent type
alias would keep the walker green.

Split into two helpers: `srcReferences` is strict (exact substring
match only, no fallback) and `testReferences` keeps the lenient
fallback for SSE events. The production-side assertions now route
through `srcReferences` so the wire name is load-bearing again.

P3 — maintenance doc overclaimed. The previous revision said 'CI red
if you forget step 4' but the symbol arrays are partially hand-
maintained, so a contributor adding a NEW phase string or i18n key
without updating the array leaves CI green (the walker never knew
to look). Rewrote the failure-mode section to distinguish the two
cases:

  - Renaming an EXISTING symbol without updating the walker → CI red
    (existing assertion fails because the old name is gone).
  - Adding a NEW hand-maintained symbol without updating the walker
    → CI stays green (walker does not know to look for it).

Also clarified that `SSE_EVENTS` and `PANELIST_ROLE_STRINGS` are
auto-built from contracts so step 4 is one-line for `PHASE_STRINGS`
and `I18N_KEYS` only.

63 / 63 vitest cases still green.

* fix(web): close two P2 findings on PR #1318 (Siri-Ray + lefarcen)

P2 (coverage walker counted self as evidence). The walker walked
apps/web/tests, which contains apps/web/tests/components/Theater/
critique-coverage.test.ts itself. The hand-maintained PHASE_STRINGS
and I18N_KEYS literals inside that file would satisfy the test-side
coverage assertion against themselves, so a real Theater test that
covers a symbol could be deleted and the gate would still pass.

Excluded the walker file from TEST_FILES via path.resolve(__filename)
filter so the test corpus only contains independent evidence.

Once the walker stopped seeing itself, the gate correctly red-flagged
nine i18n keys that no INDEPENDENT test exercises:
critiqueTheater.userFacingName, roundLabel, composite, threshold,
interrupt, interrupted, degradedHeading, shippedSummary,
interruptedSummary. Component tests like TheaterCollapsed.test.tsx
exercise the rendered text but never mention the key STRING, so the
walker couldn't see them. Closed that gap by adding
apps/web/tests/components/Theater/critique-i18n-keys.test.ts: 9 cases,
one per watched key, asserting the dictionary entry exists as a
non-empty string. That's both real coverage (catches a stale dict)
and the independent evidence the walker requires.

P2 (interruptedSummary missing from de/ja/ko/zh-TW). The native
locale overrides were missing the key, so an interrupted run on a
German / Japanese / Korean / Traditional Chinese UI silently fell
back to the English string via the ...en spread. Added the key with
{round} and {composite} placeholders preserved, using PerishCode's
suggested copy from the earlier review thread.

Verified:
- pnpm --filter @open-design/web typecheck clean.
- pnpm exec vitest run tests/components/Theater tests/i18n:
  20 files / 190 tests green (critique-coverage 62 / 62,
  critique-i18n-keys 9 / 9 new, reducer-bench 1 / 1, locales 5 / 5).

* fix(web): drop the Dict cast in i18n key coverage test (lefarcen P1 / Siri-Ray on PR #1318)

The previous revision used `(en as Record<string, string>)[key]` to
read each watched key. Dict has no string index signature, so CI's
strict typecheck rejected the broad cast with TS2352 even though the
runtime assertion was fine.

Replaced with the typed pattern lefarcen suggested: type WATCHED_KEYS
as `readonly (keyof typeof en)[]` and read `en[key]` directly. That
removes the cast and also strengthens the test, because a renamed or
removed key now fails the type check immediately rather than at
runtime.

Verified:
- pnpm --filter @open-design/web typecheck clean.
- pnpm --filter @open-design/web exec vitest run
  tests/components/Theater/critique-i18n-keys.test.ts: 9 / 9 green.

* fix(web): tighten isPanelEvent in contracts so enum + numeric fields are checked end-to-end (Siri-Ray round-3 P1 on PR #1314)

The variant validator on the web SSE path previously accepted any
`typeof === 'string'` for closed-enum fields (ship.status,
panelist_*.role, degraded.reason, failed.cause, parser_warning.kind,
run_started.cast[]) and any `typeof === 'number'` for numeric fields,
which let NaN / Infinity through. Downstream components index i18n
tables by enum value, so an unknown status or role would land
`SHIP_BADGE_KEY[final.status]` on undefined and crash the translator.

The replay parser had a separate gap: `useCritiqueReplay.parseTranscript`
called the cheap `isPanelEvent` header check directly, so a recorded
line like `{"type":"ship","runId":"r"}` reached the reducer with
composite, status, round, artifactRef, summary all undefined and
TheaterCollapsed then called `final.composite.toFixed(1)` on undefined.

Resolution: move all wire-side validation into the contract guard.

- Export const arrays for the closed enums:
  SHIP_STATUSES, DEGRADED_REASONS, FAILED_CAUSES, PARSER_WARNING_KINDS,
  ROUND_DECISIONS (PANELIST_ROLES already existed).
- Rewrite `isPanelEvent` in packages/contracts/src/critique.ts to be the
  single deep validator: header (known type + non-empty runId) plus
  every variant-specific required field plus closed-enum membership
  plus Number.isFinite on every numeric field. Documented as the wire
  source of truth.
- Drop the local `hasValidVariantShape` from web/sse.ts; sseToPanelEvent
  now relies entirely on the contract guard, and parseTranscript in
  useCritiqueReplay (which already uses isPanelEvent) gets the deeper
  validation for free.

Tests (TDD, red-first):

- packages/contracts/tests/critique.test.ts: 13 new cases pinning the
  strict guard directly (well-formed across every variant, every
  rejection path: unknown type, empty/non-string runId, unknown enum,
  non-finite numeric, missing variant field).
- apps/web/tests/components/Theater/state/sse.test.ts: 9 new cases for
  each closed-enum rejection on the wire path plus a positive sweep
  across every legal enum value across every variant.
- apps/web/tests/components/Theater/hooks/useCritiqueReplay.test.tsx:
  2 new cases for incomplete and unknown-enum transcript lines.

Verified:
- pnpm --filter @open-design/contracts test 4 files / 30 tests green.
- pnpm --filter @open-design/contracts build clean.
- pnpm --filter @open-design/web typecheck clean.
- pnpm --filter @open-design/web test 107 files / 976 tests green.

* fix(contracts): enforce numeric domains in isPanelEvent (lefarcen P2 on PR #1314 round 4)

The strict guard from PR #1314 round 3 enforced enum membership and
Number.isFinite, but accepted any finite number where the contract
intends a specific domain: scale: 0 (ScoreTicker divides by it),
negative thresholds, fractional rounds, negative mustFix, etc.
ScoreTicker.tsx writes `var(--scale, ${state.scale})` into inline
CSS and divides by it for tick width, so a guard-passing scale: 0
shipped Infinity into the rendered style. Negative composite /
score values reached downstream code that assumes >= 0.

Resolution: mirror the daemon-side Zod domain constraints in the
runtime guard.

Three new helpers in packages/contracts/src/critique.ts:

  - isPositiveInt(v): integer with v > 0. Used for round, maxRounds,
    scale, protocolVersion (all 1-indexed in the orchestrator).
  - isNonNegativeInt(v): integer with v >= 0. Used for mustFix,
    position, bestRound. bestRound: 0 is the valid sentinel for
    'interrupted before any round closed'.
  - isNonNegativeFinite(v): finite number with v >= 0. Used for
    composite, score, dimScore, threshold. Threshold may be
    fractional (e.g. 8.5 on a scale of 10).

Cross-field check inside run_started: threshold <= scale (the daemon
Zod schema enforces this with an epsilon refine, the wire guard
matches the same intent).

Tests (TDD, red-first) added in packages/contracts/tests/critique.test.ts:

  - 22 new rejection cases across every numeric field that
    previously slipped through: scale: 0, negative scale, fractional
    scale, maxRounds: 0, fractional maxRounds, protocolVersion: 0,
    fractional protocolVersion, negative threshold, threshold > scale,
    round: 0, fractional round, negative dimScore / score, negative /
    fractional mustFix, negative composite, ship round: 0, negative /
    fractional bestRound, negative interrupted composite, negative /
    fractional parser_warning position.
  - 3 positive boundary cases that must still pass: threshold == scale,
    fractional threshold within [0, scale], interrupted with
    bestRound: 0 (no round completed before interrupt), parser_warning
    with position: 0 (start of stream).

Verified:
- pnpm --filter @open-design/contracts build clean.
- pnpm --filter @open-design/contracts test: 4 files / 59 tests green
  (was 37 before the new domain cases).
- pnpm --filter @open-design/web typecheck clean.
- pnpm --filter @open-design/web test: 110 files / 1004 tests green;
  no regression on Theater suite, sse validator, replay parser, or
  assistant-feedback widget tests.

* fix(web): restore wait-for-daemon-ack pattern on Theater interrupt

Same regression as flagged on PR #1316 post-main-merge: the
optimistic local dispatch fired before the POST resolved, so a
daemon 404 / 409 still terminalized the UI and the real SSE
terminal event got ignored by the sticky interrupted phase.

Snapshot runId / bestRound / composite at click time, dispatch
interrupted only on res.ok, clear interruptPending on rejection or
non-2xx so the user can retry. Tests cover rejection + 404 leaving
the run on the live stage; the 204 path waits for the ack.

* test(e2e): move critique-coverage walker from apps/web/tests to e2e/tests (Siri-Ray P2)

The walker is by definition a cross-app consistency check: it reads
the web reducer, the daemon critique module, the contracts package,
and the e2e UI suite. Hosting it under apps/web/tests/ violated the
repo boundary rule (root AGENTS.md): app packages must not import
another app's private src/ or tests/ as a shared helper, and
cross-app consistency checks belong in e2e/tests/. The web test
lane was effectively coupled to daemon and e2e file layout, so a
daemon-only refactor could break the web lane.

Moved the file to e2e/tests/critique-coverage.test.ts and switched
the contracts import to the import.meta.glob shape the e2e package
already uses (see localized-content.test.ts), so the e2e package
does not have to add @open-design/contracts as a workspace dep just
to load two const arrays. REPO_ROOT and SELF_PATH recalculated for
the new location.

Web test lane no longer depends on daemon, contracts, or e2e layout.
The e2e walker covers the same 62 assertions as before:

  e2e/tests/critique-coverage.test.ts  62 / 62 green

Web typecheck clean, e2e typecheck clean.

* fix(test): add projectKind prop to FileViewer deck render after v0.7.0 merge

---------

Co-authored-by: Nagendhra <nagendhra405@gmail.com>
2026-05-14 15:55:36 +08:00
Fl0rencess
53148d52c8
feat(media): add SenseAudio TTS provider (#1633)
* feat(media): add SenseAudio TTS provider

Add SenseAudio (https://docs.senseaudio.cn) as a new TTS provider
alongside ElevenLabs / MiniMax / FishAudio / Volcengine. Surfaced as
the `senseaudio-tts` catalogue id, mapped on the wire to
`senseaudio-tts-1.5-260319` — SenseAudio's flagship model with
emotion / 多音字 / 公式朗读 / clone / text-generated voice support.

Scope here is HTTP non-streaming (POST /v1/t2a_v2 with stream=false)
only; SSE and WebSocket transports are intentionally out of scope.

- Mirror provider + model entries in apps/daemon and apps/web
  registries (catalogue drift check stays green).
- ENV_KEYS gets `OD_SENSEAUDIO_API_KEY` / `SENSEAUDIO_API_KEY` so the
  alias scheme matches every other integrated provider.
- `renderSenseAudioTTS` in media.ts mirrors renderMinimaxTTS: Bearer
  auth, voice_setting / audio_setting body, hex-decoded audio under
  `data.audio`, base_resp envelope split from HTTP-level failures.
- NewProjectPanel's audio supportedProviders allowlist now includes
  `senseaudio` so the picker actually surfaces the new entry.
- Audio shape (mp3 / 32kHz / 128kbps / stereo) and default voice
  (`female_0033_b`) hard-coded for parity with the other TTS paths;
  MediaContext is unchanged.
- New apps/daemon/tests/media-senseaudio.test.ts (8 specs) covers
  defaults, custom voice, default base URL fall-back, env-key path,
  missing-key error, base_resp failures, missing audio, and HTTP
  non-2xx — patterned on media-elevenlabs.test.ts.

* docs(media): drop Chinese from SenseAudio provider comment

Translate the model-capabilities line in the SenseAudio block comment
(media.ts) into English. Keeping the source comments in a single
language matches the rest of the daemon and avoids reviewer churn over
mixed-locale prose.

* fix(web): unblock openai and volcengine speech models in audio picker

Per review on #1633, supportedModels()'s audio allowlist in
NewProjectPanel was still filtering out gpt-4o-mini-tts (openai) and
doubao-tts (volcengine) even though both are marked `integrated: true`
in the shared media-models catalogue. Add the two ids so the picker
matches the registry and the PR body's "alongside doubao-tts" claim
holds true.

* style(media): normalize speech hints to bare provider names

Strip the trailing descriptions on the speech catalogue hints so every
entry shows just the provider name (matching FishAudio / ElevenLabs /
SenseAudio): `gpt-4o-mini-tts` → "OpenAI", `minimax-tts` → "MiniMax",
`doubao-tts` → "Volcengine". Also move `gpt-4o-mini-tts` to the end of
the list so the OpenAI entry sits after the upstream-focused providers,
matching the recent picker grouping discussion on #1633.

Mirrored in both apps/daemon/src/media-models.ts and apps/web/src/media/
models.ts; catalogue drift check + daemon (1848) + web (1150) suites all
green.
2026-05-14 15:26:38 +08:00
nettee
6b3cc61714
Revert "Refactor agent runtime stream handling behind adapter (#1622)" (#1656)
This reverts commit 8cb9cdb593.
2026-05-14 15:23:19 +08:00
Weston Houghton
de4430cf4e
fix(web): route remaining crypto.randomUUID calls through utils/uuid (#849) (#1621)
`crypto.randomUUID` is undefined on non-secure contexts (plain HTTP +
non-localhost — the standard Docker / NAS / unRAID self-hosted setup
e.g. `http://192.168.1.x:7456`). PR #900 introduced
`apps/web/src/utils/uuid.ts` as a tiered v4 helper that degrades to
`crypto.getRandomValues` and ultimately `Math.random`, so the original
"Create button silently does nothing" symptom (#849, #394) went away.

PR #1428 added three unguarded `crypto.randomUUID()` calls in the new
PostHog analytics provider, and `apps/web/src/runtime/exports.ts`
carried a fourth from older PDF-export work. On non-secure contexts
these throw `TypeError: crypto.randomUUID is not a function` during
`<AnalyticsProvider>` rendering, taking the whole app shell down before
any UI mounts. PDF export also fails when the print-ready handshake
nonce is generated.

Route all four sites through the existing `randomUUID()` helper.
2026-05-14 14:59:14 +08:00
Nagendhra Madishetti
d5566d7627
feat(daemon): user-configurable model alias for the media dispatcher (#1277) (#1309)
* feat(daemon): user-configurable model alias / redirect for the media dispatcher (#1277)

Tilmirs's use case in #1277: their Doubao access has moved from
`doubao-seedream-3-0-t2i-250415` to `doubao-seedream-5-0`, but the
project's registered catalog still emits the old id. Every call
fails because the old name no longer resolves at Volcengine.
Until now the only workaround was patching the source on every
update.

This adds a user-configurable alias layer that swaps the catalog
id for whatever wire-name the provider expects, without changing
the catalog itself. Two storage layers (env wins over disk,
matching the rest of media-config):

1. **Environment variable** `OD_MEDIA_MODEL_ALIASES` carries a
   JSON map: `'{"doubao-seedream-3-0-t2i-250415":"doubao-seedream-5-0"}'`.
   Single var, portable across shells (Windows cmd.exe rejects
   hyphens in env-var names, so the per-id pattern lefarcen
   suggested wouldn't have worked on Windows). Malformed JSON is
   tolerated — falls through to the on-disk map rather than
   blowing up mid-generation.

2. **media-config.json** gains a top-level `aliases` field:
   ```json
   {
     "providers": { ... },
     "aliases": {
       "doubao-seedream-3-0-t2i-250415": "doubao-seedream-5-0"
     }
   }
   ```
   The Settings UI's existing PUT writes providers only, so the
   writeStored path now reads the existing aliases and preserves
   them on every write. Without that, a Settings save would
   silently wipe the user's aliases. The Settings UI surface for
   editing aliases is a separate follow-up; manual JSON edit and
   the env var are the v1 entry points.

The resolution happens inside `startMediaGeneration` after the
catalog lookup and surface validation have already accepted the
registered id, so users still get the "unknown model" error if
they request a catalog id that doesn't exist. The swap only
changes what the provider receives on the wire (volcengine,
openai, grok, fal, nanobanana etc. each pass `ctx.model`
straight into their request body).

Per-provider auto-output-name and the file-naming side use the
function-level `model` parameter (the catalog id), so a `.png`
named after `doubao-seedream-3-0-t2i-250415` keeps surfacing the
registered id the agent / CLI asked for, not the wire-level
alias. `providerNote` strings include the wire name so the user
can see what was actually sent.

Public API additions:
- `resolveModelAlias(projectRoot, modelId)` -> the wire name (or
  the original if no alias matches).
- `readAliasMap(projectRoot)` -> { effective, env, stored } for
  the future Settings UI's source-attribution display.

Tests
- 8 new cases in tests/media-config.test.ts (suite goes 14 -> 22):
  pass-through, stored map, env map, env-over-stored precedence,
  malformed-env fall-through, coercion of bad entries (null /
  number / nested object / empty string / blank key), readAliasMap
  source attribution, and a writeConfig regression that pins
  alias preservation on a Settings-style provider PUT.

Validated
- pnpm guard clean
- pnpm --filter @open-design/daemon typecheck clean (both
  tsconfig.json and tsconfig.tests.json)
- Media test suite (media-config + media-tasks-routes +
  media-tasks-persistence + media-nanobanana): 33/33

Pre-existing daemon test failures on Windows (symlinks, CODEX_BIN
runtime resolution, MCP config, skills, server-paths) are
unrelated to this change and reproduce on a clean main checkout.

* fix(daemon): preserve catalog id for capability branches, surface aliases via /api/media/config (PR #1309 review)

Lefarcen + codex P2 on PR #1309: the alias swap overwrote
`ctx.model` globally, which silently disabled every renderer
branch that keys behaviour off the catalog id. A user aliasing
`dall-e-3 -> azure-dalle3-deployment` would have the wire name
swapped correctly but `body.response_format = 'b64_json'` and
`body.quality = 'hd'` would no longer be set, because the
`ctx.model.startsWith('dall-e-')` / `ctx.model === 'dall-e-3'`
checks now saw the alias. The same regression hit the
gpt-image-* size selection, the gpt-4o-mini-tts instructions
branch, and the openaiSizeFor() sizing function.

MediaContext now carries both fields:
- `model` — the registered catalog id (`dall-e-3`,
  `gpt-4o-mini-tts`, `doubao-seedream-3-0-t2i-250415`). All
  model-family capability branches read from here.
- `wireModel` — the post-alias wire name. Every `body.model = `,
  every URL template, and every `providerNote` string reads from
  here so the user sees what was actually sent and the provider
  gets the alias.

Renderers updated: openai image (body.model + providerNote +
openaiSizeFor keeps catalog), openai speech (body.model +
providerNote + gpt-4o-mini-tts instructions keeps catalog),
volcengine video (body + note), volcengine image (body + note +
openaiSizeFor keeps catalog), grok image (body + note), grok video
(body + note), nanobanana (`credentials.model || ctx.wireModel ||
default` chain), minimax TTS, fishaudio TTS. The MINIMAX/FISHAUDIO
hardcoded maps now sit BEHIND the user alias: explicit user alias
wins over the project's legacy rebranding table, then the table
wins over the catalog id fallback. Stub-fallback diagnostics (the
SVG placeholder + stub providerNote string) keep the catalog id
since those are debug surfaces, not provider calls.

Lefarcen P3: the PR description claimed readAliasMap was the
daemon-public API, but the /api/media/config route returned only
readMaskedConfig (which had no aliases field). readMaskedConfig
now returns `{ providers, aliases: { effective, env, stored } }`
so the future Settings UI PR can consume the source-attributed
map directly. The `aliases` field is always present (empty maps
when nothing is configured) so the UI has a stable shape to read.

Tests
- New `media-alias-capability.test.ts` (2 jsdom cases) drives
  generateMedia end-to-end with a stubbed fetch and asserts on
  the request body. Pins the regression: aliased dall-e-3 still
  sends `response_format: 'b64_json'` + `quality: 'hd'`; aliased
  gpt-4o-mini-tts still attaches the instructions field from the
  voice prop.
- `media-config.test.ts` grows by 2 cases (suite goes 22 -> 24):
  readMaskedConfig surfaces the alias map (both env and stored
  sources), and the empty-state shape for fresh installs.

Validated
- pnpm guard clean
- pnpm --filter @open-design/daemon typecheck clean (both
  tsconfig.json and tsconfig.tests.json)
- Media test suite (config + alias-capability + nanobanana +
  tasks-persistence + tasks-routes): 37/37

---------

Co-authored-by: Nagendhra <nagendhra405@gmail.com>
2026-05-14 14:58:39 +08:00
shangxinyu1
2976c76fc3
test: expand Memory and Routines coverage (#1521)
* test: expand settings and packaged coverage

* test: extend memory settings coverage

* test: cover routine settings failure states

* test: cover routine operation failures

* test: fix daemon test typing on CI

* test: decouple packaged smoke from orbit bug

* test: avoid live memory LLM calls in route tests

* test: fix daemon fetch typing in CI

* fix: restore preview comment and inspect toggles

* test: align manual edit flow with current inspector UX

* test: align comment attachment flow with current preview comments UI

* fix: probe resolved Codex launch path during detection

* fix: remove duplicate board activation helper after rebase

* test: update ghost cli detection mock

* test: align FileViewer toolbar expectation

* ci: move full app tests to extended lane

* ci: run app tests by changed scope

* ci: cover shared app inputs in test scopes

* ci: avoid setup-node cache in windows packaged smoke

* test: align extended settings and manual edit flows
2026-05-14 14:48:40 +08:00
Yuhao Chen
c942d99b14
fix(orbit): avoid sample identity leakage (#1608) 2026-05-14 14:48:29 +08:00
Nagendhra Madishetti
e508fa3fbd
test(e2e): Critique Theater Phase 11 activation (un-fixme suite, seeded-project nav, split SSE fixture) (#1483) 2026-05-14 14:27:39 +08:00
Nagendhra Madishetti
5cb0508790
fix(web): deep-link Routines history rows to their specific conversation (Fixes #1505) (#1508) 2026-05-14 14:27:34 +08:00
PerishFire
59ed000903
Fix Windows resource cache for Orbit templates (#1554) 2026-05-14 14:27:29 +08:00
lakatos
51d1c4e287
ci: skip upstream-only workflows on forks (#1586) 2026-05-14 14:27:23 +08:00
Priyanshu Kayarkar
8101e430cf
fix(ui) : radio button issue (#1599) 2026-05-14 14:27:18 +08:00
soulme
2a8ebff11a
feat(web): add collapsible comment side panel (#1607) 2026-05-14 14:27:09 +08:00
github-actions[bot]
a7bebd926f
Auto-generated metrics for run #25835844101 (#1615) 2026-05-14 14:27:03 +08:00
Nicholas-Xiong
63b90685da
fix: Improve project card metadata truncation with min-width: 0 (#1629) 2026-05-14 14:26:57 +08:00
Nicholas-Xiong
743669e01d
fix: Add dropdown chevron to routines project select field (#1630) 2026-05-14 14:26:50 +08:00
sukumarp2022
8b3e22850a
fix: replace Microsoft Copilot logo with GitHub Copilot logo (#1648) 2026-05-14 14:26:45 +08:00
Siri-Ray
d2738924fb
fix(web): freeze completed run durations across conversations (#1351)
* fix(web): freeze completed run durations across conversations

* fix(web): finalize stopped API runs

Generated-By: looper 0.6.0 (runner=fixer, agent=codex)

* fix(daemon): optimize conversation latest run lookup

Generated-By: looper 0.6.0 (runner=fixer, agent=codex)

* fix(web): scope streaming cleanup to conversation

Generated-By: looper 0.6.0 (runner=fixer, agent=codex)

* fix(web): capture streaming conversation cleanup

Generated-By: looper 0.6.0 (runner=fixer, agent=codex)

* fix(web): guard stale run ref cleanup

Generated-By: looper 0.6.0 (runner=fixer, agent=codex)
2026-05-14 14:25:37 +08:00
Marc Chan
055e55abd8
Add batch design system testing (#1515)
* feat: add batch design system testing

* fix: use daemon default agent for batch tests

* fix: honor batch project prompt flags

Generated-By: looper 0.0.0-dev (runner=fixer, agent=opencode)

* fix: persist batch run output

* fix: honor dry-run before daemon resolution

Generated-By: looper 0.0.0-dev (runner=fixer, agent=opencode)

* fix: persist batch assistant run ids

Generated-By: looper 0.0.0-dev (runner=fixer, agent=opencode)

* fix: cancel timed-out batch runs

Generated-By: looper 0.0.0-dev (runner=fixer, agent=opencode)
2026-05-14 14:19:32 +08:00
chaoxiaoche
e57e028222
feat(daemon): make design-system token channel default-on (PR-D) (#1544)
* feat(daemon): make design-system token channel default-on (PR-D)

Flip `OD_DESIGN_TOKEN_CHANNEL` from default-off to default-on. Every
chat that picks a brand with `tokens.css` + `components.html` siblings
(today: `default`, `kami`) now gets the structured token contract
appended to the system prompt automatically. `OD_DESIGN_TOKEN_CHANNEL=0`
keeps the DESIGN.md-only path as a kill switch.

Adds `scripts/check-design-system-flag-parity.ts`, registered in
`pnpm guard`. The guard walks every brand and asserts:

- 147 prose-only brands produce byte-identical prompts under flag-off
  vs flag-on (PR-D's "no-op for legacy brands" promise)
- 2 structured brands diverge as expected (catches a future regression
  that silently dropped the structured blocks)

Smoke evidence on #1385 (PR-C):
- `default` — 10/10 brand tokens used byte-for-byte in treatment vs
  0/10 invented colors in control
- `kami` — treatment recovers brand name (`Kami · 纸`), the two-tier
  surface (`--bg` parchment + `--surface` ivory), the CN font stack
  override, and the `components.html` card pattern; control invented
  "Replica" as a brand name

Co-authored-by: Cursor <cursoragent@cursor.com>

* review: address @nettee + @lefarcen feedback on parity guard

Two blocking findings from #1544 review:

1. @nettee — guard's inventory walk silently passed on unreadable
   filesystem state. `fileExists` swallowed every `stat` error and the
   bare `readdir` catch returned `[]` for any failure. A renamed
   `design-systems/` tree, a permission-denied DESIGN.md, or a
   directory at the brand path would have left `pnpm guard` happy
   after checking 0 brands — exactly the silent misconfiguration this
   guard exists to catch. Both error paths now treat only ENOENT /
   ENOTDIR as absence and rethrow everything else, mirroring the
   `readFileOptional` fix already applied to PR-C's
   `apps/daemon/src/design-systems.ts`.

2. @nettee — guard exercised `composeSystemPrompt` directly, bypassing
   the `process.env.OD_DESIGN_TOKEN_CHANNEL !== '0'` gate in server.ts
   that PR-D actually flipped. A regression that restored `=== '1'`,
   typo'd the env name, or stopped reading assets when the var is
   unset would still leave the guard green. Extracted the predicate
   into `isDesignTokenChannelEnabled(env)` next to
   `readDesignSystemAssets` and added 6 unit tests pinning every value
   that matters: unset / `'1'` / `'true'` / empty / `'0'` /
   whitespace-padded. server.ts now calls the predicate. Any
   regression on the env-flag semantics fails
   `tests/design-system-assets.test.ts` independently of the
   composer-level coverage.

Verified: pnpm guard (13/13), tsc -p scripts/tsconfig.json (clean),
@open-design/daemon typecheck (clean), 32/32 prompt + asset tests.

Co-authored-by: Cursor <cursoragent@cursor.com>

* review: pin server-layer asset resolution end-to-end (lefarcen P2)

Round-2 review feedback from @lefarcen on #1544: the predicate suite
in tests/design-system-assets.test.ts pinned the env-flag boolean but
did NOT exercise the server prompt-assembly path that PR-D actually
flipped — the seam where the daemon decides whether to read tokens.css
/ components.html from disk and hand them to composeSystemPrompt. A
regression that, say, restored an inline `=== '1'` gate or stopped
calling isDesignTokenChannelEnabled() from server.ts would still leave
the predicate test green.

Extracted that whole seam into `resolveDesignSystemAssets(id,
builtInRoot, userInstalledRoot, env)` on apps/daemon/src/design-systems.ts.
The function combines:
  1. the env-flag gate (kill switch on `OD_DESIGN_TOKEN_CHANNEL=0`)
  2. the built-in → user-installed root fallback chain (per-file)
  3. the DesignSystemAssets result shape consumed by composeSystemPrompt

server.ts at the prompt-assembly site is now a thin caller of this
function. The previous 13-line inline block (env check + per-file
fallback) collapses to one call, so the whole asset-resolution path
now has a single testable seam.

7 new tests in tests/design-system-assets.test.ts run the full pipeline
end-to-end against real disk fixtures:
  - env unset (default-on): returns built-in assets
  - env=`'0'` (kill switch): returns undefined even with files on disk
  - env=`'1'` (legacy opt-in): still works
  - mixed builtin/user-installed: per-file fallback merges correctly
  - both halves built-in: skips user-installed roundtrip verbatim
  - prose-only brand (no files): undefined / undefined
  - nonexistent brand directory: undefined / undefined

Verified: pnpm guard (13/13), tsc -p scripts/tsconfig.json (clean),
@open-design/daemon typecheck (clean), 39/39 prompt + asset tests
(was 32; +7 new server-layer-resolution tests).

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(test): add missing projectKind to FileViewer deck preview test

The deck preview test added in #1556 (086be271) renders <FileViewer/>
without `projectKind`, which became a required prop in #1509. CI on
main is currently red on this; pick up the trivial fix here so PR-D
can land cleanly.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: chaoxiaoche <chaoxiaoche@chaoxiaochedeMacBook-Pro.local>
Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-14 14:14:19 +08:00
nettee
8cb9cdb593
Refactor agent runtime stream handling behind adapter (#1622) 2026-05-14 14:12:24 +08:00
sukumarp2022
852a005b32
feat(web): add export as image screenshot to share menu (#1569)
Add an option to export the current preview viewport as a PNG image.

- Add requestPreviewSnapshot() utility in exports.ts (reuses the existing
  srcdoc snapshot bridge via postMessage)
- Add exportAsImage() and dataUrlToBlob() helpers for Blob download
- Add Export as image menu item in the HTML viewer share menu, gated
  behind srcdoc mode (bridge only present in srcdoc, not URL-load mode)
- Refactor PreviewDrawOverlay to delegate to the shared
  requestPreviewSnapshot() instead of duplicating the snapshot logic
- Add fileViewer.exportImage i18n key across all 19 locale files
- Add 7 unit tests covering snapshot request, timeout, error handling,
  and download filename sanitization

Fixes #1500
2026-05-14 11:07:28 +08:00
Nagendhra Madishetti
ff569fa50c
feat(daemon): Critique Theater Phase 16 (M-phase rollout ratchet + /api/critique/conformance) (#1499)
* feat(web): pure reducer for Critique Theater states (Phase 7.1)

Pure CritiqueState reducer driven by the contracts-level PanelEvent
(the same shape both the live SSE stream and the recorded transcript
emit), so a single reducer powers both the in-flight panel and the
rerun replay. Lifecycle covers run_started → running → (shipped /
degraded / interrupted / failed), with panelist_open / dim /
must_fix / close / round_end events building per-round
CritiquePanelistView entries as they arrive.

Defensive behaviour that surfaced while writing the spec tests:
- Terminal phases (shipped / degraded / interrupted / failed) are
  sticky against further lifecycle events for the same run, except
  for parser_warning which can land late and is recorded in a side
  channel without changing phase.
- A new run_started for a different runId at any time discards the
  prior state and reboots, so the UI can launch consecutive runs
  without an explicit reset action.
- Events whose runId does not match the active run return the same
  state reference, so React's useReducer doesn't re-render
  subscribers on stray traffic.
- Round bookkeeping keys by round number rather than "always last",
  so an out-of-order panelist_dim for round 1 arriving after a
  round 2 dim does not corrupt the round 2 bucket.

Test coverage: 18 cases covering each transition, the runId guard,
sticky-terminal behaviour, the out-of-order round invariant, and
the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire
SSE + replay into the same reducer.

* feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2)

createCritiqueEventsConnection is a pure connection manager that
mirrors apps/web/src/providers/project-events.ts: opens an
EventSource at /api/projects/:id/events, listens for every name in
CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent
(stripping the critique. prefix and merging the data payload), and
hands it to the caller's onEvent. Reconnect uses exponential
backoff (1s → 30s) and resets on `ready`; malformed payloads drop
with a dev-mode warning rather than tearing the stream.

useCritiqueStream wraps the manager in a useReducer that owns the
CritiqueState. enabled=false or a null projectId tears down the
connection cleanly; switching projectId closes the old connection
and opens a fresh one. The returned dispatch lets local UI
synthesise actions (e.g. an Esc keypress firing a synthetic
interrupted while a kill request is in flight); production traffic
comes from the SSE stream.

Test coverage:
- sse.test.ts (10 cases, node env): subscription set covers every
  CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire
  shape back to PanelEvent; malformed JSON is swallowed and does
  not stop the stream; exponential backoff schedule and ready-reset
  semantics are pinned with a setTimeout seam; close() cancels
  pending reconnects and shuts the live source; no-op fallback
  when EventSource is unavailable.
- useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event,
  reducer driven by synthetic actions, no connection when disabled
  or projectId is null, clean close on unmount, projectId change
  reopens cleanly.

* feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3)

Fetches the per-run NDJSON transcript (one PanelEvent per line),
parses every line via the shared isPanelEvent predicate, and
dispatches into the same CritiqueState reducer the live SSE stream
uses. A single reducer means the UI rendering a replay can be
identical to the live panel, and a UI mounting both
useCritiqueStream and useCritiqueReplay in parallel does not have
to reconcile two state shapes.

speed knob is `paused | instant | live | { intervalMs: N }`.
- instant flushes every event synchronously, useful for opening a
  finished run already at its terminal state.
- intervalMs paces dispatches at a fixed cadence so the reviewer
  can watch the run unfold.
- paused parses the transcript but holds events back until the
  caller advances speed (consumers can drive a scrubber later).
- live is reserved for the future "playback at original cadence"
  feature, currently treated as instant; replay timestamps are not
  yet persisted with each event so honest pacing requires a
  follow-up Phase 7+ task.

gunzip seam handles `.ndjson.gz` transcripts via
DecompressionStream when present; the production fetch path picks
between text and arrayBuffer based on the URL extension. Both seams
are injectable so the unit tests don't need to spin up a real
network or a real gzip pipeline.

Test coverage (8 cases, jsdom env):
- Idle status before any URL is provided.
- speed=instant flushes the full transcript synchronously to
  shipped state.
- speed={intervalMs:N} paces with the setTimeout seam, reaching
  done after the last tick.
- speed=paused leaves status=playing with no dispatches.
- Empty transcript reports done with state still idle.
- Fetch rejection surfaces an error status with the message.
- Malformed NDJSON lines are skipped; valid events around them
  still land.
- .gz transcripts route through the gunzip seam.

Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream +
replay), all on one branch ready for review. Phases 8+ (Theater
components) consume these from this PR.

* fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review)

Two P1 fixes from lefarcen's review on PR #1307:

SSE payload override

`sseToPanelEvent` previously spread `data` after the channel-derived
`type`, so a payload-provided `type` could override the channel and
route a `critique.run_started` frame into the reducer as a `ship`
action. Reversed the spread so the channel-derived `type` is
authoritative, and revalidated the resulting object through the
contracts-level `isPanelEvent` predicate before returning. Frames
that fail validation (missing runId, empty runId, unknown type) are
dropped, so a malformed or compromised SSE frame can no longer
dispatch a wrong-shape action into the reducer.

Three new sse.test.ts cases pin the regression: hostile `type:'ship'`
in the payload still resolves to `run_started`, missing runId is
dropped, empty runId is dropped.

Replay pause/resume

`useCritiqueReplay` had one big effect keyed on `transcriptUrl`
only, so flipping `speed` from `paused` to `instant` never re-fired
and the held events sat undispatched. Split into a parse effect
(depends on URL, fetches and stores events in state) and a pace
effect (depends on parsed-events + speed, owns the cursor + timers).
The playback cursor lives in a ref that survives pause/resume
cycles, so flipping `paused` -> `instant` flushes from the current
position rather than restarting (which would double-dispatch
`run_started` and reset the reducer).

Two new useCritiqueReplay.test.tsx cases:
- paused-then-instant transitions from `playing` to `done` and
  reaches the shipped terminal phase
- intervalMs paced playback dispatches one event, pauses to drain
  the next scheduled timer, flips to instant, and confirms the
  remaining transcript drains exactly once (cursor was preserved)

Doc consistency

The earlier source comment in useCritiqueReplay.ts claimed `live`
"paces by recorded timestamps" while the impl used zero-delay
timers and the PR body said it behaves like `instant`. Aligned to
reality: `live` currently behaves like `{ intervalMs: 0 }` (events
drain on successive microtasks via setTimeoutFn) because transcripts
do not yet carry per-event timestamps. Honest timestamp-driven
pacing is queued as a Phase 7+ follow-up.

Validated: pnpm guard, pnpm --filter @open-design/web typecheck,
Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite
96 files / 888 tests.

* feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread)

* feat(web): Theater PanelistLane component (Phase 8.1)

* feat(web): Theater ScoreTicker component (Phase 8.2)

* feat(web): Theater RoundDivider component (Phase 8.3)

* feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4)

* feat(web): Theater TheaterDegraded chip (Phase 8.5)

* feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6)

* feat(web): Theater TheaterTranscript replay surface (Phase 8.7)

* feat(web): Theater TheaterStage top-level container (Phase 8.8)

* feat(web): Theater CSS using existing semantic tokens (no hex literals)

* feat(web): Theater public exports barrel

* fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314)

Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen.

State-lifecycle fixes (3 x P2)
1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`).
   Host hooks dispatch it when their gating prop changes so a stale
   run from a prior project / transcript cannot bleed into the next
   context. Reset is idempotent on idle (returns the same reference).
2. `useCritiqueStream` dispatches `__reset__` at the top of its
   connection effect, so a workspace switch from project A (which
   streamed a critique) to project B clears the reducer before the
   new EventSource opens. enabled=false also clears.
3. `useCritiqueReplay` dispatches `__reset__` at the top of its
   parse effect, so transcriptUrl swaps (including swap-to-null after
   a replay reached `shipped`) lift the reducer back to idle before
   the new fetch starts.

SSE validation (1 x P2)
4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape`
   check after the cheap `isPanelEvent` predicate. A
   `critique.ship` frame missing `composite` / `round` / `status` /
   `artifactRef` is rejected before reaching the reducer, so
   TheaterCollapsed can no longer crash on `undefined.toFixed(1)`.
   Every variant's required fields are validated: run_started
   (protocolVersion, non-empty cast, maxRounds, threshold, scale),
   panelist_* (round, role, plus variant-specific shape), round_end
   (round, composite, mustFix, decision in {continue,ship}, reason),
   ship (round, composite, status, artifactRef.{projectId,artifactId},
   summary), degraded (reason, adapter), interrupted (bestRound,
   composite), failed (cause), parser_warning (kind, position).

Reducer correctness (1 x P2)
5. `panelist_open` now materializes the round + an empty panelist
   view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight
   the in-progress lane the instant the tag opens. Before this, a
   stream that emitted only `panelist_open` after `run_started` left
   `rounds = []` and the UI rendered no current round until a later
   `panelist_dim` arrived.

Polish (3 x P3)
6. Brand role tint swaps from `var(--magenta, var(--accent))` to
   `var(--purple, var(--accent))`. `--purple` is actually defined
   across the design systems; `--magenta` is not, so Brand was
   silently falling through to `--accent` and looking identical to
   Designer.
7. New i18n key `critiqueTheater.interruptedSummary` for the
   interrupted-collapse copy ("Interrupted at round N, best
   composite X.X"). Previously the interrupted branch reused
   `shippedSummary` and the UI read "Shipped at round..." for a run
   that specifically did not ship. Native value in en + zh-CN; other
   locales fall back via `...en` spread.
8. `TheaterDegraded` heading id comes from `useId()` instead of a
   hardcoded `theater-degraded-heading`, so two chips rendered on
   the same page (chat history with multiple completed runs) keep
   their aria-labelledby references unambiguous.

Tests (15 new cases)
- reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data.
- sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship.
- useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false.
- useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped.
- TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...".
- TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`.

Validated
- pnpm guard clean
- pnpm --filter @open-design/web typecheck clean
- Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new)
- tests/i18n/locales.test.ts 5 of 5 across 18 locales

* feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1)

* feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2)

* fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315)

Addresses every blocker from codex, Siri-Ray, and lefarcen. The
three state-lifecycle and SSE-validation issues they also flagged
inherit fixes from PR #1314's review pass that this branch now sits
on top of after rebase.

Real daemon kill on Interrupt (P1)
- CritiqueTheaterMount now POSTs to
  /api/projects/:id/critique/:runId/interrupt alongside the
  optimistic local dispatch. Before this fix, clicking Interrupt
  only flipped the React state to interrupted while the daemon job
  kept running. The fetch is best-effort: a 404 (endpoint not wired
  yet, lands in Phase 15) is swallowed with a dev-mode console.warn
  so the UI still moves to the collapsed badge.
- New fetchInterrupt test seam lets RTL assert on the URL / method
  and simulate the "daemon not ready yet" path. Two tests pin both:
  the happy URL proj-42/critique/run-abc/interrupt POSTs, and a
  rejected fetch still flips the UI.

interruptPending reset on new run (P2)
- A ref-backed effect compares the current runId against the last
  one we saw; when it changes, interruptPending is cleared. A user
  who interrupts run-1 and then triggers run-2 from the same mount
  now gets a fresh, enabled kill button instead of one stuck in
  "Interrupting…". Pinned by a new mount test.

Escape keybind scope (P2)
- InterruptButton now checks the keydown target. Escape inside an
  input, textarea, select, or contenteditable element is ignored
  (and any ancestor of those via closest() is treated the same
  way). Body-level focus still fires the keybind so the Theater
  area's affordance keeps working. Four new tests cover textarea,
  input, contenteditable, and the body-focus positive case.

userFacingName i18n key (P2)
- The spec at specs/current/critique-theater.md:6 mandates a single
  critiqueTheater.userFacingName key so the "Design Jury" label can
  be renamed without touching code. Phase 8 introduced
  critiqueTheater.title by mistake; renamed across types.ts, en.ts,
  zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer
  TheaterStage.tsx. The locale alignment test stays green.

Validated
- pnpm guard clean
- pnpm --filter @open-design/web typecheck clean
- Theater suite: 14 files, 112 tests (was 101 before, +11 new for
  the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope;
  the rest were already in #1314's review fix).
- tests/i18n/locales.test.ts 5 of 5 across 18 locales.

* feat(daemon): adapter-degraded registry with TTL (Phase 10.1)

In-memory registry recording adapters that produced malformed or
oversize transcripts so the orchestrator can skip them for a TTL
window (default 24h) instead of cycling through known-bad providers
on every run.

Records carry reason (malformed_block | oversize_block |
missing_artifact), source label, and expiresAt. The test-only
clock seam lets the suite advance time deterministically and prove
that an expired entry stops counting as degraded without anyone
calling clearDegraded.

7/7 vitest cases green.

* feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2)

Two test-only adapters that read the existing v1 transcript
fixtures (happy-3-rounds and malformed-unbalanced) and replay them
as either a full string or a 512-byte chunked stream. The chunked
form is what the conformance harness uses to prove the parser
holds together when the transcript arrives in arbitrary network
slices, not as one buffered blob.

* feat(daemon): adapter conformance harness (Phase 10.3)

runAdapterConformance pulls a transcript through the same
parseCritiqueStream pipeline the orchestrator uses and classifies
the outcome as shipped, degraded, or failed. On a degraded
outcome it forwards the matched reason to the adapter-degraded
registry, so a single nightly conformance run is what populates
the skip list rather than the orchestrator learning each adapter
is broken at request time.

5/5 vitest cases green covering shipped, malformed degraded,
oversize degraded, no-ship failure, and the harness-thrown
failure path.

* test(e2e): Critique Theater Playwright suite (Phase 11)

Six tests, one viewport per visual case, deterministic SSE
fixtures stubbed via page.route(). Adds the suite to
test:ui:extended so the existing extended-UI lane picks it up.

Coverage:

  1. Happy path: a single mounted theater plays the full
     fixture (1 run_started, 5 panelists open / dim / must_fix /
     close, 1 round_end, 1 ship) and ends on the score badge.
  2. Interrupt mid-run: the panelist that is open at the time
     the interrupt button is clicked closes with an interrupted
     marker and the transcript freezes there.
  3. Visual regression at 375x720 mobile.
  4. Visual regression at 768x1024 tablet.
  5. Visual regression at 1280x800 desktop.
  6. A11y role tree: the theater region exposes a labelled
     landmark, each panelist lane is a group with an accessible
     name, the score is a status live region.

All SSE traffic is stubbed by page.route so the suite runs in CI
without a daemon. The toggle is seeded via localStorage by
bootAppWithCritiqueEnabled so the gate behaves as if Settings
flipped it on. typecheck clean; playwright --list reports 6.

* test(web): reducer p99 bench at 10k iterations (Phase 13.1)

Locks the documented 2ms budget for the Critique Theater reducer
on a representative SSE script (27 actions, one full happy run)
behind a regression gate. Asserts p99 stays under 4ms (2x the
documented budget) so CI runners with a noisy neighbour do not
flake while a real regression to 20ms or 200ms still trips.

The bench is a vitest case rather than a bare microbenchmark so
it runs in the same CI lane as every other web test and does not
need a parallel runner.

* test(web): critique surface coverage walker (Phase 13.2)

Walks the public critique surface (11 SSE event names, 5 panelist
roles, 6 lifecycle phases, 9 named i18n keys) and asserts each
named symbol appears in both the src corpus and the test corpus.
The walker is the gate that catches a rename in one half of the
codebase without a matching update in the other half: a future
PR that drops 'panelist_must_fix' from the reducer without also
removing its test reference fails this suite.

62 assertions, one per symbol per corpus.

* docs: Critique Theater user guide (Phase 14.1)

Seven sections aimed at end users (not contributors):

  1. What is Design Jury
  2. How it works (the five panelists, auto-converging rounds,
     the composite formula)
  3. Settings (the M1 toggle and what it does)
  4. Reading the score badge
  5. Replay surface
  6. Troubleshooting (degraded, interrupted, failed)
  7. FAQ

The composite formula is documented as
    designer * 0 + critic * 0.4 + brand * 0.2 + a11y * 0.2 + copy * 0.2
because anyone trying to reverse-engineer the score is going to
search for those weights and the docs are the place they should
land first.

* docs(daemon): critique module AGENTS map (Phase 14.2)

Daemon-side wayfinder for the apps/daemon/src/critique directory.
Tables every file, what owns what invariant, and the 'when you
change anything here' guide so a future contributor does not
have to reverse-engineer the rollout resolver before adding a
new SSE event.

* docs(web): Theater module AGENTS map (Phase 14.3)

Web-side mirror of the daemon AGENTS map. Same file table, same
invariants section, same change-impact guide, sized to the
Theater component package.

* feat(daemon): rollout flag resolver (Phase 15.1)

Single decision point every caller consults to know whether the
orchestrator should wire the critique pipeline for a given run.
Priority:

  1. Skill-level policy (required wins, opt-out wins inversely)
  2. Per-project override from the Settings toggle
  3. OD_CRITIQUE_ENABLED env override
  4. Rollout phase default
       M0 dark-launch      false
       M1 settings only    false (toggle is off until the user flips it)
       M2 per-skill        true if skill opted in
       M3 global default   true

OD_CRITIQUE_ROLLOUT_PHASE parser defaults to M0 on unknown input
so a fresh install never surprises a user with the feature on.

10/10 vitest cases green covering every cell of the matrix.

* feat(web): Settings toggle hook for Critique Theater (Phase 15.2)

React hook that reads critiqueTheaterEnabled from the existing
open-design:config localStorage blob and stays in sync via:

  - the platform storage event (cross-tab)
  - a open-design:critique-theater-toggle CustomEvent (same-tab)

Same-tab event is the one that fires when the Settings panel saves
in the current window: the toggle and every mounted theater update
without a page reload.

setCritiqueTheaterEnabled(next) is the imperative setter the Settings
panel calls. It preserves the rest of the stored config (mode, apiKey,
etc.) and dispatches the same-tab event after the localStorage write.

The web hook reflects what the user toggled; the daemon-side
isCritiqueEnabled is the final routing authority (project override,
env, rollout phase). When they disagree, the daemon wins for backend
gating and the web reflects the toggle state.

6/6 vitest cases green covering first read, stored read, same-tab
event flip, config preservation, corrupted JSON tolerance, and
cross-tab storage event.

* test(web): Phase 15 toggle hook failure-mode coverage (PR #1320)

lefarcen P2 on PR #1320 flagged that the PR body claimed safe
behavior for disabled localStorage, non-object JSON, and missing
CustomEvent shim, but the suite only covered corrupt JSON plus
happy-path storage events. Added four failure-mode tests so the
swallowed errors are not silently traded for a throw in a future
refactor:

1. Returns false on a stored JSON value that parses to an array
   (non-object). Catches a regression where the guard treats
   anything truthy as a config blob.
2. Returns false on a stored JSON value of literal 'null'.
   typeof null === 'object' in JS, so the guard has to check null
   explicitly; this test pins that check.
3. Returns false when localStorage.getItem throws (private mode /
   disabled storage / SecurityError). The hook must swallow and
   return false so the rest of the app keeps rendering.
4. setCritiqueTheaterEnabled still dispatches the same-tab
   CustomEvent when localStorage.setItem throws (quota exceeded /
   disabled storage). The dispatch path is the in-session
   broadcast that keeps every mounted hook coherent even when
   persistence is unavailable; verified by mounting two probes
   and asserting both flip after the setter is called with a
   throwing setItem.

10/10 vitest cases green (6 existing + 4 new).

* fix(web): honor CustomEvent payload in toggle hook listener (PR #1320)

Both Siri-Ray (blocking) and lefarcen (P2 new) caught the same
real bug in the failure-mode test I added in affcdd27: the test
asserts the in-session UI flips when localStorage.setItem throws,
but the CustomEvent listener was ignoring the event's typed
detail and just calling readToggle(). Under a throwing setItem
the localStorage value is stale (or absent), so the listener
would see the OLD value and the test would fail (or worse, the
production claim 'in-session event keeps mounts coherent' was
hollow).

Fixed the hook, not the test: the listener now reads
event.detail.enabled when it is a boolean, falling back to
readToggle() only for malformed events or for cross-tab storage
events (which do not carry a typed payload). The setter already
dispatched the detail; the listener just was not consuming it.

Test changes:

  - The existing 'setItem throws' test now asserts the right
    behavior for the right reason. Updated the inline comment to
    say the listener reads from detail, not localStorage.
  - New test 'falls back to readToggle when the CustomEvent
    carries no usable detail' pins the fallback path: a
    malformed dispatcher (no detail, or detail.enabled not a
    boolean) degrades cleanly instead of throwing or being
    silently ignored.

11 / 11 vitest cases green (10 prior + 1 new fallback).

* feat(daemon): route critique spawn-path eligibility through the rollout resolver

The wireup edit Phase 10 and Phase 15 carved out: today server.ts gates
the critique pipeline on critiqueCfg.enabled, which is just the
OD_CRITIQUE_ENABLED env var. After this commit it gates on
isCritiqueEnabled(...) from the Phase 15 resolver, so the full
priority matrix is live:

  1. Per-skill od.critique.policy veto (opt-out / required)
  2. Per-project override (M1 Settings toggle, written through the
     existing Phase 6 settings endpoint)
  3. OD_CRITIQUE_ENABLED env override (power-user lane / CI fixtures)
  4. OD_CRITIQUE_ROLLOUT_PHASE default
       M0 dark-launch      false
       M1 settings only    false
       M2 per-skill        only when skillPolicy === 'opt-in'
       M3 global default   true

Default behaviour on a fresh install is unchanged: the resolver
returns false at M0 without an env override or a project override,
so prod traffic falls through to the legacy single-pass path
exactly the way it did before.

Inputs threaded today: phase from OD_CRITIQUE_ROLLOUT_PHASE,
envOverride from OD_CRITIQUE_ENABLED. skillPolicy and projectOverride
are passed as null for the v1 cutover; the daemon-side handler that
round-trips critiqueTheaterEnabled on the project settings row and
the od.critique.policy frontmatter resolver land as the next two
commits in this branch.

The three call sites that used critiqueCfg.enabled (the brand-thread
guard, the skill-thread guard, the top-line critiqueShouldRun
compound) now read from a single locally-scoped critiqueEnabledForRun
boolean, so the eligibility check is computed exactly once per spawn
and the prompt composer + orchestrator stay in lockstep the way
the existing comment already promised.

Tests still green: daemon vitest 22 / 22 across rollout +
conformance + adapter-degraded. Daemon typecheck clean.

* feat(web): mount CritiqueTheaterMount in ProjectView

The web counterpart of the daemon wireup. ProjectView now renders
<CritiqueTheaterMount projectId={project.id} enabled={...} /> as a
sibling of <AppChromeHeader> inside the top-level <div className="app">.

The mount is the drop-in from the Phase 9 stack: it owns the SSE
subscription, the kill-request handshake, and the phase-aware swap
from the live <TheaterStage> to the collapsed badge once a run
settles. The mount returns null until the daemon emits a
critique.run_started for the active project, so the visual surface
is byte-for-byte unchanged for users who have not opted in.

Enabled wiring: useCritiqueTheaterEnabled() reads the M1 Settings
toggle from the existing open-design:config localStorage blob and
stays in sync with both the platform storage event (cross-tab) and
the same-tab open-design:critique-theater-toggle CustomEvent the
Phase 15 setter dispatches. The hook honors the event payload
directly so a private-mode browser that cannot persist the toggle
still updates the in-session UI correctly.

The daemon-side gate (isCritiqueEnabled in apps/daemon/src/server.ts)
remains the authority for whether a run is actually wired through
the critique pipeline. This hook only governs whether the web layer
renders the resulting SSE stream when the daemon emits one. The
two-layer gate is intentional: an integrator embedding the Theater
in a custom UI can flip the web visibility independent of the
daemon's routing decision, and a daemon-side env override flips
backend gating without touching the web's localStorage.

Tests still green: web Theater suite 181 / 181 across 16 files.
Web typecheck clean.

* feat(daemon): resolve od.critique.policy frontmatter at the spawn site

The next step in the wireup branch's ladder: replace the placeholder
`skillPolicy: null` with the actual value parsed from the active
skill's SKILL.md frontmatter.

Three small edits, one new field on a public type:

1. SkillInfo gains a `critiquePolicy: SkillCritiquePolicy` field
   carrying the parsed `od.critique.policy` token (required /
   opt-in / opt-out / null). The field is null when the skill has
   no opinion, which lets the lower-priority resolver tiers
   (projectOverride, envOverride, phase default) decide.

2. listSkills() populates the new field via a small
   `normalizeCritiquePolicy` helper that tolerates the YAML
   scalar's casing and trims whitespace. Unknown tokens collapse
   to null so a typo in SKILL.md cannot accidentally force the
   panel on or off; it just falls through. Derived example cards
   inherit the parent's policy.

3. server.ts captures `skill.critiquePolicy` into a hoisted
   `skillCritiquePolicy` variable inside the existing skill-load
   block, then threads it into the isCritiqueEnabled call as the
   skillPolicy input. The hoisting keeps the variable in scope at
   the resolver call site without restructuring the spawn handler.

After this commit, the priority matrix the rollout resolver was
designed for is live for its top tier. The previous commit wired
env + phase; this one wires skill. The projectOverride input
remains null pending the next commit that extends the Phase 6
settings endpoint.

Daemon vitest: 10 / 10 rollout cases pass against the new wiring.
Daemon typecheck: clean.

* feat(daemon): feed projectOverride into the rollout resolver from project metadata

Replaces the placeholder `projectOverride: null` in the spawn
handler with the actual value the Settings panel writes onto the
project's metadata blob: `critiqueTheaterEnabled?: boolean`.

The read is defensive at the boundary: the metadata object is
typed loosely (it round-trips through SQLite as a free-form JSON
blob), so the spawn handler narrows to `boolean` and falls
through to `null` for any other shape. A missing key, a malformed
value, or a project that has never visited Settings collapses to
`null`, which is exactly the resolver's "no opinion, fall
through to env / phase" signal.

The `critique` frontmatter slot also gets typed on the
SkillFrontmatter shape so the `od.critique.policy` chain the
previous commit introduced no longer needs a bracket-access
cast. Same pattern as the existing `craft`, `preview`, and
`design_system` nested-record slots.

After this commit, every tier of the rollout resolver's priority
matrix is wired:

  1. skillPolicy   (from SKILL.md od.critique.policy)
  2. projectOverride (from project metadata critiqueTheaterEnabled)
  3. envOverride   (from OD_CRITIQUE_ENABLED)
  4. rollout phase (from OD_CRITIQUE_ROLLOUT_PHASE)

The write path for projectOverride still flows through the
existing project-update handler the Settings panel already uses
to persist project metadata; no new endpoint is needed. The
Settings UI button that calls setCritiqueTheaterEnabled and
posts the new field is the next commit on this branch.

Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases
still green against the new wiring.

* fix(daemon): forward critique events to project sinks + align composer gate (PR #1338)

Two codex review items addressed in one commit since they share the
same root cause (resolver-enabled run hits a transport / prompt
contract that was still env-gated):

P1 (transport mismatch). The daemon emits critique.* SSE frames
through critiqueBus -> design.runs.emit, which fans out on
/api/runs/:runId/events. The web CritiqueTheaterMount subscribes to
/api/projects/:projectId/events (it's project-scoped, not run-
scoped, because the mount lives at the project workspace and
follows the user across runs). Result: in production the mount
never sees a real frame and the e2e tests' stubbed routes hide the
mismatch.

Fixed by extending critiqueBus.emit to fan out to BOTH sinks: the
existing runs.emit transport, AND the per-project event-sinks map.
The project-events route emits via sse.send(payload.type, payload),
so we pack the SSE channel name onto payload.type and let the sink
push the right channel. The web sseToPanelEvent overwrites type
from the channel name on the way back into a PanelEvent, so the
round-trip stays correct.

P2 (prompt gate misalignment). composeSystemPrompt reads
cfg.enabled to decide whether to append the panel addendum, but
critiqueCfg.enabled is loaded from OD_CRITIQUE_ENABLED only. A run
the resolver enabled via phase / project / skill (env unset) would
have critiqueShouldRun = true while critiqueCfg.enabled remained
false, dropping the panel prompt while still routing through
runOrchestrator -> parser waits for tags that never arrive -> run
degrades.

Fixed by passing a derived config { ...critiqueCfg, enabled: true }
to the composer when critiqueShouldRun is true. The composer's own
gate now agrees with the resolver decision on every input the
spec defines.

Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases
still green against the new wiring.

* fix: address PerishCode P1 + P2 follow-ups on PR #1338

Two follow-up items PerishCode flagged on the activation PR.
Non-blocking but both are real:

1. Phase 11 e2e suite was wired into test:ui:extended but lands
   the user on '/' (home route) where ProjectView (and therefore
   CritiqueTheaterMount) is never rendered. With the suite as
   written, every assertion would time out the first time the
   lane runs in CI, contradicting the PR body's claim that the
   suite stays parked behind test.describe.fixme.

   The state diverged from my earlier Phase 11 work because the
   merge from main on commit 4ab719c6 brought in #1307's
   squash-merged version of the e2e file (the pre-fixme shape).

   Re-applied test.describe.fixme to the describe block plus
   removed ui/critique-theater.test.ts from the test:ui:extended
   script in e2e/package.json. Added a file-header docblock
   explaining what the follow-up commit needs to do: replace
   goto('/') with /projects/:id navigation similar to
   app-design-files.test.ts, split the SSE fixture into a live
   prefix and terminal suffix (Codex P2 on PR #1320), and commit
   the first PNG baselines.

2. bestRoundOf in CritiqueTheaterMount returned the LAST round
   with a numeric composite, not the round with the HIGHEST
   composite, while bestCompositeOf correctly returned the max.
   A run that closed round 1 at 8.5 and round 2 at 6.0 would
   dispatch interrupted { bestRound: 2, composite: 8.5 } on a
   user-clicked interrupt.

   Folded the two helpers into a single bestRoundAndComposite
   that walks state.rounds once and returns the matching pair so
   the two values cannot drift. The onInterrupt callback now
   destructures from one helper instead of two independent reads.
   Falls back to (state.activeRound, 0) when no round has closed
   with a composite yet.

Web typecheck: clean. CritiqueTheaterMount.test.tsx: 7 / 7 cases
still green against the new helper.

* fix: wire M1 project override end-to-end + correct deferred-surface doc claims (PR #1338)

Three lefarcen P2s on the latest review pass, all real:

1. M1 project override was half-wired: the daemon read
   metadata.critiqueTheaterEnabled but the web setter only
   wrote localStorage. A user opt-in would render the Theater
   on the web (localStorage was set) while the daemon resolved
   projectOverride=null and skipped critique unless env / phase
   already permitted. Two halves talking past each other.

   Extended setCritiqueTheaterEnabled to accept an optional
   { projectId, fetchProjectSettings } options bag. When a
   projectId is supplied, the setter ALSO sends a
   PATCH /api/projects/:id with { metadata: { critiqueTheaterEnabled
   } } so the daemon's spawn-time resolver picks the same value up
   on the next generation. The existing project-routes endpoint
   already accepts arbitrary metadata patches, so no new endpoint
   is needed. The local write + the CustomEvent dispatch still
   fire before the PATCH, so a network failure does not unwind
   the in-session UI flip. Three new vitest cases pin the new
   path: PATCHes when projectId is provided, skips when it is
   not, swallows a rejected PATCH so the in-session UI still
   flips.

2. Rollout docs (docs/critique-theater.md section 3) claimed the
   Settings toggle persists into the daemon settings store, but
   the previous implementation only had a localStorage reader /
   writer plus a daemon read of project metadata, with no
   round-trip. Rewrote the section to lead with the four-tier
   resolver (skill policy / project override / env / phase),
   document that the setter now round-trips via the existing
   PATCH endpoint when given a projectId, and call out the
   Settings panel UI control as a deliberate follow-up.

3. Troubleshooting table pointed users at /api/metrics/critique
   (Phase 12, deferred) and 'od adapters clear-degraded <id>'
   (CLI wrapper that does not exist). Replaced the metrics
   reference with the local conformance harness command
   (pnpm --filter @open-design/daemon vitest run
   tests/critique-conformance.test.ts) that ships today, with a
   note that the Phase 12 dashboard surfaces this status as a
   series once that PR lands. Replaced the CLI command with the
   programmatic clearDegraded() helper that exists today and
   flagged the CLI wrapper as planned follow-up.

Web typecheck: clean. Toggle hook tests: 14 / 14 green (11
existing + 3 new for the round-trip path).

* test(web): multi-round interrupt regression for bestRoundAndComposite (PR #1338)

lefarcen P3 follow-up to the previous bestRoundAndComposite fix:
the existing CritiqueTheaterMount.test.tsx interrupt cases only
exercised a single-round state, so a future refactor back to two
independent helpers wouldn't be caught by the test suite even
though it'd reintroduce the round / composite drift bug.

Added a regression case that:

  1. Drives the reducer through two complete rounds with the
     full 5-role cast closing at distinct composites: round 1
     at 8.5, round 2 at 6.0 (the high-composite round is NOT the
     most recent one).
  2. Clicks Interrupt + waits for the daemon ack via the test
     seam fetcher returning 204.
  3. Asserts the collapsed badge displays "round 1" (the
     correct best-composite round), and queryByText for
     "round 2 ... 8.5" returns null (the buggy pairing
     would have produced that string).

The bestRoundAndComposite helper walks state.rounds in one pass
and returns the matching pair, so the round number and the
composite cannot drift apart. This test locks the fix in: a
refactor that splits the helpers back into independent walks
will be caught here.

8 / 8 vitest cases green on the file.

* fix(web): read-merge-write the project metadata in setCritiqueTheaterEnabled (PerishCode P2 on PR #1338)

The previous round-trip sent { metadata: { critiqueTheaterEnabled: next } }
as the entire PATCH body. The daemon's project-routes handler only
re-stamps three immutable fields (baseDir, importedFrom,
fromTrustedPicker) before calling updateProject(db, id, patch),
which then does a shallow { ...existing, ...patch } in apps/daemon/
src/db.ts. So patch.metadata replaces the row's metadata wholesale,
dropping kind, templateId, linkedDirs, and every other field the rest
of the app reads.

No in-tree caller passes projectId today (only vitest cases), so the
bug had not surfaced yet. But the surface is documented in
docs/critique-theater.md section 3 and the function's own JSDoc as
the M1 round-trip path, so it would have shipped as a latent footgun
for the next integrator: a Settings UI follow-up, or any third party
that wires the setter into a project-aware surface.

Fix: read-merge-write rather than a bare patch.

- GET /api/projects/:id to read the row's current metadata.
- Spread that metadata into the PATCH body and overlay
  critiqueTheaterEnabled: next on top, mirroring the partial-metadata
  pattern already used in ChatComposer.tsx for linkedDirs.
- PATCH the merged object.

Failure handling:
- GET fails: skip the PATCH entirely. We cannot construct a safe
  merged body without the current state, and a bare patch would
  wipe other metadata. The in-session CustomEvent fired earlier in
  the setter still keeps every mounted hook consistent; the next
  save retries the round-trip.
- PATCH fails: log in dev. The in-session UI is already correct via
  the CustomEvent.

Tests (TDD, red-first):

- 'GETs the project then PATCHes with merged metadata when a
  projectId is supplied': stubs a GET that returns
  { kind: 'template', templateId: 'modern-blog', linkedDirs: [...] }
  and asserts the PATCH body equals the merge plus the toggle.
- 'PATCHes with just the toggle when the project has no prior
  metadata': stubs a GET that returns no metadata block.
- 'skips the PATCH (does not stomp metadata) when the prefetch GET
  fails': stubs a rejecting GET and asserts only the GET fires.
- 'swallows a rejected PATCH after a successful prefetch': stubs a
  successful GET and a rejecting PATCH; asserts the in-session UI
  still flips via the CustomEvent.

Doc updated on the setter's JSDoc to describe the new three-step
flow (localStorage, CustomEvent, read-merge-write PATCH) and the
two failure modes.

Verified:
- pnpm --filter @open-design/web typecheck clean.
- pnpm --filter @open-design/web test: 111 files / 1055 tests green
  (was 1052, +3 from the new merge-flow cases).

* fix(web): restore wait-for-daemon-ack pattern on Theater interrupt

Same regression as flagged on PR #1316 post-main-merge: the
optimistic local dispatch fired before the POST resolved, so a
daemon 404 / 409 still terminalized the UI and the real SSE
terminal event got ignored by the sticky interrupted phase.

Snapshot runId / bestRound / composite at click time, dispatch
interrupted only on res.ok, clear interruptPending on rejection or
non-2xx so the user can retry. Tests cover rejection + 404 leaving
the run on the live stage; the 204 path waits for the ack.

* feat(daemon): Critique Theater Phase 12 observability foundations

Lands the metrics registry, the structured logger, the /api/metrics
route, and the adapter-degraded bump that wires up the first data
point. The orchestrator-side bumps for runs / rounds / composite /
must-fix / interrupted / parser_errors / protocol_version land in a
follow-up commit on this branch (kept separate so the wiring diff
reads cleanly against the registry shape).

Surfaces added:

- apps/daemon/src/metrics/index.ts: 9 Prometheus series under the
  open_design_critique_* namespace with the histogram buckets the
  spec calls out (round_duration_ms at 100 / 250 / 500 / 1000 /
  2500 / 5000 / 10000 / 30000 / 60000 ms; composite_score at
  0-10 integer steps).
- apps/daemon/src/logging/critique.ts: 6 typed events, one JSON line
  per call on stdout, namespaced critique. Matches the JSON-per-line
  convention cli.ts already uses; no new logger framework.
- apps/daemon/src/server.ts: GET /api/metrics route. Honors
  OD_METRICS_ENDPOINT=disabled to opt out for air-gapped installs.
- apps/daemon/src/critique/adapter-degraded.ts: markDegraded now
  bumps degraded_total so the adapter-health dashboard panel
  reflects every TTL refresh and every fresh mark.

Deps: prom-client ^15.1.0, @opentelemetry/api ^1.9.0 added to
apps/daemon/package.json. Both are zero-config no-ops without an
exporter wired; daemon bundle size impact is ~150 KB uncompressed.
The @opentelemetry/api dep is in place ahead of the OTel-spans
follow-up commit; it adds no behavior on this commit.

Tests:
- tests/metrics/critique.test.ts (3 cases): registry shape +
  exposition text + reset-between-tests
- tests/logging/critique.test.ts (4 cases): event shape + ordering
  + newline framing + namespace stamping

Verification (Windows-local):
- pnpm --filter @open-design/daemon typecheck: clean
- New metrics + logging suites: 7 / 7 green
- Existing adapter-degraded + conformance + rollout suites:
  22 / 22 green; the bump is non-breaking

* feat(daemon): wire Critique Theater metrics + structured logs from the orchestrator

Lights up the bump sites the Phase 12 foundations PR registered the
series for. Every panel event the parser surfaces now reaches the
matching Prometheus counter / histogram and the matching JSON log
line on stdout.

Switch-loop bumps + logs:

- run_started: log run_started, set protocol_version gauge to the
  observed protocol version (small-integer cardinality).
- panelist_open: record the first-open wall-clock per round so
  round_end can compute round_duration_ms; subsequent opens in the
  same round leave the start time untouched.
- panelist_must_fix: bump must_fix_total with the panelist role.
  The wire event does not yet carry a dim name, so the label is
  'unspecified' for now; a future parser revision can drop in the
  real dim without a metric rename.
- round_end: bump rounds_total, observe composite_score, observe
  round_duration_ms (current ms minus the tracked start), log
  round_closed with the composite / mustFix / decision triple.
- parser_warning (parser-yielded): bump parser_errors_total with
  the kind label, log parser_recover with kind + position.

Orchestrator-side parser warnings (composite_mismatch and
duplicate_ship from the daemon-authoritative scoring checks) go
through a new emitParserWarning helper so the bus emit, the
collectedEvents push, the metric bump, and the log line stay in
lockstep. Three inline emission sites collapse to one-line helper
calls.

After the try/catch, a single terminal-status switch bumps
runs_total{status, adapter, skill} once per run, with branch-
specific log + counter:

- shipped / below_threshold: log run_shipped
- interrupted: bump interrupted_total, log run_failed{cause: interrupted}
- timed_out: log run_failed{cause: timed_out}
- failed: log run_failed{cause: orchestrator_internal}
- degraded: log degraded{reason: orchestrator_classified}

OrchestratorParams gains optional skill: string for the label;
defaults to 'unknown' so spawn sites that have not yet threaded it
keep working without a metric shape change.

Tests:
- The new metrics + logging suites (7 / 7) verify registry shape
  and event framing; orchestrator-side metric integration is
  exercised through the existing critique-conformance and
  critique-adapter-degraded suites (22 / 22 still green).
- Logger test reassigns process.stdout.write directly instead of
  vi.spyOn so the Node overloaded write signature does not
  collide with MockInstance<unknown>.

* feat(observability): Grafana dashboard JSON for Critique Theater

Three default rows mapping to the metrics this branch wires up:

1. Fleet quality: composite score p50 / p90 / p99 line graph by
   adapter, plus a heatmap of the composite distribution. The
   line graph answers 'are my agents getting better over time';
   the heatmap answers 'are the bad runs clustered around one
   adapter or smeared across the fleet'.

2. Adapter health: stacked bar charts for degraded marks (by
   adapter / reason) and parser errors (by adapter / kind) over
   a 5-minute window. The two queries together let an operator
   see 'is this adapter degraded because of malformed wire output
   or because of oversize blocks' without flipping panels.

3. Brief throughput: runs-per-hour by terminal status, an average
   rounds-per-run stat per adapter, and a round-duration ms p50 /
   p90 / p99 line. Throughput numbers fall straight out of the
   runs_total / rounds_total counters; the duration histogram is
   the same one the runs feed.

The dashboard uses a templated $datasource var (defaults to
'prometheus') so an operator with multiple Prometheus instances
can switch without editing JSON. Schema version 39 (Grafana 11).

Operators import via:

  pnpm dlx @grafana/cli dashboard import     tools/dev/dashboards/critique.json

or paste into a provisioned dashboards directory. The file is
checked into the repo as a starting artifact; alert rules and
SLO panels ship after the first 1000 runs inform the right
thresholds. JSON validates with node -e 'JSON.parse(...)' (sanity
checked locally).

* feat(daemon): OpenTelemetry outer span around the critique run

Wraps each runOrchestrator call in a 'critique.run' span via the
existing @opentelemetry/api dep added in the Phase 12 foundations
commit. Attributes set on the span:

- critique.run_id, critique.adapter, critique.skill at start
- critique.final_status, critique.final_composite on terminal
  resolution
- span status flipped to ERROR for failed / timed_out runs so a
  Tempo / Honeycomb / Jaeger filter on traces.status=error
  surfaces the right slice without joining back to Prometheus

No exporter is wired by default; @opentelemetry/api is the API
package and intentionally splits from @opentelemetry/sdk-*, so
the span is zero-overhead until an operator attaches an SDK
through their runtime config.

Inner per-round / parse_chunk / scoreboard_eval / persist_round /
ship.persist spans defined in the Phase 12 plan are a follow-up:
the outer span alone gives the trace a duration + final status +
adapter/skill labels, which is the 80% value for dashboards that
correlate runs across services. Adding child spans inside the
existing 600-line orchestrator without restructuring is a separate
careful change.

Verification:
- pnpm --filter @open-design/daemon typecheck: clean
- 29 / 29 critique + metrics + logging tests still green

* fix(nix): bump pnpmDepsHash for prom-client + @opentelemetry/api lockfile bump

nix-check failed on PR #1485 with hash mismatch in
open-design-daemon-pnpm-deps and open-design-web-pnpm-deps after
the Phase 12 foundations commit (2b8b7445) added prom-client and
@opentelemetry/api to apps/daemon/package.json and refreshed
pnpm-lock.yaml.

CI reported the new sha:
  specified: HFLm+8hv3o5x3Xem4MXNsNclIgiVRc70+EBafL0rVn8=
  got:       7R1sQC38gOT0gsZ2oNOviCZ486cbbGJGJCis6WI8z9s=

Both nix files pin the same workspace lockfile, so both flip in
lockstep. No other Nix surface changes required.

* fix(daemon): four Phase 12 review findings (Codex P2 x2 + Siri-Ray P2 + lefarcen P2)

1. Siri-Ray P2 in orchestrator.ts (round metric / log used untrusted
   agent values). The new observability path now records rs.composite
   and rs.mustFix (daemon-authoritative) instead of event.composite
   and event.mustFix when rs exists, and skips the bumps + log
   entirely when rs is missing (a degenerate round_end without any
   matching panelist_open). The dashboard p50 / p90 / p99 now agrees
   with persistence and ship decisions; an adapter reporting <ROUND_END
   composite='10'> while the daemon computed 6 logs 6 and still emits
   the composite_mismatch parser warning the prior block was already
   producing.

2. Codex P2 in server.ts (skill label always 'unknown'). The spawn
   path called runOrchestrator without passing the resolved skill id,
   so every live run bumped open_design_critique_*{skill='unknown'}
   and the per-skill dashboard breakdown was always empty. Threaded
   effectiveSkillId (already computed at the same handler scope as
   the project skill fallback) through skill: . . . so the metric
   reflects the real skill when one is assigned, and the orchestrator
   default of 'unknown' only fires for runs that genuinely have none.

3. Codex P2 in conformance.ts (protocol-version mismatch let through).
   An adapter that emitted <CRITIQUE_RUN version='2'> followed by a
   valid SHIP classified as shipped because the harness only watched
   for terminal events. Added a guard inside the parse loop: if a
   run_started carries protocolVersion !== CRITIQUE_PROTOCOL_VERSION,
   mark the adapter degraded with reason 'protocol_version_mismatch'
   (already in DEGRADED_REASONS) and return early. ConformanceOutcome
   union widened to accept the new reason.

4. lefarcen P2 in tools/dev/dashboards/critique.json (runs-per-hour
   panel under-reported by 3600x). 'rate(...[1h])' returns per-second.
   Multiplied by 3600 so the panel title and unit match the actual
   value rendered.

Verification:
- pnpm --filter @open-design/daemon typecheck: clean
- New metrics + logging suites (7), existing adapter-degraded (7),
  conformance (5), rollout (10): 29 / 29 green
- Grafana JSON re-parses with node -e 'JSON.parse(...)'

* feat(daemon): Critique Theater Phase 16 (M-phase rollout ratchet)

The PR that takes the rollout out of operator-flips-env-vars-by-hand and into the-fleet-conformance-numbers-decide. Stacks on Phase 12 (#1485): the ratchet reads from the conformance harness's daily output, which only exists once Phase 12's metrics + history surface land.

Five surfaces:

1. apps/daemon/src/critique/ratchet.ts (new)
   Pure evaluator. Takes the current RolloutPhase plus a rolling window of ConformanceDay rows and returns one of three decisions: promote, hold, or demote. Spec defaults (14-day window, 0.90 shipped, 0.95 clean-parse) match specs/current/critique-theater.md. Demote floor is half the promote threshold so a single noisy day does not bounce the rollout back; only sustained breakage walks things back. M0 cannot demote and M3 cannot promote, both collapse to hold with an explicit reason string.

2. apps/daemon/src/critique/conformance-history.ts (new)
   JSON-lines persistence at dataDir/conformance/adapter/date.jsonl. Append-only writer + windowed reader. Last entry per (adapter, date) wins so a retry-after-failure cron writes the right answer without a read-modify-write at write time. Malformed lines, missing files, and missing adapter directories all collapse to skip-this-row since a missing day is data missing, not data wrong.

3. apps/daemon/src/server.ts
   GET /api/critique/conformance returns { window, decision }. Tunables come from query string (windowDays, shippedThreshold, cleanParseThreshold) with spec defaults. The recommendation does not auto-flip OD_CRITIQUE_ROLLOUT_PHASE; an operator-driven follow-up consumes the JSON and decides whether to flip or alert.

4. .github/workflows/critique-conformance.yml (new)
   Nightly cron at 03:00 UTC. Builds the daemon, drives the conformance harness against the synthetic-good and synthetic-bad fixtures, and uploads the .od/conformance/ snapshot as a workflow artifact. The schedule sits outside the busy generation window so the cron does not contend with user runs for adapter rate-limit budgets.

5. apps/daemon/tests/critique-ratchet.test.ts + critique-conformance-history.test.ts
   17 cases. Ratchet: 10 cells of the promote / hold / demote matrix. History: 7 round-trip cases.

Verification:
- pnpm --filter @open-design/daemon typecheck: clean
- 17 / 17 new tests green
- Phase 12 metrics + logging + adapter-degraded + conformance + rollout suites (29) untouched and still green

* fix(daemon): three Phase 16 review findings (Codex P1/P2 + lefarcen P1 x3)

1. Duplicate parseRolloutPhase import in server.ts. The new standalone import collided with the existing grouped import; ESM would fail to parse at module load on every daemon startup path. Removed the standalone import; the grouped one already exports parseRolloutPhase.

2. Validation gap in evaluateRollout. A request like ?windowDays=0 fed passingDays >= windowDays = 0 >= 0 = true, returning promote with zero observed days. Now the evaluator rejects non-positive windowDays and out-of-range thresholds at the function entry with an explicit hold reason. The route also clamps query strings before they reach the evaluator (belt + suspenders so a future caller bypassing the route hits the same defense).

3. Missing nightly runner. The workflow called apps/daemon/src/critique/__fixtures__/run-nightly.ts, which the prior PR did not actually add, and || echo masked the failure. Added the runner: drives every synthetic adapter through runAdapterConformance, walks the resulting events for parser_warning to compute cleanParseRate, and writes one ConformanceDay row per adapter via appendConformanceDay. Removed the || echo mask so the workflow fails loudly when the runner throws.

Tests for the validation fix: four new ratchet cases (windowDays=0 holds with no evidence, windowDays=-7 holds, shippedThreshold > 1 holds, cleanParseThreshold < 0 holds). Ratchet suite goes from 10 -> 14 cases.

Verification:
- pnpm --filter @open-design/daemon typecheck: clean
- 33 / 33 critique tests green (14 ratchet, 7 conformance-history, 7 adapter-degraded, 5 conformance)

* test(daemon): explicit NaN regression cases for the ratchet evaluator (PerishCode follow-up on PR #1499)

The Number.isFinite() guard already rejects NaN on every numeric
input, so this is belt-and-suspenders: pinning the behavior so a
future refactor of the guard (a typed parser, a clamp helper, a
relaxed range check) cannot accidentally let NaN through and
surface a zero-evidence promote signal.

Three new assertions inside one case (windowDays=NaN, shippedThreshold=NaN,
cleanParseThreshold=NaN), each asserting hold + the matching
'invalid X' reason string. Ratchet suite goes from 14 -> 15 cases.

* fix(nix): regenerate lockfile + pin pnpmDepsHash for prom-client + @opentelemetry/api (lefarcen P1 on PR #1499)

---------

Co-authored-by: Nagendhra <nagendhra405@gmail.com>
2026-05-14 11:05:57 +08:00
Bryan
54498f1ac5
fix(web): parse Provenance with Markdown-bold labels (#1584)
* fix(web): parse Provenance with Markdown-bold labels (#1580)

The daemon's finalize synthesis prompt at apps/daemon/src/finalize-design.ts:560-565
lists the five Provenance fields without pinning field-label syntax, so Claude
renders them with Markdown-bold labels per Markdown convention
(`- **Field:** value`). The parser at apps/web/src/lib/parse-provenance.ts:32-36
uses `[:\s]+` as its label/value separator, which stops at the trailing `**`
after the colon; the capture group then slurps the `**` and any following
whitespace into the value. Downstream of that, transcriptMessageCount and
generatedAt parse as null because the captured tokens don't start with digits
or a valid ISO 8601 prefix, and the Continue in CLI clipboard prompt shows
`Design system: ** ...`, `Transcript message count when DESIGN.md was
generated: unknown`, `DESIGN.md generated at: unknown`.

Fix: strip leading and trailing Markdown emphasis (`*`, `_`, whitespace) from
every captured value via a single helper threaded through extractField /
extractFieldOrNone / extractNumber / extractDate. Widen the
transcriptMessageCount regex's capture from `(\d+)` to `([^\n]+)` so the
strip step gets a chance to run on `** 4`. Add `[^:]*` between `count` and
`[:\s]+` to mirror the other label-walking regexes for bolded label
variants.

Defense-in-depth: tell the synthesis prompt to emit plain `- Field: value`
bullets with no emphasis on the labels. The parser hardening is the
load-bearing fix; this is belt-and-suspenders for new model variants.

Red-Green-Refactor:
- Phase 1 (Red): 3 new parse-provenance tests covering bold labels with
  backticked values, bold labels with a short `Generated:` form, and bold
  labels with `none` sentinels. All 3 failed against pre-fix source.
- Phase 2 (Green): strip + regex widening. All 7 parse-provenance tests
  + 1158 web tests pass.
- Phase 3: empirically verified against a live finalized DESIGN.md — all
  five fields now parse correctly.
- Phase 4 (defense-in-depth): one-line addendum to synthesis prompt.
- Phase 5: bold-labelled Provenance fixture added to the hook test
  (useDesignMdState.test.tsx) so the round-7 `unknown-provenance`
  fail-closed path is regression-pinned end-to-end.

Backticks in field values are intentionally kept (out of scope per the
issue spec; rendered clipboard text reads fine with them). The variant
`- **Field**: value` (colon outside emphasis) is not in the issue
enumeration and is not handled.

Fixes #1580

* fix(web): narrow Provenance strip to Markdown residue only

Round-2 fix per lefarcen's review on PR #1584. The round-1 helper
used `^[\s*_]+` / `[\s*_]+$`, which stripped a literal leading or
trailing `*`/`_` from any captured value — `_draft.html` corrupted
to `draft.html`, and a build id like `build_id_v1_` lost its
trailing underscore.

Narrow stripMarkdownEmphasis to three explicit passes:
  1. Leading `*`/`_` tokens FOLLOWED BY WHITESPACE — only matches
     the `** ` residue left after `- **Field:** value` is captured
     starting at the `*`.
  2. Trailing WHITESPACE followed by `*`/`_` tokens — mirror of (1)
     if the value closes with emphasis after whitespace.
  3. A single balanced wrap around the remaining value
     (`**X**` / `*X*` / `__X__` / `_X_`) — handles the
     `- **Field:** **value**` shape and any plain-label
     `**value**` form.

Asymmetric literal `*`/`_` characters in the value (no whitespace
separator, no balanced closing token) are preserved by
construction.

Added regression tests:
  - plain label + `_draft.html` value
  - plain label + `build_id_v1_` value (trailing underscore)
  - bold label + `_draft.html` value (residue stripped, literal
    leading underscore preserved)
  - plain label + `**wrapped-id**` value (balanced residue stripped)

All 11 parse-provenance tests + 1162 web tests pass. Empirically
re-verified against a live finalized DESIGN.md — all five fields
still parse correctly.

---------

Co-authored-by: DevForgeAI CI/CD Engineer <devforge-ai@development.ai>
2026-05-14 11:04:24 +08:00
nmsn
d1dcc6ab7d
fix(web): use background-color instead of background shorthand (#1601)
The shorthand `background` property resets all background-related
properties to initial values, including background-repeat (defaulting
to repeat). Using `background-color` instead preserves the chevron
SVG background-image set by the global select rule.
2026-05-14 11:03:05 +08:00
Prantik Medhi
0c0da7cc23
fix(web): confirm continue-in-cli copy (#1604) 2026-05-14 11:02:36 +08:00