open-design

mirror of https://github.com/nexu-io/open-design.git synced 2026-06-01 03:14:35 +07:00

Author	SHA1	Message	Date
chaoxiaoche	bcc58af931	refactor(web): rename Execution mode and tighten settings dialog UI (#1568 ) * refactor(web): rename Execution mode and tighten settings dialog UI - Rename "Settings → Execution & model" to "Settings → Execution mode" across the web UI, i18n keys, docs, and e2e selectors. - Redesign SettingsDialog: kicker + title row in the modal head, a flatMap-driven agent grid that renders the inline test-result row beside the selected card, compact unavailable cards with right-aligned install/docs links, and an install guide that only shows when the user has no working agent picked. - Trim verbose subtitle / hint copy across chat model, CLI proxy, media providers, custom instructions, and memory sections. - Add an `info` Icon variant for the redesigned settings hints. - Update e2e selectors and docs that referenced the old menu label. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(web): polish Settings dialog — media providers, skills, MCP Media providers - Hide internal Stub fixture provider (settingsVisible: false) - Split provider list into Available (integrated, editable) and Coming Soon (collapsed <details> drawer with name/hint/Docs link only) - Drop right-side Integrated/Configured badges from every row; all rows in the main list are integrated by definition; inline grey "Saved" chip next to the provider name is the only status indicator now - "Saved" badge moves inline to the right of the provider name and uses a neutral grey treatment (was a standalone green pill below the name) - "Reload from daemon" button shows a 2s green "✓ Reloaded" flash on success instead of leaving a permanent paragraph under the header; errors remain sticky Skills - Replace three pill-row filter banks (Source, Type, Category) with a compact single-row toolbar: search + three inline <select> dropdowns side by side; active filter highlighted with a stronger border MCP server - Shorten section hint to one line - Move WHAT YOUR AGENT CAN DO capabilities above the client dropdown (motivate before asking to act) - Move "Build the daemon first" warning below the code block where it contextually explains why the command might fail, not as a top-level error before the user has done anything - Downgrade "Restart your client" left-border from accent orange to border-strong grey — it is a next step, not a warning External MCP - Shorten section hint to one line Misc CSS - Add .sr-only utility for accessible off-screen live regions - Add button.ghost.is-success-flash for transient success feedback - Add .library-filter-selects / .library-filter-select for dropdown filter rows - Add .media-provider-coming-soon-* for the roadmap drawer Co-authored-by: Cursor <cursoragent@cursor.com> * [codex] Add Cursor Agent auth diagnostics (#1538) * Add Cursor Agent auth diagnostics * Handle Cursor not logged in auth status * Address Cursor auth review feedback * Classify Cursor stdout auth failures * test: expand Memory and Routines coverage (#1521) * test: expand settings and packaged coverage * test: extend memory settings coverage * test: cover routine settings failure states * test: cover routine operation failures * test: fix daemon test typing on CI * test: decouple packaged smoke from orbit bug * test: avoid live memory LLM calls in route tests * test: fix daemon fetch typing in CI * fix: restore preview comment and inspect toggles * test: align manual edit flow with current inspector UX * test: align comment attachment flow with current preview comments UI * fix: probe resolved Codex launch path during detection * fix: remove duplicate board activation helper after rebase * test: update ghost cli detection mock * test: align FileViewer toolbar expectation * ci: move full app tests to extended lane * ci: run app tests by changed scope * ci: cover shared app inputs in test scopes * ci: avoid setup-node cache in windows packaged smoke * test: align extended settings and manual edit flows * refactor(web): rename Execution mode and tighten settings dialog UI - Rename "Settings → Execution & model" to "Settings → Execution mode" across the web UI, i18n keys, docs, and e2e selectors. - Redesign SettingsDialog: kicker + title row in the modal head, a flatMap-driven agent grid that renders the inline test-result row beside the selected card, compact unavailable cards with right-aligned install/docs links, and an install guide that only shows when the user has no working agent picked. - Trim verbose subtitle / hint copy across chat model, CLI proxy, media providers, custom instructions, and memory sections. - Add an `info` Icon variant for the redesigned settings hints. - Update e2e selectors and docs that referenced the old menu label. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(web): settings dialog UX polish — layout, dedup, and interactions - Remove duplicate section headers from all settings sections (Notifications, Appearance, Privacy, About, Design Systems, Skills, MCP server, Connectors, Media providers, Routines) - Restructure Notifications cards: title + toggle on same row, hint below - Restructure Skills toolbar: search + New skill button in row 1, filter dropdowns in row 2 with left-aligned labels - Restructure Pet section: tabs and Wake button on same row - MCP server: group capabilities and setup into separate cards, remove nested double border on client picker - Connectors: show connect errors as toast instead of inline card text, position toast inside panel, hide single-provider tab - Media providers: move Reload button to left-aligned small ghost button - Memory: info icon shows path on hover, Path copied badge inline; Extraction history and MEMORY.md as standalone collapsible cards; group header hidden when only one type visible - Pet grid cards: Adopt button hidden until hover, icon-only when adopted, description truncated to 2 lines, text fills full width via abs positioning - Agent cards: selected state uses accent border only, no background change - Add sun/moon icons to Appearance theme buttons (Light/Dark) - Shorten several hint strings for clarity Co-authored-by: Cursor <cursoragent@cursor.com> * fix(web): resolve i18n review comments from PR #1568 - Update settings.title and settings.envConfigure to localized "Execution mode" in all 17 non-English locale files - Add settings.memoryFlashPathCopied to all locales and use t() in MemorySection instead of hardcoded English "Path copied" - Add settings.agentModelHead to all locales and use t() in SettingsDialog for "Model for:" agent model row header Co-authored-by: Cursor <cursoragent@cursor.com> * fix(web): update tests to match settings dialog redesign - Add role prop to Toast (alert/status) so error toasts from ConnectorsBrowser are announced immediately by screen readers - Clear connectErrorToast on successful connector retry - Update SettingsDialog.execution tests: - Remove heading assertions for About and MCP server (headers were intentionally removed as duplicate nav labels) - Rewrite CLI env test to use codex-only fields (per-agent filtering means only selected agent's fields are shown) - Update Composio key hint text assertion to match shortened copy - Replace filter button click with select change for Type filter - Replace Configured/Unsupported/Integrated badge checks with updated assertions matching the new media provider UI - Replace disabled BFL row test with coming-soon section check - Update SettingsDialog.media test: remove Fal.ai input assertions (non-integrated providers no longer have editable fields) Co-authored-by: Cursor <cursoragent@cursor.com> * fix(web): unblock CI for #1568 Three small fixes to get Playwright back to green on the settings dialog redesign: 1. `en.ts`: revert `settings.envConfigure` to "Configure execution mode". This PR collapsed both `settings.title` (header gear) and `settings.envConfigure` (entry-side foot pill) to the same string "Execution mode", so `getByRole('button', { name: 'Execution mode' })` resolved to two elements and tripped Playwright strict mode in the three Composio-flow tests (entry-configuration-flows.test.ts:174, 228, 285). Restoring the distinct label also gives screen readers a clearer hint for the pill, which doubles as a status display. Non-English locales still alias the two keys; happy to follow up on those, but they don't gate the (English-only) Playwright suite. 2. entry-configuration-flows.test.ts:167 — `Connectors` heading is now rendered at `<h2>` in the modal-head (SettingsDialog.tsx:1545), with the inner `<h3>` removed by design (see comment around line 1448). Updated the assertion from `level: 3` to `level: 2`. 3. project-management-flows.test.ts:360 — same change for the `Pets` heading. Verified locally with `pnpm --filter @open-design/web typecheck` and `pnpm --filter @open-design/e2e typecheck`. The actual Playwright specs need the dev server up; I didn't rerun them here, but the locator changes are mechanical and match the new DOM. * fix(web): use exact match for Execution mode button locator Playwright's `getByRole({ name })` defaults to substring matching, so `{ name: 'Execution mode' }` still resolved to both the header gear (aria-label "Execution mode") and the entry-side foot pill (aria-label "Configure execution mode" — substring contains "Execution mode"). Strict mode tripped in the three composio-flow tests at lines 202, 257, and 319. Adding `exact: true` makes each call resolve to just the header gear, which opens the same dialog the foot pill does — the test outcomes are unchanged. --------- Co-authored-by: chaoxiaoche <chaoxiaoche@chaoxiaochedeMacBook-Pro.local> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Caprika <56862773+alchemistklk@users.noreply.github.com> Co-authored-by: shangxinyu1 <shangxinyu@refly.ai> Co-authored-by: lefarcen <935902669@qq.com>	2026-05-15 14:35:06 +08:00
Quang Do	a41d4f6126	fix(web): keep chat pinned during content growth (#1716 )	2026-05-15 14:12:00 +08:00
Quang Do	3d0e708720	fix(daemon): treat media generate handoff as success (#1715 )	2026-05-15 14:11:40 +08:00
Prantik Medhi	cd3acda6f6	fix(web): cap project title width (#1784 )	2026-05-15 13:29:03 +08:00
Prantik Medhi	01e54700a2	fix(web): make file grouping by kind work (#1551 ) * fix(web): group design files by kind * fix(web): unblock CI for #1551 - FileViewer test (line 434): add missing `projectKind="prototype"` to match every other instance; this was the source of the typecheck failure blocking workspace validation. - DesignFilesPanel "groups files by kind" test: assert against `.df-section-label` elements so the section header check is not ambiguous with the per-row kind cell text. - DesignFilesPanel batch-delete test: derive the expected file names from the rendered row testids and use `arrayContaining` so the assertion no longer depends on the (now kind-default) row order. * fix(web): satisfy strict-index typecheck in batch-delete test `onDeleteFiles.mock.calls[0][0]` tripped `noUncheckedIndexedAccess` ("Object is possibly 'undefined'"). Drop the separate length probe and assert the exact array instead — `selected` is a `Set`, `handleBatchDelete` spreads it with `[...selected]`, and the test clicks rows[0]/rows[1] in that order, so insertion order is deterministic and equals `[firstName, secondName]`. --------- Co-authored-by: lefarcen <935902669@qq.com>	2026-05-15 13:07:27 +08:00
Nagendhra Madishetti	ad275dbc02	fix(daemon): use danger-full-access codex sandbox on Windows to unblock PowerShell (#1745 ) Codex CLI's workspace-write sandbox on Windows blocks every shell invocation with 'powershell.exe ... rejected: blocked by policy', so the agent cannot list files, navigate the workspace, or call any shell-backed tool. Codex has no working OS-level sandbox on Windows and falls back to a coarse policy that rejects shell unconditionally. Switch to --sandbox danger-full-access on win32 only. macOS (Seatbelt) and Linux (Landlock+seccomp) keep workspace-write because their sandbox enforcement permits shell while restricting writes. Tests anchor the workspace-write expectations explicitly to darwin and linux via withPlatform(), and a new win32 case asserts the danger-full-access flag and that the workspace-write-scoped network config override is dropped. Fixes #1721. Co-authored-by: Nagendhra <nagendhra405@gmail.com>	2026-05-15 12:22:48 +08:00
Yuhao Chen	9c3a8ae3e4	fix(web): restore dark pagination select chevron (#1736 )	2026-05-15 12:15:20 +08:00
Het Savani	d0ecb62a36	fix(runtime): improve DOM fallback target selection for comment picker (#1706 )	2026-05-15 12:10:09 +08:00
Prantik Medhi	8aeedf368b	fix(web): localize accent controls in settings (#1565 ) * fix(web): localize accent controls * fix(web): localize accent default label * fix(web): unblock CI for #1565 Add missing `projectKind="prototype"` to the FileViewer deck-render test (line 434) so workspace typecheck stops failing on the `Property 'projectKind' is missing` error. This mirrors every other FileViewer render in the same file and is unrelated to the accent localization changes in this PR — it's drift from a recent change on main that made `projectKind` required. --------- Co-authored-by: lefarcen <935902669@qq.com>	2026-05-15 12:09:28 +08:00
enaktes9-hub	24a70c7ab2	fix(web): ensure routine history 'Open project' button text is visible on hover (#1766 ) The hover state used `color: var(--bg)` which could resolve to a color that blends with the panel background, making the button label invisible. Changed hover text color to `var(--bg-panel)` which is the panel surface color — it guarantees contrast against the `var(--text)` hover background in both light and dark themes. Also added focus-visible outline and active state for better affordance. Fixes #1357 Co-authored-by: Hermes PR Agent <enaktes9-hub@users.noreply.github.com>	2026-05-15 11:58:20 +08:00
Yuhao Chen	b0963fd874	fix(web): allow downloads from preview iframes (#1732 )	2026-05-15 11:55:29 +08:00
lefarcen	75498838a9	chore: align issue templates to preview/v0.8.0 naming (#1723 ) Some checks failed ci / Packaged mac smoke (push) Blocked by required conditions Details ci / Packaged windows smoke (push) Blocked by required conditions Details ci / Detect PR change scopes (push) Failing after 3s Details ci / Validate workspace (push) Has been skipped Details landing-page-ci / Validate landing page (push) Failing after 1s Details landing-page-deploy / Deploy landing page (push) Has been skipped Details nix-check / build (push) Failing after 1s Details Following the rename of the feature branch from preview/0.8.0 to preview/v0.8.0 (to match the release/v0.7.0 convention), update all issue-template references so the label, filename, and deep-link URL stay consistent. Changes: - git mv preview-0.8.0-feedback.yml → preview-v0.8.0-feedback.yml - update labels reference, title prefix, display name, body copy - update version placeholder example to 0.8.0-preview.2 (current build) - update cross-references in bug-report.yml and feature-request.yml - update config.yml first contact_link URL + about text	2026-05-14 23:21:37 +08:00
lakatos	7c9e620291	fix(web): stop conversation route sync from remounting ChatPane in a loop (#1710 ) PR #1508 added a routeConversationId -> activeConversationId sync effect next to the existing activeConversationId -> URL sync, with no arbitration between them. Creating or switching a conversation moves activeConversationId ahead of the URL; the route-sync effect then sees the stale routeConversationId and pulls activeConversationId back, while the URL sync pushes it forward again. ChatPane is keyed on activeConversationId, so the ping-pong remounts ChatPane and its composer on every flip and the composer never settles. Track the conversation id this view last pushed to the URL and have the route-sync effect ignore a routeConversationId that merely echoes it. Only a genuinely external navigation (deep-link, routine history row) differs from the last synced id, so PR #1508's deep-link behaviour is preserved while the self-inflicted remount loop is gone.	2026-05-14 21:06:26 +08:00
Nicholas-Xiong	118937d09b	fix: Change comment button label from 'Send to Claude' to 'Send to chat' (#1673 ) Fixes #1390 Update the comment popover button label to accurately describe the action and match product terminology. Before: - Button labeled 'Send to Claude' - Suggests model-specific or brand-specific destination - Inconsistent with visible chat-based workflow After: - Button labeled 'Send to chat' - Clearly describes the actual destination - Matches user mental model and product terminology - Consistent with visible UI flow Changes: - Updated both comment popover instances (batch send and side panel send) - Preserves 'Sending…' loading state text	2026-05-14 21:05:45 +08:00
Nagendhra Madishetti	98bc6d63e6	feat: Critique Theater wireup (activate the stack, M0 dark-launch by default) (#1338 ) * feat(web): pure reducer for Critique Theater states (Phase 7.1) Pure CritiqueState reducer driven by the contracts-level PanelEvent (the same shape both the live SSE stream and the recorded transcript emit), so a single reducer powers both the in-flight panel and the rerun replay. Lifecycle covers run_started → running → (shipped / degraded / interrupted / failed), with panelist_open / dim / must_fix / close / round_end events building per-round CritiquePanelistView entries as they arrive. Defensive behaviour that surfaced while writing the spec tests: - Terminal phases (shipped / degraded / interrupted / failed) are sticky against further lifecycle events for the same run, except for parser_warning which can land late and is recorded in a side channel without changing phase. - A new run_started for a different runId at any time discards the prior state and reboots, so the UI can launch consecutive runs without an explicit reset action. - Events whose runId does not match the active run return the same state reference, so React's useReducer doesn't re-render subscribers on stray traffic. - Round bookkeeping keys by round number rather than "always last", so an out-of-order panelist_dim for round 1 arriving after a round 2 dim does not corrupt the round 2 bucket. Test coverage: 18 cases covering each transition, the runId guard, sticky-terminal behaviour, the out-of-order round invariant, and the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire SSE + replay into the same reducer. * feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2) createCritiqueEventsConnection is a pure connection manager that mirrors apps/web/src/providers/project-events.ts: opens an EventSource at /api/projects/:id/events, listens for every name in CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent (stripping the critique. prefix and merging the data payload), and hands it to the caller's onEvent. Reconnect uses exponential backoff (1s → 30s) and resets on `ready`; malformed payloads drop with a dev-mode warning rather than tearing the stream. useCritiqueStream wraps the manager in a useReducer that owns the CritiqueState. enabled=false or a null projectId tears down the connection cleanly; switching projectId closes the old connection and opens a fresh one. The returned dispatch lets local UI synthesise actions (e.g. an Esc keypress firing a synthetic interrupted while a kill request is in flight); production traffic comes from the SSE stream. Test coverage: - sse.test.ts (10 cases, node env): subscription set covers every CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire shape back to PanelEvent; malformed JSON is swallowed and does not stop the stream; exponential backoff schedule and ready-reset semantics are pinned with a setTimeout seam; close() cancels pending reconnects and shuts the live source; no-op fallback when EventSource is unavailable. - useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event, reducer driven by synthetic actions, no connection when disabled or projectId is null, clean close on unmount, projectId change reopens cleanly. * feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3) Fetches the per-run NDJSON transcript (one PanelEvent per line), parses every line via the shared isPanelEvent predicate, and dispatches into the same CritiqueState reducer the live SSE stream uses. A single reducer means the UI rendering a replay can be identical to the live panel, and a UI mounting both useCritiqueStream and useCritiqueReplay in parallel does not have to reconcile two state shapes. speed knob is `paused \| instant \| live \| { intervalMs: N }`. - instant flushes every event synchronously, useful for opening a finished run already at its terminal state. - intervalMs paces dispatches at a fixed cadence so the reviewer can watch the run unfold. - paused parses the transcript but holds events back until the caller advances speed (consumers can drive a scrubber later). - live is reserved for the future "playback at original cadence" feature, currently treated as instant; replay timestamps are not yet persisted with each event so honest pacing requires a follow-up Phase 7+ task. gunzip seam handles `.ndjson.gz` transcripts via DecompressionStream when present; the production fetch path picks between text and arrayBuffer based on the URL extension. Both seams are injectable so the unit tests don't need to spin up a real network or a real gzip pipeline. Test coverage (8 cases, jsdom env): - Idle status before any URL is provided. - speed=instant flushes the full transcript synchronously to shipped state. - speed={intervalMs:N} paces with the setTimeout seam, reaching done after the last tick. - speed=paused leaves status=playing with no dispatches. - Empty transcript reports done with state still idle. - Fetch rejection surfaces an error status with the message. - Malformed NDJSON lines are skipped; valid events around them still land. - .gz transcripts route through the gunzip seam. Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream + replay), all on one branch ready for review. Phases 8+ (Theater components) consume these from this PR. * fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review) Two P1 fixes from lefarcen's review on PR #1307: SSE payload override `sseToPanelEvent` previously spread `data` after the channel-derived `type`, so a payload-provided `type` could override the channel and route a `critique.run_started` frame into the reducer as a `ship` action. Reversed the spread so the channel-derived `type` is authoritative, and revalidated the resulting object through the contracts-level `isPanelEvent` predicate before returning. Frames that fail validation (missing runId, empty runId, unknown type) are dropped, so a malformed or compromised SSE frame can no longer dispatch a wrong-shape action into the reducer. Three new sse.test.ts cases pin the regression: hostile `type:'ship'` in the payload still resolves to `run_started`, missing runId is dropped, empty runId is dropped. Replay pause/resume `useCritiqueReplay` had one big effect keyed on `transcriptUrl` only, so flipping `speed` from `paused` to `instant` never re-fired and the held events sat undispatched. Split into a parse effect (depends on URL, fetches and stores events in state) and a pace effect (depends on parsed-events + speed, owns the cursor + timers). The playback cursor lives in a ref that survives pause/resume cycles, so flipping `paused` -> `instant` flushes from the current position rather than restarting (which would double-dispatch `run_started` and reset the reducer). Two new useCritiqueReplay.test.tsx cases: - paused-then-instant transitions from `playing` to `done` and reaches the shipped terminal phase - intervalMs paced playback dispatches one event, pauses to drain the next scheduled timer, flips to instant, and confirms the remaining transcript drains exactly once (cursor was preserved) Doc consistency The earlier source comment in useCritiqueReplay.ts claimed `live` "paces by recorded timestamps" while the impl used zero-delay timers and the PR body said it behaves like `instant`. Aligned to reality: `live` currently behaves like `{ intervalMs: 0 }` (events drain on successive microtasks via setTimeoutFn) because transcripts do not yet carry per-event timestamps. Honest timestamp-driven pacing is queued as a Phase 7+ follow-up. Validated: pnpm guard, pnpm --filter @open-design/web typecheck, Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite 96 files / 888 tests. * feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread) * feat(web): Theater PanelistLane component (Phase 8.1) * feat(web): Theater ScoreTicker component (Phase 8.2) * feat(web): Theater RoundDivider component (Phase 8.3) * feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4) * feat(web): Theater TheaterDegraded chip (Phase 8.5) * feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6) * feat(web): Theater TheaterTranscript replay surface (Phase 8.7) * feat(web): Theater TheaterStage top-level container (Phase 8.8) * feat(web): Theater CSS using existing semantic tokens (no hex literals) * feat(web): Theater public exports barrel * fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314) Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen. State-lifecycle fixes (3 x P2) 1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`). Host hooks dispatch it when their gating prop changes so a stale run from a prior project / transcript cannot bleed into the next context. Reset is idempotent on idle (returns the same reference). 2. `useCritiqueStream` dispatches `__reset__` at the top of its connection effect, so a workspace switch from project A (which streamed a critique) to project B clears the reducer before the new EventSource opens. enabled=false also clears. 3. `useCritiqueReplay` dispatches `__reset__` at the top of its parse effect, so transcriptUrl swaps (including swap-to-null after a replay reached `shipped`) lift the reducer back to idle before the new fetch starts. SSE validation (1 x P2) 4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape` check after the cheap `isPanelEvent` predicate. A `critique.ship` frame missing `composite` / `round` / `status` / `artifactRef` is rejected before reaching the reducer, so TheaterCollapsed can no longer crash on `undefined.toFixed(1)`. Every variant's required fields are validated: run_started (protocolVersion, non-empty cast, maxRounds, threshold, scale), panelist_* (round, role, plus variant-specific shape), round_end (round, composite, mustFix, decision in {continue,ship}, reason), ship (round, composite, status, artifactRef.{projectId,artifactId}, summary), degraded (reason, adapter), interrupted (bestRound, composite), failed (cause), parser_warning (kind, position). Reducer correctness (1 x P2) 5. `panelist_open` now materializes the round + an empty panelist view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight the in-progress lane the instant the tag opens. Before this, a stream that emitted only `panelist_open` after `run_started` left `rounds = []` and the UI rendered no current round until a later `panelist_dim` arrived. Polish (3 x P3) 6. Brand role tint swaps from `var(--magenta, var(--accent))` to `var(--purple, var(--accent))`. `--purple` is actually defined across the design systems; `--magenta` is not, so Brand was silently falling through to `--accent` and looking identical to Designer. 7. New i18n key `critiqueTheater.interruptedSummary` for the interrupted-collapse copy ("Interrupted at round N, best composite X.X"). Previously the interrupted branch reused `shippedSummary` and the UI read "Shipped at round..." for a run that specifically did not ship. Native value in en + zh-CN; other locales fall back via `...en` spread. 8. `TheaterDegraded` heading id comes from `useId()` instead of a hardcoded `theater-degraded-heading`, so two chips rendered on the same page (chat history with multiple completed runs) keep their aria-labelledby references unambiguous. Tests (15 new cases) - reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data. - sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship. - useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false. - useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped. - TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...". - TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new) - tests/i18n/locales.test.ts 5 of 5 across 18 locales * feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1) * feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2) * fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315) Addresses every blocker from codex, Siri-Ray, and lefarcen. The three state-lifecycle and SSE-validation issues they also flagged inherit fixes from PR #1314's review pass that this branch now sits on top of after rebase. Real daemon kill on Interrupt (P1) - CritiqueTheaterMount now POSTs to /api/projects/:id/critique/:runId/interrupt alongside the optimistic local dispatch. Before this fix, clicking Interrupt only flipped the React state to interrupted while the daemon job kept running. The fetch is best-effort: a 404 (endpoint not wired yet, lands in Phase 15) is swallowed with a dev-mode console.warn so the UI still moves to the collapsed badge. - New fetchInterrupt test seam lets RTL assert on the URL / method and simulate the "daemon not ready yet" path. Two tests pin both: the happy URL proj-42/critique/run-abc/interrupt POSTs, and a rejected fetch still flips the UI. interruptPending reset on new run (P2) - A ref-backed effect compares the current runId against the last one we saw; when it changes, interruptPending is cleared. A user who interrupts run-1 and then triggers run-2 from the same mount now gets a fresh, enabled kill button instead of one stuck in "Interrupting…". Pinned by a new mount test. Escape keybind scope (P2) - InterruptButton now checks the keydown target. Escape inside an input, textarea, select, or contenteditable element is ignored (and any ancestor of those via closest() is treated the same way). Body-level focus still fires the keybind so the Theater area's affordance keeps working. Four new tests cover textarea, input, contenteditable, and the body-focus positive case. userFacingName i18n key (P2) - The spec at specs/current/critique-theater.md:6 mandates a single critiqueTheater.userFacingName key so the "Design Jury" label can be renamed without touching code. Phase 8 introduced critiqueTheater.title by mistake; renamed across types.ts, en.ts, zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer TheaterStage.tsx. The locale alignment test stays green. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 14 files, 112 tests (was 101 before, +11 new for the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope; the rest were already in #1314's review fix). - tests/i18n/locales.test.ts 5 of 5 across 18 locales. * feat(daemon): adapter-degraded registry with TTL (Phase 10.1) In-memory registry recording adapters that produced malformed or oversize transcripts so the orchestrator can skip them for a TTL window (default 24h) instead of cycling through known-bad providers on every run. Records carry reason (malformed_block \| oversize_block \| missing_artifact), source label, and expiresAt. The test-only clock seam lets the suite advance time deterministically and prove that an expired entry stops counting as degraded without anyone calling clearDegraded. 7/7 vitest cases green. * feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2) Two test-only adapters that read the existing v1 transcript fixtures (happy-3-rounds and malformed-unbalanced) and replay them as either a full string or a 512-byte chunked stream. The chunked form is what the conformance harness uses to prove the parser holds together when the transcript arrives in arbitrary network slices, not as one buffered blob. * feat(daemon): adapter conformance harness (Phase 10.3) runAdapterConformance pulls a transcript through the same parseCritiqueStream pipeline the orchestrator uses and classifies the outcome as shipped, degraded, or failed. On a degraded outcome it forwards the matched reason to the adapter-degraded registry, so a single nightly conformance run is what populates the skip list rather than the orchestrator learning each adapter is broken at request time. 5/5 vitest cases green covering shipped, malformed degraded, oversize degraded, no-ship failure, and the harness-thrown failure path. * test(e2e): Critique Theater Playwright suite (Phase 11) Six tests, one viewport per visual case, deterministic SSE fixtures stubbed via page.route(). Adds the suite to test:ui:extended so the existing extended-UI lane picks it up. Coverage: 1. Happy path: a single mounted theater plays the full fixture (1 run_started, 5 panelists open / dim / must_fix / close, 1 round_end, 1 ship) and ends on the score badge. 2. Interrupt mid-run: the panelist that is open at the time the interrupt button is clicked closes with an interrupted marker and the transcript freezes there. 3. Visual regression at 375x720 mobile. 4. Visual regression at 768x1024 tablet. 5. Visual regression at 1280x800 desktop. 6. A11y role tree: the theater region exposes a labelled landmark, each panelist lane is a group with an accessible name, the score is a status live region. All SSE traffic is stubbed by page.route so the suite runs in CI without a daemon. The toggle is seeded via localStorage by bootAppWithCritiqueEnabled so the gate behaves as if Settings flipped it on. typecheck clean; playwright --list reports 6. * test(web): reducer p99 bench at 10k iterations (Phase 13.1) Locks the documented 2ms budget for the Critique Theater reducer on a representative SSE script (27 actions, one full happy run) behind a regression gate. Asserts p99 stays under 4ms (2x the documented budget) so CI runners with a noisy neighbour do not flake while a real regression to 20ms or 200ms still trips. The bench is a vitest case rather than a bare microbenchmark so it runs in the same CI lane as every other web test and does not need a parallel runner. * test(web): critique surface coverage walker (Phase 13.2) Walks the public critique surface (11 SSE event names, 5 panelist roles, 6 lifecycle phases, 9 named i18n keys) and asserts each named symbol appears in both the src corpus and the test corpus. The walker is the gate that catches a rename in one half of the codebase without a matching update in the other half: a future PR that drops 'panelist_must_fix' from the reducer without also removing its test reference fails this suite. 62 assertions, one per symbol per corpus. * docs: Critique Theater user guide (Phase 14.1) Seven sections aimed at end users (not contributors): 1. What is Design Jury 2. How it works (the five panelists, auto-converging rounds, the composite formula) 3. Settings (the M1 toggle and what it does) 4. Reading the score badge 5. Replay surface 6. Troubleshooting (degraded, interrupted, failed) 7. FAQ The composite formula is documented as designer * 0 + critic * 0.4 + brand * 0.2 + a11y * 0.2 + copy * 0.2 because anyone trying to reverse-engineer the score is going to search for those weights and the docs are the place they should land first. * docs(daemon): critique module AGENTS map (Phase 14.2) Daemon-side wayfinder for the apps/daemon/src/critique directory. Tables every file, what owns what invariant, and the 'when you change anything here' guide so a future contributor does not have to reverse-engineer the rollout resolver before adding a new SSE event. * docs(web): Theater module AGENTS map (Phase 14.3) Web-side mirror of the daemon AGENTS map. Same file table, same invariants section, same change-impact guide, sized to the Theater component package. * feat(daemon): rollout flag resolver (Phase 15.1) Single decision point every caller consults to know whether the orchestrator should wire the critique pipeline for a given run. Priority: 1. Skill-level policy (required wins, opt-out wins inversely) 2. Per-project override from the Settings toggle 3. OD_CRITIQUE_ENABLED env override 4. Rollout phase default M0 dark-launch false M1 settings only false (toggle is off until the user flips it) M2 per-skill true if skill opted in M3 global default true OD_CRITIQUE_ROLLOUT_PHASE parser defaults to M0 on unknown input so a fresh install never surprises a user with the feature on. 10/10 vitest cases green covering every cell of the matrix. * feat(web): Settings toggle hook for Critique Theater (Phase 15.2) React hook that reads critiqueTheaterEnabled from the existing open-design:config localStorage blob and stays in sync via: - the platform storage event (cross-tab) - a open-design:critique-theater-toggle CustomEvent (same-tab) Same-tab event is the one that fires when the Settings panel saves in the current window: the toggle and every mounted theater update without a page reload. setCritiqueTheaterEnabled(next) is the imperative setter the Settings panel calls. It preserves the rest of the stored config (mode, apiKey, etc.) and dispatches the same-tab event after the localStorage write. The web hook reflects what the user toggled; the daemon-side isCritiqueEnabled is the final routing authority (project override, env, rollout phase). When they disagree, the daemon wins for backend gating and the web reflects the toggle state. 6/6 vitest cases green covering first read, stored read, same-tab event flip, config preservation, corrupted JSON tolerance, and cross-tab storage event. * test(web): Phase 15 toggle hook failure-mode coverage (PR #1320) lefarcen P2 on PR #1320 flagged that the PR body claimed safe behavior for disabled localStorage, non-object JSON, and missing CustomEvent shim, but the suite only covered corrupt JSON plus happy-path storage events. Added four failure-mode tests so the swallowed errors are not silently traded for a throw in a future refactor: 1. Returns false on a stored JSON value that parses to an array (non-object). Catches a regression where the guard treats anything truthy as a config blob. 2. Returns false on a stored JSON value of literal 'null'. typeof null === 'object' in JS, so the guard has to check null explicitly; this test pins that check. 3. Returns false when localStorage.getItem throws (private mode / disabled storage / SecurityError). The hook must swallow and return false so the rest of the app keeps rendering. 4. setCritiqueTheaterEnabled still dispatches the same-tab CustomEvent when localStorage.setItem throws (quota exceeded / disabled storage). The dispatch path is the in-session broadcast that keeps every mounted hook coherent even when persistence is unavailable; verified by mounting two probes and asserting both flip after the setter is called with a throwing setItem. 10/10 vitest cases green (6 existing + 4 new). * fix(web): honor CustomEvent payload in toggle hook listener (PR #1320) Both Siri-Ray (blocking) and lefarcen (P2 new) caught the same real bug in the failure-mode test I added in `affcdd27`: the test asserts the in-session UI flips when localStorage.setItem throws, but the CustomEvent listener was ignoring the event's typed detail and just calling readToggle(). Under a throwing setItem the localStorage value is stale (or absent), so the listener would see the OLD value and the test would fail (or worse, the production claim 'in-session event keeps mounts coherent' was hollow). Fixed the hook, not the test: the listener now reads event.detail.enabled when it is a boolean, falling back to readToggle() only for malformed events or for cross-tab storage events (which do not carry a typed payload). The setter already dispatched the detail; the listener just was not consuming it. Test changes: - The existing 'setItem throws' test now asserts the right behavior for the right reason. Updated the inline comment to say the listener reads from detail, not localStorage. - New test 'falls back to readToggle when the CustomEvent carries no usable detail' pins the fallback path: a malformed dispatcher (no detail, or detail.enabled not a boolean) degrades cleanly instead of throwing or being silently ignored. 11 / 11 vitest cases green (10 prior + 1 new fallback). * feat(daemon): route critique spawn-path eligibility through the rollout resolver The wireup edit Phase 10 and Phase 15 carved out: today server.ts gates the critique pipeline on critiqueCfg.enabled, which is just the OD_CRITIQUE_ENABLED env var. After this commit it gates on isCritiqueEnabled(...) from the Phase 15 resolver, so the full priority matrix is live: 1. Per-skill od.critique.policy veto (opt-out / required) 2. Per-project override (M1 Settings toggle, written through the existing Phase 6 settings endpoint) 3. OD_CRITIQUE_ENABLED env override (power-user lane / CI fixtures) 4. OD_CRITIQUE_ROLLOUT_PHASE default M0 dark-launch false M1 settings only false M2 per-skill only when skillPolicy === 'opt-in' M3 global default true Default behaviour on a fresh install is unchanged: the resolver returns false at M0 without an env override or a project override, so prod traffic falls through to the legacy single-pass path exactly the way it did before. Inputs threaded today: phase from OD_CRITIQUE_ROLLOUT_PHASE, envOverride from OD_CRITIQUE_ENABLED. skillPolicy and projectOverride are passed as null for the v1 cutover; the daemon-side handler that round-trips critiqueTheaterEnabled on the project settings row and the od.critique.policy frontmatter resolver land as the next two commits in this branch. The three call sites that used critiqueCfg.enabled (the brand-thread guard, the skill-thread guard, the top-line critiqueShouldRun compound) now read from a single locally-scoped critiqueEnabledForRun boolean, so the eligibility check is computed exactly once per spawn and the prompt composer + orchestrator stay in lockstep the way the existing comment already promised. Tests still green: daemon vitest 22 / 22 across rollout + conformance + adapter-degraded. Daemon typecheck clean. * feat(web): mount CritiqueTheaterMount in ProjectView The web counterpart of the daemon wireup. ProjectView now renders <CritiqueTheaterMount projectId={project.id} enabled={...} /> as a sibling of <AppChromeHeader> inside the top-level <div className="app">. The mount is the drop-in from the Phase 9 stack: it owns the SSE subscription, the kill-request handshake, and the phase-aware swap from the live <TheaterStage> to the collapsed badge once a run settles. The mount returns null until the daemon emits a critique.run_started for the active project, so the visual surface is byte-for-byte unchanged for users who have not opted in. Enabled wiring: useCritiqueTheaterEnabled() reads the M1 Settings toggle from the existing open-design:config localStorage blob and stays in sync with both the platform storage event (cross-tab) and the same-tab open-design:critique-theater-toggle CustomEvent the Phase 15 setter dispatches. The hook honors the event payload directly so a private-mode browser that cannot persist the toggle still updates the in-session UI correctly. The daemon-side gate (isCritiqueEnabled in apps/daemon/src/server.ts) remains the authority for whether a run is actually wired through the critique pipeline. This hook only governs whether the web layer renders the resulting SSE stream when the daemon emits one. The two-layer gate is intentional: an integrator embedding the Theater in a custom UI can flip the web visibility independent of the daemon's routing decision, and a daemon-side env override flips backend gating without touching the web's localStorage. Tests still green: web Theater suite 181 / 181 across 16 files. Web typecheck clean. * feat(daemon): resolve od.critique.policy frontmatter at the spawn site The next step in the wireup branch's ladder: replace the placeholder `skillPolicy: null` with the actual value parsed from the active skill's SKILL.md frontmatter. Three small edits, one new field on a public type: 1. SkillInfo gains a `critiquePolicy: SkillCritiquePolicy` field carrying the parsed `od.critique.policy` token (required / opt-in / opt-out / null). The field is null when the skill has no opinion, which lets the lower-priority resolver tiers (projectOverride, envOverride, phase default) decide. 2. listSkills() populates the new field via a small `normalizeCritiquePolicy` helper that tolerates the YAML scalar's casing and trims whitespace. Unknown tokens collapse to null so a typo in SKILL.md cannot accidentally force the panel on or off; it just falls through. Derived example cards inherit the parent's policy. 3. server.ts captures `skill.critiquePolicy` into a hoisted `skillCritiquePolicy` variable inside the existing skill-load block, then threads it into the isCritiqueEnabled call as the skillPolicy input. The hoisting keeps the variable in scope at the resolver call site without restructuring the spawn handler. After this commit, the priority matrix the rollout resolver was designed for is live for its top tier. The previous commit wired env + phase; this one wires skill. The projectOverride input remains null pending the next commit that extends the Phase 6 settings endpoint. Daemon vitest: 10 / 10 rollout cases pass against the new wiring. Daemon typecheck: clean. * feat(daemon): feed projectOverride into the rollout resolver from project metadata Replaces the placeholder `projectOverride: null` in the spawn handler with the actual value the Settings panel writes onto the project's metadata blob: `critiqueTheaterEnabled?: boolean`. The read is defensive at the boundary: the metadata object is typed loosely (it round-trips through SQLite as a free-form JSON blob), so the spawn handler narrows to `boolean` and falls through to `null` for any other shape. A missing key, a malformed value, or a project that has never visited Settings collapses to `null`, which is exactly the resolver's "no opinion, fall through to env / phase" signal. The `critique` frontmatter slot also gets typed on the SkillFrontmatter shape so the `od.critique.policy` chain the previous commit introduced no longer needs a bracket-access cast. Same pattern as the existing `craft`, `preview`, and `design_system` nested-record slots. After this commit, every tier of the rollout resolver's priority matrix is wired: 1. skillPolicy (from SKILL.md od.critique.policy) 2. projectOverride (from project metadata critiqueTheaterEnabled) 3. envOverride (from OD_CRITIQUE_ENABLED) 4. rollout phase (from OD_CRITIQUE_ROLLOUT_PHASE) The write path for projectOverride still flows through the existing project-update handler the Settings panel already uses to persist project metadata; no new endpoint is needed. The Settings UI button that calls setCritiqueTheaterEnabled and posts the new field is the next commit on this branch. Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases still green against the new wiring. * fix(daemon): forward critique events to project sinks + align composer gate (PR #1338) Two codex review items addressed in one commit since they share the same root cause (resolver-enabled run hits a transport / prompt contract that was still env-gated): P1 (transport mismatch). The daemon emits critique.* SSE frames through critiqueBus -> design.runs.emit, which fans out on /api/runs/:runId/events. The web CritiqueTheaterMount subscribes to /api/projects/:projectId/events (it's project-scoped, not run- scoped, because the mount lives at the project workspace and follows the user across runs). Result: in production the mount never sees a real frame and the e2e tests' stubbed routes hide the mismatch. Fixed by extending critiqueBus.emit to fan out to BOTH sinks: the existing runs.emit transport, AND the per-project event-sinks map. The project-events route emits via sse.send(payload.type, payload), so we pack the SSE channel name onto payload.type and let the sink push the right channel. The web sseToPanelEvent overwrites type from the channel name on the way back into a PanelEvent, so the round-trip stays correct. P2 (prompt gate misalignment). composeSystemPrompt reads cfg.enabled to decide whether to append the panel addendum, but critiqueCfg.enabled is loaded from OD_CRITIQUE_ENABLED only. A run the resolver enabled via phase / project / skill (env unset) would have critiqueShouldRun = true while critiqueCfg.enabled remained false, dropping the panel prompt while still routing through runOrchestrator -> parser waits for tags that never arrive -> run degrades. Fixed by passing a derived config { ...critiqueCfg, enabled: true } to the composer when critiqueShouldRun is true. The composer's own gate now agrees with the resolver decision on every input the spec defines. Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases still green against the new wiring. * fix: address PerishCode P1 + P2 follow-ups on PR #1338 Two follow-up items PerishCode flagged on the activation PR. Non-blocking but both are real: 1. Phase 11 e2e suite was wired into test:ui:extended but lands the user on '/' (home route) where ProjectView (and therefore CritiqueTheaterMount) is never rendered. With the suite as written, every assertion would time out the first time the lane runs in CI, contradicting the PR body's claim that the suite stays parked behind test.describe.fixme. The state diverged from my earlier Phase 11 work because the merge from main on commit `4ab719c6` brought in #1307's squash-merged version of the e2e file (the pre-fixme shape). Re-applied test.describe.fixme to the describe block plus removed ui/critique-theater.test.ts from the test:ui:extended script in e2e/package.json. Added a file-header docblock explaining what the follow-up commit needs to do: replace goto('/') with /projects/:id navigation similar to app-design-files.test.ts, split the SSE fixture into a live prefix and terminal suffix (Codex P2 on PR #1320), and commit the first PNG baselines. 2. bestRoundOf in CritiqueTheaterMount returned the LAST round with a numeric composite, not the round with the HIGHEST composite, while bestCompositeOf correctly returned the max. A run that closed round 1 at 8.5 and round 2 at 6.0 would dispatch interrupted { bestRound: 2, composite: 8.5 } on a user-clicked interrupt. Folded the two helpers into a single bestRoundAndComposite that walks state.rounds once and returns the matching pair so the two values cannot drift. The onInterrupt callback now destructures from one helper instead of two independent reads. Falls back to (state.activeRound, 0) when no round has closed with a composite yet. Web typecheck: clean. CritiqueTheaterMount.test.tsx: 7 / 7 cases still green against the new helper. * fix: wire M1 project override end-to-end + correct deferred-surface doc claims (PR #1338) Three lefarcen P2s on the latest review pass, all real: 1. M1 project override was half-wired: the daemon read metadata.critiqueTheaterEnabled but the web setter only wrote localStorage. A user opt-in would render the Theater on the web (localStorage was set) while the daemon resolved projectOverride=null and skipped critique unless env / phase already permitted. Two halves talking past each other. Extended setCritiqueTheaterEnabled to accept an optional { projectId, fetchProjectSettings } options bag. When a projectId is supplied, the setter ALSO sends a PATCH /api/projects/:id with { metadata: { critiqueTheaterEnabled } } so the daemon's spawn-time resolver picks the same value up on the next generation. The existing project-routes endpoint already accepts arbitrary metadata patches, so no new endpoint is needed. The local write + the CustomEvent dispatch still fire before the PATCH, so a network failure does not unwind the in-session UI flip. Three new vitest cases pin the new path: PATCHes when projectId is provided, skips when it is not, swallows a rejected PATCH so the in-session UI still flips. 2. Rollout docs (docs/critique-theater.md section 3) claimed the Settings toggle persists into the daemon settings store, but the previous implementation only had a localStorage reader / writer plus a daemon read of project metadata, with no round-trip. Rewrote the section to lead with the four-tier resolver (skill policy / project override / env / phase), document that the setter now round-trips via the existing PATCH endpoint when given a projectId, and call out the Settings panel UI control as a deliberate follow-up. 3. Troubleshooting table pointed users at /api/metrics/critique (Phase 12, deferred) and 'od adapters clear-degraded <id>' (CLI wrapper that does not exist). Replaced the metrics reference with the local conformance harness command (pnpm --filter @open-design/daemon vitest run tests/critique-conformance.test.ts) that ships today, with a note that the Phase 12 dashboard surfaces this status as a series once that PR lands. Replaced the CLI command with the programmatic clearDegraded() helper that exists today and flagged the CLI wrapper as planned follow-up. Web typecheck: clean. Toggle hook tests: 14 / 14 green (11 existing + 3 new for the round-trip path). * test(web): multi-round interrupt regression for bestRoundAndComposite (PR #1338) lefarcen P3 follow-up to the previous bestRoundAndComposite fix: the existing CritiqueTheaterMount.test.tsx interrupt cases only exercised a single-round state, so a future refactor back to two independent helpers wouldn't be caught by the test suite even though it'd reintroduce the round / composite drift bug. Added a regression case that: 1. Drives the reducer through two complete rounds with the full 5-role cast closing at distinct composites: round 1 at 8.5, round 2 at 6.0 (the high-composite round is NOT the most recent one). 2. Clicks Interrupt + waits for the daemon ack via the test seam fetcher returning 204. 3. Asserts the collapsed badge displays "round 1" (the correct best-composite round), and queryByText for "round 2 ... 8.5" returns null (the buggy pairing would have produced that string). The bestRoundAndComposite helper walks state.rounds in one pass and returns the matching pair, so the round number and the composite cannot drift apart. This test locks the fix in: a refactor that splits the helpers back into independent walks will be caught here. 8 / 8 vitest cases green on the file. * fix(web): read-merge-write the project metadata in setCritiqueTheaterEnabled (PerishCode P2 on PR #1338) The previous round-trip sent { metadata: { critiqueTheaterEnabled: next } } as the entire PATCH body. The daemon's project-routes handler only re-stamps three immutable fields (baseDir, importedFrom, fromTrustedPicker) before calling updateProject(db, id, patch), which then does a shallow { ...existing, ...patch } in apps/daemon/ src/db.ts. So patch.metadata replaces the row's metadata wholesale, dropping kind, templateId, linkedDirs, and every other field the rest of the app reads. No in-tree caller passes projectId today (only vitest cases), so the bug had not surfaced yet. But the surface is documented in docs/critique-theater.md section 3 and the function's own JSDoc as the M1 round-trip path, so it would have shipped as a latent footgun for the next integrator: a Settings UI follow-up, or any third party that wires the setter into a project-aware surface. Fix: read-merge-write rather than a bare patch. - GET /api/projects/:id to read the row's current metadata. - Spread that metadata into the PATCH body and overlay critiqueTheaterEnabled: next on top, mirroring the partial-metadata pattern already used in ChatComposer.tsx for linkedDirs. - PATCH the merged object. Failure handling: - GET fails: skip the PATCH entirely. We cannot construct a safe merged body without the current state, and a bare patch would wipe other metadata. The in-session CustomEvent fired earlier in the setter still keeps every mounted hook consistent; the next save retries the round-trip. - PATCH fails: log in dev. The in-session UI is already correct via the CustomEvent. Tests (TDD, red-first): - 'GETs the project then PATCHes with merged metadata when a projectId is supplied': stubs a GET that returns { kind: 'template', templateId: 'modern-blog', linkedDirs: [...] } and asserts the PATCH body equals the merge plus the toggle. - 'PATCHes with just the toggle when the project has no prior metadata': stubs a GET that returns no metadata block. - 'skips the PATCH (does not stomp metadata) when the prefetch GET fails': stubs a rejecting GET and asserts only the GET fires. - 'swallows a rejected PATCH after a successful prefetch': stubs a successful GET and a rejecting PATCH; asserts the in-session UI still flips via the CustomEvent. Doc updated on the setter's JSDoc to describe the new three-step flow (localStorage, CustomEvent, read-merge-write PATCH) and the two failure modes. Verified: - pnpm --filter @open-design/web typecheck clean. - pnpm --filter @open-design/web test: 111 files / 1055 tests green (was 1052, +3 from the new merge-flow cases). * fix(web): restore wait-for-daemon-ack pattern on Theater interrupt Same regression as flagged on PR #1316 post-main-merge: the optimistic local dispatch fired before the POST resolved, so a daemon 404 / 409 still terminalized the UI and the real SSE terminal event got ignored by the sticky interrupted phase. Snapshot runId / bestRound / composite at click time, dispatch interrupted only on res.ok, clear interruptPending on rejection or non-2xx so the user can retry. Tests cover rejection + 404 leaving the run on the live stage; the 204 path waits for the ack. * fix(test): add projectKind prop to FileViewer deck render after v0.7.0 merge * fix(daemon): address PerishCode P3 trio on PR #1338 (emit helper reuse + spawn-input coverage + restored docs) --------- Co-authored-by: Nagendhra <nagendhra405@gmail.com>	2026-05-14 20:37:06 +08:00
sakshyasinha	c4a67a7b3e	Fix Kimi CLI icon contrast in light mode (#1667 ) * fix(web): improve Kimi CLI icon contrast * fix(web): render Kimi icon via theme-aware CSS mask Move Kimi to the MONO_ICONS set so it renders through CSS mask with currentColor adaptation, making it legible in both light and dark themes instead of baking a single dark fill that fails on dark backgrounds. * fix(web): adjust Kimi icon secondary mark for dual-theme contrast Keep Kimi as a baked two-tone asset: blue accent (#1783ff) for brand identity, mid-tone gray (#666666) secondary mark for acceptable contrast on both light and dark card surfaces. Revert from mask path to preserve the blue branding. * fix(web): correct corrupted Kimi SVG and strengthen asset validation test Remove extraneous PR discussion text that was accidentally included in the SVG file. Strengthen the test to validate the bundled asset is valid SVG with the expected fills (blue accent + gray secondary mark), catching asset corruption that would otherwise go undetected.	2026-05-14 20:32:52 +08:00
sukumarp2022	9218fd649e	feat(ui): add copy to clipboard functionality for user messages with … (#1669 ) * feat(ui): add copy to clipboard functionality for user messages with localization support * fix(web): use setTimeout instead of window.setTimeout for correct Timeout type * docs: add copy prompt button screenshot for PR #1669 * docs: add copy button hover screenshot for PR #1669 * docs: add copy button copied state screenshot for PR #1669 * fix(ui): reset button border/background on copy prompt button The .user-copy-btn inherited border and background from the base button CSS, rendering as a bordered gray box instead of a clean icon overlay. This was especially visible in the Electron desktop app. Add border: none and background: none to the button, and a subtle hover background for feedback.	2026-05-14 20:19:20 +08:00
lefarcen	4693ddb00d	chore: add issue templates (bug, feature, preview/0.8.0) + chooser config (#1708 ) * chore: add issue template for preview/0.8.0 feedback Adds a guided issue form so community testers of the preview/0.8.0 branch (Skills tab + Automations) can submit structured feedback. The template auto-applies the preview/0.8.0 label, which lets maintainers filter all preview-related reports in one view: https://github.com/nexu-io/open-design/issues?q=is%3Aopen+label%3A%22preview%2F0.8.0%22 * chore: add generic bug-report issue template Pairs with the preview/0.8.0 template added in the previous commit. Until now the repo had no issue templates at all, which meant New Issue opened a blank textarea by default. The bug-report template: - Pre-applies the 'bug' label - Guides users through repro steps, version, platform, logs - Includes a callout pointing preview/0.8.0 testers to the dedicated feedback template so the two flows stay separate * chore: add feature-request template + chooser config Rounds out the issue-template basics: - feature-request.yml — 'what problem are you trying to solve' framing, willing-to-contribute dropdown so maintainers can route PRs - config.yml — disables blank-issue entry, redirects Q&A / Ideas / Show-and-tell / general chat to Discussions, points preview/0.8.0 reporters at the dedicated template After merge, the chooser at /issues/new/choose will be: Template 1. 🐛 Bug report 2. 💡 Feature request 3. 🧪 Preview 0.8.0 feedback Contact → Preview 0.8.0 feedback (dup, easy-access) → Ask a question (Discussions Q&A) → Discuss an idea (Discussions Ideas) → Show what you've made (Discussions Show-and-tell) → General discussion (Discussions General)	2026-05-14 20:13:36 +08:00
정수현	63baff5222	fix(skills): repoint coreyhaines31 upstream URLs to marketingskills (#1659 ) The upstream repo github.com/coreyhaines31/skills was renamed to github.com/coreyhaines31/marketingskills, so the four curated marketing-creative stubs (ad-creative, copywriting, marketing-psychology, paywall-upgrade-cro) advertised a source URL that now 404s. Update od.upstream and the body source/open links in all four SKILL.md stubs, plus the matching entries in the seed script so re-seeding stays consistent.	2026-05-14 20:10:14 +08:00
Nicholas-Xiong	0c5f03054e	fix: Add success toast feedback when saving artifact as template (#1671 ) Fixes #1190 Display a visible success toast after saving an artifact as a template, providing clear confirmation that the action completed successfully. Before: - No visible feedback after clicking Save - Success message only shown in menu button text (not visible after modal closes) - Users uncertain whether template was saved After: - Success toast appears after saving - Toast displays for 2.2 seconds with template name - Clear confirmation that the save action completed - Matches the pattern used for comment saves Implementation: - Added templateSavedToast state (similar to commentSavedToast) - Set toast message in handleSaveAsTemplate on success - Render toast using existing Toast component - Auto-dismiss after 2.2 seconds (consistent with other toasts)	2026-05-14 20:09:23 +08:00
ashleyashli	1e9bcbf20d	fix(contributor-bot): serialize runs to avoid state.json races and duplicate cards (#1707 )	2026-05-14 20:01:13 +08:00
Yuhao Chen	397098f231	fix(web): clean up routines form controls (#1609 )	2026-05-14 19:57:44 +08:00
PerishFire	3fa12f71be	Add release preview workflow placeholder (#1705 ) Some checks failed ci / Packaged mac smoke (push) Blocked by required conditions Details ci / Packaged windows smoke (push) Blocked by required conditions Details ci / Detect PR change scopes (push) Failing after 11s Details ci / Validate workspace (push) Has been skipped Details nix-check / build (push) Failing after 2s Details	2026-05-14 18:55:08 +08:00
Yuhao Chen	7633d7a9b0	fix(packaged): forward proxy env to sidecars (#1678 )	2026-05-14 17:59:14 +08:00
Nagendhra Madishetti	40766ef1ba	test(web): Critique Theater Phase 13 (reducer p99 bench + surface coverage walker) (#1318 ) * feat(web): pure reducer for Critique Theater states (Phase 7.1) Pure CritiqueState reducer driven by the contracts-level PanelEvent (the same shape both the live SSE stream and the recorded transcript emit), so a single reducer powers both the in-flight panel and the rerun replay. Lifecycle covers run_started → running → (shipped / degraded / interrupted / failed), with panelist_open / dim / must_fix / close / round_end events building per-round CritiquePanelistView entries as they arrive. Defensive behaviour that surfaced while writing the spec tests: - Terminal phases (shipped / degraded / interrupted / failed) are sticky against further lifecycle events for the same run, except for parser_warning which can land late and is recorded in a side channel without changing phase. - A new run_started for a different runId at any time discards the prior state and reboots, so the UI can launch consecutive runs without an explicit reset action. - Events whose runId does not match the active run return the same state reference, so React's useReducer doesn't re-render subscribers on stray traffic. - Round bookkeeping keys by round number rather than "always last", so an out-of-order panelist_dim for round 1 arriving after a round 2 dim does not corrupt the round 2 bucket. Test coverage: 18 cases covering each transition, the runId guard, sticky-terminal behaviour, the out-of-order round invariant, and the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire SSE + replay into the same reducer. * feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2) createCritiqueEventsConnection is a pure connection manager that mirrors apps/web/src/providers/project-events.ts: opens an EventSource at /api/projects/:id/events, listens for every name in CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent (stripping the critique. prefix and merging the data payload), and hands it to the caller's onEvent. Reconnect uses exponential backoff (1s → 30s) and resets on `ready`; malformed payloads drop with a dev-mode warning rather than tearing the stream. useCritiqueStream wraps the manager in a useReducer that owns the CritiqueState. enabled=false or a null projectId tears down the connection cleanly; switching projectId closes the old connection and opens a fresh one. The returned dispatch lets local UI synthesise actions (e.g. an Esc keypress firing a synthetic interrupted while a kill request is in flight); production traffic comes from the SSE stream. Test coverage: - sse.test.ts (10 cases, node env): subscription set covers every CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire shape back to PanelEvent; malformed JSON is swallowed and does not stop the stream; exponential backoff schedule and ready-reset semantics are pinned with a setTimeout seam; close() cancels pending reconnects and shuts the live source; no-op fallback when EventSource is unavailable. - useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event, reducer driven by synthetic actions, no connection when disabled or projectId is null, clean close on unmount, projectId change reopens cleanly. * feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3) Fetches the per-run NDJSON transcript (one PanelEvent per line), parses every line via the shared isPanelEvent predicate, and dispatches into the same CritiqueState reducer the live SSE stream uses. A single reducer means the UI rendering a replay can be identical to the live panel, and a UI mounting both useCritiqueStream and useCritiqueReplay in parallel does not have to reconcile two state shapes. speed knob is `paused \| instant \| live \| { intervalMs: N }`. - instant flushes every event synchronously, useful for opening a finished run already at its terminal state. - intervalMs paces dispatches at a fixed cadence so the reviewer can watch the run unfold. - paused parses the transcript but holds events back until the caller advances speed (consumers can drive a scrubber later). - live is reserved for the future "playback at original cadence" feature, currently treated as instant; replay timestamps are not yet persisted with each event so honest pacing requires a follow-up Phase 7+ task. gunzip seam handles `.ndjson.gz` transcripts via DecompressionStream when present; the production fetch path picks between text and arrayBuffer based on the URL extension. Both seams are injectable so the unit tests don't need to spin up a real network or a real gzip pipeline. Test coverage (8 cases, jsdom env): - Idle status before any URL is provided. - speed=instant flushes the full transcript synchronously to shipped state. - speed={intervalMs:N} paces with the setTimeout seam, reaching done after the last tick. - speed=paused leaves status=playing with no dispatches. - Empty transcript reports done with state still idle. - Fetch rejection surfaces an error status with the message. - Malformed NDJSON lines are skipped; valid events around them still land. - .gz transcripts route through the gunzip seam. Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream + replay), all on one branch ready for review. Phases 8+ (Theater components) consume these from this PR. * fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review) Two P1 fixes from lefarcen's review on PR #1307: SSE payload override `sseToPanelEvent` previously spread `data` after the channel-derived `type`, so a payload-provided `type` could override the channel and route a `critique.run_started` frame into the reducer as a `ship` action. Reversed the spread so the channel-derived `type` is authoritative, and revalidated the resulting object through the contracts-level `isPanelEvent` predicate before returning. Frames that fail validation (missing runId, empty runId, unknown type) are dropped, so a malformed or compromised SSE frame can no longer dispatch a wrong-shape action into the reducer. Three new sse.test.ts cases pin the regression: hostile `type:'ship'` in the payload still resolves to `run_started`, missing runId is dropped, empty runId is dropped. Replay pause/resume `useCritiqueReplay` had one big effect keyed on `transcriptUrl` only, so flipping `speed` from `paused` to `instant` never re-fired and the held events sat undispatched. Split into a parse effect (depends on URL, fetches and stores events in state) and a pace effect (depends on parsed-events + speed, owns the cursor + timers). The playback cursor lives in a ref that survives pause/resume cycles, so flipping `paused` -> `instant` flushes from the current position rather than restarting (which would double-dispatch `run_started` and reset the reducer). Two new useCritiqueReplay.test.tsx cases: - paused-then-instant transitions from `playing` to `done` and reaches the shipped terminal phase - intervalMs paced playback dispatches one event, pauses to drain the next scheduled timer, flips to instant, and confirms the remaining transcript drains exactly once (cursor was preserved) Doc consistency The earlier source comment in useCritiqueReplay.ts claimed `live` "paces by recorded timestamps" while the impl used zero-delay timers and the PR body said it behaves like `instant`. Aligned to reality: `live` currently behaves like `{ intervalMs: 0 }` (events drain on successive microtasks via setTimeoutFn) because transcripts do not yet carry per-event timestamps. Honest timestamp-driven pacing is queued as a Phase 7+ follow-up. Validated: pnpm guard, pnpm --filter @open-design/web typecheck, Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite 96 files / 888 tests. * feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread) * feat(web): Theater PanelistLane component (Phase 8.1) * feat(web): Theater ScoreTicker component (Phase 8.2) * feat(web): Theater RoundDivider component (Phase 8.3) * feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4) * feat(web): Theater TheaterDegraded chip (Phase 8.5) * feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6) * feat(web): Theater TheaterTranscript replay surface (Phase 8.7) * feat(web): Theater TheaterStage top-level container (Phase 8.8) * feat(web): Theater CSS using existing semantic tokens (no hex literals) * feat(web): Theater public exports barrel * fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314) Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen. State-lifecycle fixes (3 x P2) 1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`). Host hooks dispatch it when their gating prop changes so a stale run from a prior project / transcript cannot bleed into the next context. Reset is idempotent on idle (returns the same reference). 2. `useCritiqueStream` dispatches `__reset__` at the top of its connection effect, so a workspace switch from project A (which streamed a critique) to project B clears the reducer before the new EventSource opens. enabled=false also clears. 3. `useCritiqueReplay` dispatches `__reset__` at the top of its parse effect, so transcriptUrl swaps (including swap-to-null after a replay reached `shipped`) lift the reducer back to idle before the new fetch starts. SSE validation (1 x P2) 4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape` check after the cheap `isPanelEvent` predicate. A `critique.ship` frame missing `composite` / `round` / `status` / `artifactRef` is rejected before reaching the reducer, so TheaterCollapsed can no longer crash on `undefined.toFixed(1)`. Every variant's required fields are validated: run_started (protocolVersion, non-empty cast, maxRounds, threshold, scale), panelist_* (round, role, plus variant-specific shape), round_end (round, composite, mustFix, decision in {continue,ship}, reason), ship (round, composite, status, artifactRef.{projectId,artifactId}, summary), degraded (reason, adapter), interrupted (bestRound, composite), failed (cause), parser_warning (kind, position). Reducer correctness (1 x P2) 5. `panelist_open` now materializes the round + an empty panelist view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight the in-progress lane the instant the tag opens. Before this, a stream that emitted only `panelist_open` after `run_started` left `rounds = []` and the UI rendered no current round until a later `panelist_dim` arrived. Polish (3 x P3) 6. Brand role tint swaps from `var(--magenta, var(--accent))` to `var(--purple, var(--accent))`. `--purple` is actually defined across the design systems; `--magenta` is not, so Brand was silently falling through to `--accent` and looking identical to Designer. 7. New i18n key `critiqueTheater.interruptedSummary` for the interrupted-collapse copy ("Interrupted at round N, best composite X.X"). Previously the interrupted branch reused `shippedSummary` and the UI read "Shipped at round..." for a run that specifically did not ship. Native value in en + zh-CN; other locales fall back via `...en` spread. 8. `TheaterDegraded` heading id comes from `useId()` instead of a hardcoded `theater-degraded-heading`, so two chips rendered on the same page (chat history with multiple completed runs) keep their aria-labelledby references unambiguous. Tests (15 new cases) - reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data. - sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship. - useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false. - useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped. - TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...". - TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new) - tests/i18n/locales.test.ts 5 of 5 across 18 locales * feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1) * feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2) * fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315) Addresses every blocker from codex, Siri-Ray, and lefarcen. The three state-lifecycle and SSE-validation issues they also flagged inherit fixes from PR #1314's review pass that this branch now sits on top of after rebase. Real daemon kill on Interrupt (P1) - CritiqueTheaterMount now POSTs to /api/projects/:id/critique/:runId/interrupt alongside the optimistic local dispatch. Before this fix, clicking Interrupt only flipped the React state to interrupted while the daemon job kept running. The fetch is best-effort: a 404 (endpoint not wired yet, lands in Phase 15) is swallowed with a dev-mode console.warn so the UI still moves to the collapsed badge. - New fetchInterrupt test seam lets RTL assert on the URL / method and simulate the "daemon not ready yet" path. Two tests pin both: the happy URL proj-42/critique/run-abc/interrupt POSTs, and a rejected fetch still flips the UI. interruptPending reset on new run (P2) - A ref-backed effect compares the current runId against the last one we saw; when it changes, interruptPending is cleared. A user who interrupts run-1 and then triggers run-2 from the same mount now gets a fresh, enabled kill button instead of one stuck in "Interrupting…". Pinned by a new mount test. Escape keybind scope (P2) - InterruptButton now checks the keydown target. Escape inside an input, textarea, select, or contenteditable element is ignored (and any ancestor of those via closest() is treated the same way). Body-level focus still fires the keybind so the Theater area's affordance keeps working. Four new tests cover textarea, input, contenteditable, and the body-focus positive case. userFacingName i18n key (P2) - The spec at specs/current/critique-theater.md:6 mandates a single critiqueTheater.userFacingName key so the "Design Jury" label can be renamed without touching code. Phase 8 introduced critiqueTheater.title by mistake; renamed across types.ts, en.ts, zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer TheaterStage.tsx. The locale alignment test stays green. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 14 files, 112 tests (was 101 before, +11 new for the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope; the rest were already in #1314's review fix). - tests/i18n/locales.test.ts 5 of 5 across 18 locales. * feat(daemon): adapter-degraded registry with TTL (Phase 10.1) In-memory registry recording adapters that produced malformed or oversize transcripts so the orchestrator can skip them for a TTL window (default 24h) instead of cycling through known-bad providers on every run. Records carry reason (malformed_block \| oversize_block \| missing_artifact), source label, and expiresAt. The test-only clock seam lets the suite advance time deterministically and prove that an expired entry stops counting as degraded without anyone calling clearDegraded. 7/7 vitest cases green. * feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2) Two test-only adapters that read the existing v1 transcript fixtures (happy-3-rounds and malformed-unbalanced) and replay them as either a full string or a 512-byte chunked stream. The chunked form is what the conformance harness uses to prove the parser holds together when the transcript arrives in arbitrary network slices, not as one buffered blob. * feat(daemon): adapter conformance harness (Phase 10.3) runAdapterConformance pulls a transcript through the same parseCritiqueStream pipeline the orchestrator uses and classifies the outcome as shipped, degraded, or failed. On a degraded outcome it forwards the matched reason to the adapter-degraded registry, so a single nightly conformance run is what populates the skip list rather than the orchestrator learning each adapter is broken at request time. 5/5 vitest cases green covering shipped, malformed degraded, oversize degraded, no-ship failure, and the harness-thrown failure path. * test(e2e): Critique Theater Playwright suite (Phase 11) Six tests, one viewport per visual case, deterministic SSE fixtures stubbed via page.route(). Adds the suite to test:ui:extended so the existing extended-UI lane picks it up. Coverage: 1. Happy path: a single mounted theater plays the full fixture (1 run_started, 5 panelists open / dim / must_fix / close, 1 round_end, 1 ship) and ends on the score badge. 2. Interrupt mid-run: the panelist that is open at the time the interrupt button is clicked closes with an interrupted marker and the transcript freezes there. 3. Visual regression at 375x720 mobile. 4. Visual regression at 768x1024 tablet. 5. Visual regression at 1280x800 desktop. 6. A11y role tree: the theater region exposes a labelled landmark, each panelist lane is a group with an accessible name, the score is a status live region. All SSE traffic is stubbed by page.route so the suite runs in CI without a daemon. The toggle is seeded via localStorage by bootAppWithCritiqueEnabled so the gate behaves as if Settings flipped it on. typecheck clean; playwright --list reports 6. * test(web): reducer p99 bench at 10k iterations (Phase 13.1) Locks the documented 2ms budget for the Critique Theater reducer on a representative SSE script (27 actions, one full happy run) behind a regression gate. Asserts p99 stays under 4ms (2x the documented budget) so CI runners with a noisy neighbour do not flake while a real regression to 20ms or 200ms still trips. The bench is a vitest case rather than a bare microbenchmark so it runs in the same CI lane as every other web test and does not need a parallel runner. * test(web): critique surface coverage walker (Phase 13.2) Walks the public critique surface (11 SSE event names, 5 panelist roles, 6 lifecycle phases, 9 named i18n keys) and asserts each named symbol appears in both the src corpus and the test corpus. The walker is the gate that catches a rename in one half of the codebase without a matching update in the other half: a future PR that drops 'panelist_must_fix' from the reducer without also removing its test reference fails this suite. 62 assertions, one per symbol per corpus. * fix(web): tighten Phase 13 gates from lefarcen review (PR #1318) Address the actionable items from lefarcen's review of the two Phase 13 CI gates. The two questions about longer-term DX (pre- commit hook to auto-update the symbol table, AST-walker swap) are documented as deferred follow-ups rather than landed here. reducer-bench: - Describe renamed to 'reducer p99 regression gate (Phase 13.1)' so it reads as a gate, not a comparative benchmark. - Failure message now carries the full distribution (p50 / p90 / p99 / max + ceiling), so triage on a tripped gate can distinguish a real 20ms regression from a 4.001ms CI hiccup without re-running locally (lefarcen Q3). - Captured a baseline (p50=0.011ms p90=0.013ms p99=0.018ms max=0.244ms on a local Node 24 / Win11 run, 2026-05-11) inside the docblock so reviewers can see the actual reading sits ~222x below the 4ms ceiling (lefarcen Q1). - Replaced 'role as any' casts with PanelistRole-typed casts so the fixture is typecheck-strict. - Phase numbering corrected (13.2 → 13.1 to match the PR body). critique-coverage: - Symbols now grouped under four describe blocks (SSE events / panelist roles / lifecycle phases / i18n keys) so a failure points at the category that drifted at a glance (lefarcen nit). - Docblock now explains the grep-over-AST trade-off (the bug class is structural at the string level, not at the AST level) and points at the future AST-walker work as a deferred follow- up (lefarcen Q2). - Docblock now walks a contributor through the four-step maintenance flow (add to contract → add caller → add test → add literal here), so the next person to add an SSE event or i18n key knows the gate exists and what to update (lefarcen Q4). - Phase strings switched from 'phase: <name>' to bare-quoted literals so the walker is robust against single vs double quotes and ':' vs '===' source-shape changes. - Dead try/catch around 'stack = [root]' removed (cannot throw). - Per-symbol failure messages name the symbol AND which corpus is missing it, so the gate is self-describing on the next CI red. - Phase numbering corrected (13.4 → 13.2 to match the PR body). 63 / 63 vitest cases green (1 bench + 62 coverage). Web typecheck clean. * fix(web): tighten coverage walker semantics from lefarcen P2/P3 (PR #1318) Two follow-on findings on commit `338a185`: P2 — coverage gate weakened. The previous revision used one helper `corpusReferences` for both SRC and TEST corpora, and that helper accepted the unprefixed PanelEvent type form (`type: 'panelist_must_fix'`) as a substitute for the prefixed SSE wire name (`critique.panelist_must_fix`). The fallback is correct on the TEST side (reducer tests dispatch PanelEvent literals) but it weakened the SRC side: production code could drop the SSE channel name silently and the PanelEvent type alias would keep the walker green. Split into two helpers: `srcReferences` is strict (exact substring match only, no fallback) and `testReferences` keeps the lenient fallback for SSE events. The production-side assertions now route through `srcReferences` so the wire name is load-bearing again. P3 — maintenance doc overclaimed. The previous revision said 'CI red if you forget step 4' but the symbol arrays are partially hand- maintained, so a contributor adding a NEW phase string or i18n key without updating the array leaves CI green (the walker never knew to look). Rewrote the failure-mode section to distinguish the two cases: - Renaming an EXISTING symbol without updating the walker → CI red (existing assertion fails because the old name is gone). - Adding a NEW hand-maintained symbol without updating the walker → CI stays green (walker does not know to look for it). Also clarified that `SSE_EVENTS` and `PANELIST_ROLE_STRINGS` are auto-built from contracts so step 4 is one-line for `PHASE_STRINGS` and `I18N_KEYS` only. 63 / 63 vitest cases still green. * fix(web): close two P2 findings on PR #1318 (Siri-Ray + lefarcen) P2 (coverage walker counted self as evidence). The walker walked apps/web/tests, which contains apps/web/tests/components/Theater/ critique-coverage.test.ts itself. The hand-maintained PHASE_STRINGS and I18N_KEYS literals inside that file would satisfy the test-side coverage assertion against themselves, so a real Theater test that covers a symbol could be deleted and the gate would still pass. Excluded the walker file from TEST_FILES via path.resolve(__filename) filter so the test corpus only contains independent evidence. Once the walker stopped seeing itself, the gate correctly red-flagged nine i18n keys that no INDEPENDENT test exercises: critiqueTheater.userFacingName, roundLabel, composite, threshold, interrupt, interrupted, degradedHeading, shippedSummary, interruptedSummary. Component tests like TheaterCollapsed.test.tsx exercise the rendered text but never mention the key STRING, so the walker couldn't see them. Closed that gap by adding apps/web/tests/components/Theater/critique-i18n-keys.test.ts: 9 cases, one per watched key, asserting the dictionary entry exists as a non-empty string. That's both real coverage (catches a stale dict) and the independent evidence the walker requires. P2 (interruptedSummary missing from de/ja/ko/zh-TW). The native locale overrides were missing the key, so an interrupted run on a German / Japanese / Korean / Traditional Chinese UI silently fell back to the English string via the ...en spread. Added the key with {round} and {composite} placeholders preserved, using PerishCode's suggested copy from the earlier review thread. Verified: - pnpm --filter @open-design/web typecheck clean. - pnpm exec vitest run tests/components/Theater tests/i18n: 20 files / 190 tests green (critique-coverage 62 / 62, critique-i18n-keys 9 / 9 new, reducer-bench 1 / 1, locales 5 / 5). * fix(web): drop the Dict cast in i18n key coverage test (lefarcen P1 / Siri-Ray on PR #1318) The previous revision used `(en as Record<string, string>)[key]` to read each watched key. Dict has no string index signature, so CI's strict typecheck rejected the broad cast with TS2352 even though the runtime assertion was fine. Replaced with the typed pattern lefarcen suggested: type WATCHED_KEYS as `readonly (keyof typeof en)[]` and read `en[key]` directly. That removes the cast and also strengthens the test, because a renamed or removed key now fails the type check immediately rather than at runtime. Verified: - pnpm --filter @open-design/web typecheck clean. - pnpm --filter @open-design/web exec vitest run tests/components/Theater/critique-i18n-keys.test.ts: 9 / 9 green. * fix(web): tighten isPanelEvent in contracts so enum + numeric fields are checked end-to-end (Siri-Ray round-3 P1 on PR #1314) The variant validator on the web SSE path previously accepted any `typeof === 'string'` for closed-enum fields (ship.status, panelist_.role, degraded.reason, failed.cause, parser_warning.kind, run_started.cast[]) and any `typeof === 'number'` for numeric fields, which let NaN / Infinity through. Downstream components index i18n tables by enum value, so an unknown status or role would land `SHIP_BADGE_KEY[final.status]` on undefined and crash the translator. The replay parser had a separate gap: `useCritiqueReplay.parseTranscript` called the cheap `isPanelEvent` header check directly, so a recorded line like `{"type":"ship","runId":"r"}` reached the reducer with composite, status, round, artifactRef, summary all undefined and TheaterCollapsed then called `final.composite.toFixed(1)` on undefined. Resolution: move all wire-side validation into the contract guard. - Export const arrays for the closed enums: SHIP_STATUSES, DEGRADED_REASONS, FAILED_CAUSES, PARSER_WARNING_KINDS, ROUND_DECISIONS (PANELIST_ROLES already existed). - Rewrite `isPanelEvent` in packages/contracts/src/critique.ts to be the single deep validator: header (known type + non-empty runId) plus every variant-specific required field plus closed-enum membership plus Number.isFinite on every numeric field. Documented as the wire source of truth. - Drop the local `hasValidVariantShape` from web/sse.ts; sseToPanelEvent now relies entirely on the contract guard, and parseTranscript in useCritiqueReplay (which already uses isPanelEvent) gets the deeper validation for free. Tests (TDD, red-first): - packages/contracts/tests/critique.test.ts: 13 new cases pinning the strict guard directly (well-formed across every variant, every rejection path: unknown type, empty/non-string runId, unknown enum, non-finite numeric, missing variant field). - apps/web/tests/components/Theater/state/sse.test.ts: 9 new cases for each closed-enum rejection on the wire path plus a positive sweep across every legal enum value across every variant. - apps/web/tests/components/Theater/hooks/useCritiqueReplay.test.tsx: 2 new cases for incomplete and unknown-enum transcript lines. Verified: - pnpm --filter @open-design/contracts test 4 files / 30 tests green. - pnpm --filter @open-design/contracts build clean. - pnpm --filter @open-design/web typecheck clean. - pnpm --filter @open-design/web test 107 files / 976 tests green. fix(contracts): enforce numeric domains in isPanelEvent (lefarcen P2 on PR #1314 round 4) The strict guard from PR #1314 round 3 enforced enum membership and Number.isFinite, but accepted any finite number where the contract intends a specific domain: scale: 0 (ScoreTicker divides by it), negative thresholds, fractional rounds, negative mustFix, etc. ScoreTicker.tsx writes `var(--scale, ${state.scale})` into inline CSS and divides by it for tick width, so a guard-passing scale: 0 shipped Infinity into the rendered style. Negative composite / score values reached downstream code that assumes >= 0. Resolution: mirror the daemon-side Zod domain constraints in the runtime guard. Three new helpers in packages/contracts/src/critique.ts: - isPositiveInt(v): integer with v > 0. Used for round, maxRounds, scale, protocolVersion (all 1-indexed in the orchestrator). - isNonNegativeInt(v): integer with v >= 0. Used for mustFix, position, bestRound. bestRound: 0 is the valid sentinel for 'interrupted before any round closed'. - isNonNegativeFinite(v): finite number with v >= 0. Used for composite, score, dimScore, threshold. Threshold may be fractional (e.g. 8.5 on a scale of 10). Cross-field check inside run_started: threshold <= scale (the daemon Zod schema enforces this with an epsilon refine, the wire guard matches the same intent). Tests (TDD, red-first) added in packages/contracts/tests/critique.test.ts: - 22 new rejection cases across every numeric field that previously slipped through: scale: 0, negative scale, fractional scale, maxRounds: 0, fractional maxRounds, protocolVersion: 0, fractional protocolVersion, negative threshold, threshold > scale, round: 0, fractional round, negative dimScore / score, negative / fractional mustFix, negative composite, ship round: 0, negative / fractional bestRound, negative interrupted composite, negative / fractional parser_warning position. - 3 positive boundary cases that must still pass: threshold == scale, fractional threshold within [0, scale], interrupted with bestRound: 0 (no round completed before interrupt), parser_warning with position: 0 (start of stream). Verified: - pnpm --filter @open-design/contracts build clean. - pnpm --filter @open-design/contracts test: 4 files / 59 tests green (was 37 before the new domain cases). - pnpm --filter @open-design/web typecheck clean. - pnpm --filter @open-design/web test: 110 files / 1004 tests green; no regression on Theater suite, sse validator, replay parser, or assistant-feedback widget tests. * fix(web): restore wait-for-daemon-ack pattern on Theater interrupt Same regression as flagged on PR #1316 post-main-merge: the optimistic local dispatch fired before the POST resolved, so a daemon 404 / 409 still terminalized the UI and the real SSE terminal event got ignored by the sticky interrupted phase. Snapshot runId / bestRound / composite at click time, dispatch interrupted only on res.ok, clear interruptPending on rejection or non-2xx so the user can retry. Tests cover rejection + 404 leaving the run on the live stage; the 204 path waits for the ack. * test(e2e): move critique-coverage walker from apps/web/tests to e2e/tests (Siri-Ray P2) The walker is by definition a cross-app consistency check: it reads the web reducer, the daemon critique module, the contracts package, and the e2e UI suite. Hosting it under apps/web/tests/ violated the repo boundary rule (root AGENTS.md): app packages must not import another app's private src/ or tests/ as a shared helper, and cross-app consistency checks belong in e2e/tests/. The web test lane was effectively coupled to daemon and e2e file layout, so a daemon-only refactor could break the web lane. Moved the file to e2e/tests/critique-coverage.test.ts and switched the contracts import to the import.meta.glob shape the e2e package already uses (see localized-content.test.ts), so the e2e package does not have to add @open-design/contracts as a workspace dep just to load two const arrays. REPO_ROOT and SELF_PATH recalculated for the new location. Web test lane no longer depends on daemon, contracts, or e2e layout. The e2e walker covers the same 62 assertions as before: e2e/tests/critique-coverage.test.ts 62 / 62 green Web typecheck clean, e2e typecheck clean. * fix(test): add projectKind prop to FileViewer deck render after v0.7.0 merge --------- Co-authored-by: Nagendhra <nagendhra405@gmail.com>	2026-05-14 15:55:36 +08:00
Fl0rencess	53148d52c8	feat(media): add SenseAudio TTS provider (#1633 ) * feat(media): add SenseAudio TTS provider Add SenseAudio (https://docs.senseaudio.cn) as a new TTS provider alongside ElevenLabs / MiniMax / FishAudio / Volcengine. Surfaced as the `senseaudio-tts` catalogue id, mapped on the wire to `senseaudio-tts-1.5-260319` — SenseAudio's flagship model with emotion / 多音字 / 公式朗读 / clone / text-generated voice support. Scope here is HTTP non-streaming (POST /v1/t2a_v2 with stream=false) only; SSE and WebSocket transports are intentionally out of scope. - Mirror provider + model entries in apps/daemon and apps/web registries (catalogue drift check stays green). - ENV_KEYS gets `OD_SENSEAUDIO_API_KEY` / `SENSEAUDIO_API_KEY` so the alias scheme matches every other integrated provider. - `renderSenseAudioTTS` in media.ts mirrors renderMinimaxTTS: Bearer auth, voice_setting / audio_setting body, hex-decoded audio under `data.audio`, base_resp envelope split from HTTP-level failures. - NewProjectPanel's audio supportedProviders allowlist now includes `senseaudio` so the picker actually surfaces the new entry. - Audio shape (mp3 / 32kHz / 128kbps / stereo) and default voice (`female_0033_b`) hard-coded for parity with the other TTS paths; MediaContext is unchanged. - New apps/daemon/tests/media-senseaudio.test.ts (8 specs) covers defaults, custom voice, default base URL fall-back, env-key path, missing-key error, base_resp failures, missing audio, and HTTP non-2xx — patterned on media-elevenlabs.test.ts. * docs(media): drop Chinese from SenseAudio provider comment Translate the model-capabilities line in the SenseAudio block comment (media.ts) into English. Keeping the source comments in a single language matches the rest of the daemon and avoids reviewer churn over mixed-locale prose. * fix(web): unblock openai and volcengine speech models in audio picker Per review on #1633, supportedModels()'s audio allowlist in NewProjectPanel was still filtering out gpt-4o-mini-tts (openai) and doubao-tts (volcengine) even though both are marked `integrated: true` in the shared media-models catalogue. Add the two ids so the picker matches the registry and the PR body's "alongside doubao-tts" claim holds true. * style(media): normalize speech hints to bare provider names Strip the trailing descriptions on the speech catalogue hints so every entry shows just the provider name (matching FishAudio / ElevenLabs / SenseAudio): `gpt-4o-mini-tts` → "OpenAI", `minimax-tts` → "MiniMax", `doubao-tts` → "Volcengine". Also move `gpt-4o-mini-tts` to the end of the list so the OpenAI entry sits after the upstream-focused providers, matching the recent picker grouping discussion on #1633. Mirrored in both apps/daemon/src/media-models.ts and apps/web/src/media/ models.ts; catalogue drift check + daemon (1848) + web (1150) suites all green.	2026-05-14 15:26:38 +08:00
nettee	6b3cc61714	Revert "Refactor agent runtime stream handling behind adapter (#1622 )" (#1656 ) This reverts commit `8cb9cdb593`.	2026-05-14 15:23:19 +08:00
Weston Houghton	de4430cf4e	fix(web): route remaining crypto.randomUUID calls through utils/uuid (#849 ) (#1621 ) `crypto.randomUUID` is undefined on non-secure contexts (plain HTTP + non-localhost — the standard Docker / NAS / unRAID self-hosted setup e.g. `http://192.168.1.x:7456`). PR #900 introduced `apps/web/src/utils/uuid.ts` as a tiered v4 helper that degrades to `crypto.getRandomValues` and ultimately `Math.random`, so the original "Create button silently does nothing" symptom (#849, #394) went away. PR #1428 added three unguarded `crypto.randomUUID()` calls in the new PostHog analytics provider, and `apps/web/src/runtime/exports.ts` carried a fourth from older PDF-export work. On non-secure contexts these throw `TypeError: crypto.randomUUID is not a function` during `<AnalyticsProvider>` rendering, taking the whole app shell down before any UI mounts. PDF export also fails when the print-ready handshake nonce is generated. Route all four sites through the existing `randomUUID()` helper.	2026-05-14 14:59:14 +08:00
Nagendhra Madishetti	d5566d7627	feat(daemon): user-configurable model alias for the media dispatcher (#1277 ) (#1309 ) * feat(daemon): user-configurable model alias / redirect for the media dispatcher (#1277) Tilmirs's use case in #1277: their Doubao access has moved from `doubao-seedream-3-0-t2i-250415` to `doubao-seedream-5-0`, but the project's registered catalog still emits the old id. Every call fails because the old name no longer resolves at Volcengine. Until now the only workaround was patching the source on every update. This adds a user-configurable alias layer that swaps the catalog id for whatever wire-name the provider expects, without changing the catalog itself. Two storage layers (env wins over disk, matching the rest of media-config): 1. Environment variable `OD_MEDIA_MODEL_ALIASES` carries a JSON map: `'{"doubao-seedream-3-0-t2i-250415":"doubao-seedream-5-0"}'`. Single var, portable across shells (Windows cmd.exe rejects hyphens in env-var names, so the per-id pattern lefarcen suggested wouldn't have worked on Windows). Malformed JSON is tolerated — falls through to the on-disk map rather than blowing up mid-generation. 2. media-config.json gains a top-level `aliases` field: ```json { "providers": { ... }, "aliases": { "doubao-seedream-3-0-t2i-250415": "doubao-seedream-5-0" } } ``` The Settings UI's existing PUT writes providers only, so the writeStored path now reads the existing aliases and preserves them on every write. Without that, a Settings save would silently wipe the user's aliases. The Settings UI surface for editing aliases is a separate follow-up; manual JSON edit and the env var are the v1 entry points. The resolution happens inside `startMediaGeneration` after the catalog lookup and surface validation have already accepted the registered id, so users still get the "unknown model" error if they request a catalog id that doesn't exist. The swap only changes what the provider receives on the wire (volcengine, openai, grok, fal, nanobanana etc. each pass `ctx.model` straight into their request body). Per-provider auto-output-name and the file-naming side use the function-level `model` parameter (the catalog id), so a `.png` named after `doubao-seedream-3-0-t2i-250415` keeps surfacing the registered id the agent / CLI asked for, not the wire-level alias. `providerNote` strings include the wire name so the user can see what was actually sent. Public API additions: - `resolveModelAlias(projectRoot, modelId)` -> the wire name (or the original if no alias matches). - `readAliasMap(projectRoot)` -> { effective, env, stored } for the future Settings UI's source-attribution display. Tests - 8 new cases in tests/media-config.test.ts (suite goes 14 -> 22): pass-through, stored map, env map, env-over-stored precedence, malformed-env fall-through, coercion of bad entries (null / number / nested object / empty string / blank key), readAliasMap source attribution, and a writeConfig regression that pins alias preservation on a Settings-style provider PUT. Validated - pnpm guard clean - pnpm --filter @open-design/daemon typecheck clean (both tsconfig.json and tsconfig.tests.json) - Media test suite (media-config + media-tasks-routes + media-tasks-persistence + media-nanobanana): 33/33 Pre-existing daemon test failures on Windows (symlinks, CODEX_BIN runtime resolution, MCP config, skills, server-paths) are unrelated to this change and reproduce on a clean main checkout. * fix(daemon): preserve catalog id for capability branches, surface aliases via /api/media/config (PR #1309 review) Lefarcen + codex P2 on PR #1309: the alias swap overwrote `ctx.model` globally, which silently disabled every renderer branch that keys behaviour off the catalog id. A user aliasing `dall-e-3 -> azure-dalle3-deployment` would have the wire name swapped correctly but `body.response_format = 'b64_json'` and `body.quality = 'hd'` would no longer be set, because the `ctx.model.startsWith('dall-e-')` / `ctx.model === 'dall-e-3'` checks now saw the alias. The same regression hit the gpt-image-* size selection, the gpt-4o-mini-tts instructions branch, and the openaiSizeFor() sizing function. MediaContext now carries both fields: - `model` — the registered catalog id (`dall-e-3`, `gpt-4o-mini-tts`, `doubao-seedream-3-0-t2i-250415`). All model-family capability branches read from here. - `wireModel` — the post-alias wire name. Every `body.model = `, every URL template, and every `providerNote` string reads from here so the user sees what was actually sent and the provider gets the alias. Renderers updated: openai image (body.model + providerNote + openaiSizeFor keeps catalog), openai speech (body.model + providerNote + gpt-4o-mini-tts instructions keeps catalog), volcengine video (body + note), volcengine image (body + note + openaiSizeFor keeps catalog), grok image (body + note), grok video (body + note), nanobanana (`credentials.model \|\| ctx.wireModel \|\| default` chain), minimax TTS, fishaudio TTS. The MINIMAX/FISHAUDIO hardcoded maps now sit BEHIND the user alias: explicit user alias wins over the project's legacy rebranding table, then the table wins over the catalog id fallback. Stub-fallback diagnostics (the SVG placeholder + stub providerNote string) keep the catalog id since those are debug surfaces, not provider calls. Lefarcen P3: the PR description claimed readAliasMap was the daemon-public API, but the /api/media/config route returned only readMaskedConfig (which had no aliases field). readMaskedConfig now returns `{ providers, aliases: { effective, env, stored } }` so the future Settings UI PR can consume the source-attributed map directly. The `aliases` field is always present (empty maps when nothing is configured) so the UI has a stable shape to read. Tests - New `media-alias-capability.test.ts` (2 jsdom cases) drives generateMedia end-to-end with a stubbed fetch and asserts on the request body. Pins the regression: aliased dall-e-3 still sends `response_format: 'b64_json'` + `quality: 'hd'`; aliased gpt-4o-mini-tts still attaches the instructions field from the voice prop. - `media-config.test.ts` grows by 2 cases (suite goes 22 -> 24): readMaskedConfig surfaces the alias map (both env and stored sources), and the empty-state shape for fresh installs. Validated - pnpm guard clean - pnpm --filter @open-design/daemon typecheck clean (both tsconfig.json and tsconfig.tests.json) - Media test suite (config + alias-capability + nanobanana + tasks-persistence + tasks-routes): 37/37 --------- Co-authored-by: Nagendhra <nagendhra405@gmail.com>	2026-05-14 14:58:39 +08:00
shangxinyu1	2976c76fc3	test: expand Memory and Routines coverage (#1521 ) * test: expand settings and packaged coverage * test: extend memory settings coverage * test: cover routine settings failure states * test: cover routine operation failures * test: fix daemon test typing on CI * test: decouple packaged smoke from orbit bug * test: avoid live memory LLM calls in route tests * test: fix daemon fetch typing in CI * fix: restore preview comment and inspect toggles * test: align manual edit flow with current inspector UX * test: align comment attachment flow with current preview comments UI * fix: probe resolved Codex launch path during detection * fix: remove duplicate board activation helper after rebase * test: update ghost cli detection mock * test: align FileViewer toolbar expectation * ci: move full app tests to extended lane * ci: run app tests by changed scope * ci: cover shared app inputs in test scopes * ci: avoid setup-node cache in windows packaged smoke * test: align extended settings and manual edit flows	2026-05-14 14:48:40 +08:00
Yuhao Chen	c942d99b14	fix(orbit): avoid sample identity leakage (#1608 )	2026-05-14 14:48:29 +08:00
Nagendhra Madishetti	e508fa3fbd	test(e2e): Critique Theater Phase 11 activation (un-fixme suite, seeded-project nav, split SSE fixture) (#1483 )	2026-05-14 14:27:39 +08:00
Nagendhra Madishetti	5cb0508790	fix(web): deep-link Routines history rows to their specific conversation (Fixes #1505 ) (#1508 )	2026-05-14 14:27:34 +08:00
PerishFire	59ed000903	Fix Windows resource cache for Orbit templates (#1554 )	2026-05-14 14:27:29 +08:00
lakatos	51d1c4e287	ci: skip upstream-only workflows on forks (#1586 )	2026-05-14 14:27:23 +08:00
Priyanshu Kayarkar	8101e430cf	fix(ui) : radio button issue (#1599 )	2026-05-14 14:27:18 +08:00
soulme	2a8ebff11a	feat(web): add collapsible comment side panel (#1607 )	2026-05-14 14:27:09 +08:00
github-actions[bot]	a7bebd926f	Auto-generated metrics for run #25835844101 (#1615 )	2026-05-14 14:27:03 +08:00
Nicholas-Xiong	63b90685da	fix: Improve project card metadata truncation with min-width: 0 (#1629 )	2026-05-14 14:26:57 +08:00
Nicholas-Xiong	743669e01d	fix: Add dropdown chevron to routines project select field (#1630 )	2026-05-14 14:26:50 +08:00
sukumarp2022	8b3e22850a	fix: replace Microsoft Copilot logo with GitHub Copilot logo (#1648 )	2026-05-14 14:26:45 +08:00
Siri-Ray	d2738924fb	fix(web): freeze completed run durations across conversations (#1351 ) * fix(web): freeze completed run durations across conversations * fix(web): finalize stopped API runs Generated-By: looper 0.6.0 (runner=fixer, agent=codex) * fix(daemon): optimize conversation latest run lookup Generated-By: looper 0.6.0 (runner=fixer, agent=codex) * fix(web): scope streaming cleanup to conversation Generated-By: looper 0.6.0 (runner=fixer, agent=codex) * fix(web): capture streaming conversation cleanup Generated-By: looper 0.6.0 (runner=fixer, agent=codex) * fix(web): guard stale run ref cleanup Generated-By: looper 0.6.0 (runner=fixer, agent=codex)	2026-05-14 14:25:37 +08:00
Marc Chan	055e55abd8	Add batch design system testing (#1515 ) * feat: add batch design system testing * fix: use daemon default agent for batch tests * fix: honor batch project prompt flags Generated-By: looper 0.0.0-dev (runner=fixer, agent=opencode) * fix: persist batch run output * fix: honor dry-run before daemon resolution Generated-By: looper 0.0.0-dev (runner=fixer, agent=opencode) * fix: persist batch assistant run ids Generated-By: looper 0.0.0-dev (runner=fixer, agent=opencode) * fix: cancel timed-out batch runs Generated-By: looper 0.0.0-dev (runner=fixer, agent=opencode)	2026-05-14 14:19:32 +08:00
chaoxiaoche	e57e028222	feat(daemon): make design-system token channel default-on (PR-D) (#1544 ) * feat(daemon): make design-system token channel default-on (PR-D) Flip `OD_DESIGN_TOKEN_CHANNEL` from default-off to default-on. Every chat that picks a brand with `tokens.css` + `components.html` siblings (today: `default`, `kami`) now gets the structured token contract appended to the system prompt automatically. `OD_DESIGN_TOKEN_CHANNEL=0` keeps the DESIGN.md-only path as a kill switch. Adds `scripts/check-design-system-flag-parity.ts`, registered in `pnpm guard`. The guard walks every brand and asserts: - 147 prose-only brands produce byte-identical prompts under flag-off vs flag-on (PR-D's "no-op for legacy brands" promise) - 2 structured brands diverge as expected (catches a future regression that silently dropped the structured blocks) Smoke evidence on #1385 (PR-C): - `default` — 10/10 brand tokens used byte-for-byte in treatment vs 0/10 invented colors in control - `kami` — treatment recovers brand name (`Kami · 纸`), the two-tier surface (`--bg` parchment + `--surface` ivory), the CN font stack override, and the `components.html` card pattern; control invented "Replica" as a brand name Co-authored-by: Cursor <cursoragent@cursor.com> * review: address @nettee + @lefarcen feedback on parity guard Two blocking findings from #1544 review: 1. @nettee — guard's inventory walk silently passed on unreadable filesystem state. `fileExists` swallowed every `stat` error and the bare `readdir` catch returned `[]` for any failure. A renamed `design-systems/` tree, a permission-denied DESIGN.md, or a directory at the brand path would have left `pnpm guard` happy after checking 0 brands — exactly the silent misconfiguration this guard exists to catch. Both error paths now treat only ENOENT / ENOTDIR as absence and rethrow everything else, mirroring the `readFileOptional` fix already applied to PR-C's `apps/daemon/src/design-systems.ts`. 2. @nettee — guard exercised `composeSystemPrompt` directly, bypassing the `process.env.OD_DESIGN_TOKEN_CHANNEL !== '0'` gate in server.ts that PR-D actually flipped. A regression that restored `=== '1'`, typo'd the env name, or stopped reading assets when the var is unset would still leave the guard green. Extracted the predicate into `isDesignTokenChannelEnabled(env)` next to `readDesignSystemAssets` and added 6 unit tests pinning every value that matters: unset / `'1'` / `'true'` / empty / `'0'` / whitespace-padded. server.ts now calls the predicate. Any regression on the env-flag semantics fails `tests/design-system-assets.test.ts` independently of the composer-level coverage. Verified: pnpm guard (13/13), tsc -p scripts/tsconfig.json (clean), @open-design/daemon typecheck (clean), 32/32 prompt + asset tests. Co-authored-by: Cursor <cursoragent@cursor.com> * review: pin server-layer asset resolution end-to-end (lefarcen P2) Round-2 review feedback from @lefarcen on #1544: the predicate suite in tests/design-system-assets.test.ts pinned the env-flag boolean but did NOT exercise the server prompt-assembly path that PR-D actually flipped — the seam where the daemon decides whether to read tokens.css / components.html from disk and hand them to composeSystemPrompt. A regression that, say, restored an inline `=== '1'` gate or stopped calling isDesignTokenChannelEnabled() from server.ts would still leave the predicate test green. Extracted that whole seam into `resolveDesignSystemAssets(id, builtInRoot, userInstalledRoot, env)` on apps/daemon/src/design-systems.ts. The function combines: 1. the env-flag gate (kill switch on `OD_DESIGN_TOKEN_CHANNEL=0`) 2. the built-in → user-installed root fallback chain (per-file) 3. the DesignSystemAssets result shape consumed by composeSystemPrompt server.ts at the prompt-assembly site is now a thin caller of this function. The previous 13-line inline block (env check + per-file fallback) collapses to one call, so the whole asset-resolution path now has a single testable seam. 7 new tests in tests/design-system-assets.test.ts run the full pipeline end-to-end against real disk fixtures: - env unset (default-on): returns built-in assets - env=`'0'` (kill switch): returns undefined even with files on disk - env=`'1'` (legacy opt-in): still works - mixed builtin/user-installed: per-file fallback merges correctly - both halves built-in: skips user-installed roundtrip verbatim - prose-only brand (no files): undefined / undefined - nonexistent brand directory: undefined / undefined Verified: pnpm guard (13/13), tsc -p scripts/tsconfig.json (clean), @open-design/daemon typecheck (clean), 39/39 prompt + asset tests (was 32; +7 new server-layer-resolution tests). Co-authored-by: Cursor <cursoragent@cursor.com> * fix(test): add missing projectKind to FileViewer deck preview test The deck preview test added in #1556 (`086be271`) renders <FileViewer/> without `projectKind`, which became a required prop in #1509. CI on main is currently red on this; pick up the trivial fix here so PR-D can land cleanly. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: chaoxiaoche <chaoxiaoche@chaoxiaochedeMacBook-Pro.local> Co-authored-by: Cursor <cursoragent@cursor.com>	2026-05-14 14:14:19 +08:00
nettee	8cb9cdb593	Refactor agent runtime stream handling behind adapter (#1622 )	2026-05-14 14:12:24 +08:00
sukumarp2022	852a005b32	feat(web): add export as image screenshot to share menu (#1569 ) Add an option to export the current preview viewport as a PNG image. - Add requestPreviewSnapshot() utility in exports.ts (reuses the existing srcdoc snapshot bridge via postMessage) - Add exportAsImage() and dataUrlToBlob() helpers for Blob download - Add Export as image menu item in the HTML viewer share menu, gated behind srcdoc mode (bridge only present in srcdoc, not URL-load mode) - Refactor PreviewDrawOverlay to delegate to the shared requestPreviewSnapshot() instead of duplicating the snapshot logic - Add fileViewer.exportImage i18n key across all 19 locale files - Add 7 unit tests covering snapshot request, timeout, error handling, and download filename sanitization Fixes #1500	2026-05-14 11:07:28 +08:00
Nagendhra Madishetti	ff569fa50c	feat(daemon): Critique Theater Phase 16 (M-phase rollout ratchet + /api/critique/conformance) (#1499 ) * feat(web): pure reducer for Critique Theater states (Phase 7.1) Pure CritiqueState reducer driven by the contracts-level PanelEvent (the same shape both the live SSE stream and the recorded transcript emit), so a single reducer powers both the in-flight panel and the rerun replay. Lifecycle covers run_started → running → (shipped / degraded / interrupted / failed), with panelist_open / dim / must_fix / close / round_end events building per-round CritiquePanelistView entries as they arrive. Defensive behaviour that surfaced while writing the spec tests: - Terminal phases (shipped / degraded / interrupted / failed) are sticky against further lifecycle events for the same run, except for parser_warning which can land late and is recorded in a side channel without changing phase. - A new run_started for a different runId at any time discards the prior state and reboots, so the UI can launch consecutive runs without an explicit reset action. - Events whose runId does not match the active run return the same state reference, so React's useReducer doesn't re-render subscribers on stray traffic. - Round bookkeeping keys by round number rather than "always last", so an out-of-order panelist_dim for round 1 arriving after a round 2 dim does not corrupt the round 2 bucket. Test coverage: 18 cases covering each transition, the runId guard, sticky-terminal behaviour, the out-of-order round invariant, and the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire SSE + replay into the same reducer. * feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2) createCritiqueEventsConnection is a pure connection manager that mirrors apps/web/src/providers/project-events.ts: opens an EventSource at /api/projects/:id/events, listens for every name in CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent (stripping the critique. prefix and merging the data payload), and hands it to the caller's onEvent. Reconnect uses exponential backoff (1s → 30s) and resets on `ready`; malformed payloads drop with a dev-mode warning rather than tearing the stream. useCritiqueStream wraps the manager in a useReducer that owns the CritiqueState. enabled=false or a null projectId tears down the connection cleanly; switching projectId closes the old connection and opens a fresh one. The returned dispatch lets local UI synthesise actions (e.g. an Esc keypress firing a synthetic interrupted while a kill request is in flight); production traffic comes from the SSE stream. Test coverage: - sse.test.ts (10 cases, node env): subscription set covers every CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire shape back to PanelEvent; malformed JSON is swallowed and does not stop the stream; exponential backoff schedule and ready-reset semantics are pinned with a setTimeout seam; close() cancels pending reconnects and shuts the live source; no-op fallback when EventSource is unavailable. - useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event, reducer driven by synthetic actions, no connection when disabled or projectId is null, clean close on unmount, projectId change reopens cleanly. * feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3) Fetches the per-run NDJSON transcript (one PanelEvent per line), parses every line via the shared isPanelEvent predicate, and dispatches into the same CritiqueState reducer the live SSE stream uses. A single reducer means the UI rendering a replay can be identical to the live panel, and a UI mounting both useCritiqueStream and useCritiqueReplay in parallel does not have to reconcile two state shapes. speed knob is `paused \| instant \| live \| { intervalMs: N }`. - instant flushes every event synchronously, useful for opening a finished run already at its terminal state. - intervalMs paces dispatches at a fixed cadence so the reviewer can watch the run unfold. - paused parses the transcript but holds events back until the caller advances speed (consumers can drive a scrubber later). - live is reserved for the future "playback at original cadence" feature, currently treated as instant; replay timestamps are not yet persisted with each event so honest pacing requires a follow-up Phase 7+ task. gunzip seam handles `.ndjson.gz` transcripts via DecompressionStream when present; the production fetch path picks between text and arrayBuffer based on the URL extension. Both seams are injectable so the unit tests don't need to spin up a real network or a real gzip pipeline. Test coverage (8 cases, jsdom env): - Idle status before any URL is provided. - speed=instant flushes the full transcript synchronously to shipped state. - speed={intervalMs:N} paces with the setTimeout seam, reaching done after the last tick. - speed=paused leaves status=playing with no dispatches. - Empty transcript reports done with state still idle. - Fetch rejection surfaces an error status with the message. - Malformed NDJSON lines are skipped; valid events around them still land. - .gz transcripts route through the gunzip seam. Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream + replay), all on one branch ready for review. Phases 8+ (Theater components) consume these from this PR. * fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review) Two P1 fixes from lefarcen's review on PR #1307: SSE payload override `sseToPanelEvent` previously spread `data` after the channel-derived `type`, so a payload-provided `type` could override the channel and route a `critique.run_started` frame into the reducer as a `ship` action. Reversed the spread so the channel-derived `type` is authoritative, and revalidated the resulting object through the contracts-level `isPanelEvent` predicate before returning. Frames that fail validation (missing runId, empty runId, unknown type) are dropped, so a malformed or compromised SSE frame can no longer dispatch a wrong-shape action into the reducer. Three new sse.test.ts cases pin the regression: hostile `type:'ship'` in the payload still resolves to `run_started`, missing runId is dropped, empty runId is dropped. Replay pause/resume `useCritiqueReplay` had one big effect keyed on `transcriptUrl` only, so flipping `speed` from `paused` to `instant` never re-fired and the held events sat undispatched. Split into a parse effect (depends on URL, fetches and stores events in state) and a pace effect (depends on parsed-events + speed, owns the cursor + timers). The playback cursor lives in a ref that survives pause/resume cycles, so flipping `paused` -> `instant` flushes from the current position rather than restarting (which would double-dispatch `run_started` and reset the reducer). Two new useCritiqueReplay.test.tsx cases: - paused-then-instant transitions from `playing` to `done` and reaches the shipped terminal phase - intervalMs paced playback dispatches one event, pauses to drain the next scheduled timer, flips to instant, and confirms the remaining transcript drains exactly once (cursor was preserved) Doc consistency The earlier source comment in useCritiqueReplay.ts claimed `live` "paces by recorded timestamps" while the impl used zero-delay timers and the PR body said it behaves like `instant`. Aligned to reality: `live` currently behaves like `{ intervalMs: 0 }` (events drain on successive microtasks via setTimeoutFn) because transcripts do not yet carry per-event timestamps. Honest timestamp-driven pacing is queued as a Phase 7+ follow-up. Validated: pnpm guard, pnpm --filter @open-design/web typecheck, Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite 96 files / 888 tests. * feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread) * feat(web): Theater PanelistLane component (Phase 8.1) * feat(web): Theater ScoreTicker component (Phase 8.2) * feat(web): Theater RoundDivider component (Phase 8.3) * feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4) * feat(web): Theater TheaterDegraded chip (Phase 8.5) * feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6) * feat(web): Theater TheaterTranscript replay surface (Phase 8.7) * feat(web): Theater TheaterStage top-level container (Phase 8.8) * feat(web): Theater CSS using existing semantic tokens (no hex literals) * feat(web): Theater public exports barrel * fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314) Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen. State-lifecycle fixes (3 x P2) 1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`). Host hooks dispatch it when their gating prop changes so a stale run from a prior project / transcript cannot bleed into the next context. Reset is idempotent on idle (returns the same reference). 2. `useCritiqueStream` dispatches `__reset__` at the top of its connection effect, so a workspace switch from project A (which streamed a critique) to project B clears the reducer before the new EventSource opens. enabled=false also clears. 3. `useCritiqueReplay` dispatches `__reset__` at the top of its parse effect, so transcriptUrl swaps (including swap-to-null after a replay reached `shipped`) lift the reducer back to idle before the new fetch starts. SSE validation (1 x P2) 4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape` check after the cheap `isPanelEvent` predicate. A `critique.ship` frame missing `composite` / `round` / `status` / `artifactRef` is rejected before reaching the reducer, so TheaterCollapsed can no longer crash on `undefined.toFixed(1)`. Every variant's required fields are validated: run_started (protocolVersion, non-empty cast, maxRounds, threshold, scale), panelist_* (round, role, plus variant-specific shape), round_end (round, composite, mustFix, decision in {continue,ship}, reason), ship (round, composite, status, artifactRef.{projectId,artifactId}, summary), degraded (reason, adapter), interrupted (bestRound, composite), failed (cause), parser_warning (kind, position). Reducer correctness (1 x P2) 5. `panelist_open` now materializes the round + an empty panelist view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight the in-progress lane the instant the tag opens. Before this, a stream that emitted only `panelist_open` after `run_started` left `rounds = []` and the UI rendered no current round until a later `panelist_dim` arrived. Polish (3 x P3) 6. Brand role tint swaps from `var(--magenta, var(--accent))` to `var(--purple, var(--accent))`. `--purple` is actually defined across the design systems; `--magenta` is not, so Brand was silently falling through to `--accent` and looking identical to Designer. 7. New i18n key `critiqueTheater.interruptedSummary` for the interrupted-collapse copy ("Interrupted at round N, best composite X.X"). Previously the interrupted branch reused `shippedSummary` and the UI read "Shipped at round..." for a run that specifically did not ship. Native value in en + zh-CN; other locales fall back via `...en` spread. 8. `TheaterDegraded` heading id comes from `useId()` instead of a hardcoded `theater-degraded-heading`, so two chips rendered on the same page (chat history with multiple completed runs) keep their aria-labelledby references unambiguous. Tests (15 new cases) - reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data. - sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship. - useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false. - useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped. - TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...". - TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new) - tests/i18n/locales.test.ts 5 of 5 across 18 locales * feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1) * feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2) * fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315) Addresses every blocker from codex, Siri-Ray, and lefarcen. The three state-lifecycle and SSE-validation issues they also flagged inherit fixes from PR #1314's review pass that this branch now sits on top of after rebase. Real daemon kill on Interrupt (P1) - CritiqueTheaterMount now POSTs to /api/projects/:id/critique/:runId/interrupt alongside the optimistic local dispatch. Before this fix, clicking Interrupt only flipped the React state to interrupted while the daemon job kept running. The fetch is best-effort: a 404 (endpoint not wired yet, lands in Phase 15) is swallowed with a dev-mode console.warn so the UI still moves to the collapsed badge. - New fetchInterrupt test seam lets RTL assert on the URL / method and simulate the "daemon not ready yet" path. Two tests pin both: the happy URL proj-42/critique/run-abc/interrupt POSTs, and a rejected fetch still flips the UI. interruptPending reset on new run (P2) - A ref-backed effect compares the current runId against the last one we saw; when it changes, interruptPending is cleared. A user who interrupts run-1 and then triggers run-2 from the same mount now gets a fresh, enabled kill button instead of one stuck in "Interrupting…". Pinned by a new mount test. Escape keybind scope (P2) - InterruptButton now checks the keydown target. Escape inside an input, textarea, select, or contenteditable element is ignored (and any ancestor of those via closest() is treated the same way). Body-level focus still fires the keybind so the Theater area's affordance keeps working. Four new tests cover textarea, input, contenteditable, and the body-focus positive case. userFacingName i18n key (P2) - The spec at specs/current/critique-theater.md:6 mandates a single critiqueTheater.userFacingName key so the "Design Jury" label can be renamed without touching code. Phase 8 introduced critiqueTheater.title by mistake; renamed across types.ts, en.ts, zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer TheaterStage.tsx. The locale alignment test stays green. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 14 files, 112 tests (was 101 before, +11 new for the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope; the rest were already in #1314's review fix). - tests/i18n/locales.test.ts 5 of 5 across 18 locales. * feat(daemon): adapter-degraded registry with TTL (Phase 10.1) In-memory registry recording adapters that produced malformed or oversize transcripts so the orchestrator can skip them for a TTL window (default 24h) instead of cycling through known-bad providers on every run. Records carry reason (malformed_block \| oversize_block \| missing_artifact), source label, and expiresAt. The test-only clock seam lets the suite advance time deterministically and prove that an expired entry stops counting as degraded without anyone calling clearDegraded. 7/7 vitest cases green. * feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2) Two test-only adapters that read the existing v1 transcript fixtures (happy-3-rounds and malformed-unbalanced) and replay them as either a full string or a 512-byte chunked stream. The chunked form is what the conformance harness uses to prove the parser holds together when the transcript arrives in arbitrary network slices, not as one buffered blob. * feat(daemon): adapter conformance harness (Phase 10.3) runAdapterConformance pulls a transcript through the same parseCritiqueStream pipeline the orchestrator uses and classifies the outcome as shipped, degraded, or failed. On a degraded outcome it forwards the matched reason to the adapter-degraded registry, so a single nightly conformance run is what populates the skip list rather than the orchestrator learning each adapter is broken at request time. 5/5 vitest cases green covering shipped, malformed degraded, oversize degraded, no-ship failure, and the harness-thrown failure path. * test(e2e): Critique Theater Playwright suite (Phase 11) Six tests, one viewport per visual case, deterministic SSE fixtures stubbed via page.route(). Adds the suite to test:ui:extended so the existing extended-UI lane picks it up. Coverage: 1. Happy path: a single mounted theater plays the full fixture (1 run_started, 5 panelists open / dim / must_fix / close, 1 round_end, 1 ship) and ends on the score badge. 2. Interrupt mid-run: the panelist that is open at the time the interrupt button is clicked closes with an interrupted marker and the transcript freezes there. 3. Visual regression at 375x720 mobile. 4. Visual regression at 768x1024 tablet. 5. Visual regression at 1280x800 desktop. 6. A11y role tree: the theater region exposes a labelled landmark, each panelist lane is a group with an accessible name, the score is a status live region. All SSE traffic is stubbed by page.route so the suite runs in CI without a daemon. The toggle is seeded via localStorage by bootAppWithCritiqueEnabled so the gate behaves as if Settings flipped it on. typecheck clean; playwright --list reports 6. * test(web): reducer p99 bench at 10k iterations (Phase 13.1) Locks the documented 2ms budget for the Critique Theater reducer on a representative SSE script (27 actions, one full happy run) behind a regression gate. Asserts p99 stays under 4ms (2x the documented budget) so CI runners with a noisy neighbour do not flake while a real regression to 20ms or 200ms still trips. The bench is a vitest case rather than a bare microbenchmark so it runs in the same CI lane as every other web test and does not need a parallel runner. * test(web): critique surface coverage walker (Phase 13.2) Walks the public critique surface (11 SSE event names, 5 panelist roles, 6 lifecycle phases, 9 named i18n keys) and asserts each named symbol appears in both the src corpus and the test corpus. The walker is the gate that catches a rename in one half of the codebase without a matching update in the other half: a future PR that drops 'panelist_must_fix' from the reducer without also removing its test reference fails this suite. 62 assertions, one per symbol per corpus. * docs: Critique Theater user guide (Phase 14.1) Seven sections aimed at end users (not contributors): 1. What is Design Jury 2. How it works (the five panelists, auto-converging rounds, the composite formula) 3. Settings (the M1 toggle and what it does) 4. Reading the score badge 5. Replay surface 6. Troubleshooting (degraded, interrupted, failed) 7. FAQ The composite formula is documented as designer * 0 + critic * 0.4 + brand * 0.2 + a11y * 0.2 + copy * 0.2 because anyone trying to reverse-engineer the score is going to search for those weights and the docs are the place they should land first. * docs(daemon): critique module AGENTS map (Phase 14.2) Daemon-side wayfinder for the apps/daemon/src/critique directory. Tables every file, what owns what invariant, and the 'when you change anything here' guide so a future contributor does not have to reverse-engineer the rollout resolver before adding a new SSE event. * docs(web): Theater module AGENTS map (Phase 14.3) Web-side mirror of the daemon AGENTS map. Same file table, same invariants section, same change-impact guide, sized to the Theater component package. * feat(daemon): rollout flag resolver (Phase 15.1) Single decision point every caller consults to know whether the orchestrator should wire the critique pipeline for a given run. Priority: 1. Skill-level policy (required wins, opt-out wins inversely) 2. Per-project override from the Settings toggle 3. OD_CRITIQUE_ENABLED env override 4. Rollout phase default M0 dark-launch false M1 settings only false (toggle is off until the user flips it) M2 per-skill true if skill opted in M3 global default true OD_CRITIQUE_ROLLOUT_PHASE parser defaults to M0 on unknown input so a fresh install never surprises a user with the feature on. 10/10 vitest cases green covering every cell of the matrix. * feat(web): Settings toggle hook for Critique Theater (Phase 15.2) React hook that reads critiqueTheaterEnabled from the existing open-design:config localStorage blob and stays in sync via: - the platform storage event (cross-tab) - a open-design:critique-theater-toggle CustomEvent (same-tab) Same-tab event is the one that fires when the Settings panel saves in the current window: the toggle and every mounted theater update without a page reload. setCritiqueTheaterEnabled(next) is the imperative setter the Settings panel calls. It preserves the rest of the stored config (mode, apiKey, etc.) and dispatches the same-tab event after the localStorage write. The web hook reflects what the user toggled; the daemon-side isCritiqueEnabled is the final routing authority (project override, env, rollout phase). When they disagree, the daemon wins for backend gating and the web reflects the toggle state. 6/6 vitest cases green covering first read, stored read, same-tab event flip, config preservation, corrupted JSON tolerance, and cross-tab storage event. * test(web): Phase 15 toggle hook failure-mode coverage (PR #1320) lefarcen P2 on PR #1320 flagged that the PR body claimed safe behavior for disabled localStorage, non-object JSON, and missing CustomEvent shim, but the suite only covered corrupt JSON plus happy-path storage events. Added four failure-mode tests so the swallowed errors are not silently traded for a throw in a future refactor: 1. Returns false on a stored JSON value that parses to an array (non-object). Catches a regression where the guard treats anything truthy as a config blob. 2. Returns false on a stored JSON value of literal 'null'. typeof null === 'object' in JS, so the guard has to check null explicitly; this test pins that check. 3. Returns false when localStorage.getItem throws (private mode / disabled storage / SecurityError). The hook must swallow and return false so the rest of the app keeps rendering. 4. setCritiqueTheaterEnabled still dispatches the same-tab CustomEvent when localStorage.setItem throws (quota exceeded / disabled storage). The dispatch path is the in-session broadcast that keeps every mounted hook coherent even when persistence is unavailable; verified by mounting two probes and asserting both flip after the setter is called with a throwing setItem. 10/10 vitest cases green (6 existing + 4 new). * fix(web): honor CustomEvent payload in toggle hook listener (PR #1320) Both Siri-Ray (blocking) and lefarcen (P2 new) caught the same real bug in the failure-mode test I added in `affcdd27`: the test asserts the in-session UI flips when localStorage.setItem throws, but the CustomEvent listener was ignoring the event's typed detail and just calling readToggle(). Under a throwing setItem the localStorage value is stale (or absent), so the listener would see the OLD value and the test would fail (or worse, the production claim 'in-session event keeps mounts coherent' was hollow). Fixed the hook, not the test: the listener now reads event.detail.enabled when it is a boolean, falling back to readToggle() only for malformed events or for cross-tab storage events (which do not carry a typed payload). The setter already dispatched the detail; the listener just was not consuming it. Test changes: - The existing 'setItem throws' test now asserts the right behavior for the right reason. Updated the inline comment to say the listener reads from detail, not localStorage. - New test 'falls back to readToggle when the CustomEvent carries no usable detail' pins the fallback path: a malformed dispatcher (no detail, or detail.enabled not a boolean) degrades cleanly instead of throwing or being silently ignored. 11 / 11 vitest cases green (10 prior + 1 new fallback). * feat(daemon): route critique spawn-path eligibility through the rollout resolver The wireup edit Phase 10 and Phase 15 carved out: today server.ts gates the critique pipeline on critiqueCfg.enabled, which is just the OD_CRITIQUE_ENABLED env var. After this commit it gates on isCritiqueEnabled(...) from the Phase 15 resolver, so the full priority matrix is live: 1. Per-skill od.critique.policy veto (opt-out / required) 2. Per-project override (M1 Settings toggle, written through the existing Phase 6 settings endpoint) 3. OD_CRITIQUE_ENABLED env override (power-user lane / CI fixtures) 4. OD_CRITIQUE_ROLLOUT_PHASE default M0 dark-launch false M1 settings only false M2 per-skill only when skillPolicy === 'opt-in' M3 global default true Default behaviour on a fresh install is unchanged: the resolver returns false at M0 without an env override or a project override, so prod traffic falls through to the legacy single-pass path exactly the way it did before. Inputs threaded today: phase from OD_CRITIQUE_ROLLOUT_PHASE, envOverride from OD_CRITIQUE_ENABLED. skillPolicy and projectOverride are passed as null for the v1 cutover; the daemon-side handler that round-trips critiqueTheaterEnabled on the project settings row and the od.critique.policy frontmatter resolver land as the next two commits in this branch. The three call sites that used critiqueCfg.enabled (the brand-thread guard, the skill-thread guard, the top-line critiqueShouldRun compound) now read from a single locally-scoped critiqueEnabledForRun boolean, so the eligibility check is computed exactly once per spawn and the prompt composer + orchestrator stay in lockstep the way the existing comment already promised. Tests still green: daemon vitest 22 / 22 across rollout + conformance + adapter-degraded. Daemon typecheck clean. * feat(web): mount CritiqueTheaterMount in ProjectView The web counterpart of the daemon wireup. ProjectView now renders <CritiqueTheaterMount projectId={project.id} enabled={...} /> as a sibling of <AppChromeHeader> inside the top-level <div className="app">. The mount is the drop-in from the Phase 9 stack: it owns the SSE subscription, the kill-request handshake, and the phase-aware swap from the live <TheaterStage> to the collapsed badge once a run settles. The mount returns null until the daemon emits a critique.run_started for the active project, so the visual surface is byte-for-byte unchanged for users who have not opted in. Enabled wiring: useCritiqueTheaterEnabled() reads the M1 Settings toggle from the existing open-design:config localStorage blob and stays in sync with both the platform storage event (cross-tab) and the same-tab open-design:critique-theater-toggle CustomEvent the Phase 15 setter dispatches. The hook honors the event payload directly so a private-mode browser that cannot persist the toggle still updates the in-session UI correctly. The daemon-side gate (isCritiqueEnabled in apps/daemon/src/server.ts) remains the authority for whether a run is actually wired through the critique pipeline. This hook only governs whether the web layer renders the resulting SSE stream when the daemon emits one. The two-layer gate is intentional: an integrator embedding the Theater in a custom UI can flip the web visibility independent of the daemon's routing decision, and a daemon-side env override flips backend gating without touching the web's localStorage. Tests still green: web Theater suite 181 / 181 across 16 files. Web typecheck clean. * feat(daemon): resolve od.critique.policy frontmatter at the spawn site The next step in the wireup branch's ladder: replace the placeholder `skillPolicy: null` with the actual value parsed from the active skill's SKILL.md frontmatter. Three small edits, one new field on a public type: 1. SkillInfo gains a `critiquePolicy: SkillCritiquePolicy` field carrying the parsed `od.critique.policy` token (required / opt-in / opt-out / null). The field is null when the skill has no opinion, which lets the lower-priority resolver tiers (projectOverride, envOverride, phase default) decide. 2. listSkills() populates the new field via a small `normalizeCritiquePolicy` helper that tolerates the YAML scalar's casing and trims whitespace. Unknown tokens collapse to null so a typo in SKILL.md cannot accidentally force the panel on or off; it just falls through. Derived example cards inherit the parent's policy. 3. server.ts captures `skill.critiquePolicy` into a hoisted `skillCritiquePolicy` variable inside the existing skill-load block, then threads it into the isCritiqueEnabled call as the skillPolicy input. The hoisting keeps the variable in scope at the resolver call site without restructuring the spawn handler. After this commit, the priority matrix the rollout resolver was designed for is live for its top tier. The previous commit wired env + phase; this one wires skill. The projectOverride input remains null pending the next commit that extends the Phase 6 settings endpoint. Daemon vitest: 10 / 10 rollout cases pass against the new wiring. Daemon typecheck: clean. * feat(daemon): feed projectOverride into the rollout resolver from project metadata Replaces the placeholder `projectOverride: null` in the spawn handler with the actual value the Settings panel writes onto the project's metadata blob: `critiqueTheaterEnabled?: boolean`. The read is defensive at the boundary: the metadata object is typed loosely (it round-trips through SQLite as a free-form JSON blob), so the spawn handler narrows to `boolean` and falls through to `null` for any other shape. A missing key, a malformed value, or a project that has never visited Settings collapses to `null`, which is exactly the resolver's "no opinion, fall through to env / phase" signal. The `critique` frontmatter slot also gets typed on the SkillFrontmatter shape so the `od.critique.policy` chain the previous commit introduced no longer needs a bracket-access cast. Same pattern as the existing `craft`, `preview`, and `design_system` nested-record slots. After this commit, every tier of the rollout resolver's priority matrix is wired: 1. skillPolicy (from SKILL.md od.critique.policy) 2. projectOverride (from project metadata critiqueTheaterEnabled) 3. envOverride (from OD_CRITIQUE_ENABLED) 4. rollout phase (from OD_CRITIQUE_ROLLOUT_PHASE) The write path for projectOverride still flows through the existing project-update handler the Settings panel already uses to persist project metadata; no new endpoint is needed. The Settings UI button that calls setCritiqueTheaterEnabled and posts the new field is the next commit on this branch. Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases still green against the new wiring. * fix(daemon): forward critique events to project sinks + align composer gate (PR #1338) Two codex review items addressed in one commit since they share the same root cause (resolver-enabled run hits a transport / prompt contract that was still env-gated): P1 (transport mismatch). The daemon emits critique.* SSE frames through critiqueBus -> design.runs.emit, which fans out on /api/runs/:runId/events. The web CritiqueTheaterMount subscribes to /api/projects/:projectId/events (it's project-scoped, not run- scoped, because the mount lives at the project workspace and follows the user across runs). Result: in production the mount never sees a real frame and the e2e tests' stubbed routes hide the mismatch. Fixed by extending critiqueBus.emit to fan out to BOTH sinks: the existing runs.emit transport, AND the per-project event-sinks map. The project-events route emits via sse.send(payload.type, payload), so we pack the SSE channel name onto payload.type and let the sink push the right channel. The web sseToPanelEvent overwrites type from the channel name on the way back into a PanelEvent, so the round-trip stays correct. P2 (prompt gate misalignment). composeSystemPrompt reads cfg.enabled to decide whether to append the panel addendum, but critiqueCfg.enabled is loaded from OD_CRITIQUE_ENABLED only. A run the resolver enabled via phase / project / skill (env unset) would have critiqueShouldRun = true while critiqueCfg.enabled remained false, dropping the panel prompt while still routing through runOrchestrator -> parser waits for tags that never arrive -> run degrades. Fixed by passing a derived config { ...critiqueCfg, enabled: true } to the composer when critiqueShouldRun is true. The composer's own gate now agrees with the resolver decision on every input the spec defines. Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases still green against the new wiring. * fix: address PerishCode P1 + P2 follow-ups on PR #1338 Two follow-up items PerishCode flagged on the activation PR. Non-blocking but both are real: 1. Phase 11 e2e suite was wired into test:ui:extended but lands the user on '/' (home route) where ProjectView (and therefore CritiqueTheaterMount) is never rendered. With the suite as written, every assertion would time out the first time the lane runs in CI, contradicting the PR body's claim that the suite stays parked behind test.describe.fixme. The state diverged from my earlier Phase 11 work because the merge from main on commit `4ab719c6` brought in #1307's squash-merged version of the e2e file (the pre-fixme shape). Re-applied test.describe.fixme to the describe block plus removed ui/critique-theater.test.ts from the test:ui:extended script in e2e/package.json. Added a file-header docblock explaining what the follow-up commit needs to do: replace goto('/') with /projects/:id navigation similar to app-design-files.test.ts, split the SSE fixture into a live prefix and terminal suffix (Codex P2 on PR #1320), and commit the first PNG baselines. 2. bestRoundOf in CritiqueTheaterMount returned the LAST round with a numeric composite, not the round with the HIGHEST composite, while bestCompositeOf correctly returned the max. A run that closed round 1 at 8.5 and round 2 at 6.0 would dispatch interrupted { bestRound: 2, composite: 8.5 } on a user-clicked interrupt. Folded the two helpers into a single bestRoundAndComposite that walks state.rounds once and returns the matching pair so the two values cannot drift. The onInterrupt callback now destructures from one helper instead of two independent reads. Falls back to (state.activeRound, 0) when no round has closed with a composite yet. Web typecheck: clean. CritiqueTheaterMount.test.tsx: 7 / 7 cases still green against the new helper. * fix: wire M1 project override end-to-end + correct deferred-surface doc claims (PR #1338) Three lefarcen P2s on the latest review pass, all real: 1. M1 project override was half-wired: the daemon read metadata.critiqueTheaterEnabled but the web setter only wrote localStorage. A user opt-in would render the Theater on the web (localStorage was set) while the daemon resolved projectOverride=null and skipped critique unless env / phase already permitted. Two halves talking past each other. Extended setCritiqueTheaterEnabled to accept an optional { projectId, fetchProjectSettings } options bag. When a projectId is supplied, the setter ALSO sends a PATCH /api/projects/:id with { metadata: { critiqueTheaterEnabled } } so the daemon's spawn-time resolver picks the same value up on the next generation. The existing project-routes endpoint already accepts arbitrary metadata patches, so no new endpoint is needed. The local write + the CustomEvent dispatch still fire before the PATCH, so a network failure does not unwind the in-session UI flip. Three new vitest cases pin the new path: PATCHes when projectId is provided, skips when it is not, swallows a rejected PATCH so the in-session UI still flips. 2. Rollout docs (docs/critique-theater.md section 3) claimed the Settings toggle persists into the daemon settings store, but the previous implementation only had a localStorage reader / writer plus a daemon read of project metadata, with no round-trip. Rewrote the section to lead with the four-tier resolver (skill policy / project override / env / phase), document that the setter now round-trips via the existing PATCH endpoint when given a projectId, and call out the Settings panel UI control as a deliberate follow-up. 3. Troubleshooting table pointed users at /api/metrics/critique (Phase 12, deferred) and 'od adapters clear-degraded <id>' (CLI wrapper that does not exist). Replaced the metrics reference with the local conformance harness command (pnpm --filter @open-design/daemon vitest run tests/critique-conformance.test.ts) that ships today, with a note that the Phase 12 dashboard surfaces this status as a series once that PR lands. Replaced the CLI command with the programmatic clearDegraded() helper that exists today and flagged the CLI wrapper as planned follow-up. Web typecheck: clean. Toggle hook tests: 14 / 14 green (11 existing + 3 new for the round-trip path). * test(web): multi-round interrupt regression for bestRoundAndComposite (PR #1338) lefarcen P3 follow-up to the previous bestRoundAndComposite fix: the existing CritiqueTheaterMount.test.tsx interrupt cases only exercised a single-round state, so a future refactor back to two independent helpers wouldn't be caught by the test suite even though it'd reintroduce the round / composite drift bug. Added a regression case that: 1. Drives the reducer through two complete rounds with the full 5-role cast closing at distinct composites: round 1 at 8.5, round 2 at 6.0 (the high-composite round is NOT the most recent one). 2. Clicks Interrupt + waits for the daemon ack via the test seam fetcher returning 204. 3. Asserts the collapsed badge displays "round 1" (the correct best-composite round), and queryByText for "round 2 ... 8.5" returns null (the buggy pairing would have produced that string). The bestRoundAndComposite helper walks state.rounds in one pass and returns the matching pair, so the round number and the composite cannot drift apart. This test locks the fix in: a refactor that splits the helpers back into independent walks will be caught here. 8 / 8 vitest cases green on the file. * fix(web): read-merge-write the project metadata in setCritiqueTheaterEnabled (PerishCode P2 on PR #1338) The previous round-trip sent { metadata: { critiqueTheaterEnabled: next } } as the entire PATCH body. The daemon's project-routes handler only re-stamps three immutable fields (baseDir, importedFrom, fromTrustedPicker) before calling updateProject(db, id, patch), which then does a shallow { ...existing, ...patch } in apps/daemon/ src/db.ts. So patch.metadata replaces the row's metadata wholesale, dropping kind, templateId, linkedDirs, and every other field the rest of the app reads. No in-tree caller passes projectId today (only vitest cases), so the bug had not surfaced yet. But the surface is documented in docs/critique-theater.md section 3 and the function's own JSDoc as the M1 round-trip path, so it would have shipped as a latent footgun for the next integrator: a Settings UI follow-up, or any third party that wires the setter into a project-aware surface. Fix: read-merge-write rather than a bare patch. - GET /api/projects/:id to read the row's current metadata. - Spread that metadata into the PATCH body and overlay critiqueTheaterEnabled: next on top, mirroring the partial-metadata pattern already used in ChatComposer.tsx for linkedDirs. - PATCH the merged object. Failure handling: - GET fails: skip the PATCH entirely. We cannot construct a safe merged body without the current state, and a bare patch would wipe other metadata. The in-session CustomEvent fired earlier in the setter still keeps every mounted hook consistent; the next save retries the round-trip. - PATCH fails: log in dev. The in-session UI is already correct via the CustomEvent. Tests (TDD, red-first): - 'GETs the project then PATCHes with merged metadata when a projectId is supplied': stubs a GET that returns { kind: 'template', templateId: 'modern-blog', linkedDirs: [...] } and asserts the PATCH body equals the merge plus the toggle. - 'PATCHes with just the toggle when the project has no prior metadata': stubs a GET that returns no metadata block. - 'skips the PATCH (does not stomp metadata) when the prefetch GET fails': stubs a rejecting GET and asserts only the GET fires. - 'swallows a rejected PATCH after a successful prefetch': stubs a successful GET and a rejecting PATCH; asserts the in-session UI still flips via the CustomEvent. Doc updated on the setter's JSDoc to describe the new three-step flow (localStorage, CustomEvent, read-merge-write PATCH) and the two failure modes. Verified: - pnpm --filter @open-design/web typecheck clean. - pnpm --filter @open-design/web test: 111 files / 1055 tests green (was 1052, +3 from the new merge-flow cases). * fix(web): restore wait-for-daemon-ack pattern on Theater interrupt Same regression as flagged on PR #1316 post-main-merge: the optimistic local dispatch fired before the POST resolved, so a daemon 404 / 409 still terminalized the UI and the real SSE terminal event got ignored by the sticky interrupted phase. Snapshot runId / bestRound / composite at click time, dispatch interrupted only on res.ok, clear interruptPending on rejection or non-2xx so the user can retry. Tests cover rejection + 404 leaving the run on the live stage; the 204 path waits for the ack. * feat(daemon): Critique Theater Phase 12 observability foundations Lands the metrics registry, the structured logger, the /api/metrics route, and the adapter-degraded bump that wires up the first data point. The orchestrator-side bumps for runs / rounds / composite / must-fix / interrupted / parser_errors / protocol_version land in a follow-up commit on this branch (kept separate so the wiring diff reads cleanly against the registry shape). Surfaces added: - apps/daemon/src/metrics/index.ts: 9 Prometheus series under the open_design_critique_* namespace with the histogram buckets the spec calls out (round_duration_ms at 100 / 250 / 500 / 1000 / 2500 / 5000 / 10000 / 30000 / 60000 ms; composite_score at 0-10 integer steps). - apps/daemon/src/logging/critique.ts: 6 typed events, one JSON line per call on stdout, namespaced critique. Matches the JSON-per-line convention cli.ts already uses; no new logger framework. - apps/daemon/src/server.ts: GET /api/metrics route. Honors OD_METRICS_ENDPOINT=disabled to opt out for air-gapped installs. - apps/daemon/src/critique/adapter-degraded.ts: markDegraded now bumps degraded_total so the adapter-health dashboard panel reflects every TTL refresh and every fresh mark. Deps: prom-client ^15.1.0, @opentelemetry/api ^1.9.0 added to apps/daemon/package.json. Both are zero-config no-ops without an exporter wired; daemon bundle size impact is ~150 KB uncompressed. The @opentelemetry/api dep is in place ahead of the OTel-spans follow-up commit; it adds no behavior on this commit. Tests: - tests/metrics/critique.test.ts (3 cases): registry shape + exposition text + reset-between-tests - tests/logging/critique.test.ts (4 cases): event shape + ordering + newline framing + namespace stamping Verification (Windows-local): - pnpm --filter @open-design/daemon typecheck: clean - New metrics + logging suites: 7 / 7 green - Existing adapter-degraded + conformance + rollout suites: 22 / 22 green; the bump is non-breaking * feat(daemon): wire Critique Theater metrics + structured logs from the orchestrator Lights up the bump sites the Phase 12 foundations PR registered the series for. Every panel event the parser surfaces now reaches the matching Prometheus counter / histogram and the matching JSON log line on stdout. Switch-loop bumps + logs: - run_started: log run_started, set protocol_version gauge to the observed protocol version (small-integer cardinality). - panelist_open: record the first-open wall-clock per round so round_end can compute round_duration_ms; subsequent opens in the same round leave the start time untouched. - panelist_must_fix: bump must_fix_total with the panelist role. The wire event does not yet carry a dim name, so the label is 'unspecified' for now; a future parser revision can drop in the real dim without a metric rename. - round_end: bump rounds_total, observe composite_score, observe round_duration_ms (current ms minus the tracked start), log round_closed with the composite / mustFix / decision triple. - parser_warning (parser-yielded): bump parser_errors_total with the kind label, log parser_recover with kind + position. Orchestrator-side parser warnings (composite_mismatch and duplicate_ship from the daemon-authoritative scoring checks) go through a new emitParserWarning helper so the bus emit, the collectedEvents push, the metric bump, and the log line stay in lockstep. Three inline emission sites collapse to one-line helper calls. After the try/catch, a single terminal-status switch bumps runs_total{status, adapter, skill} once per run, with branch- specific log + counter: - shipped / below_threshold: log run_shipped - interrupted: bump interrupted_total, log run_failed{cause: interrupted} - timed_out: log run_failed{cause: timed_out} - failed: log run_failed{cause: orchestrator_internal} - degraded: log degraded{reason: orchestrator_classified} OrchestratorParams gains optional skill: string for the label; defaults to 'unknown' so spawn sites that have not yet threaded it keep working without a metric shape change. Tests: - The new metrics + logging suites (7 / 7) verify registry shape and event framing; orchestrator-side metric integration is exercised through the existing critique-conformance and critique-adapter-degraded suites (22 / 22 still green). - Logger test reassigns process.stdout.write directly instead of vi.spyOn so the Node overloaded write signature does not collide with MockInstance<unknown>. * feat(observability): Grafana dashboard JSON for Critique Theater Three default rows mapping to the metrics this branch wires up: 1. Fleet quality: composite score p50 / p90 / p99 line graph by adapter, plus a heatmap of the composite distribution. The line graph answers 'are my agents getting better over time'; the heatmap answers 'are the bad runs clustered around one adapter or smeared across the fleet'. 2. Adapter health: stacked bar charts for degraded marks (by adapter / reason) and parser errors (by adapter / kind) over a 5-minute window. The two queries together let an operator see 'is this adapter degraded because of malformed wire output or because of oversize blocks' without flipping panels. 3. Brief throughput: runs-per-hour by terminal status, an average rounds-per-run stat per adapter, and a round-duration ms p50 / p90 / p99 line. Throughput numbers fall straight out of the runs_total / rounds_total counters; the duration histogram is the same one the runs feed. The dashboard uses a templated $datasource var (defaults to 'prometheus') so an operator with multiple Prometheus instances can switch without editing JSON. Schema version 39 (Grafana 11). Operators import via: pnpm dlx @grafana/cli dashboard import tools/dev/dashboards/critique.json or paste into a provisioned dashboards directory. The file is checked into the repo as a starting artifact; alert rules and SLO panels ship after the first 1000 runs inform the right thresholds. JSON validates with node -e 'JSON.parse(...)' (sanity checked locally). * feat(daemon): OpenTelemetry outer span around the critique run Wraps each runOrchestrator call in a 'critique.run' span via the existing @opentelemetry/api dep added in the Phase 12 foundations commit. Attributes set on the span: - critique.run_id, critique.adapter, critique.skill at start - critique.final_status, critique.final_composite on terminal resolution - span status flipped to ERROR for failed / timed_out runs so a Tempo / Honeycomb / Jaeger filter on traces.status=error surfaces the right slice without joining back to Prometheus No exporter is wired by default; @opentelemetry/api is the API package and intentionally splits from @opentelemetry/sdk-, so the span is zero-overhead until an operator attaches an SDK through their runtime config. Inner per-round / parse_chunk / scoreboard_eval / persist_round / ship.persist spans defined in the Phase 12 plan are a follow-up: the outer span alone gives the trace a duration + final status + adapter/skill labels, which is the 80% value for dashboards that correlate runs across services. Adding child spans inside the existing 600-line orchestrator without restructuring is a separate careful change. Verification: - pnpm --filter @open-design/daemon typecheck: clean - 29 / 29 critique + metrics + logging tests still green fix(nix): bump pnpmDepsHash for prom-client + @opentelemetry/api lockfile bump nix-check failed on PR #1485 with hash mismatch in open-design-daemon-pnpm-deps and open-design-web-pnpm-deps after the Phase 12 foundations commit (`2b8b7445`) added prom-client and @opentelemetry/api to apps/daemon/package.json and refreshed pnpm-lock.yaml. CI reported the new sha: specified: HFLm+8hv3o5x3Xem4MXNsNclIgiVRc70+EBafL0rVn8= got: 7R1sQC38gOT0gsZ2oNOviCZ486cbbGJGJCis6WI8z9s= Both nix files pin the same workspace lockfile, so both flip in lockstep. No other Nix surface changes required. * fix(daemon): four Phase 12 review findings (Codex P2 x2 + Siri-Ray P2 + lefarcen P2) 1. Siri-Ray P2 in orchestrator.ts (round metric / log used untrusted agent values). The new observability path now records rs.composite and rs.mustFix (daemon-authoritative) instead of event.composite and event.mustFix when rs exists, and skips the bumps + log entirely when rs is missing (a degenerate round_end without any matching panelist_open). The dashboard p50 / p90 / p99 now agrees with persistence and ship decisions; an adapter reporting <ROUND_END composite='10'> while the daemon computed 6 logs 6 and still emits the composite_mismatch parser warning the prior block was already producing. 2. Codex P2 in server.ts (skill label always 'unknown'). The spawn path called runOrchestrator without passing the resolved skill id, so every live run bumped open_design_critique_{skill='unknown'} and the per-skill dashboard breakdown was always empty. Threaded effectiveSkillId (already computed at the same handler scope as the project skill fallback) through skill: . . . so the metric reflects the real skill when one is assigned, and the orchestrator default of 'unknown' only fires for runs that genuinely have none. 3. Codex P2 in conformance.ts (protocol-version mismatch let through). An adapter that emitted <CRITIQUE_RUN version='2'> followed by a valid SHIP classified as shipped because the harness only watched for terminal events. Added a guard inside the parse loop: if a run_started carries protocolVersion !== CRITIQUE_PROTOCOL_VERSION, mark the adapter degraded with reason 'protocol_version_mismatch' (already in DEGRADED_REASONS) and return early. ConformanceOutcome union widened to accept the new reason. 4. lefarcen P2 in tools/dev/dashboards/critique.json (runs-per-hour panel under-reported by 3600x). 'rate(...[1h])' returns per-second. Multiplied by 3600 so the panel title and unit match the actual value rendered. Verification: - pnpm --filter @open-design/daemon typecheck: clean - New metrics + logging suites (7), existing adapter-degraded (7), conformance (5), rollout (10): 29 / 29 green - Grafana JSON re-parses with node -e 'JSON.parse(...)' feat(daemon): Critique Theater Phase 16 (M-phase rollout ratchet) The PR that takes the rollout out of operator-flips-env-vars-by-hand and into the-fleet-conformance-numbers-decide. Stacks on Phase 12 (#1485): the ratchet reads from the conformance harness's daily output, which only exists once Phase 12's metrics + history surface land. Five surfaces: 1. apps/daemon/src/critique/ratchet.ts (new) Pure evaluator. Takes the current RolloutPhase plus a rolling window of ConformanceDay rows and returns one of three decisions: promote, hold, or demote. Spec defaults (14-day window, 0.90 shipped, 0.95 clean-parse) match specs/current/critique-theater.md. Demote floor is half the promote threshold so a single noisy day does not bounce the rollout back; only sustained breakage walks things back. M0 cannot demote and M3 cannot promote, both collapse to hold with an explicit reason string. 2. apps/daemon/src/critique/conformance-history.ts (new) JSON-lines persistence at dataDir/conformance/adapter/date.jsonl. Append-only writer + windowed reader. Last entry per (adapter, date) wins so a retry-after-failure cron writes the right answer without a read-modify-write at write time. Malformed lines, missing files, and missing adapter directories all collapse to skip-this-row since a missing day is data missing, not data wrong. 3. apps/daemon/src/server.ts GET /api/critique/conformance returns { window, decision }. Tunables come from query string (windowDays, shippedThreshold, cleanParseThreshold) with spec defaults. The recommendation does not auto-flip OD_CRITIQUE_ROLLOUT_PHASE; an operator-driven follow-up consumes the JSON and decides whether to flip or alert. 4. .github/workflows/critique-conformance.yml (new) Nightly cron at 03:00 UTC. Builds the daemon, drives the conformance harness against the synthetic-good and synthetic-bad fixtures, and uploads the .od/conformance/ snapshot as a workflow artifact. The schedule sits outside the busy generation window so the cron does not contend with user runs for adapter rate-limit budgets. 5. apps/daemon/tests/critique-ratchet.test.ts + critique-conformance-history.test.ts 17 cases. Ratchet: 10 cells of the promote / hold / demote matrix. History: 7 round-trip cases. Verification: - pnpm --filter @open-design/daemon typecheck: clean - 17 / 17 new tests green - Phase 12 metrics + logging + adapter-degraded + conformance + rollout suites (29) untouched and still green * fix(daemon): three Phase 16 review findings (Codex P1/P2 + lefarcen P1 x3) 1. Duplicate parseRolloutPhase import in server.ts. The new standalone import collided with the existing grouped import; ESM would fail to parse at module load on every daemon startup path. Removed the standalone import; the grouped one already exports parseRolloutPhase. 2. Validation gap in evaluateRollout. A request like ?windowDays=0 fed passingDays >= windowDays = 0 >= 0 = true, returning promote with zero observed days. Now the evaluator rejects non-positive windowDays and out-of-range thresholds at the function entry with an explicit hold reason. The route also clamps query strings before they reach the evaluator (belt + suspenders so a future caller bypassing the route hits the same defense). 3. Missing nightly runner. The workflow called apps/daemon/src/critique/__fixtures__/run-nightly.ts, which the prior PR did not actually add, and \|\| echo masked the failure. Added the runner: drives every synthetic adapter through runAdapterConformance, walks the resulting events for parser_warning to compute cleanParseRate, and writes one ConformanceDay row per adapter via appendConformanceDay. Removed the \|\| echo mask so the workflow fails loudly when the runner throws. Tests for the validation fix: four new ratchet cases (windowDays=0 holds with no evidence, windowDays=-7 holds, shippedThreshold > 1 holds, cleanParseThreshold < 0 holds). Ratchet suite goes from 10 -> 14 cases. Verification: - pnpm --filter @open-design/daemon typecheck: clean - 33 / 33 critique tests green (14 ratchet, 7 conformance-history, 7 adapter-degraded, 5 conformance) * test(daemon): explicit NaN regression cases for the ratchet evaluator (PerishCode follow-up on PR #1499) The Number.isFinite() guard already rejects NaN on every numeric input, so this is belt-and-suspenders: pinning the behavior so a future refactor of the guard (a typed parser, a clamp helper, a relaxed range check) cannot accidentally let NaN through and surface a zero-evidence promote signal. Three new assertions inside one case (windowDays=NaN, shippedThreshold=NaN, cleanParseThreshold=NaN), each asserting hold + the matching 'invalid X' reason string. Ratchet suite goes from 14 -> 15 cases. * fix(nix): regenerate lockfile + pin pnpmDepsHash for prom-client + @opentelemetry/api (lefarcen P1 on PR #1499) --------- Co-authored-by: Nagendhra <nagendhra405@gmail.com>	2026-05-14 11:05:57 +08:00
Bryan	54498f1ac5	fix(web): parse Provenance with Markdown-bold labels (#1584 ) * fix(web): parse Provenance with Markdown-bold labels (#1580) The daemon's finalize synthesis prompt at apps/daemon/src/finalize-design.ts:560-565 lists the five Provenance fields without pinning field-label syntax, so Claude renders them with Markdown-bold labels per Markdown convention (`- Field: value`). The parser at apps/web/src/lib/parse-provenance.ts:32-36 uses `[:\s]+` as its label/value separator, which stops at the trailing `` after the colon; the capture group then slurps the `` and any following whitespace into the value. Downstream of that, transcriptMessageCount and generatedAt parse as null because the captured tokens don't start with digits or a valid ISO 8601 prefix, and the Continue in CLI clipboard prompt shows `Design system: ** ...`, `Transcript message count when DESIGN.md was generated: unknown`, `DESIGN.md generated at: unknown`. Fix: strip leading and trailing Markdown emphasis (``, `_`, whitespace) from every captured value via a single helper threaded through extractField / extractFieldOrNone / extractNumber / extractDate. Widen the transcriptMessageCount regex's capture from `(\d+)` to `([^\n]+)` so the strip step gets a chance to run on `* 4`. Add `[^:]` between `count` and `[:\s]+` to mirror the other label-walking regexes for bolded label variants. Defense-in-depth: tell the synthesis prompt to emit plain `- Field: value` bullets with no emphasis on the labels. The parser hardening is the load-bearing fix; this is belt-and-suspenders for new model variants. Red-Green-Refactor: - Phase 1 (Red): 3 new parse-provenance tests covering bold labels with backticked values, bold labels with a short `Generated:` form, and bold labels with `none` sentinels. All 3 failed against pre-fix source. - Phase 2 (Green): strip + regex widening. All 7 parse-provenance tests + 1158 web tests pass. - Phase 3: empirically verified against a live finalized DESIGN.md — all five fields now parse correctly. - Phase 4 (defense-in-depth): one-line addendum to synthesis prompt. - Phase 5: bold-labelled Provenance fixture added to the hook test (useDesignMdState.test.tsx) so the round-7 `unknown-provenance` fail-closed path is regression-pinned end-to-end. Backticks in field values are intentionally kept (out of scope per the issue spec; rendered clipboard text reads fine with them). The variant `- Field: value` (colon outside emphasis) is not in the issue enumeration and is not handled. Fixes #1580 fix(web): narrow Provenance strip to Markdown residue only Round-2 fix per lefarcen's review on PR #1584. The round-1 helper used `^[\s_]+` / `[\s_]+$`, which stripped a literal leading or trailing ``/`_` from any captured value — `_draft.html` corrupted to `draft.html`, and a build id like `build_id_v1_` lost its trailing underscore. Narrow stripMarkdownEmphasis to three explicit passes: 1. Leading ``/`_` tokens FOLLOWED BY WHITESPACE — only matches the ` ` residue left after `- Field:** value` is captured starting at the ``. 2. Trailing WHITESPACE followed by ``/`_` tokens — mirror of (1) if the value closes with emphasis after whitespace. 3. A single balanced wrap around the remaining value (`X` / `X` / `__X__` / `_X_`) — handles the `- Field: value` shape and any plain-label `value` form. Asymmetric literal ``/`_` characters in the value (no whitespace separator, no balanced closing token) are preserved by construction. Added regression tests: - plain label + `_draft.html` value - plain label + `build_id_v1_` value (trailing underscore) - bold label + `_draft.html` value (residue stripped, literal leading underscore preserved) - plain label + `wrapped-id*` value (balanced residue stripped) All 11 parse-provenance tests + 1162 web tests pass. Empirically re-verified against a live finalized DESIGN.md — all five fields still parse correctly. --------- Co-authored-by: DevForgeAI CI/CD Engineer <devforge-ai@development.ai>	2026-05-14 11:04:24 +08:00
nmsn	d1dcc6ab7d	fix(web): use background-color instead of background shorthand (#1601 ) The shorthand `background` property resets all background-related properties to initial values, including background-repeat (defaulting to repeat). Using `background-color` instead preserves the chevron SVG background-image set by the global select rule.	2026-05-14 11:03:05 +08:00
Prantik Medhi	0c0da7cc23	fix(web): confirm continue-in-cli copy (#1604 )	2026-05-14 11:02:36 +08:00

1 2 3 4 5 ...

727 commits