open-design

mirror of https://github.com/nexu-io/open-design.git synced 2026-06-01 03:14:35 +07:00

Author	SHA1	Message	Date
Marc Chan	c45c5c9764	fix(ci): align visual selectors and nix hashes (#2471 ) * fix(ci): align visual selectors and nix hashes * fix(ci): add strict PR visual verification * fix(ci): repair visual-home captures Generated-By: looper 0.8.1 (runner=fixer, agent=opencode)	2026-05-21 10:45:37 +08:00
Chris Tam	7b1cc16988	fix(nix): force http:// scheme on bundled caddy site address (#2485 ) A bare `host:port` site address lets Caddy pick the listener scheme by port heuristic, which fights `auto_https off` and surfaces as TLS errors when the browser hits plain HTTP on a non-standard port. Hardcode the `http://` prefix in both the Home Manager and NixOS Caddyfile templates — the bundled proxy is plaintext-only by design, so users who need TLS run their own front-end with `webFrontend.enable = false`.	2026-05-21 10:44:51 +08:00
lefarcen	80d305858b	feat(diagnostics): add one-click log export from Settings → About (#798 ) * feat(diagnostics): add one-click log export from Settings → About Adds a new "Export diagnostics" entry under the About section that bundles daemon/web/desktop logs, machine info, and recent macOS crash reports into a zip the user can share when reporting issues. - Browser hits a new daemon HTTP endpoint and triggers a download. - Electron uses an IPC bridge with the native save dialog and reveals the saved file in Finder/Explorer; the Help menu also exposes it as a fallback when the daemon is unresponsive. Packaging + redaction lives in a new @open-design/diagnostics package so both surfaces share it. Sensitive JSON keys, URL query secrets, and the current user's home path are redacted before packaging. * build(nix): include packages/diagnostics in daemon build targets The Nix daemon derivation builds workspace siblings in dependency order before compiling apps/daemon. Without @open-design/diagnostics in that list, the daemon TypeScript build fails inside the Nix sandbox with `Cannot find module '@open-design/diagnostics'` because pnpm install only creates the symlink — the dist output that the package.json exports point at isn't produced until each sibling's build script runs. * build(tools-pack): include @open-design/diagnostics in packaged INTERNAL_PACKAGES Without this, packaged win/mac/linux builds fail with `npm error 404` when the post-build `npm install --omit=dev --no-package-lock` step in the assembled app tries to resolve `@open-design/diagnostics@0.2.0` from the public npm registry. The package is workspace-private, so it has to be tarballed via `pnpm pack` and file:-referenced from the assembled package.json like every other internal workspace dep that daemon/desktop depend on. Also wires the package's `pnpm --filter ... build` into the pre-pack workspace build step so the dist/ exists before pnpm pack runs, and updates the two test fixtures (`win-app.test.ts`, `workspace-build.test.ts`) that mirror INTERNAL_PACKAGES. The diagnostics package itself is repinned to exact dependency versions already used elsewhere in the workspace (`jszip 3.10.1`, `@types/node 20.19.39`, `esbuild 0.28.0`, `typescript 5.9.3`, `vitest 4.1.6`) so it passes the new `pnpm guard` exact-version rule and produces a minimal lockfile diff vs main (additions only, no resolution-string churn). * fix(diagnostics): include `~` in bearer-token redaction char class RFC 6750 token68 syntax allows `~`, so tokens like `Authorization: Bearer abcd~efgh` were only partially matched by `HTTP_AUTH_SCHEME_RE`. The regex stopped at the first `~`, leaving the tail (`~efgh`) un-redacted in the exported diagnostics zip — a clear leak since this feature explicitly generates support bundles for external sharing. Add `~` to the character class and a regression test. * fix(diagnostics): only collect renderer.log from desktop `buildSidecarLogSources` unconditionally added `logs/${app}/renderer.log` for daemon/web/desktop, but only the desktop runtime writes a renderer log (see apps/desktop/src/main/runtime.ts) — daemon and web are pure Node services with no Electron renderer. Every export therefore produced missing-file placeholders and manifest warnings for the two phantom paths, polluting the bundle. Gate the renderer.log source on APP_KEYS.DESKTOP so the daemon-side collector matches the desktop-side collector in apps/desktop/src/main/ diagnostics.ts:63. * fix(diagnostics): mirror desktop-side renderer.log gate The previous fix only updated the daemon-side `buildSidecarLogSources` in `apps/daemon/src/diagnostics-export.ts`. The desktop-side collector at `apps/desktop/src/main/diagnostics.ts` had an identical copy of the same bug that I overlooked: it also unconditionally added `logs/${appKey}/renderer.log` for daemon/web/desktop, producing missing-file placeholders + manifest warnings for the two phantom paths on every desktop-initiated export. Apply the same `appKey === APP_KEYS.DESKTOP` gate here so both export entry points (browser via daemon HTTP, Electron via native save dialog) emit the same clean manifest. * feat(diagnostics): add `od diagnostics export` CLI subcommand AGENTS.md's dual-track capability-exposure contract requires every user-facing feature to ship on both the web UI and the `od` CLI. The diagnostics export was only reachable through Settings → About and the desktop Help menu; this commit closes the loop with an `od diagnostics export [<path>] [--json]` subcommand registered in SUBCOMMAND_MAP. The CLI is a thin shell over the existing GET /api/diagnostics/export endpoint — same zip output, same redaction, same crash-report scope. Defaults to writing `open-design-diagnostics-<timestamp>.zip` in the current directory; `--output <path>` or a positional arg overrides. `--json` prints `{path, sizeBytes}` for shell pipelines. Use cases this unlocks: - A CI script can `od diagnostics export ~/artifacts/bundle.zip` after a failed run. - Bug reporters on headless boxes can grab a bundle without booting the web UI. - `od doctor` follow-ups can collect a full snapshot when a probe fails. * fix(diagnostics): surface non-sidecar launch in manifest warnings `buildSidecarLogSources()` returns `[]` when the daemon has no sidecar runtime context, which is the standard `od` (plain) launch path — `runDaemonCliStartup()` -> `startDaemonRuntime()` does not pass a runtime. Settings → About and the new `od diagnostics export` previously reported success but produced a bundle with only the summary JSONs, so operators could not tell "no logs because plain launch" from "no logs because something genuinely broke." - Extend `DiagnosticsContext` with an optional upstream `warnings: string[]` that `buildManifest` merges into the manifest warnings. - Emit STANDALONE_LAUNCH_WARNING from the daemon handler when `options.runtime == null`. The warning names the limitation and points the user at the sidecar entry points that DO capture logs. - Add a regression spec at `apps/daemon/tests/diagnostics-export.test.ts` that drives the handler with `runtime: null` and asserts the warning surfaces in `summary/manifest.json` (and that `files` is empty so a user reading the bundle does not confuse "no log sources" with "missing files").	2026-05-20 09:10:51 +08:00
Marc Chan	60bd58a1f3	fix(nix): prebuild host package for web build (#2280 ) Some checks failed ci / Detect CI change scopes (push) Successful in 0s Details landing-page-ci / Validate landing page (push) Failing after 2s Details landing-page-deploy / Deploy landing page (push) Has been skipped Details nix-check / build (push) Failing after 2s Details ci / Preflight (push) Failing after 2s Details ci / Core package tests (push) Failing after 1s Details ci / Tools workspace tests (push) Failing after 1s Details ci / Daemon workspace tests (1/2) (push) Failing after 1s Details ci / Daemon workspace tests (2/2) (push) Failing after 2s Details ci / Web workspace tests (push) Failing after 1s Details ci / E2E vitest (push) Failing after 2s Details ci / Playwright critical (starters) (push) Failing after 1s Details ci / Playwright critical (core) (push) Failing after 2s Details ci / Build workspaces (push) Failing after 1s Details ci / App workspace tests (push) Failing after 0s Details ci / Validate workspace (push) Failing after 0s Details ci / Runtime trace (push) Has been skipped Details	2026-05-19 23:59:48 +08:00
PerishFire	bb13eee765	chore: optimize CI and beta release runtime (#2231 ) * chore(ci): add runtime trace summaries * chore(ci): tighten measured workspace steps * chore(release): tighten beta setup steps * chore(release): slim beta windows smoke * chore(ci): shard daemon tests * chore(ci): harden runtime trace lookup * chore(release): avoid mac pnpm cache in beta * chore(ci): split critical playwright checks * chore(release): publish beta platforms from builders * test(e2e): update beta release workflow expectation * chore(ci): stop gating PRs on nix check * fix(release): keep beta latest complete	2026-05-19 18:06:28 +08:00
PerishFire	bd48c597b0	chore: pin dependency versions and harden CI caches (#2189 ) * chore: pin dependency versions * ci: enforce pinned dependency specs * ci: fix pnpm executable invocation	2026-05-19 13:58:27 +08:00
lefarcen	14cff69435	fix(nix): refresh pnpmDepsHash after merging main The merge brought new entries into pnpm-lock.yaml (Leonardo.ai provider, custom select primitive, Critique Theater wireup, etc.), so the vendored pnpm store hash changed. CI computed the new hash from the freshly built fixed-output derivation. specified: sha256-mzGiD5K1l5SEFpQ++L4wq775ne+ViG4ruXYZYEdT6zQ= (pre-merge) got: sha256-lROdH5HgKFf3R7DYGbc8n/GrmINwLbfVwC4Xp7SrHN4= (post-merge)	2026-05-15 19:16:01 +08:00
PerishCode	545aed642e	Merge remote-tracking branch 'origin/preview/0.8.0' into preview/v0.8.0 # Conflicts: # nix/package-daemon.nix # scripts/postinstall.mjs	2026-05-14 21:46:05 +08:00
PerishCode	883598f556	Build registry protocol in packaged workspaces	2026-05-14 21:23:45 +08:00
PerishCode	dbfe08eaf3	Update Nix pnpm deps hash	2026-05-14 21:15:27 +08:00
Tom Huang	76defffb93	Garnet hemisphere (#1702 ) * feat(chat-composer): enhance mention handling and input overlay - Introduced a new overlay for inline mentions in the chat composer, improving user experience by visually indicating mentions as users type. - Updated the `ChatComposer` component to manage mention entities and integrate them into the input field, allowing for better context and interaction. - Enhanced the `AssistantMessage` component to support the display of plugin action panels based on the current project context, facilitating easier plugin management. - Refactored related components to ensure consistent handling of project files and mentions across the application. This update significantly improves the chat interaction model, making it more intuitive for users to engage with mentions and plugins. * feat(plugin-management): enhance plugin action panels and UI components - Updated the `AssistantMessage` component to include plugin action panels based on the latest project context, improving user interaction with generated plugins. - Refactored the `PluginsView` to support detailed views for available marketplace entries, allowing users to access more information and actions for each plugin. - Introduced new CSS styles for improved visual representation of plugin-related UI elements, enhancing overall user experience. - Enhanced the `listPlugins` function to include an option for fetching hidden plugins, providing more flexibility in plugin management. This update significantly improves the usability and functionality of the plugin management system, making it easier for users to interact with and manage their plugins. * fix(assistant-message): refine plugin folder candidate selection logic - Updated the `pluginFoldersTouchedThisTurn` function to improve the logic for selecting plugin folder candidates based on touched paths and message content. - Introduced a new helper function, `pathMatchesFolderFileBasename`, to enhance the matching criteria for folder candidates. - Added a check for explicit folder matches before falling back to a single candidate, improving accuracy in folder selection. - Modified the `shouldRenderSlotAsText` function in `HomeHero` to include the name parameter, refining the rendering logic for slot text. These changes enhance the functionality and reliability of the assistant message component in managing plugin folder candidates. * feat(plugin-folder-actions): implement agent-routed CLI actions for plugin management - Introduced a new `PluginFolderAgentAction` type to streamline actions related to plugin folders, including install, publish, and contribute. - Updated the `DesignFilesPanel`, `FileWorkspace`, and `AssistantMessage` components to utilize the new agent action handling, improving user interaction with generated plugins. - Refactored the action handling logic to send commands to the agent, enhancing the workflow for managing plugin folders. - Added corresponding tests to ensure the new functionality works as expected and integrates seamlessly with existing components. This update significantly enhances the plugin management experience by routing actions through the agent, allowing for a more cohesive and interactive user experience. * Fix PR 1702 CI blockers * Fix PR 1702 remaining CI checks * Prebuild AGUI adapter after install * Restore plugin project snapshot wiring * feat(marketplace): refactor marketplace URL handling and enhance fetching logic - Introduced new functions to normalize marketplace URLs and manage fetching of marketplace manifests, improving the reliability of marketplace integrations. - Updated the server and plugin logic to utilize the new fetching mechanisms, ensuring consistent handling of marketplace data. - Enhanced tests to cover new URL normalization and fetching scenarios, ensuring robustness in marketplace management. This update significantly improves the marketplace experience by streamlining URL handling and enhancing data fetching capabilities. * Fix project auto-send cleanup spec	2026-05-14 21:12:50 +08:00
Nagendhra Madishetti	38a5ab69e6	feat(daemon): Critique Theater Phase 12 (9 Prometheus metrics + 6 log events + OTel span + Grafana dashboard) (#1485 ) * feat(web): pure reducer for Critique Theater states (Phase 7.1) Pure CritiqueState reducer driven by the contracts-level PanelEvent (the same shape both the live SSE stream and the recorded transcript emit), so a single reducer powers both the in-flight panel and the rerun replay. Lifecycle covers run_started → running → (shipped / degraded / interrupted / failed), with panelist_open / dim / must_fix / close / round_end events building per-round CritiquePanelistView entries as they arrive. Defensive behaviour that surfaced while writing the spec tests: - Terminal phases (shipped / degraded / interrupted / failed) are sticky against further lifecycle events for the same run, except for parser_warning which can land late and is recorded in a side channel without changing phase. - A new run_started for a different runId at any time discards the prior state and reboots, so the UI can launch consecutive runs without an explicit reset action. - Events whose runId does not match the active run return the same state reference, so React's useReducer doesn't re-render subscribers on stray traffic. - Round bookkeeping keys by round number rather than "always last", so an out-of-order panelist_dim for round 1 arriving after a round 2 dim does not corrupt the round 2 bucket. Test coverage: 18 cases covering each transition, the runId guard, sticky-terminal behaviour, the out-of-order round invariant, and the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire SSE + replay into the same reducer. * feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2) createCritiqueEventsConnection is a pure connection manager that mirrors apps/web/src/providers/project-events.ts: opens an EventSource at /api/projects/:id/events, listens for every name in CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent (stripping the critique. prefix and merging the data payload), and hands it to the caller's onEvent. Reconnect uses exponential backoff (1s → 30s) and resets on `ready`; malformed payloads drop with a dev-mode warning rather than tearing the stream. useCritiqueStream wraps the manager in a useReducer that owns the CritiqueState. enabled=false or a null projectId tears down the connection cleanly; switching projectId closes the old connection and opens a fresh one. The returned dispatch lets local UI synthesise actions (e.g. an Esc keypress firing a synthetic interrupted while a kill request is in flight); production traffic comes from the SSE stream. Test coverage: - sse.test.ts (10 cases, node env): subscription set covers every CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire shape back to PanelEvent; malformed JSON is swallowed and does not stop the stream; exponential backoff schedule and ready-reset semantics are pinned with a setTimeout seam; close() cancels pending reconnects and shuts the live source; no-op fallback when EventSource is unavailable. - useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event, reducer driven by synthetic actions, no connection when disabled or projectId is null, clean close on unmount, projectId change reopens cleanly. * feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3) Fetches the per-run NDJSON transcript (one PanelEvent per line), parses every line via the shared isPanelEvent predicate, and dispatches into the same CritiqueState reducer the live SSE stream uses. A single reducer means the UI rendering a replay can be identical to the live panel, and a UI mounting both useCritiqueStream and useCritiqueReplay in parallel does not have to reconcile two state shapes. speed knob is `paused \| instant \| live \| { intervalMs: N }`. - instant flushes every event synchronously, useful for opening a finished run already at its terminal state. - intervalMs paces dispatches at a fixed cadence so the reviewer can watch the run unfold. - paused parses the transcript but holds events back until the caller advances speed (consumers can drive a scrubber later). - live is reserved for the future "playback at original cadence" feature, currently treated as instant; replay timestamps are not yet persisted with each event so honest pacing requires a follow-up Phase 7+ task. gunzip seam handles `.ndjson.gz` transcripts via DecompressionStream when present; the production fetch path picks between text and arrayBuffer based on the URL extension. Both seams are injectable so the unit tests don't need to spin up a real network or a real gzip pipeline. Test coverage (8 cases, jsdom env): - Idle status before any URL is provided. - speed=instant flushes the full transcript synchronously to shipped state. - speed={intervalMs:N} paces with the setTimeout seam, reaching done after the last tick. - speed=paused leaves status=playing with no dispatches. - Empty transcript reports done with state still idle. - Fetch rejection surfaces an error status with the message. - Malformed NDJSON lines are skipped; valid events around them still land. - .gz transcripts route through the gunzip seam. Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream + replay), all on one branch ready for review. Phases 8+ (Theater components) consume these from this PR. * fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review) Two P1 fixes from lefarcen's review on PR #1307: SSE payload override `sseToPanelEvent` previously spread `data` after the channel-derived `type`, so a payload-provided `type` could override the channel and route a `critique.run_started` frame into the reducer as a `ship` action. Reversed the spread so the channel-derived `type` is authoritative, and revalidated the resulting object through the contracts-level `isPanelEvent` predicate before returning. Frames that fail validation (missing runId, empty runId, unknown type) are dropped, so a malformed or compromised SSE frame can no longer dispatch a wrong-shape action into the reducer. Three new sse.test.ts cases pin the regression: hostile `type:'ship'` in the payload still resolves to `run_started`, missing runId is dropped, empty runId is dropped. Replay pause/resume `useCritiqueReplay` had one big effect keyed on `transcriptUrl` only, so flipping `speed` from `paused` to `instant` never re-fired and the held events sat undispatched. Split into a parse effect (depends on URL, fetches and stores events in state) and a pace effect (depends on parsed-events + speed, owns the cursor + timers). The playback cursor lives in a ref that survives pause/resume cycles, so flipping `paused` -> `instant` flushes from the current position rather than restarting (which would double-dispatch `run_started` and reset the reducer). Two new useCritiqueReplay.test.tsx cases: - paused-then-instant transitions from `playing` to `done` and reaches the shipped terminal phase - intervalMs paced playback dispatches one event, pauses to drain the next scheduled timer, flips to instant, and confirms the remaining transcript drains exactly once (cursor was preserved) Doc consistency The earlier source comment in useCritiqueReplay.ts claimed `live` "paces by recorded timestamps" while the impl used zero-delay timers and the PR body said it behaves like `instant`. Aligned to reality: `live` currently behaves like `{ intervalMs: 0 }` (events drain on successive microtasks via setTimeoutFn) because transcripts do not yet carry per-event timestamps. Honest timestamp-driven pacing is queued as a Phase 7+ follow-up. Validated: pnpm guard, pnpm --filter @open-design/web typecheck, Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite 96 files / 888 tests. * feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread) * feat(web): Theater PanelistLane component (Phase 8.1) * feat(web): Theater ScoreTicker component (Phase 8.2) * feat(web): Theater RoundDivider component (Phase 8.3) * feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4) * feat(web): Theater TheaterDegraded chip (Phase 8.5) * feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6) * feat(web): Theater TheaterTranscript replay surface (Phase 8.7) * feat(web): Theater TheaterStage top-level container (Phase 8.8) * feat(web): Theater CSS using existing semantic tokens (no hex literals) * feat(web): Theater public exports barrel * fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314) Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen. State-lifecycle fixes (3 x P2) 1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`). Host hooks dispatch it when their gating prop changes so a stale run from a prior project / transcript cannot bleed into the next context. Reset is idempotent on idle (returns the same reference). 2. `useCritiqueStream` dispatches `__reset__` at the top of its connection effect, so a workspace switch from project A (which streamed a critique) to project B clears the reducer before the new EventSource opens. enabled=false also clears. 3. `useCritiqueReplay` dispatches `__reset__` at the top of its parse effect, so transcriptUrl swaps (including swap-to-null after a replay reached `shipped`) lift the reducer back to idle before the new fetch starts. SSE validation (1 x P2) 4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape` check after the cheap `isPanelEvent` predicate. A `critique.ship` frame missing `composite` / `round` / `status` / `artifactRef` is rejected before reaching the reducer, so TheaterCollapsed can no longer crash on `undefined.toFixed(1)`. Every variant's required fields are validated: run_started (protocolVersion, non-empty cast, maxRounds, threshold, scale), panelist_* (round, role, plus variant-specific shape), round_end (round, composite, mustFix, decision in {continue,ship}, reason), ship (round, composite, status, artifactRef.{projectId,artifactId}, summary), degraded (reason, adapter), interrupted (bestRound, composite), failed (cause), parser_warning (kind, position). Reducer correctness (1 x P2) 5. `panelist_open` now materializes the round + an empty panelist view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight the in-progress lane the instant the tag opens. Before this, a stream that emitted only `panelist_open` after `run_started` left `rounds = []` and the UI rendered no current round until a later `panelist_dim` arrived. Polish (3 x P3) 6. Brand role tint swaps from `var(--magenta, var(--accent))` to `var(--purple, var(--accent))`. `--purple` is actually defined across the design systems; `--magenta` is not, so Brand was silently falling through to `--accent` and looking identical to Designer. 7. New i18n key `critiqueTheater.interruptedSummary` for the interrupted-collapse copy ("Interrupted at round N, best composite X.X"). Previously the interrupted branch reused `shippedSummary` and the UI read "Shipped at round..." for a run that specifically did not ship. Native value in en + zh-CN; other locales fall back via `...en` spread. 8. `TheaterDegraded` heading id comes from `useId()` instead of a hardcoded `theater-degraded-heading`, so two chips rendered on the same page (chat history with multiple completed runs) keep their aria-labelledby references unambiguous. Tests (15 new cases) - reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data. - sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship. - useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false. - useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped. - TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...". - TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new) - tests/i18n/locales.test.ts 5 of 5 across 18 locales * feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1) * feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2) * fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315) Addresses every blocker from codex, Siri-Ray, and lefarcen. The three state-lifecycle and SSE-validation issues they also flagged inherit fixes from PR #1314's review pass that this branch now sits on top of after rebase. Real daemon kill on Interrupt (P1) - CritiqueTheaterMount now POSTs to /api/projects/:id/critique/:runId/interrupt alongside the optimistic local dispatch. Before this fix, clicking Interrupt only flipped the React state to interrupted while the daemon job kept running. The fetch is best-effort: a 404 (endpoint not wired yet, lands in Phase 15) is swallowed with a dev-mode console.warn so the UI still moves to the collapsed badge. - New fetchInterrupt test seam lets RTL assert on the URL / method and simulate the "daemon not ready yet" path. Two tests pin both: the happy URL proj-42/critique/run-abc/interrupt POSTs, and a rejected fetch still flips the UI. interruptPending reset on new run (P2) - A ref-backed effect compares the current runId against the last one we saw; when it changes, interruptPending is cleared. A user who interrupts run-1 and then triggers run-2 from the same mount now gets a fresh, enabled kill button instead of one stuck in "Interrupting…". Pinned by a new mount test. Escape keybind scope (P2) - InterruptButton now checks the keydown target. Escape inside an input, textarea, select, or contenteditable element is ignored (and any ancestor of those via closest() is treated the same way). Body-level focus still fires the keybind so the Theater area's affordance keeps working. Four new tests cover textarea, input, contenteditable, and the body-focus positive case. userFacingName i18n key (P2) - The spec at specs/current/critique-theater.md:6 mandates a single critiqueTheater.userFacingName key so the "Design Jury" label can be renamed without touching code. Phase 8 introduced critiqueTheater.title by mistake; renamed across types.ts, en.ts, zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer TheaterStage.tsx. The locale alignment test stays green. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 14 files, 112 tests (was 101 before, +11 new for the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope; the rest were already in #1314's review fix). - tests/i18n/locales.test.ts 5 of 5 across 18 locales. * feat(daemon): adapter-degraded registry with TTL (Phase 10.1) In-memory registry recording adapters that produced malformed or oversize transcripts so the orchestrator can skip them for a TTL window (default 24h) instead of cycling through known-bad providers on every run. Records carry reason (malformed_block \| oversize_block \| missing_artifact), source label, and expiresAt. The test-only clock seam lets the suite advance time deterministically and prove that an expired entry stops counting as degraded without anyone calling clearDegraded. 7/7 vitest cases green. * feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2) Two test-only adapters that read the existing v1 transcript fixtures (happy-3-rounds and malformed-unbalanced) and replay them as either a full string or a 512-byte chunked stream. The chunked form is what the conformance harness uses to prove the parser holds together when the transcript arrives in arbitrary network slices, not as one buffered blob. * feat(daemon): adapter conformance harness (Phase 10.3) runAdapterConformance pulls a transcript through the same parseCritiqueStream pipeline the orchestrator uses and classifies the outcome as shipped, degraded, or failed. On a degraded outcome it forwards the matched reason to the adapter-degraded registry, so a single nightly conformance run is what populates the skip list rather than the orchestrator learning each adapter is broken at request time. 5/5 vitest cases green covering shipped, malformed degraded, oversize degraded, no-ship failure, and the harness-thrown failure path. * test(e2e): Critique Theater Playwright suite (Phase 11) Six tests, one viewport per visual case, deterministic SSE fixtures stubbed via page.route(). Adds the suite to test:ui:extended so the existing extended-UI lane picks it up. Coverage: 1. Happy path: a single mounted theater plays the full fixture (1 run_started, 5 panelists open / dim / must_fix / close, 1 round_end, 1 ship) and ends on the score badge. 2. Interrupt mid-run: the panelist that is open at the time the interrupt button is clicked closes with an interrupted marker and the transcript freezes there. 3. Visual regression at 375x720 mobile. 4. Visual regression at 768x1024 tablet. 5. Visual regression at 1280x800 desktop. 6. A11y role tree: the theater region exposes a labelled landmark, each panelist lane is a group with an accessible name, the score is a status live region. All SSE traffic is stubbed by page.route so the suite runs in CI without a daemon. The toggle is seeded via localStorage by bootAppWithCritiqueEnabled so the gate behaves as if Settings flipped it on. typecheck clean; playwright --list reports 6. * test(web): reducer p99 bench at 10k iterations (Phase 13.1) Locks the documented 2ms budget for the Critique Theater reducer on a representative SSE script (27 actions, one full happy run) behind a regression gate. Asserts p99 stays under 4ms (2x the documented budget) so CI runners with a noisy neighbour do not flake while a real regression to 20ms or 200ms still trips. The bench is a vitest case rather than a bare microbenchmark so it runs in the same CI lane as every other web test and does not need a parallel runner. * test(web): critique surface coverage walker (Phase 13.2) Walks the public critique surface (11 SSE event names, 5 panelist roles, 6 lifecycle phases, 9 named i18n keys) and asserts each named symbol appears in both the src corpus and the test corpus. The walker is the gate that catches a rename in one half of the codebase without a matching update in the other half: a future PR that drops 'panelist_must_fix' from the reducer without also removing its test reference fails this suite. 62 assertions, one per symbol per corpus. * docs: Critique Theater user guide (Phase 14.1) Seven sections aimed at end users (not contributors): 1. What is Design Jury 2. How it works (the five panelists, auto-converging rounds, the composite formula) 3. Settings (the M1 toggle and what it does) 4. Reading the score badge 5. Replay surface 6. Troubleshooting (degraded, interrupted, failed) 7. FAQ The composite formula is documented as designer * 0 + critic * 0.4 + brand * 0.2 + a11y * 0.2 + copy * 0.2 because anyone trying to reverse-engineer the score is going to search for those weights and the docs are the place they should land first. * docs(daemon): critique module AGENTS map (Phase 14.2) Daemon-side wayfinder for the apps/daemon/src/critique directory. Tables every file, what owns what invariant, and the 'when you change anything here' guide so a future contributor does not have to reverse-engineer the rollout resolver before adding a new SSE event. * docs(web): Theater module AGENTS map (Phase 14.3) Web-side mirror of the daemon AGENTS map. Same file table, same invariants section, same change-impact guide, sized to the Theater component package. * feat(daemon): rollout flag resolver (Phase 15.1) Single decision point every caller consults to know whether the orchestrator should wire the critique pipeline for a given run. Priority: 1. Skill-level policy (required wins, opt-out wins inversely) 2. Per-project override from the Settings toggle 3. OD_CRITIQUE_ENABLED env override 4. Rollout phase default M0 dark-launch false M1 settings only false (toggle is off until the user flips it) M2 per-skill true if skill opted in M3 global default true OD_CRITIQUE_ROLLOUT_PHASE parser defaults to M0 on unknown input so a fresh install never surprises a user with the feature on. 10/10 vitest cases green covering every cell of the matrix. * feat(web): Settings toggle hook for Critique Theater (Phase 15.2) React hook that reads critiqueTheaterEnabled from the existing open-design:config localStorage blob and stays in sync via: - the platform storage event (cross-tab) - a open-design:critique-theater-toggle CustomEvent (same-tab) Same-tab event is the one that fires when the Settings panel saves in the current window: the toggle and every mounted theater update without a page reload. setCritiqueTheaterEnabled(next) is the imperative setter the Settings panel calls. It preserves the rest of the stored config (mode, apiKey, etc.) and dispatches the same-tab event after the localStorage write. The web hook reflects what the user toggled; the daemon-side isCritiqueEnabled is the final routing authority (project override, env, rollout phase). When they disagree, the daemon wins for backend gating and the web reflects the toggle state. 6/6 vitest cases green covering first read, stored read, same-tab event flip, config preservation, corrupted JSON tolerance, and cross-tab storage event. * test(web): Phase 15 toggle hook failure-mode coverage (PR #1320) lefarcen P2 on PR #1320 flagged that the PR body claimed safe behavior for disabled localStorage, non-object JSON, and missing CustomEvent shim, but the suite only covered corrupt JSON plus happy-path storage events. Added four failure-mode tests so the swallowed errors are not silently traded for a throw in a future refactor: 1. Returns false on a stored JSON value that parses to an array (non-object). Catches a regression where the guard treats anything truthy as a config blob. 2. Returns false on a stored JSON value of literal 'null'. typeof null === 'object' in JS, so the guard has to check null explicitly; this test pins that check. 3. Returns false when localStorage.getItem throws (private mode / disabled storage / SecurityError). The hook must swallow and return false so the rest of the app keeps rendering. 4. setCritiqueTheaterEnabled still dispatches the same-tab CustomEvent when localStorage.setItem throws (quota exceeded / disabled storage). The dispatch path is the in-session broadcast that keeps every mounted hook coherent even when persistence is unavailable; verified by mounting two probes and asserting both flip after the setter is called with a throwing setItem. 10/10 vitest cases green (6 existing + 4 new). * fix(web): honor CustomEvent payload in toggle hook listener (PR #1320) Both Siri-Ray (blocking) and lefarcen (P2 new) caught the same real bug in the failure-mode test I added in `affcdd27`: the test asserts the in-session UI flips when localStorage.setItem throws, but the CustomEvent listener was ignoring the event's typed detail and just calling readToggle(). Under a throwing setItem the localStorage value is stale (or absent), so the listener would see the OLD value and the test would fail (or worse, the production claim 'in-session event keeps mounts coherent' was hollow). Fixed the hook, not the test: the listener now reads event.detail.enabled when it is a boolean, falling back to readToggle() only for malformed events or for cross-tab storage events (which do not carry a typed payload). The setter already dispatched the detail; the listener just was not consuming it. Test changes: - The existing 'setItem throws' test now asserts the right behavior for the right reason. Updated the inline comment to say the listener reads from detail, not localStorage. - New test 'falls back to readToggle when the CustomEvent carries no usable detail' pins the fallback path: a malformed dispatcher (no detail, or detail.enabled not a boolean) degrades cleanly instead of throwing or being silently ignored. 11 / 11 vitest cases green (10 prior + 1 new fallback). * feat(daemon): route critique spawn-path eligibility through the rollout resolver The wireup edit Phase 10 and Phase 15 carved out: today server.ts gates the critique pipeline on critiqueCfg.enabled, which is just the OD_CRITIQUE_ENABLED env var. After this commit it gates on isCritiqueEnabled(...) from the Phase 15 resolver, so the full priority matrix is live: 1. Per-skill od.critique.policy veto (opt-out / required) 2. Per-project override (M1 Settings toggle, written through the existing Phase 6 settings endpoint) 3. OD_CRITIQUE_ENABLED env override (power-user lane / CI fixtures) 4. OD_CRITIQUE_ROLLOUT_PHASE default M0 dark-launch false M1 settings only false M2 per-skill only when skillPolicy === 'opt-in' M3 global default true Default behaviour on a fresh install is unchanged: the resolver returns false at M0 without an env override or a project override, so prod traffic falls through to the legacy single-pass path exactly the way it did before. Inputs threaded today: phase from OD_CRITIQUE_ROLLOUT_PHASE, envOverride from OD_CRITIQUE_ENABLED. skillPolicy and projectOverride are passed as null for the v1 cutover; the daemon-side handler that round-trips critiqueTheaterEnabled on the project settings row and the od.critique.policy frontmatter resolver land as the next two commits in this branch. The three call sites that used critiqueCfg.enabled (the brand-thread guard, the skill-thread guard, the top-line critiqueShouldRun compound) now read from a single locally-scoped critiqueEnabledForRun boolean, so the eligibility check is computed exactly once per spawn and the prompt composer + orchestrator stay in lockstep the way the existing comment already promised. Tests still green: daemon vitest 22 / 22 across rollout + conformance + adapter-degraded. Daemon typecheck clean. * feat(web): mount CritiqueTheaterMount in ProjectView The web counterpart of the daemon wireup. ProjectView now renders <CritiqueTheaterMount projectId={project.id} enabled={...} /> as a sibling of <AppChromeHeader> inside the top-level <div className="app">. The mount is the drop-in from the Phase 9 stack: it owns the SSE subscription, the kill-request handshake, and the phase-aware swap from the live <TheaterStage> to the collapsed badge once a run settles. The mount returns null until the daemon emits a critique.run_started for the active project, so the visual surface is byte-for-byte unchanged for users who have not opted in. Enabled wiring: useCritiqueTheaterEnabled() reads the M1 Settings toggle from the existing open-design:config localStorage blob and stays in sync with both the platform storage event (cross-tab) and the same-tab open-design:critique-theater-toggle CustomEvent the Phase 15 setter dispatches. The hook honors the event payload directly so a private-mode browser that cannot persist the toggle still updates the in-session UI correctly. The daemon-side gate (isCritiqueEnabled in apps/daemon/src/server.ts) remains the authority for whether a run is actually wired through the critique pipeline. This hook only governs whether the web layer renders the resulting SSE stream when the daemon emits one. The two-layer gate is intentional: an integrator embedding the Theater in a custom UI can flip the web visibility independent of the daemon's routing decision, and a daemon-side env override flips backend gating without touching the web's localStorage. Tests still green: web Theater suite 181 / 181 across 16 files. Web typecheck clean. * feat(daemon): resolve od.critique.policy frontmatter at the spawn site The next step in the wireup branch's ladder: replace the placeholder `skillPolicy: null` with the actual value parsed from the active skill's SKILL.md frontmatter. Three small edits, one new field on a public type: 1. SkillInfo gains a `critiquePolicy: SkillCritiquePolicy` field carrying the parsed `od.critique.policy` token (required / opt-in / opt-out / null). The field is null when the skill has no opinion, which lets the lower-priority resolver tiers (projectOverride, envOverride, phase default) decide. 2. listSkills() populates the new field via a small `normalizeCritiquePolicy` helper that tolerates the YAML scalar's casing and trims whitespace. Unknown tokens collapse to null so a typo in SKILL.md cannot accidentally force the panel on or off; it just falls through. Derived example cards inherit the parent's policy. 3. server.ts captures `skill.critiquePolicy` into a hoisted `skillCritiquePolicy` variable inside the existing skill-load block, then threads it into the isCritiqueEnabled call as the skillPolicy input. The hoisting keeps the variable in scope at the resolver call site without restructuring the spawn handler. After this commit, the priority matrix the rollout resolver was designed for is live for its top tier. The previous commit wired env + phase; this one wires skill. The projectOverride input remains null pending the next commit that extends the Phase 6 settings endpoint. Daemon vitest: 10 / 10 rollout cases pass against the new wiring. Daemon typecheck: clean. * feat(daemon): feed projectOverride into the rollout resolver from project metadata Replaces the placeholder `projectOverride: null` in the spawn handler with the actual value the Settings panel writes onto the project's metadata blob: `critiqueTheaterEnabled?: boolean`. The read is defensive at the boundary: the metadata object is typed loosely (it round-trips through SQLite as a free-form JSON blob), so the spawn handler narrows to `boolean` and falls through to `null` for any other shape. A missing key, a malformed value, or a project that has never visited Settings collapses to `null`, which is exactly the resolver's "no opinion, fall through to env / phase" signal. The `critique` frontmatter slot also gets typed on the SkillFrontmatter shape so the `od.critique.policy` chain the previous commit introduced no longer needs a bracket-access cast. Same pattern as the existing `craft`, `preview`, and `design_system` nested-record slots. After this commit, every tier of the rollout resolver's priority matrix is wired: 1. skillPolicy (from SKILL.md od.critique.policy) 2. projectOverride (from project metadata critiqueTheaterEnabled) 3. envOverride (from OD_CRITIQUE_ENABLED) 4. rollout phase (from OD_CRITIQUE_ROLLOUT_PHASE) The write path for projectOverride still flows through the existing project-update handler the Settings panel already uses to persist project metadata; no new endpoint is needed. The Settings UI button that calls setCritiqueTheaterEnabled and posts the new field is the next commit on this branch. Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases still green against the new wiring. * fix(daemon): forward critique events to project sinks + align composer gate (PR #1338) Two codex review items addressed in one commit since they share the same root cause (resolver-enabled run hits a transport / prompt contract that was still env-gated): P1 (transport mismatch). The daemon emits critique.* SSE frames through critiqueBus -> design.runs.emit, which fans out on /api/runs/:runId/events. The web CritiqueTheaterMount subscribes to /api/projects/:projectId/events (it's project-scoped, not run- scoped, because the mount lives at the project workspace and follows the user across runs). Result: in production the mount never sees a real frame and the e2e tests' stubbed routes hide the mismatch. Fixed by extending critiqueBus.emit to fan out to BOTH sinks: the existing runs.emit transport, AND the per-project event-sinks map. The project-events route emits via sse.send(payload.type, payload), so we pack the SSE channel name onto payload.type and let the sink push the right channel. The web sseToPanelEvent overwrites type from the channel name on the way back into a PanelEvent, so the round-trip stays correct. P2 (prompt gate misalignment). composeSystemPrompt reads cfg.enabled to decide whether to append the panel addendum, but critiqueCfg.enabled is loaded from OD_CRITIQUE_ENABLED only. A run the resolver enabled via phase / project / skill (env unset) would have critiqueShouldRun = true while critiqueCfg.enabled remained false, dropping the panel prompt while still routing through runOrchestrator -> parser waits for tags that never arrive -> run degrades. Fixed by passing a derived config { ...critiqueCfg, enabled: true } to the composer when critiqueShouldRun is true. The composer's own gate now agrees with the resolver decision on every input the spec defines. Daemon typecheck: clean. Daemon vitest: 10 / 10 rollout cases still green against the new wiring. * fix: address PerishCode P1 + P2 follow-ups on PR #1338 Two follow-up items PerishCode flagged on the activation PR. Non-blocking but both are real: 1. Phase 11 e2e suite was wired into test:ui:extended but lands the user on '/' (home route) where ProjectView (and therefore CritiqueTheaterMount) is never rendered. With the suite as written, every assertion would time out the first time the lane runs in CI, contradicting the PR body's claim that the suite stays parked behind test.describe.fixme. The state diverged from my earlier Phase 11 work because the merge from main on commit `4ab719c6` brought in #1307's squash-merged version of the e2e file (the pre-fixme shape). Re-applied test.describe.fixme to the describe block plus removed ui/critique-theater.test.ts from the test:ui:extended script in e2e/package.json. Added a file-header docblock explaining what the follow-up commit needs to do: replace goto('/') with /projects/:id navigation similar to app-design-files.test.ts, split the SSE fixture into a live prefix and terminal suffix (Codex P2 on PR #1320), and commit the first PNG baselines. 2. bestRoundOf in CritiqueTheaterMount returned the LAST round with a numeric composite, not the round with the HIGHEST composite, while bestCompositeOf correctly returned the max. A run that closed round 1 at 8.5 and round 2 at 6.0 would dispatch interrupted { bestRound: 2, composite: 8.5 } on a user-clicked interrupt. Folded the two helpers into a single bestRoundAndComposite that walks state.rounds once and returns the matching pair so the two values cannot drift. The onInterrupt callback now destructures from one helper instead of two independent reads. Falls back to (state.activeRound, 0) when no round has closed with a composite yet. Web typecheck: clean. CritiqueTheaterMount.test.tsx: 7 / 7 cases still green against the new helper. * fix: wire M1 project override end-to-end + correct deferred-surface doc claims (PR #1338) Three lefarcen P2s on the latest review pass, all real: 1. M1 project override was half-wired: the daemon read metadata.critiqueTheaterEnabled but the web setter only wrote localStorage. A user opt-in would render the Theater on the web (localStorage was set) while the daemon resolved projectOverride=null and skipped critique unless env / phase already permitted. Two halves talking past each other. Extended setCritiqueTheaterEnabled to accept an optional { projectId, fetchProjectSettings } options bag. When a projectId is supplied, the setter ALSO sends a PATCH /api/projects/:id with { metadata: { critiqueTheaterEnabled } } so the daemon's spawn-time resolver picks the same value up on the next generation. The existing project-routes endpoint already accepts arbitrary metadata patches, so no new endpoint is needed. The local write + the CustomEvent dispatch still fire before the PATCH, so a network failure does not unwind the in-session UI flip. Three new vitest cases pin the new path: PATCHes when projectId is provided, skips when it is not, swallows a rejected PATCH so the in-session UI still flips. 2. Rollout docs (docs/critique-theater.md section 3) claimed the Settings toggle persists into the daemon settings store, but the previous implementation only had a localStorage reader / writer plus a daemon read of project metadata, with no round-trip. Rewrote the section to lead with the four-tier resolver (skill policy / project override / env / phase), document that the setter now round-trips via the existing PATCH endpoint when given a projectId, and call out the Settings panel UI control as a deliberate follow-up. 3. Troubleshooting table pointed users at /api/metrics/critique (Phase 12, deferred) and 'od adapters clear-degraded <id>' (CLI wrapper that does not exist). Replaced the metrics reference with the local conformance harness command (pnpm --filter @open-design/daemon vitest run tests/critique-conformance.test.ts) that ships today, with a note that the Phase 12 dashboard surfaces this status as a series once that PR lands. Replaced the CLI command with the programmatic clearDegraded() helper that exists today and flagged the CLI wrapper as planned follow-up. Web typecheck: clean. Toggle hook tests: 14 / 14 green (11 existing + 3 new for the round-trip path). * test(web): multi-round interrupt regression for bestRoundAndComposite (PR #1338) lefarcen P3 follow-up to the previous bestRoundAndComposite fix: the existing CritiqueTheaterMount.test.tsx interrupt cases only exercised a single-round state, so a future refactor back to two independent helpers wouldn't be caught by the test suite even though it'd reintroduce the round / composite drift bug. Added a regression case that: 1. Drives the reducer through two complete rounds with the full 5-role cast closing at distinct composites: round 1 at 8.5, round 2 at 6.0 (the high-composite round is NOT the most recent one). 2. Clicks Interrupt + waits for the daemon ack via the test seam fetcher returning 204. 3. Asserts the collapsed badge displays "round 1" (the correct best-composite round), and queryByText for "round 2 ... 8.5" returns null (the buggy pairing would have produced that string). The bestRoundAndComposite helper walks state.rounds in one pass and returns the matching pair, so the round number and the composite cannot drift apart. This test locks the fix in: a refactor that splits the helpers back into independent walks will be caught here. 8 / 8 vitest cases green on the file. * fix(web): read-merge-write the project metadata in setCritiqueTheaterEnabled (PerishCode P2 on PR #1338) The previous round-trip sent { metadata: { critiqueTheaterEnabled: next } } as the entire PATCH body. The daemon's project-routes handler only re-stamps three immutable fields (baseDir, importedFrom, fromTrustedPicker) before calling updateProject(db, id, patch), which then does a shallow { ...existing, ...patch } in apps/daemon/ src/db.ts. So patch.metadata replaces the row's metadata wholesale, dropping kind, templateId, linkedDirs, and every other field the rest of the app reads. No in-tree caller passes projectId today (only vitest cases), so the bug had not surfaced yet. But the surface is documented in docs/critique-theater.md section 3 and the function's own JSDoc as the M1 round-trip path, so it would have shipped as a latent footgun for the next integrator: a Settings UI follow-up, or any third party that wires the setter into a project-aware surface. Fix: read-merge-write rather than a bare patch. - GET /api/projects/:id to read the row's current metadata. - Spread that metadata into the PATCH body and overlay critiqueTheaterEnabled: next on top, mirroring the partial-metadata pattern already used in ChatComposer.tsx for linkedDirs. - PATCH the merged object. Failure handling: - GET fails: skip the PATCH entirely. We cannot construct a safe merged body without the current state, and a bare patch would wipe other metadata. The in-session CustomEvent fired earlier in the setter still keeps every mounted hook consistent; the next save retries the round-trip. - PATCH fails: log in dev. The in-session UI is already correct via the CustomEvent. Tests (TDD, red-first): - 'GETs the project then PATCHes with merged metadata when a projectId is supplied': stubs a GET that returns { kind: 'template', templateId: 'modern-blog', linkedDirs: [...] } and asserts the PATCH body equals the merge plus the toggle. - 'PATCHes with just the toggle when the project has no prior metadata': stubs a GET that returns no metadata block. - 'skips the PATCH (does not stomp metadata) when the prefetch GET fails': stubs a rejecting GET and asserts only the GET fires. - 'swallows a rejected PATCH after a successful prefetch': stubs a successful GET and a rejecting PATCH; asserts the in-session UI still flips via the CustomEvent. Doc updated on the setter's JSDoc to describe the new three-step flow (localStorage, CustomEvent, read-merge-write PATCH) and the two failure modes. Verified: - pnpm --filter @open-design/web typecheck clean. - pnpm --filter @open-design/web test: 111 files / 1055 tests green (was 1052, +3 from the new merge-flow cases). * fix(web): restore wait-for-daemon-ack pattern on Theater interrupt Same regression as flagged on PR #1316 post-main-merge: the optimistic local dispatch fired before the POST resolved, so a daemon 404 / 409 still terminalized the UI and the real SSE terminal event got ignored by the sticky interrupted phase. Snapshot runId / bestRound / composite at click time, dispatch interrupted only on res.ok, clear interruptPending on rejection or non-2xx so the user can retry. Tests cover rejection + 404 leaving the run on the live stage; the 204 path waits for the ack. * feat(daemon): Critique Theater Phase 12 observability foundations Lands the metrics registry, the structured logger, the /api/metrics route, and the adapter-degraded bump that wires up the first data point. The orchestrator-side bumps for runs / rounds / composite / must-fix / interrupted / parser_errors / protocol_version land in a follow-up commit on this branch (kept separate so the wiring diff reads cleanly against the registry shape). Surfaces added: - apps/daemon/src/metrics/index.ts: 9 Prometheus series under the open_design_critique_* namespace with the histogram buckets the spec calls out (round_duration_ms at 100 / 250 / 500 / 1000 / 2500 / 5000 / 10000 / 30000 / 60000 ms; composite_score at 0-10 integer steps). - apps/daemon/src/logging/critique.ts: 6 typed events, one JSON line per call on stdout, namespaced critique. Matches the JSON-per-line convention cli.ts already uses; no new logger framework. - apps/daemon/src/server.ts: GET /api/metrics route. Honors OD_METRICS_ENDPOINT=disabled to opt out for air-gapped installs. - apps/daemon/src/critique/adapter-degraded.ts: markDegraded now bumps degraded_total so the adapter-health dashboard panel reflects every TTL refresh and every fresh mark. Deps: prom-client ^15.1.0, @opentelemetry/api ^1.9.0 added to apps/daemon/package.json. Both are zero-config no-ops without an exporter wired; daemon bundle size impact is ~150 KB uncompressed. The @opentelemetry/api dep is in place ahead of the OTel-spans follow-up commit; it adds no behavior on this commit. Tests: - tests/metrics/critique.test.ts (3 cases): registry shape + exposition text + reset-between-tests - tests/logging/critique.test.ts (4 cases): event shape + ordering + newline framing + namespace stamping Verification (Windows-local): - pnpm --filter @open-design/daemon typecheck: clean - New metrics + logging suites: 7 / 7 green - Existing adapter-degraded + conformance + rollout suites: 22 / 22 green; the bump is non-breaking * feat(daemon): wire Critique Theater metrics + structured logs from the orchestrator Lights up the bump sites the Phase 12 foundations PR registered the series for. Every panel event the parser surfaces now reaches the matching Prometheus counter / histogram and the matching JSON log line on stdout. Switch-loop bumps + logs: - run_started: log run_started, set protocol_version gauge to the observed protocol version (small-integer cardinality). - panelist_open: record the first-open wall-clock per round so round_end can compute round_duration_ms; subsequent opens in the same round leave the start time untouched. - panelist_must_fix: bump must_fix_total with the panelist role. The wire event does not yet carry a dim name, so the label is 'unspecified' for now; a future parser revision can drop in the real dim without a metric rename. - round_end: bump rounds_total, observe composite_score, observe round_duration_ms (current ms minus the tracked start), log round_closed with the composite / mustFix / decision triple. - parser_warning (parser-yielded): bump parser_errors_total with the kind label, log parser_recover with kind + position. Orchestrator-side parser warnings (composite_mismatch and duplicate_ship from the daemon-authoritative scoring checks) go through a new emitParserWarning helper so the bus emit, the collectedEvents push, the metric bump, and the log line stay in lockstep. Three inline emission sites collapse to one-line helper calls. After the try/catch, a single terminal-status switch bumps runs_total{status, adapter, skill} once per run, with branch- specific log + counter: - shipped / below_threshold: log run_shipped - interrupted: bump interrupted_total, log run_failed{cause: interrupted} - timed_out: log run_failed{cause: timed_out} - failed: log run_failed{cause: orchestrator_internal} - degraded: log degraded{reason: orchestrator_classified} OrchestratorParams gains optional skill: string for the label; defaults to 'unknown' so spawn sites that have not yet threaded it keep working without a metric shape change. Tests: - The new metrics + logging suites (7 / 7) verify registry shape and event framing; orchestrator-side metric integration is exercised through the existing critique-conformance and critique-adapter-degraded suites (22 / 22 still green). - Logger test reassigns process.stdout.write directly instead of vi.spyOn so the Node overloaded write signature does not collide with MockInstance<unknown>. * feat(observability): Grafana dashboard JSON for Critique Theater Three default rows mapping to the metrics this branch wires up: 1. Fleet quality: composite score p50 / p90 / p99 line graph by adapter, plus a heatmap of the composite distribution. The line graph answers 'are my agents getting better over time'; the heatmap answers 'are the bad runs clustered around one adapter or smeared across the fleet'. 2. Adapter health: stacked bar charts for degraded marks (by adapter / reason) and parser errors (by adapter / kind) over a 5-minute window. The two queries together let an operator see 'is this adapter degraded because of malformed wire output or because of oversize blocks' without flipping panels. 3. Brief throughput: runs-per-hour by terminal status, an average rounds-per-run stat per adapter, and a round-duration ms p50 / p90 / p99 line. Throughput numbers fall straight out of the runs_total / rounds_total counters; the duration histogram is the same one the runs feed. The dashboard uses a templated $datasource var (defaults to 'prometheus') so an operator with multiple Prometheus instances can switch without editing JSON. Schema version 39 (Grafana 11). Operators import via: pnpm dlx @grafana/cli dashboard import tools/dev/dashboards/critique.json or paste into a provisioned dashboards directory. The file is checked into the repo as a starting artifact; alert rules and SLO panels ship after the first 1000 runs inform the right thresholds. JSON validates with node -e 'JSON.parse(...)' (sanity checked locally). * feat(daemon): OpenTelemetry outer span around the critique run Wraps each runOrchestrator call in a 'critique.run' span via the existing @opentelemetry/api dep added in the Phase 12 foundations commit. Attributes set on the span: - critique.run_id, critique.adapter, critique.skill at start - critique.final_status, critique.final_composite on terminal resolution - span status flipped to ERROR for failed / timed_out runs so a Tempo / Honeycomb / Jaeger filter on traces.status=error surfaces the right slice without joining back to Prometheus No exporter is wired by default; @opentelemetry/api is the API package and intentionally splits from @opentelemetry/sdk-, so the span is zero-overhead until an operator attaches an SDK through their runtime config. Inner per-round / parse_chunk / scoreboard_eval / persist_round / ship.persist spans defined in the Phase 12 plan are a follow-up: the outer span alone gives the trace a duration + final status + adapter/skill labels, which is the 80% value for dashboards that correlate runs across services. Adding child spans inside the existing 600-line orchestrator without restructuring is a separate careful change. Verification: - pnpm --filter @open-design/daemon typecheck: clean - 29 / 29 critique + metrics + logging tests still green fix(nix): bump pnpmDepsHash for prom-client + @opentelemetry/api lockfile bump nix-check failed on PR #1485 with hash mismatch in open-design-daemon-pnpm-deps and open-design-web-pnpm-deps after the Phase 12 foundations commit (`2b8b7445`) added prom-client and @opentelemetry/api to apps/daemon/package.json and refreshed pnpm-lock.yaml. CI reported the new sha: specified: HFLm+8hv3o5x3Xem4MXNsNclIgiVRc70+EBafL0rVn8= got: 7R1sQC38gOT0gsZ2oNOviCZ486cbbGJGJCis6WI8z9s= Both nix files pin the same workspace lockfile, so both flip in lockstep. No other Nix surface changes required. * fix(daemon): four Phase 12 review findings (Codex P2 x2 + Siri-Ray P2 + lefarcen P2) 1. Siri-Ray P2 in orchestrator.ts (round metric / log used untrusted agent values). The new observability path now records rs.composite and rs.mustFix (daemon-authoritative) instead of event.composite and event.mustFix when rs exists, and skips the bumps + log entirely when rs is missing (a degenerate round_end without any matching panelist_open). The dashboard p50 / p90 / p99 now agrees with persistence and ship decisions; an adapter reporting <ROUND_END composite='10'> while the daemon computed 6 logs 6 and still emits the composite_mismatch parser warning the prior block was already producing. 2. Codex P2 in server.ts (skill label always 'unknown'). The spawn path called runOrchestrator without passing the resolved skill id, so every live run bumped open_design_critique_{skill='unknown'} and the per-skill dashboard breakdown was always empty. Threaded effectiveSkillId (already computed at the same handler scope as the project skill fallback) through skill: . . . so the metric reflects the real skill when one is assigned, and the orchestrator default of 'unknown' only fires for runs that genuinely have none. 3. Codex P2 in conformance.ts (protocol-version mismatch let through). An adapter that emitted <CRITIQUE_RUN version='2'> followed by a valid SHIP classified as shipped because the harness only watched for terminal events. Added a guard inside the parse loop: if a run_started carries protocolVersion !== CRITIQUE_PROTOCOL_VERSION, mark the adapter degraded with reason 'protocol_version_mismatch' (already in DEGRADED_REASONS) and return early. ConformanceOutcome union widened to accept the new reason. 4. lefarcen P2 in tools/dev/dashboards/critique.json (runs-per-hour panel under-reported by 3600x). 'rate(...[1h])' returns per-second. Multiplied by 3600 so the panel title and unit match the actual value rendered. Verification: - pnpm --filter @open-design/daemon typecheck: clean - New metrics + logging suites (7), existing adapter-degraded (7), conformance (5), rollout (10): 29 / 29 green - Grafana JSON re-parses with node -e 'JSON.parse(...)' fix(nix): set pnpmDepsHash to fakeHash so CI surfaces the real hash for the regenerated lockfile (lefarcen P1 on PR #1485) * fix(nix): pin pnpmDepsHash to sha256-NtXbiRU0YZ4EVJVNC6N3sR1S0ozA3BvCwgXI0L0OMH4= from CI nix-check output --------- Co-authored-by: Nagendhra <nagendhra405@gmail.com>	2026-05-13 22:11:27 +08:00
lefarcen	a1859b7f40	fix(nix): update pnpmDepsHash for merged lockfile The merged pnpm-lock.yaml (release/v0.7.0 contents on top of main) has a different hash than either parent. Adopt the value Nix computed in CI.	2026-05-13 18:26:30 +08:00
nettee	f621dbbfea	feat(web): Add Tailwind foundation (#1388 )	2026-05-12 21:48:16 +08:00
Chris Tam	c61ba320fd	feat(nix): Add official flake with home-manager and NixOS support (#402 ) * nix: add official flake with home-manager and nixos modules * Pin pnpm version * Format README.md * Populate PATH files to discover installed CLIs * Revert "Populate PATH files to discover installed CLIs" This reverts commit 18d88781a88b8781913cf5a8b680dfb38eabf7e4. * Fix missing sqlite issue * Fix system issue * Reapply "Populate PATH files to discover installed CLIs" This reverts commit `d02ea994e6`. * Handle different ports for web frontend * Provide documentation for getting pnpm hash * Enable nix flake checks for code changes * Set `OD_WEB_PORT` on daemon when declared * fix: Fix environmentFile for macOS targets * chore: Ignore nix and direnv related files * fix: Read version directly from `package.json` * feat: Make nix shell entry prettier * chore: Update pnpm hashes * chore: Bump `pnpm` hashes * docs: Add blurb about dev shell in `README.md` * Address review comments * Add support for `OD_WEB_ORIGINS` * Fix `isLocalSameOrigin` * Update pnpm checksums * docs: Update documentation on host origins * Move allowedOrigins mapping out of the webFrontend.enable guard * fix: Bump pnpm hashes * Remove changes to `daemon` with `main` changes `main` merged a feature that addressed our need for allowed origins. Since this feature branch no longer needs it, remove any remaining changes in `daemon` code so that this is a pure Nix change. * Update documentation around `OD_DAEMON_URL` * Rewrite option docs to match same-origin proxy contract The port, webFrontend, and webFrontend.port option descriptions still described OD_DAEMON_URL as the runtime contract for the SPA, but the SPA issues relative /api/, /artifacts/, /frames/* requests and there is no runtime daemon-URL injection. Rewrite the three blocks to describe what the caddy / custom proxy must actually do. * Document daemon-side requirements for custom-server proxy paths The bring-your-own-server path in section (3) and the same-origin contract in section (4) understated what the daemon needs: any proxy whose origin differs from the daemon's bind (including loopback split-port like 127.0.0.1:8080 while the daemon stays on :7457) is 403'd by the daemon's same-origin gate until told about that origin. Add a callout under section (3)'s table, expand section (4) with a decision table covering same-port, loopback split-port (OD_WEB_PORT or webFrontend.allowedOrigins), and non-loopback (webFrontend.allowedOrigins) cases, and rewrite the webFrontend.allowedOrigins option description to enumerate the cases where it's required and surface OD_WEB_PORT as an alternative for the loopback split-port case. --------- Co-authored-by: lefarcen <935902669@qq.com>	2026-05-09 23:50:16 +08:00

15 commits