open-design

mirror of https://github.com/nexu-io/open-design.git synced 2026-06-01 03:14:35 +07:00

History

Nagendhra Madishetti e2f409579d docs: Critique Theater Phase 14 (user guide + 2 AGENTS module maps) (#1319 ) * feat(web): pure reducer for Critique Theater states (Phase 7.1) Pure CritiqueState reducer driven by the contracts-level PanelEvent (the same shape both the live SSE stream and the recorded transcript emit), so a single reducer powers both the in-flight panel and the rerun replay. Lifecycle covers run_started → running → (shipped / degraded / interrupted / failed), with panelist_open / dim / must_fix / close / round_end events building per-round CritiquePanelistView entries as they arrive. Defensive behaviour that surfaced while writing the spec tests: - Terminal phases (shipped / degraded / interrupted / failed) are sticky against further lifecycle events for the same run, except for parser_warning which can land late and is recorded in a side channel without changing phase. - A new run_started for a different runId at any time discards the prior state and reboots, so the UI can launch consecutive runs without an explicit reset action. - Events whose runId does not match the active run return the same state reference, so React's useReducer doesn't re-render subscribers on stray traffic. - Round bookkeeping keys by round number rather than "always last", so an out-of-order panelist_dim for round 1 arriving after a round 2 dim does not corrupt the round 2 bucket. Test coverage: 18 cases covering each transition, the runId guard, sticky-terminal behaviour, the out-of-order round invariant, and the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire SSE + replay into the same reducer. * feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2) createCritiqueEventsConnection is a pure connection manager that mirrors apps/web/src/providers/project-events.ts: opens an EventSource at /api/projects/:id/events, listens for every name in CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent (stripping the critique. prefix and merging the data payload), and hands it to the caller's onEvent. Reconnect uses exponential backoff (1s → 30s) and resets on `ready`; malformed payloads drop with a dev-mode warning rather than tearing the stream. useCritiqueStream wraps the manager in a useReducer that owns the CritiqueState. enabled=false or a null projectId tears down the connection cleanly; switching projectId closes the old connection and opens a fresh one. The returned dispatch lets local UI synthesise actions (e.g. an Esc keypress firing a synthetic interrupted while a kill request is in flight); production traffic comes from the SSE stream. Test coverage: - sse.test.ts (10 cases, node env): subscription set covers every CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire shape back to PanelEvent; malformed JSON is swallowed and does not stop the stream; exponential backoff schedule and ready-reset semantics are pinned with a setTimeout seam; close() cancels pending reconnects and shuts the live source; no-op fallback when EventSource is unavailable. - useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event, reducer driven by synthetic actions, no connection when disabled or projectId is null, clean close on unmount, projectId change reopens cleanly. * feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3) Fetches the per-run NDJSON transcript (one PanelEvent per line), parses every line via the shared isPanelEvent predicate, and dispatches into the same CritiqueState reducer the live SSE stream uses. A single reducer means the UI rendering a replay can be identical to the live panel, and a UI mounting both useCritiqueStream and useCritiqueReplay in parallel does not have to reconcile two state shapes. speed knob is `paused \| instant \| live \| { intervalMs: N }`. - instant flushes every event synchronously, useful for opening a finished run already at its terminal state. - intervalMs paces dispatches at a fixed cadence so the reviewer can watch the run unfold. - paused parses the transcript but holds events back until the caller advances speed (consumers can drive a scrubber later). - live is reserved for the future "playback at original cadence" feature, currently treated as instant; replay timestamps are not yet persisted with each event so honest pacing requires a follow-up Phase 7+ task. gunzip seam handles `.ndjson.gz` transcripts via DecompressionStream when present; the production fetch path picks between text and arrayBuffer based on the URL extension. Both seams are injectable so the unit tests don't need to spin up a real network or a real gzip pipeline. Test coverage (8 cases, jsdom env): - Idle status before any URL is provided. - speed=instant flushes the full transcript synchronously to shipped state. - speed={intervalMs:N} paces with the setTimeout seam, reaching done after the last tick. - speed=paused leaves status=playing with no dispatches. - Empty transcript reports done with state still idle. - Fetch rejection surfaces an error status with the message. - Malformed NDJSON lines are skipped; valid events around them still land. - .gz transcripts route through the gunzip seam. Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream + replay), all on one branch ready for review. Phases 8+ (Theater components) consume these from this PR. * fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review) Two P1 fixes from lefarcen's review on PR #1307: SSE payload override `sseToPanelEvent` previously spread `data` after the channel-derived `type`, so a payload-provided `type` could override the channel and route a `critique.run_started` frame into the reducer as a `ship` action. Reversed the spread so the channel-derived `type` is authoritative, and revalidated the resulting object through the contracts-level `isPanelEvent` predicate before returning. Frames that fail validation (missing runId, empty runId, unknown type) are dropped, so a malformed or compromised SSE frame can no longer dispatch a wrong-shape action into the reducer. Three new sse.test.ts cases pin the regression: hostile `type:'ship'` in the payload still resolves to `run_started`, missing runId is dropped, empty runId is dropped. Replay pause/resume `useCritiqueReplay` had one big effect keyed on `transcriptUrl` only, so flipping `speed` from `paused` to `instant` never re-fired and the held events sat undispatched. Split into a parse effect (depends on URL, fetches and stores events in state) and a pace effect (depends on parsed-events + speed, owns the cursor + timers). The playback cursor lives in a ref that survives pause/resume cycles, so flipping `paused` -> `instant` flushes from the current position rather than restarting (which would double-dispatch `run_started` and reset the reducer). Two new useCritiqueReplay.test.tsx cases: - paused-then-instant transitions from `playing` to `done` and reaches the shipped terminal phase - intervalMs paced playback dispatches one event, pauses to drain the next scheduled timer, flips to instant, and confirms the remaining transcript drains exactly once (cursor was preserved) Doc consistency The earlier source comment in useCritiqueReplay.ts claimed `live` "paces by recorded timestamps" while the impl used zero-delay timers and the PR body said it behaves like `instant`. Aligned to reality: `live` currently behaves like `{ intervalMs: 0 }` (events drain on successive microtasks via setTimeoutFn) because transcripts do not yet carry per-event timestamps. Honest timestamp-driven pacing is queued as a Phase 7+ follow-up. Validated: pnpm guard, pnpm --filter @open-design/web typecheck, Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite 96 files / 888 tests. * feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread) * feat(web): Theater PanelistLane component (Phase 8.1) * feat(web): Theater ScoreTicker component (Phase 8.2) * feat(web): Theater RoundDivider component (Phase 8.3) * feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4) * feat(web): Theater TheaterDegraded chip (Phase 8.5) * feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6) * feat(web): Theater TheaterTranscript replay surface (Phase 8.7) * feat(web): Theater TheaterStage top-level container (Phase 8.8) * feat(web): Theater CSS using existing semantic tokens (no hex literals) * feat(web): Theater public exports barrel * fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314) Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen. State-lifecycle fixes (3 x P2) 1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`). Host hooks dispatch it when their gating prop changes so a stale run from a prior project / transcript cannot bleed into the next context. Reset is idempotent on idle (returns the same reference). 2. `useCritiqueStream` dispatches `__reset__` at the top of its connection effect, so a workspace switch from project A (which streamed a critique) to project B clears the reducer before the new EventSource opens. enabled=false also clears. 3. `useCritiqueReplay` dispatches `__reset__` at the top of its parse effect, so transcriptUrl swaps (including swap-to-null after a replay reached `shipped`) lift the reducer back to idle before the new fetch starts. SSE validation (1 x P2) 4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape` check after the cheap `isPanelEvent` predicate. A `critique.ship` frame missing `composite` / `round` / `status` / `artifactRef` is rejected before reaching the reducer, so TheaterCollapsed can no longer crash on `undefined.toFixed(1)`. Every variant's required fields are validated: run_started (protocolVersion, non-empty cast, maxRounds, threshold, scale), panelist_* (round, role, plus variant-specific shape), round_end (round, composite, mustFix, decision in {continue,ship}, reason), ship (round, composite, status, artifactRef.{projectId,artifactId}, summary), degraded (reason, adapter), interrupted (bestRound, composite), failed (cause), parser_warning (kind, position). Reducer correctness (1 x P2) 5. `panelist_open` now materializes the round + an empty panelist view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight the in-progress lane the instant the tag opens. Before this, a stream that emitted only `panelist_open` after `run_started` left `rounds = []` and the UI rendered no current round until a later `panelist_dim` arrived. Polish (3 x P3) 6. Brand role tint swaps from `var(--magenta, var(--accent))` to `var(--purple, var(--accent))`. `--purple` is actually defined across the design systems; `--magenta` is not, so Brand was silently falling through to `--accent` and looking identical to Designer. 7. New i18n key `critiqueTheater.interruptedSummary` for the interrupted-collapse copy ("Interrupted at round N, best composite X.X"). Previously the interrupted branch reused `shippedSummary` and the UI read "Shipped at round..." for a run that specifically did not ship. Native value in en + zh-CN; other locales fall back via `...en` spread. 8. `TheaterDegraded` heading id comes from `useId()` instead of a hardcoded `theater-degraded-heading`, so two chips rendered on the same page (chat history with multiple completed runs) keep their aria-labelledby references unambiguous. Tests (15 new cases) - reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data. - sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship. - useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false. - useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped. - TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...". - TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new) - tests/i18n/locales.test.ts 5 of 5 across 18 locales * feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1) * feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2) * fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315) Addresses every blocker from codex, Siri-Ray, and lefarcen. The three state-lifecycle and SSE-validation issues they also flagged inherit fixes from PR #1314's review pass that this branch now sits on top of after rebase. Real daemon kill on Interrupt (P1) - CritiqueTheaterMount now POSTs to /api/projects/:id/critique/:runId/interrupt alongside the optimistic local dispatch. Before this fix, clicking Interrupt only flipped the React state to interrupted while the daemon job kept running. The fetch is best-effort: a 404 (endpoint not wired yet, lands in Phase 15) is swallowed with a dev-mode console.warn so the UI still moves to the collapsed badge. - New fetchInterrupt test seam lets RTL assert on the URL / method and simulate the "daemon not ready yet" path. Two tests pin both: the happy URL proj-42/critique/run-abc/interrupt POSTs, and a rejected fetch still flips the UI. interruptPending reset on new run (P2) - A ref-backed effect compares the current runId against the last one we saw; when it changes, interruptPending is cleared. A user who interrupts run-1 and then triggers run-2 from the same mount now gets a fresh, enabled kill button instead of one stuck in "Interrupting…". Pinned by a new mount test. Escape keybind scope (P2) - InterruptButton now checks the keydown target. Escape inside an input, textarea, select, or contenteditable element is ignored (and any ancestor of those via closest() is treated the same way). Body-level focus still fires the keybind so the Theater area's affordance keeps working. Four new tests cover textarea, input, contenteditable, and the body-focus positive case. userFacingName i18n key (P2) - The spec at specs/current/critique-theater.md:6 mandates a single critiqueTheater.userFacingName key so the "Design Jury" label can be renamed without touching code. Phase 8 introduced critiqueTheater.title by mistake; renamed across types.ts, en.ts, zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer TheaterStage.tsx. The locale alignment test stays green. Validated - pnpm guard clean - pnpm --filter @open-design/web typecheck clean - Theater suite: 14 files, 112 tests (was 101 before, +11 new for the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope; the rest were already in #1314's review fix). - tests/i18n/locales.test.ts 5 of 5 across 18 locales. * feat(daemon): adapter-degraded registry with TTL (Phase 10.1) In-memory registry recording adapters that produced malformed or oversize transcripts so the orchestrator can skip them for a TTL window (default 24h) instead of cycling through known-bad providers on every run. Records carry reason (malformed_block \| oversize_block \| missing_artifact), source label, and expiresAt. The test-only clock seam lets the suite advance time deterministically and prove that an expired entry stops counting as degraded without anyone calling clearDegraded. 7/7 vitest cases green. * feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2) Two test-only adapters that read the existing v1 transcript fixtures (happy-3-rounds and malformed-unbalanced) and replay them as either a full string or a 512-byte chunked stream. The chunked form is what the conformance harness uses to prove the parser holds together when the transcript arrives in arbitrary network slices, not as one buffered blob. * feat(daemon): adapter conformance harness (Phase 10.3) runAdapterConformance pulls a transcript through the same parseCritiqueStream pipeline the orchestrator uses and classifies the outcome as shipped, degraded, or failed. On a degraded outcome it forwards the matched reason to the adapter-degraded registry, so a single nightly conformance run is what populates the skip list rather than the orchestrator learning each adapter is broken at request time. 5/5 vitest cases green covering shipped, malformed degraded, oversize degraded, no-ship failure, and the harness-thrown failure path. * test(e2e): Critique Theater Playwright suite (Phase 11) Six tests, one viewport per visual case, deterministic SSE fixtures stubbed via page.route(). Adds the suite to test:ui:extended so the existing extended-UI lane picks it up. Coverage: 1. Happy path: a single mounted theater plays the full fixture (1 run_started, 5 panelists open / dim / must_fix / close, 1 round_end, 1 ship) and ends on the score badge. 2. Interrupt mid-run: the panelist that is open at the time the interrupt button is clicked closes with an interrupted marker and the transcript freezes there. 3. Visual regression at 375x720 mobile. 4. Visual regression at 768x1024 tablet. 5. Visual regression at 1280x800 desktop. 6. A11y role tree: the theater region exposes a labelled landmark, each panelist lane is a group with an accessible name, the score is a status live region. All SSE traffic is stubbed by page.route so the suite runs in CI without a daemon. The toggle is seeded via localStorage by bootAppWithCritiqueEnabled so the gate behaves as if Settings flipped it on. typecheck clean; playwright --list reports 6. * test(web): reducer p99 bench at 10k iterations (Phase 13.1) Locks the documented 2ms budget for the Critique Theater reducer on a representative SSE script (27 actions, one full happy run) behind a regression gate. Asserts p99 stays under 4ms (2x the documented budget) so CI runners with a noisy neighbour do not flake while a real regression to 20ms or 200ms still trips. The bench is a vitest case rather than a bare microbenchmark so it runs in the same CI lane as every other web test and does not need a parallel runner. * test(web): critique surface coverage walker (Phase 13.2) Walks the public critique surface (11 SSE event names, 5 panelist roles, 6 lifecycle phases, 9 named i18n keys) and asserts each named symbol appears in both the src corpus and the test corpus. The walker is the gate that catches a rename in one half of the codebase without a matching update in the other half: a future PR that drops 'panelist_must_fix' from the reducer without also removing its test reference fails this suite. 62 assertions, one per symbol per corpus. * docs: Critique Theater user guide (Phase 14.1) Seven sections aimed at end users (not contributors): 1. What is Design Jury 2. How it works (the five panelists, auto-converging rounds, the composite formula) 3. Settings (the M1 toggle and what it does) 4. Reading the score badge 5. Replay surface 6. Troubleshooting (degraded, interrupted, failed) 7. FAQ The composite formula is documented as designer * 0 + critic * 0.4 + brand * 0.2 + a11y * 0.2 + copy * 0.2 because anyone trying to reverse-engineer the score is going to search for those weights and the docs are the place they should land first. * docs(daemon): critique module AGENTS map (Phase 14.2) Daemon-side wayfinder for the apps/daemon/src/critique directory. Tables every file, what owns what invariant, and the 'when you change anything here' guide so a future contributor does not have to reverse-engineer the rollout resolver before adding a new SSE event. * docs(web): Theater module AGENTS map (Phase 14.3) Web-side mirror of the daemon AGENTS map. Same file table, same invariants section, same change-impact guide, sized to the Theater component package. * docs: tighten Phase 14 reasoning from lefarcen review (PR #1319) Four content gaps lefarcen flagged in the Phase 14 docs review, addressed inline rather than deferred. The fifth item (scope-drift between 'docs only' PR body and the cumulative stacked diff) is handled by rewriting the PR body, not the docs. 1. Round exit conditions (lefarcen P2-1). docs/critique-theater.md §2 'Auto-converging rounds' now lists the five conditions that stop a run (threshold reached, round budget exhausted, per-round timeout, total timeout, user interrupt) with their default values. A user debugging a run that stopped at round 1 with composite 5.4 can read this list and find the matching cause without spelunking the orchestrator. 2. Prior-art comparison (lefarcen P2-2). New §1.5 'Why an in-CLI panel and not a third-party design lint' pre-answers the 'why not Figma lint / Adobe checker / Material You conformance' question. Three differences: rule engines vs generative reviewers, post-hoc vs in-loop, external service vs same-CLI-session. 3. Composite formula rationale (lefarcen P2-4). §2 now explains why each weight is set the way it is: critic gates correctness so it gets 0.4; brand / a11y / copy are secondary quality dimensions at 0.2 each; designer is at 0.0 in v1 because aesthetic preference is not a ship gate. The slot stays in the schema so notes flow into the transcript and a v2 config release can bump the weight without a wire-shape change. 4. v2 cast-config ownership (lefarcen P2-3). Both AGENTS.md files (daemon + web) now declare a 'Designer weight frozen at 0.0 until v2 cast config' invariant. The daemon side calls out where the SKILL.md frontmatter resolver lands (apps/daemon/src/critique/config.ts); the web side calls out where the Settings surface lands (apps/web/src/components/ Settings/). A contributor reading either AGENTS.md before implementing v2 sees which module to touch first. * docs(web): mirror the Designer-weight invariant in Theater AGENTS.md (PR #1319) lefarcen P1 follow-up on PR #1319: the daemon AGENTS.md already declares 'Designer weight is frozen at 0.0 until v2 cast config lands' as an invariant, but the web AGENTS.md's parallel bullet led with 'Composite weights are read-only on the web side' which buried the Designer-specific constraint. A web contributor reading that bullet would not realise the v1 weight distribution is wire-shape (changing it mid-v1 invalidates persisted critique_runs composite values). Rewrote the bullet to lead with the same 'Designer weight is frozen at 0.0 until v2 cast config lands' phrasing the daemon side uses, and added an explicit cross-link to the daemon AGENTS.md so the two halves of the invariant read as one rule. Web-side specifics retained: ScoreTicker / TheaterCollapsed read composite off the wire (no client recompute), v2 lands as a Settings surface at apps/web/src/components/Settings/, do not add a 'weights' prop to any component in this directory until the contracts package carries the v2 cast type. * docs: replace deferred metrics endpoint reference + refresh Theater module map (PR #1319) Two carryover items lefarcen flagged across the PR #1319 + #1320 reviews. 1. docs/critique-theater.md was sending users to /api/metrics/critique as the conformance-status check on malformed_block, but the Phase 12 metrics endpoint is explicitly deferred until after orchestrator wiring lands. Replaced the link with the pnpm conformance-harness command that DOES exist today (pnpm --filter @open-design/daemon vitest run tests/critique-conformance.test.ts) and noted that the dashboard surfaces this status as a series once Phase 12 ships. 2. apps/web/src/components/Theater/AGENTS.md module map was stale after Phase 15: the index.ts row said 'only two hooks are exported' but the barrel now exports useCritiqueTheaterEnabled too (plus the setCritiqueTheaterEnabled setter). Updated the row to list all three hooks + the setter + the reducer-derived contract types, and added a new row for hooks/useCritiqueTheaterEnabled.ts in the file table so a web contributor scanning the table sees the new hook without inferring it from the index.ts blurb. * fix(web): restore wait-for-daemon-ack pattern on Theater interrupt Same regression as flagged on PR #1316 post-main-merge: the optimistic local dispatch fired before the POST resolved, so a daemon 404 / 409 still terminalized the UI and the real SSE terminal event got ignored by the sticky interrupted phase. Snapshot runId / bestRound / composite at click time, dispatch interrupted only on res.ok, clear interruptPending on rejection or non-2xx so the user can retry. Tests cover rejection + 404 leaving the run on the live stage; the 204 path waits for the ack. --------- Co-authored-by: Nagendhra <nagendhra405@gmail.com>		2026-05-13 12:11:48 +08:00
..
lib	fix(daemon): persist runStatus/endedAt on chat run termination (#1230 )	2026-05-11 15:37:52 +08:00
resources	test: strengthen e2e PR coverage (#796 )	2026-05-08 16:48:10 +08:00
scripts	test: harden e2e smoke and release reports (#1140 )	2026-05-11 13:11:16 +08:00
specs	test: expand nightly UI and desktop regression coverage (#1256 )	2026-05-11 19:23:13 +08:00
tests	feat(design-systems): add structured tokens.css schema (default + kami) (#1231 )	2026-05-11 22:23:34 +08:00
ui	docs: Critique Theater Phase 14 (user guide + 2 AGENTS module maps) (#1319 )	2026-05-13 12:11:48 +08:00
AGENTS.md	test: harden e2e smoke and release reports (#1140 )	2026-05-11 13:11:16 +08:00
package.json	test: add Memory and Routines coverage (#1400 )	2026-05-12 17:48:56 +08:00
playwright.config.ts	test: stabilize extended Playwright coverage (#1341 )	2026-05-12 15:11:34 +08:00
tsconfig.json	test(e2e): gate beta packaged runtime (#637 )	2026-05-06 17:44:29 +08:00
vitest.config.ts	test(e2e): gate beta packaged runtime (#637 )	2026-05-06 17:44:29 +08:00