open-design/apps/daemon/tests/critique-conformance.test.ts
Nagendhra Madishetti dccac100f2
test(e2e): Critique Theater Phase 11 (Playwright stage suite: happy, interrupt, 3 viewports, a11y) (#1317)
* feat(web): pure reducer for Critique Theater states (Phase 7.1)

Pure CritiqueState reducer driven by the contracts-level PanelEvent
(the same shape both the live SSE stream and the recorded transcript
emit), so a single reducer powers both the in-flight panel and the
rerun replay. Lifecycle covers run_started → running → (shipped /
degraded / interrupted / failed), with panelist_open / dim /
must_fix / close / round_end events building per-round
CritiquePanelistView entries as they arrive.

Defensive behaviour that surfaced while writing the spec tests:
- Terminal phases (shipped / degraded / interrupted / failed) are
  sticky against further lifecycle events for the same run, except
  for parser_warning which can land late and is recorded in a side
  channel without changing phase.
- A new run_started for a different runId at any time discards the
  prior state and reboots, so the UI can launch consecutive runs
  without an explicit reset action.
- Events whose runId does not match the active run return the same
  state reference, so React's useReducer doesn't re-render
  subscribers on stray traffic.
- Round bookkeeping keys by round number rather than "always last",
  so an out-of-order panelist_dim for round 1 arriving after a
  round 2 dim does not corrupt the round 2 bucket.

Test coverage: 18 cases covering each transition, the runId guard,
sticky-terminal behaviour, the out-of-order round invariant, and
the stable-identity guarantee. Sets up Phase 7.2 and 7.3 to wire
SSE + replay into the same reducer.

* feat(web): useCritiqueStream hook subscribes to SSE and feeds reducer (Phase 7.2)

createCritiqueEventsConnection is a pure connection manager that
mirrors apps/web/src/providers/project-events.ts: opens an
EventSource at /api/projects/:id/events, listens for every name in
CRITIQUE_SSE_EVENT_NAMES, decodes each frame back into a PanelEvent
(stripping the critique. prefix and merging the data payload), and
hands it to the caller's onEvent. Reconnect uses exponential
backoff (1s → 30s) and resets on `ready`; malformed payloads drop
with a dev-mode warning rather than tearing the stream.

useCritiqueStream wraps the manager in a useReducer that owns the
CritiqueState. enabled=false or a null projectId tears down the
connection cleanly; switching projectId closes the old connection
and opens a fresh one. The returned dispatch lets local UI
synthesise actions (e.g. an Esc keypress firing a synthetic
interrupted while a kill request is in flight); production traffic
comes from the SSE stream.

Test coverage:
- sse.test.ts (10 cases, node env): subscription set covers every
  CRITIQUE_SSE_EVENT_NAMES channel; payload decoding lifts the wire
  shape back to PanelEvent; malformed JSON is swallowed and does
  not stop the stream; exponential backoff schedule and ready-reset
  semantics are pinned with a setTimeout seam; close() cancels
  pending reconnects and shuts the live source; no-op fallback
  when EventSource is unavailable.
- useCritiqueStream.test.tsx (6 cases, jsdom env): idle pre-event,
  reducer driven by synthetic actions, no connection when disabled
  or projectId is null, clean close on unmount, projectId change
  reopens cleanly.

* feat(web): useCritiqueReplay hook drives reducer from transcript file (Phase 7.3)

Fetches the per-run NDJSON transcript (one PanelEvent per line),
parses every line via the shared isPanelEvent predicate, and
dispatches into the same CritiqueState reducer the live SSE stream
uses. A single reducer means the UI rendering a replay can be
identical to the live panel, and a UI mounting both
useCritiqueStream and useCritiqueReplay in parallel does not have
to reconcile two state shapes.

speed knob is `paused | instant | live | { intervalMs: N }`.
- instant flushes every event synchronously, useful for opening a
  finished run already at its terminal state.
- intervalMs paces dispatches at a fixed cadence so the reviewer
  can watch the run unfold.
- paused parses the transcript but holds events back until the
  caller advances speed (consumers can drive a scrubber later).
- live is reserved for the future "playback at original cadence"
  feature, currently treated as instant; replay timestamps are not
  yet persisted with each event so honest pacing requires a
  follow-up Phase 7+ task.

gunzip seam handles `.ndjson.gz` transcripts via
DecompressionStream when present; the production fetch path picks
between text and arrayBuffer based on the URL extension. Both seams
are injectable so the unit tests don't need to spin up a real
network or a real gzip pipeline.

Test coverage (8 cases, jsdom env):
- Idle status before any URL is provided.
- speed=instant flushes the full transcript synchronously to
  shipped state.
- speed={intervalMs:N} paces with the setTimeout seam, reaching
  done after the last tick.
- speed=paused leaves status=playing with no dispatches.
- Empty transcript reports done with state still idle.
- Fetch rejection surfaces an error status with the message.
- Malformed NDJSON lines are skipped; valid events around them
  still land.
- .gz transcripts route through the gunzip seam.

Closes the Phase 7 plan tasks 7.1 / 7.2 / 7.3 (reducer + stream +
replay), all on one branch ready for review. Phases 8+ (Theater
components) consume these from this PR.

* fix(web): close payload-override gap + paused-resume bug in Critique Theater hooks (Phase 7 review)

Two P1 fixes from lefarcen's review on PR #1307:

SSE payload override

`sseToPanelEvent` previously spread `data` after the channel-derived
`type`, so a payload-provided `type` could override the channel and
route a `critique.run_started` frame into the reducer as a `ship`
action. Reversed the spread so the channel-derived `type` is
authoritative, and revalidated the resulting object through the
contracts-level `isPanelEvent` predicate before returning. Frames
that fail validation (missing runId, empty runId, unknown type) are
dropped, so a malformed or compromised SSE frame can no longer
dispatch a wrong-shape action into the reducer.

Three new sse.test.ts cases pin the regression: hostile `type:'ship'`
in the payload still resolves to `run_started`, missing runId is
dropped, empty runId is dropped.

Replay pause/resume

`useCritiqueReplay` had one big effect keyed on `transcriptUrl`
only, so flipping `speed` from `paused` to `instant` never re-fired
and the held events sat undispatched. Split into a parse effect
(depends on URL, fetches and stores events in state) and a pace
effect (depends on parsed-events + speed, owns the cursor + timers).
The playback cursor lives in a ref that survives pause/resume
cycles, so flipping `paused` -> `instant` flushes from the current
position rather than restarting (which would double-dispatch
`run_started` and reset the reducer).

Two new useCritiqueReplay.test.tsx cases:
- paused-then-instant transitions from `playing` to `done` and
  reaches the shipped terminal phase
- intervalMs paced playback dispatches one event, pauses to drain
  the next scheduled timer, flips to instant, and confirms the
  remaining transcript drains exactly once (cursor was preserved)

Doc consistency

The earlier source comment in useCritiqueReplay.ts claimed `live`
"paces by recorded timestamps" while the impl used zero-delay
timers and the PR body said it behaves like `instant`. Aligned to
reality: `live` currently behaves like `{ intervalMs: 0 }` (events
drain on successive microtasks via setTimeoutFn) because transcripts
do not yet carry per-event timestamps. Honest timestamp-driven
pacing is queued as a Phase 7+ follow-up.

Validated: pnpm guard, pnpm --filter @open-design/web typecheck,
Theater suite 47/47 (up from 42, +3 sse + 2 replay), full web suite
96 files / 888 tests.

* feat(i18n): seed Critique Theater key block (en + zh-CN; other locales fall back via spread)

* feat(web): Theater PanelistLane component (Phase 8.1)

* feat(web): Theater ScoreTicker component (Phase 8.2)

* feat(web): Theater RoundDivider component (Phase 8.3)

* feat(web): Theater InterruptButton component with Escape keybind (Phase 8.4)

* feat(web): Theater TheaterDegraded chip (Phase 8.5)

* feat(web): Theater TheaterCollapsed post-run summary (Phase 8.6)

* feat(web): Theater TheaterTranscript replay surface (Phase 8.7)

* feat(web): Theater TheaterStage top-level container (Phase 8.8)

* feat(web): Theater CSS using existing semantic tokens (no hex literals)

* feat(web): Theater public exports barrel

* fix(web): resolve P2 + P3 review feedback on Phase 8 (PR #1314)

Addresses all 4 P2 + 3 P3 items from codex, Siri-Ray, and lefarcen.

State-lifecycle fixes (3 x P2)
1. Reducer learns a synthetic `__reset__` action (`CritiqueResetAction`).
   Host hooks dispatch it when their gating prop changes so a stale
   run from a prior project / transcript cannot bleed into the next
   context. Reset is idempotent on idle (returns the same reference).
2. `useCritiqueStream` dispatches `__reset__` at the top of its
   connection effect, so a workspace switch from project A (which
   streamed a critique) to project B clears the reducer before the
   new EventSource opens. enabled=false also clears.
3. `useCritiqueReplay` dispatches `__reset__` at the top of its
   parse effect, so transcriptUrl swaps (including swap-to-null after
   a replay reached `shipped`) lift the reducer back to idle before
   the new fetch starts.

SSE validation (1 x P2)
4. `sseToPanelEvent` now runs a per-variant `hasValidVariantShape`
   check after the cheap `isPanelEvent` predicate. A
   `critique.ship` frame missing `composite` / `round` / `status` /
   `artifactRef` is rejected before reaching the reducer, so
   TheaterCollapsed can no longer crash on `undefined.toFixed(1)`.
   Every variant's required fields are validated: run_started
   (protocolVersion, non-empty cast, maxRounds, threshold, scale),
   panelist_* (round, role, plus variant-specific shape), round_end
   (round, composite, mustFix, decision in {continue,ship}, reason),
   ship (round, composite, status, artifactRef.{projectId,artifactId},
   summary), degraded (reason, adapter), interrupted (bestRound,
   composite), failed (cause), parser_warning (kind, position).

Reducer correctness (1 x P2)
5. `panelist_open` now materializes the round + an empty panelist
   view (`{dims: [], mustFixes: []}`) so TheaterStage can highlight
   the in-progress lane the instant the tag opens. Before this, a
   stream that emitted only `panelist_open` after `run_started` left
   `rounds = []` and the UI rendered no current round until a later
   `panelist_dim` arrived.

Polish (3 x P3)
6. Brand role tint swaps from `var(--magenta, var(--accent))` to
   `var(--purple, var(--accent))`. `--purple` is actually defined
   across the design systems; `--magenta` is not, so Brand was
   silently falling through to `--accent` and looking identical to
   Designer.
7. New i18n key `critiqueTheater.interruptedSummary` for the
   interrupted-collapse copy ("Interrupted at round N, best
   composite X.X"). Previously the interrupted branch reused
   `shippedSummary` and the UI read "Shipped at round..." for a run
   that specifically did not ship. Native value in en + zh-CN; other
   locales fall back via `...en` spread.
8. `TheaterDegraded` heading id comes from `useId()` instead of a
   hardcoded `theater-degraded-heading`, so two chips rendered on
   the same page (chat history with multiple completed runs) keep
   their aria-labelledby references unambiguous.

Tests (15 new cases)
- reducer.test.ts (+5): __reset__ on running/terminal/idle, panelist_open materializes round, panelist_open does not stomp prior panelist data.
- sse.test.ts (+6): variant-level rejection for ship without required fields, degraded without adapter, run_started with empty cast, panelist_dim with non-numeric score, round_end with unknown decision, plus a positive fully-formed ship.
- useCritiqueStream.test.tsx (+2): state reset on projectId change, state reset on enabled flip false.
- useCritiqueReplay.test.tsx (+1): state reset on transcriptUrl swap to null after a replay reached shipped.
- TheaterCollapsed.test.tsx (text-pinning update): asserts the interrupted branch reads "Interrupted at round 1" + "best composite 7.9", and explicitly NOT "Shipped at round...".
- TheaterDegraded.test.tsx (+1): two chips on the same page get unique aria-labelledby ids that each resolve to an `<h3>`.

Validated
- pnpm guard clean
- pnpm --filter @open-design/web typecheck clean
- Theater suite: 13 files, 101 tests (was 86 on the first Phase 8 push, +15 new)
- tests/i18n/locales.test.ts 5 of 5 across 18 locales

* feat(web): CritiqueTheaterMount wires SSE + reducer into a single drop-in (Phase 9.1)

* feat(i18n): Critique Theater strings for de + ja + ko + zh-TW (Phase 9.2)

* fix(web): resolve P1 + P2 review feedback on Phase 9 (PR #1315)

Addresses every blocker from codex, Siri-Ray, and lefarcen. The
three state-lifecycle and SSE-validation issues they also flagged
inherit fixes from PR #1314's review pass that this branch now sits
on top of after rebase.

Real daemon kill on Interrupt (P1)
- CritiqueTheaterMount now POSTs to
  /api/projects/:id/critique/:runId/interrupt alongside the
  optimistic local dispatch. Before this fix, clicking Interrupt
  only flipped the React state to interrupted while the daemon job
  kept running. The fetch is best-effort: a 404 (endpoint not wired
  yet, lands in Phase 15) is swallowed with a dev-mode console.warn
  so the UI still moves to the collapsed badge.
- New fetchInterrupt test seam lets RTL assert on the URL / method
  and simulate the "daemon not ready yet" path. Two tests pin both:
  the happy URL proj-42/critique/run-abc/interrupt POSTs, and a
  rejected fetch still flips the UI.

interruptPending reset on new run (P2)
- A ref-backed effect compares the current runId against the last
  one we saw; when it changes, interruptPending is cleared. A user
  who interrupts run-1 and then triggers run-2 from the same mount
  now gets a fresh, enabled kill button instead of one stuck in
  "Interrupting…". Pinned by a new mount test.

Escape keybind scope (P2)
- InterruptButton now checks the keydown target. Escape inside an
  input, textarea, select, or contenteditable element is ignored
  (and any ancestor of those via closest() is treated the same
  way). Body-level focus still fires the keybind so the Theater
  area's affordance keeps working. Four new tests cover textarea,
  input, contenteditable, and the body-focus positive case.

userFacingName i18n key (P2)
- The spec at specs/current/critique-theater.md:6 mandates a single
  critiqueTheater.userFacingName key so the "Design Jury" label can
  be renamed without touching code. Phase 8 introduced
  critiqueTheater.title by mistake; renamed across types.ts, en.ts,
  zh-CN.ts, de.ts, ja.ts, ko.ts, zh-TW.ts, and the lone consumer
  TheaterStage.tsx. The locale alignment test stays green.

Validated
- pnpm guard clean
- pnpm --filter @open-design/web typecheck clean
- Theater suite: 14 files, 112 tests (was 101 before, +11 new for
  the Phase 9 review pass: 3 mount + 4 InterruptButton focus scope;
  the rest were already in #1314's review fix).
- tests/i18n/locales.test.ts 5 of 5 across 18 locales.

* feat(daemon): adapter-degraded registry with TTL (Phase 10.1)

In-memory registry recording adapters that produced malformed or
oversize transcripts so the orchestrator can skip them for a TTL
window (default 24h) instead of cycling through known-bad providers
on every run.

Records carry reason (malformed_block | oversize_block |
missing_artifact), source label, and expiresAt. The test-only
clock seam lets the suite advance time deterministically and prove
that an expired entry stops counting as degraded without anyone
calling clearDegraded.

7/7 vitest cases green.

* feat(daemon): synthetic good + bad adapter fixtures (Phase 10.2)

Two test-only adapters that read the existing v1 transcript
fixtures (happy-3-rounds and malformed-unbalanced) and replay them
as either a full string or a 512-byte chunked stream. The chunked
form is what the conformance harness uses to prove the parser
holds together when the transcript arrives in arbitrary network
slices, not as one buffered blob.

* feat(daemon): adapter conformance harness (Phase 10.3)

runAdapterConformance pulls a transcript through the same
parseCritiqueStream pipeline the orchestrator uses and classifies
the outcome as shipped, degraded, or failed. On a degraded
outcome it forwards the matched reason to the adapter-degraded
registry, so a single nightly conformance run is what populates
the skip list rather than the orchestrator learning each adapter
is broken at request time.

5/5 vitest cases green covering shipped, malformed degraded,
oversize degraded, no-ship failure, and the harness-thrown
failure path.

* test(e2e): Critique Theater Playwright suite (Phase 11)

Six tests, one viewport per visual case, deterministic SSE
fixtures stubbed via page.route(). Adds the suite to
test:ui:extended so the existing extended-UI lane picks it up.

Coverage:

  1. Happy path: a single mounted theater plays the full
     fixture (1 run_started, 5 panelists open / dim / must_fix /
     close, 1 round_end, 1 ship) and ends on the score badge.
  2. Interrupt mid-run: the panelist that is open at the time
     the interrupt button is clicked closes with an interrupted
     marker and the transcript freezes there.
  3. Visual regression at 375x720 mobile.
  4. Visual regression at 768x1024 tablet.
  5. Visual regression at 1280x800 desktop.
  6. A11y role tree: the theater region exposes a labelled
     landmark, each panelist lane is a group with an accessible
     name, the score is a status live region.

All SSE traffic is stubbed by page.route so the suite runs in CI
without a daemon. The toggle is seeded via localStorage by
bootAppWithCritiqueEnabled so the gate behaves as if Settings
flipped it on. typecheck clean; playwright --list reports 6.

* fix: resolve P1 + P2 review feedback on Phase 10 (PR #1316)

Six fixes across the surfaces flagged by Codex, Siri-Ray, and
lefarcen on PR #1316.

1. SSE variant guard rejects unknown enum values (lefarcen P2).
   hasValidVariantShape now checks role against PANELIST_ROLES and
   status / reason / cause / kind / decision against the contracts
   literal sets. A malformed frame with role='__proto__' or
   status='wat' is dropped before it reaches the reducer.

2. Replay transcript parser reuses the SSE variant guard
   (lefarcen + codex P2). parseTranscript in useCritiqueReplay now
   layers hasValidVariantShape on top of the shallow isPanelEvent
   check, so a corrupt transcript line like
   '{type:"ship",runId:"r"}' is dropped instead of crashing the
   reducer on composite.toFixed().

3. Conformance harness treats any parser_warning as degraded
   (lefarcen P2). The classifier previously returned 'shipped' on
   the first ship event even if score_clamped / unknown_role /
   composite_mismatch / duplicate_ship warnings were emitted
   earlier. New behavior: any parser_warning in the stream
   classifies as degraded with local reason 'parser_warning'
   (mapped to contracts 'malformed_block' when marking the
   registry).

4. Conformance harness verifies panel completeness on ship
   (codex P2). Previously a ship event with only some panelists
   closed was accepted as shipped because the parser only enforces
   the round-1 designer artifact invariant. The harness now
   tracks closed roles vs the cast declared in run_started and
   returns degraded with local reason 'incomplete_panel' when any
   role missed close (mapped to contracts 'missing_artifact').

5. Conformance tests for oversize_block and adapter-throws
   (lefarcen P2). Added the two cases the prior PR body claimed
   were covered but were not: oversize via a tight
   parserMaxBlockBytes on the good fixture, and an async iterable
   that throws mid-stream to drive the unexpected_error path. Two
   more tests cover the new parser_warning and incomplete_panel
   classifications. 9 vitest cases total, all green.

6. Interrupt dispatch waits for daemon ack (Siri-Ray + lefarcen
   P1). CritiqueTheaterMount used to optimistically dispatch
   'interrupted' synchronously alongside the fetch, so a daemon
   that responded 404 or 409 (endpoint not wired, run already
   terminal) still moved the UI to the sticky interrupted phase
   and ignored every real terminal event. The new flow snapshots
   runId / bestRound / composite at click time, awaits the fetch,
   and only terminalizes on res.ok. On rejection or non-2xx, it
   clears interruptPending so the user can retry and the live SSE
   keeps emitting.

7. Native i18n key critiqueTheater.interruptedSummary backfilled
   in de / ja / ko / zh-TW (Siri-Ray P2). The other 12 locales
   inherit from en via spread, so they were already typecheck-
   safe; this commit gives the native locales native interrupted-
   summary strings instead of falling through to English.

Tests: 16 daemon + 108 web Theater + locales suite all green;
web typecheck clean.

* test(e2e): Critique Theater Playwright suite (Phase 11)

Six tests, one viewport per visual case, deterministic SSE
fixtures stubbed via page.route(). Adds the suite to
test:ui:extended so the existing extended-UI lane picks it up.

Coverage:

  1. Happy path: a single mounted theater plays the full
     fixture (1 run_started, 5 panelists open / dim / must_fix /
     close, 1 round_end, 1 ship) and ends on the score badge.
  2. Interrupt mid-run: the panelist that is open at the time
     the interrupt button is clicked closes with an interrupted
     marker and the transcript freezes there.
  3. Visual regression at 375x720 mobile.
  4. Visual regression at 768x1024 tablet.
  5. Visual regression at 1280x800 desktop.
  6. A11y role tree: the theater region exposes a labelled
     landmark, each panelist lane is a group with an accessible
     name, the score is a status live region.

All SSE traffic is stubbed by page.route so the suite runs in CI
without a daemon. The toggle is seeded via localStorage by
bootAppWithCritiqueEnabled so the gate behaves as if Settings
flipped it on. typecheck clean; playwright --list reports 6.

* fix(e2e): park Critique Theater suite behind fixme + document viewport breakpoints (PR #1317)

Two of three review items addressed on this branch (the daemon-side
fixes the same review flagged live on the Phase 10 stack instead;
see PR #1316 commit 454c2ad4):

1. Codex P1 + P1 (suite never mounts a Theater, no baselines):
   wrapped the describe in test.describe.fixme until
   CritiqueTheaterMount is actually rendered inside App / ProjectView.
   The mount is exported today but no caller in the web app
   instantiates it, so a Playwright run that visits '/' produces no
   Theater region for the assertions to bind to. The same
   wireup PR that flips the orchestrator path on after Phase 15 also
   flips this describe back to live + commits the first PNG
   baselines.

2. lefarcen P2 (viewport breakpoint docs): added an explanatory
   block above the 375 / 768 / 1280 loop explaining what wraps at
   each breakpoint and why intermediate widths are not interesting
   without changing the snapshot cost shape.

Suite still lists six tests via 'playwright test --list' (the fixme
appears as skipped at run time, not removed at list time), so the
spec stays visible to reviewers and to the contributor who lights
the suite up later. e2e typecheck clean.

* fix(daemon): resolve P2 daemon feedback flagged on PR #1317

Two cleanups on Phase 10's daemon surface that the Phase 11 review
caught (lefarcen):

1. Module-URL-anchored fixture paths in synthetic-good.ts and
   synthetic-bad.ts. The previous version used
   path.join(__dirname, '..', 'v1', ...) which silently breaks if
   the directory tree shifts. Replaced with new URL('../v1/...',
   import.meta.url) so a fixture move surfaces as a clear ENOENT
   pointing at this exact import. readFileSync accepts URL objects
   directly so no path conversion is needed at the call site;
   FIXTURE_PATH stays exported in string form for callers that
   still need it.

2. Surface shipPayload in the shipped ConformanceOutcome. The
   parser already hands the artifact bytes back via onArtifact, but
   the previous shipped variant dropped them and the body silenced
   the lint with void shipPayload. Added an artifact field
   (ShipArtifactPayload | null) and removed the lint trick. A new
   assertion in critique-conformance.test.ts pins MIME + non-empty
   body on the synthetic-good shipped path.

16/16 daemon tests still green. Daemon typecheck clean.

* fix(web): restore wait-for-daemon-ack pattern on Theater interrupt

Same regression as flagged on PR #1316 post-main-merge: the
optimistic local dispatch fired before the POST resolved, so a
daemon 404 / 409 still terminalized the UI and the real SSE
terminal event got ignored by the sticky interrupted phase.

Snapshot runId / bestRound / composite at click time, dispatch
interrupted only on res.ok, clear interruptPending on rejection or
non-2xx so the user can retry. Tests cover rejection + 404 leaving
the run on the live stage; the 204 path waits for the ack.

* fix(test): add projectKind prop to FileViewer deck render after v0.7.0 merge

* fix(daemon): parser_warning wins over no_ship in conformance harness (PerishCode P3 on PR #1317)

---------

Co-authored-by: Nagendhra <nagendhra405@gmail.com>
2026-05-13 21:23:18 +08:00

466 lines
22 KiB
TypeScript

/**
* End-to-end coverage for the adapter conformance harness
* (Phase 10, Task 10.1).
*
* Drives the same `parseCritiqueStream` the production orchestrator
* uses, but with the synthetic adapter fixtures so the assertion is
* about the harness's classification logic (shipped / degraded /
* failed) rather than the parser's correctness (already covered by
* the v1 parser tests).
*/
import { afterEach, beforeEach, describe, expect, it } from 'vitest';
import { PARSER_WARNING_KINDS } from '@open-design/contracts/critique';
import { runAdapterConformance } from '../src/critique/conformance.js';
import {
syntheticGoodStream,
} from '../src/critique/__fixtures__/adapters/synthetic-good.js';
import {
syntheticBadStream,
} from '../src/critique/__fixtures__/adapters/synthetic-bad.js';
import {
__resetDegradedRegistryForTests,
__setDegradedClockForTests,
isDegraded,
} from '../src/critique/adapter-degraded.js';
let now = 1_000_000;
beforeEach(() => {
now = 1_000_000;
__setDegradedClockForTests({ now: () => now });
});
afterEach(() => {
__setDegradedClockForTests(null);
__resetDegradedRegistryForTests();
});
describe('adapter conformance harness (Phase 10)', () => {
it('synthetic-good emits shipped and leaves the adapter undegraded', async () => {
const outcome = await runAdapterConformance({
adapterId: 'synthetic-good',
runId: 'run-good-1',
source: syntheticGoodStream(),
});
expect(outcome.kind).toBe('shipped');
if (outcome.kind !== 'shipped') return;
expect(outcome.round).toBeGreaterThan(0);
expect(outcome.composite).toBeGreaterThan(0);
// The harness must NOT mark the adapter degraded on success.
expect(isDegraded('synthetic-good')).toBe(false);
// Every panel event for the run should land in the events array
// for downstream inspection.
expect(outcome.events.length).toBeGreaterThan(0);
expect(outcome.events.find((e) => e.type === 'ship')).toBeTruthy();
// The shipped outcome must surface the artifact bytes the parser
// handed back via onArtifact, so a nightly cycle can pin MIME /
// byte-length / hash without re-parsing the transcript (lefarcen
// P2 on PR #1317).
expect(outcome.artifact).not.toBeNull();
expect(outcome.artifact?.mime).toMatch(/^text\/(html|markdown)/);
expect(outcome.artifact?.body.length).toBeGreaterThan(0);
});
it('synthetic-bad emits degraded with the parser-derived reason and marks the adapter', async () => {
const outcome = await runAdapterConformance({
adapterId: 'synthetic-bad',
runId: 'run-bad-1',
source: syntheticBadStream(),
});
expect(outcome.kind).toBe('degraded');
if (outcome.kind !== 'degraded') return;
expect(['malformed_block', 'oversize_block', 'missing_artifact']).toContain(
outcome.reason,
);
expect(isDegraded('synthetic-bad')).toBe(true);
});
it('marks the adapter degraded for the default 24h TTL after a bad run', async () => {
await runAdapterConformance({
adapterId: 'synthetic-bad-2',
runId: 'run-bad-2',
source: syntheticBadStream(),
});
expect(isDegraded('synthetic-bad-2')).toBe(true);
// Advance the clock just shy of 24h, still degraded.
now += 24 * 60 * 60 * 1000 - 1;
expect(isDegraded('synthetic-bad-2')).toBe(true);
// Cross the boundary, mark falls off.
now += 2;
expect(isDegraded('synthetic-bad-2')).toBe(false);
});
it('classifies a stream that finishes without a ship event as failed (no_ship)', async () => {
async function* truncated(): AsyncIterable<string> {
// Open the critique-run envelope, emit a single panelist tag, then
// close cleanly. The parser yields no SHIP, so the harness must
// surface `failed: no_ship` rather than spinning forever or
// returning `shipped`.
yield '<CRITIQUE_RUN version="1" runId="run-x" projectId="p" artifactId="a">\n';
yield '</CRITIQUE_RUN>\n';
}
const outcome = await runAdapterConformance({
adapterId: 'synthetic-truncated',
runId: 'run-x',
source: truncated(),
});
expect(outcome.kind).toBe('failed');
if (outcome.kind !== 'failed') return;
expect(outcome.cause).toBe('no_ship');
});
it('threads the projectId / artifactId / runId through to the parser SHIP event', async () => {
const outcome = await runAdapterConformance({
adapterId: 'synthetic-good',
runId: 'custom-run-id',
source: syntheticGoodStream(),
projectId: 'proj-conformance',
artifactId: 'artifact-conformance',
});
if (outcome.kind !== 'shipped') {
throw new Error('expected shipped outcome');
}
const ship = outcome.events.find((e) => e.type === 'ship');
expect(ship?.type).toBe('ship');
if (ship?.type !== 'ship') return;
expect(ship.artifactRef.projectId).toBe('proj-conformance');
expect(ship.artifactRef.artifactId).toBe('artifact-conformance');
});
it('classifies an oversize block as degraded oversize_block (lefarcen P2)', async () => {
// The synthetic-good transcript is fine under the default 256 KB
// block budget. Replay it through the harness with a tiny budget so
// the parser throws OversizeBlockError on the first ARTIFACT body
// and the harness has to surface `degraded: oversize_block`.
const outcome = await runAdapterConformance({
adapterId: 'synthetic-oversize',
runId: 'run-oversize',
source: syntheticGoodStream(),
parserMaxBlockBytes: 256,
});
expect(outcome.kind).toBe('degraded');
if (outcome.kind !== 'degraded') return;
expect(outcome.reason).toBe('oversize_block');
expect(isDegraded('synthetic-oversize')).toBe(true);
});
it('classifies an adapter that throws mid-stream as failed unexpected_error (lefarcen P2)', async () => {
class AdapterBoom extends Error {
constructor() {
super('adapter blew up');
this.name = 'AdapterBoom';
}
}
async function* throwing(): AsyncIterable<string> {
yield '<CRITIQUE_RUN version="1" runId="run-boom" projectId="p" artifactId="a">\n';
yield '<ROUND n="1">\n';
throw new AdapterBoom();
}
const outcome = await runAdapterConformance({
adapterId: 'synthetic-throwing',
runId: 'run-boom',
source: throwing(),
});
expect(outcome.kind).toBe('failed');
if (outcome.kind !== 'failed') return;
expect(outcome.cause).toBe('unexpected_error');
expect(outcome.error).toContain('adapter blew up');
// The adapter is NOT marked degraded here: an unexpected throw is a
// failure to evaluate, not evidence of a malformed stream. A real
// policy could choose to mark it after N consecutive throws; the
// harness leaves that decision to the caller.
expect(isDegraded('synthetic-throwing')).toBe(false);
});
it('classifies a clean SHIP that arrived alongside parser warnings as degraded parser_warning (lefarcen P2)', async () => {
// A panelist score outside [0, scale] makes the parser yield a
// `parser_warning` with kind=`score_clamped` BEFORE the panelist
// closes. The harness must promote the run to degraded even though
// a syntactically valid SHIP arrives later.
async function* withClampedScore(): AsyncIterable<string> {
yield '<CRITIQUE_RUN version="1" maxRounds="1" threshold="0.1" scale="10">\n';
yield ' <ROUND n="1">\n';
// designer must include an ARTIFACT in round 1 (parser invariant).
yield ' <PANELIST role="designer">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>x</p>]]></ARTIFACT>\n';
yield ' </PANELIST>\n';
// Out-of-range score on `critic` triggers score_clamped warning.
yield ' <PANELIST role="critic" score="99"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="brand" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="a11y" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="copy" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <ROUND_END n="1" composite="6.0" must_fix="0" decision="ship">\n';
yield ' <REASON>ok</REASON>\n';
yield ' </ROUND_END>\n';
yield ' </ROUND>\n';
yield ' <SHIP round="1" composite="6.0" status="shipped">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>final</p>]]></ARTIFACT>\n';
yield ' <SUMMARY>ok</SUMMARY>\n';
yield ' </SHIP>\n';
yield '</CRITIQUE_RUN>\n';
}
const outcome = await runAdapterConformance({
adapterId: 'synthetic-warned',
runId: 'run-warned',
source: withClampedScore(),
});
expect(outcome.kind).toBe('degraded');
if (outcome.kind !== 'degraded') return;
expect(outcome.reason).toBe('parser_warning');
expect(outcome.events.some((e) => e.type === 'parser_warning')).toBe(true);
expect(isDegraded('synthetic-warned')).toBe(true);
});
it('classifies a SHIP that arrived before every panelist closed as degraded incomplete_panel (codex P2)', async () => {
// run_started declares the full 5-role cast, but only `designer`
// and `critic` ever emit panelist_close. The parser does not reject
// this on its own; the harness is the gate that catches it.
async function* incomplete(): AsyncIterable<string> {
yield '<CRITIQUE_RUN version="1" maxRounds="1" threshold="0.1" scale="10">\n';
yield ' <ROUND n="1">\n';
yield ' <PANELIST role="designer">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>x</p>]]></ARTIFACT>\n';
yield ' </PANELIST>\n';
yield ' <PANELIST role="critic" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <ROUND_END n="1" composite="6.0" must_fix="0" decision="ship">\n';
yield ' <REASON>ok</REASON>\n';
yield ' </ROUND_END>\n';
yield ' </ROUND>\n';
yield ' <SHIP round="1" composite="6.0" status="shipped">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>final</p>]]></ARTIFACT>\n';
yield ' <SUMMARY>ok</SUMMARY>\n';
yield ' </SHIP>\n';
yield '</CRITIQUE_RUN>\n';
}
const outcome = await runAdapterConformance({
adapterId: 'synthetic-incomplete',
runId: 'run-incomplete',
source: incomplete(),
});
expect(outcome.kind).toBe('degraded');
if (outcome.kind !== 'degraded') return;
expect(outcome.reason).toBe('incomplete_panel');
expect(isDegraded('synthetic-incomplete')).toBe(true);
});
it('classifies a duplicate-SHIP stream as degraded parser_warning even though ship arrives first (lefarcen P2 follow-up)', async () => {
// Two `<SHIP>` blocks in the same transcript. The parser emits a
// SHIP event for the first and a `parser_warning` of kind
// `duplicate_ship` for the second; the warning arrives AFTER the
// ship. The harness must drain the rest of the stream and
// classify as degraded rather than returning on the first ship.
async function* duplicateShip(): AsyncIterable<string> {
yield '<CRITIQUE_RUN version="1" maxRounds="1" threshold="0.1" scale="10">\n';
yield ' <ROUND n="1">\n';
yield ' <PANELIST role="designer">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>x</p>]]></ARTIFACT>\n';
yield ' </PANELIST>\n';
yield ' <PANELIST role="critic" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="brand" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="a11y" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="copy" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <ROUND_END n="1" composite="6.0" must_fix="0" decision="ship">\n';
yield ' <REASON>ok</REASON>\n';
yield ' </ROUND_END>\n';
yield ' </ROUND>\n';
yield ' <SHIP round="1" composite="6.0" status="shipped">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>first</p>]]></ARTIFACT>\n';
yield ' <SUMMARY>first</SUMMARY>\n';
yield ' </SHIP>\n';
// Second SHIP block triggers the parser_warning (duplicate_ship).
yield ' <SHIP round="1" composite="6.0" status="shipped">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>second</p>]]></ARTIFACT>\n';
yield ' <SUMMARY>second</SUMMARY>\n';
yield ' </SHIP>\n';
yield '</CRITIQUE_RUN>\n';
}
const outcome = await runAdapterConformance({
adapterId: 'synthetic-duplicate-ship',
runId: 'run-dup',
source: duplicateShip(),
});
expect(outcome.kind).toBe('degraded');
if (outcome.kind !== 'degraded') return;
expect(outcome.reason).toBe('parser_warning');
// The events array must hold both the first ship AND the
// duplicate_ship warning so a debugger can see what happened.
expect(outcome.events.filter((e) => e.type === 'ship')).toHaveLength(1);
expect(
outcome.events.some(
(e) => e.type === 'parser_warning' && e.kind === 'duplicate_ship',
),
).toBe(true);
expect(isDegraded('synthetic-duplicate-ship')).toBe(true);
});
it('classifies a SHIP whose round did not close every cast role as incomplete_panel even if earlier rounds closed everyone (lefarcen P2 follow-up)', async () => {
// Round 1 closes all five cast roles cleanly. Round 2 closes only
// designer + critic before <SHIP round="2"> arrives. A cumulative
// (non-per-round) tracker would happily say "all five closed
// somewhere, ship is fine"; the corrected per-round tracker
// looks only at the shipping round's panelist_close set and
// flags incomplete_panel because brand / a11y / copy never
// closed in round 2.
async function* incompleteShippingRound(): AsyncIterable<string> {
yield '<CRITIQUE_RUN version="1" maxRounds="2" threshold="0.1" scale="10">\n';
// Round 1 — all five close.
yield ' <ROUND n="1">\n';
yield ' <PANELIST role="designer">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>v1</p>]]></ARTIFACT>\n';
yield ' </PANELIST>\n';
yield ' <PANELIST role="critic" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="brand" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="a11y" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="copy" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <ROUND_END n="1" composite="6.0" must_fix="3" decision="continue">\n';
yield ' <REASON>more work</REASON>\n';
yield ' </ROUND_END>\n';
yield ' </ROUND>\n';
// Round 2 — only designer + critic close (the cumulative bug
// would let this slide; the fix catches it).
yield ' <ROUND n="2">\n';
yield ' <PANELIST role="designer">\n';
yield ' <NOTES>iterating</NOTES>\n';
yield ' </PANELIST>\n';
yield ' <PANELIST role="critic" score="7"><DIM name="x" score="7">n</DIM></PANELIST>\n';
yield ' <ROUND_END n="2" composite="7.0" must_fix="0" decision="ship">\n';
yield ' <REASON>ok</REASON>\n';
yield ' </ROUND_END>\n';
yield ' </ROUND>\n';
yield ' <SHIP round="2" composite="7.0" status="shipped">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>final</p>]]></ARTIFACT>\n';
yield ' <SUMMARY>ok</SUMMARY>\n';
yield ' </SHIP>\n';
yield '</CRITIQUE_RUN>\n';
}
const outcome = await runAdapterConformance({
adapterId: 'synthetic-incomplete-round-2',
runId: 'run-r2',
source: incompleteShippingRound(),
});
expect(outcome.kind).toBe('degraded');
if (outcome.kind !== 'degraded') return;
expect(outcome.reason).toBe('incomplete_panel');
expect(isDegraded('synthetic-incomplete-round-2')).toBe(true);
});
it('classifies a parser_warning followed by EOF without SHIP as degraded parser_warning, not failed no_ship (PerishCode P3 on PR #1317)', async () => {
// The bug the priority-order fix in conformance.ts addresses: a
// stream that emits a `parser_warning` (out-of-range score) and
// then dies before a `SHIP` arrives (adapter crash, network
// drop, run-out-of-rounds) used to fall through to
// `failed:no_ship` because the `parserWarningSeen` check sat
// inside the post-no_ship branch. Rule 3 in the conformance
// docstring says parser_warning wins over no_ship; this test
// pins the docstring's "top-to-bottom priority" promise for the
// no-ship path so a future refactor cannot silently flip it.
async function* warnedThenEof(): AsyncIterable<string> {
// Well-formed stream that emits a score_clamped warning and
// ends with a `continue` decision on the last allowed round,
// so no SHIP block arrives but the parser does not flag
// malformed_block either. This is the exact shape the priority
// fix in conformance.ts is built to catch: rule 3 (warning) must
// win over rule 6 (no_ship).
yield '<CRITIQUE_RUN version="1" maxRounds="1" threshold="0.1" scale="10">\n';
yield ' <ROUND n="1">\n';
yield ' <PANELIST role="designer">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>x</p>]]></ARTIFACT>\n';
yield ' </PANELIST>\n';
// Out-of-range score triggers score_clamped warning.
yield ' <PANELIST role="critic" score="99"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="brand" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="a11y" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="copy" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
// decision="continue" with no SHIP block on a maxRounds=1 run.
yield ' <ROUND_END n="1" composite="6.0" must_fix="1" decision="continue">\n';
yield ' <REASON>more work needed but ran out of rounds</REASON>\n';
yield ' </ROUND_END>\n';
yield ' </ROUND>\n';
yield '</CRITIQUE_RUN>\n';
}
const outcome = await runAdapterConformance({
adapterId: 'synthetic-warned-then-died',
runId: 'run-warned-eof',
source: warnedThenEof(),
});
// Rule 3 (parser_warning) wins over rule 6 (no_ship); the adapter
// is marked degraded for 24h, not silently dropped as failed.
expect(outcome.kind).toBe('degraded');
if (outcome.kind !== 'degraded') return;
expect(outcome.reason).toBe('parser_warning');
expect(outcome.events.some((e) => e.type === 'parser_warning')).toBe(true);
expect(isDegraded('synthetic-warned-then-died')).toBe(true);
});
// PerishCode P3 follow-up on PR #1317: the score_clamped case above
// exercises one of the five ParserWarningKind values. Rule 3 fires on
// ANY parser_warning kind, so this matrix drives the conformance gate
// off PARSER_WARNING_KINDS directly. Adding a sixth kind to the
// contracts export auto-grows the matrix without a harness-test edit.
// Kinds reachable in a single-fixture generator are covered here;
// kinds that need a multi-round or cross-panelist setup are marked
// `it.todo` so the gap is documented rather than silently uncovered.
describe('parser_warning matrix across PARSER_WARNING_KINDS (PerishCode P3 on PR #1317)', () => {
it('all kinds documented match the contracts enum', () => {
// Bare guard: if PARSER_WARNING_KINDS changes shape without the
// matrix being updated, this test points at the missing fixtures
// (it.todo lines below) before the next reviewer has to ask.
expect([...PARSER_WARNING_KINDS]).toEqual([
'weak_debate',
'unknown_role',
'score_clamped',
'composite_mismatch',
'duplicate_ship',
]);
});
it('classifies score_clamped as degraded parser_warning', async () => {
async function* fixture(): AsyncIterable<string> {
yield '<CRITIQUE_RUN version="1" maxRounds="1" threshold="0.1" scale="10">\n';
yield ' <ROUND n="1">\n';
yield ' <PANELIST role="designer">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>x</p>]]></ARTIFACT>\n';
yield ' </PANELIST>\n';
yield ' <PANELIST role="critic" score="99"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="brand" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="a11y" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <PANELIST role="copy" score="6"><DIM name="x" score="6">n</DIM></PANELIST>\n';
yield ' <ROUND_END n="1" composite="6.0" must_fix="0" decision="ship">\n';
yield ' <REASON>ok</REASON>\n';
yield ' </ROUND_END>\n';
yield ' </ROUND>\n';
yield ' <SHIP round="1" composite="6.0" status="shipped">\n';
yield ' <ARTIFACT mime="text/html"><![CDATA[<p>final</p>]]></ARTIFACT>\n';
yield ' <SUMMARY>ok</SUMMARY>\n';
yield ' </SHIP>\n';
yield '</CRITIQUE_RUN>\n';
}
const outcome = await runAdapterConformance({
adapterId: 'synthetic-warned-score-clamped',
runId: 'run-warned-score-clamped',
source: fixture(),
});
expect(outcome.kind).toBe('degraded');
if (outcome.kind !== 'degraded') return;
expect(outcome.reason).toBe('parser_warning');
const warnings = outcome.events.filter((e) => e.type === 'parser_warning');
expect(warnings.length).toBeGreaterThan(0);
expect(warnings.some((w) => w.type === 'parser_warning' && w.kind === 'score_clamped')).toBe(true);
});
// The four kinds below need single-fixture generators that the
// parser currently emits in isolation. The score_clamped case is
// the simplest because the trigger is a literal attribute on a
// single <PANELIST>. The other four need either cross-panelist
// (weak_debate, composite_mismatch), unknown-enum (unknown_role),
// or multi-block (duplicate_ship) setups whose isolation behavior
// depends on parser invariants the harness should not duplicate.
// Marking them it.todo documents the gap explicitly so the next
// contributor finishing the matrix sees what's missing rather than
// assuming the kind is uncovered by accident.
it.todo('classifies weak_debate as degraded parser_warning');
it.todo('classifies unknown_role as degraded parser_warning');
it.todo('classifies composite_mismatch as degraded parser_warning');
it.todo('classifies duplicate_ship as degraded parser_warning');
});
});