mirror of
https://github.com/nexu-io/open-design.git
synced 2026-06-01 03:14:35 +07:00
* fix(daemon): finish live-artifact chat runs via watchdog quiet-period handoff (#1451) Live-artifact runs were staying in `Working` for the full 10-minute inactivity window even after the deliverable had been registered, and sometimes finishing as `failed` with `Agent stalled without emitting any new output for 600s`. The agent process kept its stdin/stdout alive (claude-code stream-json idle stdin, post-write reasoning that never reaches the chat) so the existing watchdog could not tell the deliverable was already in the user's hands. Wire `/api/tools/live-artifacts/create` back into the chat run via a small per-run handle registry: on the first `created` event, the run flips a local `artifactRegistered` flag and rearms the watchdog with the shorter `OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS` (default 60s) instead of the 10-minute pre-artifact ceiling. When that quiet timer trips, the watchdog no longer emits a stalled-error / `failed` finish; it SIGTERMs the child and lets the existing child-exit handler do final classification — the close handler now treats a SIGTERM exit after a registered artifact as `succeeded`, matching what the user actually got (a delivered artifact, not a failed run). The handoff stays with the existing child-exit lifecycle, so tool token revocation, cancel semantics, and exit-status classification keep their current owner — addressing the PR #1543 review history where finishing the run from the tool route bypassed those guarantees. Closes #1451. * fix(daemon): gate artifact quiet-period close on daemon-initiated flag (#1451 review follow-up) Reviewer (#2585) found that the close-handler branch reclassifying SIGTERM/SIGKILL as `succeeded` only checked `artifactRegistered`, so an unrelated later termination (external `kill`, OOM, container shutdown) after a successful artifact write would silently flip the run from `failed` to `succeeded` — the exact "completed without producing anything visible" failure mode the existing close handler is trying to prevent. Track the watchdog-initiated shutdown explicitly: set `artifactQuietShutdownRequested = true` immediately before `failForInactivity()` sends SIGTERM (covering the kill-grace SIGKILL escalation under the same flag), and require that flag in the close handler's quiet-period branch. Extract the final-status decision into a pure `classifyChatRunCloseStatus` so the daemon-initiated vs external signal cases can be pinned with focused unit tests instead of asserting closure-internal state via end-to-end timing. * fix(daemon): treat OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS=0 as disabled (#1451 review follow-up) Reviewer (#2585 non-blocking) found that an operator override of `OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS=0` no longer behaved as a "disable the quiet period" knob: once the artifact was registered, `activeInactivityTimeoutMs()` dropped to 0, `noteAgentActivity()` early-returned without clearing the prior timer, and the pre-artifact 10-minute timer kept running while further agent activity stopped refreshing it. Make the quiet-period switch conditional on a positive value. A 0 override now means "do not shorten after artifact registration" — the pre-artifact ceiling stays active, subsequent activity continues to reschedule it, and the existing pre-artifact stalled-error path still fires when the agent genuinely hangs. Pin the resolver as a pure `resolveActiveInactivityTimeoutMs` helper so the four quiet-vs-pre matrix cases are unit-tested directly. * fix(daemon): arm the quiet-period watchdog when pre-artifact timeout is disabled (#1451 review follow-up) Reviewer (#2585 non-blocking, round 3) found that `OD_CHAT_RUN_INACTIVITY_TIMEOUT_MS=0` paired with `OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS>0` left the watchdog disarmed forever. The `noteAgentActivity()` call at run start exited early because the pre-artifact delay was 0, so `inactivityTimer` was still `null` when the artifact was registered, and the prior `if (inactivityTimer) noteAgentActivity()` guard inside `noteArtifactRegistered()` then skipped the re-arm. The newly-positive quiet-period delay never armed a timer at all — a chat run that went silent right after artifact creation would stay `running` indefinitely. Drop the guard. `noteAgentActivity()` is already the function that decides whether to schedule (it bails when the active delay is 0), so calling it unconditionally keeps the behavior coherent across the four pre/quiet combinations: both non-zero (was already fine), pre=0 + quiet>0 (now arms the quiet timer), pre>0 + quiet=0 (still falls back to the pre-artifact ceiling via the existing resolver), both zero (still no watchdog at all — operator opted out). Pure-function coverage of the ceiling decision stays in `resolveActiveInactivityTimeoutMs` — exercised across the same four combinations in the existing unit suite.
This commit is contained in:
parent
53ecd36b9d
commit
48ed23c72f
2 changed files with 531 additions and 11 deletions
|
|
@ -1334,6 +1334,13 @@ async function refreshAndPersistToken(dataDir, serverId, current) {
|
|||
|
||||
const activeChatAgentEventSinks = new Map();
|
||||
const activeProjectEventSinks = new Map();
|
||||
// Per-chat-run handles, keyed by runId. Lets non-stream side effects
|
||||
// (live-artifact create, project events) reach back into the chat
|
||||
// run's local state — currently used by the artifact quiet-period
|
||||
// shortcut (#1451) so a successful artifact registration can shorten
|
||||
// the inactivity watchdog without the chat path having to poll a
|
||||
// store.
|
||||
const activeChatRunHandles = new Map();
|
||||
|
||||
function emitChatAgentEvent(runId, payload) {
|
||||
const sink = activeChatAgentEventSinks.get(runId);
|
||||
|
|
@ -1341,6 +1348,20 @@ function emitChatAgentEvent(runId, payload) {
|
|||
return sink(payload);
|
||||
}
|
||||
|
||||
// Exported for tests covering the artifact quiet-period plumbing
|
||||
// (#1451). The chat run path is a deep closure inside startServer, so
|
||||
// pin the hook contract at the emit/handle boundary instead of
|
||||
// driving a full fake-agent e2e for every invariant.
|
||||
export const __forTestChatRunHandles = activeChatRunHandles;
|
||||
|
||||
export function __forTestEmitLiveArtifactEvent(
|
||||
grant: { runId?: string; projectId?: string },
|
||||
action: 'created' | 'updated' | 'deleted',
|
||||
artifact: { id: string; projectId?: string; title?: string; refreshStatus?: string },
|
||||
) {
|
||||
return emitLiveArtifactEvent(grant, action, artifact);
|
||||
}
|
||||
|
||||
function emitLiveArtifactEvent(grant, action, artifact) {
|
||||
if (!artifact?.id) return false;
|
||||
const payload = {
|
||||
|
|
@ -1353,6 +1374,18 @@ function emitLiveArtifactEvent(grant, action, artifact) {
|
|||
};
|
||||
let emitted = emitProjectEvent(payload.projectId, payload);
|
||||
if (grant?.runId) emitted = emitChatAgentEvent(grant.runId, payload) || emitted;
|
||||
// After the deliverable exists, switch the chat run into a shorter
|
||||
// "quiet period" watchdog: agents sometimes keep their child process
|
||||
// alive after a successful artifact write (post-write reasoning, log
|
||||
// flushes, claude-code stream-json's idle stdin) and the 10-minute
|
||||
// default leaves the UI parked on Working until the watchdog fires
|
||||
// an unrelated "stalled" error. See #1451.
|
||||
if (action === 'created' && grant?.runId) {
|
||||
const handle = activeChatRunHandles.get(grant.runId);
|
||||
if (handle?.noteArtifactRegistered) {
|
||||
try { handle.noteArtifactRegistered(); } catch {}
|
||||
}
|
||||
}
|
||||
return emitted;
|
||||
}
|
||||
|
||||
|
|
@ -3013,6 +3046,12 @@ export interface StartServerOptions {
|
|||
|
||||
const DEFAULT_CHAT_RUN_INACTIVITY_TIMEOUT_MS = 10 * 60 * 1000;
|
||||
const MAX_CHAT_RUN_INACTIVITY_TIMEOUT_MS = 24 * 60 * 60 * 1000;
|
||||
// After a successful live-artifact registration the daemon switches the
|
||||
// chat-run inactivity watchdog from the long pre-artifact ceiling
|
||||
// (DEFAULT_CHAT_RUN_INACTIVITY_TIMEOUT_MS) down to a much shorter
|
||||
// "quiet period" — the deliverable exists, so further silence almost
|
||||
// always means the agent is winding down or hanging. See #1451.
|
||||
const DEFAULT_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS = 60 * 1000;
|
||||
|
||||
function resolveChatRunInactivityTimeoutMs() {
|
||||
const raw = Number(process.env.OD_CHAT_RUN_INACTIVITY_TIMEOUT_MS);
|
||||
|
|
@ -3026,6 +3065,67 @@ function resolveChatRunInactivityTimeoutMs() {
|
|||
return Math.min(MAX_CHAT_RUN_INACTIVITY_TIMEOUT_MS, Math.max(0, Math.floor(raw)));
|
||||
}
|
||||
|
||||
// Resolve the post-artifact quiet-period window. Same clamp as the outer
|
||||
// inactivity watchdog so an oversized override doesn't get Node-downgraded
|
||||
// to a 1ms timer. Exported so tests can pin the env behavior without
|
||||
// reaching into chat-run internals.
|
||||
export function resolveChatRunArtifactQuietPeriodMs() {
|
||||
const raw = Number(process.env.OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS);
|
||||
if (!Number.isFinite(raw)) return DEFAULT_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS;
|
||||
return Math.min(MAX_CHAT_RUN_INACTIVITY_TIMEOUT_MS, Math.max(0, Math.floor(raw)));
|
||||
}
|
||||
|
||||
// Pure resolver for the chat run's *currently active* inactivity
|
||||
// ceiling. Used by both `noteAgentActivity` and `noteArtifactRegistered`
|
||||
// to pick between the pre-artifact watchdog and the shortened quiet
|
||||
// period. Extracted so the `OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS=0`
|
||||
// "disable the quiet period" semantics can be pinned with focused unit
|
||||
// tests (#1451 review: a 0-value override must not strand the pre-artifact
|
||||
// timer or stop further reschedules — it has to fall back to the
|
||||
// pre-artifact ceiling so subsequent activity keeps refreshing the timer).
|
||||
export function resolveActiveInactivityTimeoutMs(params: {
|
||||
inactivityTimeoutMs: number;
|
||||
artifactQuietPeriodMs: number;
|
||||
artifactRegistered: boolean;
|
||||
}): number {
|
||||
if (params.artifactRegistered && params.artifactQuietPeriodMs > 0) {
|
||||
return params.artifactQuietPeriodMs;
|
||||
}
|
||||
return params.inactivityTimeoutMs;
|
||||
}
|
||||
|
||||
// Pure final-status classifier for the chat run's child-close handler.
|
||||
// Extracted so the per-branch invariants can be unit-tested without
|
||||
// driving a full child process — in particular:
|
||||
// - cancel always wins over success/failure classification.
|
||||
// - the ACP forced-shutdown override is scoped to SIGTERM + clean
|
||||
// completion only (signed-32-bit-overflow SIGKILL or non-clean ACP
|
||||
// state still report `failed`).
|
||||
// - the artifact quiet-period override is gated on a daemon-initiated
|
||||
// flag, NOT on `artifactRegistered` alone — see #1451 review:
|
||||
// an external `kill -9` after the artifact write must still report
|
||||
// `failed`, only the watchdog-initiated SIGTERM/SIGKILL escalation
|
||||
// is allowed to flip the status to `succeeded`.
|
||||
export function classifyChatRunCloseStatus(params: {
|
||||
cancelRequested: boolean;
|
||||
code: number | null;
|
||||
signal: NodeJS.Signals | string | null;
|
||||
acpCleanCompletion: boolean;
|
||||
artifactQuietShutdownRequested: boolean;
|
||||
}): 'canceled' | 'succeeded' | 'failed' {
|
||||
if (params.cancelRequested) return 'canceled';
|
||||
if (params.code === 0) return 'succeeded';
|
||||
const acpForcedShutdown =
|
||||
params.code === null && params.signal === 'SIGTERM' && params.acpCleanCompletion;
|
||||
if (acpForcedShutdown) return 'succeeded';
|
||||
const artifactQuietShutdown =
|
||||
params.artifactQuietShutdownRequested &&
|
||||
params.code === null &&
|
||||
(params.signal === 'SIGTERM' || params.signal === 'SIGKILL');
|
||||
if (artifactQuietShutdown) return 'succeeded';
|
||||
return 'failed';
|
||||
}
|
||||
|
||||
function resolveChatRunShutdownGraceMs() {
|
||||
const raw = Number(process.env.OD_CHAT_RUN_SHUTDOWN_GRACE_MS);
|
||||
if (!Number.isFinite(raw)) return 3_000;
|
||||
|
|
@ -10078,11 +10178,26 @@ export async function startServer({
|
|||
design.runs.emit(run, event, data);
|
||||
};
|
||||
const inactivityTimeoutMs = resolveChatRunInactivityTimeoutMs();
|
||||
const artifactQuietPeriodMs = resolveChatRunArtifactQuietPeriodMs();
|
||||
const inactivityKillGraceMs = 3_000;
|
||||
let inactivityTimer = null;
|
||||
let childStdoutSeen = false;
|
||||
let lastAgentEventPhase = 'spawn pending';
|
||||
let lastToolResultChars = 0;
|
||||
// Becomes true once any live-artifact create has been registered for
|
||||
// this run. Subsequent watchdog scheduling uses the shorter quiet
|
||||
// period, and a watchdog trip after this point is treated as
|
||||
// "agent finished the deliverable and went idle" rather than
|
||||
// "agent stalled with nothing to show" (issue #1451).
|
||||
let artifactRegistered = false;
|
||||
// Only daemon-initiated quiet-period termination should be treated
|
||||
// as `succeeded` in the close handler. A later unrelated SIGTERM /
|
||||
// SIGKILL (external `kill`, OOM, container shutdown) must keep its
|
||||
// existing `failed` classification even when `artifactRegistered`
|
||||
// is true — those signals don't mean the agent finished cleanly,
|
||||
// they just terminated the process. Set strictly inside
|
||||
// `failForInactivity`'s quiet-period branch.
|
||||
let artifactQuietShutdownRequested = false;
|
||||
const summarizeAgentEventForInactivity = (payload) => {
|
||||
const type = payload?.type ? String(payload.type) : 'unknown';
|
||||
if (type === 'tool_result') {
|
||||
|
|
@ -10117,13 +10232,35 @@ export async function startServer({
|
|||
};
|
||||
const failForInactivity = () => {
|
||||
if (run.cancelRequested || design.runs.isTerminal(run.status)) return;
|
||||
clearInactivityWatchdog();
|
||||
if (artifactRegistered) {
|
||||
// The deliverable already exists. The agent process is either
|
||||
// genuinely idle (claude-code's stream-json child sitting on an
|
||||
// open stdin) or wedged in post-write reasoning that never
|
||||
// emits stdout. Either way, finishing the run via the normal
|
||||
// child-exit path (status decision in child.on('close') below)
|
||||
// is safer than tearing it down with a failure banner — the
|
||||
// tool token, cancel state, and exit-code classification stay
|
||||
// owned by the existing lifecycle. SIGTERM the child and let
|
||||
// the close handler classify the run as succeeded (via the
|
||||
// artifactQuietShutdown branch). Mark this termination as
|
||||
// daemon-initiated so an unrelated later signal (external
|
||||
// kill, OOM) is NOT silently reclassified to `succeeded` —
|
||||
// only signals from this watchdog branch should be.
|
||||
artifactQuietShutdownRequested = true;
|
||||
if (acpSession?.abort) {
|
||||
acpSession.abort();
|
||||
}
|
||||
if (child && !child.killed) child.kill('SIGTERM');
|
||||
scheduleForcedChildShutdown();
|
||||
return;
|
||||
}
|
||||
const message =
|
||||
`Agent stalled without emitting any new output for ${Math.round(inactivityTimeoutMs / 1000)}s. ` +
|
||||
'The model or CLI likely hung while generating. ' +
|
||||
`Phase details: spawned agent binary ${resolvedBin}; stdout arrived: ${childStdoutSeen ? 'yes' : 'no'}; ` +
|
||||
`last agent event: ${lastAgentEventPhase}; largest tool result observed: ${lastToolResultChars} chars. ` +
|
||||
'Retry the turn, pick a different model, or start a new conversation if the prior context is very large.';
|
||||
clearInactivityWatchdog();
|
||||
send('error', createSseErrorPayload('AGENT_EXECUTION_FAILED', message, { retryable: true }));
|
||||
design.runs.finish(run, 'failed', 1, null);
|
||||
if (acpSession?.abort) {
|
||||
|
|
@ -10132,14 +10269,41 @@ export async function startServer({
|
|||
if (child && !child.killed) child.kill('SIGTERM');
|
||||
scheduleForcedChildShutdown();
|
||||
};
|
||||
const activeInactivityTimeoutMs = () =>
|
||||
resolveActiveInactivityTimeoutMs({
|
||||
inactivityTimeoutMs,
|
||||
artifactQuietPeriodMs,
|
||||
artifactRegistered,
|
||||
});
|
||||
const noteAgentActivity = () => {
|
||||
if (inactivityTimeoutMs <= 0) return;
|
||||
const delay = activeInactivityTimeoutMs();
|
||||
if (delay <= 0) return;
|
||||
clearInactivityWatchdog();
|
||||
inactivityTimer = setTimeout(failForInactivity, inactivityTimeoutMs);
|
||||
inactivityTimer = setTimeout(failForInactivity, delay);
|
||||
inactivityTimer.unref?.();
|
||||
};
|
||||
const noteArtifactRegistered = () => {
|
||||
if (artifactRegistered) return;
|
||||
artifactRegistered = true;
|
||||
// Switch the watchdog to the shorter quiet-period window
|
||||
// immediately so we don't have to wait for the next agent event
|
||||
// before the new ceiling takes effect. Call unconditionally:
|
||||
// an earlier `if (inactivityTimer)` gate left the run in limbo
|
||||
// when `OD_CHAT_RUN_INACTIVITY_TIMEOUT_MS=0` but
|
||||
// `OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS>0` — noteAgentActivity()
|
||||
// had returned early at run start (pre-artifact delay = 0,
|
||||
// no timer set), so the guard then skipped the re-arm and the
|
||||
// newly-positive quiet-period delay never armed a timer at all.
|
||||
// `noteAgentActivity` itself is the one that decides whether to
|
||||
// schedule (it bails when the active delay is 0), so leaving the
|
||||
// decision there keeps the behavior coherent across all four
|
||||
// combinations of pre / quiet timeouts.
|
||||
noteAgentActivity();
|
||||
};
|
||||
const unregisterChatAgentEventSink = () => {
|
||||
activeChatAgentEventSinks.delete(toolTokenGrant?.runId ?? runId);
|
||||
const sinkRunId = toolTokenGrant?.runId ?? runId;
|
||||
activeChatAgentEventSinks.delete(sinkRunId);
|
||||
activeChatRunHandles.delete(sinkRunId);
|
||||
};
|
||||
if (toolTokenGrant?.runId) {
|
||||
activeChatAgentEventSinks.set(toolTokenGrant.runId, (payload) => {
|
||||
|
|
@ -10147,6 +10311,7 @@ export async function startServer({
|
|||
noteAgentActivity();
|
||||
send('agent', payload);
|
||||
});
|
||||
activeChatRunHandles.set(toolTokenGrant.runId, { noteArtifactRegistered });
|
||||
}
|
||||
// If detection can't find the binary, surface a friendly SSE error
|
||||
// pointing at /api/agents instead of silently falling back to
|
||||
|
|
@ -10778,13 +10943,13 @@ export async function startServer({
|
|||
const acpCleanCompletion =
|
||||
typeof acpSession?.completedSuccessfully === 'function' &&
|
||||
acpSession.completedSuccessfully();
|
||||
const acpForcedShutdown =
|
||||
code === null && signal === 'SIGTERM' && acpCleanCompletion;
|
||||
const status = run.cancelRequested
|
||||
? 'canceled'
|
||||
: code === 0 || acpForcedShutdown
|
||||
? 'succeeded'
|
||||
: 'failed';
|
||||
const status = classifyChatRunCloseStatus({
|
||||
cancelRequested: !!run.cancelRequested,
|
||||
code,
|
||||
signal,
|
||||
acpCleanCompletion,
|
||||
artifactQuietShutdownRequested,
|
||||
});
|
||||
if (status === 'failed') {
|
||||
const diagnostic = diagnoseClaudeCliFailure({
|
||||
agentId: def.id,
|
||||
|
|
|
|||
355
apps/daemon/tests/chat-run-artifact-quiet-period.test.ts
Normal file
355
apps/daemon/tests/chat-run-artifact-quiet-period.test.ts
Normal file
|
|
@ -0,0 +1,355 @@
|
|||
/**
|
||||
* Artifact quiet-period plumbing (#1451).
|
||||
*
|
||||
* Live-artifact registration now feeds back into the chat-run
|
||||
* inactivity watchdog: once the deliverable exists, the daemon
|
||||
* switches to a shorter "quiet period" timeout instead of the
|
||||
* 10-minute pre-artifact ceiling, and a watchdog trip after the
|
||||
* artifact is in place is treated as "agent finished and went idle"
|
||||
* rather than "agent stalled with nothing to show".
|
||||
*
|
||||
* The full watchdog state machine sits inside a deep closure in
|
||||
* `startServer`, so these tests pin the emit/handle plumbing at the
|
||||
* boundary (via the `__forTest*` exports) and pin the env resolver
|
||||
* directly. The integration story — fake agent + live-artifact
|
||||
* create over HTTP + run-status check — would require setting up a
|
||||
* project, minting a tool token, and reproducing the chat-run path;
|
||||
* that is covered by manual verification in the PR body.
|
||||
*/
|
||||
|
||||
import { afterEach, describe, expect, it, vi } from 'vitest';
|
||||
import {
|
||||
__forTestChatRunHandles,
|
||||
__forTestEmitLiveArtifactEvent,
|
||||
classifyChatRunCloseStatus,
|
||||
resolveActiveInactivityTimeoutMs,
|
||||
resolveChatRunArtifactQuietPeriodMs,
|
||||
} from '../src/server.js';
|
||||
|
||||
const ENV_KEY = 'OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS';
|
||||
const ONE_MINUTE_MS = 60 * 1000;
|
||||
const TWENTY_FOUR_HOURS_MS = 24 * 60 * 60 * 1000;
|
||||
|
||||
describe('resolveChatRunArtifactQuietPeriodMs', () => {
|
||||
const originalEnv = process.env[ENV_KEY];
|
||||
|
||||
afterEach(() => {
|
||||
if (originalEnv === undefined) {
|
||||
delete process.env[ENV_KEY];
|
||||
} else {
|
||||
process.env[ENV_KEY] = originalEnv;
|
||||
}
|
||||
});
|
||||
|
||||
it('returns the 1-minute default when no env override is set', () => {
|
||||
delete process.env[ENV_KEY];
|
||||
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(ONE_MINUTE_MS);
|
||||
});
|
||||
|
||||
it('honors the env override when it is a finite number', () => {
|
||||
process.env[ENV_KEY] = '5000';
|
||||
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(5_000);
|
||||
});
|
||||
|
||||
it('falls back to the default when the env value is not parseable as a number', () => {
|
||||
process.env[ENV_KEY] = 'not-a-number';
|
||||
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(ONE_MINUTE_MS);
|
||||
});
|
||||
|
||||
it('clamps an oversized env override to the 24-hour ceiling (so Node does not silently downgrade the timer to 1ms)', () => {
|
||||
process.env[ENV_KEY] = String(TWENTY_FOUR_HOURS_MS * 100);
|
||||
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(TWENTY_FOUR_HOURS_MS);
|
||||
});
|
||||
|
||||
it('floors negative env overrides to 0 rather than scheduling a negative-delay timer', () => {
|
||||
process.env[ENV_KEY] = '-1000';
|
||||
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(0);
|
||||
});
|
||||
|
||||
it('honors env=0 to disable the artifact quiet-period entirely', () => {
|
||||
process.env[ENV_KEY] = '0';
|
||||
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(0);
|
||||
});
|
||||
});
|
||||
|
||||
describe('live-artifact create → chat-run handle hook (#1451)', () => {
|
||||
afterEach(() => {
|
||||
// Avoid leaking handles into other test files that touch the same
|
||||
// server-internal registry.
|
||||
__forTestChatRunHandles.clear();
|
||||
});
|
||||
|
||||
it('calls noteArtifactRegistered on the registered handle when emit fires with action="created"', () => {
|
||||
// The boundary contract: emitLiveArtifactEvent must route a
|
||||
// `created` action back into the chat run's quiet-period switch,
|
||||
// not just into the project SSE stream. Without this hook the
|
||||
// watchdog would never shorten, the user would still wait the
|
||||
// full 10 minutes, and `Working` would still get stuck after the
|
||||
// deliverable was already in the chat.
|
||||
const noteArtifactRegistered = vi.fn();
|
||||
__forTestChatRunHandles.set('run-1', { noteArtifactRegistered });
|
||||
|
||||
__forTestEmitLiveArtifactEvent(
|
||||
{ runId: 'run-1', projectId: 'project-1' },
|
||||
'created',
|
||||
{ id: 'artifact-1', projectId: 'project-1', title: 'Daily digest' },
|
||||
);
|
||||
|
||||
expect(noteArtifactRegistered).toHaveBeenCalledTimes(1);
|
||||
});
|
||||
|
||||
it('does not call noteArtifactRegistered on "updated" — only the first registration shortens the watchdog', () => {
|
||||
// An updated artifact is the same deliverable being rewritten,
|
||||
// not a fresh handoff. The watchdog already switched the first
|
||||
// time `created` fired (if it ever did); the chat run's
|
||||
// own noteAgentActivity path handles activity-driven resets
|
||||
// separately. Re-firing the hook here would be a no-op at best
|
||||
// and a double-arming bug at worst.
|
||||
const noteArtifactRegistered = vi.fn();
|
||||
__forTestChatRunHandles.set('run-1', { noteArtifactRegistered });
|
||||
|
||||
__forTestEmitLiveArtifactEvent(
|
||||
{ runId: 'run-1', projectId: 'project-1' },
|
||||
'updated',
|
||||
{ id: 'artifact-1', projectId: 'project-1' },
|
||||
);
|
||||
__forTestEmitLiveArtifactEvent(
|
||||
{ runId: 'run-1', projectId: 'project-1' },
|
||||
'deleted',
|
||||
{ id: 'artifact-1', projectId: 'project-1' },
|
||||
);
|
||||
|
||||
expect(noteArtifactRegistered).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('does not throw when no chat-run handle is registered for the runId (live-artifact emit from /api/projects path with no chat run)', () => {
|
||||
expect(() =>
|
||||
__forTestEmitLiveArtifactEvent(
|
||||
{ runId: 'no-such-run', projectId: 'project-1' },
|
||||
'created',
|
||||
{ id: 'artifact-1', projectId: 'project-1' },
|
||||
),
|
||||
).not.toThrow();
|
||||
});
|
||||
|
||||
it('does not throw when emit fires without a runId on the grant (project-scoped live-artifact emit)', () => {
|
||||
// The same `emitLiveArtifactEvent` is used by background refresh
|
||||
// routes where there is no owning chat run; passing a grant
|
||||
// without a runId must be a no-op for the handle path.
|
||||
const noteArtifactRegistered = vi.fn();
|
||||
__forTestChatRunHandles.set('run-1', { noteArtifactRegistered });
|
||||
|
||||
expect(() =>
|
||||
__forTestEmitLiveArtifactEvent(
|
||||
{ projectId: 'project-1' },
|
||||
'created',
|
||||
{ id: 'artifact-1', projectId: 'project-1' },
|
||||
),
|
||||
).not.toThrow();
|
||||
expect(noteArtifactRegistered).not.toHaveBeenCalled();
|
||||
});
|
||||
|
||||
it('does not propagate exceptions thrown from the handle hook (the artifact emit must not fail because of a broken consumer)', () => {
|
||||
// The chat run's noteArtifactRegistered touches local timer
|
||||
// state; a future refactor could throw if e.g. the timer was
|
||||
// already cleared mid-shutdown. The artifact-create endpoint
|
||||
// already wrote the deliverable, so the HTTP response must not
|
||||
// 500 because of a downstream hook failure.
|
||||
__forTestChatRunHandles.set('run-1', {
|
||||
noteArtifactRegistered: () => {
|
||||
throw new Error('boom');
|
||||
},
|
||||
});
|
||||
expect(() =>
|
||||
__forTestEmitLiveArtifactEvent(
|
||||
{ runId: 'run-1', projectId: 'project-1' },
|
||||
'created',
|
||||
{ id: 'artifact-1', projectId: 'project-1' },
|
||||
),
|
||||
).not.toThrow();
|
||||
});
|
||||
});
|
||||
|
||||
describe('resolveActiveInactivityTimeoutMs (#1451 quiet-period switch)', () => {
|
||||
const TEN_MIN = 10 * 60 * 1000;
|
||||
const ONE_MIN = 60 * 1000;
|
||||
|
||||
it('returns the pre-artifact ceiling when no artifact has been registered yet', () => {
|
||||
expect(
|
||||
resolveActiveInactivityTimeoutMs({
|
||||
inactivityTimeoutMs: TEN_MIN,
|
||||
artifactQuietPeriodMs: ONE_MIN,
|
||||
artifactRegistered: false,
|
||||
}),
|
||||
).toBe(TEN_MIN);
|
||||
});
|
||||
|
||||
it('switches to the quiet ceiling once an artifact has been registered', () => {
|
||||
expect(
|
||||
resolveActiveInactivityTimeoutMs({
|
||||
inactivityTimeoutMs: TEN_MIN,
|
||||
artifactQuietPeriodMs: ONE_MIN,
|
||||
artifactRegistered: true,
|
||||
}),
|
||||
).toBe(ONE_MIN);
|
||||
});
|
||||
|
||||
it('treats artifactQuietPeriodMs=0 as "disable the quiet period" — keeps the pre-artifact ceiling after registration', () => {
|
||||
// The bug from the #2585 review: when an operator sets
|
||||
// OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS=0, the prior implementation
|
||||
// dropped the active ceiling to 0 once the artifact was registered,
|
||||
// which made noteAgentActivity() early-return without rescheduling,
|
||||
// stranding the pre-artifact timer. Falling back to the pre-artifact
|
||||
// ceiling instead means subsequent activity keeps the timer fresh
|
||||
// and the existing pre-artifact stalled-error path still works.
|
||||
expect(
|
||||
resolveActiveInactivityTimeoutMs({
|
||||
inactivityTimeoutMs: TEN_MIN,
|
||||
artifactQuietPeriodMs: 0,
|
||||
artifactRegistered: true,
|
||||
}),
|
||||
).toBe(TEN_MIN);
|
||||
});
|
||||
|
||||
it('keeps a 0 pre-artifact ceiling at 0 when no artifact is registered (watchdog fully disabled)', () => {
|
||||
expect(
|
||||
resolveActiveInactivityTimeoutMs({
|
||||
inactivityTimeoutMs: 0,
|
||||
artifactQuietPeriodMs: ONE_MIN,
|
||||
artifactRegistered: false,
|
||||
}),
|
||||
).toBe(0);
|
||||
});
|
||||
|
||||
it('honors a 0 pre-artifact ceiling after artifact registration when quiet is also 0 (both disabled)', () => {
|
||||
expect(
|
||||
resolveActiveInactivityTimeoutMs({
|
||||
inactivityTimeoutMs: 0,
|
||||
artifactQuietPeriodMs: 0,
|
||||
artifactRegistered: true,
|
||||
}),
|
||||
).toBe(0);
|
||||
});
|
||||
});
|
||||
|
||||
describe('classifyChatRunCloseStatus (#1451 close-handler classification)', () => {
|
||||
const base = {
|
||||
cancelRequested: false,
|
||||
code: 0 as number | null,
|
||||
signal: null as string | null,
|
||||
acpCleanCompletion: false,
|
||||
artifactQuietShutdownRequested: false,
|
||||
};
|
||||
|
||||
it('returns canceled when cancelRequested wins regardless of other signals', () => {
|
||||
expect(
|
||||
classifyChatRunCloseStatus({ ...base, cancelRequested: true, code: 0 }),
|
||||
).toBe('canceled');
|
||||
expect(
|
||||
classifyChatRunCloseStatus({
|
||||
...base,
|
||||
cancelRequested: true,
|
||||
code: null,
|
||||
signal: 'SIGTERM',
|
||||
artifactQuietShutdownRequested: true,
|
||||
}),
|
||||
).toBe('canceled');
|
||||
});
|
||||
|
||||
it('returns succeeded on clean exit code 0', () => {
|
||||
expect(classifyChatRunCloseStatus({ ...base, code: 0 })).toBe('succeeded');
|
||||
});
|
||||
|
||||
it('returns succeeded on ACP forced shutdown (SIGTERM + clean ACP completion)', () => {
|
||||
expect(
|
||||
classifyChatRunCloseStatus({
|
||||
...base,
|
||||
code: null,
|
||||
signal: 'SIGTERM',
|
||||
acpCleanCompletion: true,
|
||||
}),
|
||||
).toBe('succeeded');
|
||||
});
|
||||
|
||||
it('returns failed when ACP shutdown was via SIGKILL (not the narrow override)', () => {
|
||||
expect(
|
||||
classifyChatRunCloseStatus({
|
||||
...base,
|
||||
code: null,
|
||||
signal: 'SIGKILL',
|
||||
acpCleanCompletion: true,
|
||||
}),
|
||||
).toBe('failed');
|
||||
});
|
||||
|
||||
it('returns succeeded when the watchdog-initiated quiet-period SIGTERM fires', () => {
|
||||
expect(
|
||||
classifyChatRunCloseStatus({
|
||||
...base,
|
||||
code: null,
|
||||
signal: 'SIGTERM',
|
||||
artifactQuietShutdownRequested: true,
|
||||
}),
|
||||
).toBe('succeeded');
|
||||
});
|
||||
|
||||
it('returns succeeded when the watchdog quiet-period escalates to SIGKILL (kill-grace timer)', () => {
|
||||
// After SIGTERM the inactivityKillGraceMs timer escalates to
|
||||
// SIGKILL if the child has not exited yet. Both signals belong to
|
||||
// the same daemon-initiated shutdown, so the close handler must
|
||||
// accept either when the flag is set.
|
||||
expect(
|
||||
classifyChatRunCloseStatus({
|
||||
...base,
|
||||
code: null,
|
||||
signal: 'SIGKILL',
|
||||
artifactQuietShutdownRequested: true,
|
||||
}),
|
||||
).toBe('succeeded');
|
||||
});
|
||||
|
||||
it('returns failed when SIGTERM/SIGKILL arrive but no quiet-period shutdown was requested', () => {
|
||||
// The reviewer-correctness fix: external `kill`, OOM, container
|
||||
// shutdown after a successful artifact registration must NOT be
|
||||
// silently reclassified as `succeeded`. The previous version only
|
||||
// checked `artifactRegistered` (true here, implied via the flag
|
||||
// being false because we never called failForInactivity), which
|
||||
// would have flipped these to succeeded incorrectly.
|
||||
expect(
|
||||
classifyChatRunCloseStatus({
|
||||
...base,
|
||||
code: null,
|
||||
signal: 'SIGTERM',
|
||||
artifactQuietShutdownRequested: false,
|
||||
}),
|
||||
).toBe('failed');
|
||||
expect(
|
||||
classifyChatRunCloseStatus({
|
||||
...base,
|
||||
code: null,
|
||||
signal: 'SIGKILL',
|
||||
artifactQuietShutdownRequested: false,
|
||||
}),
|
||||
).toBe('failed');
|
||||
});
|
||||
|
||||
it('returns failed on a non-zero exit code regardless of the quiet-period flag', () => {
|
||||
// The quiet-period override is signal-only; a clean process exit
|
||||
// that returned non-zero is a real failure (agent CLI bug, model
|
||||
// error, etc.) and must propagate as such.
|
||||
expect(
|
||||
classifyChatRunCloseStatus({
|
||||
...base,
|
||||
code: 1,
|
||||
signal: null,
|
||||
artifactQuietShutdownRequested: true,
|
||||
}),
|
||||
).toBe('failed');
|
||||
});
|
||||
|
||||
it('returns failed on the standard failure shape (non-zero exit, no special flags)', () => {
|
||||
expect(
|
||||
classifyChatRunCloseStatus({ ...base, code: 1, signal: null }),
|
||||
).toBe('failed');
|
||||
});
|
||||
});
|
||||
Loading…
Reference in a new issue