fix(daemon): finish live-artifact chat runs via watchdog quiet-period handoff (#1451) (#2585)

* fix(daemon): finish live-artifact chat runs via watchdog quiet-period handoff (#1451)

Live-artifact runs were staying in `Working` for the full 10-minute
inactivity window even after the deliverable had been registered, and
sometimes finishing as `failed` with `Agent stalled without emitting
any new output for 600s`. The agent process kept its stdin/stdout
alive (claude-code stream-json idle stdin, post-write reasoning that
never reaches the chat) so the existing watchdog could not tell the
deliverable was already in the user's hands.

Wire `/api/tools/live-artifacts/create` back into the chat run via a
small per-run handle registry: on the first `created` event, the run
flips a local `artifactRegistered` flag and rearms the watchdog with
the shorter `OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS` (default 60s)
instead of the 10-minute pre-artifact ceiling. When that quiet timer
trips, the watchdog no longer emits a stalled-error / `failed` finish;
it SIGTERMs the child and lets the existing child-exit handler do
final classification — the close handler now treats a SIGTERM exit
after a registered artifact as `succeeded`, matching what the user
actually got (a delivered artifact, not a failed run).

The handoff stays with the existing child-exit lifecycle, so tool
token revocation, cancel semantics, and exit-status classification
keep their current owner — addressing the PR #1543 review history
where finishing the run from the tool route bypassed those guarantees.

Closes #1451.

* fix(daemon): gate artifact quiet-period close on daemon-initiated flag (#1451 review follow-up)

Reviewer (#2585) found that the close-handler branch reclassifying
SIGTERM/SIGKILL as `succeeded` only checked `artifactRegistered`, so an
unrelated later termination (external `kill`, OOM, container shutdown)
after a successful artifact write would silently flip the run from
`failed` to `succeeded` — the exact "completed without producing
anything visible" failure mode the existing close handler is trying
to prevent.

Track the watchdog-initiated shutdown explicitly: set
`artifactQuietShutdownRequested = true` immediately before
`failForInactivity()` sends SIGTERM (covering the kill-grace SIGKILL
escalation under the same flag), and require that flag in the close
handler's quiet-period branch.

Extract the final-status decision into a pure
`classifyChatRunCloseStatus` so the daemon-initiated vs external
signal cases can be pinned with focused unit tests instead of
asserting closure-internal state via end-to-end timing.

* fix(daemon): treat OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS=0 as disabled (#1451 review follow-up)

Reviewer (#2585 non-blocking) found that an operator override of
`OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS=0` no longer behaved as a
"disable the quiet period" knob: once the artifact was registered,
`activeInactivityTimeoutMs()` dropped to 0, `noteAgentActivity()`
early-returned without clearing the prior timer, and the pre-artifact
10-minute timer kept running while further agent activity stopped
refreshing it.

Make the quiet-period switch conditional on a positive value. A 0
override now means "do not shorten after artifact registration" — the
pre-artifact ceiling stays active, subsequent activity continues to
reschedule it, and the existing pre-artifact stalled-error path still
fires when the agent genuinely hangs. Pin the resolver as a pure
`resolveActiveInactivityTimeoutMs` helper so the four quiet-vs-pre
matrix cases are unit-tested directly.

* fix(daemon): arm the quiet-period watchdog when pre-artifact timeout is disabled (#1451 review follow-up)

Reviewer (#2585 non-blocking, round 3) found that
`OD_CHAT_RUN_INACTIVITY_TIMEOUT_MS=0` paired with
`OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS>0` left the watchdog disarmed
forever. The `noteAgentActivity()` call at run start exited early
because the pre-artifact delay was 0, so `inactivityTimer` was still
`null` when the artifact was registered, and the prior
`if (inactivityTimer) noteAgentActivity()` guard inside
`noteArtifactRegistered()` then skipped the re-arm. The
newly-positive quiet-period delay never armed a timer at all — a
chat run that went silent right after artifact creation would stay
`running` indefinitely.

Drop the guard. `noteAgentActivity()` is already the function that
decides whether to schedule (it bails when the active delay is 0),
so calling it unconditionally keeps the behavior coherent across the
four pre/quiet combinations: both non-zero (was already fine), pre=0
+ quiet>0 (now arms the quiet timer), pre>0 + quiet=0 (still falls
back to the pre-artifact ceiling via the existing resolver), both
zero (still no watchdog at all — operator opted out).

Pure-function coverage of the ceiling decision stays in
`resolveActiveInactivityTimeoutMs` — exercised across the same four
combinations in the existing unit suite.
This commit is contained in:
YOMXXX 2026-05-22 19:06:13 +08:00 committed by GitHub
parent 53ecd36b9d
commit 48ed23c72f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 531 additions and 11 deletions

View file

@ -1334,6 +1334,13 @@ async function refreshAndPersistToken(dataDir, serverId, current) {
const activeChatAgentEventSinks = new Map();
const activeProjectEventSinks = new Map();
// Per-chat-run handles, keyed by runId. Lets non-stream side effects
// (live-artifact create, project events) reach back into the chat
// run's local state — currently used by the artifact quiet-period
// shortcut (#1451) so a successful artifact registration can shorten
// the inactivity watchdog without the chat path having to poll a
// store.
const activeChatRunHandles = new Map();
function emitChatAgentEvent(runId, payload) {
const sink = activeChatAgentEventSinks.get(runId);
@ -1341,6 +1348,20 @@ function emitChatAgentEvent(runId, payload) {
return sink(payload);
}
// Exported for tests covering the artifact quiet-period plumbing
// (#1451). The chat run path is a deep closure inside startServer, so
// pin the hook contract at the emit/handle boundary instead of
// driving a full fake-agent e2e for every invariant.
export const __forTestChatRunHandles = activeChatRunHandles;
export function __forTestEmitLiveArtifactEvent(
grant: { runId?: string; projectId?: string },
action: 'created' | 'updated' | 'deleted',
artifact: { id: string; projectId?: string; title?: string; refreshStatus?: string },
) {
return emitLiveArtifactEvent(grant, action, artifact);
}
function emitLiveArtifactEvent(grant, action, artifact) {
if (!artifact?.id) return false;
const payload = {
@ -1353,6 +1374,18 @@ function emitLiveArtifactEvent(grant, action, artifact) {
};
let emitted = emitProjectEvent(payload.projectId, payload);
if (grant?.runId) emitted = emitChatAgentEvent(grant.runId, payload) || emitted;
// After the deliverable exists, switch the chat run into a shorter
// "quiet period" watchdog: agents sometimes keep their child process
// alive after a successful artifact write (post-write reasoning, log
// flushes, claude-code stream-json's idle stdin) and the 10-minute
// default leaves the UI parked on Working until the watchdog fires
// an unrelated "stalled" error. See #1451.
if (action === 'created' && grant?.runId) {
const handle = activeChatRunHandles.get(grant.runId);
if (handle?.noteArtifactRegistered) {
try { handle.noteArtifactRegistered(); } catch {}
}
}
return emitted;
}
@ -3013,6 +3046,12 @@ export interface StartServerOptions {
const DEFAULT_CHAT_RUN_INACTIVITY_TIMEOUT_MS = 10 * 60 * 1000;
const MAX_CHAT_RUN_INACTIVITY_TIMEOUT_MS = 24 * 60 * 60 * 1000;
// After a successful live-artifact registration the daemon switches the
// chat-run inactivity watchdog from the long pre-artifact ceiling
// (DEFAULT_CHAT_RUN_INACTIVITY_TIMEOUT_MS) down to a much shorter
// "quiet period" — the deliverable exists, so further silence almost
// always means the agent is winding down or hanging. See #1451.
const DEFAULT_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS = 60 * 1000;
function resolveChatRunInactivityTimeoutMs() {
const raw = Number(process.env.OD_CHAT_RUN_INACTIVITY_TIMEOUT_MS);
@ -3026,6 +3065,67 @@ function resolveChatRunInactivityTimeoutMs() {
return Math.min(MAX_CHAT_RUN_INACTIVITY_TIMEOUT_MS, Math.max(0, Math.floor(raw)));
}
// Resolve the post-artifact quiet-period window. Same clamp as the outer
// inactivity watchdog so an oversized override doesn't get Node-downgraded
// to a 1ms timer. Exported so tests can pin the env behavior without
// reaching into chat-run internals.
export function resolveChatRunArtifactQuietPeriodMs() {
const raw = Number(process.env.OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS);
if (!Number.isFinite(raw)) return DEFAULT_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS;
return Math.min(MAX_CHAT_RUN_INACTIVITY_TIMEOUT_MS, Math.max(0, Math.floor(raw)));
}
// Pure resolver for the chat run's *currently active* inactivity
// ceiling. Used by both `noteAgentActivity` and `noteArtifactRegistered`
// to pick between the pre-artifact watchdog and the shortened quiet
// period. Extracted so the `OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS=0`
// "disable the quiet period" semantics can be pinned with focused unit
// tests (#1451 review: a 0-value override must not strand the pre-artifact
// timer or stop further reschedules — it has to fall back to the
// pre-artifact ceiling so subsequent activity keeps refreshing the timer).
export function resolveActiveInactivityTimeoutMs(params: {
inactivityTimeoutMs: number;
artifactQuietPeriodMs: number;
artifactRegistered: boolean;
}): number {
if (params.artifactRegistered && params.artifactQuietPeriodMs > 0) {
return params.artifactQuietPeriodMs;
}
return params.inactivityTimeoutMs;
}
// Pure final-status classifier for the chat run's child-close handler.
// Extracted so the per-branch invariants can be unit-tested without
// driving a full child process — in particular:
// - cancel always wins over success/failure classification.
// - the ACP forced-shutdown override is scoped to SIGTERM + clean
// completion only (signed-32-bit-overflow SIGKILL or non-clean ACP
// state still report `failed`).
// - the artifact quiet-period override is gated on a daemon-initiated
// flag, NOT on `artifactRegistered` alone — see #1451 review:
// an external `kill -9` after the artifact write must still report
// `failed`, only the watchdog-initiated SIGTERM/SIGKILL escalation
// is allowed to flip the status to `succeeded`.
export function classifyChatRunCloseStatus(params: {
cancelRequested: boolean;
code: number | null;
signal: NodeJS.Signals | string | null;
acpCleanCompletion: boolean;
artifactQuietShutdownRequested: boolean;
}): 'canceled' | 'succeeded' | 'failed' {
if (params.cancelRequested) return 'canceled';
if (params.code === 0) return 'succeeded';
const acpForcedShutdown =
params.code === null && params.signal === 'SIGTERM' && params.acpCleanCompletion;
if (acpForcedShutdown) return 'succeeded';
const artifactQuietShutdown =
params.artifactQuietShutdownRequested &&
params.code === null &&
(params.signal === 'SIGTERM' || params.signal === 'SIGKILL');
if (artifactQuietShutdown) return 'succeeded';
return 'failed';
}
function resolveChatRunShutdownGraceMs() {
const raw = Number(process.env.OD_CHAT_RUN_SHUTDOWN_GRACE_MS);
if (!Number.isFinite(raw)) return 3_000;
@ -10078,11 +10178,26 @@ export async function startServer({
design.runs.emit(run, event, data);
};
const inactivityTimeoutMs = resolveChatRunInactivityTimeoutMs();
const artifactQuietPeriodMs = resolveChatRunArtifactQuietPeriodMs();
const inactivityKillGraceMs = 3_000;
let inactivityTimer = null;
let childStdoutSeen = false;
let lastAgentEventPhase = 'spawn pending';
let lastToolResultChars = 0;
// Becomes true once any live-artifact create has been registered for
// this run. Subsequent watchdog scheduling uses the shorter quiet
// period, and a watchdog trip after this point is treated as
// "agent finished the deliverable and went idle" rather than
// "agent stalled with nothing to show" (issue #1451).
let artifactRegistered = false;
// Only daemon-initiated quiet-period termination should be treated
// as `succeeded` in the close handler. A later unrelated SIGTERM /
// SIGKILL (external `kill`, OOM, container shutdown) must keep its
// existing `failed` classification even when `artifactRegistered`
// is true — those signals don't mean the agent finished cleanly,
// they just terminated the process. Set strictly inside
// `failForInactivity`'s quiet-period branch.
let artifactQuietShutdownRequested = false;
const summarizeAgentEventForInactivity = (payload) => {
const type = payload?.type ? String(payload.type) : 'unknown';
if (type === 'tool_result') {
@ -10117,13 +10232,35 @@ export async function startServer({
};
const failForInactivity = () => {
if (run.cancelRequested || design.runs.isTerminal(run.status)) return;
clearInactivityWatchdog();
if (artifactRegistered) {
// The deliverable already exists. The agent process is either
// genuinely idle (claude-code's stream-json child sitting on an
// open stdin) or wedged in post-write reasoning that never
// emits stdout. Either way, finishing the run via the normal
// child-exit path (status decision in child.on('close') below)
// is safer than tearing it down with a failure banner — the
// tool token, cancel state, and exit-code classification stay
// owned by the existing lifecycle. SIGTERM the child and let
// the close handler classify the run as succeeded (via the
// artifactQuietShutdown branch). Mark this termination as
// daemon-initiated so an unrelated later signal (external
// kill, OOM) is NOT silently reclassified to `succeeded` —
// only signals from this watchdog branch should be.
artifactQuietShutdownRequested = true;
if (acpSession?.abort) {
acpSession.abort();
}
if (child && !child.killed) child.kill('SIGTERM');
scheduleForcedChildShutdown();
return;
}
const message =
`Agent stalled without emitting any new output for ${Math.round(inactivityTimeoutMs / 1000)}s. ` +
'The model or CLI likely hung while generating. ' +
`Phase details: spawned agent binary ${resolvedBin}; stdout arrived: ${childStdoutSeen ? 'yes' : 'no'}; ` +
`last agent event: ${lastAgentEventPhase}; largest tool result observed: ${lastToolResultChars} chars. ` +
'Retry the turn, pick a different model, or start a new conversation if the prior context is very large.';
clearInactivityWatchdog();
send('error', createSseErrorPayload('AGENT_EXECUTION_FAILED', message, { retryable: true }));
design.runs.finish(run, 'failed', 1, null);
if (acpSession?.abort) {
@ -10132,14 +10269,41 @@ export async function startServer({
if (child && !child.killed) child.kill('SIGTERM');
scheduleForcedChildShutdown();
};
const activeInactivityTimeoutMs = () =>
resolveActiveInactivityTimeoutMs({
inactivityTimeoutMs,
artifactQuietPeriodMs,
artifactRegistered,
});
const noteAgentActivity = () => {
if (inactivityTimeoutMs <= 0) return;
const delay = activeInactivityTimeoutMs();
if (delay <= 0) return;
clearInactivityWatchdog();
inactivityTimer = setTimeout(failForInactivity, inactivityTimeoutMs);
inactivityTimer = setTimeout(failForInactivity, delay);
inactivityTimer.unref?.();
};
const noteArtifactRegistered = () => {
if (artifactRegistered) return;
artifactRegistered = true;
// Switch the watchdog to the shorter quiet-period window
// immediately so we don't have to wait for the next agent event
// before the new ceiling takes effect. Call unconditionally:
// an earlier `if (inactivityTimer)` gate left the run in limbo
// when `OD_CHAT_RUN_INACTIVITY_TIMEOUT_MS=0` but
// `OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS>0` — noteAgentActivity()
// had returned early at run start (pre-artifact delay = 0,
// no timer set), so the guard then skipped the re-arm and the
// newly-positive quiet-period delay never armed a timer at all.
// `noteAgentActivity` itself is the one that decides whether to
// schedule (it bails when the active delay is 0), so leaving the
// decision there keeps the behavior coherent across all four
// combinations of pre / quiet timeouts.
noteAgentActivity();
};
const unregisterChatAgentEventSink = () => {
activeChatAgentEventSinks.delete(toolTokenGrant?.runId ?? runId);
const sinkRunId = toolTokenGrant?.runId ?? runId;
activeChatAgentEventSinks.delete(sinkRunId);
activeChatRunHandles.delete(sinkRunId);
};
if (toolTokenGrant?.runId) {
activeChatAgentEventSinks.set(toolTokenGrant.runId, (payload) => {
@ -10147,6 +10311,7 @@ export async function startServer({
noteAgentActivity();
send('agent', payload);
});
activeChatRunHandles.set(toolTokenGrant.runId, { noteArtifactRegistered });
}
// If detection can't find the binary, surface a friendly SSE error
// pointing at /api/agents instead of silently falling back to
@ -10778,13 +10943,13 @@ export async function startServer({
const acpCleanCompletion =
typeof acpSession?.completedSuccessfully === 'function' &&
acpSession.completedSuccessfully();
const acpForcedShutdown =
code === null && signal === 'SIGTERM' && acpCleanCompletion;
const status = run.cancelRequested
? 'canceled'
: code === 0 || acpForcedShutdown
? 'succeeded'
: 'failed';
const status = classifyChatRunCloseStatus({
cancelRequested: !!run.cancelRequested,
code,
signal,
acpCleanCompletion,
artifactQuietShutdownRequested,
});
if (status === 'failed') {
const diagnostic = diagnoseClaudeCliFailure({
agentId: def.id,

View file

@ -0,0 +1,355 @@
/**
* Artifact quiet-period plumbing (#1451).
*
* Live-artifact registration now feeds back into the chat-run
* inactivity watchdog: once the deliverable exists, the daemon
* switches to a shorter "quiet period" timeout instead of the
* 10-minute pre-artifact ceiling, and a watchdog trip after the
* artifact is in place is treated as "agent finished and went idle"
* rather than "agent stalled with nothing to show".
*
* The full watchdog state machine sits inside a deep closure in
* `startServer`, so these tests pin the emit/handle plumbing at the
* boundary (via the `__forTest*` exports) and pin the env resolver
* directly. The integration story fake agent + live-artifact
* create over HTTP + run-status check would require setting up a
* project, minting a tool token, and reproducing the chat-run path;
* that is covered by manual verification in the PR body.
*/
import { afterEach, describe, expect, it, vi } from 'vitest';
import {
__forTestChatRunHandles,
__forTestEmitLiveArtifactEvent,
classifyChatRunCloseStatus,
resolveActiveInactivityTimeoutMs,
resolveChatRunArtifactQuietPeriodMs,
} from '../src/server.js';
const ENV_KEY = 'OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS';
const ONE_MINUTE_MS = 60 * 1000;
const TWENTY_FOUR_HOURS_MS = 24 * 60 * 60 * 1000;
describe('resolveChatRunArtifactQuietPeriodMs', () => {
const originalEnv = process.env[ENV_KEY];
afterEach(() => {
if (originalEnv === undefined) {
delete process.env[ENV_KEY];
} else {
process.env[ENV_KEY] = originalEnv;
}
});
it('returns the 1-minute default when no env override is set', () => {
delete process.env[ENV_KEY];
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(ONE_MINUTE_MS);
});
it('honors the env override when it is a finite number', () => {
process.env[ENV_KEY] = '5000';
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(5_000);
});
it('falls back to the default when the env value is not parseable as a number', () => {
process.env[ENV_KEY] = 'not-a-number';
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(ONE_MINUTE_MS);
});
it('clamps an oversized env override to the 24-hour ceiling (so Node does not silently downgrade the timer to 1ms)', () => {
process.env[ENV_KEY] = String(TWENTY_FOUR_HOURS_MS * 100);
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(TWENTY_FOUR_HOURS_MS);
});
it('floors negative env overrides to 0 rather than scheduling a negative-delay timer', () => {
process.env[ENV_KEY] = '-1000';
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(0);
});
it('honors env=0 to disable the artifact quiet-period entirely', () => {
process.env[ENV_KEY] = '0';
expect(resolveChatRunArtifactQuietPeriodMs()).toBe(0);
});
});
describe('live-artifact create → chat-run handle hook (#1451)', () => {
afterEach(() => {
// Avoid leaking handles into other test files that touch the same
// server-internal registry.
__forTestChatRunHandles.clear();
});
it('calls noteArtifactRegistered on the registered handle when emit fires with action="created"', () => {
// The boundary contract: emitLiveArtifactEvent must route a
// `created` action back into the chat run's quiet-period switch,
// not just into the project SSE stream. Without this hook the
// watchdog would never shorten, the user would still wait the
// full 10 minutes, and `Working` would still get stuck after the
// deliverable was already in the chat.
const noteArtifactRegistered = vi.fn();
__forTestChatRunHandles.set('run-1', { noteArtifactRegistered });
__forTestEmitLiveArtifactEvent(
{ runId: 'run-1', projectId: 'project-1' },
'created',
{ id: 'artifact-1', projectId: 'project-1', title: 'Daily digest' },
);
expect(noteArtifactRegistered).toHaveBeenCalledTimes(1);
});
it('does not call noteArtifactRegistered on "updated" — only the first registration shortens the watchdog', () => {
// An updated artifact is the same deliverable being rewritten,
// not a fresh handoff. The watchdog already switched the first
// time `created` fired (if it ever did); the chat run's
// own noteAgentActivity path handles activity-driven resets
// separately. Re-firing the hook here would be a no-op at best
// and a double-arming bug at worst.
const noteArtifactRegistered = vi.fn();
__forTestChatRunHandles.set('run-1', { noteArtifactRegistered });
__forTestEmitLiveArtifactEvent(
{ runId: 'run-1', projectId: 'project-1' },
'updated',
{ id: 'artifact-1', projectId: 'project-1' },
);
__forTestEmitLiveArtifactEvent(
{ runId: 'run-1', projectId: 'project-1' },
'deleted',
{ id: 'artifact-1', projectId: 'project-1' },
);
expect(noteArtifactRegistered).not.toHaveBeenCalled();
});
it('does not throw when no chat-run handle is registered for the runId (live-artifact emit from /api/projects path with no chat run)', () => {
expect(() =>
__forTestEmitLiveArtifactEvent(
{ runId: 'no-such-run', projectId: 'project-1' },
'created',
{ id: 'artifact-1', projectId: 'project-1' },
),
).not.toThrow();
});
it('does not throw when emit fires without a runId on the grant (project-scoped live-artifact emit)', () => {
// The same `emitLiveArtifactEvent` is used by background refresh
// routes where there is no owning chat run; passing a grant
// without a runId must be a no-op for the handle path.
const noteArtifactRegistered = vi.fn();
__forTestChatRunHandles.set('run-1', { noteArtifactRegistered });
expect(() =>
__forTestEmitLiveArtifactEvent(
{ projectId: 'project-1' },
'created',
{ id: 'artifact-1', projectId: 'project-1' },
),
).not.toThrow();
expect(noteArtifactRegistered).not.toHaveBeenCalled();
});
it('does not propagate exceptions thrown from the handle hook (the artifact emit must not fail because of a broken consumer)', () => {
// The chat run's noteArtifactRegistered touches local timer
// state; a future refactor could throw if e.g. the timer was
// already cleared mid-shutdown. The artifact-create endpoint
// already wrote the deliverable, so the HTTP response must not
// 500 because of a downstream hook failure.
__forTestChatRunHandles.set('run-1', {
noteArtifactRegistered: () => {
throw new Error('boom');
},
});
expect(() =>
__forTestEmitLiveArtifactEvent(
{ runId: 'run-1', projectId: 'project-1' },
'created',
{ id: 'artifact-1', projectId: 'project-1' },
),
).not.toThrow();
});
});
describe('resolveActiveInactivityTimeoutMs (#1451 quiet-period switch)', () => {
const TEN_MIN = 10 * 60 * 1000;
const ONE_MIN = 60 * 1000;
it('returns the pre-artifact ceiling when no artifact has been registered yet', () => {
expect(
resolveActiveInactivityTimeoutMs({
inactivityTimeoutMs: TEN_MIN,
artifactQuietPeriodMs: ONE_MIN,
artifactRegistered: false,
}),
).toBe(TEN_MIN);
});
it('switches to the quiet ceiling once an artifact has been registered', () => {
expect(
resolveActiveInactivityTimeoutMs({
inactivityTimeoutMs: TEN_MIN,
artifactQuietPeriodMs: ONE_MIN,
artifactRegistered: true,
}),
).toBe(ONE_MIN);
});
it('treats artifactQuietPeriodMs=0 as "disable the quiet period" — keeps the pre-artifact ceiling after registration', () => {
// The bug from the #2585 review: when an operator sets
// OD_CHAT_RUN_ARTIFACT_QUIET_PERIOD_MS=0, the prior implementation
// dropped the active ceiling to 0 once the artifact was registered,
// which made noteAgentActivity() early-return without rescheduling,
// stranding the pre-artifact timer. Falling back to the pre-artifact
// ceiling instead means subsequent activity keeps the timer fresh
// and the existing pre-artifact stalled-error path still works.
expect(
resolveActiveInactivityTimeoutMs({
inactivityTimeoutMs: TEN_MIN,
artifactQuietPeriodMs: 0,
artifactRegistered: true,
}),
).toBe(TEN_MIN);
});
it('keeps a 0 pre-artifact ceiling at 0 when no artifact is registered (watchdog fully disabled)', () => {
expect(
resolveActiveInactivityTimeoutMs({
inactivityTimeoutMs: 0,
artifactQuietPeriodMs: ONE_MIN,
artifactRegistered: false,
}),
).toBe(0);
});
it('honors a 0 pre-artifact ceiling after artifact registration when quiet is also 0 (both disabled)', () => {
expect(
resolveActiveInactivityTimeoutMs({
inactivityTimeoutMs: 0,
artifactQuietPeriodMs: 0,
artifactRegistered: true,
}),
).toBe(0);
});
});
describe('classifyChatRunCloseStatus (#1451 close-handler classification)', () => {
const base = {
cancelRequested: false,
code: 0 as number | null,
signal: null as string | null,
acpCleanCompletion: false,
artifactQuietShutdownRequested: false,
};
it('returns canceled when cancelRequested wins regardless of other signals', () => {
expect(
classifyChatRunCloseStatus({ ...base, cancelRequested: true, code: 0 }),
).toBe('canceled');
expect(
classifyChatRunCloseStatus({
...base,
cancelRequested: true,
code: null,
signal: 'SIGTERM',
artifactQuietShutdownRequested: true,
}),
).toBe('canceled');
});
it('returns succeeded on clean exit code 0', () => {
expect(classifyChatRunCloseStatus({ ...base, code: 0 })).toBe('succeeded');
});
it('returns succeeded on ACP forced shutdown (SIGTERM + clean ACP completion)', () => {
expect(
classifyChatRunCloseStatus({
...base,
code: null,
signal: 'SIGTERM',
acpCleanCompletion: true,
}),
).toBe('succeeded');
});
it('returns failed when ACP shutdown was via SIGKILL (not the narrow override)', () => {
expect(
classifyChatRunCloseStatus({
...base,
code: null,
signal: 'SIGKILL',
acpCleanCompletion: true,
}),
).toBe('failed');
});
it('returns succeeded when the watchdog-initiated quiet-period SIGTERM fires', () => {
expect(
classifyChatRunCloseStatus({
...base,
code: null,
signal: 'SIGTERM',
artifactQuietShutdownRequested: true,
}),
).toBe('succeeded');
});
it('returns succeeded when the watchdog quiet-period escalates to SIGKILL (kill-grace timer)', () => {
// After SIGTERM the inactivityKillGraceMs timer escalates to
// SIGKILL if the child has not exited yet. Both signals belong to
// the same daemon-initiated shutdown, so the close handler must
// accept either when the flag is set.
expect(
classifyChatRunCloseStatus({
...base,
code: null,
signal: 'SIGKILL',
artifactQuietShutdownRequested: true,
}),
).toBe('succeeded');
});
it('returns failed when SIGTERM/SIGKILL arrive but no quiet-period shutdown was requested', () => {
// The reviewer-correctness fix: external `kill`, OOM, container
// shutdown after a successful artifact registration must NOT be
// silently reclassified as `succeeded`. The previous version only
// checked `artifactRegistered` (true here, implied via the flag
// being false because we never called failForInactivity), which
// would have flipped these to succeeded incorrectly.
expect(
classifyChatRunCloseStatus({
...base,
code: null,
signal: 'SIGTERM',
artifactQuietShutdownRequested: false,
}),
).toBe('failed');
expect(
classifyChatRunCloseStatus({
...base,
code: null,
signal: 'SIGKILL',
artifactQuietShutdownRequested: false,
}),
).toBe('failed');
});
it('returns failed on a non-zero exit code regardless of the quiet-period flag', () => {
// The quiet-period override is signal-only; a clean process exit
// that returned non-zero is a real failure (agent CLI bug, model
// error, etc.) and must propagate as such.
expect(
classifyChatRunCloseStatus({
...base,
code: 1,
signal: null,
artifactQuietShutdownRequested: true,
}),
).toBe('failed');
});
it('returns failed on the standard failure shape (non-zero exit, no special flags)', () => {
expect(
classifyChatRunCloseStatus({ ...base, code: 1, signal: null }),
).toBe('failed');
});
});