open-design/apps
Weston Houghton 65802542a2
fix(chat): surface OpenCode usage-limit/provider failures instead of a bare timeout (#3316)
* fix(chat): surface OpenCode provider failures from its log on a silent stall

OpenCode's headless `run --format json` mode swallows provider failures: a
429 usage-limit is marked retryable and retried silently with nothing on
stdout/stderr, so the chat run only dies via the inactivity watchdog and the
daemon shows a bare "request timed out" with no reason. The real error
(statusCode + "Monthly usage limit reached…") is recorded only in OpenCode's
own session log.

On a failed OpenCode close where stdout/stderr carry no signal, read the
newest OpenCode session log, extract the latest `service=llm` provider error
(scoped to that one line so the embedded request body can't contaminate the
classification), and emit a structured, retryable SSE error (RATE_LIMITED /
AGENT_AUTH_REQUIRED / UPSTREAM_UNAVAILABLE) carrying the provider's message.

Refs #982.

* fix(chat): emit recovered OpenCode failure from the watchdog path, bound to the run

Addresses review on #3316.

Blocking: the recovery previously ran only in the child-close handler, but in
the inactivity-watchdog stall path (the exact case this targets)
failForInactivity sends its error and finish()es the run — which clears
run.clients — before the child closes. So the structured error reached zero
live SSE clients and only surfaced on reload. Recover and send the OpenCode
failure inside failForInactivity, before finish(), on the same pre-teardown
send path the generic stall message already uses. Keep the close-handler
branch for the case where OpenCode exits non-zero on its own (clients still
attached).

Non-blocking: bind the log lookup to the current run via an mtime gate
(since=run.createdAt) so a stale or concurrent session's error can't be
misattributed — skip log files last written before the run started.

* docs(opencode-log): note the concurrent-run limitation of the mtime gate

* fix(chat): skip close-handler failure emit when the watchdog already finished the run

Non-blocking review follow-up on #3316: on the silent-stall path both
failForInactivity and the child-close handler fired for the same run, so the
recovered RATE_LIMITED error was sent twice and the events-log stream was
reopened after finish() had closed it. Guard the close-handler failure emit
with !design.runs.isTerminal(run.status) — the watchdog already sent the
error and finalized the run; finalization below still runs (finish() no-ops
once terminal).
2026-05-30 03:23:58 +00:00
..
daemon fix(chat): surface OpenCode usage-limit/provider failures instead of a bare timeout (#3316) 2026-05-30 03:23:58 +00:00
desktop feat(runtimes): register AMR (vela) as an ACP stdio agent (#2355) 2026-05-28 05:09:55 +00:00
landing-page fix(landing-page): wire up mobile nav toggle on the homepage (#3295) 2026-05-29 10:19:37 +00:00
packaged fix(platform): support live system proxy changes (#3093) 2026-05-28 06:11:47 +00:00
telemetry-worker chore: pin dependency versions and harden CI caches (#2189) 2026-05-19 13:58:27 +08:00
web fix(web): snapshot the srcDoc bridge frame in Mark mode so deck capture works (#3304) 2026-05-29 11:50:37 +00:00
AGENTS.md refactor(daemon): split agent runtime definitions (#1063) 2026-05-11 15:01:55 +08:00