Commit graph

9 commits

Author SHA1 Message Date
lefarcen
a6a56099ca
ci: show per-case pass/fail status emoji in agent report (#3118)
Reviewers asked for at-a-glance outcomes. Instruct the agent to begin each
"Cases Tested" bullet with a status emoji ( pass /  fail / ⚠️ warning /
 inconclusive) and a bold case name, so the report shows which checks
passed or failed without reading each line.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 13:34:29 +00:00
lefarcen
bf61a39cb5
ci: clean agent report (write-to-file) + slim artifacts/uploads (#3116)
* ci: clean agent report (write-to-file) + slim artifacts/uploads

Four related cleanups to the agent PR exploration output:

1. Clean report. The PR comment / report.md was assembled by dumping the
   entire verbose expect.log (ACP init logs, "Git failed" warnings, the
   ~24KB echoed prompt, ANSI codes, progress checklist) under the trace
   header -- ~28KB of noise. Instead, instruct the agent to write its
   final Markdown report to a file via its file-write tool, and have the
   runner read that file directly. Verified: Codex writes a clean report
   to the given absolute path. Falls back to an inconclusive note if the
   agent did not finish.

2. Drop duplicate trace/video. The script copied
   playwright-smoke-trace.zip -> playwright-trace.zip (a ~28MB legacy
   duplicate) and the webm likewise, and uploaded both to R2. Keep only
   the canonical smoke-named artifacts.

3. Slim the GitHub artifact. The trace zips and videos are already on R2;
   exclude *.zip / *.webm from the uploaded artifact so it drops from
   ~56MB to <1MB (report + logs only).

4. Persist report on the runner. Copy the report / agent-report /
   expect.log / trace URL to a stable host dir
   ($HOME/.cache/agent-pr-explore/reports/pr-<n>) so dry runs
   (skip_comment) can be inspected without downloading the artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: address review — keep advisory reports + recursive artifact excludes

Review findings on the report/artifact cleanup:

1. Regression fix: the non-app-surface and deterministic-verifier branches
   write their pre-baked advisory report (Inconclusive / Pass / Fail) and
   never run the agent, so they don't produce agent-report.md. After
   switching write_agent_report_artifact to read only agent-report.md they
   fell through to the "agent did not write a final report" fallback,
   dropping the real advisory (and mis-reporting on .github-only PRs like
   this one). Fix: those branches now write their advisory directly to
   $agent_report_file — single source of truth for the report body.

2. Recursive artifact excludes: the source Playwright recording lives at
   artifacts/playwright-video/<uuid>.webm; non-recursive !*.webm / !*.zip
   didn't match the subdirectory. Use **/*.zip and **/*.webm so the slim
   actually holds.

3. Drop the now-dangling summary.legacyTrace field (the legacy trace copy
   is no longer produced), matching the legacyVideo removal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:30:05 +00:00
lefarcen
995601de9c
ci: make agent exploration finalize promptly (avoid inactivity abort) (#3100)
Real Codex runs (#3060) explored correctly — verifying 3-4 UI cases with
DOM evidence — but Codex over-planned (6 steps), executed the high-value
ones, then went silent chasing a remaining planned step and tripped
expect-cli's ~180s no-output watchdog, aborting the turn before it emitted
a final report. The run then fell back to an advisory artifact, so the
real findings never reached report.md.

Tighten the prompt so Codex finishes and submits before going idle:
- cap at 3 cases (was 6) and target 2-3, quality over breadth;
- add a CRITICAL instruction stating the runner aborts with no report
  after ~3 min of no output, so Codex must stop after 2-3 cases and emit
  the complete report in one final turn rather than leaving planned steps
  pending or retrying silently.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 17:15:06 +08:00
lefarcen
fed464509b
ci: drive agent PR exploration with the Codex ACP backend (#3086)
expect-cli defaults to the Claude Code ACP provider, which is not
installed on the self-hosted runner, so the exploration step errored
(AcpProviderNotInstalledError) and fell back to a reachability-only smoke
trace instead of real UI exploration.

Pass `-a codex` to expect-cli so it drives the Codex agent (installed on
the runner, authenticated via CODEX_HOME). Configurable via
OD_EXPECT_AGENT (set to empty to use expect-cli's default). When the
agent is unavailable the existing smoke-trace fallback still applies, so
this is safe even before Codex is authenticated.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 15:05:03 +08:00
lefarcen
114be63a4e
ci: route agent sandbox installs through the China npm mirror (#3084)
After fixing source acquisition (#3078), the #3060 validation run reached
the container and got through most of `pnpm install`, then failed building
the better-sqlite3 native module: prebuild-install could not reach github
releases and the node-gyp fallback could not fetch node headers from
nodejs.org (ECONNRESET). The electron postinstall hits the same blocked
hosts, and package tarballs from npmjs were throttled to ~20 KB/s.

The runner's network to npmjs / nodejs.org / github releases is throttled
or reset by GFW; the China npm mirror (npmmirror.com) is fast and complete
(verified from the runner: registry ~2.4 MB/s, node headers ~3.6 MB/s,
better-sqlite3 prebuilt present). Point the in-container install at it via
registry + disturl (node-gyp headers) + electron / electron-builder /
better-sqlite3 binary mirrors + Playwright download host.

Package integrity is still verified against the lockfile, so the mirror
only changes transport. Once a native module builds, pnpm's side-effects
cache in the persistent store keeps it warm for later runs.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 14:30:59 +08:00
lefarcen
12141648e4
ci: fetch agent sandbox PR source on the host over SSH via a local mirror (#3078)
The sandbox checked out PR code with `git fetch https://github.com/...`
*inside* the container. The self-hosted runner's bandwidth to github.com
is throttled across every transport (HTTPS/SSH/codeload/API, all
~30-90 KB/s) and the HTTPS handshake is frequently RST'd, so a
from-scratch fetch of this ~200MB repo is impractical and unreliable per
run (run 26491460889 failed here with repeated GnuTLS resets).

Move source acquisition to the trusted host and make it incremental:

- Keep a persistent bare mirror of the base repo
  ($HOME/.cache/agent-pr-explore/open-design.git, overridable via
  OD_SANDBOX_REPO_MIRROR). Each run fetches only the PR's delta via
  `refs/pull/<n>/head` over SSH -- the one transport GFW doesn't reset --
  using a read-only deploy key (OD_SANDBOX_GIT_SSH_KEY).
- Take the head from the BASE repo's pull ref so fork PRs work without
  depending on the head fork, and verify it equals the resolved HEAD_SHA.
- Check the PR head into a per-run worktree and mount it read-only into
  the container; the container copies it into a writable workdir and no
  longer needs (or has) any github access.

The deploy key stays on the trusted host and is never exposed to the
untrusted PR code. The mirror must be seeded once on the runner (the
error message prints the exact clone command if it is missing).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 05:36:13 +00:00
lefarcen
2ed93e9c5d
ci: reuse cached docker image and persist pnpm store for agent sandbox (#3074)
* ci: skip docker pull when agent sandbox image is already cached

The agent PR exploration script ran an unconditional `docker pull
"$image"` before `docker run`. Under `set -e`, a transient registry
timeout (the self-hosted runner's network to docker.io is unreliable)
aborts the whole run even when the base image (node:24-bookworm) is
already cached locally — which is what happened on run 26490782540.

Skip the pull entirely when the image is already present, and only pull
when it is missing. This avoids both the failure and the wasted pull
timeout on every run, and keeps a run's base image stable. Refreshing
the cached image is a separate, explicit operation on the runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: persist agent sandbox pnpm store across runs

The pnpm store was placed under $RUNNER_TEMP, which the Actions runner
wipes per job, so every agent exploration re-downloaded all dependencies
from the npm registry — slow, and as fragile as the runner's docker.io
access (the same network class that already broke the docker pull).

Move the store to a persistent host path ($HOME/.cache/agent-pr-explore/
pnpm-store, overridable via OD_SANDBOX_PNPM_STORE) so a warm,
content-addressed store is reused across runs. `rm -rf "$root"` no longer
touches it since it lives outside the per-run root.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:49:26 +08:00
lefarcen
7b8bf0d9fb
ci: map agent trace upload to existing R2 secrets (#3013)
* ci: map agent trace upload to existing R2 secrets

* ci: make agent report comments macos-compatible

* ci: ensure Playwright browsers for agent traces
2026-05-27 03:01:36 +00:00
lefarcen
b5bf28060b
Add sandboxed agent PR exploration (#2604) 2026-05-26 07:52:42 +00:00