open-design/design-templates/eng-runbook/example.html
Tom Huang b5eb8c1647
feat: generic skills + split skills/design-templates + finalize-design API (#955)
* feat: general-purpose skills with @-mention composition and user import

Lift skills from "one mode-bound skill per project" to a generic capability
the user can compose per turn:

- Daemon: scan multiple skill roots (user-skills under runtime data, then
  the bundled `skills/`); user-imported skills can shadow built-ins by id.
- New `POST /api/skills/import` and `DELETE /api/skills/:id` endpoints,
  with CONFLICT/BAD_REQUEST/NOT_FOUND error codes and built-in delete
  protection.
- ChatRequest gains `skillIds: string[]`; the chat run concatenates each
  picked skill's body (and merges craftRequires) into the system prompt
  for that turn only — the project's persistent `skillId` is untouched.
- Web composer: `@` popover now lists skills alongside project files;
  picks render as removable chips above the textarea and ride along with
  the request as `skillIds`.
- Settings → Library: import form (name/description/triggers/body),
  per-card delete for user skills, "user" origin badge.

* chore(web): drop welcome pet teaser + add ds→prompt-template mapping util

- SettingsDialog: remove the inline pet adoption teaser from the welcome
  panel so the first-run modal stays focused on configuration.
- New `inferPromptTemplateCategoriesForDs(ds)` helper that maps a design
  system's authored metadata to prompt-template gallery categories.
  Imported by the design-system gallery wiring on a sibling branch; no
  callers in this branch yet.

* feat: split skills/design-templates and add finalize-design API

Phase 0 of the skills/design-templates refactor (specs/current/
skills-and-design-templates.md):

- Move ~104 rendering catalogue entries from skills/ to design-templates/
  and keep skills/ for the small set of functional skills that *do work*
  on user input (utilities, briefs, packagers).
- Add design-templates/AGENTS.md and skills/AGENTS.md describing the
  contract, and a brand-agnostic craft/ surface for opt-in craft rules.
- Daemon: add DESIGN_TEMPLATES_DIR / USER_DESIGN_TEMPLATES_DIR roots and
  an /api/design-templates surface mirroring /api/skills. Asset/example
  routes still span both registries so existing srcdoc URLs keep
  resolving across the rename.
- Web: split LibrarySection into SkillsSection + DesignSystemsSection,
  rename the EntryView "Examples" tab to "Templates", and update locales
  + the New-project picker accordingly.

Adds the finalize-design endpoint:

- New apps/daemon/src/finalize-design.ts and packages/contracts/src/api/
  finalize.ts — one-shot synthesis of a project's transcript + active
  design system + current artifact into <projectDir>/DESIGN.md via the
  Anthropic Messages API. Per-project .finalize.lock mirrors the
  transcript-export hygiene from PR #493; provider credentials are not
  persisted by the daemon.

Other supporting changes:

- README + AGENTS.md updates to document the new directory split and
  craft/ surface, plus i18n strings across 13 locales.
- Test refactors and new coverage (finalize-design, runs, sidecar
  server, plus refreshed daemon integration tests).
- .gitignore: scope the *.exe ignore to /OpenDesign.exe so legitimate
  vendor binaries are no longer hidden.

* fix(merge): move clinical-case-report to design-templates/

Origin/main added the clinical-case-report skill under skills/ before
the skills/design-templates split landed. Its od.mode is prototype, so
per specs/current/skills-and-design-templates.md it is a design template
and belongs alongside the other rendering catalogue entries — not under
the slimmed-down functional skills/ root. Moving it keeps the EntryView
Templates tab consistent with origin/main's intent.

* feat(skills): curated design/creative catalogue + collapsible Settings rows

Seed ~100 curated design/creative skill stubs under skills/ sourced from
awesome-claude-skills (ComposioHQ) and awesome-agent-skills (VoltAgent).
Each stub carries an od.category tag so the new filter pill row in
Settings -> Skills can group them. The seed script
(scripts/seed-curated-design-skills.ts, pnpm seed:curated-design-skills)
is idempotent: it only creates folders that don't already exist, so
hand-edited stubs are never overwritten.

- Daemon: parse and surface od.category on SkillInfo with a strict slug
  normaliser; mirror the field on SkillSummary in @open-design/contracts.
  Category is purely a UI hint — system-prompt composition is unchanged.
- Web: rewrite SkillsSection from a left-list / right-detail grid into a
  vertical stack of collapsible rows mirroring the External MCP panel
  (header always visible with name + mode/source/category pills + per-row
  enable toggle; SKILL.md preview, file tree and inline edit form expand
  on demand). Add a Category filter row above the list. Reorder Settings
  nav so Skills + External MCP sit above the Composio/MCP cluster. Update
  composer placeholder/hint across 17 locales to advertise '@ files or
  skills · / for commands'.
- Docs: extend skills/AGENTS.md with the curated catalogue rules
  (idempotency, category vocabulary, no upstream vendoring).

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(skills): teach localized-content + system-prompt tests about the skills/design-templates split

mrcfps blocking review on PR #955: the skills/design-templates split
(b5993385) moved ~110 SKILL.md entries out of `skills/` and into
`design-templates/`, but two repo-level tests still hard-coded the
single-root layout, so CI gates went red on the merged branch:

- `e2e/tests/localized-content.test.ts` only scanned `<repo>/skills`
  while the locale `skillCopy` map keeps id-keyed entries spanning
  both roots (ExamplesTab/Templates uses one lookup regardless of
  origin). Teach the helper to read both `skills/` and
  `design-templates/`, deduplicating ids so the union matches the
  localized claim.
- `apps/daemon/tests/prompts/system.test.ts` read
  `skills/live-artifact/SKILL.md`, which now lives under
  `design-templates/live-artifact/`. Update the absolute path so
  composeSystemPrompt's coverage of the live-artifact preamble is
  exercised again.

Also enroll the curated design/creative catalogue (PR #955, ~91
stubs sourced from awesome-claude-skills / awesome-agent-skills) in
the DE / FR / RU `_SKILL_IDS_WITH_EN_FALLBACK` lists. The stubs are
English-only by design (frontmatter advertises an upstream URL); the
fallback list is exactly the place to acknowledge "we know this id
exists, English copy is fine here" so the localized-content coverage
gate passes without forcing a translation task per locale.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(skills): always quote frontmatter name so importUserSkill round-trips numeric / boolean ids

mrcfps PR #955 review: `buildSkillMarkdown` emitted `name:
${escapeYamlString(name)}` without quotes, so YAML coerced names
like `123`, `true`, `false`, or `null` into non-string scalars on
re-parse. listSkills() then read `data.name` as a number/boolean
and the import flow's follow-up `findSkillById(skills, result.id)`
missed it, falling into `/api/skills/import`'s "imported skill
could not be re-read" 500 path for those ids.

Switch the emitter to a quoted scalar (`name: "..."`) — the
double-escape already in `escapeYamlString` makes the quoted form
safe — and add a round-trip test covering `123`, `true`, `false`,
`null`, and `0` to lock in the contract.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): drop staged-skill chips when the matching @<id> token leaves the draft

mrcfps PR #955 review: `submit()` always forwarded every id in
`stagedSkills`, but that state was only mutated on picker click and
chip removal. Hand-deleting an `@<id>` token from the textarea left
the chip staged, so the request still carried `skillIds: [<id>]` and
the daemon composed a skill the prompt no longer referenced.

Sync the chips with the draft inside `handleChange()` by pruning
`stagedSkills` whenever the new value no longer contains the
`@<id>` token (using the same whitespace boundary as
`removeStagedSkill`'s strip regex). Comment explains why this
prune does not run for `staged` file attachments — users frequently
add files via the upload button without leaving an `@<path>` token,
so a symmetric prune there would erase legitimate uploads.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(daemon): stage @-composed skills' side files alongside the active skill

codex PR #955 review: composing a per-turn `@`-picked skill into the
system prompt appended its body (with the `withSkillRootPreamble`
guidance pointing at relative paths under `<cwd>/.od-skills/<folder>/`)
but never staged the actual folder. `startChatRun` only copied
`activeSkillDir`, so when the project's primary skill was different
(or absent) the composed skill's references/, examples/, and scripts/
files lived only at their absolute repo path — agents that honour
the cwd-relative form (or that don't get `--add-dir`, e.g. Codex with
allowlisted gpt-image projects) couldn't reach them.

Thread the composed skills' dirs out of `composeDaemonSystemPrompt`
as `extraSkillDirs` and stage each one through the same
`stageActiveSkill` API used for the primary skill. Dedupe by folder
basename so a project whose primary skill is also `@`-composed isn't
copied twice. Each preamble already advertises its own folder, so the
prompt and the staged tree stay aligned without further changes.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(web): respect the Library disable toggle in the project @-mention picker

codex PR #955 review: only `EntryView` received `enabledSkills`
(filtered against `config.disabledSkills`); active projects still
got `skills={skills}` raw, so a skill the user disabled in Settings
kept appearing in the project's `@`-mention popover and could ride
along to the daemon via `skillIds`. That broke the Library toggle
for any project opened on the post-split branch.

Compute a functional-skills-only enabled subset
(`enabledFunctionalSkills`) and pass it into `<ProjectView>` instead.
Templates stay separate — design-templates are filtered through their
own `enabledDesignTemplates` memo for the Templates gallery — so
ProjectView's chat composer still only sees skills, never templates,
matching the pre-split prop surface.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(e2e): mock /api/design-templates for example-use-prompt flow

The Templates tab in EntryView fetches from /api/design-templates after
the skills/design-templates split (specs/current/skills-and-design-templates.md).
The example-use-prompt Playwright scenario only mocked /api/skills, so the
gallery card never appeared and the test timed out waiting on
example-card-warm-utility-example. Serve the same fixture summary on both
endpoints so the templates gallery renders the card the test clicks.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(tools-pack): create design-templates fixture for resources test

The packaging resources copy now bundles the new design-templates tree
alongside skills (see resources.ts BUNDLED_RESOURCE_TREES). The
copyBundledResourceTrees fixture only created skills, design-systems,
craft, etc., so the recursive copy crashed with ENOENT on
design-templates before it could check the prompt-templates assertion.
Add the missing fixture directory so the test exercises the same set
of resource trees the packaged build does.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(skills): clone built-in side files into the shadow on first edit

mrcfps PR #955 review: editing a built-in skill wrote a USER_SKILLS_DIR
shadow folder that contained only a new SKILL.md. The next listSkills()
pass surfaced the shadow as the active dir, but every side-file resolver
(/api/skills/:id/files, /example, /assets/*, the system-prompt preamble,
and the per-turn cwd staging) reads through skill.dir. With nothing but
SKILL.md in the shadow, the bundled assets/, references/, scripts/, and
examples/ disappeared the moment the user hit save — a built-in like
last30days or live-artifact would break immediately after edit instead
of just having its body overridden.

Teach updateUserSkill() to take a `sourceDir` and clone every entry
except SKILL.md / dotfiles into the shadow on the very first edit. The
shadow stays self-contained, so all the resolvers keep working without
fallback bookkeeping. Subsequent edits detect the existing shadow and
skip the clone, so user tweaks under the side tree survive a re-save.

Wire `sourceDir: skill.dir` from server.ts's PUT /api/skills/:id handler
and add two regression tests:
- 'clones built-in side files into the shadow on the first edit' walks
  the file tree after save and asserts assets/template.html, references/
  notes.md, and scripts/helper.sh all round-trip from the built-in.
- 'preserves user-edited side files on subsequent edits' edits the
  staged assets/template.html, re-saves, and confirms the user content
  is still there.

Co-authored-by: Cursor <cursoragent@cursor.com>

* test(e2e): rename home tab from Examples to Templates

The Examples tab was renamed to Templates in EntryView (b5993385's
skills/design-templates split — entry.tabExamples became entry.tabTemplates
and the tab value moved from 'examples' to 'templates'), but
entry-chrome-flows still asserted the old label and testId. Update both.

* fix(skills+web): preserve template body in API mode and dir-based skill delete

Two follow-ups from PR #955 review:

1. ProjectView only received `enabledFunctionalSkills`, but
   `composedSystemPrompt()` still resolved `project.skillId` through that
   prop and `fetchSkill()`. Projects created from the new
   `/api/design-templates` surface keep a template id in `project.skillId`,
   so opening one in API mode dropped the template body from the system
   prompt and the upstream request ran without the project's primary
   template instructions. Now ProjectView takes a separate
   `designTemplates` prop (the unfiltered template list, so a
   later-disabled template still loads for projects already created from
   it) and `composedSystemPrompt()` plus the metadata / `isDeck` lookups
   fall back to that list, with `fetchDesignTemplate()` as the body-fetch
   fallback to `fetchSkill()`. The chat composer's `@`-picker keeps
   receiving only the enabled functional skills.

2. `DELETE /api/skills/:id` used `deleteUserSkill(USER_SKILLS_DIR, skill.id)`
   which re-slugified the frontmatter id and removed
   `<userSkillsDir>/<slug>/`. That matched the import shape but missed the
   install shape — `installFromTarget` writes the folder at
   `sanitizeRepoName(url)` (GitHub) or `path.basename(realpath)` (local
   symlink), neither of which is guaranteed to equal the slugified
   frontmatter `name`. A duplicate `app.delete('/api/skills/:id', ...)`
   handler at the install routes never fired because Express resolved the
   earlier registration first, leaving the install/uninstall path without
   working teardown. The handler now removes `skill.dir` (the absolute
   path listSkills already discovered) under a USER_SKILLS_DIR safety
   check, using `lstat` + `unlinkSync` so symlinked local installs unlink
   cleanly without recursing into the user's source tree. The dead
   duplicate handler is removed; `deleteUserSkill` is dropped from the
   server.ts import set (still exported and unit-tested in skills.ts).
   Regression coverage in `apps/daemon/tests/skills-delete-route.test.ts`
   pins both shapes plus the symlink-preserves-source case.

* test(daemon): point hyperframes system-prompt test at design-templates

The merge with main brought in a hyperframes system-prompt test that
reads `skills/hyperframes/SKILL.md`, but this branch's split moved
`hyperframes` into `design-templates/` (same migration as `live-artifact`
already handled above in this file). CI was failing with ENOENT on the
old path.

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-11 17:48:34 +08:00

250 lines
14 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Auth Service · Runbook</title>
<style>
:root {
--bg: #0c0e14;
--paper: #14171f;
--paper-2: #1c2030;
--ink: #eaecf3;
--muted: #8b94ad;
--line: #262b3b;
--accent: #6ee7b7;
--accent-soft: rgba(110,231,183,0.1);
--warn: #fbbf24;
--danger: #f87171;
--display: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
--body: -apple-system, BlinkMacSystemFont, 'Segoe UI', Inter, sans-serif;
--mono: ui-monospace, 'JetBrains Mono', SFMono-Regular, Menlo, monospace;
}
* { box-sizing: border-box; }
body { margin: 0; background: var(--bg); color: var(--ink); font-family: var(--body); font-size: 14px; line-height: 1.6; }
.page { max-width: 1100px; margin: 0 auto; padding: 32px 28px 64px; }
/* Header */
.head { display: flex; justify-content: space-between; align-items: flex-end; padding-bottom: 24px; border-bottom: 1px solid var(--line); margin-bottom: 28px; }
.head-left { display: flex; flex-direction: column; gap: 6px; }
.crumb { font-family: var(--mono); font-size: 11.5px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.06em; }
h1 { font-family: var(--display); font-size: 36px; margin: 4px 0; font-weight: 700; letter-spacing: -0.02em; }
.head-meta { font-family: var(--mono); font-size: 11.5px; color: var(--muted); }
.head-meta span { color: var(--accent); }
.pill {
display: inline-flex; align-items: center; gap: 6px; padding: 5px 12px; border-radius: 999px;
font-family: var(--mono); font-size: 11px; text-transform: uppercase; letter-spacing: 0.06em; font-weight: 600;
}
.pill.tier { background: var(--accent-soft); color: var(--accent); border: 1px solid rgba(110,231,183,0.3); }
.pill .dot { width: 6px; height: 6px; border-radius: 50%; background: var(--accent); }
section { margin-top: 40px; }
h2 { font-family: var(--display); font-size: 22px; margin: 0 0 14px; letter-spacing: -0.005em; font-weight: 700; }
h2 .index { font-family: var(--mono); font-size: 12px; color: var(--muted); margin-right: 12px; vertical-align: middle; }
/* Summary */
.summary { display: grid; grid-template-columns: 1.4fr 1fr; gap: 14px; }
.panel { padding: 22px 24px; background: var(--paper); border: 1px solid var(--line); border-radius: 12px; }
.panel p { margin: 0 0 12px; }
.panel p:last-child { margin: 0; }
.deps h3 { font-family: var(--mono); font-size: 11px; text-transform: uppercase; letter-spacing: 0.08em; color: var(--muted); margin: 0 0 10px; font-weight: 500; }
.deps ul { padding: 0; margin: 0; list-style: none; display: flex; flex-direction: column; gap: 8px; font-family: var(--mono); font-size: 12.5px; }
.deps li { display: flex; justify-content: space-between; padding: 8px 12px; background: var(--paper-2); border-radius: 6px; }
.deps li .ok { color: var(--accent); }
.deps li .warn { color: var(--warn); }
/* Tables */
table { width: 100%; border-collapse: collapse; background: var(--paper); border: 1px solid var(--line); border-radius: 12px; overflow: hidden; }
th, td { text-align: left; padding: 12px 16px; border-bottom: 1px solid var(--line); font-size: 13px; vertical-align: top; }
th { font-family: var(--mono); font-size: 10.5px; text-transform: uppercase; letter-spacing: 0.06em; color: var(--muted); background: var(--paper-2); }
tr:last-child td { border-bottom: none; }
td.code, .panel code { font-family: var(--mono); }
.sev { display: inline-flex; align-items: center; gap: 6px; padding: 3px 9px; border-radius: 4px; font-family: var(--mono); font-size: 10.5px; text-transform: uppercase; letter-spacing: 0.04em; font-weight: 600; }
.sev-1 { background: rgba(248,113,113,0.15); color: var(--danger); }
.sev-2 { background: rgba(251,191,36,0.15); color: var(--warn); }
.sev-3 { background: rgba(110,231,183,0.15); color: var(--accent); }
/* Procedure cards */
.procs { display: flex; flex-direction: column; gap: 14px; }
.proc { padding: 18px 22px; background: var(--paper); border: 1px solid var(--line); border-radius: 12px; }
.proc-head { display: flex; justify-content: space-between; align-items: baseline; margin-bottom: 10px; }
.proc-head h3 { margin: 0; font-family: var(--display); font-size: 17px; }
.proc-head .when { font-family: var(--mono); font-size: 11px; color: var(--muted); }
pre { background: var(--paper-2); border: 1px solid var(--line); border-radius: 8px; padding: 14px 16px; overflow-x: auto; font-family: var(--mono); font-size: 12.5px; line-height: 1.6; color: #cdd6f4; margin: 8px 0 0; }
pre .cmt { color: var(--muted); }
pre .var { color: var(--warn); }
pre .ok { color: var(--accent); }
/* On-call */
.rota { background: var(--paper); border: 1px solid var(--line); border-radius: 12px; overflow: hidden; }
/* Checklist */
.checklist { display: grid; grid-template-columns: 1fr 1fr; gap: 14px; }
.step { padding: 18px 20px; background: var(--paper); border: 1px solid var(--line); border-radius: 12px; display: flex; gap: 16px; align-items: flex-start; }
.step-num { flex: 0 0 36px; width: 36px; height: 36px; border-radius: 50%; background: var(--accent); color: var(--bg); display: inline-flex; align-items: center; justify-content: center; font-weight: 700; font-family: var(--display); font-size: 16px; }
.step h4 { margin: 0 0 6px; font-family: var(--display); font-size: 15px; }
.step p { margin: 0; color: var(--muted); font-size: 13px; }
.step code { font-family: var(--mono); background: var(--paper-2); padding: 2px 6px; border-radius: 4px; font-size: 12px; color: var(--accent); }
footer { margin-top: 56px; padding-top: 18px; border-top: 1px solid var(--line); display: flex; justify-content: space-between; font-family: var(--mono); font-size: 11.5px; color: var(--muted); }
@media (max-width: 880px) {
.summary, .checklist { grid-template-columns: 1fr; }
h1 { font-size: 26px; }
}
</style>
</head>
<body>
<div class="page">
<header class="head">
<div class="head-left">
<div class="crumb">Northwind / Identity / Auth</div>
<h1>auth-service</h1>
<div class="head-meta">Owned by <span>@identity-platform</span> · v4.7.2 · Last reviewed 14 Oct 2025</div>
</div>
<span class="pill tier"><span class="dot"></span>Tier 0 · production-critical</span>
</header>
<section>
<h2><span class="index">01</span>Service summary</h2>
<div class="summary">
<div class="panel">
<p><strong>auth-service</strong> issues, validates, and revokes session tokens for every Northwind product surface — web, mobile, and the public API. It owns the password store, the TOTP/WebAuthn enrollments, and the audit-log writer for all auth events.</p>
<p>If <code>auth-service</code> is down, customers cannot log in or refresh sessions. Existing valid sessions continue to work for their TTL (15 minutes) but no new auth happens.</p>
</div>
<div class="panel deps">
<h3>Dependencies</h3>
<ul>
<li><span>Postgres · auth-db</span><span class="ok">healthy</span></li>
<li><span>Redis · session-cache</span><span class="ok">healthy</span></li>
<li><span>KMS · auth-keyring</span><span class="ok">healthy</span></li>
<li><span>SES · transactional</span><span class="warn">degraded</span></li>
<li><span>Pager · oncall.northwind</span><span class="ok">healthy</span></li>
</ul>
</div>
</div>
</section>
<section>
<h2><span class="index">02</span>Alerts you might wake up to</h2>
<table>
<thead><tr><th>Alert</th><th>Severity</th><th>What it means</th><th>First response</th></tr></thead>
<tbody>
<tr>
<td class="code">auth.login_5xx_rate &gt; 1%</td>
<td><span class="sev sev-1">SEV-1</span></td>
<td>Login endpoint returning errors. Customers are locked out.</td>
<td>Check Postgres + Redis dashboards. Roll back last deploy if &lt; 30 min old.</td>
</tr>
<tr>
<td class="code">auth.token_refresh_lag_p95 &gt; 800ms</td>
<td><span class="sev sev-2">SEV-2</span></td>
<td>Refresh path is slow. Web app starts to feel sluggish.</td>
<td>Inspect Redis CPU + connection count. Scale read replicas if needed.</td>
</tr>
<tr>
<td class="code">auth.signup_failure &gt; 10/min</td>
<td><span class="sev sev-2">SEV-2</span></td>
<td>New signups are failing. Often SES bounces or SMTP auth.</td>
<td>Check SES bounce rate. Failover transactional queue to backup region.</td>
</tr>
<tr>
<td class="code">auth.kms_signing_errors &gt; 0</td>
<td><span class="sev sev-1">SEV-1</span></td>
<td>KMS can't sign session tokens. New logins fail; existing sessions OK.</td>
<td>Page the security team. Do not roll keys without a security engineer.</td>
</tr>
<tr>
<td class="code">auth.audit_writer_backlog &gt; 5k</td>
<td><span class="sev sev-3">SEV-3</span></td>
<td>Audit log writer is falling behind. Compliance impact.</td>
<td>Drain manually. Open a ticket; not a wake-up.</td>
</tr>
</tbody>
</table>
</section>
<section>
<h2><span class="index">03</span>Common procedures</h2>
<div class="procs">
<div class="proc">
<div class="proc-head"><h3>Deploy a new version</h3><span class="when">Use during business hours</span></div>
<p>Deploys are blue/green. The script waits for two consecutive healthchecks before promoting traffic.</p>
<pre><span class="cmt"># Deploy auth-service v4.7.3 to production</span>
$ nw deploy auth-service --tag <span class="var">v4.7.3</span> --env production
<span class="cmt"># Wait for two consecutive healthchecks (~90 s), then promote.</span>
$ nw deploy promote auth-service --env production
<span class="ok">→ traffic shifted: 10% / 50% / 100%</span></pre>
</div>
<div class="proc">
<div class="proc-head"><h3>Roll back to last known good</h3><span class="when">Use when error rate &gt; 1% post-deploy</span></div>
<pre><span class="cmt"># Rolls back to the previously promoted version, no rebuild.</span>
$ nw deploy rollback auth-service --env production
<span class="ok">→ rolled back to v4.7.2 in 38 s</span></pre>
</div>
<div class="proc">
<div class="proc-head"><h3>Rotate signing keys</h3><span class="when">Schedule with security; never solo</span></div>
<pre><span class="cmt"># 1. Generate the new signing key in KMS</span>
$ nw kms create-key --alias auth-signing-<span class="var">$(date +%Y%m%d)</span>
<span class="cmt"># 2. Mark the new key as the primary; old key remains valid for 24h</span>
$ nw kms set-primary auth-signing --key <span class="var">&lt;arn&gt;</span>
<span class="cmt"># 3. After 24h, schedule deletion of the previous key</span>
$ nw kms schedule-deletion auth-signing --key <span class="var">&lt;old-arn&gt;</span> --days 30</pre>
</div>
<div class="proc">
<div class="proc-head"><h3>Drain audit-log backlog</h3><span class="when">Use when audit_writer_backlog alert fires</span></div>
<pre>$ nw exec auth-service -- bin/audit-drain --batch <span class="var">5000</span>
<span class="ok">→ drained 4,812 entries in 12 s; backlog now 0</span></pre>
</div>
</div>
</section>
<section>
<h2><span class="index">04</span>On-call rotation · this month</h2>
<table class="rota">
<thead><tr><th>Week</th><th>Primary</th><th>Secondary</th><th>Backup (escalation)</th></tr></thead>
<tbody>
<tr><td>Oct 27 Nov 02</td><td>Devon Park</td><td>Priya Banerjee</td><td>Sasha Lin</td></tr>
<tr><td>Nov 03 Nov 09</td><td>Caleb Renner</td><td>Devon Park</td><td>Sasha Lin</td></tr>
<tr><td>Nov 10 Nov 16</td><td>Priya Banerjee</td><td>Caleb Renner</td><td>Mira Reddy</td></tr>
<tr><td>Nov 17 Nov 23</td><td>Sasha Lin</td><td>Priya Banerjee</td><td>Mira Reddy</td></tr>
</tbody>
</table>
</section>
<section>
<h2><span class="index">05</span>Incident response — first 30 minutes</h2>
<div class="checklist">
<div class="step">
<div class="step-num">1</div>
<div><h4>Acknowledge the page within 5 min.</h4><p>Type <code>/ack</code> in <code>#incidents-auth</code>. The bot stops re-paging and tags the on-call.</p></div>
</div>
<div class="step">
<div class="step-num">2</div>
<div><h4>Open the incident channel.</h4><p>Run <code>/incident open auth-service "&lt;short title&gt;"</code>. Slack bot creates a dedicated channel and pages the secondary.</p></div>
</div>
<div class="step">
<div class="step-num">3</div>
<div><h4>Post a status snapshot.</h4><p>Customer-impact in one line, what you know, what you're checking next. Re-post every 10 minutes.</p></div>
</div>
<div class="step">
<div class="step-num">4</div>
<div><h4>Mitigate before you diagnose.</h4><p>If a recent deploy is suspect, roll back. If KMS is degraded, fail open is <em>never</em> the answer for auth — escalate to security.</p></div>
</div>
<div class="step">
<div class="step-num">5</div>
<div><h4>Hand off or stand down.</h4><p>If you can't resolve in 30 min, hand to the secondary. When healthy, close with <code>/incident close</code>; postmortem is owed within 5 business days.</p></div>
</div>
</div>
</section>
<footer>
<span>Northwind Identity Platform · runbook v3.2</span>
<span>Source: ops-docs/auth-service.md</span>
</footer>
</div>
</body>
</html>