mirror of
https://github.com/nexu-io/open-design.git
synced 2026-05-31 19:04:39 +07:00
* feat: general-purpose skills with @-mention composition and user import
Lift skills from "one mode-bound skill per project" to a generic capability
the user can compose per turn:
- Daemon: scan multiple skill roots (user-skills under runtime data, then
the bundled `skills/`); user-imported skills can shadow built-ins by id.
- New `POST /api/skills/import` and `DELETE /api/skills/:id` endpoints,
with CONFLICT/BAD_REQUEST/NOT_FOUND error codes and built-in delete
protection.
- ChatRequest gains `skillIds: string[]`; the chat run concatenates each
picked skill's body (and merges craftRequires) into the system prompt
for that turn only — the project's persistent `skillId` is untouched.
- Web composer: `@` popover now lists skills alongside project files;
picks render as removable chips above the textarea and ride along with
the request as `skillIds`.
- Settings → Library: import form (name/description/triggers/body),
per-card delete for user skills, "user" origin badge.
* chore(web): drop welcome pet teaser + add ds→prompt-template mapping util
- SettingsDialog: remove the inline pet adoption teaser from the welcome
panel so the first-run modal stays focused on configuration.
- New `inferPromptTemplateCategoriesForDs(ds)` helper that maps a design
system's authored metadata to prompt-template gallery categories.
Imported by the design-system gallery wiring on a sibling branch; no
callers in this branch yet.
* feat: split skills/design-templates and add finalize-design API
Phase 0 of the skills/design-templates refactor (specs/current/
skills-and-design-templates.md):
- Move ~104 rendering catalogue entries from skills/ to design-templates/
and keep skills/ for the small set of functional skills that *do work*
on user input (utilities, briefs, packagers).
- Add design-templates/AGENTS.md and skills/AGENTS.md describing the
contract, and a brand-agnostic craft/ surface for opt-in craft rules.
- Daemon: add DESIGN_TEMPLATES_DIR / USER_DESIGN_TEMPLATES_DIR roots and
an /api/design-templates surface mirroring /api/skills. Asset/example
routes still span both registries so existing srcdoc URLs keep
resolving across the rename.
- Web: split LibrarySection into SkillsSection + DesignSystemsSection,
rename the EntryView "Examples" tab to "Templates", and update locales
+ the New-project picker accordingly.
Adds the finalize-design endpoint:
- New apps/daemon/src/finalize-design.ts and packages/contracts/src/api/
finalize.ts — one-shot synthesis of a project's transcript + active
design system + current artifact into <projectDir>/DESIGN.md via the
Anthropic Messages API. Per-project .finalize.lock mirrors the
transcript-export hygiene from PR #493; provider credentials are not
persisted by the daemon.
Other supporting changes:
- README + AGENTS.md updates to document the new directory split and
craft/ surface, plus i18n strings across 13 locales.
- Test refactors and new coverage (finalize-design, runs, sidecar
server, plus refreshed daemon integration tests).
- .gitignore: scope the *.exe ignore to /OpenDesign.exe so legitimate
vendor binaries are no longer hidden.
* fix(merge): move clinical-case-report to design-templates/
Origin/main added the clinical-case-report skill under skills/ before
the skills/design-templates split landed. Its od.mode is prototype, so
per specs/current/skills-and-design-templates.md it is a design template
and belongs alongside the other rendering catalogue entries — not under
the slimmed-down functional skills/ root. Moving it keeps the EntryView
Templates tab consistent with origin/main's intent.
* feat(skills): curated design/creative catalogue + collapsible Settings rows
Seed ~100 curated design/creative skill stubs under skills/ sourced from
awesome-claude-skills (ComposioHQ) and awesome-agent-skills (VoltAgent).
Each stub carries an od.category tag so the new filter pill row in
Settings -> Skills can group them. The seed script
(scripts/seed-curated-design-skills.ts, pnpm seed:curated-design-skills)
is idempotent: it only creates folders that don't already exist, so
hand-edited stubs are never overwritten.
- Daemon: parse and surface od.category on SkillInfo with a strict slug
normaliser; mirror the field on SkillSummary in @open-design/contracts.
Category is purely a UI hint — system-prompt composition is unchanged.
- Web: rewrite SkillsSection from a left-list / right-detail grid into a
vertical stack of collapsible rows mirroring the External MCP panel
(header always visible with name + mode/source/category pills + per-row
enable toggle; SKILL.md preview, file tree and inline edit form expand
on demand). Add a Category filter row above the list. Reorder Settings
nav so Skills + External MCP sit above the Composio/MCP cluster. Update
composer placeholder/hint across 17 locales to advertise '@ files or
skills · / for commands'.
- Docs: extend skills/AGENTS.md with the curated catalogue rules
(idempotency, category vocabulary, no upstream vendoring).
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(skills): teach localized-content + system-prompt tests about the skills/design-templates split
mrcfps blocking review on PR #955: the skills/design-templates split
(b5993385) moved ~110 SKILL.md entries out of `skills/` and into
`design-templates/`, but two repo-level tests still hard-coded the
single-root layout, so CI gates went red on the merged branch:
- `e2e/tests/localized-content.test.ts` only scanned `<repo>/skills`
while the locale `skillCopy` map keeps id-keyed entries spanning
both roots (ExamplesTab/Templates uses one lookup regardless of
origin). Teach the helper to read both `skills/` and
`design-templates/`, deduplicating ids so the union matches the
localized claim.
- `apps/daemon/tests/prompts/system.test.ts` read
`skills/live-artifact/SKILL.md`, which now lives under
`design-templates/live-artifact/`. Update the absolute path so
composeSystemPrompt's coverage of the live-artifact preamble is
exercised again.
Also enroll the curated design/creative catalogue (PR #955, ~91
stubs sourced from awesome-claude-skills / awesome-agent-skills) in
the DE / FR / RU `_SKILL_IDS_WITH_EN_FALLBACK` lists. The stubs are
English-only by design (frontmatter advertises an upstream URL); the
fallback list is exactly the place to acknowledge "we know this id
exists, English copy is fine here" so the localized-content coverage
gate passes without forcing a translation task per locale.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(skills): always quote frontmatter name so importUserSkill round-trips numeric / boolean ids
mrcfps PR #955 review: `buildSkillMarkdown` emitted `name:
${escapeYamlString(name)}` without quotes, so YAML coerced names
like `123`, `true`, `false`, or `null` into non-string scalars on
re-parse. listSkills() then read `data.name` as a number/boolean
and the import flow's follow-up `findSkillById(skills, result.id)`
missed it, falling into `/api/skills/import`'s "imported skill
could not be re-read" 500 path for those ids.
Switch the emitter to a quoted scalar (`name: "..."`) — the
double-escape already in `escapeYamlString` makes the quoted form
safe — and add a round-trip test covering `123`, `true`, `false`,
`null`, and `0` to lock in the contract.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(web): drop staged-skill chips when the matching @<id> token leaves the draft
mrcfps PR #955 review: `submit()` always forwarded every id in
`stagedSkills`, but that state was only mutated on picker click and
chip removal. Hand-deleting an `@<id>` token from the textarea left
the chip staged, so the request still carried `skillIds: [<id>]` and
the daemon composed a skill the prompt no longer referenced.
Sync the chips with the draft inside `handleChange()` by pruning
`stagedSkills` whenever the new value no longer contains the
`@<id>` token (using the same whitespace boundary as
`removeStagedSkill`'s strip regex). Comment explains why this
prune does not run for `staged` file attachments — users frequently
add files via the upload button without leaving an `@<path>` token,
so a symmetric prune there would erase legitimate uploads.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(daemon): stage @-composed skills' side files alongside the active skill
codex PR #955 review: composing a per-turn `@`-picked skill into the
system prompt appended its body (with the `withSkillRootPreamble`
guidance pointing at relative paths under `<cwd>/.od-skills/<folder>/`)
but never staged the actual folder. `startChatRun` only copied
`activeSkillDir`, so when the project's primary skill was different
(or absent) the composed skill's references/, examples/, and scripts/
files lived only at their absolute repo path — agents that honour
the cwd-relative form (or that don't get `--add-dir`, e.g. Codex with
allowlisted gpt-image projects) couldn't reach them.
Thread the composed skills' dirs out of `composeDaemonSystemPrompt`
as `extraSkillDirs` and stage each one through the same
`stageActiveSkill` API used for the primary skill. Dedupe by folder
basename so a project whose primary skill is also `@`-composed isn't
copied twice. Each preamble already advertises its own folder, so the
prompt and the staged tree stay aligned without further changes.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(web): respect the Library disable toggle in the project @-mention picker
codex PR #955 review: only `EntryView` received `enabledSkills`
(filtered against `config.disabledSkills`); active projects still
got `skills={skills}` raw, so a skill the user disabled in Settings
kept appearing in the project's `@`-mention popover and could ride
along to the daemon via `skillIds`. That broke the Library toggle
for any project opened on the post-split branch.
Compute a functional-skills-only enabled subset
(`enabledFunctionalSkills`) and pass it into `<ProjectView>` instead.
Templates stay separate — design-templates are filtered through their
own `enabledDesignTemplates` memo for the Templates gallery — so
ProjectView's chat composer still only sees skills, never templates,
matching the pre-split prop surface.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(e2e): mock /api/design-templates for example-use-prompt flow
The Templates tab in EntryView fetches from /api/design-templates after
the skills/design-templates split (specs/current/skills-and-design-templates.md).
The example-use-prompt Playwright scenario only mocked /api/skills, so the
gallery card never appeared and the test timed out waiting on
example-card-warm-utility-example. Serve the same fixture summary on both
endpoints so the templates gallery renders the card the test clicks.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(tools-pack): create design-templates fixture for resources test
The packaging resources copy now bundles the new design-templates tree
alongside skills (see resources.ts BUNDLED_RESOURCE_TREES). The
copyBundledResourceTrees fixture only created skills, design-systems,
craft, etc., so the recursive copy crashed with ENOENT on
design-templates before it could check the prompt-templates assertion.
Add the missing fixture directory so the test exercises the same set
of resource trees the packaged build does.
Co-authored-by: Cursor <cursoragent@cursor.com>
* fix(skills): clone built-in side files into the shadow on first edit
mrcfps PR #955 review: editing a built-in skill wrote a USER_SKILLS_DIR
shadow folder that contained only a new SKILL.md. The next listSkills()
pass surfaced the shadow as the active dir, but every side-file resolver
(/api/skills/:id/files, /example, /assets/*, the system-prompt preamble,
and the per-turn cwd staging) reads through skill.dir. With nothing but
SKILL.md in the shadow, the bundled assets/, references/, scripts/, and
examples/ disappeared the moment the user hit save — a built-in like
last30days or live-artifact would break immediately after edit instead
of just having its body overridden.
Teach updateUserSkill() to take a `sourceDir` and clone every entry
except SKILL.md / dotfiles into the shadow on the very first edit. The
shadow stays self-contained, so all the resolvers keep working without
fallback bookkeeping. Subsequent edits detect the existing shadow and
skip the clone, so user tweaks under the side tree survive a re-save.
Wire `sourceDir: skill.dir` from server.ts's PUT /api/skills/:id handler
and add two regression tests:
- 'clones built-in side files into the shadow on the first edit' walks
the file tree after save and asserts assets/template.html, references/
notes.md, and scripts/helper.sh all round-trip from the built-in.
- 'preserves user-edited side files on subsequent edits' edits the
staged assets/template.html, re-saves, and confirms the user content
is still there.
Co-authored-by: Cursor <cursoragent@cursor.com>
* test(e2e): rename home tab from Examples to Templates
The Examples tab was renamed to Templates in EntryView (b5993385's
skills/design-templates split — entry.tabExamples became entry.tabTemplates
and the tab value moved from 'examples' to 'templates'), but
entry-chrome-flows still asserted the old label and testId. Update both.
* fix(skills+web): preserve template body in API mode and dir-based skill delete
Two follow-ups from PR #955 review:
1. ProjectView only received `enabledFunctionalSkills`, but
`composedSystemPrompt()` still resolved `project.skillId` through that
prop and `fetchSkill()`. Projects created from the new
`/api/design-templates` surface keep a template id in `project.skillId`,
so opening one in API mode dropped the template body from the system
prompt and the upstream request ran without the project's primary
template instructions. Now ProjectView takes a separate
`designTemplates` prop (the unfiltered template list, so a
later-disabled template still loads for projects already created from
it) and `composedSystemPrompt()` plus the metadata / `isDeck` lookups
fall back to that list, with `fetchDesignTemplate()` as the body-fetch
fallback to `fetchSkill()`. The chat composer's `@`-picker keeps
receiving only the enabled functional skills.
2. `DELETE /api/skills/:id` used `deleteUserSkill(USER_SKILLS_DIR, skill.id)`
which re-slugified the frontmatter id and removed
`<userSkillsDir>/<slug>/`. That matched the import shape but missed the
install shape — `installFromTarget` writes the folder at
`sanitizeRepoName(url)` (GitHub) or `path.basename(realpath)` (local
symlink), neither of which is guaranteed to equal the slugified
frontmatter `name`. A duplicate `app.delete('/api/skills/:id', ...)`
handler at the install routes never fired because Express resolved the
earlier registration first, leaving the install/uninstall path without
working teardown. The handler now removes `skill.dir` (the absolute
path listSkills already discovered) under a USER_SKILLS_DIR safety
check, using `lstat` + `unlinkSync` so symlinked local installs unlink
cleanly without recursing into the user's source tree. The dead
duplicate handler is removed; `deleteUserSkill` is dropped from the
server.ts import set (still exported and unit-tested in skills.ts).
Regression coverage in `apps/daemon/tests/skills-delete-route.test.ts`
pins both shapes plus the symlink-preserves-source case.
* test(daemon): point hyperframes system-prompt test at design-templates
The merge with main brought in a hyperframes system-prompt test that
reads `skills/hyperframes/SKILL.md`, but this branch's split moved
`hyperframes` into `design-templates/` (same migration as `live-artifact`
already handled above in this file). CI was failing with ENOENT on the
old path.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>
250 lines
14 KiB
HTML
250 lines
14 KiB
HTML
<!doctype html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta charset="utf-8" />
|
||
<meta name="viewport" content="width=device-width, initial-scale=1" />
|
||
<title>Auth Service · Runbook</title>
|
||
<style>
|
||
:root {
|
||
--bg: #0c0e14;
|
||
--paper: #14171f;
|
||
--paper-2: #1c2030;
|
||
--ink: #eaecf3;
|
||
--muted: #8b94ad;
|
||
--line: #262b3b;
|
||
--accent: #6ee7b7;
|
||
--accent-soft: rgba(110,231,183,0.1);
|
||
--warn: #fbbf24;
|
||
--danger: #f87171;
|
||
--display: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', sans-serif;
|
||
--body: -apple-system, BlinkMacSystemFont, 'Segoe UI', Inter, sans-serif;
|
||
--mono: ui-monospace, 'JetBrains Mono', SFMono-Regular, Menlo, monospace;
|
||
}
|
||
* { box-sizing: border-box; }
|
||
body { margin: 0; background: var(--bg); color: var(--ink); font-family: var(--body); font-size: 14px; line-height: 1.6; }
|
||
.page { max-width: 1100px; margin: 0 auto; padding: 32px 28px 64px; }
|
||
|
||
/* Header */
|
||
.head { display: flex; justify-content: space-between; align-items: flex-end; padding-bottom: 24px; border-bottom: 1px solid var(--line); margin-bottom: 28px; }
|
||
.head-left { display: flex; flex-direction: column; gap: 6px; }
|
||
.crumb { font-family: var(--mono); font-size: 11.5px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.06em; }
|
||
h1 { font-family: var(--display); font-size: 36px; margin: 4px 0; font-weight: 700; letter-spacing: -0.02em; }
|
||
.head-meta { font-family: var(--mono); font-size: 11.5px; color: var(--muted); }
|
||
.head-meta span { color: var(--accent); }
|
||
.pill {
|
||
display: inline-flex; align-items: center; gap: 6px; padding: 5px 12px; border-radius: 999px;
|
||
font-family: var(--mono); font-size: 11px; text-transform: uppercase; letter-spacing: 0.06em; font-weight: 600;
|
||
}
|
||
.pill.tier { background: var(--accent-soft); color: var(--accent); border: 1px solid rgba(110,231,183,0.3); }
|
||
.pill .dot { width: 6px; height: 6px; border-radius: 50%; background: var(--accent); }
|
||
|
||
section { margin-top: 40px; }
|
||
h2 { font-family: var(--display); font-size: 22px; margin: 0 0 14px; letter-spacing: -0.005em; font-weight: 700; }
|
||
h2 .index { font-family: var(--mono); font-size: 12px; color: var(--muted); margin-right: 12px; vertical-align: middle; }
|
||
|
||
/* Summary */
|
||
.summary { display: grid; grid-template-columns: 1.4fr 1fr; gap: 14px; }
|
||
.panel { padding: 22px 24px; background: var(--paper); border: 1px solid var(--line); border-radius: 12px; }
|
||
.panel p { margin: 0 0 12px; }
|
||
.panel p:last-child { margin: 0; }
|
||
.deps h3 { font-family: var(--mono); font-size: 11px; text-transform: uppercase; letter-spacing: 0.08em; color: var(--muted); margin: 0 0 10px; font-weight: 500; }
|
||
.deps ul { padding: 0; margin: 0; list-style: none; display: flex; flex-direction: column; gap: 8px; font-family: var(--mono); font-size: 12.5px; }
|
||
.deps li { display: flex; justify-content: space-between; padding: 8px 12px; background: var(--paper-2); border-radius: 6px; }
|
||
.deps li .ok { color: var(--accent); }
|
||
.deps li .warn { color: var(--warn); }
|
||
|
||
/* Tables */
|
||
table { width: 100%; border-collapse: collapse; background: var(--paper); border: 1px solid var(--line); border-radius: 12px; overflow: hidden; }
|
||
th, td { text-align: left; padding: 12px 16px; border-bottom: 1px solid var(--line); font-size: 13px; vertical-align: top; }
|
||
th { font-family: var(--mono); font-size: 10.5px; text-transform: uppercase; letter-spacing: 0.06em; color: var(--muted); background: var(--paper-2); }
|
||
tr:last-child td { border-bottom: none; }
|
||
td.code, .panel code { font-family: var(--mono); }
|
||
.sev { display: inline-flex; align-items: center; gap: 6px; padding: 3px 9px; border-radius: 4px; font-family: var(--mono); font-size: 10.5px; text-transform: uppercase; letter-spacing: 0.04em; font-weight: 600; }
|
||
.sev-1 { background: rgba(248,113,113,0.15); color: var(--danger); }
|
||
.sev-2 { background: rgba(251,191,36,0.15); color: var(--warn); }
|
||
.sev-3 { background: rgba(110,231,183,0.15); color: var(--accent); }
|
||
|
||
/* Procedure cards */
|
||
.procs { display: flex; flex-direction: column; gap: 14px; }
|
||
.proc { padding: 18px 22px; background: var(--paper); border: 1px solid var(--line); border-radius: 12px; }
|
||
.proc-head { display: flex; justify-content: space-between; align-items: baseline; margin-bottom: 10px; }
|
||
.proc-head h3 { margin: 0; font-family: var(--display); font-size: 17px; }
|
||
.proc-head .when { font-family: var(--mono); font-size: 11px; color: var(--muted); }
|
||
pre { background: var(--paper-2); border: 1px solid var(--line); border-radius: 8px; padding: 14px 16px; overflow-x: auto; font-family: var(--mono); font-size: 12.5px; line-height: 1.6; color: #cdd6f4; margin: 8px 0 0; }
|
||
pre .cmt { color: var(--muted); }
|
||
pre .var { color: var(--warn); }
|
||
pre .ok { color: var(--accent); }
|
||
|
||
/* On-call */
|
||
.rota { background: var(--paper); border: 1px solid var(--line); border-radius: 12px; overflow: hidden; }
|
||
|
||
/* Checklist */
|
||
.checklist { display: grid; grid-template-columns: 1fr 1fr; gap: 14px; }
|
||
.step { padding: 18px 20px; background: var(--paper); border: 1px solid var(--line); border-radius: 12px; display: flex; gap: 16px; align-items: flex-start; }
|
||
.step-num { flex: 0 0 36px; width: 36px; height: 36px; border-radius: 50%; background: var(--accent); color: var(--bg); display: inline-flex; align-items: center; justify-content: center; font-weight: 700; font-family: var(--display); font-size: 16px; }
|
||
.step h4 { margin: 0 0 6px; font-family: var(--display); font-size: 15px; }
|
||
.step p { margin: 0; color: var(--muted); font-size: 13px; }
|
||
.step code { font-family: var(--mono); background: var(--paper-2); padding: 2px 6px; border-radius: 4px; font-size: 12px; color: var(--accent); }
|
||
|
||
footer { margin-top: 56px; padding-top: 18px; border-top: 1px solid var(--line); display: flex; justify-content: space-between; font-family: var(--mono); font-size: 11.5px; color: var(--muted); }
|
||
|
||
@media (max-width: 880px) {
|
||
.summary, .checklist { grid-template-columns: 1fr; }
|
||
h1 { font-size: 26px; }
|
||
}
|
||
</style>
|
||
</head>
|
||
<body>
|
||
<div class="page">
|
||
<header class="head">
|
||
<div class="head-left">
|
||
<div class="crumb">Northwind / Identity / Auth</div>
|
||
<h1>auth-service</h1>
|
||
<div class="head-meta">Owned by <span>@identity-platform</span> · v4.7.2 · Last reviewed 14 Oct 2025</div>
|
||
</div>
|
||
<span class="pill tier"><span class="dot"></span>Tier 0 · production-critical</span>
|
||
</header>
|
||
|
||
<section>
|
||
<h2><span class="index">01</span>Service summary</h2>
|
||
<div class="summary">
|
||
<div class="panel">
|
||
<p><strong>auth-service</strong> issues, validates, and revokes session tokens for every Northwind product surface — web, mobile, and the public API. It owns the password store, the TOTP/WebAuthn enrollments, and the audit-log writer for all auth events.</p>
|
||
<p>If <code>auth-service</code> is down, customers cannot log in or refresh sessions. Existing valid sessions continue to work for their TTL (15 minutes) but no new auth happens.</p>
|
||
</div>
|
||
<div class="panel deps">
|
||
<h3>Dependencies</h3>
|
||
<ul>
|
||
<li><span>Postgres · auth-db</span><span class="ok">healthy</span></li>
|
||
<li><span>Redis · session-cache</span><span class="ok">healthy</span></li>
|
||
<li><span>KMS · auth-keyring</span><span class="ok">healthy</span></li>
|
||
<li><span>SES · transactional</span><span class="warn">degraded</span></li>
|
||
<li><span>Pager · oncall.northwind</span><span class="ok">healthy</span></li>
|
||
</ul>
|
||
</div>
|
||
</div>
|
||
</section>
|
||
|
||
<section>
|
||
<h2><span class="index">02</span>Alerts you might wake up to</h2>
|
||
<table>
|
||
<thead><tr><th>Alert</th><th>Severity</th><th>What it means</th><th>First response</th></tr></thead>
|
||
<tbody>
|
||
<tr>
|
||
<td class="code">auth.login_5xx_rate > 1%</td>
|
||
<td><span class="sev sev-1">SEV-1</span></td>
|
||
<td>Login endpoint returning errors. Customers are locked out.</td>
|
||
<td>Check Postgres + Redis dashboards. Roll back last deploy if < 30 min old.</td>
|
||
</tr>
|
||
<tr>
|
||
<td class="code">auth.token_refresh_lag_p95 > 800ms</td>
|
||
<td><span class="sev sev-2">SEV-2</span></td>
|
||
<td>Refresh path is slow. Web app starts to feel sluggish.</td>
|
||
<td>Inspect Redis CPU + connection count. Scale read replicas if needed.</td>
|
||
</tr>
|
||
<tr>
|
||
<td class="code">auth.signup_failure > 10/min</td>
|
||
<td><span class="sev sev-2">SEV-2</span></td>
|
||
<td>New signups are failing. Often SES bounces or SMTP auth.</td>
|
||
<td>Check SES bounce rate. Failover transactional queue to backup region.</td>
|
||
</tr>
|
||
<tr>
|
||
<td class="code">auth.kms_signing_errors > 0</td>
|
||
<td><span class="sev sev-1">SEV-1</span></td>
|
||
<td>KMS can't sign session tokens. New logins fail; existing sessions OK.</td>
|
||
<td>Page the security team. Do not roll keys without a security engineer.</td>
|
||
</tr>
|
||
<tr>
|
||
<td class="code">auth.audit_writer_backlog > 5k</td>
|
||
<td><span class="sev sev-3">SEV-3</span></td>
|
||
<td>Audit log writer is falling behind. Compliance impact.</td>
|
||
<td>Drain manually. Open a ticket; not a wake-up.</td>
|
||
</tr>
|
||
</tbody>
|
||
</table>
|
||
</section>
|
||
|
||
<section>
|
||
<h2><span class="index">03</span>Common procedures</h2>
|
||
<div class="procs">
|
||
<div class="proc">
|
||
<div class="proc-head"><h3>Deploy a new version</h3><span class="when">Use during business hours</span></div>
|
||
<p>Deploys are blue/green. The script waits for two consecutive healthchecks before promoting traffic.</p>
|
||
<pre><span class="cmt"># Deploy auth-service v4.7.3 to production</span>
|
||
$ nw deploy auth-service --tag <span class="var">v4.7.3</span> --env production
|
||
|
||
<span class="cmt"># Wait for two consecutive healthchecks (~90 s), then promote.</span>
|
||
$ nw deploy promote auth-service --env production
|
||
<span class="ok">→ traffic shifted: 10% / 50% / 100%</span></pre>
|
||
</div>
|
||
<div class="proc">
|
||
<div class="proc-head"><h3>Roll back to last known good</h3><span class="when">Use when error rate > 1% post-deploy</span></div>
|
||
<pre><span class="cmt"># Rolls back to the previously promoted version, no rebuild.</span>
|
||
$ nw deploy rollback auth-service --env production
|
||
<span class="ok">→ rolled back to v4.7.2 in 38 s</span></pre>
|
||
</div>
|
||
<div class="proc">
|
||
<div class="proc-head"><h3>Rotate signing keys</h3><span class="when">Schedule with security; never solo</span></div>
|
||
<pre><span class="cmt"># 1. Generate the new signing key in KMS</span>
|
||
$ nw kms create-key --alias auth-signing-<span class="var">$(date +%Y%m%d)</span>
|
||
|
||
<span class="cmt"># 2. Mark the new key as the primary; old key remains valid for 24h</span>
|
||
$ nw kms set-primary auth-signing --key <span class="var"><arn></span>
|
||
|
||
<span class="cmt"># 3. After 24h, schedule deletion of the previous key</span>
|
||
$ nw kms schedule-deletion auth-signing --key <span class="var"><old-arn></span> --days 30</pre>
|
||
</div>
|
||
<div class="proc">
|
||
<div class="proc-head"><h3>Drain audit-log backlog</h3><span class="when">Use when audit_writer_backlog alert fires</span></div>
|
||
<pre>$ nw exec auth-service -- bin/audit-drain --batch <span class="var">5000</span>
|
||
<span class="ok">→ drained 4,812 entries in 12 s; backlog now 0</span></pre>
|
||
</div>
|
||
</div>
|
||
</section>
|
||
|
||
<section>
|
||
<h2><span class="index">04</span>On-call rotation · this month</h2>
|
||
<table class="rota">
|
||
<thead><tr><th>Week</th><th>Primary</th><th>Secondary</th><th>Backup (escalation)</th></tr></thead>
|
||
<tbody>
|
||
<tr><td>Oct 27 – Nov 02</td><td>Devon Park</td><td>Priya Banerjee</td><td>Sasha Lin</td></tr>
|
||
<tr><td>Nov 03 – Nov 09</td><td>Caleb Renner</td><td>Devon Park</td><td>Sasha Lin</td></tr>
|
||
<tr><td>Nov 10 – Nov 16</td><td>Priya Banerjee</td><td>Caleb Renner</td><td>Mira Reddy</td></tr>
|
||
<tr><td>Nov 17 – Nov 23</td><td>Sasha Lin</td><td>Priya Banerjee</td><td>Mira Reddy</td></tr>
|
||
</tbody>
|
||
</table>
|
||
</section>
|
||
|
||
<section>
|
||
<h2><span class="index">05</span>Incident response — first 30 minutes</h2>
|
||
<div class="checklist">
|
||
<div class="step">
|
||
<div class="step-num">1</div>
|
||
<div><h4>Acknowledge the page within 5 min.</h4><p>Type <code>/ack</code> in <code>#incidents-auth</code>. The bot stops re-paging and tags the on-call.</p></div>
|
||
</div>
|
||
<div class="step">
|
||
<div class="step-num">2</div>
|
||
<div><h4>Open the incident channel.</h4><p>Run <code>/incident open auth-service "<short title>"</code>. Slack bot creates a dedicated channel and pages the secondary.</p></div>
|
||
</div>
|
||
<div class="step">
|
||
<div class="step-num">3</div>
|
||
<div><h4>Post a status snapshot.</h4><p>Customer-impact in one line, what you know, what you're checking next. Re-post every 10 minutes.</p></div>
|
||
</div>
|
||
<div class="step">
|
||
<div class="step-num">4</div>
|
||
<div><h4>Mitigate before you diagnose.</h4><p>If a recent deploy is suspect, roll back. If KMS is degraded, fail open is <em>never</em> the answer for auth — escalate to security.</p></div>
|
||
</div>
|
||
<div class="step">
|
||
<div class="step-num">5</div>
|
||
<div><h4>Hand off or stand down.</h4><p>If you can't resolve in 30 min, hand to the secondary. When healthy, close with <code>/incident close</code>; postmortem is owed within 5 business days.</p></div>
|
||
</div>
|
||
</div>
|
||
</section>
|
||
|
||
<footer>
|
||
<span>Northwind Identity Platform · runbook v3.2</span>
|
||
<span>Source: ops-docs/auth-service.md</span>
|
||
</footer>
|
||
</div>
|
||
</body>
|
||
</html>
|