open-design/apps/daemon/src/installation.ts
lefarcen 6bb0f0fd91
feat(observability): web lifecycle telemetry + stable installationId migration (#2527)
* feat(observability): web lifecycle telemetry + stable installationId migration

Two intertwined safety-telemetry additions for the 0.8.0 release.

Web lifecycle observability
---------------------------
New `apps/web/src/observability/` module installed at module load via
client-app.tsx — alongside the existing error-tracking exception hooks
from #2521. Reuses error-tracking's direct-fetch transport (the same
consent-bypass + early-buffer guarantees) so every event flows even when
the user has opted out of general analytics:

  - client_long_task         PerformanceObserver longtask >100ms (real
                             "feels janky" signal, FPS proxy)
  - client_white_screen      app fails to mount after 5s; MutationObserver
                             cancels the timer the moment the React root
                             renders so a normal boot is zero events
  - client_resource_error    capture-phase window.error catches failed
                             <script>/<link>/<img>/<iframe> loads
                             (chunk-load failures, broken artifact refs)
  - client_boot_timing       navigationStart → load timings via
                             Navigation Timing v2
  - client_visibility_change visibilitychange + page lifetime
  - client_session_summary   real foreground duration emitted on pagehide
  - client_run_stuck         5min watchdog on SSE runs that don't progress
                             (#2464 / #2405 / #1451 in data form)
  - client_iframe_error      FileViewer iframe load failures (iframe
                             errors don't bubble to window, so the global
                             resource-error observer can't see them)
  - desktop_renderer_crash   Electron main observes render-process-gone
                             and forwards to daemon /api/observability/event
  - daemon_uncaught_exception
    daemon_unhandled_rejection
                             process-level handlers on the daemon

error-tracking.ts is generalised: `reportSafetyEvent(name, props)` now
exposes the same buffer + direct-fetch transport that `reportHandledException`
used, with identical $exception wire shape preserved for the existing
exception path.

Daemon cross-process bridge
---------------------------
New `AnalyticsService.captureSafety()` skips the consent re-check and
posts via posthog-node with installationId as distinct_id. Wired into:

  - `POST /api/observability/event` for desktop main and any future
    helper process that needs to ship a safety event (no consent check —
    same contract as web's direct-fetch path)
  - `process.on('uncaughtException')` / `unhandledRejection` on the
    daemon itself

Stable installationId across reinstalls (critical for 0.8.0 rollout)
--------------------------------------------------------------------
installationId previously lived in `<namespace>/data/app-config.json`,
so a packaged reinstall that churned the namespace token (or any future
namespace-scoped data wipe) rotated the id and the user showed up as a
brand-new PostHog person. This is the immediate trigger: when 0.8.0
ships, every 0.7.x user upgrading would silently double the user count.

New module `apps/daemon/src/installation.ts` reads/writes
`<installationDir>/installation.json` at the channel root. The daemon
gets the path from `OD_INSTALLATION_DIR`, set by
`apps/packaged/src/sidecars.ts` to `paths.installationRoot`
(one level above `namespaces/` — e.g.
`~/Library/Application Support/Open Design Nightly/` on mac).

`readAppConfig` transparently merges: if installation.json has an id it
wins; if only app-config.json has one (the 0.7.x state), it gets mirrored
to installation.json on the next read. `writeAppConfig` mirrors any
explicit installationId write, including the null-clear path used by
Settings → "Delete my data". 7 call sites of readAppConfig keep their
signatures unchanged.

Survives:
  - same-channel reinstall (DMG drag-replace, NSIS reinstall)
  - namespace churn between packaged builds
  - per-namespace data reset (future installer that clears `<ns>/data/`)

Still rotates (intentionally):
  - explicit "Delete my data"
  - manual `rm -rf "~/Library/Application Support/Open Design <Channel>/"`
  - different channel (Stable vs Nightly stay distinct because userData
    paths differ; that's the existing channel-isolation contract)

What this changes for posthog-js
--------------------------------
client.ts had `capture_exceptions: false` from #2521; nothing else
changes. autocapture / $pageview / $autocapture / track() / daemon
analyticsService.capture() — all unchanged. New events are additive.

Validation
----------
  - pnpm guard                              pass
  - pnpm typecheck                          whole repo pass
  - pnpm --filter @open-design/web test     200 files / 1824 tests
  - pnpm --filter @open-design/daemon test  251 files / 2981 tests
    (includes 10 new tests in installation.test.ts pinning the 0.7.x →
    0.8.0 migration, namespace-wipe survival, delete-my-data clear, and
    fresh-id rotation)
  - pnpm --filter @open-design/packaged test 9 files / 89 tests
  - Pre-existing baseline: apps/desktop/src/main/updater.ts has typecheck
    references to RELEASE_CHANNEL_NAMES.PREVIEW/NIGHTLY on release/v0.8.0;
    unrelated to this PR.

* fix(observability): preserve fatal exit on uncaught + skip loading shell in white-screen check

Addresses codex review on PR #2527 (Siri-Ray).

1) Daemon process handlers must keep Node fatal semantics

Installing an uncaughtException listener silences Node's default
crash/exit; Node 15+ does the same for unhandledRejection when a
listener is present. The previous handlers logged telemetry and let
control return to the event loop, leaving a corrupted daemon serving
requests instead of letting the supervisor restart it cleanly.

triggerFatalShutdown() now:
  - dispatches captureSafety once (guarded against re-entry from
    cascading faults)
  - races posthog-node's shutdown against a 1s bounded timeout so a
    slow flush can't keep the process alive
  - calls process.exit(1) after the race resolves
Both uncaughtException and unhandledRejection route through it.

apps/daemon/tests/uncaught-fatal-shutdown.test.ts pins:
  - captureSafety is invoked exactly once even on repeated faults
  - exit(1) fires on the happy path
  - exit(1) still fires when shutdown hangs past the timeout
  - exit(1) still fires when captureSafety itself throws

2) White-screen detector treated the loading shell as a successful mount

apps/web/app/[[...slug]]/client-app.tsx renders the dynamic-import
fallback as <div class="od-loading-shell">Loading Open Design…</div>
whose visible text (19 chars) exceeded the previous 10-char floor.
monitorMount() would therefore cancel the 5s timer the instant Next
swapped the loading shell in, completely missing the white-screen
signal the observer is meant to add.

isAppMounted() now:
  - primary signal: <html data-od-app-mounted="1"> set by App.tsx's
    first useEffect — authoritative because once App has mounted at
    least once, any later tree crash is an $exception story, not a
    white-screen story
  - fallback: only counts children of the root container whose
    classList does NOT include known loading-shell markers
    (od-loading-shell). Their visible text drives the > MIN_VISIBLE_TEXT
    check, so the loading sentinel can never be mistaken for a mount.

apps/web/tests/observability/white-screen.test.ts pins:
  - fires client_white_screen when only the loading shell is present
    after the timeout
  - does NOT fire when data-od-app-mounted is set before the timeout
  - cancels the timer the moment a real workspace-shell child appears
    alongside the loading shell
  - still fires when only sub-MIN_VISIBLE_TEXT non-shell content is
    present (effectively blank)

Validation:
  - pnpm guard pass
  - pnpm typecheck pass
  - pnpm --filter @open-design/daemon test  252 files / 2985 tests
  - pnpm --filter @open-design/web test     201 files / 1828 tests

* fix(observability): await captureSafety enqueue before fatal shutdown flush

Addresses second-pass codex review on PR #2527 (Siri-Ray, 3279268246).

The previous fatal-shutdown path called `analyticsService.captureSafety()`
synchronously and immediately raced `analyticsService.shutdown()` against
the bounded timeout. captureSafety in apps/daemon/src/analytics.ts does
its real `client.capture()` call only inside an async IIFE after
`await readInstallationIdSafe()` — so shutdown could win the race,
drain an empty posthog-node queue, and let `process.exit(1)` run BEFORE
the daemon crash event ever got enqueued. We'd then preserve the
process-lifecycle contract but lose the exact signal this PR is adding.

Changes:

  - AnalyticsService.captureSafety now returns Promise<void>. The async
    IIFE is gone; the body awaits readInstallationIdSafe directly so the
    returned promise resolves only AFTER client.capture() has been
    invoked (which is when posthog-node's local buffer contains the
    event).
  - server.ts triggerFatalShutdown awaits captureSafety, then calls
    shutdown, and races that whole sequence against the 1s bounded
    timeout. Capture failures still don't block exit (try/catch around
    the await).
  - NOOP_SERVICE.captureSafety becomes `async () => undefined` to
    match the new signature.
  - Fire-and-forget callers (/api/observability/event) are unaffected;
    voiding the returned promise keeps them non-blocking.

apps/daemon/tests/uncaught-fatal-shutdown.test.ts adds the reviewer-
requested fixture:

  - 'waits for the captureSafety promise to settle before invoking
    shutdown' — gives capture a 50ms delay and shutdown a separate 50ms
    delay so the intermediate "capture done / shutdown not yet" state
    is observable.
  - 'still aborts and exits if captureSafety hangs past the bounded
    timeout' — captureSafety never resolves; the outer 1s timeout still
    forces process.exit(1).

Validation:
  - pnpm guard                                pass
  - pnpm typecheck                            whole repo pass
  - pnpm --filter @open-design/daemon test    252 files / 2987 tests
2026-05-21 15:37:48 +08:00

144 lines
5.8 KiB
TypeScript

// Channel-root installation identity.
//
// `installationId` was historically stored in `<dataRoot>/app-config.json`,
// which lives at `<userData>/namespaces/<namespace>/data/app-config.json`
// in packaged builds. Two reinstall scenarios then silently rotated the
// id:
//
// 1. **Namespace churn.** If a packaged build bakes a different
// `namespace` token than the previous version, the daemon writes to
// a different `<namespace>/data/` subtree and the old installationId
// is invisible. The user shows up in PostHog as a brand-new person.
//
// 2. **Clean reinstall.** Even when the namespace is stable, anything
// that wipes `<userData>/namespaces/<ns>/data/` (a future installer
// that resets per-namespace data, a manual `rm -rf`) takes the id
// down with it.
//
// To preserve person continuity across both, we mirror the id into a
// stable file at the **channel root** — one level above the namespaces
// directory — and treat that file as authoritative on read. The legacy
// app-config.json field is still written so any code path that reads it
// directly (legacy / future fallbacks) keeps seeing the same value.
//
// Locations:
//
// packaged (mac): ~/Library/Application Support/Open Design Nightly/installation.json
// packaged (win): %APPDATA%/Open Design Nightly/installation.json
// packaged (linux): $XDG_CONFIG_HOME/Open Design Nightly/installation.json
// tools-dev / OSS: <dataDir>/installation.json (no namespace concept; fall back to dataDir)
//
// `OD_INSTALLATION_DIR` is the env override. Packaged sidecars.ts sets it
// to the channel root explicitly; everything else falls back to dataDir
// (where it sits next to app-config.json and behaves like the legacy
// path — fine for dev because dev doesn't have namespace churn).
import { mkdir, readFile, writeFile } from 'node:fs/promises';
import { dirname, join } from 'node:path';
/**
* Wire shape persisted at `<installationDir>/installation.json`.
* Kept intentionally narrow: only fields that need to survive a
* namespace-scoped data-dir wipe belong here.
*/
export interface InstallationFile {
installationId?: string;
// Future fields (privacy decision timestamp, telemetry flags) can join
// this list as soon as we have a use case for "they must outlive a
// namespace reset". Today, only installationId carries that contract.
}
export function resolveInstallationDir(dataDir: string): string {
const env = process.env.OD_INSTALLATION_DIR;
if (env && env.length > 0) return env;
return dataDir;
}
function installationFilePath(installationDir: string): string {
return join(installationDir, 'installation.json');
}
export async function readInstallationFile(
installationDir: string,
): Promise<InstallationFile> {
try {
const raw = await readFile(installationFilePath(installationDir), 'utf8');
const parsed: unknown = JSON.parse(raw);
if (parsed == null || typeof parsed !== 'object' || Array.isArray(parsed)) {
return {};
}
const obj = parsed as Record<string, unknown>;
const out: InstallationFile = {};
if (typeof obj.installationId === 'string' && obj.installationId.length > 0) {
out.installationId = obj.installationId;
}
return out;
} catch (err) {
const e = err as { code?: string; name?: string };
if (e.code === 'ENOENT') return {};
if (e.name === 'SyntaxError') return {};
// Anything else (permission denied, EIO) — treat as empty so the
// fallback path through app-config.json keeps the daemon alive.
return {};
}
}
// Serialize writes to the same installationDir so concurrent persists
// can't truncate each other. Mirrors the writeAppConfig lock strategy.
const writeLocks = new Map<string, Promise<unknown>>();
/**
* Patch shape for {@link writeInstallationFile}.
*
* Distinct from `Partial<InstallationFile>` because `exactOptionalPropertyTypes`
* blocks `{ installationId: undefined }` and we explicitly need a way to
* **clear** the id (Settings → "Delete my data", or an explicit null write
* via `writeAppConfig`). The convention here:
*
* - field present with a non-empty string → assign the new value
* - field present with null / empty string → delete the field on disk
* - field absent → leave the existing value alone
*/
export type InstallationFilePatch = {
installationId?: string | null;
};
export async function writeInstallationFile(
installationDir: string,
patch: InstallationFilePatch,
): Promise<InstallationFile> {
const prev = writeLocks.get(installationDir) ?? Promise.resolve();
const task = prev.catch(() => undefined).then(() => doWrite(installationDir, patch));
writeLocks.set(installationDir, task);
try {
return await task;
} finally {
if (writeLocks.get(installationDir) === task) writeLocks.delete(installationDir);
}
}
async function doWrite(
installationDir: string,
patch: InstallationFilePatch,
): Promise<InstallationFile> {
const existing = await readInstallationFile(installationDir);
const next: InstallationFile = { ...existing };
if (Object.prototype.hasOwnProperty.call(patch, 'installationId')) {
if (typeof patch.installationId === 'string' && patch.installationId.length > 0) {
next.installationId = patch.installationId;
} else {
delete next.installationId;
}
}
await mkdir(dirname(installationFilePath(installationDir)), { recursive: true });
// The file is small, the user only writes it on consent + delete-my-data
// flows. We deliberately don't use a temp-file + rename dance: a partial
// write here just means `readInstallationFile` falls back to app-config.json,
// which is the same fallback we use when the file simply doesn't exist yet.
await writeFile(
installationFilePath(installationDir),
JSON.stringify(next, null, 2) + '\n',
'utf8',
);
return next;
}