mirror of
https://github.com/nexu-io/open-design.git
synced 2026-06-01 03:14:35 +07:00
* feat(observability): web lifecycle telemetry + stable installationId migration Two intertwined safety-telemetry additions for the 0.8.0 release. Web lifecycle observability --------------------------- New `apps/web/src/observability/` module installed at module load via client-app.tsx — alongside the existing error-tracking exception hooks from #2521. Reuses error-tracking's direct-fetch transport (the same consent-bypass + early-buffer guarantees) so every event flows even when the user has opted out of general analytics: - client_long_task PerformanceObserver longtask >100ms (real "feels janky" signal, FPS proxy) - client_white_screen app fails to mount after 5s; MutationObserver cancels the timer the moment the React root renders so a normal boot is zero events - client_resource_error capture-phase window.error catches failed <script>/<link>/<img>/<iframe> loads (chunk-load failures, broken artifact refs) - client_boot_timing navigationStart → load timings via Navigation Timing v2 - client_visibility_change visibilitychange + page lifetime - client_session_summary real foreground duration emitted on pagehide - client_run_stuck 5min watchdog on SSE runs that don't progress (#2464 / #2405 / #1451 in data form) - client_iframe_error FileViewer iframe load failures (iframe errors don't bubble to window, so the global resource-error observer can't see them) - desktop_renderer_crash Electron main observes render-process-gone and forwards to daemon /api/observability/event - daemon_uncaught_exception daemon_unhandled_rejection process-level handlers on the daemon error-tracking.ts is generalised: `reportSafetyEvent(name, props)` now exposes the same buffer + direct-fetch transport that `reportHandledException` used, with identical $exception wire shape preserved for the existing exception path. Daemon cross-process bridge --------------------------- New `AnalyticsService.captureSafety()` skips the consent re-check and posts via posthog-node with installationId as distinct_id. Wired into: - `POST /api/observability/event` for desktop main and any future helper process that needs to ship a safety event (no consent check — same contract as web's direct-fetch path) - `process.on('uncaughtException')` / `unhandledRejection` on the daemon itself Stable installationId across reinstalls (critical for 0.8.0 rollout) -------------------------------------------------------------------- installationId previously lived in `<namespace>/data/app-config.json`, so a packaged reinstall that churned the namespace token (or any future namespace-scoped data wipe) rotated the id and the user showed up as a brand-new PostHog person. This is the immediate trigger: when 0.8.0 ships, every 0.7.x user upgrading would silently double the user count. New module `apps/daemon/src/installation.ts` reads/writes `<installationDir>/installation.json` at the channel root. The daemon gets the path from `OD_INSTALLATION_DIR`, set by `apps/packaged/src/sidecars.ts` to `paths.installationRoot` (one level above `namespaces/` — e.g. `~/Library/Application Support/Open Design Nightly/` on mac). `readAppConfig` transparently merges: if installation.json has an id it wins; if only app-config.json has one (the 0.7.x state), it gets mirrored to installation.json on the next read. `writeAppConfig` mirrors any explicit installationId write, including the null-clear path used by Settings → "Delete my data". 7 call sites of readAppConfig keep their signatures unchanged. Survives: - same-channel reinstall (DMG drag-replace, NSIS reinstall) - namespace churn between packaged builds - per-namespace data reset (future installer that clears `<ns>/data/`) Still rotates (intentionally): - explicit "Delete my data" - manual `rm -rf "~/Library/Application Support/Open Design <Channel>/"` - different channel (Stable vs Nightly stay distinct because userData paths differ; that's the existing channel-isolation contract) What this changes for posthog-js -------------------------------- client.ts had `capture_exceptions: false` from #2521; nothing else changes. autocapture / $pageview / $autocapture / track() / daemon analyticsService.capture() — all unchanged. New events are additive. Validation ---------- - pnpm guard pass - pnpm typecheck whole repo pass - pnpm --filter @open-design/web test 200 files / 1824 tests - pnpm --filter @open-design/daemon test 251 files / 2981 tests (includes 10 new tests in installation.test.ts pinning the 0.7.x → 0.8.0 migration, namespace-wipe survival, delete-my-data clear, and fresh-id rotation) - pnpm --filter @open-design/packaged test 9 files / 89 tests - Pre-existing baseline: apps/desktop/src/main/updater.ts has typecheck references to RELEASE_CHANNEL_NAMES.PREVIEW/NIGHTLY on release/v0.8.0; unrelated to this PR. * fix(observability): preserve fatal exit on uncaught + skip loading shell in white-screen check Addresses codex review on PR #2527 (Siri-Ray). 1) Daemon process handlers must keep Node fatal semantics Installing an uncaughtException listener silences Node's default crash/exit; Node 15+ does the same for unhandledRejection when a listener is present. The previous handlers logged telemetry and let control return to the event loop, leaving a corrupted daemon serving requests instead of letting the supervisor restart it cleanly. triggerFatalShutdown() now: - dispatches captureSafety once (guarded against re-entry from cascading faults) - races posthog-node's shutdown against a 1s bounded timeout so a slow flush can't keep the process alive - calls process.exit(1) after the race resolves Both uncaughtException and unhandledRejection route through it. apps/daemon/tests/uncaught-fatal-shutdown.test.ts pins: - captureSafety is invoked exactly once even on repeated faults - exit(1) fires on the happy path - exit(1) still fires when shutdown hangs past the timeout - exit(1) still fires when captureSafety itself throws 2) White-screen detector treated the loading shell as a successful mount apps/web/app/[[...slug]]/client-app.tsx renders the dynamic-import fallback as <div class="od-loading-shell">Loading Open Design…</div> whose visible text (19 chars) exceeded the previous 10-char floor. monitorMount() would therefore cancel the 5s timer the instant Next swapped the loading shell in, completely missing the white-screen signal the observer is meant to add. isAppMounted() now: - primary signal: <html data-od-app-mounted="1"> set by App.tsx's first useEffect — authoritative because once App has mounted at least once, any later tree crash is an $exception story, not a white-screen story - fallback: only counts children of the root container whose classList does NOT include known loading-shell markers (od-loading-shell). Their visible text drives the > MIN_VISIBLE_TEXT check, so the loading sentinel can never be mistaken for a mount. apps/web/tests/observability/white-screen.test.ts pins: - fires client_white_screen when only the loading shell is present after the timeout - does NOT fire when data-od-app-mounted is set before the timeout - cancels the timer the moment a real workspace-shell child appears alongside the loading shell - still fires when only sub-MIN_VISIBLE_TEXT non-shell content is present (effectively blank) Validation: - pnpm guard pass - pnpm typecheck pass - pnpm --filter @open-design/daemon test 252 files / 2985 tests - pnpm --filter @open-design/web test 201 files / 1828 tests * fix(observability): await captureSafety enqueue before fatal shutdown flush Addresses second-pass codex review on PR #2527 (Siri-Ray, 3279268246). The previous fatal-shutdown path called `analyticsService.captureSafety()` synchronously and immediately raced `analyticsService.shutdown()` against the bounded timeout. captureSafety in apps/daemon/src/analytics.ts does its real `client.capture()` call only inside an async IIFE after `await readInstallationIdSafe()` — so shutdown could win the race, drain an empty posthog-node queue, and let `process.exit(1)` run BEFORE the daemon crash event ever got enqueued. We'd then preserve the process-lifecycle contract but lose the exact signal this PR is adding. Changes: - AnalyticsService.captureSafety now returns Promise<void>. The async IIFE is gone; the body awaits readInstallationIdSafe directly so the returned promise resolves only AFTER client.capture() has been invoked (which is when posthog-node's local buffer contains the event). - server.ts triggerFatalShutdown awaits captureSafety, then calls shutdown, and races that whole sequence against the 1s bounded timeout. Capture failures still don't block exit (try/catch around the await). - NOOP_SERVICE.captureSafety becomes `async () => undefined` to match the new signature. - Fire-and-forget callers (/api/observability/event) are unaffected; voiding the returned promise keeps them non-blocking. apps/daemon/tests/uncaught-fatal-shutdown.test.ts adds the reviewer- requested fixture: - 'waits for the captureSafety promise to settle before invoking shutdown' — gives capture a 50ms delay and shutdown a separate 50ms delay so the intermediate "capture done / shutdown not yet" state is observable. - 'still aborts and exits if captureSafety hangs past the bounded timeout' — captureSafety never resolves; the outer 1s timeout still forces process.exit(1). Validation: - pnpm guard pass - pnpm typecheck whole repo pass - pnpm --filter @open-design/daemon test 252 files / 2987 tests
153 lines
6.5 KiB
TypeScript
153 lines
6.5 KiB
TypeScript
import { mkdir, mkdtemp, readFile, rm, writeFile } from 'node:fs/promises';
|
|
import { tmpdir } from 'node:os';
|
|
import { join } from 'node:path';
|
|
|
|
import { afterEach, beforeEach, describe, expect, it } from 'vitest';
|
|
|
|
import { readAppConfig, writeAppConfig } from '../src/app-config.js';
|
|
import {
|
|
readInstallationFile,
|
|
writeInstallationFile,
|
|
} from '../src/installation.js';
|
|
|
|
/**
|
|
* The contract these tests pin down is the **0.7.x → 0.8.0 person
|
|
* continuity guarantee**: an existing user upgrading from a daemon that
|
|
* only wrote to `app-config.json` must not produce a new PostHog person
|
|
* after the upgrade. They also pin the "namespace churn / clean
|
|
* reinstall" survival path: when `<dataDir>/app-config.json` disappears
|
|
* but `<installationDir>/installation.json` still has the id, reads must
|
|
* return that id so the next event ships with the same distinct_id.
|
|
*
|
|
* Two directories per test:
|
|
* - `dataDir` — simulates `<namespace>/data/`
|
|
* - `installDir` — simulates the channel root (`<userData>/`)
|
|
* They live as siblings so we can independently reset either side.
|
|
*/
|
|
|
|
let rootDir: string;
|
|
let dataDir: string;
|
|
let installDir: string;
|
|
const SAVED_INSTALL_ENV = process.env.OD_INSTALLATION_DIR;
|
|
|
|
beforeEach(async () => {
|
|
rootDir = await mkdtemp(join(tmpdir(), 'od-install-test-'));
|
|
dataDir = join(rootDir, 'namespace', 'data');
|
|
installDir = join(rootDir, 'channel-root');
|
|
await mkdir(dataDir, { recursive: true });
|
|
await mkdir(installDir, { recursive: true });
|
|
process.env.OD_INSTALLATION_DIR = installDir;
|
|
});
|
|
|
|
afterEach(async () => {
|
|
if (SAVED_INSTALL_ENV == null) {
|
|
delete process.env.OD_INSTALLATION_DIR;
|
|
} else {
|
|
process.env.OD_INSTALLATION_DIR = SAVED_INSTALL_ENV;
|
|
}
|
|
if (rootDir != null) {
|
|
await rm(rootDir, { recursive: true, force: true });
|
|
}
|
|
});
|
|
|
|
describe('installation.json migration', () => {
|
|
it('reads installationId from app-config when installation.json is absent (0.7.x state)', async () => {
|
|
await writeFile(
|
|
join(dataDir, 'app-config.json'),
|
|
JSON.stringify({ installationId: 'legacy-id-7' }),
|
|
'utf8',
|
|
);
|
|
const cfg = await readAppConfig(dataDir);
|
|
expect(cfg.installationId).toBe('legacy-id-7');
|
|
});
|
|
|
|
it('mirrors a legacy app-config installationId into installation.json on first read (0.7.x → 0.8.0)', async () => {
|
|
await writeFile(
|
|
join(dataDir, 'app-config.json'),
|
|
JSON.stringify({ installationId: 'legacy-id-8' }),
|
|
'utf8',
|
|
);
|
|
// No installation.json yet — readAppConfig should backfill it.
|
|
await readAppConfig(dataDir);
|
|
const persisted = await readInstallationFile(installDir);
|
|
expect(persisted.installationId).toBe('legacy-id-8');
|
|
});
|
|
|
|
it('serves installation.json even when app-config.json was wiped (namespace churn / clean reinstall)', async () => {
|
|
await writeInstallationFile(installDir, { installationId: 'stable-id' });
|
|
// dataDir is empty — no app-config.json
|
|
const cfg = await readAppConfig(dataDir);
|
|
expect(cfg.installationId).toBe('stable-id');
|
|
});
|
|
|
|
it('prefers installation.json over a divergent value in app-config.json (post-migration writes win)', async () => {
|
|
await writeInstallationFile(installDir, { installationId: 'new-id' });
|
|
await writeFile(
|
|
join(dataDir, 'app-config.json'),
|
|
JSON.stringify({ installationId: 'old-id' }),
|
|
'utf8',
|
|
);
|
|
const cfg = await readAppConfig(dataDir);
|
|
expect(cfg.installationId).toBe('new-id');
|
|
});
|
|
|
|
it('mirrors installationId from writeAppConfig into installation.json so reinstalls survive', async () => {
|
|
await writeAppConfig(dataDir, { installationId: 'fresh-id', telemetry: { metrics: true } });
|
|
const installContents = JSON.parse(
|
|
await readFile(join(installDir, 'installation.json'), 'utf8'),
|
|
) as { installationId: string };
|
|
expect(installContents.installationId).toBe('fresh-id');
|
|
// Legacy app-config is also kept current so any code path that reads
|
|
// it directly (or a downgrade to 0.7.x) still sees the same id.
|
|
const appConfigContents = JSON.parse(
|
|
await readFile(join(dataDir, 'app-config.json'), 'utf8'),
|
|
) as { installationId: string };
|
|
expect(appConfigContents.installationId).toBe('fresh-id');
|
|
});
|
|
|
|
it('does not touch installation.json on an unrelated writeAppConfig (no churn)', async () => {
|
|
await writeInstallationFile(installDir, { installationId: 'untouched' });
|
|
await writeAppConfig(dataDir, { telemetry: { metrics: true } });
|
|
const persisted = await readInstallationFile(installDir);
|
|
expect(persisted.installationId).toBe('untouched');
|
|
});
|
|
|
|
it('falls back to dataDir when OD_INSTALLATION_DIR is unset (dev / OSS / tools-dev paths)', async () => {
|
|
delete process.env.OD_INSTALLATION_DIR;
|
|
await writeAppConfig(dataDir, { installationId: 'devmode-id' });
|
|
// With no override, the install file should land next to app-config.json.
|
|
const persisted = JSON.parse(
|
|
await readFile(join(dataDir, 'installation.json'), 'utf8'),
|
|
) as { installationId: string };
|
|
expect(persisted.installationId).toBe('devmode-id');
|
|
});
|
|
|
|
it('returns an empty object when neither file exists (cold first boot)', async () => {
|
|
const cfg = await readAppConfig(dataDir);
|
|
expect(cfg.installationId).toBeUndefined();
|
|
const persisted = await readInstallationFile(installDir);
|
|
expect(persisted.installationId).toBeUndefined();
|
|
});
|
|
|
|
it('clears installation.json when "Delete my data" sets installationId to null', async () => {
|
|
// First write a real id — both files now hold it.
|
|
await writeAppConfig(dataDir, { installationId: 'before-delete' });
|
|
expect((await readInstallationFile(installDir)).installationId).toBe('before-delete');
|
|
// "Delete my data" path: PUT /api/app-config with installationId: null.
|
|
// Without the mirror-on-clear in writeAppConfig, the next readAppConfig
|
|
// would still serve `before-delete` from installation.json and the
|
|
// user's reset action would silently no-op.
|
|
await writeAppConfig(dataDir, { installationId: null });
|
|
const persisted = await readInstallationFile(installDir);
|
|
expect(persisted.installationId).toBeUndefined();
|
|
const cfg = await readAppConfig(dataDir);
|
|
expect(cfg.installationId).toBeNull();
|
|
});
|
|
|
|
it('overwrites installation.json when "Delete my data" rotates to a fresh id', async () => {
|
|
await writeAppConfig(dataDir, { installationId: 'old' });
|
|
await writeAppConfig(dataDir, { installationId: 'new' });
|
|
const persisted = await readInstallationFile(installDir);
|
|
expect(persisted.installationId).toBe('new');
|
|
});
|
|
});
|