open-design/specs/current/maintainability-roadmap.md

12 KiB

Maintainability Roadmap

Purpose

This document captures the maintainability risks in the current apps/web + apps/daemon architecture and the recommended optimization path.

The architectural boundary stays unchanged:

  • apps/web: Next.js frontend and thin BFF/proxy layer.
  • apps/daemon: local runtime/backend for SQLite, .od filesystem state, AI agent CLI processes, and SSE streaming.

The first-principles maintainability goals are:

  • Understandability: engineers can locate behavior quickly and reason about data flow.
  • Changeability: common changes can be made with bounded blast radius.
  • Verifiability: contracts, tests, and types catch regressions early.
  • Isolation: high-risk capabilities are contained behind explicit boundaries.
  • Recoverability: failures produce actionable state, logs, and cleanup behavior.

Priority Scale

Priority Meaning
P0 Blocks safe evolution or creates high-risk runtime/security failure modes.
P1 Major maintainability risk that increases regression and debugging cost.
P2 Medium-term risk that affects reliability, portability, or architecture clarity.
P3 Supporting documentation/process improvement.

Risk List and Optimization Plan

ID Priority Risk Evidence Impact Optimization Plan
R1 P0 Daemon TypeScript enforcement needs ongoing maintenance. apps/daemon now typechecks across source and tests, and all @ts-nocheck suppressions have been removed. New daemon payload, DB row, agent event, or task-state changes can reintroduce drift if they bypass shared contracts or typed boundaries. Keep daemon source and tests under TypeScript enforcement; preserve zero @ts-nocheck; route shared API/SSE/error shapes through packages/contracts; add runtime validation for untrusted daemon inputs.
R2 P0 Web/daemon API contract is implicit. apps/web calls daemon through /api/* rewrites; web has TypeScript types, daemon returns manually shaped JSON. Field mismatches surface at runtime; API evolution is fragile. Create packages/api-contract or an equivalent shared contract layer for request, response, error, and SSE event types.
R3 P0 Runtime validation is incomplete at the daemon boundary. Daemon requests can trigger local filesystem access, SQLite writes, and child_process.spawn(). Type correctness alone cannot protect against malformed runtime input, path traversal, invalid agent IDs, or unsafe args. Add schema validation at HTTP boundaries with Zod/TypeBox; centralize validation for workspace paths, task IDs, agent IDs, models, reasoning options, uploaded files, and command arguments.
R4 P0 Local capability security boundary needs explicit rules. Daemon owns high-permission capabilities: local files, .od, project workspaces, agent CLIs, and logs. Unsafe path handling, broad command execution, token leakage, and unintended workspace access become possible failure modes. Treat daemon as a capability server: bind to localhost, use workspace/path allowlists, normalize and jail paths, allowlist agent commands, and redact sensitive output.
R5 P0 Agent process lifecycle needs a first-class manager. /api/chat spawns multiple agent runtimes and streams output to the frontend. Zombie processes, cancellation gaps, orphaned tasks, inconsistent exit handling, and concurrent process conflicts. Introduce a process/task manager with task state machine, cancellation, timeout, cleanup, exit code capture, signal handling, and concurrency limits.
R6 P1 server.ts is too monolithic. apps/daemon/src/server.ts contains many routes plus orchestration, filesystem logic, streaming, uploads, and artifact handling. Harder to understand, test, and change; unrelated edits share the same file and increase regression risk. Split into thin routes plus services/adapters: routes/, services/, agents/, db/, fs/, streams/, artifacts/.
R7 P1 Error handling is inconsistent. Handlers commonly use local try/catch and return ad hoc JSON errors. UI receives inconsistent failures; logs lose context; task state can stall after partial failures. Define a unified error model with code, message, details, retryable, and requestId/taskId; add centralized Express error middleware and adapter-level error mapping.
R8 P1 SSE protocol is under-specified. Daemon manually writes text/event-stream events for agent output and status. Frontend parsing is fragile; disconnect, heartbeat, terminal events, and error semantics can drift. Version the SSE event contract and define canonical events such as task.started, task.output, task.error, task.completed, task.cancelled, and heartbeat.
R9 P1 SQLite schema and migration lifecycle need stronger guarantees. apps/daemon/src/db.ts owns local better-sqlite3 tables and migrations. Local user data upgrades can fail unpredictably; schema drift is hard to diagnose and recover. Add explicit migration table, ordered forward migrations, startup migration checks, schema version logging, backup-before-migrate strategy, and migration tests.
R10 P1 Test coverage is thin around daemon behavior. Existing daemon tests focus on stream parsing and artifact manifest behavior; HTTP/DB/spawn flows have limited coverage. Changes are validated by manual testing; regressions in filesystem, SQLite, SSE, or agent mocks can ship. Build layered tests: shared contract tests, route integration tests, service unit tests, SQLite migration tests, SSE parser tests, and agent mock integration tests.
R11 P1 Logging and observability are insufficient for local runtime debugging. Agent execution involves long-lived tasks, subprocess output, filesystem state, and frontend SSE consumption. User issues are hard to reproduce; failures lack correlated context. Add structured logs with requestId, taskId, agentId, workspace, exit code, and duration; separate app logs from agent output; redact secrets.
R12 P2 Configuration, port, and health behavior can become fragile. Web proxies /api/* to daemon; dev startup coordinates Next.js and daemon ports. Port conflicts, daemon-not-ready states, and mismatched environment variables can break startup or distribution. Centralize config resolution; expose /health; add daemon readiness checks; make port selection and UI fallback deterministic.
R13 P2 Cross-platform behavior is a recurring risk. Daemon uses filesystem paths, SQLite native bindings, shell/process behavior, and signals. macOS, Linux, and Windows/WSL can differ in path normalization, quoting, permissions, and process termination. Use Node path APIs consistently, avoid shell string composition, isolate platform-specific process logic, and add CI coverage for supported platforms.
R14 P2 Framework migration can distract from core maintainability issues. Current complexity is concentrated in FS/spawn/SSE/SQLite and module boundaries. A framework rewrite can consume time while preserving the risky domain logic. Keep Express for now; revisit Fastify only after TS, contracts, validation, tests, and modularization are in place and Express becomes a clear limiter.
R15 P2 Web/daemon boundary can erode over time. Next.js has BFF capability and daemon has backend capability; future edits may blur ownership. High-permission local runtime logic may leak into apps/web; deployment and security assumptions become unclear. Document and enforce ownership: web handles UI/BFF/proxy; daemon owns local runtime capabilities; shared code contains contracts and pure logic only.
R16 P3 Operational documentation is incomplete. Local-first daemon behavior depends on ports, .od, agent CLIs, runtime logs, and recovery flows. Onboarding and support costs rise; troubleshooting relies on oral knowledge. Document daemon architecture, API/SSE contract, task lifecycle, .od data layout, agent dependency checks, and common recovery procedures.

Optimization Dependencies

The optimization work should proceed in dependency order. Some items can run in parallel once their prerequisites are stable.

Workstream Status Optimization Covers Depends on Output
W1 Completed Confirm architecture and capability boundaries R4, R15 Written ownership rules for web, daemon, shared contracts, and dangerous local capabilities. See specs/current/architecture-boundaries.md.
W2 Completed Define API, SSE, and error contracts R2, R7, R8 W1 packages/contracts now provides shared request/response types, SSE event unions, and error model helpers consumed by web and daemon.
W3 Partial Migrate project-owned code to TypeScript R1 W2 for highest-value shared types Daemon source and tests have most @ts-nocheck suppressions removed. High-conflict daemon core files are deferred for follow-up to avoid blocking parallel PRs: apps/daemon/src/server.ts, apps/daemon/src/agents.ts, apps/daemon/src/projects.ts, apps/daemon/src/runs.ts, and apps/daemon/src/cli.ts. Remaining work is to migrate those files, restore full daemon typecheck coverage, and rerun pnpm --filter @open-design/daemon typecheck, pnpm typecheck, and pnpm guard.
W4 Planned Add runtime validation at daemon boundaries R3, R4 W2 Schemas for HTTP requests, paths, agents, models, uploads, task IDs, and command args.
W5 Planned Modularize server.ts R6 W2, W3, W4 Thin route handlers plus services/adapters for agents, DB, FS, streams, and artifacts.
W6 Partial Introduce agent process/task manager R5, R8, R11 W2, W5 apps/daemon/src/runs.ts now provides an in-memory chat run service with run states, event replay, SSE streaming, cancellation, waiting, terminal cleanup, and exit metadata; critique also has an in-process run registry for interrupts. Remaining work is a unified agent process manager with explicit concurrency limits, stronger timeout/cleanup policy, and consistent lifecycle ownership across agent surfaces.
W7 Planned Strengthen SQLite migrations R9 W5 or a clear DB adapter boundary Migration table, ordered migrations, startup checks, backup strategy, migration tests.
W8 Partial Build the daemon test pyramid R10 W2, W4, W5 Daemon now has broad Vitest coverage under apps/daemon/tests/, including route, agent, DB, SSE, critique, live-artifact, connector, config, and filesystem behavior. Remaining work is to make the layers explicit: shared contract tests, route integration suites, service unit tests, migration tests, canonical SSE protocol tests, and mocked agent-process lifecycle tests.
W9 Planned Add structured logs and observability R11 W2, W6 Correlated request/task logs, sanitized agent output, durations, exit status, and diagnostic context.
W10 Partial Harden config, port, and readiness behavior R12 W1 Daemon exposes GET /api/health with basic { ok, version } health data. Remaining work is centralized config resolution, richer readiness checks, deterministic port behavior, and UI-visible daemon-not-ready handling.
W11 Partial Harden cross-platform behavior R13 W4, W6, W5 Some process and path hardening exists, including shared platform command invocation and Windows command-line budget checks for agent CLIs. Remaining work is to formalize platform-specific process handling, path normalization rules, and supported-platform CI coverage.
W12 Planned Revisit HTTP framework choice R14 W2, W3, W4, W5, W8 Evidence-based decision on whether Express remains adequate or Fastify provides clear net value.
W13 Partial Complete operational documentation R16 W1 through W11 as sections stabilize Boundary and ownership documentation exists in AGENTS.md, apps/AGENTS.md, packages/AGENTS.md, and specs/current/architecture-boundaries.md. Remaining work is current-state daemon docs, API/SSE lifecycle docs, runbooks, troubleshooting guides, and recovery procedures.
Phase 1: W1 -> W2 -> W3 -> W4
Phase 2: W5 -> W6 -> W7 -> W8
Phase 3: W9 -> W10 -> W11 -> W13
Phase 4: W12

The core principle is to reduce risk before changing framework foundations: establish contracts, types, validation, and module boundaries first; then evaluate whether Express remains the right transport layer.