Maintainability Roadmap

Purpose

This document captures the maintainability risks in the current apps/web + apps/daemon architecture and the recommended optimization path.

The architectural boundary stays unchanged:

apps/web: Next.js frontend and thin BFF/proxy layer.
apps/daemon: local runtime/backend for SQLite, .od filesystem state, AI agent CLI processes, and SSE streaming.

The first-principles maintainability goals are:

Understandability: engineers can locate behavior quickly and reason about data flow.
Changeability: common changes can be made with bounded blast radius.
Verifiability: contracts, tests, and types catch regressions early.
Isolation: high-risk capabilities are contained behind explicit boundaries.
Recoverability: failures produce actionable state, logs, and cleanup behavior.

Priority Scale

Priority	Meaning
P0	Blocks safe evolution or creates high-risk runtime/security failure modes.
P1	Major maintainability risk that increases regression and debugging cost.
P2	Medium-term risk that affects reliability, portability, or architecture clarity.
P3	Supporting documentation/process improvement.

Risk List and Optimization Plan

ID	Priority	Risk	Evidence	Impact	Optimization Plan
R1	P0	Daemon TypeScript enforcement needs ongoing maintenance.	`apps/daemon` now typechecks across source and tests, and all `@ts-nocheck` suppressions have been removed.	New daemon payload, DB row, agent event, or task-state changes can reintroduce drift if they bypass shared contracts or typed boundaries.	Keep daemon source and tests under TypeScript enforcement; preserve zero `@ts-nocheck`; route shared API/SSE/error shapes through `packages/contracts`; add runtime validation for untrusted daemon inputs.
R2	P0	Web/daemon API contract is implicit.	`apps/web` calls daemon through `/api/*` rewrites; web has TypeScript types, daemon returns manually shaped JSON.	Field mismatches surface at runtime; API evolution is fragile.	Create `packages/api-contract` or an equivalent shared contract layer for request, response, error, and SSE event types.
R3	P0	Runtime validation is incomplete at the daemon boundary.	Daemon requests can trigger local filesystem access, SQLite writes, and `child_process.spawn()`.	Type correctness alone cannot protect against malformed runtime input, path traversal, invalid agent IDs, or unsafe args.	Add schema validation at HTTP boundaries with Zod/TypeBox; centralize validation for workspace paths, task IDs, agent IDs, models, reasoning options, uploaded files, and command arguments.
R4	P0	Local capability security boundary needs explicit rules.	Daemon owns high-permission capabilities: local files, `.od`, project workspaces, agent CLIs, and logs.	Unsafe path handling, broad command execution, token leakage, and unintended workspace access become possible failure modes.	Treat daemon as a capability server: bind to localhost, use workspace/path allowlists, normalize and jail paths, allowlist agent commands, and redact sensitive output.
R5	P0	Agent process lifecycle needs a first-class manager.	`/api/chat` spawns multiple agent runtimes and streams output to the frontend.	Zombie processes, cancellation gaps, orphaned tasks, inconsistent exit handling, and concurrent process conflicts.	Introduce a process/task manager with task state machine, cancellation, timeout, cleanup, exit code capture, signal handling, and concurrency limits.
R6	P1	`server.ts` is too monolithic.	`apps/daemon/src/server.ts` contains many routes plus orchestration, filesystem logic, streaming, uploads, and artifact handling.	Harder to understand, test, and change; unrelated edits share the same file and increase regression risk.	Split into thin routes plus services/adapters: `routes/`, `services/`, `agents/`, `db/`, `fs/`, `streams/`, `artifacts/`.
R7	P1	Error handling is inconsistent.	Handlers commonly use local `try/catch` and return ad hoc JSON errors.	UI receives inconsistent failures; logs lose context; task state can stall after partial failures.	Define a unified error model with `code`, `message`, `details`, `retryable`, and `requestId/taskId`; add centralized Express error middleware and adapter-level error mapping.
R8	P1	SSE protocol is under-specified.	Daemon manually writes `text/event-stream` events for agent output and status.	Frontend parsing is fragile; disconnect, heartbeat, terminal events, and error semantics can drift.	Version the SSE event contract and define canonical events such as `task.started`, `task.output`, `task.error`, `task.completed`, `task.cancelled`, and `heartbeat`.
R9	P1	SQLite schema and migration lifecycle need stronger guarantees.	`apps/daemon/src/db.ts` owns local `better-sqlite3` tables and migrations.	Local user data upgrades can fail unpredictably; schema drift is hard to diagnose and recover.	Add explicit migration table, ordered forward migrations, startup migration checks, schema version logging, backup-before-migrate strategy, and migration tests.
R10	P1	Test coverage is thin around daemon behavior.	Existing daemon tests focus on stream parsing and artifact manifest behavior; HTTP/DB/spawn flows have limited coverage.	Changes are validated by manual testing; regressions in filesystem, SQLite, SSE, or agent mocks can ship.	Build layered tests: shared contract tests, route integration tests, service unit tests, SQLite migration tests, SSE parser tests, and agent mock integration tests.
R11	P1	Logging and observability are insufficient for local runtime debugging.	Agent execution involves long-lived tasks, subprocess output, filesystem state, and frontend SSE consumption.	User issues are hard to reproduce; failures lack correlated context.	Add structured logs with `requestId`, `taskId`, `agentId`, `workspace`, exit code, and duration; separate app logs from agent output; redact secrets.
R12	P2	Configuration, port, and health behavior can become fragile.	Web proxies `/api/*` to daemon; dev startup coordinates Next.js and daemon ports.	Port conflicts, daemon-not-ready states, and mismatched environment variables can break startup or distribution.	Centralize config resolution; expose `/health`; add daemon readiness checks; make port selection and UI fallback deterministic.
R13	P2	Cross-platform behavior is a recurring risk.	Daemon uses filesystem paths, SQLite native bindings, shell/process behavior, and signals.	macOS, Linux, and Windows/WSL can differ in path normalization, quoting, permissions, and process termination.	Use Node path APIs consistently, avoid shell string composition, isolate platform-specific process logic, and add CI coverage for supported platforms.
R14	P2	Framework migration can distract from core maintainability issues.	Current complexity is concentrated in FS/spawn/SSE/SQLite and module boundaries.	A framework rewrite can consume time while preserving the risky domain logic.	Keep Express for now; revisit Fastify only after TS, contracts, validation, tests, and modularization are in place and Express becomes a clear limiter.
R15	P2	Web/daemon boundary can erode over time.	Next.js has BFF capability and daemon has backend capability; future edits may blur ownership.	High-permission local runtime logic may leak into `apps/web`; deployment and security assumptions become unclear.	Document and enforce ownership: web handles UI/BFF/proxy; daemon owns local runtime capabilities; shared code contains contracts and pure logic only.
R16	P3	Operational documentation is incomplete.	Local-first daemon behavior depends on ports, `.od`, agent CLIs, runtime logs, and recovery flows.	Onboarding and support costs rise; troubleshooting relies on oral knowledge.	Document daemon architecture, API/SSE contract, task lifecycle, `.od` data layout, agent dependency checks, and common recovery procedures.

Optimization Dependencies

The optimization work should proceed in dependency order. Some items can run in parallel once their prerequisites are stable.

Workstream	Status	Optimization	Covers	Depends on	Output
W1	Completed	Confirm architecture and capability boundaries	R4, R15	—	Written ownership rules for web, daemon, shared contracts, and dangerous local capabilities. See `specs/current/architecture-boundaries.md`.
W2	Completed	Define API, SSE, and error contracts	R2, R7, R8	W1	`packages/contracts` now provides shared request/response types, SSE event unions, and error model helpers consumed by web and daemon.
W3	Partial	Migrate project-owned code to TypeScript	R1	W2 for highest-value shared types	Daemon source and tests have most `@ts-nocheck` suppressions removed. High-conflict daemon core files are deferred for follow-up to avoid blocking parallel PRs: `apps/daemon/src/server.ts`, `apps/daemon/src/agents.ts`, `apps/daemon/src/projects.ts`, `apps/daemon/src/runs.ts`, and `apps/daemon/src/cli.ts`. Remaining work is to migrate those files, restore full daemon typecheck coverage, and rerun `pnpm --filter @open-design/daemon typecheck`, `pnpm typecheck`, and `pnpm guard`.
W4	Planned	Add runtime validation at daemon boundaries	R3, R4	W2	Schemas for HTTP requests, paths, agents, models, uploads, task IDs, and command args.
W5	Planned	Modularize `server.ts`	R6	W2, W3, W4	Thin route handlers plus services/adapters for agents, DB, FS, streams, and artifacts.
W6	Partial	Introduce agent process/task manager	R5, R8, R11	W2, W5	`apps/daemon/src/runs.ts` now provides an in-memory chat run service with run states, event replay, SSE streaming, cancellation, waiting, terminal cleanup, and exit metadata; critique also has an in-process run registry for interrupts. Remaining work is a unified agent process manager with explicit concurrency limits, stronger timeout/cleanup policy, and consistent lifecycle ownership across agent surfaces.
W7	Planned	Strengthen SQLite migrations	R9	W5 or a clear DB adapter boundary	Migration table, ordered migrations, startup checks, backup strategy, migration tests.
W8	Partial	Build the daemon test pyramid	R10	W2, W4, W5	Daemon now has broad Vitest coverage under `apps/daemon/tests/`, including route, agent, DB, SSE, critique, live-artifact, connector, config, and filesystem behavior. Remaining work is to make the layers explicit: shared contract tests, route integration suites, service unit tests, migration tests, canonical SSE protocol tests, and mocked agent-process lifecycle tests.
W9	Planned	Add structured logs and observability	R11	W2, W6	Correlated request/task logs, sanitized agent output, durations, exit status, and diagnostic context.
W10	Partial	Harden config, port, and readiness behavior	R12	W1	Daemon exposes `GET /api/health` with basic `{ ok, version }` health data. Remaining work is centralized config resolution, richer readiness checks, deterministic port behavior, and UI-visible daemon-not-ready handling.
W11	Partial	Harden cross-platform behavior	R13	W4, W6, W5	Some process and path hardening exists, including shared platform command invocation and Windows command-line budget checks for agent CLIs. Remaining work is to formalize platform-specific process handling, path normalization rules, and supported-platform CI coverage.
W12	Planned	Revisit HTTP framework choice	R14	W2, W3, W4, W5, W8	Evidence-based decision on whether Express remains adequate or Fastify provides clear net value.
W13	Partial	Complete operational documentation	R16	W1 through W11 as sections stabilize	Boundary and ownership documentation exists in `AGENTS.md`, `apps/AGENTS.md`, `packages/AGENTS.md`, and `specs/current/architecture-boundaries.md`. Remaining work is current-state daemon docs, API/SSE lifecycle docs, runbooks, troubleshooting guides, and recovery procedures.

Recommended Execution Order

Phase 1: W1 -> W2 -> W3 -> W4
Phase 2: W5 -> W6 -> W7 -> W8
Phase 3: W9 -> W10 -> W11 -> W13
Phase 4: W12

The core principle is to reduce risk before changing framework foundations: establish contracts, types, validation, and module boundaries first; then evaluate whether Express remains the right transport layer.

12 KiB Raw Permalink Blame History