Braide manages agent and terminal processes as child processes of the Next.js server. Each process type has its own startup, tracking, and shutdown behaviour. This page documents the full lifecycle and the guarantees the system provides on exit.
When an agent is enabled (from the UI or restored from persisted settings on server start), the process manager runs a three-stage pipeline:
child_process.spawn() using detached: true so the child gets its own process group. Stdio is piped for the ACP protocol transport.ClientSideConnection is created over the piped stdio using the ndjson stream protocol. The initialize handshake negotiates protocol version and capabilities.Each running agent is tracked as a ManagedAgent entry in a global Map keyed by agent ID. The map is stored on globalThis so it survives hot-module reloading during development.
A managed agent tracks:
| Field | Purpose |
|---|---|
process | The ChildProcess reference |
connection | The ACP ClientSideConnection for protocol messages |
initResponse | Negotiated capabilities from the initialize handshake |
scope | An Effect CloseableScope whose finalizer kills the process tree |
status | Current state: downloading, starting, running, error, or stopped |
knownSessions | Set of ACP session IDs currently open on this agent |
restartCount | Number of automatic restarts since last successful start |
If a running agent process exits unexpectedly (status is not stopped), the process manager automatically restarts it with exponential backoff:
| Attempt | Delay |
|---|---|
| 1st | 1 second |
| 2nd | 2 seconds |
| 3rd | 4 seconds |
After 3 failed restarts, the agent enters the error state permanently until manually restarted. A successful ACP initialize resets the restart counter.
Agent shutdown follows a layered strategy with timeouts at each stage to prevent hangs:
All sessions tracked in knownSessions are closed via unstable_closeSession if the agent advertises the sessionCapabilities.close capability. This gives the agent a chance to persist state and clean up internal resources.
A 3-second timeout bounds this step. If the agent is unresponsive, session close is abandoned and shutdown proceeds.
The Effect scope finalizer sends SIGTERM to the entire process group (using kill(-pid)), not just the direct child. This ensures any sub-processes the agent spawned (JVMs, language servers, etc.) also receive the signal.
If process group signalling fails (e.g. the group no longer exists), the signal falls back to the direct PID.
The system waits up to 5 seconds for the process to exit after receiving SIGTERM.
If the process has not exited after the grace period, SIGKILL is sent to the entire process group, forcing immediate termination of the agent and all its children.
When the server itself is shutting down, a 10-second overall timeout ensures process.exit() fires even if the graceful shutdown sequence hangs. This timer is unref'd so it does not keep the event loop alive on its own.
shutdown()
│
├─ stopScheduleHeartbeat()
├─ stopArchivePruner()
├─ stopTerminalWs()
├─ start 10s force-exit timer (unref'd)
│
└─ shutdownAllAgents() ── runs all agents in parallel:
│
└─ gracefulStopAgent(id)
│
├─ set status = "stopped", remove from map
├─ closeKnownSessions() ── 3s timeout
├─ Scope.close() ── SIGTERM to process group
├─ wait for exit ── 5s timeout
├─ SIGKILL process group ── if still running
└─ done
Terminal processes (PTY shells spawned for interactive use) follow a similar pattern with process group management.
Terminals are spawned via node-pty with a login shell. Each terminal is tracked by a composite key of project ID and session ID.
The same kill(-pid) process group technique is used, with a fallback to direct PID signalling.
Terminal cleanup handlers are registered on process.on("exit"), SIGINT, and SIGTERM to kill all terminal PTYs when the server shuts down. These handlers are guarded by a globalThis flag to prevent double-registration during hot reloads.
Client terminals (spawned for agent-initiated terminal sessions) use direct process management rather than PTY:
The server registers handlers for multiple signals and error conditions:
| Signal / Event | Action |
|---|---|
SIGTERM | Graceful shutdown (exit code 0) |
SIGINT | Graceful shutdown (exit code 0) |
SIGUSR2 | Graceful shutdown (exit code 0) — used by nodemon-style restarts |
uncaughtException | Log fatal error, then shutdown (exit code 1) |
unhandledRejection | Log fatal error, then shutdown (exit code 1) |
beforeExit | Graceful shutdown if not already shutting down |
All handlers are idempotent — a shuttingDown flag prevents re-entry if multiple signals arrive.
Both agent and terminal shutdown use process group killing to prevent orphaned child processes. The technique works as follows:
detached: true, which creates a new process group with the child as the group leader.process.kill(-pid, signal) sends the signal to every process in the group (the negative PID targets the group, not a single process).process.kill(pid, signal) targeting the direct child.This is important for agents like Junie that spawn JVM sub-processes — without process group killing, those sub-processes would be orphaned when the parent agent is terminated.