Process Lifecycle

Braide manages agent and terminal processes as child processes of the Next.js server. Each process type has its own startup, tracking, and shutdown behaviour. This page documents the full lifecycle and the guarantees the system provides on exit.

Agent Processes

Startup

When an agent is enabled (from the UI or restored from persisted settings on server start), the process manager runs a three-stage pipeline:

  1. Download — If the agent uses a binary distribution and the binary is not already cached, it is downloaded with progress reporting. NPX and UVX distributions skip this step.
  2. Spawn — The agent binary is launched with child_process.spawn() using detached: true so the child gets its own process group. Stdio is piped for the ACP protocol transport.
  3. ACP Initialize — A ClientSideConnection is created over the piped stdio using the ndjson stream protocol. The initialize handshake negotiates protocol version and capabilities.

Tracking

Each running agent is tracked as a ManagedAgent entry in a global Map keyed by agent ID. The map is stored on globalThis so it survives hot-module reloading during development.

A managed agent tracks:

FieldPurpose
processThe ChildProcess reference
connectionThe ACP ClientSideConnection for protocol messages
initResponseNegotiated capabilities from the initialize handshake
scopeAn Effect CloseableScope whose finalizer kills the process tree
statusCurrent state: downloading, starting, running, error, or stopped
knownSessionsSet of ACP session IDs currently open on this agent
restartCountNumber of automatic restarts since last successful start

Auto-Restart

If a running agent process exits unexpectedly (status is not stopped), the process manager automatically restarts it with exponential backoff:

AttemptDelay
1st1 second
2nd2 seconds
3rd4 seconds

After 3 failed restarts, the agent enters the error state permanently until manually restarted. A successful ACP initialize resets the restart counter.

Shutdown

Agent shutdown follows a layered strategy with timeouts at each stage to prevent hangs:

1. Close ACP Sessions (up to 3 seconds)

All sessions tracked in knownSessions are closed via unstable_closeSession if the agent advertises the sessionCapabilities.close capability. This gives the agent a chance to persist state and clean up internal resources.

A 3-second timeout bounds this step. If the agent is unresponsive, session close is abandoned and shutdown proceeds.

2. SIGTERM to Process Group

The Effect scope finalizer sends SIGTERM to the entire process group (using kill(-pid)), not just the direct child. This ensures any sub-processes the agent spawned (JVMs, language servers, etc.) also receive the signal.

If process group signalling fails (e.g. the group no longer exists), the signal falls back to the direct PID.

3. Wait for Exit (up to 5 seconds)

The system waits up to 5 seconds for the process to exit after receiving SIGTERM.

4. SIGKILL to Process Group

If the process has not exited after the grace period, SIGKILL is sent to the entire process group, forcing immediate termination of the agent and all its children.

5. Overall Shutdown Timeout (10 seconds)

When the server itself is shutting down, a 10-second overall timeout ensures process.exit() fires even if the graceful shutdown sequence hangs. This timer is unref'd so it does not keep the event loop alive on its own.

Shutdown Sequence Diagram

shutdown()
  │
  ├─ stopScheduleHeartbeat()
  ├─ stopArchivePruner()
  ├─ stopTerminalWs()
  ├─ start 10s force-exit timer (unref'd)
  │
  └─ shutdownAllAgents()          ── runs all agents in parallel:
       │
       └─ gracefulStopAgent(id)
            │
            ├─ set status = "stopped", remove from map
            ├─ closeKnownSessions()   ── 3s timeout
            ├─ Scope.close()          ── SIGTERM to process group
            ├─ wait for exit           ── 5s timeout
            ├─ SIGKILL process group   ── if still running
            └─ done

Terminal Processes

Terminal processes (PTY shells spawned for interactive use) follow a similar pattern with process group management.

Startup

Terminals are spawned via node-pty with a login shell. Each terminal is tracked by a composite key of project ID and session ID.

Shutdown

  1. SIGHUP to process group — Signals the terminal and all child processes (running commands, background jobs) to hang up.
  2. Wait up to 5 seconds for the process group to exit.
  3. SIGKILL to process group — Force-kills the entire group if it has not exited.

The same kill(-pid) process group technique is used, with a fallback to direct PID signalling.

Cleanup on Exit

Terminal cleanup handlers are registered on process.on("exit"), SIGINT, and SIGTERM to kill all terminal PTYs when the server shuts down. These handlers are guarded by a globalThis flag to prevent double-registration during hot reloads.

Client Terminals

Client terminals (spawned for agent-initiated terminal sessions) use direct process management rather than PTY:

  1. SIGTERM to the child process.
  2. Wait up to 5 seconds for exit.
  3. SIGKILL if still running.

Signal Handling

The server registers handlers for multiple signals and error conditions:

Signal / EventAction
SIGTERMGraceful shutdown (exit code 0)
SIGINTGraceful shutdown (exit code 0)
SIGUSR2Graceful shutdown (exit code 0) — used by nodemon-style restarts
uncaughtExceptionLog fatal error, then shutdown (exit code 1)
unhandledRejectionLog fatal error, then shutdown (exit code 1)
beforeExitGraceful shutdown if not already shutting down

All handlers are idempotent — a shuttingDown flag prevents re-entry if multiple signals arrive.

Process Group Killing

Both agent and terminal shutdown use process group killing to prevent orphaned child processes. The technique works as follows:

  1. Agents and terminals are spawned with detached: true, which creates a new process group with the child as the group leader.
  2. process.kill(-pid, signal) sends the signal to every process in the group (the negative PID targets the group, not a single process).
  3. If group signalling fails (e.g. the process already exited), the system falls back to process.kill(pid, signal) targeting the direct child.

This is important for agents like Junie that spawn JVM sub-processes — without process group killing, those sub-processes would be orphaned when the parent agent is terminated.