Proactive agent — operator runbook

The proactive agent is a scheduled harness in fusion-ai (@tikab-interactive/fusion-ai/agent): a cron wakes it, it evaluates a set of checks, the model proposes findings, and deterministic code decides whether to notify, batch, or stay quiet. The harness ships only ports; the example app at example/src/lib/agent-*.ts wires them to real infrastructure (Postgres, the mailer, realtime, metrics). The full design rationale is in PROACTIVE_AGENT.md. This page has two halves: How the heartbeat works — a code-grounded walk through the loop, aimed at someone debugging "why didn't the agent notify me?" — and then the operator view: how to pause it, drain failures, and read its signals when something goes wrong.

Everything here describes mechanisms that exist in the codebase. Where a lever is a server function it can be invoked from the /sandbox/agent page; where it is a cron it is registered in registerAgentCron (example/src/lib/agent-jobs.ts) and runs on the worker.

Loading diagram...

How the heartbeat works

This section is the one to read when a user says "the agent never told me about X." There are about eight places that answer could die — a cron that isn't running, a consent flag, an interval gate, a degraded check, the model staying silent, a dedupe row, a cap, or a delivery that rolled into a digest — and each is deterministic code you can point at. We'll walk the loop top to bottom and name the exact file and function at every step.

Coming from Django? The whole thing is Celery beat plus a delivery policy. Beat fires a periodic task on a schedule; here Hatchet (a self-hosted scheduler) fires proactive-agent-tick every minute. The difference is what the task does: instead of "always send the email," it reads your data, asks a model what's worth your attention, and then runs that proposal through rails (dedupe, cooldown, caps) that decide whether to actually interrupt you. The "didn't fire" bugs you'd debug in Celery (worker down, beat not scheduling, task raised) are all here too — plus a few new ones above.

What fires a tick — the Hatchet cron

The heartbeat is a Hatchet cron, registered once when the worker boots. Hatchet is the self-hosted workflow engine the stack runs on (see deploy); think of it as Celery beat + a result/lease store. The registration lives in registerAgentCron (example/src/lib/agent-jobs.ts):

example/src/lib/agent-jobs.ts — registerAgentCron

export async function registerAgentCron(): Promise<void> {
	await ensureCron(proactiveAgentTick, "proactive-agent-every-minute", "* * * * *");
	await ensureCron(deliveryRetryTick, "agent-delivery-retry-every-minute", "* * * * *");
	await ensureCron(agentRetentionTick, "agent-retention-daily", "0 3 * * *");
	await ensureCron(embeddingBackfillTick, "agent-embedding-backfill", "*/5 * * * *");
	await ensureCron(memoryConsolidationTick, "agent-memory-consolidation-daily", "30 3 * * *");
	await ensureCron(digestTick, "agent-digest-briefing", "30 7 * * *");
}

So the cadence is once a minute (* * * * *) for the tick itself; the other crons are the supporting cast (digest briefing at 07:30, retention at 03:00, embedding back-fill every 5 minutes, memory consolidation at 03:30). ensureCron swallows Hatchet's "already exists" 400 so the worker is safe to restart — re-registering the same (name, expression) pair would otherwise crash it on every boot.

Each cron firing invokes the matching Hatchet task. The tick task (proactiveAgentTick, same file) is a thin wrapper: it resolves the owner, checks consent, builds the config + actor + stores, and calls the harness entry point runAgentTick. Two scheduling decisions are made here, before the harness:

Consent gate (getUserConsent): autonomous monitoring is opt-in per user, so a tick for a non-consenting owner returns { skipped: "no-consent" } and never touches the harness. (More in GDPR & retention.)
Missed-tick policy (shouldCatchUp): if the worker was down and ticks were missed, the task tags this run trigger: "catchup" instead of "timer". A catch-up runs exactly once and bypasses the interval gate — it is not a replay of every missed minute. Otherwise the trigger is "timer".

example/src/lib/agent-jobs.ts — proactiveAgentTick (excerpt)

		const trigger: Trigger = shouldCatchUp(
			systemClock.now(),
			rt.lastTickAt,
			config.baseIntervalSeconds,
		)
			? "catchup"
			: "timer";

Running it locally. The worker is a standalone process — start it with bun run worker (example/src/worker.ts). It loads .env.local first, then guards on isJobsEnabled() before importing anything Hatchet-related.

Without Hatchet: the `HATCHET_CLIENT_TOKEN` gate

The single most common reason "the agent never told me about X" is that no cron is running at all — because there's no worker. Background jobs are opt-in via one environment variable. isJobsEnabled() is just:

fusion-jobs/src/index.ts

/**
 * Background jobs are optional. They light up only when `HATCHET_CLIENT_TOKEN`
 * is set — the same graceful-degradation contract as the mailer (`SMTP_HOST`
 * in @tikab-interactive/fusion-mailer). When unset, the web app boots normally and enqueue
 * helpers simply log-and-skip.
 *
 * This entry imports NOTHING (no Hatchet SDK), so the web server can check the
 * flag — and the app's enqueue helpers can guard on it — without pulling the
 * gRPC client into the request bundle. The SDK lives behind the separate
 * `@tikab-interactive/fusion-jobs/hatchet` entry, which is only ever loaded via dynamic import
 * (in the app's enqueue helpers) or by the standalone worker.
 */
export function isJobsEnabled(): boolean {
	return Boolean(process.env.HATCHET_CLIENT_TOKEN);
}

This gate appears in three places, and together they make Hatchet a clean progressive enhancement:

The worker (worker.ts) checks it first and exits cleanly if it's unset — "background jobs disabled; nothing to run." No worker means no proactive-agent-tick cron, so the autonomous timer never fires. The app still runs fine.
getHatchet() (fusion-jobs/src/hatchet.ts) wraps Hatchet.init(), which throws without the token. Because importing agent-jobs.ts calls getHatchet() at module load, that module is imported only by the worker, after the guard — never by the web server.
The web server's own server functions (getAgentStatus in agent-server.ts) surface jobsEnabled so the /sandbox/agent page can show an amber "timer is off" state.

So the inline fallback is: there is no autonomous loop, but every tick is still runnable by hand. The exact same runAgentTick runs inline in the web process via the manual trigger (run_agent_check_now / the sandbox "Run tick" button — see Run a tick now). That's how the feature stays fully demoable with zero infrastructure: the timer is the only thing Hatchet provides; the mechanism is identical either way.

Debugging checkpoint #1. Agent silent on the timer? Confirm a worker is up (bun run worker printed "crons registered") and HATCHET_CLIENT_TOKEN is set. With no token there is no timer — only manual ticks. This is by far the most common cause.

What runs inside a tick — `runAgentTick`

runAgentTick (fusion-ai/src/agent/loop.ts) is the orchestrator. It is pure given its injected dependencies (clock, model, store, sink) — the loop file never imports the LLM or the database, so the whole thing is unit-testable with scripted fakes (loop.test.ts). A throw anywhere inside is caught and the run transitioned to FAILED; the worker never crashes. The order is fixed, and every gate below is a place a finding can legitimately not happen:

Kill switch — if (config.killSwitch) return skip("killSwitch"). Instant no-op, no run row. (See Pause one agent vs pause all runs.)
Interval gate — isDueTick(...) (scheduling.ts). A timer trigger must have waited a full baseIntervalSeconds since the last tick; manual and catchup bypass it. The demo sets baseIntervalSeconds: 60 (agent-config.ts), so although the cron fires every minute, the gate is what actually paces real ticks.
Single-flight lease — store.openRun(...). The example claims a RUNNING row under a Postgres advisory lock with a lease (leaseExpiresAt), not a boolean flag (agent-store.ts). If a live, unexpired run already holds it, this tick returns skip("overlap"). Because it's a lease, a crashed run's claim expires on its own and no longer blocks the next tick — that's the recovery story.
Which checks are due — the loop filters input.checks by three things: the check is enabled, it's due on its own per-check interval (isCheckDue), and we're inside active hours (isWithinActiveHours) unless the check is criticalAlways.
Gather signals (gather.ts) — run each due check, failure-isolated, with a breaker + bounded retry + per-tool timeout. Detailed below.
Model proposes (model.ts) — the gathered signals go to the LLM, which returns candidate findings (or an empty array). One bounded re-ask on malformed output, then a clean FAILED. Detailed in How a finding becomes a notification.
Token-budget cap — if proposed.tokens > config.maxTokensPerRun, the run does not deliver or write memory: it ends FAILED with a budget operator alert. A runaway-cost guard.
Delivery decision (delivery.ts) — deterministic dedupe / cooldown / caps decide what actually reaches you. Detailed below.
Dispatch (dispatch.ts) — push the surviving items to the sink and record them as surfaced; route act findings to the approval queue.
Atomic success commit — store.finishRun(...) writes memory and flips RUNNING → SUCCEEDED in one transaction (agent-store.ts).

Here is the loop as one diagram — every branch that ends without a notification is a real, debuggable outcome:

Loading diagram...

The checks: live, owner-scoped reads → a signal string

A check is a small object: an id, a prompt telling the model what to watch for, an interval, and a gatherSignal(actor) function that does a live, owner-scoped, bounded query and returns a short string. The real ones live in realChecks() (example/src/lib/agent-checks.ts). There are three:

Check	Reads	Watches for
`deadlines`	`processn_assigned_task` (joined)	the user's assigned tasks overdue or due within 48 h
`inbox`	`mailbox`	messages to the user in the last 24 h that need a reply
`ops`	`audit_log`	`agent.alert.*` / auto-disable incidents in the last hour

Three properties make these safe and make their output trustworthy:

Owner-scoped. Every query filters on actor.ownerId (e.g. the deadlines check: eq(processnAssignedTask.userId, actor.ownerId)). A check can only ever see one user's data — the same §8 cross-tenant boundary the chat tools obey.
Bounded. Ranking and limits are pushed into the SQL — a recent-time window plus ORDER BY … LIMIT 10 — so the model gets a short, pre-ranked candidate list, not a firehose.
A signal, not a verdict. gatherSignal returns prose for the model to judge — e.g. - "Submit drawings" due 2026-06-22T… (OVERDUE) — including the explicit "nothing here" case ("No assigned tasks are overdue or due within the next 48 hours."). The check never decides to notify; it only reports.

The crucial invariant: "couldn't check" is never faked as "all clear." If a gatherSignal throws, the harness marks that check degraded and tells the model so (the user prompt carries STATUS: could not be checked this tick (treat as UNKNOWN, not OK)). This is enforced in gatherSignals (fusion-ai/src/agent/gather.ts), which wraps each check in: a circuit-breaker gate → a bounded retry (GATHER_ATTEMPTS = 3) → a per-tool timeout (withTimeout, capped at 30 s). After repeated failures the breaker trips open, an operator alert fires, and the harness emits a synthetic outage finding (makeOutageFinding) that flows through the same delivery pipeline as model findings — so a persistently broken source surfaces as a high-severity "Can't reach X" notification rather than silence.

Debugging checkpoint #2. A check that returns "nothing notable" produces no finding — that's correct, quiet behaviour, not a bug. But a check that's silently failing is different: look at agent_check_state.last_status (it'll read degraded) and the agent.alert.breaker_open audit rows. Degraded ≠ healthy; the model is told as much.

How a finding becomes a notification (or doesn't)

This is the heart of the "why didn't it notify me?" question. A finding is the model's candidate (shape in fusion-ai/src/agent/findings.ts): a title, detail, a severity (critical | high | normal | low | info), a confidence, a suggestedAction (notify | act | log_only), and a dedupeKey — a stable identity of the underlying real-world thing (e.g. task:Submit drawings), which is what makes dedupe possible. The model proposes; it never notifies. Two layers stand between a finding and your bell.

Layer 2 — the model (model.ts). createFindingModel turns the gathered checks into findings via a structured-output LLM call. The system prompt forces "default to silence; only surface what genuinely matters," fences each signal as untrusted <signal> DATA (prompt-injection defence), and — for a Swedish deployment — writes the title/detail in the user's language while keeping the dedupeKey a stable machine identifier. If nothing is notable the model returns an empty array, and the loop stays completely quiet (it never touches the sink — that's the "don't be annoying" guarantee). With no AI provider configured, generateObject returns null → zero findings (graceful degradation); the offline demo uses a deterministic model that fires on an [URGENT] marker so the flow is testable without an LLM.

Layer 3 — the deterministic delivery policy (delivery.ts, decideDelivery). This is pure code, not the model's discretion — the single most important production rail. Each proposed finding runs this gauntlet, in order:

Mute — if the user muted this agent, non-critical findings are suppressed (muted).
Classify mode (classifyMode) — log_only or info → suppressed outright; critical/high (or a user's must-know check) → immediate; everything else → digest.
Dedupe / cooldown / ack (checkSuppression) — the loop looks up this dedupeKey in the surfaced_finding ledger (store.surfacedFor). If it was already notified recently (within cooldownSeconds, default 1 h) → cooldown. If the user acknowledged or resolved it → acknowledged. Exception: a finding whose severity escalated since last time (isEscalation) bypasses cooldown once — a problem getting worse still reaches you.
Caps (applyCapMode) — maxNotificationsPerRun (default 3) and maxNotificationsPerDay (default 20). Overflow doesn't get dropped — it rolls into the digest. critical is exempt from caps entirely. Findings are ranked by priorityScore (severity dominates, confidence breaks ties) before caps apply, so the most important survive.

Whatever survives as immediate goes to the DeliverySink (agent-delivery.ts), which is where a finding finally becomes user-visible:

immediate → an in-app notification row (the bell) — inserted onConflictDoNothing on (userId, dedupeKey, runId), so a retried run cannot double-notify the same finding (this is the "effectively-once" guarantee, recorded in agent_side_effect_log) — plus a best-effort realtime push (a no-op when Centrifugo is off; the bell reads the durable row on mount).
digest → queued to the digest_item table; the agent-digest-briefing cron later batches a user's pending rows into one email.

And the ledger is updated: recordSurfaced upserts the surfaced_finding row (lastNotifiedAt, lastSeverity, notifyCount), which is exactly what the next tick's cooldown check reads. (When you acknowledge a notification, acknowledgeNotification in agent-server.ts also flips the matching surfaced_finding to acknowledged, so the same dedupeKey stops resurfacing.)

Debugging checkpoint #3 — the decision table. Walk it in order for the finding you expected:

Symptom Where it died How to confirm
No matching data exists the check returned "nothing notable" run the check's query; check agent_check_state.last_status = ok
The model judged it not worth surfacing Layer 2 returned no finding for it agent_run is SUCCEEDED with delivered: 0; inspect the signal
severity = info or suggestedAction = log_only classifyMode → suppressed (policy) the finding's severity/action
Already notified recently cooldown (default 1 h, unless escalated) surfaced_finding.last_notified_at for that dedupeKey
You already acked/resolved it acknowledged suppression surfaced_finding.status
It went to email, not the bell digest mode or a cap rolled it to digest digest_item rows; wait for agent-digest-briefing (07:30)
It wants to act, not notify routed to the approval queue (HITL) agent_approval rows; the "Needs you" surface
The run failed before delivery a throw, or over token budget agent_run.state (FAILED/DEAD_LETTER) + error

The delivered / suppressed counters on the RunAgentTickResult (and the fusion_agent_notifications_total{disposition} metric) tell you which bucket a tick's findings fell into without reading rows.

Symptom	Where it died	How to confirm
No matching data exists	the check returned "nothing notable"	run the check's query; check `agent_check_state.last_status = ok`
The model judged it not worth surfacing	Layer 2 returned no finding for it	`agent_run` is `SUCCEEDED` with `delivered: 0`; inspect the signal
`severity = info` or `suggestedAction = log_only`	`classifyMode` → suppressed (`policy`)	the finding's severity/action
Already notified recently	cooldown (default 1 h, unless escalated)	`surfaced_finding.last_notified_at` for that `dedupeKey`
You already acked/resolved it	acknowledged suppression	`surfaced_finding.status`
It went to email, not the bell	digest mode or a cap rolled it to digest	`digest_item` rows; wait for `agent-digest-briefing` (07:30)
It wants to act, not notify	routed to the approval queue (HITL)	`agent_approval` rows; the "Needs you" surface
The run failed before delivery	a throw, or over token budget	`agent_run.state` (`FAILED`/`DEAD_LETTER`) + `error`

Memory: durable facts, embedded and recalled

Memory is what makes the agent feel continuous across its isolated ticks, and it feeds the conversational front door too. It lives in one pgvector-backed table, agent_memory (example/src/db/schema/agent.ts), with a kind of fact, preference, observation, or summary. There are two write paths and two read paths:

Writing. Every successful tick writes a one-line observation summarizing the run, as part of the atomic finishRun transaction (agent-store.ts) — so memory and SUCCEEDED commit together. The conversational path is rememberMemory: when you tell Carola "call me Andy," the chat's remember tool persists a durable fact/preference. Both embed the text (768-dim) when an embedder is configured; when it isn't, the row is written embedding_pending = true and stays keyword-searchable — never silently lost. A back-fill cron (backfillEmbeddings, every 5 min) vectorizes pending rows once a provider is reachable, so historical memory becomes vector-recallable without a re-write.

Recalling. Inside a tick the model can call the owner-scoped memory_search / memory_get tools (createMemoryTools + createDrizzleMemoryStore) — cosine search when embedded, keyword/recency fallback otherwise. But the more visible recall is at the conversational front door: the chat endpoint (example/src/routes/api/agent/chat.ts) calls recentUserMemories(db, actor) on every turn and injects the user's durable facts straight into the system prompt, so Carola addresses you correctly without having to call a tool first:

example/src/routes/api/agent/chat.ts

				const memories = await recentUserMemories(db, actor).catch(() => []);
				const remembered =
					memories.length > 0
						? `Here is what you already remember about this user — use it naturally (address them by their preferred name and respect these preferences), and don't ask again for things you already know: ${memories
								.map((mem) => `• ${mem}`)
								.join("  ")}`
						: "You don't have any saved facts about this user yet — when they share one, save it with the remember tool.";

To keep memory from growing without bound, a daily consolidation ("dreaming") cron (consolidateMemories) folds the oldest raw observation rows into one higher-level summary and prunes the originals — atomically, under an advisory lock, idempotent under concurrency. Memory is the one thing the retention sweep never deletes (it has no TTL by design — consolidation, not deletion, bounds it). Every memory read is owner-scoped (visibleMemoryWhere), the same cross-tenant boundary as everything else.

Debugging checkpoint #4. "It forgot my nickname." Check agent_memory for the row (kind = 'fact'/'preference', your owner_id). If the row is there but still embedding_pending and you expected vector recall, the embedder isn't configured yet — recency/keyword recall still works, and the back-fill cron will vectorize it. The chat injects the 12 most recent facts via recentUserMemories, so a very old fact past that window relies on the model calling search_agent_memory.

Run a tick now

Two manual triggers run the exact same runAgentTick as the cron, which makes the whole loop demoable and debuggable with no scheduler:

run_agent_check_now — a chat tool (example/src/lib/agent-assistant.ts). When you ask Carola to "kolla nu" / "check now," it runs a real tick with the real checks (realChecks()), trigger: "manual", and reports state / delivered / approvalsRequested. Because the trigger is manual, it bypasses the interval gate — it always runs — but single-flight, dedupe, cooldown and caps still apply exactly as on the timer. So "run now" surfaces genuinely new findings but won't re-spam things already in cooldown.
The sandbox tick — runAgentTickNow (example/src/lib/agent-server.ts), behind the /sandbox/agent "Run tick" button. It takes a scenario (finding | quiet | flaky | approval | real) so you can drive the deterministic demo checks — e.g. flaky throws a 503 to exercise the retry + circuit breaker, approval emits an act finding to exercise the human-in-the-loop round-trip — or pick real to run the production checks. It defaults to the deterministic demo model (predictable, e2e-stable); opt into the real provider to watch an actual model judge the same signals.

Both record an agent_run row and an audit entry, so a manual tick is observable in exactly the same places as a timer tick — which is what makes this section's debugging checkpoints work whether or not Hatchet is running.

The conversational front door

Besides the autonomous timer, the agent — Carola — has a chat: the conversation-first shell (Carola) where you land on /home, every thread addressable by URL (/c/$id, or /project/$key/c/$id inside a building) and revisitable from the sidebar history. A per-conversation header carries the thread controls (rename / star / add-to-project / delete + Files + the floating-Carola pop-out), and the right rail is Carola's companion — the live conversation alongside conversational maps. Every thread is also indexed in universal search. The chat binds owner-scoped server tools (example/src/lib/agent-assistant.ts) that read the same data the timer uses, so answers come from reality, never invention:

Tool	Answers
`list_recent_findings`	what the agent has surfaced ("vad är brådskande?")
`list_upcoming_deadlines`	overdue / due-soon assigned tasks
`search_agent_memory`	what the agent has observed about this user
`run_agent_check_now`	run the checks now and report
`search_project_wiki`	the documentation handbook (WikiN)
`search_project_news`	the project news feed (NyhetN)
`search_attached_documents`	exact passages from files attached to the chat, with page citations

The two search_project_* tools are backed by Postgres Swedish full-text search (to_tsvector('swedish') @@ websearch_to_tsquery, ILIKE fallback) — no embedder, so they work air-gapped. They let the chat answer concrete project questions ("Vilken Revit-familj används?") by quoting the actual article. search_attached_documents covers files the user attaches to a chat — see Document attachments below. These tools are read-only and owner/visibility-scoped, so they don't appear in the operator levers below.

Document attachments (RAG)

A user can attach files to a Carola conversation (Carola) — owner-private, scoped to that one thread (the chat_document_attachment join), including 1000-page standards like a Eurocode — and ask for an exact detail ("what partial factor does §6.1 give?"). Each attachment is stored (chat_document + fusion-storage), and a Hatchet worker processes it: Kreuzberg (a docker compose service) extracts the text (PDF / Office / OCR; plain text is read directly), it's chunked with page metadata, and each chunk is embedded (768-dim, pgvector — the same embedder the agent's memory uses). Carola's search_attached_documents tool then cosine-searches the chunks scoped to that conversation (keyword fallback when no embedder) and answers strictly from the returned excerpts, citing the file and page. Graceful degradation: without Hatchet the upload processes inline; without Kreuzberg only plain-text attachments are read. KREUZBERG_URL (preset in .env.local) toggles extraction.

How it compares — OpenClaw and other systems

The agent is OpenClaw's idea rebuilt as a production harness. OpenClaw is the open-source reference implementation this is modelled on (hence "OpenClaw-style" in PROACTIVE_AGENT.md); it demonstrated the mechanism. Almost everything Fusion adds is about making that mechanism safe, multi-tenant, and operable. The thesis of the design doc is one line: production-readiness lives in the harness, not the model.

Same as OpenClaw (the shared idea)

The core loop is identical, and intentionally so:

A timer wakes the agent on an interval — the "heartbeat." It is not driven by user input. That autonomy is the "proactive" in proactive agent.
Each wake-up reads a checklist of things worth watching (the checks), gathers live signals with tools, and asks the model to judge what — if anything — matters.
The model proposes; the harness then notifies, batches, or stays silent. A quiet run produces zero output — the "don't be annoying" guarantee.
Each heartbeat is an isolated session (bounded blast radius and token cost), and the agent keeps long-term memory across runs.

If you've seen OpenClaw, the loop here is familiar.

Deliberately different from OpenClaw (the hardening)

The differences are all in the harness, not the model:

Aspect	OpenClaw	This agent
Deciding what's important	hands the checklist to the LLM and trusts its judgment	the LLM judges inside deterministic rails — explicit severity/score, dedupe, cooldown, caps, digest-vs-immediate. Testable, auditable, no notification storms.
What it can touch	open shell, files, a browser — very powerful, very risky	typed tools only, default-deny allow-list. No shell, no arbitrary files, no browser. The single biggest deliberate deviation.
What can trigger a run	the heartbeat could be re-fired by non-timer events (≈1–2 min)	the interval gate is the only trigger — a tool finishing or a memory write can never start a run.
Concurrency & crashes	best-effort	a run lease (not a boolean flag) makes a crashed run recoverable; single-flight per agent; missed ticks → one catch-up, not a replay storm; per-run timeout; dead-letter + auto-disable for poison runs.
A watched thing breaks	—	per-tool timeout, error classification, bounded retry, a circuit breaker per integration, and "couldn't check" never reads as "all clear."
Risky actions	the model just does them	human-in-the-loop approval — consequential actions pause for an explicit approve/deny (see `agent-approvals.ts`).
Multi-tenant & privacy	a single power-user tool	every tool + memory read is permission- and tenant-scoped (authorization); data residency can force a local model (Ollama) for air-gapped tenants — never the cloud.
Delivery	—	effectively-once (at-least-once transport + idempotent sinks + dedupe keys) — a computed alert is never lost and never double-sent.
Operability	—	per-run audit rows, Prometheus metrics, operator alerts (a separate channel from user notifications), a kill switch, and the SLOs below.

Net: safer and multi-tenant, but narrower — OpenClaw can roam your machine; this can only do the specific, vetted things its tools allow.

Where it sits next to other categories

It reuses well-known patterns rather than inventing new ones:

Category	Relationship
Cron / scheduled scripts	The base is a cron tick — but with model judgment and a delivery policy on top, instead of "always fire."
Monitoring & alerting (Alertmanager, PagerDuty, Datadog)	Borrows the alerting discipline — severity, dedupe keys, cooldown, escalation, digests to fight fatigue. Difference: the "is this worth surfacing?" call is an LLM over heterogeneous app data, not threshold rules over metrics — and it can propose actions, not just alert.
Workflow engines (Hatchet, Temporal, Airflow)	It runs on one — Hatchet provides the scheduling, leases, retries and dead-letter; the agent is the domain harness on top, not a re-implementation. See deploy.
Autonomous-agent frameworks (AutoGPT/BabyAGI, LangChain)	Same tool-calling LLM loop, but those are usually user-initiated and "run until the goal is done." This is scheduled, bounded per heartbeat, and gated — capped steps/tokens, no open-ended autonomy, a kill switch.
RAG chatbots / copilots	That's the reactive half (Carola's chat, above) — it answers when asked. The proactive loop is the inverse: it acts without being asked. This system ships both, sharing the same tools and data.

At a glance

I need to…	Lever	Effect
Pause all proactive runs	`setAgentKillSwitch({ on: true })` → `agent:killSwitch`	every tick skips instantly (`skipped: "killSwitch"`)
Pause one agent	flip that agent's `killSwitch` (per-agent config)	only that agent's ticks skip
See failed runs	`agent_run` rows where `state = 'DEAD_LETTER'`	the poison runs that exhausted retries
Reset a tripped breaker	restart the web/worker process (`breakerStore` is in-memory)	all breakers return to `closed`
Re-run a missed digest	`drainDigestsNow` (manual) or wait for the `agent-digest-briefing` cron	sends each user's queued `digest_item` rows as one briefing email
Watch the system	`fusion_agent_` on `/api/metrics` + `agent.alert.` in the audit log	run health, breaker trips, delivery failures
Erase a user's data	`eraseAgentDataForUser(db, ownerId)` (self-service via `eraseMyAgentData`)	one transactional, orphan-free cascade

Pause one agent vs pause all runs

The global kill switch is the agent:killSwitch setting. Toggle it with the setAgentKillSwitch server function (example/src/lib/agent-server.ts), which calls setKillSwitch (example/src/lib/agent-config.ts):

example/src/lib/agent-config.ts

export async function setKillSwitch(on: boolean): Promise<void> {
	await setSetting(db, "agent:killSwitch", on);
	clearSettingsCache("agent:killSwitch"); // bypass the 30s TTL so the toggle is instant
}

The clearSettingsCache call is what makes it instant — without it the change would wait out the settings cache TTL. runAgentTick reads the flag at the very top of the tick and bails before doing any work:

fusion-ai/src/agent — runAgentTick

if (config.killSwitch) return skip("killSwitch");

So flipping it on means the next tick and every tick after stops immediately — no run record, no model call, no notifications. Flip it back off to resume. This is the big red button for an incident (runaway cost, a misbehaving integration, a bad deploy).

One agent vs all. getAgentConfig layers a per-agent override key (agent:<id>:config) over tenant defaults and then forces the global agent:killSwitch on top, so the global switch is authoritative for every agent. To pause a single agent without stopping the rest, set killSwitch: true inside that agent's agent:<id>:config override (or disable its individual checks via the per-check enabled flag) and leave the global switch off. The example runs one shared demo agent (AGENT_ID = "demo"), so in the sandbox the global switch and the per-agent switch are the same lever.

Auto-disable on a poison run

A run that keeps crashing is a poison run. After the attempt cap (default 3) the harness moves it to DEAD_LETTER instead of retrying forever, fires a dead_letter operator alert, and invokes its onAutoDisable hook. The example wires that hook to autoDisableAgent, which flips the kill switch so the agent stops being scheduled and writes an agent.auto_disable audit row:

fusion-ai/src/agent — onAutoDisable contract

/** Called when a poison run is dead-lettered (§6.2): the consumer disables the agent,
 *  e.g. flips its kill switch, so it stops being scheduled. Best-effort — errors here
 *  never re-fail the already-dead-lettered run. */
onAutoDisable?: (p: { agentId: string; reason: string }) => Promise<void>;

The point for an operator: a dead-lettered agent is meant to stay stopped until you look at it. It will not silently start hammering the same broken dependency again. Recovery is deliberate — investigate the cause, then re-enable by flipping the kill switch back off (see Drain dead-letters below).

Drain dead-letters

Failed-beyond-retry runs land in agent_run with state = 'DEAD_LETTER'. The state machine (fusion-ai/src/agent/pure.ts) treats DEAD_LETTER as terminal — it never transitions back to QUEUED, so a dead-lettered run is inert and will not be retried by the worker.

Find them:

-- Dead-lettered runs, newest first
SELECT id, agent_id, owner_id, trigger, attempt, error, created_at
FROM agent_run
WHERE state = 'DEAD_LETTER'
ORDER BY created_at DESC;

What to do with them:

Read error on the row (and the matching dead_letter line in the audit log, agent.alert.dead_letter) to find the cause — a permanently broken check, a malformed config, a downstream that is genuinely gone.
Confirm the agent is stopped. A poison run dead-letters and auto-disable is expected to have flipped the kill switch (above). Verify the agent isn't still being scheduled before you fix anything.
Fix the root cause — repair the check/config or the downstream integration.
Re-enable by flipping the kill switch back off (setAgentKillSwitch({ on: false }), or clear the per-agent killSwitch). The next proactive-agent-tick then runs a single fresh tick.

The dead-lettered rows themselves are historical records. They are not re-run; they age out on the normal retention schedule (terminal runs, see GDPR & retention). There is no "requeue a dead-letter" button by design — replaying a poison run would just re-poison.

Reset a stuck circuit breaker

Each downstream integration has a per-integration circuit breaker so a dead dependency is fast-failed instead of hammered every tick. The breaker state machine (closed → open → half_open) lives in fusion-ai/src/agent/resilience.ts; the example persists it through breakerStore in example/src/lib/agent-resilience.ts:

example/src/lib/agent-resilience.ts

const breakerState = new Map<string, BreakerState>();

Behavior: failures count up and the breaker trips open at the threshold; while open it fast-fails until the cooldown elapses, then flips to half_open to allow exactly one probe; a successful probe closes it, a failed probe re-opens it. Each trip fires a breaker_open operator alert and increments fusion_agent_breaker_open_total.

To reset it now: because breakerStore is an in-memory Map, restarting the process that holds it clears every breaker back to closed (CLOSED_BREAKER). In the example that is the web process (manual ticks) and/or the worker (cron ticks). If you don't want to wait for the half-open probe, a restart is the reset.

The standard recovery path is to just wait — the breaker self-heals via the half-open probe once the dependency is back. Restart only when you need it closed immediately. Note the in-memory caveat: a breaker open state does not survive a restart, and is not shared across processes — a production deployment would back breakerStore with the DB or Redis so an open breaker is durable and fleet-wide.

Replay a missed digest

normal/low findings are batched into a digest instead of interrupting the user one at a time. The DeliverySink (example/src/lib/agent-delivery.ts) does not email these inline — it queues each one to the digest_item table (enqueueDigestItems). A separate scheduled job then drains the queue: agent-digest-tick (cron agent-digest-briefing, e.g. a morning slot) calls drainDigests, which batches each user's pending rows into one briefing email (kind: "agent_digest") and marks them sent. This is the main defense against notification fatigue.

So to replay / force a briefing now:

Manually: call drainDigestsNow (the /sandbox/agent page exposes a "Send briefing now" button). It drains all currently-pending digest_item rows into briefing emails immediately.
Scheduled: the agent-digest-briefing cron runs the same drainDigests path on its schedule; the next firing sweeps whatever is queued.

Two things that make a replay safe to do:

Idempotent delivery. Immediate notifications are inserted onConflictDoNothing on (user, dedupeKey, runId) and recorded in agent_side_effect_log, so re-running a tick cannot double-notify the same finding.
Dedupe/cooldown. A finding already surfaced recently won't resurface unless its severity escalates, so a manual replay won't spam the user with everything again.

If the digest email itself failed to send (SMTP down), it rides the mailer outbox — fix the mailer and the outbox redelivers; you do not need to re-run the agent for that.

Operator alerts vs user notifications

Keep these two channels separate — they have different audiences.

Channel	Audience	Where it lands
Operator alerts	whoever runs the system	the audit log as `agent.alert.*` (+ a `console.warn`)
User notifications	the end user	the in-app bell (`notification` rows) + optional realtime push

Operator alerts are written by the operatorAlert port (example/src/lib/agent-resilience.ts) as audit rows with action = "agent.alert.<kind>". The kinds are fixed (OperatorAlertKind in fusion-ai/src/agent/resilience.ts):

Audit action	Fires when
`agent.alert.dead_letter`	a poison run exhausted its attempts and was dead-lettered
`agent.alert.breaker_open`	a per-integration circuit breaker tripped open
`agent.alert.delivery_failed`	a realtime push dead-lettered after the retry cap (per user)
`agent.alert.budget`	a run hit its token/step budget and ended `FAILED`

GDPR actions are audited separately under agent.gdpr.* — the self-service erasure writes agent.gdpr.erase (example/src/lib/agent-server.ts).

example/src/lib/agent-resilience.ts

		await recordAudit(db, {
			actorUserId: null,
			actorName: `agent:${event.agentId}`,
			action: `agent.alert.${event.kind}`,
			entityType: "agent_alert",
			summary: event.detail,
			meta: { kind: event.kind, runId: event.runId ?? null },
		});

Query for operator alerts:

SELECT created_at, action, summary, meta
FROM audit_log
WHERE action LIKE 'agent.alert.%'
ORDER BY created_at DESC
LIMIT 100;

User notifications are the in-app bell — notification rows written by the DeliverySink, surfaced via listMyNotifications, with an optional live realtime push (a no-op when Centrifugo is off; the bell reads the durable row on mount). These never page you; they are the product surface.

Metrics to watch

The harness's AgentMetrics port is wired onto the shared fusion-metrics Prometheus registry (example/src/lib/agent-metrics.ts), so proactive-agent series appear on the same /api/metrics exposition as the rest of the stack — no new endpoint. They are recorded only when METRICS_ENABLED is set; otherwise the sink no-ops.

Series	Type	Labels	Watch for
`fusion_agent_runs_total`	counter	`state`, `trigger`	a rising `state="FAILED"` / `"TIMED_OUT"` / `"DEAD_LETTER"` rate
`fusion_agent_run_duration_seconds`	histogram	`state`, `trigger`	p95/p99 creeping toward the per-run timeout (runs about to time out)
`fusion_agent_notifications_total`	counter	`disposition`	`disposition="delivered"` spiking = possible notification fatigue/dedupe gap
`fusion_agent_breaker_open_total`	counter	`agent`	any increase = a downstream integration is failing repeatedly

Concretely:

Run health — split fusion_agent_runs_total by state. A healthy system is almost all SUCCEEDED; a climbing DEAD_LETTER means poison runs (check the audit log), TIMED_OUT means runs are exceeding the per-run cap, FAILED with a budget alert means cost caps are biting.
Latency — fusion_agent_run_duration_seconds p95 trending up toward the timeout is your early warning that ticks are about to start timing out and being skipped by single-flight.
Notification volume — fusion_agent_notifications_total{disposition="delivered"} far above suppressed suggests dedupe/cooldown isn't catching repeats; a flood of identical alerts is the classic proactive-agent failure mode.
Breaker trips — any fusion_agent_breaker_open_total increase points at a specific failing integration (the agent label) — correlate with the agent.alert.breaker_open audit rows.

Two distinct paths (example/src/lib/agent-retention.ts):

Self-service erasure — eraseAgentDataForUser(db, ownerId). One transaction that cascades all of a user's agent data, orphan-free, ordered so run-keyed tables go before the runs they reference: side-effect log → working log → notifications → delivery attempts → surfaced findings → approvals → digest items → memory → runs. It returns per-table delete counts. The user-facing entry point is the eraseMyAgentData server function, which runs the cascade and writes an agent.gdpr.erase audit row with the counts. (Deleting the whole account also erases everything — the FKs are onDelete: "cascade"; this path is for "forget my assistant history" while the account stays.)

Consent / opt-in. Autonomous (timer) monitoring is opt-in per user (§10): getUserConsent / setUserConsent store the consent (with a timestamp) in user.metadata.agentConsent, and the proactive-agent-tick cron skips any owner who hasn't consented (skipped: "no-consent"). A manual tick is the user's own explicit action, so it isn't gated. Operator note: if an agent isn't running for a user on the timer, check their consent first — the /sandbox/agent page has a consent toggle.

Daily retention sweep — sweepAgentRetention. Run off-peak by the agent-retention-tick cron (registerAgentCron, daily at 03:00). It ages out only terminal/settled rows — never live ones, and never durable memory (memory has no TTL by design; consolidation, not deletion, bounds its growth). Defaults, all overridable via env without code:

Data	Aged out when…	Default TTL	Env override
working log (scratch)	older than TTL	7 days	`AGENT_RETENTION_WORKING_LOG_DAYS`
`agent_run` (terminal)	older than TTL, not `RUNNING`/`QUEUED`	90 days	`AGENT_RETENTION_RUN_DAYS`
notifications	older than TTL, status not `unread`	90 days	`AGENT_RETENTION_NOTIFICATION_DAYS`
delivery attempts	older than TTL, status not `pending`	30 days	`AGENT_RETENTION_DELIVERY_DAYS`
resolved findings	older than TTL, status `resolved`	90 days	`AGENT_RETENTION_RESOLVED_FINDING_DAYS`

The sweep is idempotent and safe to run repeatedly. It will never reap a RUNNING run out from under a worker, an unread notification, a pending delivery retry, or an active finding.

Service-level objectives (SLOs)

These are starting targets to agree with the team, not yet enforced thresholds. They are framed around the guarantees the harness is built to provide (see PROACTIVE_AGENT.md §9), and each maps to a signal above so it's measurable.

Objective	Starting target	Measured by
Schedule adherence — a due check runs near its schedule	within 1× the base interval	gap between expected and actual tick time; `proactive-agent-tick` runs
No false "OK" — an unreachable source never reads as healthy	0 false-OK	unreachable checks become `degraded`, never silent success (§6.1 invariant)
Duplicate-notification rate — the same thing isn't re-alerted	≈ 0	repeats of one `dedupeKey`; `fusion_agent_notifications_total{disposition="delivered"}` vs surfaced findings
Run latency — ticks finish well under their cap	p95 run duration < 30 s	`fusion_agent_run_duration_seconds` (the example caps a tick at 120 s)
Run success rate — most ticks succeed	≥ 99% over a rolling window	`fusion_agent_runs_total{state="SUCCEEDED"}` ÷ total
No infinite poison loop — bad runs stop	0 runs past the attempt cap	`DEAD_LETTER` count vs the retry cap; `agent.alert.dead_letter`
Delivery — a computed alert is never lost	effectively-once	`agent.alert.delivery_failed` backlog stays near zero; dead delivery rows
Erasure — a deletion leaves nothing behind	0 orphaned rows	`eraseAgentDataForUser` counts vs the user's prior `agentDataFootprint`

Note the harness targets effectively-once delivery (at-least-once transport + idempotent sinks), not exactly-once — true exactly-once isn't achievable across external systems, so the SLO is "no lost alert, no duplicate," not "exactly one attempt."