Proactive agent — operator runbook
The proactive agent is a scheduled harness in fusion-ai
(@tikab-interactive/fusion-ai/agent): a cron wakes it, it evaluates a set of
checks, the model proposes findings, and deterministic code decides whether to
notify, batch, or stay quiet. The harness ships only ports; the example app at
example/src/lib/agent-*.ts wires them to real infrastructure (Postgres, the mailer,
realtime, metrics). The full design rationale is in PROACTIVE_AGENT.md. This page has
two halves: How the heartbeat works — a code-grounded
walk through the loop, aimed at someone debugging "why didn't the agent notify me?" —
and then the operator view: how to pause it, drain failures, and read its signals
when something goes wrong.
Everything here describes mechanisms that exist in the codebase. Where a lever is a
server function it can be invoked from the /sandbox/agent page; where it is a
cron it is registered in registerAgentCron (example/src/lib/agent-jobs.ts) and runs
on the worker.
How the heartbeat works
This section is the one to read when a user says "the agent never told me about X." There are about eight places that answer could die — a cron that isn't running, a consent flag, an interval gate, a degraded check, the model staying silent, a dedupe row, a cap, or a delivery that rolled into a digest — and each is deterministic code you can point at. We'll walk the loop top to bottom and name the exact file and function at every step.
Coming from Django? The whole thing is Celery beat plus a delivery policy. Beat fires a periodic task on a schedule; here Hatchet (a self-hosted scheduler) fires
proactive-agent-tickevery minute. The difference is what the task does: instead of "always send the email," it reads your data, asks a model what's worth your attention, and then runs that proposal through rails (dedupe, cooldown, caps) that decide whether to actually interrupt you. The "didn't fire" bugs you'd debug in Celery (worker down, beat not scheduling, task raised) are all here too — plus a few new ones above.
What fires a tick — the Hatchet cron
The heartbeat is a Hatchet cron, registered once when the worker boots. Hatchet is the
self-hosted workflow engine the stack runs on (see deploy); think of it as
Celery beat + a result/lease store. The registration lives in registerAgentCron
(example/src/lib/agent-jobs.ts):
export async function registerAgentCron(): Promise<void> {
await ensureCron(proactiveAgentTick, "proactive-agent-every-minute", "* * * * *");
await ensureCron(deliveryRetryTick, "agent-delivery-retry-every-minute", "* * * * *");
await ensureCron(agentRetentionTick, "agent-retention-daily", "0 3 * * *");
await ensureCron(embeddingBackfillTick, "agent-embedding-backfill", "*/5 * * * *");
await ensureCron(memoryConsolidationTick, "agent-memory-consolidation-daily", "30 3 * * *");
await ensureCron(digestTick, "agent-digest-briefing", "30 7 * * *");
}So the cadence is once a minute (* * * * *) for the tick itself; the other crons are
the supporting cast (digest briefing at 07:30, retention at 03:00, embedding back-fill
every 5 minutes, memory consolidation at 03:30). ensureCron swallows Hatchet's
"already exists" 400 so the worker is safe to restart — re-registering the same
(name, expression) pair would otherwise crash it on every boot.
Each cron firing invokes the matching Hatchet task. The tick task
(proactiveAgentTick, same file) is a thin wrapper: it resolves the owner, checks
consent, builds the config + actor + stores, and calls the harness entry point
runAgentTick. Two scheduling decisions are made here, before the harness:
- Consent gate (
getUserConsent): autonomous monitoring is opt-in per user, so a tick for a non-consenting owner returns{ skipped: "no-consent" }and never touches the harness. (More in GDPR & retention.) - Missed-tick policy (
shouldCatchUp): if the worker was down and ticks were missed, the task tags this runtrigger: "catchup"instead of"timer". A catch-up runs exactly once and bypasses the interval gate — it is not a replay of every missed minute. Otherwise the trigger is"timer".
const trigger: Trigger = shouldCatchUp(
systemClock.now(),
rt.lastTickAt,
config.baseIntervalSeconds,
)
? "catchup"
: "timer";Running it locally. The worker is a standalone process — start it with bun run worker (example/src/worker.ts). It loads .env.local first, then guards on
isJobsEnabled() before importing anything Hatchet-related.
Without Hatchet: the HATCHET_CLIENT_TOKEN gate
The single most common reason "the agent never told me about X" is that no cron is
running at all — because there's no worker. Background jobs are opt-in via one
environment variable. isJobsEnabled() is just:
/**
* Background jobs are optional. They light up only when `HATCHET_CLIENT_TOKEN`
* is set — the same graceful-degradation contract as the mailer (`SMTP_HOST`
* in @tikab-interactive/fusion-mailer). When unset, the web app boots normally and enqueue
* helpers simply log-and-skip.
*
* This entry imports NOTHING (no Hatchet SDK), so the web server can check the
* flag — and the app's enqueue helpers can guard on it — without pulling the
* gRPC client into the request bundle. The SDK lives behind the separate
* `@tikab-interactive/fusion-jobs/hatchet` entry, which is only ever loaded via dynamic import
* (in the app's enqueue helpers) or by the standalone worker.
*/
export function isJobsEnabled(): boolean {
return Boolean(process.env.HATCHET_CLIENT_TOKEN);
}This gate appears in three places, and together they make Hatchet a clean progressive enhancement:
- The worker (
worker.ts) checks it first and exits cleanly if it's unset — "background jobs disabled; nothing to run." No worker means noproactive-agent-tickcron, so the autonomous timer never fires. The app still runs fine. getHatchet()(fusion-jobs/src/hatchet.ts) wrapsHatchet.init(), which throws without the token. Because importingagent-jobs.tscallsgetHatchet()at module load, that module is imported only by the worker, after the guard — never by the web server.- The web server's own server functions (
getAgentStatusinagent-server.ts) surfacejobsEnabledso the/sandbox/agentpage can show an amber "timer is off" state.
So the inline fallback is: there is no autonomous loop, but every tick is still runnable
by hand. The exact same runAgentTick runs inline in the web process via the manual
trigger (run_agent_check_now / the sandbox "Run tick" button — see
Run a tick now). That's how the feature stays fully demoable with zero
infrastructure: the timer is the only thing Hatchet provides; the mechanism is identical
either way.
Debugging checkpoint #1. Agent silent on the timer? Confirm a worker is up (
bun run workerprinted "crons registered") andHATCHET_CLIENT_TOKENis set. With no token there is no timer — only manual ticks. This is by far the most common cause.
What runs inside a tick — runAgentTick
runAgentTick (fusion-ai/src/agent/loop.ts) is the orchestrator. It is pure given its
injected dependencies (clock, model, store, sink) — the loop file never imports the LLM
or the database, so the whole thing is unit-testable with scripted fakes (loop.test.ts).
A throw anywhere inside is caught and the run transitioned to FAILED; the worker never
crashes. The order is fixed, and every gate below is a place a finding can legitimately
not happen:
- Kill switch —
if (config.killSwitch) return skip("killSwitch"). Instant no-op, no run row. (See Pause one agent vs pause all runs.) - Interval gate —
isDueTick(...)(scheduling.ts). Atimertrigger must have waited a fullbaseIntervalSecondssince the last tick;manualandcatchupbypass it. The demo setsbaseIntervalSeconds: 60(agent-config.ts), so although the cron fires every minute, the gate is what actually paces real ticks. - Single-flight lease —
store.openRun(...). The example claims aRUNNINGrow under a Postgres advisory lock with a lease (leaseExpiresAt), not a boolean flag (agent-store.ts). If a live, unexpired run already holds it, this tick returnsskip("overlap"). Because it's a lease, a crashed run's claim expires on its own and no longer blocks the next tick — that's the recovery story. - Which checks are due — the loop filters
input.checksby three things: the check isenabled, it's due on its own per-check interval (isCheckDue), and we're inside active hours (isWithinActiveHours) unless the check iscriticalAlways. - Gather signals (
gather.ts) — run each due check, failure-isolated, with a breaker + bounded retry + per-tool timeout. Detailed below. - Model proposes (
model.ts) — the gathered signals go to the LLM, which returns candidate findings (or an empty array). One bounded re-ask on malformed output, then a cleanFAILED. Detailed in How a finding becomes a notification. - Token-budget cap — if
proposed.tokens > config.maxTokensPerRun, the run does not deliver or write memory: it endsFAILEDwith abudgetoperator alert. A runaway-cost guard. - Delivery decision (
delivery.ts) — deterministic dedupe / cooldown / caps decide what actually reaches you. Detailed below. - Dispatch (
dispatch.ts) — push the surviving items to the sink and record them as surfaced; routeactfindings to the approval queue. - Atomic success commit —
store.finishRun(...)writes memory and flipsRUNNING → SUCCEEDEDin one transaction (agent-store.ts).
Here is the loop as one diagram — every branch that ends without a notification is a real, debuggable outcome:
The checks: live, owner-scoped reads → a signal string
A check is a small object: an id, a prompt telling the model what to watch for, an
interval, and a gatherSignal(actor) function that does a live, owner-scoped, bounded
query and returns a short string. The real ones live in realChecks()
(example/src/lib/agent-checks.ts). There are three:
| Check | Reads | Watches for |
|---|---|---|
deadlines | processn_assigned_task (joined) | the user's assigned tasks overdue or due within 48 h |
inbox | mailbox | messages to the user in the last 24 h that need a reply |
ops | audit_log | agent.alert.* / auto-disable incidents in the last hour |
Three properties make these safe and make their output trustworthy:
- Owner-scoped. Every query filters on
actor.ownerId(e.g. the deadlines check:eq(processnAssignedTask.userId, actor.ownerId)). A check can only ever see one user's data — the same §8 cross-tenant boundary the chat tools obey. - Bounded. Ranking and limits are pushed into the SQL — a recent-time window plus
ORDER BY … LIMIT 10— so the model gets a short, pre-ranked candidate list, not a firehose. - A signal, not a verdict.
gatherSignalreturns prose for the model to judge — e.g.- "Submit drawings" due 2026-06-22T… (OVERDUE)— including the explicit "nothing here" case ("No assigned tasks are overdue or due within the next 48 hours."). The check never decides to notify; it only reports.
The crucial invariant: "couldn't check" is never faked as "all clear." If a
gatherSignal throws, the harness marks that check degraded and tells the model so (the
user prompt carries STATUS: could not be checked this tick (treat as UNKNOWN, not OK)).
This is enforced in gatherSignals (fusion-ai/src/agent/gather.ts), which wraps each
check in: a circuit-breaker gate → a bounded retry (GATHER_ATTEMPTS = 3) → a
per-tool timeout (withTimeout, capped at 30 s). After repeated failures the breaker
trips open, an operator alert fires, and the harness emits a synthetic outage finding
(makeOutageFinding) that flows through the same delivery pipeline as model findings —
so a persistently broken source surfaces as a high-severity "Can't reach X" notification
rather than silence.
Debugging checkpoint #2. A check that returns "nothing notable" produces no finding — that's correct, quiet behaviour, not a bug. But a check that's silently failing is different: look at
agent_check_state.last_status(it'll readdegraded) and theagent.alert.breaker_openaudit rows. Degraded ≠ healthy; the model is told as much.
How a finding becomes a notification (or doesn't)
This is the heart of the "why didn't it notify me?" question. A finding is the model's
candidate (shape in fusion-ai/src/agent/findings.ts): a title, detail, a severity
(critical | high | normal | low | info), a confidence, a suggestedAction
(notify | act | log_only), and a dedupeKey — a stable identity of the underlying
real-world thing (e.g. task:Submit drawings), which is what makes dedupe possible. The
model proposes; it never notifies. Two layers stand between a finding and your bell.
Layer 2 — the model (model.ts). createFindingModel turns the gathered checks into
findings via a structured-output LLM call. The system prompt forces "default to silence;
only surface what genuinely matters," fences each signal as untrusted <signal> DATA
(prompt-injection defence), and — for a Swedish deployment — writes the title/detail in
the user's language while keeping the dedupeKey a stable machine identifier. If nothing
is notable the model returns an empty array, and the loop stays completely quiet (it never
touches the sink — that's the "don't be annoying" guarantee). With no AI provider configured,
generateObject returns null → zero findings (graceful degradation); the offline demo uses
a deterministic model that fires on an [URGENT] marker so the flow is testable without an LLM.
Layer 3 — the deterministic delivery policy (delivery.ts, decideDelivery). This is
pure code, not the model's discretion — the single most important production rail. Each
proposed finding runs this gauntlet, in order:
- Mute — if the user muted this agent, non-critical findings are suppressed (
muted). - Classify mode (
classifyMode) —log_onlyorinfo→ suppressed outright;critical/high(or a user's must-know check) → immediate; everything else → digest. - Dedupe / cooldown / ack (
checkSuppression) — the loop looks up thisdedupeKeyin thesurfaced_findingledger (store.surfacedFor). If it was already notified recently (withincooldownSeconds, default 1 h) →cooldown. If the user acknowledged or resolved it →acknowledged. Exception: a finding whose severity escalated since last time (isEscalation) bypasses cooldown once — a problem getting worse still reaches you. - Caps (
applyCapMode) —maxNotificationsPerRun(default 3) andmaxNotificationsPerDay(default 20). Overflow doesn't get dropped — it rolls into the digest.criticalis exempt from caps entirely. Findings are ranked bypriorityScore(severity dominates, confidence breaks ties) before caps apply, so the most important survive.
Whatever survives as immediate goes to the DeliverySink (agent-delivery.ts), which
is where a finding finally becomes user-visible:
- immediate → an in-app
notificationrow (the bell) — insertedonConflictDoNothingon(userId, dedupeKey, runId), so a retried run cannot double-notify the same finding (this is the "effectively-once" guarantee, recorded inagent_side_effect_log) — plus a best-effort realtime push (a no-op when Centrifugo is off; the bell reads the durable row on mount). - digest → queued to the
digest_itemtable; theagent-digest-briefingcron later batches a user's pending rows into one email.
And the ledger is updated: recordSurfaced upserts the surfaced_finding row
(lastNotifiedAt, lastSeverity, notifyCount), which is exactly what the next tick's
cooldown check reads. (When you acknowledge a notification, acknowledgeNotification in
agent-server.ts also flips the matching surfaced_finding to acknowledged, so the same
dedupeKey stops resurfacing.)
Debugging checkpoint #3 — the decision table. Walk it in order for the finding you expected:
Symptom Where it died How to confirm No matching data exists the check returned "nothing notable" run the check's query; check agent_check_state.last_status = okThe model judged it not worth surfacing Layer 2 returned no finding for it agent_runisSUCCEEDEDwithdelivered: 0; inspect the signalseverity = infoorsuggestedAction = log_onlyclassifyMode→ suppressed (policy)the finding's severity/action Already notified recently cooldown (default 1 h, unless escalated) surfaced_finding.last_notified_atfor thatdedupeKeyYou already acked/resolved it acknowledged suppression surfaced_finding.statusIt went to email, not the bell digest mode or a cap rolled it to digest digest_itemrows; wait foragent-digest-briefing(07:30)It wants to act, not notify routed to the approval queue (HITL) agent_approvalrows; the "Needs you" surfaceThe run failed before delivery a throw, or over token budget agent_run.state(FAILED/DEAD_LETTER) +errorThe
delivered/suppressedcounters on theRunAgentTickResult(and thefusion_agent_notifications_total{disposition}metric) tell you which bucket a tick's findings fell into without reading rows.
Memory: durable facts, embedded and recalled
Memory is what makes the agent feel continuous across its isolated ticks, and it feeds the
conversational front door too. It lives in one pgvector-backed table, agent_memory
(example/src/db/schema/agent.ts), with a kind of fact, preference, observation,
or summary. There are two write paths and two read paths:
Writing. Every successful tick writes a one-line observation summarizing the run, as
part of the atomic finishRun transaction (agent-store.ts) — so memory and SUCCEEDED
commit together. The conversational path is rememberMemory: when you tell Carola
"call me Andy," the chat's remember tool persists a durable fact/preference. Both
embed the text (768-dim) when an embedder is configured; when it isn't, the row is written
embedding_pending = true and stays keyword-searchable — never silently lost. A
back-fill cron (backfillEmbeddings, every 5 min) vectorizes pending rows once a provider
is reachable, so historical memory becomes vector-recallable without a re-write.
Recalling. Inside a tick the model can call the owner-scoped memory_search /
memory_get tools (createMemoryTools + createDrizzleMemoryStore) — cosine search when
embedded, keyword/recency fallback otherwise. But the more visible recall is at the
conversational front door: the chat endpoint (example/src/routes/api/agent/chat.ts)
calls recentUserMemories(db, actor) on every turn and injects the user's durable facts
straight into the system prompt, so Carola addresses you correctly without having to call
a tool first:
const memories = await recentUserMemories(db, actor).catch(() => []);
const remembered =
memories.length > 0
? `Here is what you already remember about this user — use it naturally (address them by their preferred name and respect these preferences), and don't ask again for things you already know: ${memories
.map((mem) => `• ${mem}`)
.join(" ")}`
: "You don't have any saved facts about this user yet — when they share one, save it with the remember tool.";To keep memory from growing without bound, a daily consolidation ("dreaming") cron
(consolidateMemories) folds the oldest raw observation rows into one higher-level
summary and prunes the originals — atomically, under an advisory lock, idempotent under
concurrency. Memory is the one thing the retention sweep never deletes (it has no TTL by
design — consolidation, not deletion, bounds it). Every memory read is owner-scoped
(visibleMemoryWhere), the same cross-tenant boundary as everything else.
Debugging checkpoint #4. "It forgot my nickname." Check
agent_memoryfor the row (kind = 'fact'/'preference', yourowner_id). If the row is there but stillembedding_pendingand you expected vector recall, the embedder isn't configured yet — recency/keyword recall still works, and the back-fill cron will vectorize it. The chat injects the 12 most recent facts viarecentUserMemories, so a very old fact past that window relies on the model callingsearch_agent_memory.
Run a tick now
Two manual triggers run the exact same runAgentTick as the cron, which makes the
whole loop demoable and debuggable with no scheduler:
run_agent_check_now— a chat tool (example/src/lib/agent-assistant.ts). When you ask Carola to "kolla nu" / "check now," it runs a real tick with the real checks (realChecks()),trigger: "manual", and reportsstate/delivered/approvalsRequested. Because the trigger ismanual, it bypasses the interval gate — it always runs — but single-flight, dedupe, cooldown and caps still apply exactly as on the timer. So "run now" surfaces genuinely new findings but won't re-spam things already in cooldown.- The sandbox tick —
runAgentTickNow(example/src/lib/agent-server.ts), behind the/sandbox/agent"Run tick" button. It takes ascenario(finding | quiet | flaky | approval | real) so you can drive the deterministic demo checks — e.g.flakythrows a 503 to exercise the retry + circuit breaker,approvalemits anactfinding to exercise the human-in-the-loop round-trip — or pickrealto run the production checks. It defaults to the deterministic demo model (predictable, e2e-stable); opt into the real provider to watch an actual model judge the same signals.
Both record an agent_run row and an audit entry, so a manual tick is observable in
exactly the same places as a timer tick — which is what makes this section's debugging
checkpoints work whether or not Hatchet is running.
The conversational front door
Besides the autonomous timer, the agent — Carola — has a chat: the conversation-first
shell (Carola) where you land on /home, every thread addressable by URL
(/c/$id, or /project/$key/c/$id inside a building) and revisitable from the sidebar
history. A per-conversation header carries the thread controls (rename / star /
add-to-project / delete + Files + the floating-Carola pop-out), and the right rail is Carola's
companion — the live conversation alongside conversational maps. Every thread is
also indexed in universal search. The chat binds owner-scoped server tools
(example/src/lib/agent-assistant.ts) that read the same data the timer uses, so answers
come from reality, never invention:
| Tool | Answers |
|---|---|
list_recent_findings | what the agent has surfaced ("vad är brådskande?") |
list_upcoming_deadlines | overdue / due-soon assigned tasks |
search_agent_memory | what the agent has observed about this user |
run_agent_check_now | run the checks now and report |
search_project_wiki | the documentation handbook (WikiN) |
search_project_news | the project news feed (NyhetN) |
search_attached_documents | exact passages from files attached to the chat, with page citations |
The two search_project_* tools are backed by Postgres Swedish full-text search
(to_tsvector('swedish') @@ websearch_to_tsquery, ILIKE fallback) — no embedder, so they
work air-gapped. They let the chat answer concrete project questions ("Vilken Revit-familj
används?") by quoting the actual article. search_attached_documents covers files the user
attaches to a chat — see Document attachments below. These tools
are read-only and owner/visibility-scoped, so they don't appear in the operator levers below.
Document attachments (RAG)
A user can attach files to a Carola conversation (Carola) — owner-private,
scoped to that one thread (the chat_document_attachment join), including 1000-page standards
like a Eurocode — and ask for an exact detail ("what partial factor does §6.1 give?").
Each attachment is stored (chat_document + fusion-storage), and a
Hatchet worker processes it: Kreuzberg (a docker compose service)
extracts the text (PDF / Office / OCR; plain text is read directly), it's chunked with page
metadata, and each chunk is embedded (768-dim, pgvector — the same embedder the agent's memory
uses). Carola's search_attached_documents tool then cosine-searches the chunks scoped to
that conversation (keyword fallback when no embedder) and answers strictly from the returned
excerpts, citing the file and page. Graceful degradation: without Hatchet the upload
processes inline; without Kreuzberg only plain-text attachments are read. KREUZBERG_URL
(preset in .env.local) toggles extraction.
How it compares — OpenClaw and other systems
The agent is OpenClaw's idea rebuilt as a production harness. OpenClaw is the
open-source reference implementation this is modelled on (hence
"OpenClaw-style" in PROACTIVE_AGENT.md); it demonstrated the mechanism. Almost
everything Fusion adds is about making that mechanism safe, multi-tenant, and operable.
The thesis of the design doc is one line: production-readiness lives in the harness, not
the model.
Same as OpenClaw (the shared idea)
The core loop is identical, and intentionally so:
- A timer wakes the agent on an interval — the "heartbeat." It is not driven by user input. That autonomy is the "proactive" in proactive agent.
- Each wake-up reads a checklist of things worth watching (the
checks), gathers live signals with tools, and asks the model to judge what — if anything — matters. - The model proposes; the harness then notifies, batches, or stays silent. A quiet run produces zero output — the "don't be annoying" guarantee.
- Each heartbeat is an isolated session (bounded blast radius and token cost), and the agent keeps long-term memory across runs.
If you've seen OpenClaw, the loop here is familiar.
Deliberately different from OpenClaw (the hardening)
The differences are all in the harness, not the model:
| Aspect | OpenClaw | This agent |
|---|---|---|
| Deciding what's important | hands the checklist to the LLM and trusts its judgment | the LLM judges inside deterministic rails — explicit severity/score, dedupe, cooldown, caps, digest-vs-immediate. Testable, auditable, no notification storms. |
| What it can touch | open shell, files, a browser — very powerful, very risky | typed tools only, default-deny allow-list. No shell, no arbitrary files, no browser. The single biggest deliberate deviation. |
| What can trigger a run | the heartbeat could be re-fired by non-timer events (≈1–2 min) | the interval gate is the only trigger — a tool finishing or a memory write can never start a run. |
| Concurrency & crashes | best-effort | a run lease (not a boolean flag) makes a crashed run recoverable; single-flight per agent; missed ticks → one catch-up, not a replay storm; per-run timeout; dead-letter + auto-disable for poison runs. |
| A watched thing breaks | — | per-tool timeout, error classification, bounded retry, a circuit breaker per integration, and "couldn't check" never reads as "all clear." |
| Risky actions | the model just does them | human-in-the-loop approval — consequential actions pause for an explicit approve/deny (see agent-approvals.ts). |
| Multi-tenant & privacy | a single power-user tool | every tool + memory read is permission- and tenant-scoped (authorization); data residency can force a local model (Ollama) for air-gapped tenants — never the cloud. |
| Delivery | — | effectively-once (at-least-once transport + idempotent sinks + dedupe keys) — a computed alert is never lost and never double-sent. |
| Operability | — | per-run audit rows, Prometheus metrics, operator alerts (a separate channel from user notifications), a kill switch, and the SLOs below. |
Net: safer and multi-tenant, but narrower — OpenClaw can roam your machine; this can only do the specific, vetted things its tools allow.
Where it sits next to other categories
It reuses well-known patterns rather than inventing new ones:
| Category | Relationship |
|---|---|
| Cron / scheduled scripts | The base is a cron tick — but with model judgment and a delivery policy on top, instead of "always fire." |
| Monitoring & alerting (Alertmanager, PagerDuty, Datadog) | Borrows the alerting discipline — severity, dedupe keys, cooldown, escalation, digests to fight fatigue. Difference: the "is this worth surfacing?" call is an LLM over heterogeneous app data, not threshold rules over metrics — and it can propose actions, not just alert. |
| Workflow engines (Hatchet, Temporal, Airflow) | It runs on one — Hatchet provides the scheduling, leases, retries and dead-letter; the agent is the domain harness on top, not a re-implementation. See deploy. |
| Autonomous-agent frameworks (AutoGPT/BabyAGI, LangChain) | Same tool-calling LLM loop, but those are usually user-initiated and "run until the goal is done." This is scheduled, bounded per heartbeat, and gated — capped steps/tokens, no open-ended autonomy, a kill switch. |
| RAG chatbots / copilots | That's the reactive half (Carola's chat, above) — it answers when asked. The proactive loop is the inverse: it acts without being asked. This system ships both, sharing the same tools and data. |
At a glance
| I need to… | Lever | Effect |
|---|---|---|
| Pause all proactive runs | setAgentKillSwitch({ on: true }) → agent:killSwitch | every tick skips instantly (skipped: "killSwitch") |
| Pause one agent | flip that agent's killSwitch (per-agent config) | only that agent's ticks skip |
| See failed runs | agent_run rows where state = 'DEAD_LETTER' | the poison runs that exhausted retries |
| Reset a tripped breaker | restart the web/worker process (breakerStore is in-memory) | all breakers return to closed |
| Re-run a missed digest | drainDigestsNow (manual) or wait for the agent-digest-briefing cron | sends each user's queued digest_item rows as one briefing email |
| Watch the system | fusion_agent_* on /api/metrics + agent.alert.* in the audit log | run health, breaker trips, delivery failures |
| Erase a user's data | eraseAgentDataForUser(db, ownerId) (self-service via eraseMyAgentData) | one transactional, orphan-free cascade |
Pause one agent vs pause all runs
The global kill switch is the agent:killSwitch setting. Toggle it with the
setAgentKillSwitch server function (example/src/lib/agent-server.ts), which calls
setKillSwitch (example/src/lib/agent-config.ts):
export async function setKillSwitch(on: boolean): Promise<void> {
await setSetting(db, "agent:killSwitch", on);
clearSettingsCache("agent:killSwitch"); // bypass the 30s TTL so the toggle is instant
}The clearSettingsCache call is what makes it instant — without it the change
would wait out the settings cache TTL. runAgentTick reads the flag at the very top
of the tick and bails before doing any work:
if (config.killSwitch) return skip("killSwitch");So flipping it on means the next tick and every tick after stops immediately — no run record, no model call, no notifications. Flip it back off to resume. This is the big red button for an incident (runaway cost, a misbehaving integration, a bad deploy).
One agent vs all. getAgentConfig layers a per-agent override key
(agent:<id>:config) over tenant defaults and then forces the global
agent:killSwitch on top, so the global switch is authoritative for every agent. To
pause a single agent without stopping the rest, set killSwitch: true inside that
agent's agent:<id>:config override (or disable its individual checks via the
per-check enabled flag) and leave the global switch off. The example runs one shared
demo agent (AGENT_ID = "demo"), so in the sandbox the global switch and the
per-agent switch are the same lever.
Auto-disable on a poison run
A run that keeps crashing is a poison run. After the attempt cap (default 3) the
harness moves it to DEAD_LETTER instead of retrying forever, fires a dead_letter
operator alert, and invokes its onAutoDisable hook. The example wires that hook to
autoDisableAgent, which flips the kill switch so the agent stops being scheduled and
writes an agent.auto_disable audit row:
/** Called when a poison run is dead-lettered (§6.2): the consumer disables the agent,
* e.g. flips its kill switch, so it stops being scheduled. Best-effort — errors here
* never re-fail the already-dead-lettered run. */
onAutoDisable?: (p: { agentId: string; reason: string }) => Promise<void>;The point for an operator: a dead-lettered agent is meant to stay stopped until you look at it. It will not silently start hammering the same broken dependency again. Recovery is deliberate — investigate the cause, then re-enable by flipping the kill switch back off (see Drain dead-letters below).
Drain dead-letters
Failed-beyond-retry runs land in agent_run with state = 'DEAD_LETTER'. The state
machine (fusion-ai/src/agent/pure.ts) treats DEAD_LETTER as terminal — it
never transitions back to QUEUED, so a dead-lettered run is inert and will not be
retried by the worker.
Find them:
-- Dead-lettered runs, newest first
SELECT id, agent_id, owner_id, trigger, attempt, error, created_at
FROM agent_run
WHERE state = 'DEAD_LETTER'
ORDER BY created_at DESC;What to do with them:
- Read
erroron the row (and the matchingdead_letterline in the audit log,agent.alert.dead_letter) to find the cause — a permanently broken check, a malformed config, a downstream that is genuinely gone. - Confirm the agent is stopped. A poison run dead-letters and auto-disable is expected to have flipped the kill switch (above). Verify the agent isn't still being scheduled before you fix anything.
- Fix the root cause — repair the check/config or the downstream integration.
- Re-enable by flipping the kill switch back off (
setAgentKillSwitch({ on: false }), or clear the per-agentkillSwitch). The nextproactive-agent-tickthen runs a single fresh tick.
The dead-lettered rows themselves are historical records. They are not re-run; they age out on the normal retention schedule (terminal runs, see GDPR & retention). There is no "requeue a dead-letter" button by design — replaying a poison run would just re-poison.
Reset a stuck circuit breaker
Each downstream integration has a per-integration circuit breaker so a dead
dependency is fast-failed instead of hammered every tick. The breaker state machine
(closed → open → half_open) lives in fusion-ai/src/agent/resilience.ts; the
example persists it through breakerStore in example/src/lib/agent-resilience.ts:
const breakerState = new Map<string, BreakerState>();Behavior: failures count up and the breaker trips open at the threshold; while open
it fast-fails until the cooldown elapses, then flips to half_open to allow exactly
one probe; a successful probe closes it, a failed probe re-opens it. Each trip fires a
breaker_open operator alert and increments fusion_agent_breaker_open_total.
To reset it now: because breakerStore is an in-memory Map, restarting the
process that holds it clears every breaker back to closed (CLOSED_BREAKER). In
the example that is the web process (manual ticks) and/or the worker (cron ticks). If
you don't want to wait for the half-open probe, a restart is the reset.
The standard recovery path is to just wait — the breaker self-heals via the
half-open probe once the dependency is back. Restart only when you need it closed
immediately. Note the in-memory caveat: a breaker open state does not survive a
restart, and is not shared across processes — a production deployment would back
breakerStore with the DB or Redis so an open breaker is durable and fleet-wide.
Replay a missed digest
normal/low findings are batched into a digest instead of interrupting the user
one at a time. The DeliverySink (example/src/lib/agent-delivery.ts) does not email
these inline — it queues each one to the digest_item table (enqueueDigestItems). A
separate scheduled job then drains the queue: agent-digest-tick (cron
agent-digest-briefing, e.g. a morning slot) calls drainDigests, which batches each
user's pending rows into one briefing email (kind: "agent_digest") and marks them
sent. This is the main defense against notification fatigue.
So to replay / force a briefing now:
- Manually: call
drainDigestsNow(the/sandbox/agentpage exposes a "Send briefing now" button). It drains all currently-pendingdigest_itemrows into briefing emails immediately. - Scheduled: the
agent-digest-briefingcron runs the samedrainDigestspath on its schedule; the next firing sweeps whatever is queued.
Two things that make a replay safe to do:
- Idempotent delivery. Immediate notifications are inserted
onConflictDoNothingon(user, dedupeKey, runId)and recorded inagent_side_effect_log, so re-running a tick cannot double-notify the same finding. - Dedupe/cooldown. A finding already surfaced recently won't resurface unless its severity escalates, so a manual replay won't spam the user with everything again.
If the digest email itself failed to send (SMTP down), it rides the mailer outbox — fix the mailer and the outbox redelivers; you do not need to re-run the agent for that.
Operator alerts vs user notifications
Keep these two channels separate — they have different audiences.
| Channel | Audience | Where it lands |
|---|---|---|
| Operator alerts | whoever runs the system | the audit log as agent.alert.* (+ a console.warn) |
| User notifications | the end user | the in-app bell (notification rows) + optional realtime push |
Operator alerts are written by the operatorAlert port
(example/src/lib/agent-resilience.ts) as audit rows with
action = "agent.alert.<kind>". The kinds are fixed
(OperatorAlertKind in fusion-ai/src/agent/resilience.ts):
| Audit action | Fires when |
|---|---|
agent.alert.dead_letter | a poison run exhausted its attempts and was dead-lettered |
agent.alert.breaker_open | a per-integration circuit breaker tripped open |
agent.alert.delivery_failed | a realtime push dead-lettered after the retry cap (per user) |
agent.alert.budget | a run hit its token/step budget and ended FAILED |
GDPR actions are audited separately under agent.gdpr.* — the self-service erasure
writes agent.gdpr.erase (example/src/lib/agent-server.ts).
await recordAudit(db, {
actorUserId: null,
actorName: `agent:${event.agentId}`,
action: `agent.alert.${event.kind}`,
entityType: "agent_alert",
summary: event.detail,
meta: { kind: event.kind, runId: event.runId ?? null },
});Query for operator alerts:
SELECT created_at, action, summary, meta
FROM audit_log
WHERE action LIKE 'agent.alert.%'
ORDER BY created_at DESC
LIMIT 100;User notifications are the in-app bell — notification rows written by the
DeliverySink, surfaced via listMyNotifications, with an optional live realtime push
(a no-op when Centrifugo is off; the bell reads the durable row on mount). These never
page you; they are the product surface.
Metrics to watch
The harness's AgentMetrics port is wired onto the shared fusion-metrics Prometheus
registry (example/src/lib/agent-metrics.ts), so proactive-agent series appear on the
same /api/metrics exposition as the rest of the stack — no new endpoint. They are
recorded only when METRICS_ENABLED is set; otherwise the sink no-ops.
| Series | Type | Labels | Watch for |
|---|---|---|---|
fusion_agent_runs_total | counter | state, trigger | a rising state="FAILED" / "TIMED_OUT" / "DEAD_LETTER" rate |
fusion_agent_run_duration_seconds | histogram | state, trigger | p95/p99 creeping toward the per-run timeout (runs about to time out) |
fusion_agent_notifications_total | counter | disposition | disposition="delivered" spiking = possible notification fatigue/dedupe gap |
fusion_agent_breaker_open_total | counter | agent | any increase = a downstream integration is failing repeatedly |
Concretely:
- Run health — split
fusion_agent_runs_totalbystate. A healthy system is almost allSUCCEEDED; a climbingDEAD_LETTERmeans poison runs (check the audit log),TIMED_OUTmeans runs are exceeding the per-run cap,FAILEDwith abudgetalert means cost caps are biting. - Latency —
fusion_agent_run_duration_secondsp95 trending up toward the timeout is your early warning that ticks are about to start timing out and being skipped by single-flight. - Notification volume —
fusion_agent_notifications_total{disposition="delivered"}far abovesuppressedsuggests dedupe/cooldown isn't catching repeats; a flood of identical alerts is the classic proactive-agent failure mode. - Breaker trips — any
fusion_agent_breaker_open_totalincrease points at a specific failing integration (theagentlabel) — correlate with theagent.alert.breaker_openaudit rows.
GDPR & retention operations
Two distinct paths (example/src/lib/agent-retention.ts):
Self-service erasure — eraseAgentDataForUser(db, ownerId). One transaction that
cascades all of a user's agent data, orphan-free, ordered so run-keyed tables go
before the runs they reference: side-effect log → working log → notifications →
delivery attempts → surfaced findings → approvals → digest items → memory → runs. It
returns per-table delete counts. The user-facing entry point is the eraseMyAgentData
server function, which runs the cascade and writes an agent.gdpr.erase audit row with
the counts. (Deleting the whole account also erases everything — the FKs are
onDelete: "cascade"; this path is for "forget my assistant history" while the account
stays.)
Consent / opt-in. Autonomous (timer) monitoring is opt-in per user (§10):
getUserConsent / setUserConsent store the consent (with a timestamp) in
user.metadata.agentConsent, and the proactive-agent-tick cron skips any owner who
hasn't consented (skipped: "no-consent"). A manual tick is the user's own explicit
action, so it isn't gated. Operator note: if an agent isn't running for a user on the
timer, check their consent first — the /sandbox/agent page has a consent toggle.
Daily retention sweep — sweepAgentRetention. Run off-peak by the
agent-retention-tick cron (registerAgentCron, daily at 03:00). It ages out only
terminal/settled rows — never live ones, and never durable memory (memory has
no TTL by design; consolidation, not deletion, bounds its growth). Defaults, all
overridable via env without code:
| Data | Aged out when… | Default TTL | Env override |
|---|---|---|---|
| working log (scratch) | older than TTL | 7 days | AGENT_RETENTION_WORKING_LOG_DAYS |
agent_run (terminal) | older than TTL, not RUNNING/QUEUED | 90 days | AGENT_RETENTION_RUN_DAYS |
| notifications | older than TTL, status not unread | 90 days | AGENT_RETENTION_NOTIFICATION_DAYS |
| delivery attempts | older than TTL, status not pending | 30 days | AGENT_RETENTION_DELIVERY_DAYS |
| resolved findings | older than TTL, status resolved | 90 days | AGENT_RETENTION_RESOLVED_FINDING_DAYS |
The sweep is idempotent and safe to run repeatedly. It will never reap a RUNNING run
out from under a worker, an unread notification, a pending delivery retry, or an active
finding.
Service-level objectives (SLOs)
These are starting targets to agree with the team, not yet enforced thresholds.
They are framed around the guarantees the harness is built to provide (see
PROACTIVE_AGENT.md §9), and each maps to a signal above so it's measurable.
| Objective | Starting target | Measured by |
|---|---|---|
| Schedule adherence — a due check runs near its schedule | within 1× the base interval | gap between expected and actual tick time; proactive-agent-tick runs |
| No false "OK" — an unreachable source never reads as healthy | 0 false-OK | unreachable checks become degraded, never silent success (§6.1 invariant) |
| Duplicate-notification rate — the same thing isn't re-alerted | ≈ 0 | repeats of one dedupeKey; fusion_agent_notifications_total{disposition="delivered"} vs surfaced findings |
| Run latency — ticks finish well under their cap | p95 run duration < 30 s | fusion_agent_run_duration_seconds (the example caps a tick at 120 s) |
| Run success rate — most ticks succeed | ≥ 99% over a rolling window | fusion_agent_runs_total{state="SUCCEEDED"} ÷ total |
| No infinite poison loop — bad runs stop | 0 runs past the attempt cap | DEAD_LETTER count vs the retry cap; agent.alert.dead_letter |
| Delivery — a computed alert is never lost | effectively-once | agent.alert.delivery_failed backlog stays near zero; dead delivery rows |
| Erasure — a deletion leaves nothing behind | 0 orphaned rows | eraseAgentDataForUser counts vs the user's prior agentDataFootprint |
Note the harness targets effectively-once delivery (at-least-once transport + idempotent sinks), not exactly-once — true exactly-once isn't achievable across external systems, so the SLO is "no lost alert, no duplicate," not "exactly one attempt."