OR
OpenRemedy
Whitepaperv1.1 · June 2026

The moat is the harness, not the model.

How OpenRemedy approaches autonomous operations on Linux fleets — from values to implementation. A design-space document for operators, founders, and investors evaluating the agentic platforms emerging around incident response.

1.6%
AI decision logic
98.4%
Deterministic harness
5
Independent gates
3
Trust modes per server
01Executive summary

Modern incident response runs on the same five investigations against the same five runbooks, repeated tens of thousands of times per day across every datacenter on the planet.

A senior engineer is paged at 03:00, spends 20 to 40 minutes proving which of the known fixes applies, and applies it. The work is structurally repetitive; the human is the bottleneck; and the failure mode is operator carelessness under fatigue — the wrong service restarted at the wrong time, the wrong hostname typed into a destructive command.

OpenRemedy is an autonomous SRE platform that closes those incidents under operator control. It runs the proving step that humans do today, proposes the fix it knows is safe, and executes only what the operator has approved — directly or in advance via the trust ladder.

The moat in autonomous operations is the operational harness, not the model.

This is not a novel claim. A recent academic analysis of Anthropic's Claude Code platform — a comparable autonomous agent in the adjacent domain of software engineering — found that only 1.6% of the codebase was AI decision logic; the remaining 98.4% was deterministic infrastructure: permission gates, context management, tool routing, recovery logic.[1] Frontier models are converging in capability. The durable competitive surface is the system that decides what to show the model, what the model can touch, and how to recover when the model makes a mistake.

OpenRemedy is built on that thesis, applied to a different domain and a different buyer.

02The thesis

Harness, not model.

The agent loop is simple. Assemble the incident context, call the model, dispatch the tools the model wants to call, check whether the operator authorized those tools to run on this host, run the ones that clear, collect the results, repeat until the model says it is done. Maybe two hundred lines of code.

What surrounds the loop is the platform.

Daemon15s heartbeatWebhookHMAC-signedScheduled probesAnsibleManual incidentsoperator-openedTriageDiagnoseValidateExecutegatedReviewAuditapproval / trust × risk gateSIGNALS INAGENT PIPELINE

Figure 1 — Signals enter from four sources and converge into the agent pipeline. Every stage is deterministic platform code; only Triage and Diagnose call the model. The Execute stage is gated by approval or trust × risk. OpenRemedy Guardian raises the risk of destructive actions before the gate.

Every box outside Triage and Diagnose is deterministic infrastructure. The agent does not decide whether a recipe is safe to run on this host; the platform does. The agent does not write to the audit log; the platform does. The agent does not promote a server from shadow to live; the platform does, and only after an operator has clicked accept on a suggestion the platform earned through recorded outcomes.

This is not a constraint we accept reluctantly. It is the design.

Frontier models will continue to improve, and the buyer of an autonomous-operations platform — particularly in regulated industries — is not buying a model. They are buying the operator-control surface, the audit trail, the safety gates, the ability to defend a fix in front of an auditor. None of that comes from the model. All of it comes from the harness.

When we choose what to build next, we ask: does this make the platform smarter, or does it make the harness more trustworthy? We prefer the second.

03Five values

What we optimise for.

Every architectural choice in OpenRemedy traces back to one of five values. The values are not aspirational; they are decision criteria. When two design alternatives both work technically, we pick the one that scores higher across the five.

1 · Operator decision authority

Humans own every action that matters. The agent proposes; the operator decides. This is a hard rule, not a default that can be relaxed. The platform refuses to ship a configuration where an LLM can self-authorize a remediation outside the bounds the operator has explicitly set.

The trust ladder (audit → shadow → live) is the mechanism by which operators grant autonomy in stages. A server in audit mode never sees a remediation proposal. A server in shadow mode sees every proposal pause for human approval, regardless of the trust × risk gate that governs live mode. Promotion between modes is operator-driven, with an audit trail that records who decided to grant autonomy, when, and based on what evidence.

2 · Defense in depth via independent failure modes

A safety boundary that depends on a single mechanism is not defense in depth — it is defense in name. The platform runs five independent gates before any remediation executes, each with a different failure mode.

GateDepends onFails when
Server modeOperator-set field on the server rowOperator changes mode
Trust × riskRecipe risk level + agent trust levelRecipe metadata corrupted
Approval gateOperator action via UIOperator unavailable
Tool filterHardcoded list of remediation tool namesTool renamed without filter update
Safety classifier (planned)Separate LLM call with hardened promptLLM provider outage

Figure 2 — Five gates, five distinct failure modes. None share a single point of failure.

If any two of those gates depended on the same external service — say, all of them required an LLM call to evaluate — they would not be five gates. They would be one gate wearing five hats. We explicitly audit for shared failure modes and move the layers apart when they coincide. This is the principle Liu et al. flag as the single most common failure pattern in agent platforms: defense in depth that degrades to a single point of failure under load.

3 · Reversibility-graduated autonomy

Granting an agent the authority to act on production infrastructure is not a binary decision. It is a gradient, and the platform's job is to make that gradient legible.

Every server in OpenRemedy lives in one of three modes.

operator promotespromotion ladder suggestsAUDITobserve onlySHADOWevery action waits for approvalLIVEtrust × risk decidesoperator demotes at any time

Figure 3 — The trust ladder. Promotion is operator-gated; demotion is always available.

In audit, the agent classifies the incident and resolves it immediately. No remediation is proposed; nothing executes; the operator gets visibility without commitment.

In shadow, the agent runs the full diagnostic and proposal pipeline, and every proposal waits for human approval.

In live, the trust × risk gate decides whether a recipe runs autonomously.

A server enters in audit and only earns its way to live through recorded operator approvals — not through configuration, not through a one-time decision the operator might forget about. The promotion ladder generates suggestions when a (server, recipe) pair has accumulated at least ten approvals with zero rejections in a 30-day window. The operator accepts or dismisses; the platform never self-promotes.

4 · Transparency as compliance primitive

Audit-readiness is not a feature added at the end of the build. It is a constraint applied at every layer.

Every state-changing action writes an audit row tagged with the tenant, the actor, the IP address, and the precise change. The audit log is append-only. Every incident's reasoning is reconstructable from the timeline, not from a vendor's opaque summary. Memory — the corpus of past resolutions the agent learns from — will be served as plain markdown files per tenant, version-controllable and exportable on demand. Operators own their data.

We pick auditability over query power when they conflict. A compliance auditor wants to read the trail; a dashboard wants to filter it. The first wins.

5 · Domain specialization

We are not building a generalist agent. We are building a runbook executor for Linux fleets.

Generalist coding agents like Claude Code answer the question “what would a senior engineer do here?” across a broad surface. OpenRemedy answers a narrower question: “given a known incident shape on a known kind of server, which of the five known fixes applies, and is it safe to run right now?”

The narrower question lets us be more specific in every layer. Our recipe catalogue is curated, not synthesized. Our agent prompts are domain-specific. Our trust ladder is per-recipe per-server, not a global autonomy slider. Our safety classifier (planned) will train on the tenant's actual incident corpus, not on internet-scale generic data.

04Operator control surface

Three mechanisms layered on top of each other.

The product surface that the operator interacts with — the visible shape of the platform — is governed by three mechanisms layered on top of each other.

Server modes

The coarsest control. Every server is in audit, shadow, or live. The mode determines which stages of the agent pipeline run.

AUDITTriageAuto-resolve── ENDS HERE ──(Diagnose skipped)(Propose skipped)(Execute skipped)(Review skipped)SHADOWTriageDiagnoseProposeAwaiting approvalExecute (if approved)ReviewLIVETriageDiagnoseProposeTrust × risk gateExecute (auto or approved)Review

Figure 4 — Mode-aware pipeline gating. Audit short-circuits after triage; shadow runs the full pipeline with mandatory human approval; live consults the trust × risk gate.

The trust ladder

Operators never have to choose between “the agent has full autonomy” and “the agent has none.” The trust ladder records every approval, rejection, auto-execute, and post-run failure for each (server, recipe) pair. When a pair accumulates ten approvals with zero rejections in 30 days, the platform suggests promotion.

What “Accept” does branches on the server's current mode. On a shadow server, accepting flips the server to live — the historical mechanic. On a server already in live, accepting grants a per-tenant recipe role override for the tuple (recipe, server role): the next time the pipeline proposes that recipe on a server with that role, the trust × risk gate is short-circuited and the validate stage is skipped. The operator can revoke the override at any time from the Agents page; revocation re-imposes the gate.

The platform never self-promotes. The threshold floors are 5 approvals over 7 days; below those, no suggestion is generated. The override path is operator-explicit by the same rule — it is never granted without an Accept click, and a single rejection in the window disqualifies the pair. Together this prevents an accidental “auto-promote on first approval” footgun while still letting trusted recipes graduate out of the approval queue without flipping a whole server's posture.

Per-execution preview

Independent of mode, any pending execution can be marked as preview. A preview run executes the recipe in dry-run posture against the host — reporting what would change without applying — and lands in a dedicated preview-completed state distinct from the success state.

Preview runs do not advance the incident to review (there is nothing to review), do not appear in metrics that count applied changes, and hide the rollback button (there is nothing to roll back). The operator chooses preview at approval time, on a per-execution basis.

The three mechanisms compose. A server in shadow can run a preview of a proposal before deciding whether to approve the live run. A server in live can run a preview of a risky recipe before letting the trust × risk gate auto-execute the next one. A server in audit will not surface execution at all.

Planned change — maintenance plans

Incident response is reactive. The same harness also runs planned change. A maintenance plan is a markdown document with a YAML front-matter (risk, strategy, snapshot dimensions) and a sequence of typed steps — validate, custom_tool, recipe, wait, notify, human_gate. Each step declares whether it requires approval; the same gate model that governs incident remediation governs every non-trivial step here too.

Plans run under a strategy: rolling (one server at a time, easy to abort mid-fleet), parallel_batched(fixed batch size with soak), rings(canary → pilot → broad with per-ring soak windows). On approval, the plan's markdown is frozen onto the schedule so future edits to the plan template never affect a run already greenlit; operator edits to the schedule itself are preserved through approval. An AI editor lets the operator draft and revise the markdown conversationally before approving, and an audit log entry records every chat-driven change so the planned-change story stays as inspectable as the incident-response one.

05OpenRemedy Guardian

A standing guard against destructive actions.

The trust ladder and the approval gate decide who may run what. OpenRemedy Guardian decides something narrower and blunter: whether the specific action in front of the platform is the kind that destroys data or takes a system down — a wiped disk, a dropped database, a flushed firewall, a forced reboot of the wrong host — no matter how routine the surrounding request looked.

Guardian is a dedicated model that runs ahead of the approval gate as an independent pre-triage signal. It reads the operation about to be taken — the concrete command, not just the alert that triggered it — recognises destructive intent even when it is obfuscated, and assigns a severity. That severity is folded into the action's risk before the gate ever sees it: the platform takes the higher of the catalogue-declared risk and Guardian's reading. A routine-looking recipe that turns out to be destructive is escalated to human approval automatically.

Guardian never relaxes a decision — it can only raise risk, never lower it — and it never executes anything itself. It is a separate failure domain from the model that reasons about the incident and from the gate that authorises execution, so a destructive action has to slip past all three independently. When Guardian is unreachable, each tenant chooses the posture: proceed, with the existing gates still standing, or force human approval until it returns.

Every reading Guardian makes is recorded on the incident timeline, so the operator can see at a glance that the platform looked at an action and judged it safe — or that it raised the alarm and routed the action to a human.

A separate guard whose only verdict is “this is destructive” — and whose only power is to slow things down.
06What is shipped

Honest numbers from a single dogfooded server.

OpenRemedy is in private testing. The platform runs at app.openremedy.io and the founder is dogfooding it on his own production infrastructure — a single server hosting barrahome.org — before opening it to other operators. There are no external customers yet, by design.

What that single dogfooded server has produced, as of the time of writing:

96
Incidents handled
37
Progressed to execute
~36s
Median resolution
641
Audit log entries

The 59 incidents that did not progress to execution were classified as transient or were artifacts of platform bugs we found by dogfooding — the same reason we dogfood, to find them ourselves before a real customer does. Faster resolutions (0–5 seconds) were predominantly automated reclassifications; slower ones (up to 4 minutes) were investigations that called multiple diagnostic tools before deciding. The median ticked up as the platform graduated from naive single-agent runs to the multi-agent pipeline (Atlas triages, a separate diagnose / execute / review agent handles the rest), which deepens reasoning per stage at the cost of a few seconds of latency.

The numbers are small. The shape of the deployment is what matters at this stage: a single operator with twenty years of Linux production experience, running the platform against his own infrastructure, finding bugs by using it, and shipping a self-improvement every time he finds one. The infrastructure is mature enough that the operator does not have to babysit it; the safety-classifier work and per-tenant specialized model that comes next will let it generalize beyond a single operator's intuitions.

07What comes next

Depth and breadth.

The current platform handles the closing-the-loop wedge for Linux fleets — alerts in, classification, diagnosis, gated remediation, and audit. The roadmap extends in two directions.

Depth.A separate ML safety classifier — an independent LLM call with a hardened prompt template — will sit between the trust × risk gate and the worker dispatch. Its only job is to answer yes or no to the question “is this specific action safe on this specific host right now?” It can never relax the existing gates; it can only veto. The natural home for the per-tenant trained model is this classifier slot.

Per-tenant file-based memory — exportable, version-controllable markdown files instead of opaque database rows — will replace the current resolution corpus. Sidechain transcripts will capture each pipeline stage's full reasoning as JSONL, separately from the summary view in the dashboard.

Breadth.The wedge is closing incidents on existing infrastructure. The trajectory is the full lifecycle of an operator's infrastructure under a single agentic surface — what we think of as vibe deploy, alongside the vibe coding the industry already has. Describe a project; the platform picks the cloud, provisions, watches, fixes what breaks, and grows a per-tenant resolution corpus the operator owns. The wedge unlocks the rest: once an operator trusts the platform to close a tier-one incident at three in the morning, they trust it to ship the next environment, and the same agentic loop handles both.

The idea is large and the road is long. We know what we are signing up for.

08Closing note

For builders in regulated domains.

Three things we have learned by building.

The harness is auditable; the model is not. When a compliance auditor asks why a fix was applied, the answer cannot be “the model decided.” The answer has to be a chain of platform decisions — what mode the server was in, what trust level the agent had, what risk level the recipe carried, who approved, what the platform recorded — ending with the model's reasoning as one input among several. The harness is what produces a defendable audit trail.

Independent failure modes are the only real defense in depth. Five gates that all depend on the same LLM provider are one gate. Five gates that depend on a server mode (operator-set), recipe risk (catalogue-curated), trust level (per-agent), tool filter (hardcoded in code), and a separate classifier call (independent LLM) fail under different conditions and give the operator a real safety margin. Audit for independence early; correct it before the system grows.

Trust gradients beat trust switches. Operators do not want a checkbox that says “let the AI run things.” They want a path from observation to autonomy that they can step through, demonstrate defensibility on, and walk back from at any point. Building the platform around that path is the difference between something an operator will run on production infrastructure and something they will give a five-minute demo on and never touch again.

These are not unique to OpenRemedy. They are the consequences of taking autonomous operations seriously enough to put it in front of a buyer who has compliance requirements. The technology is the model; the product is the harness.

Citations

  1. [1]Liu, J., Zhao, X., Shang, X., and Shen, Z. Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems. arXiv:2604.14228, 2026. The 1.6% / 98.4% finding (AI decision logic versus deterministic infrastructure) is the source of the central thesis applied throughout this document.