OpenRemedy — Whitepaper

01Executive summary

Modern incident response runs on the same five investigations against the same five runbooks, repeated tens of thousands of times per day across every datacenter on the planet.

A senior engineer is paged at 03:00, spends 20 to 40 minutes proving which of the known fixes applies, and applies it. The work is structurally repetitive; the human is the bottleneck; and the failure mode is operator carelessness under fatigue — the wrong service restarted at the wrong time, the wrong hostname typed into a destructive command.

OpenRemedy is an autonomous SRE platform that closes those incidents under operator control. It runs the proving step that humans do today, proposes the fix it knows is safe, and executes only what the operator has approved — directly or in advance via the trust ladder.

This document is a design-space whitepaper. It is not a sales sheet. It explains what we believe about autonomous operations, why we made the specific architectural choices we made, and how those choices land in the codebase.

The moat in autonomous operations is the operational harness, not the model.

This is not a novel claim. A recent academic analysis of Anthropic's Claude Code platform — a comparable autonomous agent in the adjacent domain of software engineering — found that only 1.6% of the codebase was AI decision logic; the remaining 98.4% was deterministic infrastructure: permission gates, context management, tool routing, recovery logic.^[1] Frontier models are converging in capability. The durable competitive surface is the system that decides what to show the model, what the model can touch, and how to recover when the model makes a mistake.

OpenRemedy is built on that thesis, applied to a different domain and a different buyer.

02The thesis

Harness, not model.

The agent loop is simple. Assemble the incident context, call the model, dispatch the tools the model wants to call, check whether the operator authorized those tools to run on this host, run the ones that clear, collect the results, repeat until the model says it is done. Maybe two hundred lines of code.

What surrounds the loop is the platform.

Figure 1 — Signals enter from four sources and converge into the agent pipeline. Every stage is deterministic platform code; only Triage and Diagnose call the model. The Execute stage is gated by approval or trust × risk.

Every box outside Triage and Diagnose is deterministic infrastructure. The agent does not decide whether a recipe is safe to run on this host; the platform does. The agent does not write to the audit log; the platform does. The agent does not promote a server from shadow to live; the platform does, and only after an operator has clicked accept on a suggestion the platform earned through recorded outcomes.

This is not a constraint we accept reluctantly. It is the design.

Frontier models will continue to improve, and the buyer of an autonomous-operations platform — particularly in regulated industries — is not buying a model. They are buying the operator-control surface, the audit trail, the safety gates, the ability to defend a fix in front of an auditor. None of that comes from the model. All of it comes from the harness.

When we choose what to build next, we ask: does this make the platform smarter, or does it make the harness more trustworthy? We prefer the second.

03Five values

What we optimise for.

Every architectural choice in OpenRemedy traces back to one of five values. The values are not aspirational; they are decision criteria. When two design alternatives both work technically, we pick the one that scores higher across the five.

1 · Operator decision authority

Humans own every action that matters. The agent proposes; the operator decides. This is a hard rule, not a default that can be relaxed. The platform refuses to ship a configuration where an LLM can self-authorize a remediation outside the bounds the operator has explicitly set.

The trust ladder (audit → shadow → live) is the mechanism by which operators grant autonomy in stages. A server in audit mode never sees a remediation proposal. A server in shadow mode sees every proposal pause for human approval, regardless of the trust × risk gate that governs live mode. Promotion between modes is operator-driven, with an audit trail that records who decided to grant autonomy, when, and based on what evidence.

2 · Defense in depth via independent failure modes

A safety boundary that depends on a single mechanism is not defense in depth — it is defense in name. The platform runs five independent gates before any remediation executes, each with a different failure mode.

Gate	Depends on	Fails when
Server mode	Operator-set field on the server row	Operator changes mode
Trust × risk	Recipe risk level + agent trust level	Recipe metadata corrupted
Approval gate	Operator action via UI	Operator unavailable
Tool filter	Hardcoded list of remediation tool names	Tool renamed without filter update
Safety classifier (planned)	Separate LLM call with hardened prompt	LLM provider outage

Figure 2 — Five gates, five distinct failure modes. None share a single point of failure.

If any two of those gates depended on the same external service — say, all of them required an LLM call to evaluate — they would not be five gates. They would be one gate wearing five hats. We explicitly audit for shared failure modes and move the layers apart when they coincide. This is the principle Liu et al. flag as the single most common failure pattern in agent platforms: defense in depth that degrades to a single point of failure under load.

3 · Reversibility-graduated autonomy

Granting an agent the authority to act on production infrastructure is not a binary decision. It is a gradient, and the platform's job is to make that gradient legible.

Every server in OpenRemedy lives in one of three modes.

Figure 3 — The trust ladder. Promotion is operator-gated; demotion is always available.

In audit, the agent classifies the incident and resolves it immediately. No remediation is proposed; nothing executes; the operator gets visibility without commitment.

In shadow, the agent runs the full diagnostic and proposal pipeline, and every proposal waits for human approval.

In live, the trust × risk gate decides whether a recipe runs autonomously.

A server enters in audit and only earns its way to live through recorded operator approvals — not through configuration, not through a one-time decision the operator might forget about. The promotion ladder generates suggestions when a (server, recipe) pair has accumulated at least ten approvals with zero rejections in a 30-day window. The operator accepts or dismisses; the platform never self-promotes.

4 · Transparency as compliance primitive

Audit-readiness is not a feature added at the end of the build. It is a constraint applied at every layer.

Every state-changing action writes an audit row tagged with the tenant, the actor, the IP address, and the precise change. The audit log is append-only. Every incident's reasoning is reconstructable from the timeline, not from a vendor's opaque summary. Memory — the corpus of past resolutions the agent learns from — will be served as plain markdown files per tenant, version-controllable and exportable on demand. Operators own their data.

We pick auditability over query power when they conflict. A compliance auditor wants to read the trail; a dashboard wants to filter it. The first wins.

5 · Domain specialization

We are not building a generalist agent. We are building a runbook executor for Linux fleets.

Generalist coding agents like Claude Code answer the question “what would a senior engineer do here?” across a broad surface. OpenRemedy answers a narrower question: “given a known incident shape on a known kind of server, which of the five known fixes applies, and is it safe to run right now?”

The narrower question lets us be more specific in every layer. Our recipe catalogue is curated, not synthesized. Our agent prompts are domain-specific. Our trust ladder is per-recipe per-server, not a global autonomy slider. Our safety classifier (planned) will train on the tenant's actual incident corpus, not on internet-scale generic data.

04Operator control surface

Three mechanisms layered on top of each other.

The product surface that the operator interacts with — the visible shape of the platform — is governed by three mechanisms layered on top of each other.

Server modes

The coarsest control. Every server is in audit, shadow, or live. The mode determines which stages of the agent pipeline run.

Figure 4 — Mode-aware pipeline gating. Audit short-circuits after triage; shadow runs the full pipeline with mandatory human approval; live consults the trust × risk gate.

The trust ladder

Operators never have to choose between “the agent has full autonomy” and “the agent has none.” The trust ladder records every approval, rejection, auto-execute, and post-run failure for each (server, recipe) pair. When a pair accumulates ten approvals with zero rejections in 30 days, the platform suggests promotion. The operator accepts (the server flips from shadow to live) or dismisses (the suggestion clears, the pair becomes eligible again on the next scan).

The platform never self-promotes. The threshold floors are 5 approvals over 7 days; below those, no suggestion is generated. This prevents an accidental “auto-promote on first approval” footgun.

Per-execution preview

Independent of mode, any pending execution can be marked as preview. A preview run executes the recipe in dry-run posture against the host — reporting what would change without applying — and lands in a dedicated preview-completed state distinct from the success state.

Preview runs do not advance the incident to review (there is nothing to review), do not appear in metrics that count applied changes, and hide the rollback button (there is nothing to roll back). The operator chooses preview at approval time, on a per-execution basis.

The three mechanisms compose. A server in shadow can run a preview of a proposal before deciding whether to approve the live run. A server in live can run a preview of a risky recipe before letting the trust × risk gate auto-execute the next one. A server in audit will not surface execution at all.

05Comparison

Where we sit relative to incumbents.

OpenRemedy sits at the intersection of three categories. Each category has incumbents. None of them, in our reading, close the loop the way OpenRemedy does.

Capability	OpenRemedy	Datadog Bits AI	New Relic SRE Agent	PagerDuty AI Agents
AI-driven incident investigation	✓	✓	✓	✓
Autonomous remediation on the host	✓ (Ansible)	— (suggests fixes)	✓ (claimed)	✓ (well-known issues)
Explicit operator approval gate	✓ (per recipe risk)	n/a	varies	partial
Trust gradient per server (observe → dry-run → live)	✓	—	—	—
Per-execution preview / dry-run	✓	—	—	—
Self-hostable / on-prem option	✓ (planned)	partial (DPM)	partial (legacy on-prem)	—
Bring-your-own-model (BYOM)	✓ (multi-provider routing)	vendor-managed	vendor-managed	vendor-managed
Operator-extensible harness (custom tools, monitors, skills, plugins)	✓	—	—	—
Scope: Linux fleets (bare metal / VMs / Docker hosts)	✓	✓ (observability)	✓ (observability)	n/a (orchestration)

Figure 5 — Capability matrix based on public product documentation reviewed April 2026. The agentic-AI category moved fast in 2025–26; every observability and incident-management incumbent now ships some form of AI remediation. The defensible differentiators for OpenRemedy are the trust gradient per server, the per-execution preview, and the regulated-buyer self-host path.

The agentic-AI category moved fast in 2025–26. Every observability and incident-management incumbent now ships some form of AI remediation: Datadog Bits AI investigates autonomously and suggests fixes; New Relic's SRE Agent claims investigation and remediation; PagerDuty's AI Agents autonomously resolve well-known issues. Kubernetes-specific platforms make the same moves inside their domain.

The differentiation for OpenRemedy is not whether we have agentic AI — that's table stakes now — but what scope we serve and how we expose control.

Scope. The observability incumbents stretch across every stack but do not run remediations on the host with the depth regulated operators want. The incident-management incumbents orchestrate humans and now autoremediate well-known cases, but they do not own the host. We are built specifically for the Linux fleets — bare metal, virtual machines, Docker hosts, traditional service stacks — that the Kubernetes-native platforms do not address and the observability incumbents do not act on.
Trust gradient as a documented operator control. Variable autonomy is implicit in most of these platforms. We make it explicit and per-server: a three-mode dial (audit / shadow / live) the operator sets, with an automated promotion ladder driven by recorded approvals.
Per-execution preview as a first-class control. An operator can rehearse any approved recipe before a live run, on any incident, regardless of mode. Not a configuration buried in defaults — a visible control on the approval screen.
Self-hostable for regulated buyers. Datadog and New Relic offer limited on-prem options; PagerDuty is SaaS-only. OpenRemedy is built so the entire control plane — Postgres, Redis, secret store, model gateway — runs inside the customer's perimeter for buyers who cannot put their fleet under a SaaS contract.
Bring your own model.The incumbents run their AI agents on a vendor-managed model stack; operators get whatever model the vendor chose, not the model their compliance review approved. OpenRemedy routes across providers per workload — cheap models for classification, frontier models for delicate reasoning — and an Azure-OpenAI-only or on-prem deployment is a configuration, not a custom build. The same hook is the natural home for the per-tenant specialised model we are training: a fine-tuned local model trained on the tenant's own resolved-incident corpus, slotting into the same routing layer as a fifth provider.
Operator-extensible harness. The thesis of this paper is that the harness is the moat. The operator-facing version of that thesis is that the harness must be extensible — what the daemon checks, what the agent can call, what knowledge the agent has access to, and how new event sources plug in. OpenRemedy exposes that extension surface directly: custom monitors per server (operator-defined checks the daemon runs), custom tools the agent can invoke (HTTP requests with resolved-IP SSRF blocking, shell commands rendered with argument quoting), skills as markdown knowledge the operator attaches to an agent, plugins for new alert sources and notification channels, an MCP integration for external tool surfaces, and a marketplace of pre-packaged bundles. Adapting the platform to a customer's specific fleet is a configuration the operator does in the dashboard, not a professional-services engagement.
Audit trail as compliance primitive, not a bolt-on. Every state-changing action, every approval, every mode flip writes to an append-only log designed to satisfy a compliance auditor reading the trail, not a dashboard filtering it.

The wedge for general-purpose Linux fleets in regulated industries is open.

06What is shipped

Honest numbers from a single dogfooded server.

OpenRemedy is in private testing. The platform runs at app.openremedy.io and the founder is dogfooding it on his own production infrastructure — a single server hosting barrahome.org — before opening it to other operators. There are no external customers yet, by design.

What that single dogfooded server has produced, as of the time of writing:

Incidents handled

Progressed to execute

~22s

Median resolution

293

Audit log entries

The 28 incidents that did not progress to execution were classified as transient or were artifacts of platform bugs we found by dogfooding — the same reason we dogfood, to find them ourselves before a real customer does. Faster resolutions (0–5 seconds) were predominantly automated reclassifications; slower ones (up to 4 minutes) were investigations that called multiple diagnostic tools before deciding.

The numbers are small. The shape of the deployment is what matters at this stage: a single operator with twenty years of Linux production experience, running the platform against his own infrastructure, finding bugs by using it, and shipping a self-improvement every time he finds one. The infrastructure is mature enough that the operator does not have to babysit it; the safety-classifier work and per-tenant specialized model that comes next will let it generalize beyond a single operator's intuitions.

07What comes next

Depth and breadth.

The current platform handles the closing-the-loop wedge for Linux fleets — alerts in, classification, diagnosis, gated remediation, and audit. The roadmap extends in two directions.

Depth.A separate ML safety classifier — an independent LLM call with a hardened prompt template — will sit between the trust × risk gate and the worker dispatch. Its only job is to answer yes or no to the question “is this specific action safe on this specific host right now?” It can never relax the existing gates; it can only veto. The natural home for the per-tenant trained model is this classifier slot.

Per-tenant file-based memory — exportable, version-controllable markdown files instead of opaque database rows — will replace the current resolution corpus. Sidechain transcripts will capture each pipeline stage's full reasoning as JSONL, separately from the summary view in the dashboard.

Breadth.The wedge is closing incidents on existing infrastructure. The trajectory is the full lifecycle of an operator's infrastructure under a single agentic surface — what we think of as vibe deploy, alongside the vibe coding the industry already has. Describe a project; the platform picks the cloud, provisions, watches, fixes what breaks, and grows a per-tenant resolution corpus the operator owns. The wedge unlocks the rest: once an operator trusts the platform to close a tier-one incident at three in the morning, they trust it to ship the next environment, and the same agentic loop handles both.

The idea is large and the road is long. We know what we are signing up for.

08Closing note

For builders in regulated domains.

Three things we have learned by building.

The harness is auditable; the model is not. When a compliance auditor asks why a fix was applied, the answer cannot be “the model decided.” The answer has to be a chain of platform decisions — what mode the server was in, what trust level the agent had, what risk level the recipe carried, who approved, what the platform recorded — ending with the model's reasoning as one input among several. The harness is what produces a defendable audit trail.

Independent failure modes are the only real defense in depth. Five gates that all depend on the same LLM provider are one gate. Five gates that depend on a server mode (operator-set), recipe risk (catalogue-curated), trust level (per-agent), tool filter (hardcoded in code), and a separate classifier call (independent LLM) fail under different conditions and give the operator a real safety margin. Audit for independence early; correct it before the system grows.

Trust gradients beat trust switches. Operators do not want a checkbox that says “let the AI run things.” They want a path from observation to autonomy that they can step through, demonstrate defensibility on, and walk back from at any point. Building the platform around that path is the difference between something an operator will run on production infrastructure and something they will give a five-minute demo on and never touch again.

These are not unique to OpenRemedy. They are the consequences of taking autonomous operations seriously enough to put it in front of a buyer who has compliance requirements. The technology is the model; the product is the harness.

Citations

[1]Liu, J., Zhao, X., Shang, X., and Shen, Z. Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems. arXiv:2604.14228, 2026. The 1.6% / 98.4% finding (AI decision logic versus deterministic infrastructure) is the source of the central thesis applied throughout this document. ↩

The moat is the harness, not the model.