How it works
Detect early, decide together, fix safely.
OpenRemedy is built around a single loop: watch continuously, recognise the deviation, classify and route, propose a remedy, approve where it matters, execute, verify, and write the post-mortem. The following pages walk through that loop one scene at a time.
Continuous, on the server and from the platform.
OpenRemedy runs a small daemon on each managed server. It reports facts every fifteen seconds and runs the health checks the platform sent it — each one HMAC-signed so a tampered configuration can't smuggle arbitrary commands to the host.
Servers without an agent stay covered too. Scheduled probes run from the platform on a configurable cadence, and AI patrols periodically inspect the fleet for conditions that no one wrote a rule for.
An incident opens the moment something is off.
Whether the signal comes from the daemon, a scheduled probe, an agent on patrol, or an external alert source like Alertmanager or Datadog, it lands in the same place: a new incident with severity, type, server context, and the relevant evidence already attached.
The detection floor is roughly fifteen seconds for threshold conditions on the daemon, sub-second for push alerts. Most issues land here long before any user is affected.
Sustained CPU on api-prod-03
opened 41 seconds ago · daemon · resolved automatically
Triage · Atlas
Pattern matches a recurring spike from the daily ingest job. Routed to the diagnose stage.
Diagnose · Forge
Captured a fresh top -bn1 snapshot. Top consumer: python ingest_worker.py at 89% (expected behaviour during ingest).
Resolved · Default SRE
Transient — load returning to baseline. No action taken. Marked for monitoring.
Triage routes to the right specialist.
A triage agent reads the incident, looks up past resolutions for similar conditions, and decides who handles it. Recurring patterns get matched to known fixes; novel issues get more investigation.
The agent records its reasoning as it goes, so the decision trail is visible later — no opaque “the AI did something.”
Triage timeline · Atlas
Searched past resolutions
Found 3 matching incidents on api-prod-03 in the last 30 days. All resolved by the same recipe.
Classified
cpu_high · severity high · pattern: scheduled-job spike
Handed off
Routed to diagnose stage with the recurring-spike context attached.
Evidence-first, with sandboxed tools only.
The diagnose agent collects evidence using a curated set of read-only tools — service status, log tails, process snapshots, container introspection. It reasons through the root cause and proposes a remedy.
The LLM never gets a free-form shell. Every diagnostic call is a typed verb against the platform's sandboxed catalogue. Operators can extend that catalogue, but each new tool runs through the same parameter-quoting and host-allowlisting machinery as the built-ins.
Low-risk fixes run. Risky ones wait for you.
Every recipe carries an explicit risk level set by the operator who wrote it. Autonomous agents can run low-risk fixes on their own; medium and above always pause for a human approval. The LLM does not get to grade its own homework on whether something is safe.
Approval requests show full context: what the agent proposes, why, what could go wrong, and what past runs of the same recipe looked like. One click and the platform executes; one click and it walks away.
Approval requested
Restart nginx on edge-04?
Proposed recipe
systemd-restart-service
risk: medium · trust: supervised · expected duration: ~8 s
Why this fix
nginx.service has been in failed for 4 min. Last 12 lines of the journal show recurring SIGSEGV after a config reload. A clean restart is the standard remedy and reproduces past resolutions on this server.
Ansible runs the fix. The platform watches the run.
The approved recipe is dispatched to a worker that runs the underlying Ansible playbook over SSH. Output streams to the dashboard in real time so you can watch the run, intervene if needed, and see exactly what changed on the host.
A review agent verifies the resolution after the playbook completes — service back up, metrics back to normal, no regression elsewhere — and only then closes the incident.
Every incident writes its own post-mortem.
The platform stitches the timeline, evidence, agent reasoning, approval chain, and execution outcome into a single report. If a human stepped in, their reasoning is captured alongside the agent's.
Recurring patterns become candidates for new pre-approved playbooks. The team's runbook grows on its own, curated by the work the platform has actually been doing.
nginx restart on edge-04
Tue, 14:08 UTC · 12 minutes total · approved by alberto@…
- What happened
- nginx.service crashed with SIGSEGV after a config reload at 14:01. The daemon detected the failed state within 12 seconds and opened an incident.
- Root cause
- A reload pulled in a partially-written /etc/nginx/sites-enabled/api.conf. The deploy pipeline had no atomic-write step.
- What we did
- Approved the systemd-restart-service recipe at 14:09. Service back to active (running) in 6 s. Verified with three follow-up health probes.
- Follow-up
- Open ticket against the deploy pipeline to add atomic file-replace. Add a proactive policy to alert on partial config files.
That's the whole shape.
Every detection source, every agent, every recipe follows the same loop. The dashboard, the audit trail, and the docs are organised around it.