OR
OpenRemedy

How it works

Detect early, decide together, fix safely.

OpenRemedy is built around a single loop: watch continuously, recognise the deviation, classify and route, propose a remedy, approve where it matters, execute, verify, and write the post-mortem. The following pages walk through that loop one scene at a time.

01Watch

Continuous, on the server and from the platform.

OpenRemedy runs a small daemon on each managed server. It reports facts every fifteen seconds and runs the health checks the platform sent it — each one HMAC-signed so a tampered configuration can't smuggle arbitrary commands to the host.

Servers without an agent stay covered too. Scheduled probes run from the platform on a configurable cadence, and AI patrols periodically inspect the fleet for conditions that no one wrote a rule for.

journalctl -u openremedy-agent -f
12:04:18 INFO [reporter] cycle #1842 — 7 monitors, discovery=false
12:04:18 INFO [reporter] evidence sent OK
12:04:33 INFO [reporter] cycle #1843 — 7 monitors, discovery=false
12:04:33 WARN [collector/cpu] threshold exceeded: load1m=4.21 (max=3.0)
12:04:33 INFO [reporter] alert cpu_high queued for next evidence push
12:04:33 INFO [reporter] evidence sent OK
12:04:48 INFO [tasks] platform pushed config: 7 monitors, 2 with HMAC signatures
02Detect

An incident opens the moment something is off.

Whether the signal comes from the daemon, a scheduled probe, an agent on patrol, or an external alert source like Alertmanager or Datadog, it lands in the same place: a new incident with severity, type, server context, and the relevant evidence already attached.

The detection floor is roughly fifteen seconds for threshold conditions on the daemon, sub-second for push alerts. Most issues land here long before any user is affected.

highcpu_high

Sustained CPU on api-prod-03

opened 41 seconds ago · daemon · resolved automatically

resolved
  1. Triage · Atlas

    Pattern matches a recurring spike from the daily ingest job. Routed to the diagnose stage.

  2. Diagnose · Forge

    Captured a fresh top -bn1 snapshot. Top consumer: python ingest_worker.py at 89% (expected behaviour during ingest).

  3. Resolved · Default SRE

    Transient — load returning to baseline. No action taken. Marked for monitoring.

3 tool calls · 2.4s wallView incident
03Classify

Triage routes to the right specialist.

A triage agent reads the incident, looks up past resolutions for similar conditions, and decides who handles it. Recurring patterns get matched to known fixes; novel issues get more investigation.

The agent records its reasoning as it goes, so the decision trail is visible later — no opaque “the AI did something.”

Triage timeline · Atlas

  1. Searched past resolutions

    Found 3 matching incidents on api-prod-03 in the last 30 days. All resolved by the same recipe.

  2. Classified

    cpu_high · severity high · pattern: scheduled-job spike

  3. Handed off

    Routed to diagnose stage with the recurring-spike context attached.

04Diagnose

Evidence-first, with sandboxed tools only.

The diagnose agent collects evidence using a curated set of read-only tools — service status, log tails, process snapshots, container introspection. It reasons through the root cause and proposes a remedy.

The LLM never gets a free-form shell. Every diagnostic call is a typed verb against the platform's sandboxed catalogue. Operators can extend that catalogue, but each new tool runs through the same parameter-quoting and host-allowlisting machinery as the built-ins.

agent · diagnose · Forge
tool_call: run_diagnostic_command ({
verb: "top_snapshot",
arg: "30"
})
30-line top output captured (1.4 KB)
tool_call: propose_recipe ({
slug: "throttle-ingest-worker",
risk: "medium"
})
05Decide

Low-risk fixes run. Risky ones wait for you.

Every recipe carries an explicit risk level set by the operator who wrote it. Autonomous agents can run low-risk fixes on their own; medium and above always pause for a human approval. The LLM does not get to grade its own homework on whether something is safe.

Approval requests show full context: what the agent proposes, why, what could go wrong, and what past runs of the same recipe looked like. One click and the platform executes; one click and it walks away.

Approval requested

Restart nginx on edge-04?

Proposed recipe

systemd-restart-service

risk: medium · trust: supervised · expected duration: ~8 s

Why this fix

nginx.service has been in failed for 4 min. Last 12 lines of the journal show recurring SIGSEGV after a config reload. A clean restart is the standard remedy and reproduces past resolutions on this server.

Anyone with role admin can approve
06Remediate

Ansible runs the fix. The platform watches the run.

The approved recipe is dispatched to a worker that runs the underlying Ansible playbook over SSH. Output streams to the dashboard in real time so you can watch the run, intervene if needed, and see exactly what changed on the host.

A review agent verifies the resolution after the playbook completes — service back up, metrics back to normal, no regression elsewhere — and only then closes the incident.

execution · live outputrunning
PLAY [Restart and verify nginx]
TASK [capture pre-state]
ok: [edge-04]
TASK [restart nginx]
changed: [edge-04]
TASK [wait for healthy]
ok: [edge-04]
── recap ── ok=3 changed=1 failed=0
07Report

Every incident writes its own post-mortem.

The platform stitches the timeline, evidence, agent reasoning, approval chain, and execution outcome into a single report. If a human stepped in, their reasoning is captured alongside the agent's.

Recurring patterns become candidates for new pre-approved playbooks. The team's runbook grows on its own, curated by the work the platform has actually been doing.

Post-incident report · auto-generated

nginx restart on edge-04

Tue, 14:08 UTC · 12 minutes total · approved by alberto@…

What happened
nginx.service crashed with SIGSEGV after a config reload at 14:01. The daemon detected the failed state within 12 seconds and opened an incident.
Root cause
A reload pulled in a partially-written /etc/nginx/sites-enabled/api.conf. The deploy pipeline had no atomic-write step.
What we did
Approved the systemd-restart-service recipe at 14:09. Service back to active (running) in 6 s. Verified with three follow-up health probes.
Follow-up
Open ticket against the deploy pipeline to add atomic file-replace. Add a proactive policy to alert on partial config files.

That's the whole shape.

Every detection source, every agent, every recipe follows the same loop. The dashboard, the audit trail, and the docs are organised around it.