The Autonomous AI SRE for Linux fleets.
OpenRemedy is your trusted autonomous SRE for Linux fleets. It watches, diagnoses, and runs the fix under three escalating trust modes per server — observe, dry-run, live. Built by operators with 20+ years on production. For the boxes Kubernetes doesn't run on.
Already curious? Read the docs or see how it works.
Sustained CPU on api-prod-03
opened by daemon · 00:00 ago
Agent pipeline starting…
Atlas is being assigned to the incident.
Replay of a real CPU spike. Detected, classified, and closed automatically in under three seconds.
The premise
Detection is the easy part. Closing the loop is the work.
Most platforms tell you something broke and route the page to a human. OpenRemedy ingests signals from every source you already have, then runs the loop the human used to run — diagnose, decide, act, document — under your control, with the audit trail compliance teams expect.
On the server
A small Go agent reports system facts and runs platform-signed health checks every 15 seconds.
On the platform
Scheduled Ansible probes cover servers without an agent, and AI patrols look for unusual patterns.
From the outside
Webhook ingestion accepts alerts from Alertmanager, Grafana, Datadog, or any HTTP source — HMAC-signed.
From the operator
Manual incident creation for ad-hoc inquiries; the agent runs the requested check fresh and reports back.
Trust gradients, not all-or-nothing autonomy
Fast where it should be fast. Careful where it shouldn't.
Every server lives in one of three escalating modes — observe (the agent watches but never touches), dry-run (it simulates the fix and shows you exactly what would have happened), and live (it runs, with low-risk fixes autonomous and medium-and-up gated by human approval). You promote a host through modes when you've seen the agent earn it. The LLM never decides on its own that something is safe enough to run.
Approval requested
Restart nginx on edge-04?
Proposed recipe
systemd-restart-service
risk: medium · trust: supervised · expected duration: ~8 s
Why this fix
nginx.service has been in failed for 4 min. Last 12 lines of the journal show recurring SIGSEGV after a config reload. A clean restart is the standard remedy and reproduces past resolutions on this server.
nginx restart on edge-04
Tue, 14:08 UTC · 12 minutes total · approved by alberto@…
- What happened
- nginx.service crashed with SIGSEGV after a config reload at 14:01. The daemon detected the failed state within 12 seconds and opened an incident.
- Root cause
- A reload pulled in a partially-written /etc/nginx/sites-enabled/api.conf. The deploy pipeline had no atomic-write step.
- What we did
- Approved the systemd-restart-service recipe at 14:09. Service back to active (running) in 6 s. Verified with three follow-up health probes.
- Follow-up
- Open ticket against the deploy pipeline to add atomic file-replace. Add a proactive policy to alert on partial config files.
Audit-ready by construction
Every incident writes its own post-mortem.
The platform documents what happened, what the agent reasoned, what was approved, who approved it, and how the fix was verified. You walk into the weekly review — or the next compliance audit — with the report already written.
Want a slot?
We're still teaching the platform to operate Linux fleets safely — currently dogfooding on our own production infrastructure. Drop your email and we'll reach out when the safety story is good enough to put on yours.
No spam, no marketing drip — one email when your slot opens.