OR
OpenRemedy
In private testing — join the waitlist

Linux server monitoring that catches problems early.

OpenRemedy watches your fleet continuously and lets AI agents handle the routine fixes — under human approval where it matters. Less alert fatigue, fewer 3 a.m. pages.

Already curious? Read the docs or see how it works.

https://openremedy.io/
highcpu_high · #7c3d

Sustained CPU on api-prod-03

opened by daemon · 00:00 ago

open
SLA · 00:00 / 04:00

Agent pipeline starting…

Atlas is being assigned to the incident.

agent pipeline startinglive · ws

Replay of a real CPU spike. Detected, classified, and closed automatically in under three seconds.

The premise

We don't wait for things to break.

A reactive ops loop is one alert away from a 3 a.m. page. OpenRemedy watches every server continuously, through multiple independent mechanisms, so deviations are caught before they propagate.

journalctl -u openremedy-agent -f
12:04:18 INFO [reporter] cycle #1842 — 7 monitors, discovery=false
12:04:18 INFO [reporter] evidence sent OK
12:04:33 INFO [reporter] cycle #1843 — 7 monitors, discovery=false
12:04:33 WARN [collector/cpu] threshold exceeded: load1m=4.21 (max=3.0)
12:04:33 INFO [reporter] alert cpu_high queued for next evidence push
12:04:33 INFO [reporter] evidence sent OK
12:04:48 INFO [tasks] platform pushed config: 7 monitors, 2 with HMAC signatures
  • On the server

    A small Go agent reports system facts and runs platform-signed health checks every 15 seconds.

  • On the platform

    Scheduled Ansible probes cover servers without an agent, and AI patrols look for unusual patterns.

  • From the outside

    Webhook ingestion accepts alerts from Alertmanager, Grafana, Datadog, or any HTTP source — HMAC-signed.

  • From the operator

    Manual incident creation for ad-hoc inquiries; the agent runs the requested check fresh and reports back.

The full story on proactive monitoring

Human in the loop

Fast where it should be fast. Careful where it shouldn't.

Every recipe carries an explicit risk level. Low-risk fixes run autonomously; medium and above pause for a human approval with the agent's reasoning attached. The LLM never decides on its own that something is safe enough to run. That decision is yours.

Approval requested

Restart nginx on edge-04?

Proposed recipe

systemd-restart-service

risk: medium · trust: supervised · expected duration: ~8 s

Why this fix

nginx.service has been in failed for 4 min. Last 12 lines of the journal show recurring SIGSEGV after a config reload. A clean restart is the standard remedy and reproduces past resolutions on this server.

Anyone with role admin can approve
Post-incident report · auto-generated

nginx restart on edge-04

Tue, 14:08 UTC · 12 minutes total · approved by alberto@…

What happened
nginx.service crashed with SIGSEGV after a config reload at 14:01. The daemon detected the failed state within 12 seconds and opened an incident.
Root cause
A reload pulled in a partially-written /etc/nginx/sites-enabled/api.conf. The deploy pipeline had no atomic-write step.
What we did
Approved the systemd-restart-service recipe at 14:09. Service back to active (running) in 6 s. Verified with three follow-up health probes.
Follow-up
Open ticket against the deploy pipeline to add atomic file-replace. Add a proactive policy to alert on partial config files.

After the dust settles

Every incident writes its own post-mortem.

The platform documents what happened, what the agent reasoned, what was approved, and how it was verified. You walk into the weekly review with the report already written.

Want a slot?

We're onboarding teams in waves. Drop your email and we'll reach out as soon as space opens.

No spam, no marketing drip — one email when your slot opens.