In private testing — join the waitlist

Linux server monitoring that catches problems early.

OpenRemedy watches your fleet continuously and lets AI agents handle the routine fixes — under human approval where it matters. Less alert fatigue, fewer 3 a.m. pages.

Already curious? Read the docs or see how it works.

https://openremedy.io/

highcpu_high · #7c3d

Sustained CPU on api-prod-03

opened by daemon · 00:00 ago

open

SLA · 00:00 / 04:00

Agent pipeline starting…

Atlas is being assigned to the incident.

agent pipeline startinglive · ws

Replay of a real CPU spike. Detected, classified, and closed automatically in under three seconds.

The premise

We don't wait for things to break.

A reactive ops loop is one alert away from a 3 a.m. page. OpenRemedy watches every server continuously, through multiple independent mechanisms, so deviations are caught before they propagate.

journalctl -u openremedy-agent -f

12:04:18 INFO [reporter] cycle #1842 — 7 monitors, discovery=false

12:04:18 INFO [reporter] evidence sent OK

12:04:33 INFO [reporter] cycle #1843 — 7 monitors, discovery=false

12:04:33 WARN [collector/cpu] threshold exceeded: load1m=4.21 (max=3.0)

12:04:33 INFO [reporter] alert cpu_high queued for next evidence push

12:04:33 INFO [reporter] evidence sent OK

12:04:48 INFO [tasks] platform pushed config: 7 monitors, 2 with HMAC signatures

On the server
A small Go agent reports system facts and runs platform-signed health checks every 15 seconds.
On the platform
Scheduled Ansible probes cover servers without an agent, and AI patrols look for unusual patterns.
From the outside
Webhook ingestion accepts alerts from Alertmanager, Grafana, Datadog, or any HTTP source — HMAC-signed.
From the operator
Manual incident creation for ad-hoc inquiries; the agent runs the requested check fresh and reports back.

The full story on proactive monitoring

Human in the loop

Fast where it should be fast. Careful where it shouldn't.

Every recipe carries an explicit risk level. Low-risk fixes run autonomously; medium and above pause for a human approval with the agent's reasoning attached. The LLM never decides on its own that something is safe enough to run. That decision is yours.

Approval requested

Restart nginx on edge-04?

Proposed recipe

systemd-restart-service

risk: medium · trust: supervised · expected duration: ~8 s

Why this fix

nginx.service has been in failed for 4 min. Last 12 lines of the journal show recurring SIGSEGV after a config reload. A clean restart is the standard remedy and reproduces past resolutions on this server.

Anyone with role admin can approve

Post-incident report · auto-generated

nginx restart on edge-04

Tue, 14:08 UTC · 12 minutes total · approved by alberto@…

What happened: nginx.service crashed with SIGSEGV after a config reload at 14:01. The daemon detected the failed state within 12 seconds and opened an incident.
Root cause: A reload pulled in a partially-written /etc/nginx/sites-enabled/api.conf. The deploy pipeline had no atomic-write step.
What we did: Approved the systemd-restart-service recipe at 14:09. Service back to active (running) in 6 s. Verified with three follow-up health probes.
Follow-up: Open ticket against the deploy pipeline to add atomic file-replace. Add a proactive policy to alert on partial config files.

After the dust settles

Every incident writes its own post-mortem.

The platform documents what happened, what the agent reasoned, what was approved, and how it was verified. You walk into the weekly review with the report already written.

Want a slot?

We're onboarding teams in waves. Drop your email and we'll reach out as soon as space opens.

No spam, no marketing drip — one email when your slot opens.

Linux server monitoring that catches problems early.

We don't wait for things to break.

On the server

On the platform

From the outside

From the operator

Fast where it should be fast. Careful where it shouldn't.

Restart nginx on edge-04?

nginx restart on edge-04

Every incident writes its own post-mortem.

Want a slot?