Features
The platform, in depth.
Four pillars: how OpenRemedy detects, how it reasons, how it acts, and the operational fabric that makes it usable across a real ops team. Each section lists the specific capabilities that ship in the box.
Five independent ways to spot a problem early.
OpenRemedy doesn't depend on one signal. Five mechanisms run concurrently, with overlapping coverage so a single missed alarm doesn't mean a missed incident.
Continuous daemon monitors
A small Go agent on each managed server checks system health every fifteen seconds. CPU, memory, disk, ports, services, processes, log patterns, Docker containers — built-in monitor types you can compose into policies. Custom shell checks ride the same channel, signed by the platform so a tampered config can't smuggle arbitrary commands to the host.
Scheduled platform-side probes
Servers without a daemon stay covered. Recipes tagged as recipe_check fire on a configurable schedule, run as Ansible playbooks from the platform, and feed results back through an LLM evaluator that decides pass / fail with full context.
AI agents on patrol
Per-agent patrol intervals let the AI round the fleet on its own, looking for anomalies you didn't write a rule for. The 3 a.m. quiet that shouldn't be quiet, the load that flatlined, the service that restarted three times in an hour — the agent opens an incident with its own reasoning attached.
Webhook ingestion
Existing monitoring stacks plug in via HMAC-signed webhook. Alertmanager, Grafana, Datadog, PagerDuty, custom HTTP clients — any push source converges on the same incident pipeline. Sub-second latency from alert to incident in the database.
Manual ad-hoc inquiries
Operators can open a custom incident with a free-form question. The agent runs the requested check fresh against the live server and reports back — useful for one-off forensics that don't justify a permanent monitor.
Specialised agents, each with its own role and trust level.
Every incident moves through a pipeline of distinct agent invocations. Each stage has its own prompt, tool budget, and tenant-scoped context. No single LLM call carries the whole burden.
Five-stage pipeline
Triage → diagnose → validate → execute → review. Custom-type incidents skip execution and resolve after diagnose. Each stage records its tool calls and reasoning as discrete events, so the audit trail tells you what the agent thought and why — not just what it did.
Trust × risk gate
Each agent has a trust level (autonomous / supervised / manual). Each recipe has a risk level (low / medium / high). The platform's gate runs server-side; the LLM cannot self-approve. Autonomous trust permits auto-execute only at low risk. Everything else asks a human.
Skills as knowledge modules
Markdown documents attached to an agent — "nginx operations", "PostgreSQL fundamentals", "Docker triage" — loaded into context at runtime. Skills sharpen domain reasoning without forcing one giant system prompt. Built-in catalogue plus per-tenant authoring.
Per-tenant prompt overrides
Default Jinja templates ship with the platform; any tenant can override the prompt for any pipeline stage from the dashboard. Tune for your fleet, your terminology, your escalation conventions — without touching code.
Past-resolution awareness
Triage searches historical incidents for matching patterns before deciding. Recurring conditions get matched to known fixes. The agent doesn't relearn yesterday's solution every morning.
Curated remediation, signed off where it matters, fully audited.
The action surface is deliberately narrow. The LLM never gets a free-form shell. Remediation is an explicit catalogue of vetted Ansible playbooks, executed under approval rules you control.
Recipe catalogue
Every remediation is a pre-approved Ansible playbook in a global catalogue. Each carries a risk level, a category, a rollback path if relevant. Read and execute are open to your team; create / update / delete are restricted to superadmin so tenant admins can't smuggle in malicious playbook paths.
Approval workflow
Awaiting-approval incidents show full context — what the agent observed, what it proposes, why, what the rollback looks like. One click runs the playbook with live output streaming to the dashboard; one click rejects and walks away.
Live execution output
Recipe runs stream stdout via WebSocket directly to the incident detail page. Per-task status (ok / changed / failed), expandable stdout/stderr per task, return code on completion. The platform's own audit log gets the playbook output verbatim.
Verified-and-closed reviews
After a playbook completes, a review agent runs verification checks — service back up, metrics back to normal, no regression elsewhere — before closing the incident. If verification fails, the incident reopens with the agent's findings.
Auto-generated post-mortems
Every resolved incident writes its own RCA report stitched from the timeline, evidence, agent reasoning, approval chain, and execution outcome. You walk into the weekly review with the document already done.
Approval requested
Restart nginx on edge-04?
Proposed recipe
systemd-restart-service
risk: medium · trust: supervised · expected duration: ~8 s
Why this fix
nginx.service has been in failed for 4 min. Last 12 lines of the journal show recurring SIGSEGV after a config reload. A clean restart is the standard remedy and reproduces past resolutions on this server.
Multi-tenant scaffolding for the teams that actually run things.
Built for SRE / managed-service teams from the start. The plumbing is the part you usually have to build yourself; here it ships in the box.
Multi-tenant isolation
Servers, recipes (read), policies, agents, audit logs, secrets, webhook secret, and users are all scoped per tenant. Real-time WebSocket fanout is filtered server-side by tenant_id. A superadmin role with explicit impersonation crosses tenant boundaries when ops need it; everyone else stays in their lane.
Encrypted secret store
SSH keys, bearer tokens, basic-auth pairs, database credentials — stored AES-256-GCM-encrypted at rest. Referenced by name from server records and from custom HTTP tools. The encryption key never appears in the database itself.
Plugin and marketplace ecosystem
Extension points for new alert sources, hook events (slack / email / pagerduty notifications), storage backends, and LLM providers. Pre-packaged bundles in the marketplace combine an agent template with skills and tools — install in one click into your tenant.
SLA timers with breach escalation
Per-severity policies define time-to-acknowledge / first-response / resolution. Live timers on every incident card, breach indicator turns red on miss, escalation rules can flag the agent or page a human when the clock runs out.
Maintenance windows
Plans + schedules with optional approval gates and recurrence. Active maintenance windows automatically suppress incident creation on the targeted servers — no false alarms during planned changes.
Append-only audit log
Every state-changing action — create, update, delete, execute, approve, login, login failure — gets a row. Tenant-scoped, IP-captured (with trusted-proxy logic so X-Forwarded-For can't be spoofed), and indexed by resource so the dashboard's per-resource history filter is one click.
nginx restart on edge-04
Tue, 14:08 UTC · 12 minutes total · approved by alberto@…
- What happened
- nginx.service crashed with SIGSEGV after a config reload at 14:01. The daemon detected the failed state within 12 seconds and opened an incident.
- Root cause
- A reload pulled in a partially-written /etc/nginx/sites-enabled/api.conf. The deploy pipeline had no atomic-write step.
- What we did
- Approved the systemd-restart-service recipe at 14:09. Service back to active (running) in 6 s. Verified with three follow-up health probes.
- Follow-up
- Open ticket against the deploy pipeline to add atomic file-replace. Add a proactive policy to alert on partial config files.
Want to see it running on your servers?
The waitlist is the fastest path. Reading order from there: start with the overview, then the proactive monitoring page, then install the daemon.