Overview
What OpenRemedy is and how it fits together.
OpenRemedy is a multi-tenant Linux server monitoring and remediation platform with AI-driven incident response and continuous early detection.
What it does (plain language)
A team running production servers spends a large portion of its time on the same handful of recurring problems: a service crashed, a disk filled up, a process is stuck, a port stopped responding. Most of those have known fixes that an experienced operator would apply in seconds — but only if they catch them in time.
OpenRemedy is doing two things at once. It is watching every server continuously so problems are caught early, often before any user notices. And when something is detected — by the platform itself, by an external monitoring stack, or by an AI agent doing a routine round of the fleet — it classifies, decides, and either fixes or escalates. The riskier the fix, the more human approval it needs. Every step is logged.
Think of it as a junior on-call engineer who:
- never sleeps,
- watches every server constantly,
- spots small deviations before they become outages,
- knows the runbooks by heart,
- always asks before touching anything important,
- writes a complete report afterward.
How it works, in one diagram
Detection is not a single source — see proactive for the five mechanisms that feed the same pipeline. The rest of this document is for operators and engineers who need to understand the platform internals.
Detection sources
OpenRemedy creates incidents from five independent mechanisms, running concurrently and converging on the same pipeline. Full breakdown in proactive. Brief summary:
| Source | Where it runs | Cadence | Best for |
|---|---|---|---|
| Daemon monitors | The customer server (Go agent) | Continuous, ~15 s | Standard system metrics on servers where the agent is installed |
| Webhook ingestion | Platform, push-based | Sub-second | Existing monitoring stacks (Alertmanager, Grafana, Datadog, PagerDuty) |
| CheckScheduler + Evaluator | Platform, proactive container | 60 s sweep | Stateful or context-aware checks (DB queries, multi-step probes), legacy hosts without daemon |
| Agent patrols | Platform, swarm container | Per-agent patrol_interval minutes | Anomalies that aren't explicit alarm conditions — "the 3 a.m. quiet that shouldn't be quiet" |
| Manual entry | UI | On demand | Ad-hoc inquiries; informational queries (incident_type=custom) |
A sixth mechanism, IncidentWatcher, is not a detection source per se — it re-invokes the agent pipeline whenever a human comments on an escalated incident, closing the loop between operator and agent without manual re-triggering.
Webhook authentication
External webhook requests must be HMAC-SHA256 signed against the
tenant's webhook_secret, presented in the
X-OpenRemedy-Signature: sha256=<hex> header. See
integrations for signing examples and adapter
patterns.
Daemon authentication
The daemon authenticates via session token (Bearer in
Authorization). Custom monitor commands carry an HMAC signature
issued by the platform; the daemon refuses to execute unsigned or
tampered commands, defending against compromise of the platform DB.
Detailed pipeline
The full flow that runs after an incident is created:
Stages are each a separate agent invocation with its own prompt, tool
budget, and tenant-scoped context. The risk gate is a hard server-side
check (should_request_approval); the LLM
cannot self-approve a medium+ recipe. autonomous trust permits
auto-execution only at low risk. supervised and manual trust
levels require human approval at every step.
Concept reference
| Concept | Definition |
|---|---|
| Incident | A problem detected on a server. Created via webhook, manual entry, or the daemon. Lifecycle: open → classifying → recipe_proposed → awaiting_approval → executing → resolved (or failed → escalated). Custom-type incidents are informational queries. |
| Recipe | An Ansible playbook the platform is allowed to run. Carries a risk level (low, medium, high) which gates auto-execution. The catalog is global; only superadmin may create, update, or delete recipes. |
| Agent | An LLM-backed entity with a trust level (autonomous, supervised, manual), a role set (triage / diagnose / validate / execute / review), and a system prompt. |
| Skill | A Markdown knowledge module attached to an agent (e.g., "nginx operations"). Loaded into the agent's context at runtime. |
| Tool | A function callable by the agent during reasoning. Built-in (curated diagnostic verbs, management functions) or custom (operator-defined shell_command or http_request templates with sandboxed parameter substitution). |
| Policy | A flow definition (visual editor) mapping a trigger to a recipe over a set of servers. Drives proactive monitoring. |
| Daemon | Optional Go binary on the managed server. Provides heartbeats, evidence collection, and platform-signed custom monitor execution. |
| Tenant | Isolation boundary. Servers, recipes (read), policies, agents, audit logs, secrets, webhook secret, and users are scoped per tenant. |
Comparison: OpenRemedy vs OpenClaw
OpenRemedy is sometimes compared to OpenClaw because both are AI agents that take real actions on systems. They sit in adjacent design spaces with materially different threat models.
| Dimension | OpenRemedy | OpenClaw |
|---|---|---|
| Audience | SRE / ops teams managing fleets | Single-user personal assistant |
| Tenancy | Multi-tenant SaaS, hard isolation per tenant | Single-user, runs on the owner's machine |
| Action surface | Curated Ansible recipes, sandboxed tool catalog, daemon with HMAC-signed configs | Free-form shell, browser, camera, location, arbitrary code |
| Approval model | Risk-gated; medium+ requires human approval | Owner trusts the assistant by definition |
| Threat model | Tenant admin compromise, DB tampering, prompt injection, cross-tenant leakage | None — owner is the security boundary |
OpenClaw's freedom is the feature for a personal assistant. For a multi-tenant ops platform, that same freedom would be a liability. OpenRemedy ships the same agentic capability with explicit guardrails: no LLM-driven shell execution, HMAC-signed daemon configs, parameter sandboxing on every custom tool, and tenant scoping on every fanout channel.
Authentication model
- JWTs travel as
HttpOnly+Secure+SameSite=strictcookies (access_token,refresh_token). JavaScript never sees them. - Webhooks require
X-OpenRemedy-Signature: sha256=<hex>HMAC of the raw body usingtenant.webhook_secret. - WebSocket handshakes authenticate via the cookie (browser default) or
via
Sec-WebSocket-Protocol: bearer, <jwt>(programmatic clients). Fanout is filtered server-side bytenant_id. /auth/loginis rate-limited at 10 requests/minute per client IP./webhooks/alerts/{slug}is rate-limited at 60 requests/minute.
Documentation index
| Path | Audience | Contents |
|---|---|---|
| proactive | Operators, ops leads | The five mechanisms that detect and create incidents. Cadence, tuning knobs, when to use each. |
| integrations | Integrators | Connecting Alertmanager, Grafana, Datadog, PagerDuty, or custom clients to the webhook endpoint. Signing examples in bash, Python, Node.js. |
| security | Operators, auditors | Required environment variables, authentication model, webhook and daemon HMAC, tenant isolation, approval gate, custom tool sandbox, encryption at rest, audit. |
| Dashboard | Operators | Section-by-section reference for every menu item in the web UI. |
Endpoint glossary
REST https://<host>/api/v1/...
WebSocket wss://<host>/ws/incidents
wss://<host>/ws/executions/{id}
Daemon https://<host>/daemon/v1/{heartbeat,evidence,tasks}
Webhooks https://<host>/api/v1/webhooks/alerts/{tenant_slug}