Overview

What OpenRemedy is and how it fits together.

OpenRemedy is a multi-tenant Linux server monitoring and remediation platform with AI-driven incident response and continuous early detection.

What it does (plain language)

A team running production servers spends a large portion of its time on the same handful of recurring problems: a service crashed, a disk filled up, a process is stuck, a port stopped responding. Most of those have known fixes that an experienced operator would apply in seconds — but only if they catch them in time.

OpenRemedy is doing two things at once. It is watching every server continuously so problems are caught early, often before any user notices. And when something is detected — by the platform itself, by an external monitoring stack, or by an AI agent doing a routine round of the fleet — it classifies, decides, and either fixes or escalates. The riskier the fix, the more human approval it needs. Every step is logged.

Think of it as a junior on-call engineer who:

never sleeps,
watches every server constantly,
spots small deviations before they become outages,
knows the runbooks by heart,
always asks before touching anything important,
writes a complete report afterward.

How it works, in one diagram

Detection is not a single source — see proactive for the five mechanisms that feed the same pipeline. The rest of this document is for operators and engineers who need to understand the platform internals.

Detection sources

OpenRemedy creates incidents from five independent mechanisms, running concurrently and converging on the same pipeline. Full breakdown in proactive. Brief summary:

Source	Where it runs	Cadence	Best for
Daemon monitors	The customer server (Go agent)	Continuous, ~15 s	Standard system metrics on servers where the agent is installed
Webhook ingestion	Platform, push-based	Sub-second	Existing monitoring stacks (Alertmanager, Grafana, Datadog, PagerDuty)
CheckScheduler + Evaluator	Platform, `proactive` container	60 s sweep	Stateful or context-aware checks (DB queries, multi-step probes), legacy hosts without daemon
Agent patrols	Platform, `swarm` container	Per-agent `patrol_interval` minutes	Anomalies that aren't explicit alarm conditions — "the 3 a.m. quiet that shouldn't be quiet"
Manual entry	UI	On demand	Ad-hoc inquiries; informational queries (`incident_type=custom`)

A sixth mechanism, IncidentWatcher, is not a detection source per se — it re-invokes the agent pipeline whenever a human comments on an escalated incident, closing the loop between operator and agent without manual re-triggering.

Webhook authentication

External webhook requests must be HMAC-SHA256 signed against the tenant's webhook_secret, presented in the X-OpenRemedy-Signature: sha256=<hex> header. See integrations for signing examples and adapter patterns.

Daemon authentication

The daemon authenticates via session token (Bearer in Authorization). Custom monitor commands carry an HMAC signature issued by the platform; the daemon refuses to execute unsigned or tampered commands, defending against compromise of the platform DB.

Detailed pipeline

The full flow that runs after an incident is created:

Stages are each a separate agent invocation with its own prompt, tool budget, and tenant-scoped context. The risk gate is a hard server-side check (should_request_approval); the LLM cannot self-approve a medium+ recipe. autonomous trust permits auto-execution only at low risk. supervised and manual trust levels require human approval at every step.

Concept reference

Concept	Definition
Incident	A problem detected on a server. Created via webhook, manual entry, or the daemon. Lifecycle: `open → classifying → recipe_proposed → awaiting_approval → executing → resolved` (or `failed → escalated`). Custom-type incidents are informational queries.
Recipe	An Ansible playbook the platform is allowed to run. Carries a risk level (`low`, `medium`, `high`) which gates auto-execution. The catalog is global; only `superadmin` may create, update, or delete recipes.
Agent	An LLM-backed entity with a trust level (`autonomous`, `supervised`, `manual`), a role set (triage / diagnose / validate / execute / review), and a system prompt.
Skill	A Markdown knowledge module attached to an agent (e.g., "nginx operations"). Loaded into the agent's context at runtime.
Tool	A function callable by the agent during reasoning. Built-in (curated diagnostic verbs, management functions) or custom (operator-defined `shell_command` or `http_request` templates with sandboxed parameter substitution).
Policy	A flow definition (visual editor) mapping a trigger to a recipe over a set of servers. Drives proactive monitoring.
Daemon	Optional Go binary on the managed server. Provides heartbeats, evidence collection, and platform-signed custom monitor execution.
Tenant	Isolation boundary. Servers, recipes (read), policies, agents, audit logs, secrets, webhook secret, and users are scoped per tenant.

Comparison: OpenRemedy vs OpenClaw

OpenRemedy is sometimes compared to OpenClaw because both are AI agents that take real actions on systems. They sit in adjacent design spaces with materially different threat models.

Dimension	OpenRemedy	OpenClaw
Audience	SRE / ops teams managing fleets	Single-user personal assistant
Tenancy	Multi-tenant SaaS, hard isolation per tenant	Single-user, runs on the owner's machine
Action surface	Curated Ansible recipes, sandboxed tool catalog, daemon with HMAC-signed configs	Free-form shell, browser, camera, location, arbitrary code
Approval model	Risk-gated; medium+ requires human approval	Owner trusts the assistant by definition
Threat model	Tenant admin compromise, DB tampering, prompt injection, cross-tenant leakage	None — owner is the security boundary

OpenClaw's freedom is the feature for a personal assistant. For a multi-tenant ops platform, that same freedom would be a liability. OpenRemedy ships the same agentic capability with explicit guardrails: no LLM-driven shell execution, HMAC-signed daemon configs, parameter sandboxing on every custom tool, and tenant scoping on every fanout channel.

Authentication model

JWTs travel as HttpOnly + Secure + SameSite=strict cookies (access_token, refresh_token). JavaScript never sees them.
Webhooks require X-OpenRemedy-Signature: sha256=<hex> HMAC of the raw body using tenant.webhook_secret.
WebSocket handshakes authenticate via the cookie (browser default) or via Sec-WebSocket-Protocol: bearer, <jwt> (programmatic clients). Fanout is filtered server-side by tenant_id.
/auth/login is rate-limited at 10 requests/minute per client IP. /webhooks/alerts/{slug} is rate-limited at 60 requests/minute.

Documentation index

Path	Audience	Contents
proactive	Operators, ops leads	The five mechanisms that detect and create incidents. Cadence, tuning knobs, when to use each.
integrations	Integrators	Connecting Alertmanager, Grafana, Datadog, PagerDuty, or custom clients to the webhook endpoint. Signing examples in bash, Python, Node.js.
security	Operators, auditors	Required environment variables, authentication model, webhook and daemon HMAC, tenant isolation, approval gate, custom tool sandbox, encryption at rest, audit.
Dashboard	Operators	Section-by-section reference for every menu item in the web UI.

Endpoint glossary

REST       https://<host>/api/v1/...
WebSocket  wss://<host>/ws/incidents
           wss://<host>/ws/executions/{id}
Daemon     https://<host>/daemon/v1/{heartbeat,evidence,tasks}
Webhooks   https://<host>/api/v1/webhooks/alerts/{tenant_slug}