OR
OpenRemedy

Overview

What OpenRemedy is and how it fits together.

OpenRemedy is a multi-tenant Linux server monitoring and remediation platform with AI-driven incident response and continuous early detection.


What it does (plain language)

A team running production servers spends a large portion of its time on the same handful of recurring problems: a service crashed, a disk filled up, a process is stuck, a port stopped responding. Most of those have known fixes that an experienced operator would apply in seconds — but only if they catch them in time.

OpenRemedy is doing two things at once. It is watching every server continuously so problems are caught early, often before any user notices. And when something is detected — by the platform itself, by an external monitoring stack, or by an AI agent doing a routine round of the fleet — it classifies, decides, and either fixes or escalates. The riskier the fix, the more human approval it needs. Every step is logged.

Think of it as a junior on-call engineer who:

  • never sleeps,
  • watches every server constantly,
  • spots small deviations before they become outages,
  • knows the runbooks by heart,
  • always asks before touching anything important,
  • writes a complete report afterward.

How it works, in one diagram

Detection is not a single source — see proactive for the five mechanisms that feed the same pipeline. The rest of this document is for operators and engineers who need to understand the platform internals.


Detection sources

OpenRemedy creates incidents from five independent mechanisms, running concurrently and converging on the same pipeline. Full breakdown in proactive. Brief summary:

SourceWhere it runsCadenceBest for
Daemon monitorsThe customer server (Go agent)Continuous, ~15 sStandard system metrics on servers where the agent is installed
Webhook ingestionPlatform, push-basedSub-secondExisting monitoring stacks (Alertmanager, Grafana, Datadog, PagerDuty)
CheckScheduler + EvaluatorPlatform, proactive container60 s sweepStateful or context-aware checks (DB queries, multi-step probes), legacy hosts without daemon
Agent patrolsPlatform, swarm containerPer-agent patrol_interval minutesAnomalies that aren't explicit alarm conditions — "the 3 a.m. quiet that shouldn't be quiet"
Manual entryUIOn demandAd-hoc inquiries; informational queries (incident_type=custom)

A sixth mechanism, IncidentWatcher, is not a detection source per se — it re-invokes the agent pipeline whenever a human comments on an escalated incident, closing the loop between operator and agent without manual re-triggering.

Webhook authentication

External webhook requests must be HMAC-SHA256 signed against the tenant's webhook_secret, presented in the X-OpenRemedy-Signature: sha256=<hex> header. See integrations for signing examples and adapter patterns.

Daemon authentication

The daemon authenticates via session token (Bearer in Authorization). Custom monitor commands carry an HMAC signature issued by the platform; the daemon refuses to execute unsigned or tampered commands, defending against compromise of the platform DB.


Detailed pipeline

The full flow that runs after an incident is created:

Stages are each a separate agent invocation with its own prompt, tool budget, and tenant-scoped context. The risk gate is a hard server-side check (should_request_approval); the LLM cannot self-approve a medium+ recipe. autonomous trust permits auto-execution only at low risk. supervised and manual trust levels require human approval at every step.


Concept reference

ConceptDefinition
IncidentA problem detected on a server. Created via webhook, manual entry, or the daemon. Lifecycle: open → classifying → recipe_proposed → awaiting_approval → executing → resolved (or failed → escalated). Custom-type incidents are informational queries.
RecipeAn Ansible playbook the platform is allowed to run. Carries a risk level (low, medium, high) which gates auto-execution. The catalog is global; only superadmin may create, update, or delete recipes.
AgentAn LLM-backed entity with a trust level (autonomous, supervised, manual), a role set (triage / diagnose / validate / execute / review), and a system prompt.
SkillA Markdown knowledge module attached to an agent (e.g., "nginx operations"). Loaded into the agent's context at runtime.
ToolA function callable by the agent during reasoning. Built-in (curated diagnostic verbs, management functions) or custom (operator-defined shell_command or http_request templates with sandboxed parameter substitution).
PolicyA flow definition (visual editor) mapping a trigger to a recipe over a set of servers. Drives proactive monitoring.
DaemonOptional Go binary on the managed server. Provides heartbeats, evidence collection, and platform-signed custom monitor execution.
TenantIsolation boundary. Servers, recipes (read), policies, agents, audit logs, secrets, webhook secret, and users are scoped per tenant.

Comparison: OpenRemedy vs OpenClaw

OpenRemedy is sometimes compared to OpenClaw because both are AI agents that take real actions on systems. They sit in adjacent design spaces with materially different threat models.

DimensionOpenRemedyOpenClaw
AudienceSRE / ops teams managing fleetsSingle-user personal assistant
TenancyMulti-tenant SaaS, hard isolation per tenantSingle-user, runs on the owner's machine
Action surfaceCurated Ansible recipes, sandboxed tool catalog, daemon with HMAC-signed configsFree-form shell, browser, camera, location, arbitrary code
Approval modelRisk-gated; medium+ requires human approvalOwner trusts the assistant by definition
Threat modelTenant admin compromise, DB tampering, prompt injection, cross-tenant leakageNone — owner is the security boundary

OpenClaw's freedom is the feature for a personal assistant. For a multi-tenant ops platform, that same freedom would be a liability. OpenRemedy ships the same agentic capability with explicit guardrails: no LLM-driven shell execution, HMAC-signed daemon configs, parameter sandboxing on every custom tool, and tenant scoping on every fanout channel.


Authentication model

  • JWTs travel as HttpOnly + Secure + SameSite=strict cookies (access_token, refresh_token). JavaScript never sees them.
  • Webhooks require X-OpenRemedy-Signature: sha256=<hex> HMAC of the raw body using tenant.webhook_secret.
  • WebSocket handshakes authenticate via the cookie (browser default) or via Sec-WebSocket-Protocol: bearer, <jwt> (programmatic clients). Fanout is filtered server-side by tenant_id.
  • /auth/login is rate-limited at 10 requests/minute per client IP. /webhooks/alerts/{slug} is rate-limited at 60 requests/minute.

Documentation index

PathAudienceContents
proactiveOperators, ops leadsThe five mechanisms that detect and create incidents. Cadence, tuning knobs, when to use each.
integrationsIntegratorsConnecting Alertmanager, Grafana, Datadog, PagerDuty, or custom clients to the webhook endpoint. Signing examples in bash, Python, Node.js.
securityOperators, auditorsRequired environment variables, authentication model, webhook and daemon HMAC, tenant isolation, approval gate, custom tool sandbox, encryption at rest, audit.
DashboardOperatorsSection-by-section reference for every menu item in the web UI.

Endpoint glossary

REST       https://<host>/api/v1/...
WebSocket  wss://<host>/ws/incidents
           wss://<host>/ws/executions/{id}
Daemon     https://<host>/daemon/v1/{heartbeat,evidence,tasks}
Webhooks   https://<host>/api/v1/webhooks/alerts/{tenant_slug}