Proactive monitoring
Five mechanisms for continuous early detection.
OpenRemedy is not waiting for your servers to fail. The platform runs five independent proactive mechanisms that watch the fleet continuously and create incidents the moment a deviation is detected — typically before the deviation propagates into a user-visible outage. Whether the incident comes from an external alert manager, a daemon threshold, a scheduled health probe, or an agent's unsolicited observation, the same pipeline handles it from there.
This document covers what each mechanism does, what it is good at, and how to tune it.
The five incident sources
Each mechanism described below.
1 · Daemon monitors (on the managed server)
Runs in openremedy-agent on
the customer server.
The daemon executes its assigned monitors locally and reports the results to the platform every 15 seconds. Monitor types:
| Type | Notes |
|---|---|
cpu | Load percent threshold |
memory | RSS percent threshold |
disk | Per-mount percent threshold |
port | TCP probe |
service | systemd unit state |
url | HTTP status code |
log | Regex pattern over a log file with a sliding window |
process | Minimum process count for a name |
docker | Container running / docker_health status |
custom | Operator-defined shell, HMAC-signed by the platform |
When a monitor crosses its threshold, the daemon emits an Alert in
the next evidence POST. The backend creates the incident immediately.
Tuning
- Heartbeat / report cadence — defaults to 15 seconds; overridable
in
/etc/openremedy-agent/config.json(heartbeat_interval_seconds,report_interval_seconds). - Thresholds — set per policy in the dashboard. Defaults applied
to a new server come from
/settings/servers. - Custom monitors — added by the operator in the server detail
page (Settings tab →
custom_monitorsJSON). The platform HMAC-signs each command before serving it; the daemon refuses to exec unsigned commands.
When to use
Default for any server where the daemon can be installed. Lowest detection latency, no platform-side scheduling overhead.
2 · CheckScheduler (on the platform)
Runs in the
proactive container.
Sweeps every 60 seconds. Loads every policy whose flow_definition
contains a recipe_check trigger node. For each one whose schedule
interval has elapsed, it dispatches the recipe to the worker queue.
The worker runs the playbook (Ansible) over SSH against the target
servers and writes the result to the check_results table.
What this is for
Health checks that don't fit on the daemon:
- Servers without a daemon. Legacy boxes, third-party-managed hosts, or systems where you cannot install an agent.
- Multi-step or stateful checks. Authenticated HTTP scrapes, database queries, validations that need values from multiple commands.
- Synthetic probes. "Hit this URL with this payload, expect this field in the response."
Tuning
- Sweep interval. Constant
CYCLE_INTERVAL = 60seconds inscheduler.py. Practical lower bound for a sweep. - Per-policy frequency. Set on the policy's
recipe_checktrigger in the visual flow editor. - Worker concurrency. Controlled by ARQ worker settings.
When to use
Whenever a check is too rich, too stateful, or too policy-driven to live on the daemon.
3 · CheckEvaluator (on the platform)
Runs in the
proactive container alongside the scheduler.
Reads new rows from check_results and decides whether each check
passed or failed. The decision uses:
- The recipe's structured success criteria (exit code, regex, expected JSON shape) when present.
- An LLM evaluation when the criteria are ambiguous or the recipe asks for context-aware judgement.
A failed result becomes an incident, published to the incidents
Redis channel, which the swarm Manager picks up.
What this is for
Checks where the meaning of the output depends on context. "92 % disk" is not the same on a database box during nightly backup as it is on a cache box at 4 AM. The evaluator can apply that context without the operator having to encode every nuance into the recipe.
Tuning
- Evaluation model. Per-tenant config in
/settings/llm. - Default-pass / default-fail. Per recipe.
- Suppression windows. Maintenance plans suppress incident creation; the evaluator honours an active maintenance window.
When to use
In conjunction with the CheckScheduler whenever the check output needs human-style interpretation.
4 · PatrolScheduler (agent rounds)
Runs in the swarm
container.
Every agent has a patrol_interval (minutes) on its configuration.
When greater than zero, the patrol scheduler periodically asks the
agent to perform a patrol — an unscheduled round of diagnostic
checks across the agent's assigned servers.
The agent uses its built-in diagnostic tools (check_service,
check_systemd_unit, top_snapshot, etc.) to look for anomalies
that might not have triggered any explicit alarm: a load that
suddenly dropped to zero on a normally-busy server, a service that
restarted three times in an hour, an unusually large log file. If
the agent finds something, it opens an incident on its own and the
pipeline runs.
What this is for
Catching the deviations that are not explicit alarm conditions. Threshold-based monitors miss everything below the threshold; an agent rounding the fleet can notice patterns no one wrote a check for.
Tuning
- Per-agent
patrol_intervalin the agent detail page. Set to zero to disable patrols for that agent. Common values: 15-60 minutes. - Server scope. Patrols cover the servers the agent is assigned to.
- Token budget. Patrols consume the agent's monthly token allowance; the budget gauge reflects this.
When to use
For high-value servers where you want a second pair of eyes that isn't bound to a fixed alarm definition. Effective complement to, not replacement for, the daemon and the scheduled checks.
5 · IncidentWatcher (re-invocation on human input)
Runs in the
proactive container.
Subscribes to the Redis incidents and approvals channels.
- When a human comments on an incident in
escalatedormonitoringstate, the watcher re-invokes the agent pipeline with the comment as added context. The agent gets a second turn, informed by what the human just said. - When a human approves or rejects a pending recipe execution, the watcher publishes the decision and the worker either runs or cancels the playbook.
This is the channel that closes the loop between the human and the agent without forcing the operator to manually re-trigger anything.
Tuning
There is nothing to configure on the watcher itself. It reacts to incident state. The behaviour is governed by:
- The agent's
trust_leveland assigned roles. - The recipe's risk level (drives the approval gate).
- The incident status (
escalated,monitoring,awaiting_approval).
When to use
Always on. This is part of the platform's normal operation, not an opt-in feature.
Mechanism selection
| You have… | Use |
|---|---|
| Daemon installed and standard system metrics | Daemon |
| Server you cannot install an agent on | CheckScheduler with an Ansible recipe |
| Check whose pass/fail depends on context | CheckScheduler + CheckEvaluator |
| Existing monitoring stack (Prometheus, Grafana, Datadog) | Webhook (this is the passive entry point — see integrations) |
| High-value server you want continuously eyeballed | Patrol enabled on the assigned agent |
| Live incident where a human just added context | IncidentWatcher (automatic) |
The mechanisms are additive — most production setups run all five simultaneously. They funnel into the same incident pipeline, so the downstream handling is uniform regardless of how the incident was born.
Latency profile
| Source | Time from condition → incident in DB |
|---|---|
| Daemon | ≤ 15 s (one report cycle) |
| Webhook | sub-second (push) |
| CheckScheduler | up to (sweep interval + check frequency) |
| PatrolScheduler | up to patrol_interval minutes |
Practical detection floor in a default deployment: ~15 seconds via the daemon for threshold-based conditions; sub-second for push-based external alerts.
Suppression: maintenance windows
Active maintenance plans suppress incident creation on their target servers for the duration of the schedule. This applies across all five sources. See dashboard/maintenances.
Operational tips
- Start with the daemon and webhooks. Add CheckScheduler recipes for things the daemon cannot do.
- Enable patrols selectively. Patrolling every agent on every server burns tokens for marginal returns. Start with one agent patrolling the most critical servers at a 30-minute cadence.
- Tune thresholds in audit, not in panic. Every threshold change
is logged in
/audit; review false positives weekly and adjust. - Use the CheckEvaluator's LLM judgement sparingly. It is the most expensive option. Reserve for checks where structured pass/fail criteria genuinely cannot capture the intent.