OR
OpenRemedy

Proactive monitoring

Five mechanisms for continuous early detection.

OpenRemedy is not waiting for your servers to fail. The platform runs five independent proactive mechanisms that watch the fleet continuously and create incidents the moment a deviation is detected — typically before the deviation propagates into a user-visible outage. Whether the incident comes from an external alert manager, a daemon threshold, a scheduled health probe, or an agent's unsolicited observation, the same pipeline handles it from there.

This document covers what each mechanism does, what it is good at, and how to tune it.


The five incident sources

Each mechanism described below.


1 · Daemon monitors (on the managed server)

Runs in openremedy-agent on the customer server.

The daemon executes its assigned monitors locally and reports the results to the platform every 15 seconds. Monitor types:

TypeNotes
cpuLoad percent threshold
memoryRSS percent threshold
diskPer-mount percent threshold
portTCP probe
servicesystemd unit state
urlHTTP status code
logRegex pattern over a log file with a sliding window
processMinimum process count for a name
dockerContainer running / docker_health status
customOperator-defined shell, HMAC-signed by the platform

When a monitor crosses its threshold, the daemon emits an Alert in the next evidence POST. The backend creates the incident immediately.

Tuning

  • Heartbeat / report cadence — defaults to 15 seconds; overridable in /etc/openremedy-agent/config.json (heartbeat_interval_seconds, report_interval_seconds).
  • Thresholds — set per policy in the dashboard. Defaults applied to a new server come from /settings/servers.
  • Custom monitors — added by the operator in the server detail page (Settings tab → custom_monitors JSON). The platform HMAC-signs each command before serving it; the daemon refuses to exec unsigned commands.

When to use

Default for any server where the daemon can be installed. Lowest detection latency, no platform-side scheduling overhead.


2 · CheckScheduler (on the platform)

Runs in the proactive container.

Sweeps every 60 seconds. Loads every policy whose flow_definition contains a recipe_check trigger node. For each one whose schedule interval has elapsed, it dispatches the recipe to the worker queue. The worker runs the playbook (Ansible) over SSH against the target servers and writes the result to the check_results table.

What this is for

Health checks that don't fit on the daemon:

  • Servers without a daemon. Legacy boxes, third-party-managed hosts, or systems where you cannot install an agent.
  • Multi-step or stateful checks. Authenticated HTTP scrapes, database queries, validations that need values from multiple commands.
  • Synthetic probes. "Hit this URL with this payload, expect this field in the response."

Tuning

  • Sweep interval. Constant CYCLE_INTERVAL = 60 seconds in scheduler.py. Practical lower bound for a sweep.
  • Per-policy frequency. Set on the policy's recipe_check trigger in the visual flow editor.
  • Worker concurrency. Controlled by ARQ worker settings.

When to use

Whenever a check is too rich, too stateful, or too policy-driven to live on the daemon.


3 · CheckEvaluator (on the platform)

Runs in the proactive container alongside the scheduler.

Reads new rows from check_results and decides whether each check passed or failed. The decision uses:

  1. The recipe's structured success criteria (exit code, regex, expected JSON shape) when present.
  2. An LLM evaluation when the criteria are ambiguous or the recipe asks for context-aware judgement.

A failed result becomes an incident, published to the incidents Redis channel, which the swarm Manager picks up.

What this is for

Checks where the meaning of the output depends on context. "92 % disk" is not the same on a database box during nightly backup as it is on a cache box at 4 AM. The evaluator can apply that context without the operator having to encode every nuance into the recipe.

Tuning

  • Evaluation model. Per-tenant config in /settings/llm.
  • Default-pass / default-fail. Per recipe.
  • Suppression windows. Maintenance plans suppress incident creation; the evaluator honours an active maintenance window.

When to use

In conjunction with the CheckScheduler whenever the check output needs human-style interpretation.


4 · PatrolScheduler (agent rounds)

Runs in the swarm container.

Every agent has a patrol_interval (minutes) on its configuration. When greater than zero, the patrol scheduler periodically asks the agent to perform a patrol — an unscheduled round of diagnostic checks across the agent's assigned servers.

The agent uses its built-in diagnostic tools (check_service, check_systemd_unit, top_snapshot, etc.) to look for anomalies that might not have triggered any explicit alarm: a load that suddenly dropped to zero on a normally-busy server, a service that restarted three times in an hour, an unusually large log file. If the agent finds something, it opens an incident on its own and the pipeline runs.

What this is for

Catching the deviations that are not explicit alarm conditions. Threshold-based monitors miss everything below the threshold; an agent rounding the fleet can notice patterns no one wrote a check for.

Tuning

  • Per-agent patrol_interval in the agent detail page. Set to zero to disable patrols for that agent. Common values: 15-60 minutes.
  • Server scope. Patrols cover the servers the agent is assigned to.
  • Token budget. Patrols consume the agent's monthly token allowance; the budget gauge reflects this.

When to use

For high-value servers where you want a second pair of eyes that isn't bound to a fixed alarm definition. Effective complement to, not replacement for, the daemon and the scheduled checks.


5 · IncidentWatcher (re-invocation on human input)

Runs in the proactive container.

Subscribes to the Redis incidents and approvals channels.

  • When a human comments on an incident in escalated or monitoring state, the watcher re-invokes the agent pipeline with the comment as added context. The agent gets a second turn, informed by what the human just said.
  • When a human approves or rejects a pending recipe execution, the watcher publishes the decision and the worker either runs or cancels the playbook.

This is the channel that closes the loop between the human and the agent without forcing the operator to manually re-trigger anything.

Tuning

There is nothing to configure on the watcher itself. It reacts to incident state. The behaviour is governed by:

  • The agent's trust_level and assigned roles.
  • The recipe's risk level (drives the approval gate).
  • The incident status (escalated, monitoring, awaiting_approval).

When to use

Always on. This is part of the platform's normal operation, not an opt-in feature.


Mechanism selection

You have…Use
Daemon installed and standard system metricsDaemon
Server you cannot install an agent onCheckScheduler with an Ansible recipe
Check whose pass/fail depends on contextCheckScheduler + CheckEvaluator
Existing monitoring stack (Prometheus, Grafana, Datadog)Webhook (this is the passive entry point — see integrations)
High-value server you want continuously eyeballedPatrol enabled on the assigned agent
Live incident where a human just added contextIncidentWatcher (automatic)

The mechanisms are additive — most production setups run all five simultaneously. They funnel into the same incident pipeline, so the downstream handling is uniform regardless of how the incident was born.


Latency profile

SourceTime from condition → incident in DB
Daemon≤ 15 s (one report cycle)
Webhooksub-second (push)
CheckSchedulerup to (sweep interval + check frequency)
PatrolSchedulerup to patrol_interval minutes

Practical detection floor in a default deployment: ~15 seconds via the daemon for threshold-based conditions; sub-second for push-based external alerts.


Suppression: maintenance windows

Active maintenance plans suppress incident creation on their target servers for the duration of the schedule. This applies across all five sources. See dashboard/maintenances.


Operational tips

  • Start with the daemon and webhooks. Add CheckScheduler recipes for things the daemon cannot do.
  • Enable patrols selectively. Patrolling every agent on every server burns tokens for marginal returns. Start with one agent patrolling the most critical servers at a 30-minute cadence.
  • Tune thresholds in audit, not in panic. Every threshold change is logged in /audit; review false positives weekly and adjust.
  • Use the CheckEvaluator's LLM judgement sparingly. It is the most expensive option. Reserve for checks where structured pass/fail criteria genuinely cannot capture the intent.