← Back to Blog

Distributed Monitoring Without Prometheus or Grafana

February 24, 2026 monitoring operations event-stream

You want to monitor 15 servers. You need CPU, memory, disk, and uptime. You want alerts when something breaks. You want a dashboard showing what is up and what is down. How hard should this be?

In the current monitoring ecosystem, the standard answer is: install Prometheus on a central server, deploy node_exporter on every target, configure scrape intervals, set up alertmanager, install Grafana, configure data sources, build dashboards, manage retention policies, and maintain all of it. That is six components, each with its own configuration format, update cycle, and failure mode. For 15 servers.

The frustration is real. "I am tired of special snowflakey prone to breaking hand rolled solutions," wrote one sysadmin searching for alternatives. Another asked simply: "Recommendations for simple performance monitoring infrastructure -- an API to report/store measurements and a web-based dashboard to view data with history." A third pointed out that even supposedly simple tools have gaps: "Uptime Kuma cannot SSH into a server, it only checks if the SSH daemon is up."

The problem is not that Prometheus is bad. Prometheus is excellent at what it does. The problem is that the standard monitoring stack has accumulated so much complexity that setting it up correctly is itself a significant engineering project. And once it is running, scaling it introduces yet another layer: "Scaling Prometheus isn't straightforward -- federation, remote-write, Thanos, Cortex." Each scaling solution adds components, configuration, and operational burden.

What People Actually Want

Read enough monitoring threads and a clear pattern emerges. People are not asking for a better Prometheus. They are asking for something fundamentally simpler:

Agent-Based Monitoring Over Pilot

Pilot Protocol's event stream (port 1002) provides a pub/sub system that works over encrypted tunnels, across NATs, with built-in node discovery. These are exactly the properties a monitoring system needs. The approach: each node runs a lightweight monitoring agent that publishes metrics to topics, and a central aggregator subscribes to all metric topics.

The result is not a monitoring product. It is a monitoring pattern that uses existing Pilot primitives. The trade-off: you get simplicity and cross-network operation at the cost of building your own storage and visualization. For many small-to-medium deployments, this trade-off is correct.

# Architecture overview:
#
# [Server 1] pilotctl publish "metrics/cpu"    ──┐
# [Server 2] pilotctl publish "metrics/cpu"    ──┤
# [Server 3] pilotctl publish "metrics/cpu"    ──┼──> [Aggregator] pilotctl subscribe "metrics/*"
# [Server 1] pilotctl publish "metrics/disk"   ──┤
# [Server 2] pilotctl publish "metrics/memory" ──┘
#
# All connections: encrypted, NAT-traversing, authenticated

The Monitoring Agent: A Shell Script

The monitoring agent on each node is a shell script. Not a Go program, not a Python application, not a compiled binary. A shell script that reads system metrics and publishes them to Pilot topics. This is intentional -- if the monitoring agent itself is complex enough to have bugs, you have defeated the purpose.

#!/bin/bash
# monitor-agent.sh -- publish system metrics to Pilot event stream
# Run this on every server you want to monitor

INTERVAL=30  # seconds between metric publications
HOSTNAME=$(hostname)

while true; do
  # CPU usage (1-minute load average, normalized to core count)
  CORES=$(nproc)
  LOAD=$(awk '{print $1}' /proc/loadavg)
  CPU_PCT=$(awk "BEGIN {printf \"%.1f\", ($LOAD / $CORES) * 100}")

  # Memory usage
  MEM_TOTAL=$(awk '/MemTotal/ {print $2}' /proc/meminfo)
  MEM_AVAIL=$(awk '/MemAvailable/ {print $2}' /proc/meminfo)
  MEM_USED=$((MEM_TOTAL - MEM_AVAIL))
  MEM_PCT=$(awk "BEGIN {printf \"%.1f\", ($MEM_USED / $MEM_TOTAL) * 100}")

  # Disk usage (root partition)
  DISK_PCT=$(df / | awk 'NR==2 {gsub(/%/,""); print $5}')

  # Uptime in seconds
  UPTIME=$(awk '{print int($1)}' /proc/uptime)

  # Publish each metric to its topic
  pilotctl publish "metrics/cpu/$HOSTNAME" \
    --data "{\"host\":\"$HOSTNAME\",\"cpu_pct\":$CPU_PCT,\"load\":$LOAD,\"cores\":$CORES,\"ts\":$(date +%s)}"

  pilotctl publish "metrics/memory/$HOSTNAME" \
    --data "{\"host\":\"$HOSTNAME\",\"mem_pct\":$MEM_PCT,\"mem_used_kb\":$MEM_USED,\"mem_total_kb\":$MEM_TOTAL,\"ts\":$(date +%s)}"

  pilotctl publish "metrics/disk/$HOSTNAME" \
    --data "{\"host\":\"$HOSTNAME\",\"disk_pct\":$DISK_PCT,\"ts\":$(date +%s)}"

  pilotctl publish "metrics/uptime/$HOSTNAME" \
    --data "{\"host\":\"$HOSTNAME\",\"uptime_sec\":$UPTIME,\"ts\":$(date +%s)}"

  sleep $INTERVAL
done

This script is 30 lines. It reads four metrics from /proc and df, formats them as JSON, and publishes them to hierarchical topics. The Pilot daemon handles encryption, authentication, and delivery. The script does not know or care where the aggregator is, what network it is on, or whether there is a NAT in the way.

To deploy it on a new server, you install pilotctl, start the daemon, establish trust with the aggregator, and run the script. Total setup time: under two minutes.

# Deploy monitoring on a new server
$ go install github.com/TeoSlayer/pilotprotocol/cmd/pilotctl@latest
$ pilotctl init --hostname web-server-04
$ pilotctl daemon start --registry rendezvous.internal:9000 --beacon rendezvous.internal:9001
$ pilotctl set-tags monitoring server web

# Establish trust with the aggregator
$ pilotctl handshake metrics-aggregator "Monitoring agent for web-server-04"

# Start publishing metrics
$ nohup bash monitor-agent.sh &

The Aggregator: Collecting and Alerting

The aggregator is a single node that subscribes to all metric topics and processes the incoming data. Here is a Python implementation that collects metrics, checks thresholds, and publishes alerts.

#!/usr/bin/env python3
"""Metrics aggregator: subscribe to all metrics, alert on thresholds."""
import subprocess
import json
import sys
import threading
from datetime import datetime

# Alert thresholds
THRESHOLDS = {
    "cpu_pct": 90.0,
    "mem_pct": 85.0,
    "disk_pct": 90.0,
}

# Track latest metrics per host
latest = {}

def subscribe_metrics():
    """Subscribe to all metric topics using wildcard."""
    proc = subprocess.Popen(
        ["pilotctl", "subscribe", "metrics/*"],
        stdout=subprocess.PIPE, text=True
    )
    for line in proc.stdout:
        line = line.strip()
        if not line:
            continue
        try:
            # Parse topic and payload
            topic, payload = line.split("] ", 1)
            topic = topic.lstrip("[")
            data = json.loads(payload)
            host = data.get("host", "unknown")

            # Store latest metric
            if host not in latest:
                latest[host] = {}
            latest[host][topic] = data
            latest[host]["last_seen"] = datetime.now().isoformat()

            # Check thresholds
            check_alerts(host, data)

        except (json.JSONDecodeError, ValueError) as e:
            print(f"Parse error: {e}", file=sys.stderr)

def check_alerts(host, data):
    """Check metric values against thresholds and publish alerts."""
    for metric, threshold in THRESHOLDS.items():
        if metric in data and data[metric] > threshold:
            alert = {
                "host": host,
                "metric": metric,
                "value": data[metric],
                "threshold": threshold,
                "severity": "critical" if data[metric] > threshold + 5 else "warning",
                "ts": data.get("ts", 0)
            }
            # Publish alert to dedicated topic
            subprocess.run([
                "pilotctl", "publish", f"alerts/{metric}/{host}",
                "--data", json.dumps(alert)
            ])
            print(f"ALERT [{alert['severity']}] {host}: {metric}={data[metric]} (threshold: {threshold})")

def print_status():
    """Periodically print fleet status summary."""
    import time
    while True:
        time.sleep(60)
        print(f"\n--- Fleet Status ({datetime.now().isoformat()}) ---")
        for host, metrics in sorted(latest.items()):
            last = metrics.get("last_seen", "never")
            cpu = "?"
            mem = "?"
            disk = "?"
            for topic, data in metrics.items():
                if "cpu" in topic:
                    cpu = f"{data.get('cpu_pct', '?')}%"
                if "memory" in topic:
                    mem = f"{data.get('mem_pct', '?')}%"
                if "disk" in topic:
                    disk = f"{data.get('disk_pct', '?')}%"
            print(f"  {host:20s}  cpu:{cpu:>6s}  mem:{mem:>6s}  disk:{disk:>6s}  seen:{last}")
        print("---")

if __name__ == "__main__":
    # Run status printer in background thread
    threading.Thread(target=print_status, daemon=True).start()
    # Block on subscription stream
    subscribe_metrics()

The aggregator subscribes to metrics/* (wildcard) and receives every metric from every host. It maintains an in-memory map of the latest values, checks thresholds on every update, and publishes alerts to alerts/* topics. Any interested party -- a Slack bot, an email script, a pager integration -- can subscribe to alerts/* and receive notifications.

# Subscribe to all alerts (from any monitoring consumer)
$ pilotctl subscribe "alerts/*"
[alerts/cpu/web-server-02] {"host":"web-server-02","metric":"cpu_pct","value":94.2,"threshold":90.0,"severity":"warning","ts":1709136000}
[alerts/disk/db-server-01] {"host":"db-server-01","metric":"disk_pct","value":97.1,"threshold":90.0,"severity":"critical","ts":1709136030}

Real Service Checks, Not Port Scans

The monitoring agent can do more than read /proc. Because it runs on the target server with full local access, it can execute actual service checks -- something that external monitoring tools cannot do without SSH access or dedicated agents.

# Add service checks to the monitoring agent

# Check if PostgreSQL can execute a query
PG_OK=$(psql -U monitor -d mydb -c "SELECT 1" 2>&1 && echo "ok" || echo "fail")
pilotctl publish "health/postgres/$HOSTNAME" \
  --data "{\"host\":\"$HOSTNAME\",\"service\":\"postgres\",\"status\":\"$PG_OK\",\"ts\":$(date +%s)}"

# Check if the web application returns 200
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/health)
pilotctl publish "health/webapp/$HOSTNAME" \
  --data "{\"host\":\"$HOSTNAME\",\"service\":\"webapp\",\"http_code\":$HTTP_CODE,\"ts\":$(date +%s)}"

# Check if a process is running
WORKERS=$(pgrep -c worker-process || echo 0)
pilotctl publish "health/workers/$HOSTNAME" \
  --data "{\"host\":\"$HOSTNAME\",\"service\":\"workers\",\"count\":$WORKERS,\"ts\":$(date +%s)}"

# Check disk I/O latency
IO_LATENCY=$(ioping -c 3 -q / 2>&1 | awk '/avg/ {print $4}')
pilotctl publish "metrics/io_latency/$HOSTNAME" \
  --data "{\"host\":\"$HOSTNAME\",\"io_latency_us\":\"$IO_LATENCY\",\"ts\":$(date +%s)}"

This is the advantage of agent-based monitoring over pull-based monitoring. The agent runs locally, so it can check anything the server can check. PostgreSQL query execution, HTTP response codes, process counts, disk I/O latency -- all of these require local access that an external scraper does not have. Uptime Kuma can tell you port 5432 is open. The monitoring agent can tell you the database is responding to queries in 3ms.

The Built-In Dashboard

The Polo dashboard at polo.pilotprotocol.network already provides a fleet overview. Every node registered with the rendezvous server appears with its hostname, address, tags, online status, and polo score. For basic monitoring -- "which of my servers are up?" -- the dashboard works without any additional setup.

Tag your servers by role, and the dashboard becomes a filtered view of your infrastructure:

# Tag servers by role for dashboard filtering
$ pilotctl set-tags web-server production us-east
$ pilotctl set-tags db-server production us-east postgres
$ pilotctl set-tags cache-server production us-east redis

# View all production servers
$ pilotctl peers --search "tag:production"
1:0001.0000.0001  web-server-01    [web-server, production, us-east]          online
1:0001.0000.0002  web-server-02    [web-server, production, us-east]          online
1:0001.0000.0003  db-server-01     [db-server, production, us-east, postgres] online
1:0001.0000.0004  cache-server-01  [cache-server, production, us-east, redis] OFFLINE

The "OFFLINE" status is detected automatically. The Pilot daemon sends keepalive probes every 30 seconds, with an idle timeout of 120 seconds. If a node stops responding, the registry marks it offline. No configuration needed. No separate health check system.

Cross-Network Monitoring Without VPNs

This is where Pilot's architecture provides a genuine advantage over traditional monitoring stacks. Consider this scenario: you have servers in AWS us-east, a few in a Hetzner data center, and a couple under your desk at the office. With Prometheus, you need each environment to be reachable from the Prometheus server -- which means VPN tunnels, port forwarding, or running multiple Prometheus instances with federation.

With Pilot, every server runs the same daemon and connects to the same rendezvous server. NAT traversal is automatic. The monitoring agent on the office server publishes metrics to the same topics as the monitoring agent on the AWS server. The aggregator subscribes once and receives everything.

# Server in AWS (public IP)
$ pilotctl daemon start --registry rendezvous.mycompany.com:9000 --beacon rendezvous.mycompany.com:9001

# Server in Hetzner (different data center)
$ pilotctl daemon start --registry rendezvous.mycompany.com:9000 --beacon rendezvous.mycompany.com:9001

# Server under desk (behind office NAT)
$ pilotctl daemon start --registry rendezvous.mycompany.com:9000 --beacon rendezvous.mycompany.com:9001

# All three publish to the same topics.
# The aggregator (anywhere) subscribes and receives from all three.
# No VPN. No port forwarding. No firewall rules.

The office server behind NAT connects through STUN discovery and, if necessary, beacon relay. The aggregator sees metrics from all three environments in a single subscription stream. The encryption is end-to-end -- the rendezvous server and beacon never see the metric data in cleartext.

Comparison: Pilot + Scripts vs. The Alternatives

PropertyPilot + ScriptsPrometheus + GrafanaNetdataUptime Kuma
Components to install1 (pilotctl + script)4-6 (Prometheus, exporters, Grafana, Alertmanager)1 (agent)1 (server)
Local service checksYes (agent runs locally)Via custom exportersYes (plugin system)No (external probes only)
Cross-networkYes (NAT traversal built-in)Requires VPN/federationRequires Netdata CloudRequires VPN/public access
EncryptionMandatory (X25519 + AES-GCM)Optional (TLS config)TLS (Cloud uses HTTPS)Optional (reverse proxy)
Time-series storageNone (bring your own)Built-in (TSDB)Built-in (custom DB)SQLite
Built-in alertingVia pub/sub topicsAlertmanager (separate)Built-inBuilt-in
DashboardPolo dashboard (basic)Grafana (full-featured)Built-in (excellent)Built-in (status page)
Query languageNonePromQLCustomNone
ScalingAdd more subscribersFederation/Thanos/CortexNetdata CloudSingle instance
Memory per node~10MB (daemon)~50-100MB (exporter + Prometheus)~100-200MBN/A (server only)
Setup time~5 minutes~2-4 hours~15 minutes~10 minutes

Adding Storage: Pipe to SQLite or InfluxDB

The obvious gap in the Pilot monitoring pattern is storage. Prometheus has a built-in time-series database. Netdata has its own storage engine. Pilot has nothing -- the event stream is ephemeral. Once a metric is published and consumed, it is gone.

For historical data, you pipe the aggregator output to a storage backend. The simplest option is SQLite:

#!/usr/bin/env python3
"""Pipe Pilot metrics into SQLite for historical queries."""
import subprocess
import json
import sqlite3

DB_PATH = "/var/lib/monitoring/metrics.db"

def init_db():
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS metrics (
            ts INTEGER,
            host TEXT,
            metric TEXT,
            value REAL,
            raw TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_ts ON metrics(ts)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_host ON metrics(host)")
    conn.commit()
    return conn

def main():
    conn = init_db()
    proc = subprocess.Popen(
        ["pilotctl", "subscribe", "metrics/*"],
        stdout=subprocess.PIPE, text=True
    )
    batch = []
    for line in proc.stdout:
        line = line.strip()
        if not line:
            continue
        try:
            topic, payload = line.split("] ", 1)
            data = json.loads(payload)
            host = data.get("host", "unknown")
            ts = data.get("ts", 0)

            # Extract numeric metrics
            for key, val in data.items():
                if isinstance(val, (int, float)) and key not in ("ts",):
                    batch.append((ts, host, key, val, payload))

            # Write in batches of 100
            if len(batch) >= 100:
                conn.executemany(
                    "INSERT INTO metrics VALUES (?,?,?,?,?)", batch
                )
                conn.commit()
                batch = []
        except (json.JSONDecodeError, ValueError):
            continue

main()

For larger deployments, replace SQLite with InfluxDB, TimescaleDB, or any time-series database. The integration point is the same: subscribe to metrics/*, parse the JSON, insert into the database. The Pilot layer handles collection, encryption, and delivery. The storage layer handles retention, indexing, and queries.

This is the trade-off: Prometheus gives you collection, storage, querying, and alerting in one system. Pilot gives you collection, encryption, and cross-network delivery, and you bring your own storage. If you need PromQL, histograms, and recording rules, use Prometheus. If you need encrypted monitoring across NATs with minimal setup, this pattern works.

Alerting Patterns

Because alerts are published to Pilot topics, any consumer can subscribe and route them. Here are three common patterns:

# Pattern 1: Slack webhook on critical alerts
pilotctl subscribe "alerts/*" | while read -r line; do
  SEVERITY=$(echo "$line" | jq -r '.severity // empty' 2>/dev/null)
  if [ "$SEVERITY" = "critical" ]; then
    curl -s -X POST "$SLACK_WEBHOOK" \
      -H 'Content-Type: application/json' \
      -d "{\"text\": \"CRITICAL: $line\"}"
  fi
done

# Pattern 2: Email on any alert
pilotctl subscribe "alerts/*" | while read -r line; do
  echo "$line" | mail -s "Monitoring Alert" [email protected]
done

# Pattern 3: PagerDuty integration for critical only
pilotctl subscribe "alerts/*" | while read -r line; do
  SEVERITY=$(echo "$line" | jq -r '.severity // empty' 2>/dev/null)
  if [ "$SEVERITY" = "critical" ]; then
    curl -s -X POST "https://events.pagerduty.com/v2/enqueue" \
      -H 'Content-Type: application/json' \
      -d "{\"routing_key\": \"$PD_KEY\", \"event_action\": \"trigger\", \"payload\": {\"summary\": \"$line\", \"severity\": \"critical\", \"source\": \"pilot-monitoring\"}}"
  fi
done

Each pattern is a shell one-liner piped from a Pilot subscription. There is no alerting rule language to learn, no routing tree to configure, no notification template system. The alert is a JSON message on a topic. What you do with it is up to you.

Honest Limitations

This approach is not a replacement for Prometheus. It is an alternative for a specific set of requirements. Here are the things it cannot do:

The right question is not "is this better than Prometheus?" It is "do I need everything Prometheus provides?" If you have 15 servers, need basic metrics and alerts, and want it working in 5 minutes instead of 4 hours, this pattern delivers. If you have 1,500 servers, need PromQL, and have a dedicated observability team, use Prometheus.

For a deeper look at the event stream that powers this pattern, see Replace Your Agent Message Broker with 12 Lines of Go. For the NAT traversal that makes cross-network monitoring work, see NAT Traversal: A Deep Dive. For the operational story behind scaling the Pilot network itself, see How We Run 10,000 Agents on 3 VMs.

Try Pilot Protocol

Encrypted pub/sub monitoring across any network. One binary per node, shell scripts for metrics, built-in NAT traversal. No Prometheus, no Grafana, no VPN.

View on GitHub