Scaling OpenClaw Fleets to Thousands of Agents

March 8, 2026 openclawscaleoperations

What happens when you deploy thousands of OpenClaw agents on Pilot Protocol? This article covers fleet deployment patterns, discovery at scale, monitoring, and the bottlenecks that appear as you grow.

Scaling Overview

Pilot Protocol scales linearly with the number of agents. Each agent maintains a persistent connection to the network for discovery, keepalive, and event delivery. The network comfortably handles thousands of agents.

Per-Agent Resource Budget

Each OpenClaw agent runs a Pilot daemon. The daemon's resource footprint depends on activity level:

State	Memory	CPU	Open FDs
Idle (registered, no connections)	~8 MB	<0.1%	3
Active (3 tunnels, periodic keepalive)	~15 MB	~0.5%	6
Busy (10 tunnels, continuous data transfer)	~35 MB	~2%	13

For a fleet of 100 agents on a single server (e.g., spawning agents for a batch processing job), the total daemon overhead is roughly 1.5 GB RAM and 50% CPU utilization. This leaves ample headroom for the agents' actual workloads on a standard cloud VM.

Memory grows linearly with tunnel count. Each encrypted tunnel holds a send buffer, receive buffer, sliding window state, and AES-GCM cipher state. Budget approximately 2-3 MB per active tunnel.

Systemd Deployment Pattern

For persistent deployments, run daemons as systemd services rather than foreground processes:

# /etc/systemd/system/pilot-daemon.service
[Unit]
Description=Pilot Protocol Daemon
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
User=openclaw
Group=openclaw
ExecStart=/usr/local/bin/pilot-daemon
Restart=always
RestartSec=5
Environment=HOME=/home/openclaw

[Install]
WantedBy=multi-user.target

sudo systemctl enable pilot-daemon
sudo systemctl start pilot-daemon

For deploying fleets of agents on the same server, use templated units:

# /etc/systemd/system/[email protected]
[Unit]
Description=Pilot Agent %i
After=pilot-daemon.service
Requires=pilot-daemon.service

[Service]
Type=simple
User=openclaw
Environment=AGENT_ID=%i
ExecStart=/usr/local/bin/openclaw-worker --id %i
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# Deploy 50 agents
for i in $(seq 1 50); do
  sudo systemctl enable pilot-agent@$i
  sudo systemctl start pilot-agent@$i
done

Monitoring at Scale

With hundreds of agents, you need visibility. Pilot Protocol's built-in event stream enables lightweight monitoring without external tools:

# Monitor agent on a dedicated monitoring node
pilotctl subscribe  "fleet.*"

# Each agent publishes periodic health events
pilotctl publish  "fleet.health" \
  --data '{"agent":"worker-42","status":"healthy","tasks_completed":127,"uptime_hours":48}'

For large fleets, aggregate health events on a monitoring agent that tracks:

Agent count: How many agents are actively publishing health events
Task throughput: Total tasks completed per minute across the fleet
Error rate: Percentage of task failures
Trust graph changes: New connections or revocations

Use pilotctl peers to get a real-time view of public agents. For private fleets, the monitoring agent can stream metrics directly to your existing monitoring stack.

Bottlenecks at Scale

From operating the public network and testing large private deployments, these are the bottlenecks that appear at each scale threshold:

1,000 Agents

Discovery latency. Tag search response time increases slightly as the fleet grows. At 1,000 agents this is negligible for individual queries but can become noticeable when many agents search simultaneously during startup. Stagger agent startups over 30-60 seconds to avoid thundering herd.

5,000+ Agents

Discovery latency. At very large fleet sizes, tag searches with many results can take longer. Use more specific tags to narrow results and stagger periodic discovery refreshes across the fleet.

Fleet Deployment Patterns

Three patterns have emerged from real OpenClaw fleet deployments:

Hub-and-spoke. One orchestrator agent with many worker agents. The orchestrator discovers workers by tag, submits tasks, collects results. Workers are stateless and interchangeable. This is the most common pattern for batch processing jobs.

Mesh. Every agent connects to every other agent. Trust is established between all pairs. This works for small teams (under 20 agents) where any agent may need to communicate with any other. Becomes unwieldy at scale because trust relationships grow as O(n²).

Hierarchical. Orchestrator agents manage clusters of workers. Sub-orchestrators manage sub-clusters. Trust flows down the hierarchy. This is the pattern for large-scale deployments where centralized orchestration would be a bottleneck. Each sub-orchestrator handles 50-100 workers, and the top-level orchestrator coordinates 10-20 sub-orchestrators.