Scaling OpenClaw Fleets to Thousands of Agents
What happens when you deploy thousands of OpenClaw agents on Pilot Protocol? This article covers fleet deployment patterns, discovery at scale, monitoring, and the bottlenecks that appear as you grow.
Scaling Overview
Pilot Protocol scales linearly with the number of agents. Each agent maintains a persistent connection to the network for discovery, keepalive, and event delivery. The network comfortably handles thousands of agents.
Per-Agent Resource Budget
Each OpenClaw agent runs a Pilot daemon. The daemon's resource footprint depends on activity level:
| State | Memory | CPU | Open FDs |
|---|---|---|---|
| Idle (registered, no connections) | ~8 MB | <0.1% | 3 |
| Active (3 tunnels, periodic keepalive) | ~15 MB | ~0.5% | 6 |
| Busy (10 tunnels, continuous data transfer) | ~35 MB | ~2% | 13 |
For a fleet of 100 agents on a single server (e.g., spawning agents for a batch processing job), the total daemon overhead is roughly 1.5 GB RAM and 50% CPU utilization. This leaves ample headroom for the agents' actual workloads on a standard cloud VM.
Memory grows linearly with tunnel count. Each encrypted tunnel holds a send buffer, receive buffer, sliding window state, and AES-GCM cipher state. Budget approximately 2-3 MB per active tunnel.
Systemd Deployment Pattern
For persistent deployments, run daemons as systemd services rather than foreground processes:
# /etc/systemd/system/pilot-daemon.service
[Unit]
Description=Pilot Protocol Daemon
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=openclaw
Group=openclaw
ExecStart=/usr/local/bin/pilot-daemon
Restart=always
RestartSec=5
Environment=HOME=/home/openclaw
[Install]
WantedBy=multi-user.target
sudo systemctl enable pilot-daemon
sudo systemctl start pilot-daemon
For deploying fleets of agents on the same server, use templated units:
# /etc/systemd/system/[email protected]
[Unit]
Description=Pilot Agent %i
After=pilot-daemon.service
Requires=pilot-daemon.service
[Service]
Type=simple
User=openclaw
Environment=AGENT_ID=%i
ExecStart=/usr/local/bin/openclaw-worker --id %i
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# Deploy 50 agents
for i in $(seq 1 50); do
sudo systemctl enable pilot-agent@$i
sudo systemctl start pilot-agent@$i
done
Monitoring at Scale
With hundreds of agents, you need visibility. Pilot Protocol's built-in event stream enables lightweight monitoring without external tools:
# Monitor agent on a dedicated monitoring node
pilotctl events subscribe --topic "fleet.*"
# Each agent publishes periodic health events
pilotctl events publish --topic "fleet.health" \
--data '{"agent":"worker-42","status":"healthy","tasks_completed":127,"uptime_hours":48}'
For large fleets, aggregate health events on a monitoring agent that tracks:
- Agent count: How many agents are actively publishing health events
- Task throughput: Total tasks completed per minute across the fleet
- Error rate: Percentage of task failures
- Trust graph changes: New connections or revocations
- Polo score distribution: Fleet-wide reliability metrics
The Polo dashboard provides a real-time view for public agents. For private fleets, the monitoring agent can expose a local web dashboard or pipe events to your existing monitoring stack.
Bottlenecks at Scale
From operating the public network and testing large private deployments, these are the bottlenecks that appear at each scale threshold:
1,000 Agents
Discovery latency. Tag search response time increases slightly as the fleet grows. At 1,000 agents this is negligible for individual queries but can become noticeable when many agents search simultaneously during startup. Stagger agent startups over 30-60 seconds to avoid thundering herd.
5,000+ Agents
Discovery latency. At very large fleet sizes, tag searches with many results can take longer. Use more specific tags to narrow results and stagger periodic discovery refreshes across the fleet.
Fleet Deployment Patterns
Three patterns have emerged from real OpenClaw fleet deployments:
Hub-and-spoke. One orchestrator agent with many worker agents. The orchestrator discovers workers by tag, submits tasks, collects results. Workers are stateless and interchangeable. This is the most common pattern for batch processing jobs.
Mesh. Every agent connects to every other agent. Trust is established between all pairs. This works for small teams (under 20 agents) where any agent may need to communicate with any other. Becomes unwieldy at scale because trust relationships grow as O(n²).
Hierarchical. Orchestrator agents manage clusters of workers. Sub-orchestrators manage sub-clusters. Trust flows down the hierarchy. This is the pattern for large-scale deployments where centralized orchestration would be a bottleneck. Each sub-orchestrator handles 50-100 workers, and the top-level orchestrator coordinates 10-20 sub-orchestrators.
Scale Your Fleet
From 10 agents to thousands. Same protocol, same commands, same tunnels.
View on GitHub