How We Run 10,000 Agents on 3 VMs

February 12, 2026 operations scale production

When we tell people we run over 9,500 Pilot Protocol agents across three GCP virtual machines, the first reaction is usually disbelief. The second is "why?" The answer: because the best way to prove an agent networking stack works at scale is to actually run it at scale. No simulations. No theoretical models. Real daemons, real UDP tunnels, real registry lookups.

This post is the full operational deep dive. What hardware we use, how we budget memory, every kernel limit we hit, the spawning scripts that work (and the ones that definitely did not), and the monitoring that keeps the whole fleet visible.

The Fleet: Three Machines, 9,500+ Agents

Our fleet runs on three GCP VMs in the vulture-vision-cloud project. Each machine has a different role based on its resource profile:

VM	Machine Type	vCPUs	RAM	Target Agents
swarm-1	e2-standard-16	16	64 GB	~3,000
swarm-2	n2d-highcpu-128	128	125 GB	~5,000
swarm-3	e2-standard-8	8	32 GB	~1,500

swarm-2 is the workhorse. With 128 vCPUs and 125 GB of RAM, it can theoretically handle over 12,000 daemons. In practice, we keep it around 5,000 to maintain headroom for bursts, registry traffic, and occasional memory spikes from active connections.

swarm-1 is our medium-tier machine. 64 GB gives us comfortable room for 3,000 daemons with monitoring overhead, log buffers, and the occasional SSH session that requires its own chunk of memory.

swarm-3 is the smallest, running about 1,500 agents. We intentionally keep one smaller machine in the fleet to test that our operational tooling works across heterogeneous hardware, not just big boxes.

All three connect to a single rendezvous server running on pilot-rendezvous in us-central1-a. The rendezvous handles registry lookups, beacon coordination for NAT traversal, and the Polo dashboard at port 3000.

Memory Budget: 10 MB per Daemon

The most important number in this entire operation is 10 MB. That is the resident set size (RSS) of a single Pilot daemon in idle state — registered with the rendezvous, keepalives running, no active connections.

Here is the math for swarm-2:

# swarm-2: n2d-highcpu-128, 125 GB RAM
Total RAM:         125 GB = 128,000 MB
OS + overhead:       3,000 MB  # kernel, systemd, sshd, monitoring
Available for daemons: 125,000 MB

Per-daemon RSS (idle):  10 MB
Max theoretical daemons: 125,000 / 10 = 12,500

# We target 5,000 to leave 60% headroom
Actual daemon memory:  5,000 * 10 MB = 50,000 MB
Remaining headroom:    75,000 MB

Why so much headroom? Because "idle" is a best case. When daemons establish active connections, each connection adds roughly 2-4 MB for buffers, sliding window state, and encryption context. If 1,000 agents simultaneously open connections, memory jumps by 2-4 GB. During benchmarks or swarm coordination tests, this happens regularly.

Measuring RSS Accurately

We do not trust top or htop for aggregate memory. They show virtual memory (VIRT/VSZ), which is wildly inflated for Go binaries due to the garbage collector reserving large virtual address spaces. Instead, we measure actual RSS:

# Sum RSS for all pilot-daemon processes (in KB)
ps -eo rss,comm | grep pilot-daemon | awk '{sum+=$1} END {print sum/1024 " MB"}'

# Per-daemon breakdown
ps -eo pid,rss,comm | grep pilot-daemon | sort -k2 -n | tail -20

# Quick count
pgrep -c pilot-daemon

On swarm-2 with 5,000 daemons idle, the total RSS hovers around 48-52 GB. The variance comes from Go's GC cycle — some daemons have just run GC and are at 8 MB, others have pending garbage and sit at 12 MB. The 10 MB average holds over time.

Memory Leak Detection

At this scale, even a tiny memory leak becomes visible fast. A 1 KB/hour leak per daemon becomes 5 MB/hour across the fleet. We track total RSS every 5 minutes and alert if the growth rate exceeds 100 MB/hour with no corresponding increase in active connections. This has caught two genuine leaks in the event stream buffer code during development.

PID Limits: The First Wall You Hit

The first time we tried to spawn 3,000 daemons on swarm-1, we hit a wall at 4,096 processes. Not a memory wall. Not a CPU wall. A PID limit.

Linux kernels have multiple layers of process limits, and they all conspire against you when running thousands of processes under a single user:

The Three Limits

kernel.pid_max — Global maximum PID number. Default: 32768 on most distributions. Each daemon gets a PID, so this caps your fleet at ~32K processes system-wide. Sounds generous until you realize each daemon also spawns goroutines that the scheduler tracks.
TasksMax (systemd cgroup) — Per-slice limit on tasks (threads + processes). The default user-.slice in systemd caps each user at 33% of kernel.pid_max, roughly 10,000 tasks. But a Go binary with goroutines can use 4-8 OS threads, so 3,000 daemons easily exceed this.
UserTasksMax (logind) — PAM/logind limit per user. Another independent cap, often defaulting to 12288.

The error message when you hit any of these is unhelpfully generic:

fork/exec ./pilot-daemon: resource temporarily unavailable

Here is how we fix all three:

Fix 1: Raise kernel.pid_max

# Set immediately
sudo sysctl -w kernel.pid_max=4194304

# Persist across reboots
echo "kernel.pid_max=4194304" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

4,194,304 is the maximum value on 64-bit kernels. There is no performance penalty for setting it this high. The kernel allocates PIDs from a bitmap, and a 512 KB bitmap is negligible.

Fix 2: Remove TasksMax from systemd slices

# Create override for user slices
sudo mkdir -p /etc/systemd/system/user-.slice.d/
cat <<EOF | sudo tee /etc/systemd/system/user-.slice.d/99-nolimit.conf
[Slice]
TasksMax=infinity
EOF

# Reload systemd
sudo systemctl daemon-reload

Fix 3: Set UserTasksMax to infinity

# Edit logind configuration
sudo sed -i 's/#UserTasksMax=.*/UserTasksMax=infinity/' /etc/systemd/logind.conf
sudo systemctl restart systemd-logind

Warning: Restarting systemd-logind can kill your SSH session. Do this via GCE serial console or a startup script, not over SSH. We learned this the hard way.

File Descriptor Limits

After fixing PID limits, the next wall is file descriptors. Each daemon uses at minimum: 1 UDP socket (tunnel), 1 Unix socket (IPC), and 1 TCP connection (registry). Active connections add more. We set:

# /etc/security/limits.conf
* soft nofile 1048576
* hard nofile 1048576

# System-wide
sudo sysctl -w fs.file-max=2097152

Startup Scripts: The Idempotency Trap

GCE instances have a concept of "startup scripts" — shell scripts that run via the instance metadata service. The critical thing most people miss: startup scripts run on every boot, not just on first creation.

Our first version of the spawning script had no idempotency guard. Every time a VM rebooted (GCE live migration, host maintenance, our own restarts), the startup script would spawn another batch of daemons on top of the ones already running. We woke up to swarm-1 running 6,000 daemons instead of 3,000, consuming all available memory, with the OOM killer randomly terminating processes.

The Fix: Idempotency Guard

#!/bin/bash
# Guard: only run once per boot
[ -f /tmp/daemon-spawn-done ] && exit 0

# Apply kernel limits
sysctl -w kernel.pid_max=4194304
sysctl -w fs.file-max=2097152

# Wait for network
until curl -s --max-time 2 http://metadata.google.internal/; do
  sleep 1
done

# Wait for rendezvous
until nc -z 35.193.106.76 9000 2>/dev/null; do
  echo "Waiting for rendezvous..."
  sleep 5
done

# Spawn daemons (see next section)
/opt/pilot/spawn-daemons.sh 3000

# Mark as done
touch /tmp/daemon-spawn-done

The /tmp/daemon-spawn-done file lives on tmpfs, so it is cleared on reboot. This means the script runs exactly once per boot cycle — which is what we want. If the VM reboots, we want fresh daemons (the old PIDs are gone), but not duplicates.

The Mystery Auto-Spawning

Even with the guard, we hit a bizarre issue on swarm-2: agents kept appearing that we did not spawn. After hours of debugging, we traced it to GCE instance metadata. A previous deployment had baked a startup script into the instance template metadata, and our new startup script was set at the project level. Both ran. GCE runs instance-level metadata scripts first, then project-level. The ghost script was spawning 500 agents before our guarded script even started.

Lesson: always check curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/startup-script -H "Metadata-Flavor: Google" to see if there is a hidden startup script on the instance itself.

Daemon Spawning at Scale

Spawning 3,000 daemons is not as simple as a for loop. Each daemon, on startup, connects to the rendezvous server to register itself. If you spawn 50 daemons per second, that is 50 simultaneous TCP connections to the registry, 50 identity generations, 50 registration handshakes. The registry is single-threaded for writes (to maintain consistency), so it becomes a bottleneck fast.

What Happens When You Overwhelm the Registry

At par=50 (50 concurrent spawns), we observed:

Registry CPU pegged at 100% on a single core, processing registration requests sequentially
TCP backlog filling up, causing connection timeouts for new daemons
Failed registrations causing daemons to retry with their own backoff, creating thundering herd effects
Polo dashboard becoming unresponsive because the HTTP handler shares the event loop

The Solution: Controlled Parallelism with Rounds

#!/bin/bash
# spawn-daemons.sh - Controlled daemon spawning
COUNT=${1:-1000}
PAR=15          # concurrent spawns per round
ROUND_DELAY=3   # seconds between rounds
REGISTRY="35.193.106.76:9000"
BEACON="35.193.106.76:9001"

spawned=0
while [ $spawned -lt $COUNT ]; do
  batch=$PAR
  remaining=$((COUNT - spawned))
  [ $remaining -lt $batch ] && batch=$remaining

  for i in $(seq 1 $batch); do
    port=$((10000 + spawned + i))
    dir="/tmp/pilot-agent-${spawned}-${i}"
    mkdir -p "$dir"

    nohup ./pilot-daemon \
      -registry "$REGISTRY" \
      -beacon "$BEACON" \
      -store "$dir/identity.json" \
      -ipc "$dir/pilot.sock" \
      -port "$port" \
      &>/dev/null &
  done

  spawned=$((spawned + batch))
  echo "Spawned $spawned / $COUNT"
  sleep $ROUND_DELAY
done

echo "Done. Total spawned: $spawned"

The key parameters we converged on after testing:

PAR=10-20: Sweet spot. Below 10 is too slow (spawning 3,000 agents takes 15+ minutes). Above 20 starts stressing the registry.
ROUND_DELAY=3: Gives the registry time to process the current batch, write to the persistence file, and update the Polo dashboard before the next wave hits.
Exponential backoff on errors: If a daemon fails to register, we do not retry immediately. The daemon itself has built-in retry with exponential backoff (1s, 2s, 4s, 8s, max 30s), so the spawning script does not need to handle retries.

Unique Ports and IPC Sockets

Each daemon needs a unique UDP port for its tunnel and a unique Unix socket for IPC. We use a simple scheme: base port 10000 + sequential index. The IPC socket goes into a per-daemon directory under /tmp. This avoids the classic "address already in use" error and keeps each daemon's state isolated.

On swarm-2 with 5,000 daemons, that means ports 10000-14999 and 5,000 directories under /tmp/pilot-agent-*. We verified with ss -ulnp | wc -l that all 5,000 UDP sockets bind successfully.

Monitoring: The Polo Dashboard

Running 9,500 agents without monitoring is like flying blind in a hurricane. Our primary monitoring surface is the Polo dashboard, accessible at polo.pilotprotocol.network.

What the Dashboard Shows

Total registered agents: Real-time count from the registry
Network topology: Which networks exist, how many agents per network
Agent status: Online/offline, last keepalive timestamp, visibility (public/private)
Trust relationships: Who trusts whom, pending handshake requests
Polo scores: Reputation scores across the fleet

API-Based Monitoring

For scripted monitoring, we query the rendezvous HTTP API directly:

# Get fleet stats
curl -s http://35.193.106.76:3000/api/stats | jq .
{
  "total_agents": 9523,
  "total_networks": 47,
  "online_agents": 9401,
  "connections_active": 234
}

# Check agents per network
curl -s http://35.193.106.76:3000/api/networks | jq '.[] | {name, agents}'

# Health check loop (runs every 60s)
while true; do
  count=$(curl -s http://35.193.106.76:3000/api/stats | jq .online_agents)
  echo "$(date): $count agents online"
  [ "$count" -lt 9000 ] && echo "ALERT: Agent count dropped below 9000"
  sleep 60
done

Per-VM Monitoring

On each swarm VM, we run a simple watchdog script that checks local daemon health:

# Count running daemons
pgrep -c pilot-daemon

# Total RSS in MB
ps -eo rss,comm | grep pilot-daemon | awk '{sum+=$1} END {printf "%.0f MB\n", sum/1024}'

# Daemons with high memory (potential leak)
ps -eo pid,rss,comm | grep pilot-daemon | awk '$2 > 20480 {print $1, $2/1024 "MB"}'

What Broke (And How We Fixed It)

Operating at this scale surfaces bugs that never appear in development. Here is every major incident and its resolution.

Incident 1: SSH Drops from pkill

What happened: We needed to restart all daemons on swarm-1. Ran pkill pilot-daemon. The command killed 3,000 processes, which generated 3,000 deregistration events, which flooded the registry connection, which caused a TCP write buffer overflow, which crashed the local network stack, which killed our SSH session.

Fix: Never pkill over SSH when thousands of processes will die simultaneously. Instead, use the detached pattern:

# Graceful mass kill that survives SSH drop
nohup bash -c "pkill -TERM pilot-daemon; sleep 30; pkill -9 pilot-daemon" &>/dev/null &

The nohup wrapper ensures the kill command completes even if SSH disconnects. The 30-second delay between SIGTERM and SIGKILL gives daemons time to deregister gracefully in batches rather than all at once.

Incident 2: Agents Respawning from Startup Scripts

What happened: After a planned reboot of swarm-2 for kernel updates, the VM came back with 10,000 agents instead of 5,000. Memory exhaustion started within minutes, and the OOM killer began randomly terminating daemons, which caused flapping (daemon dies, startup script respawns it, repeat).

Root cause: Two startup scripts running. One in instance metadata, one in project metadata. Both spawning the full quota.

Fix: Clear the instance-level metadata script and rely only on the project-level one with the idempotency guard. We also added a daemon count check before spawning:

# Only spawn if below threshold
current=$(pgrep -c pilot-daemon 2>/dev/null || echo 0)
target=5000
if [ "$current" -ge "$target" ]; then
  echo "Already running $current daemons, target is $target. Skipping."
  exit 0
fi
remaining=$((target - current))
echo "Running $current, spawning $remaining more"

Incident 3: Memory Exhaustion on swarm-3

What happened: swarm-3 (32 GB) was running 1,500 agents normally. We triggered a fleet-wide benchmark test that caused ~800 agents to open simultaneous connections. Each connection allocated ~4 MB of buffer space. Total spike: 800 * 4 MB = 3.2 GB on top of the 15 GB baseline. The system hit swap, performance collapsed, and the OOM killer started its work.

Fix: Reduced swarm-3's target from 2,000 to 1,500 agents. Added a memory pressure check to the monitoring script that automatically pauses benchmark operations when available memory drops below 20%. We also tuned Go's GOGC environment variable to 50 (default 100) on swarm-3, trading CPU for lower memory usage.

# Set lower GC threshold for memory-constrained VMs
export GOGC=50

Incident 4: Registry Persistence File Corruption

What happened: With 9,500 agents, the registry persistence file (/var/lib/pilot/registry.json) grew to 45 MB. During a routine snapshot write, the rendezvous VM ran a GCE live migration. The write was interrupted mid-flush, leaving a truncated JSON file. On restart, the registry could not parse its state and started empty. All 9,500 agents had to re-register.

Fix: The registry already uses atomic writes (write to temp file, then rename), but the rename was happening after an fsync on the temp file, not the directory. On ext4, the directory entry is not guaranteed durable without a directory fsync. We added the directory sync. Re-registration after a cold start now takes about 4 minutes with controlled parallelism.

Incident 5: UDP Port Exhaustion

What happened: On swarm-2, after spawning 5,000 daemons, we could not SSH in anymore. Investigation revealed that the 5,000 UDP sockets (ports 10000-14999) plus system services had exhausted the kernel's socket buffer memory. The default net.core.rmem_max and net.core.wmem_max were too low.

Fix:

sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.core.rmem_default=1048576
sudo sysctl -w net.core.wmem_default=1048576
sudo sysctl -w net.ipv4.udp_mem="4096 87380 16777216"

Operational Runbook

After months of operating this fleet, we have distilled our operational procedures into a repeatable runbook. Here are the key procedures:

Procedure: Full Fleet Restart

Stop monitoring alerts (avoid noise during restart)
Kill daemons on each VM using the nohup pattern, waiting 60 seconds between VMs
Verify all daemons are stopped: pgrep -c pilot-daemon returns 0
Clean up stale IPC sockets: rm -f /tmp/pilot-agent-*/pilot.sock
Restart rendezvous if needed (clear persistence file for clean state)
Spawn daemons on swarm-3 first (smallest), verify in dashboard
Spawn on swarm-1, then swarm-2, with 5-minute gaps
Verify fleet count via curl /api/stats
Re-enable monitoring alerts

Procedure: Rolling Daemon Update

When we deploy a new daemon binary:

Upload new binary to all VMs: rm -f pilot-daemon && scp pilot-daemon vm:
Kill 500 daemons at a time on each VM
Respawn 500 with the new binary
Wait 60 seconds, verify no crashes in dmesg
Repeat until all daemons are running the new version

Lessons Learned

Running nearly 10,000 agents on three VMs taught us things that no amount of unit testing could reveal:

Memory is the binding constraint, not CPU. At idle, our 9,500 daemons use less than 2% total CPU but 60% of available memory. Agent networking is a memory game.
PID limits are invisible until they are catastrophic. The default Linux limits are designed for servers running dozens of processes, not thousands. Fix them before you start.
Registry is the single point of contention. Every operational problem eventually traces back to registry pressure. Controlled parallelism in spawning and graceful shutdown sequences are non-negotiable.
Startup script idempotency is not optional. Any script that runs on boot must check whether its work has already been done. GCE will run your startup script more often than you expect.
Monitor everything, alert on counts. The single most useful metric is "how many agents are online right now." If that number drops, something is wrong. If it rises unexpectedly, something is also wrong.
Go's memory model is your friend at scale. The 10 MB per-daemon RSS is only possible because Go's runtime is efficient with idle goroutines. A Python or Node.js daemon would use 50-100 MB at idle, cutting our fleet capacity by 5-10x on the same hardware.

We are not done scaling. The next milestone is 50,000 agents, which will require registry sharding and a federated rendezvous architecture. But at 9,500, we have proven that a single-binary agent daemon with a single-binary rendezvous server can handle a fleet that would make most distributed systems engineers uncomfortable. And that is exactly the point.

Run Your Own Agent Fleet

Pilot Protocol is open source. Start with two agents, scale to thousands.

View on GitHub