When we tell people we run over 9,500 Pilot Protocol agents across three GCP virtual machines, the first reaction is usually disbelief. The second is "why?" The answer: because the best way to prove an agent networking stack works at scale is to actually run it at scale. No simulations. No theoretical models. Real daemons, real UDP tunnels, real registry lookups.
This post is the full operational deep dive. What hardware we use, how we budget memory, every kernel limit we hit, the spawning scripts that work (and the ones that definitely did not), and the monitoring that keeps the whole fleet visible.
Our fleet runs on three GCP VMs in the vulture-vision-cloud project. Each machine has a different role based on its resource profile:
| VM | Machine Type | vCPUs | RAM | Target Agents |
|---|---|---|---|---|
| swarm-1 | e2-standard-16 | 16 | 64 GB | ~3,000 |
| swarm-2 | n2d-highcpu-128 | 128 | 125 GB | ~5,000 |
| swarm-3 | e2-standard-8 | 8 | 32 GB | ~1,500 |
swarm-2 is the workhorse. With 128 vCPUs and 125 GB of RAM, it can theoretically handle over 12,000 daemons. In practice, we keep it around 5,000 to maintain headroom for bursts, registry traffic, and occasional memory spikes from active connections.
swarm-1 is our medium-tier machine. 64 GB gives us comfortable room for 3,000 daemons with monitoring overhead, log buffers, and the occasional SSH session that requires its own chunk of memory.
swarm-3 is the smallest, running about 1,500 agents. We intentionally keep one smaller machine in the fleet to test that our operational tooling works across heterogeneous hardware, not just big boxes.
All three connect to a single rendezvous server running on pilot-rendezvous in us-central1-a. The rendezvous handles registry lookups, beacon coordination for NAT traversal, and the Polo dashboard at port 3000.
The most important number in this entire operation is 10 MB. That is the resident set size (RSS) of a single Pilot daemon in idle state — registered with the rendezvous, keepalives running, no active connections.
Here is the math for swarm-2:
# swarm-2: n2d-highcpu-128, 125 GB RAM
Total RAM: 125 GB = 128,000 MB
OS + overhead: 3,000 MB # kernel, systemd, sshd, monitoring
Available for daemons: 125,000 MB
Per-daemon RSS (idle): 10 MB
Max theoretical daemons: 125,000 / 10 = 12,500
# We target 5,000 to leave 60% headroom
Actual daemon memory: 5,000 * 10 MB = 50,000 MB
Remaining headroom: 75,000 MB
Why so much headroom? Because "idle" is a best case. When daemons establish active connections, each connection adds roughly 2-4 MB for buffers, sliding window state, and encryption context. If 1,000 agents simultaneously open connections, memory jumps by 2-4 GB. During benchmarks or swarm coordination tests, this happens regularly.
We do not trust top or htop for aggregate memory. They show virtual memory (VIRT/VSZ), which is wildly inflated for Go binaries due to the garbage collector reserving large virtual address spaces. Instead, we measure actual RSS:
# Sum RSS for all pilot-daemon processes (in KB)
ps -eo rss,comm | grep pilot-daemon | awk '{sum+=$1} END {print sum/1024 " MB"}'
# Per-daemon breakdown
ps -eo pid,rss,comm | grep pilot-daemon | sort -k2 -n | tail -20
# Quick count
pgrep -c pilot-daemon
On swarm-2 with 5,000 daemons idle, the total RSS hovers around 48-52 GB. The variance comes from Go's GC cycle — some daemons have just run GC and are at 8 MB, others have pending garbage and sit at 12 MB. The 10 MB average holds over time.
At this scale, even a tiny memory leak becomes visible fast. A 1 KB/hour leak per daemon becomes 5 MB/hour across the fleet. We track total RSS every 5 minutes and alert if the growth rate exceeds 100 MB/hour with no corresponding increase in active connections. This has caught two genuine leaks in the event stream buffer code during development.
The first time we tried to spawn 3,000 daemons on swarm-1, we hit a wall at 4,096 processes. Not a memory wall. Not a CPU wall. A PID limit.
Linux kernels have multiple layers of process limits, and they all conspire against you when running thousands of processes under a single user:
kernel.pid_max — Global maximum PID number. Default: 32768 on most distributions. Each daemon gets a PID, so this caps your fleet at ~32K processes system-wide. Sounds generous until you realize each daemon also spawns goroutines that the scheduler tracks.TasksMax (systemd cgroup) — Per-slice limit on tasks (threads + processes). The default user-.slice in systemd caps each user at 33% of kernel.pid_max, roughly 10,000 tasks. But a Go binary with goroutines can use 4-8 OS threads, so 3,000 daemons easily exceed this.UserTasksMax (logind) — PAM/logind limit per user. Another independent cap, often defaulting to 12288.The error message when you hit any of these is unhelpfully generic:
fork/exec ./pilot-daemon: resource temporarily unavailable
Here is how we fix all three:
# Set immediately
sudo sysctl -w kernel.pid_max=4194304
# Persist across reboots
echo "kernel.pid_max=4194304" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
4,194,304 is the maximum value on 64-bit kernels. There is no performance penalty for setting it this high. The kernel allocates PIDs from a bitmap, and a 512 KB bitmap is negligible.
# Create override for user slices
sudo mkdir -p /etc/systemd/system/user-.slice.d/
cat <<EOF | sudo tee /etc/systemd/system/user-.slice.d/99-nolimit.conf
[Slice]
TasksMax=infinity
EOF
# Reload systemd
sudo systemctl daemon-reload
# Edit logind configuration
sudo sed -i 's/#UserTasksMax=.*/UserTasksMax=infinity/' /etc/systemd/logind.conf
sudo systemctl restart systemd-logind
Warning: Restarting systemd-logind can kill your SSH session. Do this via GCE serial console or a startup script, not over SSH. We learned this the hard way.
After fixing PID limits, the next wall is file descriptors. Each daemon uses at minimum: 1 UDP socket (tunnel), 1 Unix socket (IPC), and 1 TCP connection (registry). Active connections add more. We set:
# /etc/security/limits.conf
* soft nofile 1048576
* hard nofile 1048576
# System-wide
sudo sysctl -w fs.file-max=2097152
GCE instances have a concept of "startup scripts" — shell scripts that run via the instance metadata service. The critical thing most people miss: startup scripts run on every boot, not just on first creation.
Our first version of the spawning script had no idempotency guard. Every time a VM rebooted (GCE live migration, host maintenance, our own restarts), the startup script would spawn another batch of daemons on top of the ones already running. We woke up to swarm-1 running 6,000 daemons instead of 3,000, consuming all available memory, with the OOM killer randomly terminating processes.
#!/bin/bash
# Guard: only run once per boot
[ -f /tmp/daemon-spawn-done ] && exit 0
# Apply kernel limits
sysctl -w kernel.pid_max=4194304
sysctl -w fs.file-max=2097152
# Wait for network
until curl -s --max-time 2 http://metadata.google.internal/; do
sleep 1
done
# Wait for rendezvous
until nc -z 35.193.106.76 9000 2>/dev/null; do
echo "Waiting for rendezvous..."
sleep 5
done
# Spawn daemons (see next section)
/opt/pilot/spawn-daemons.sh 3000
# Mark as done
touch /tmp/daemon-spawn-done
The /tmp/daemon-spawn-done file lives on tmpfs, so it is cleared on reboot. This means the script runs exactly once per boot cycle — which is what we want. If the VM reboots, we want fresh daemons (the old PIDs are gone), but not duplicates.
Even with the guard, we hit a bizarre issue on swarm-2: agents kept appearing that we did not spawn. After hours of debugging, we traced it to GCE instance metadata. A previous deployment had baked a startup script into the instance template metadata, and our new startup script was set at the project level. Both ran. GCE runs instance-level metadata scripts first, then project-level. The ghost script was spawning 500 agents before our guarded script even started.
Lesson: always check curl http://metadata.google.internal/computeMetadata/v1/instance/attributes/startup-script -H "Metadata-Flavor: Google" to see if there is a hidden startup script on the instance itself.
Spawning 3,000 daemons is not as simple as a for loop. Each daemon, on startup, connects to the rendezvous server to register itself. If you spawn 50 daemons per second, that is 50 simultaneous TCP connections to the registry, 50 identity generations, 50 registration handshakes. The registry is single-threaded for writes (to maintain consistency), so it becomes a bottleneck fast.
At par=50 (50 concurrent spawns), we observed:
#!/bin/bash
# spawn-daemons.sh - Controlled daemon spawning
COUNT=${1:-1000}
PAR=15 # concurrent spawns per round
ROUND_DELAY=3 # seconds between rounds
REGISTRY="35.193.106.76:9000"
BEACON="35.193.106.76:9001"
spawned=0
while [ $spawned -lt $COUNT ]; do
batch=$PAR
remaining=$((COUNT - spawned))
[ $remaining -lt $batch ] && batch=$remaining
for i in $(seq 1 $batch); do
port=$((10000 + spawned + i))
dir="/tmp/pilot-agent-${spawned}-${i}"
mkdir -p "$dir"
nohup ./pilot-daemon \
-registry "$REGISTRY" \
-beacon "$BEACON" \
-store "$dir/identity.json" \
-ipc "$dir/pilot.sock" \
-port "$port" \
&>/dev/null &
done
spawned=$((spawned + batch))
echo "Spawned $spawned / $COUNT"
sleep $ROUND_DELAY
done
echo "Done. Total spawned: $spawned"
The key parameters we converged on after testing:
PAR=10-20: Sweet spot. Below 10 is too slow (spawning 3,000 agents takes 15+ minutes). Above 20 starts stressing the registry.ROUND_DELAY=3: Gives the registry time to process the current batch, write to the persistence file, and update the Polo dashboard before the next wave hits.Each daemon needs a unique UDP port for its tunnel and a unique Unix socket for IPC. We use a simple scheme: base port 10000 + sequential index. The IPC socket goes into a per-daemon directory under /tmp. This avoids the classic "address already in use" error and keeps each daemon's state isolated.
On swarm-2 with 5,000 daemons, that means ports 10000-14999 and 5,000 directories under /tmp/pilot-agent-*. We verified with ss -ulnp | wc -l that all 5,000 UDP sockets bind successfully.
Running 9,500 agents without monitoring is like flying blind in a hurricane. Our primary monitoring surface is the Polo dashboard, accessible at polo.pilotprotocol.network.
For scripted monitoring, we query the rendezvous HTTP API directly:
# Get fleet stats
curl -s http://35.193.106.76:3000/api/stats | jq .
{
"total_agents": 9523,
"total_networks": 47,
"online_agents": 9401,
"connections_active": 234
}
# Check agents per network
curl -s http://35.193.106.76:3000/api/networks | jq '.[] | {name, agents}'
# Health check loop (runs every 60s)
while true; do
count=$(curl -s http://35.193.106.76:3000/api/stats | jq .online_agents)
echo "$(date): $count agents online"
[ "$count" -lt 9000 ] && echo "ALERT: Agent count dropped below 9000"
sleep 60
done
On each swarm VM, we run a simple watchdog script that checks local daemon health:
# Count running daemons
pgrep -c pilot-daemon
# Total RSS in MB
ps -eo rss,comm | grep pilot-daemon | awk '{sum+=$1} END {printf "%.0f MB\n", sum/1024}'
# Daemons with high memory (potential leak)
ps -eo pid,rss,comm | grep pilot-daemon | awk '$2 > 20480 {print $1, $2/1024 "MB"}'
Operating at this scale surfaces bugs that never appear in development. Here is every major incident and its resolution.
What happened: We needed to restart all daemons on swarm-1. Ran pkill pilot-daemon. The command killed 3,000 processes, which generated 3,000 deregistration events, which flooded the registry connection, which caused a TCP write buffer overflow, which crashed the local network stack, which killed our SSH session.
Fix: Never pkill over SSH when thousands of processes will die simultaneously. Instead, use the detached pattern:
# Graceful mass kill that survives SSH drop
nohup bash -c "pkill -TERM pilot-daemon; sleep 30; pkill -9 pilot-daemon" &>/dev/null &
The nohup wrapper ensures the kill command completes even if SSH disconnects. The 30-second delay between SIGTERM and SIGKILL gives daemons time to deregister gracefully in batches rather than all at once.
What happened: After a planned reboot of swarm-2 for kernel updates, the VM came back with 10,000 agents instead of 5,000. Memory exhaustion started within minutes, and the OOM killer began randomly terminating daemons, which caused flapping (daemon dies, startup script respawns it, repeat).
Root cause: Two startup scripts running. One in instance metadata, one in project metadata. Both spawning the full quota.
Fix: Clear the instance-level metadata script and rely only on the project-level one with the idempotency guard. We also added a daemon count check before spawning:
# Only spawn if below threshold
current=$(pgrep -c pilot-daemon 2>/dev/null || echo 0)
target=5000
if [ "$current" -ge "$target" ]; then
echo "Already running $current daemons, target is $target. Skipping."
exit 0
fi
remaining=$((target - current))
echo "Running $current, spawning $remaining more"
What happened: swarm-3 (32 GB) was running 1,500 agents normally. We triggered a fleet-wide benchmark test that caused ~800 agents to open simultaneous connections. Each connection allocated ~4 MB of buffer space. Total spike: 800 * 4 MB = 3.2 GB on top of the 15 GB baseline. The system hit swap, performance collapsed, and the OOM killer started its work.
Fix: Reduced swarm-3's target from 2,000 to 1,500 agents. Added a memory pressure check to the monitoring script that automatically pauses benchmark operations when available memory drops below 20%. We also tuned Go's GOGC environment variable to 50 (default 100) on swarm-3, trading CPU for lower memory usage.
# Set lower GC threshold for memory-constrained VMs
export GOGC=50
What happened: With 9,500 agents, the registry persistence file (/var/lib/pilot/registry.json) grew to 45 MB. During a routine snapshot write, the rendezvous VM ran a GCE live migration. The write was interrupted mid-flush, leaving a truncated JSON file. On restart, the registry could not parse its state and started empty. All 9,500 agents had to re-register.
Fix: The registry already uses atomic writes (write to temp file, then rename), but the rename was happening after an fsync on the temp file, not the directory. On ext4, the directory entry is not guaranteed durable without a directory fsync. We added the directory sync. Re-registration after a cold start now takes about 4 minutes with controlled parallelism.
What happened: On swarm-2, after spawning 5,000 daemons, we could not SSH in anymore. Investigation revealed that the 5,000 UDP sockets (ports 10000-14999) plus system services had exhausted the kernel's socket buffer memory. The default net.core.rmem_max and net.core.wmem_max were too low.
Fix:
sudo sysctl -w net.core.rmem_max=16777216
sudo sysctl -w net.core.wmem_max=16777216
sudo sysctl -w net.core.rmem_default=1048576
sudo sysctl -w net.core.wmem_default=1048576
sudo sysctl -w net.ipv4.udp_mem="4096 87380 16777216"
After months of operating this fleet, we have distilled our operational procedures into a repeatable runbook. Here are the key procedures:
nohup pattern, waiting 60 seconds between VMspgrep -c pilot-daemon returns 0rm -f /tmp/pilot-agent-*/pilot.sockcurl /api/statsWhen we deploy a new daemon binary:
rm -f pilot-daemon && scp pilot-daemon vm:dmesgRunning nearly 10,000 agents on three VMs taught us things that no amount of unit testing could reveal:
We are not done scaling. The next milestone is 50,000 agents, which will require registry sharding and a federated rendezvous architecture. But at 9,500, we have proven that a single-binary agent daemon with a single-binary rendezvous server can handle a fleet that would make most distributed systems engineers uncomfortable. And that is exactly the point.
Pilot Protocol is open source. Start with two agents, scale to thousands.
View on GitHub