← Back to Blog

Connect AI Agents Behind NAT Without a VPN

February 22, 2026 NAT networking P2P

"How do IoT devices behind NAT communicate peer-to-peer?" This is one of the most searched networking questions on Stack Overflow, and it has been for years. The underlying problem is simple: 88% of networked devices sit behind a NAT. They can make outbound connections, but nothing from outside can reach them. And when two devices behind different NATs want to talk directly, neither can reach the other.

The workarounds are well-known and universally disliked. VPNs work but require configuration on every device, a central server, and ongoing maintenance. Port forwarding works but requires router access, static IPs, and breaks when the network changes. Cloud relay services work but add latency, cost money per gigabyte, and become a single point of failure.

Developers building robot fleets, federated learning systems, CI/CD pipelines, and multi-agent AI deployments all hit the same wall. As one robotics engineer put it: "The network grinds to a halt 90% of the time when I add a second robot." Researchers working on federated learning report that "NAT/firewall traversal is unsolved for P2P federated learning."

Pilot Protocol solves this with a three-tier NAT traversal strategy that is completely automatic. The agent developer writes zero networking code. The daemon handles STUN discovery, hole-punching, and relay fallback transparently. This article explains why NAT breaks peer-to-peer communication, why existing workarounds fail, and how Pilot's approach works in practice.

Why NAT Breaks Everything

Network Address Translation was introduced as a temporary fix for IPv4 address exhaustion in the 1990s. Thirty years later, it is more prevalent than ever and has become the single largest obstacle to peer-to-peer communication.

A NAT device assigns private IP addresses to devices on a local network (192.168.x.x, 10.x.x.x) and translates them to a single public IP address for outgoing traffic. This works perfectly for client-server communication: the device behind NAT initiates a connection, the NAT creates a mapping, and responses flow back through the mapping.

But peer-to-peer communication requires both sides to accept incoming connections. When Device A behind NAT-1 wants to talk to Device B behind NAT-2, neither can initiate because neither has a public address. Device A sends a packet to Device B's public NAT address, but NAT-2 has no mapping for this unsolicited packet and drops it. The same happens in reverse.

The Four NAT Types

Not all NATs are equally hostile. RFC 3489 classifies NATs into four types based on how they create and enforce address mappings:

NAT TypeMapping RuleP2P Possible?Prevalence
Full ConeAny external host can send to the mapped portDirect connection~15%
Restricted ConeOnly hosts the device has sent to can replyHole-punching~25%
Port-RestrictedOnly the exact host:port the device sent to can replyHole-punching~35%
SymmetricDifferent external port for each destinationRelay only~25%

Full Cone NATs are the friendliest: once a mapping is created, anyone can use it. These are common in residential routers and consumer-grade equipment. Restricted and Port-Restricted Cone NATs are more common in enterprise environments and allow peer-to-peer connections through hole-punching. Symmetric NATs -- found in corporate firewalls and carrier-grade NATs -- defeat hole-punching entirely because the external port changes with each destination.

The practical consequence: roughly 75% of NAT configurations allow direct peer-to-peer connections with the right technique. The remaining 25% require a relay. A complete NAT traversal solution must handle all four types automatically.

Current Workarounds and Why They Fail

VPNs: Powerful but Complex

WireGuard, Tailscale, and ZeroTier can connect devices behind NAT. They work well. They also require installing and configuring software on every device, managing key distribution, maintaining a coordination server, and debugging when connections drop. For a fleet of 50 IoT sensors, a VPN is viable. For a fleet of 10,000 ephemeral AI agents that spin up and down dynamically, the operational overhead is prohibitive.

VPNs also create a flat network -- every device on the VPN can reach every other device. This is a security liability when you want fine-grained access control between agents. Pilot Protocol's trust model allows per-agent-pair authorization, which a VPN does not provide. See How to Secure AI Agent Communication With Zero Trust for details.

DDS and Multicast: LAN Only

Robot Operating System 2 (ROS2) uses DDS (Data Distribution Service) for communication. DDS relies on UDP multicast for discovery, which works on a single LAN segment. The moment you cross a router, multicast traffic is dropped. The robotics engineer who reported "the network grinds to a halt 90% of the time when I add a second robot" was experiencing multicast storms -- DDS flooding the local network with discovery packets that never reach the intended destination.

DDS was designed for real-time systems on a local network, not for distributed agents across the internet. Using it for wide-area agent communication is using a hammer on a screw.

ngrok and Tunneling Services: Costly at Scale

Services like ngrok create public HTTPS endpoints for local services. They are perfect for demos and development. They are impractical for production agent fleets because they charge per connection or per bandwidth, they add latency (traffic routes through the provider's infrastructure), and they create a dependency on a third-party service that you do not control.

More fundamentally, ngrok solves the wrong problem. It makes a single device reachable from the internet. Agent communication is peer-to-peer: every agent needs to reach every other agent, not just be reachable from a central server. With N agents, you need N ngrok tunnels and N^2 potential connections. The cost and complexity scale quadratically.

Cloud MQTT/Relay: The Bottleneck

Many IoT platforms route all traffic through a cloud MQTT broker or relay server. This works but makes the cloud server a bottleneck for throughput, a single point of failure for availability, and a privacy concern for data that should never leave the local network. If two agents are in the same building, their traffic still round-trips through a data center on another continent.

The common thread: Every workaround either requires manual configuration per device, creates a central bottleneck, adds a third-party dependency, or scales poorly. None of them provide automatic, peer-to-peer NAT traversal that handles all four NAT types out of the box.

Pilot's Three-Tier Approach: STUN, Hole-Punch, Relay

Pilot Protocol handles NAT traversal automatically through a three-tier strategy. The daemon tries the fastest approach first and falls back to slower approaches only when necessary. The agent developer never sees any of this -- they just call pilotctl connect and get a working connection.

Tier 1: STUN Discovery and Direct Connection

When a Pilot daemon starts, it discovers its own public endpoint using STUN (Session Traversal Utilities for NAT). The daemon sends a UDP packet to the beacon server, and the beacon replies with the source address it observed -- which is the agent's public NAT-mapped endpoint.

# Daemon startup: STUN discovery happens automatically
$ pilotctl daemon start
STUN: discovered public endpoint 34.148.103.117:4000
NAT type: restricted_cone
Registered with registry

If both agents are behind Full Cone NATs, direct connection works immediately. Agent A looks up Agent B's public endpoint from the registry and sends packets directly. The NAT forwards them because Full Cone NATs accept packets from any source once a mapping exists.

Tier 2: Coordinated Hole-Punching

When one or both agents are behind Restricted or Port-Restricted NATs, direct packets get dropped because the NAT has no mapping for the sender. Hole-punching solves this by coordinating simultaneous outbound packets from both sides.

The sequence is choreographed by the beacon server:

  1. Agent A sends a punch request to the beacon: "I want to connect to Agent B"
  2. The beacon sends a punch command to both agents: "Send a UDP packet to each other's public endpoint now"
  3. Both agents send UDP packets simultaneously. These outbound packets create NAT mappings on both sides.
  4. Subsequent packets from either side pass through the newly created mappings
  5. The hole is established. Traffic flows directly, peer-to-peer, with no relay
Hole-punch timing (simplified):

Agent A              Beacon              Agent B
   |                   |                   |
   |-- PunchRequest -->|                   |
   |                   |-- PunchCommand -->|
   |<-- PunchCommand --|                   |
   |                   |                   |
   |===== UDP to B's endpoint ============>|  (creates mapping on A's NAT)
   |<============ UDP to A's endpoint =====|  (creates mapping on B's NAT)
   |                   |                   |
   |<========= direct traffic ============>|  (both NATs have mappings now)

After hole-punching succeeds, the connection is indistinguishable from a direct connection. The beacon is no longer involved. Traffic flows peer-to-peer at full speed with no added latency.

Tier 3: Relay Fallback for Symmetric NAT

Symmetric NATs assign a different external port for each destination. STUN discovers port 32761 (the port assigned when talking to the beacon), but when the agent sends to a peer, the NAT assigns port 32762. The peer is trying to reach port 32761, which does not accept traffic from the peer's IP. Hole-punching cannot work.

For these cases, Pilot falls back to relay mode. All traffic between the two agents routes through the beacon. The relay is transparent to the application -- the agent still sees a normal connection object with the same API. The only difference is higher latency and lower throughput.

Relay path (symmetric NAT):

Agent A ──→ Beacon ──→ Agent B
        encrypted       encrypted
        payload         payload

# The beacon never sees plaintext.
# Traffic is encrypted end-to-end with X25519 + AES-256-GCM.
# The beacon forwards opaque bytes.

A critical property: even in relay mode, the beacon cannot read the traffic. The tunnel between Agent A and Agent B is encrypted with X25519 key exchange and AES-256-GCM. The beacon sees only encrypted bytes, a source node ID, and a destination node ID. It is a dumb forwarder. A compromised beacon cannot eavesdrop on agent communications.

Zero Configuration: It Just Works

The three-tier approach is automatic. The daemon tries all tiers in sequence without any input from the agent developer or operator. Here is what happens when two agents behind different NATs connect:

# Agent A (behind Port-Restricted NAT)
$ pilotctl daemon start
STUN: public endpoint 73.162.88.14:4000
NAT type: port_restricted_cone

# Agent B (behind Symmetric NAT)
$ pilotctl daemon start
STUN: public endpoint 91.203.45.67:32761
NAT type: symmetric

# Agent A connects to Agent B -- all traversal is automatic
$ pilotctl connect agent-b --message "hello from behind NAT"
Trying direct... failed (3 attempts)
Trying hole-punch... failed (symmetric NAT on remote)
Falling back to relay...
Connected via relay (encrypted, 15ms overhead)
Message delivered

The developer's command is pilotctl connect agent-b. The nine retry attempts, the hole-punch coordination, and the relay fallback happen inside the DialConnection function. The developer never writes NAT traversal code, never configures port forwarding, and never sets up a VPN.

For cloud VMs with stable public IPs, the --endpoint flag skips STUN entirely:

# Cloud VM with known public IP -- skip STUN
$ pilotctl daemon start --endpoint 34.148.103.117:4000 --public
Registered with fixed endpoint (STUN skipped)

Performance by Tier

The traversal tier directly affects connection performance. Here are measurements from Pilot's test fleet across five GCP regions:

Connection TypeLatency (RTT)ThroughputSetup Time
Direct (same region)~2ms~850 Mbps~50ms
Direct (cross-region)~40ms~400 Mbps~80ms
Hole-punched+5ms setup overheadSame as direct~600ms
Relay (same beacon)+15ms per hop~200 Mbps~100ms
Relay (cross-beacon)+25ms per hop~120 Mbps~150ms

The key insight: hole-punched connections perform identically to direct connections after the initial setup. The ~600ms setup time is the cost of coordinating the punch through the beacon. Once the hole is established, traffic flows peer-to-peer with no added latency. This means 75% of NAT scenarios (Full Cone, Restricted Cone, Port-Restricted Cone) get direct-speed performance.

Relay adds ~15ms of latency per hop because every packet takes two trips (sender to beacon, beacon to receiver). Throughput is roughly halved because the beacon becomes the bottleneck. For most agent workloads -- task delegation, status updates, small data transfers -- this overhead is imperceptible. For bulk data transfer, hole-punching is strongly preferred.

Use Cases

IoT Device Fleets

Sensors, cameras, and edge devices behind residential or industrial NATs can form peer-to-peer networks without cloud relay. A sensor in a factory can stream data directly to a processing agent in the same building, or to an analytics agent in a different city. The NAT type does not matter -- Pilot handles it. No port forwarding, no VPN, no cloud dependency.

Federated Learning Nodes

Federated learning requires model updates to flow between training nodes without centralizing the data. When those nodes are on university networks, home labs, and corporate campuses -- all behind different NATs -- the communication layer is the hardest part. Pilot provides authenticated, encrypted peer-to-peer connections between any two nodes, regardless of their NAT type. The trust model ensures that only authorized nodes participate in the training federation.

CI/CD Runners and Dev Containers

Self-hosted CI/CD runners behind corporate firewalls cannot accept inbound connections. Cloud-based orchestrators cannot push jobs to them. Pilot flips this: the runner registers with the Pilot registry on startup, and the orchestrator connects through the automatic NAT traversal. No firewall rules to modify, no VPN to maintain, no IT ticket to file.

Multi-Agent AI Deployments

When your AI agents run on a mix of cloud VMs, office workstations, and edge devices, the network topology is inherently heterogeneous. Some agents have public IPs, some are behind residential NAT, some are behind corporate firewalls. Pilot's three-tier traversal unifies all of these into a single overlay network where every agent can reach every other agent through its virtual address.

CLI Example: Two Agents Behind Different NATs

Here is the complete workflow for connecting two agents on different home networks, each behind a consumer-grade NAT router.

# Home network 1 (Port-Restricted Cone NAT)
$ go install github.com/TeoSlayer/pilotprotocol/cmd/pilotctl@latest
$ pilotctl init --hostname home-agent-1
$ pilotctl daemon start --registry rendezvous.example.com:9000
STUN: public endpoint 73.162.88.14:4000 (port_restricted_cone)
Registered as home-agent-1 (1:0001.0000.0008)

# Home network 2 (Restricted Cone NAT)
$ go install github.com/TeoSlayer/pilotprotocol/cmd/pilotctl@latest
$ pilotctl init --hostname home-agent-2 --public
$ pilotctl daemon start --registry rendezvous.example.com:9000
STUN: public endpoint 98.45.211.33:4000 (restricted_cone)
Registered as home-agent-2 (1:0001.0000.0009)

# Home 1: establish trust and connect
$ pilotctl handshake home-agent-2 "Collaborative data processing"
Handshake sent...

# Home 2: approve
$ pilotctl approve 1:0001.0000.0008
Trust established

# Home 1: connect — hole-punching happens automatically
$ pilotctl ping home-agent-2 --count 4
PING home-agent-2 (1:0001.0000.0009):
  Reply: 12ms (hole-punched, encrypted)
  Reply: 11ms
  Reply: 11ms
  Reply: 12ms
4 packets, 0% loss, avg 11.5ms

# Direct peer-to-peer — no relay, no VPN, no port forwarding

Both agents are behind NAT on different home networks. The hole-punch succeeded because both NAT types support it (Port-Restricted and Restricted Cone). After the initial punch (~600ms), traffic flows directly between the two home IPs at ~12ms round-trip time. No cloud relay, no VPN tunnel, no router configuration. The only infrastructure is the rendezvous server that coordinated the initial discovery and hole-punch.

For a deeper technical treatment of the NAT traversal protocol, including wire formats, beacon gossip, and cross-beacon relay, see NAT Traversal for AI Agents: A Deep Dive. For the encryption that protects all traffic regardless of traversal tier, see Zero-Dependency Agent Encryption. For the trust model that determines which agents can connect, see How to Secure AI Agent Communication With Zero Trust.

Try Pilot Protocol

Automatic NAT traversal for AI agents, IoT devices, and distributed systems. No VPN, no port forwarding, no cloud relay bills. Install and connect in under 5 minutes.

View on GitHub