← Back to Blog

NAT Traversal for AI Agents: A Deep Dive

February 13, 2026 deep-dive nat networking

88% of networked devices sit behind a NAT. Every corporate laptop, every cloud VM behind a load balancer, every home lab GPU rig. When Google's A2A specification says "agents communicate via HTTP endpoints," it quietly assumes every agent has a reachable IP address. Most do not.

Pilot Protocol solves this with a three-tier NAT traversal strategy: direct connection for simple NATs, UDP hole-punching for restrictive NATs, and relay for the worst case (symmetric NAT). This article is the definitive technical reference on how it works, why each tier exists, and what happens at the wire level.

NAT Taxonomy: What You Are Up Against

Not all NATs are created equal. RFC 3489 defines four types, and each has different implications for peer-to-peer connectivity between agents.

Full Cone NAT

Once an internal host sends a packet through the NAT, any external host can send packets back to the mapped address. The NAT creates a mapping (internal_ip:port) → (external_ip:port) and allows any external source to use it.

Full Cone: the friendly NAT
Agent A (192.168.1.5:4000)  ──→  NAT maps to 34.148.103.117:4000
                                   ↑
                                   Any external host can send to this address

Traversal strategy: Direct connection. STUN discovers the public endpoint, the agent registers it, and any peer can connect directly. This is the easiest case.

Restricted Cone NAT

The NAT only forwards packets from external hosts that the internal host has previously sent a packet to. The mapping exists, but the NAT filters by source IP (not port).

Restricted Cone: must "know" the sender
Agent A sends to Beacon (1.2.3.4)  ──→  NAT allows packets FROM 1.2.3.4 (any port)
Agent B (5.6.7.8) tries to reach A  ──→  NAT DROPS (A never sent to 5.6.7.8)

Traversal strategy: Hole-punching. Both agents send a UDP packet to each other's mapped address simultaneously. This "punches" a hole in both NATs, allowing bidirectional traffic.

Port-Restricted Cone NAT

Like Restricted Cone, but the NAT also filters by source port. The internal host must have sent to the exact (IP, port) pair.

Port-Restricted Cone: must know sender IP AND port
Agent A sends to 1.2.3.4:9001  ──→  NAT allows packets FROM 1.2.3.4:9001 only
Agent B at 5.6.7.8:4000       ──→  NAT DROPS (A never sent to 5.6.7.8:4000)

Traversal strategy: Hole-punching still works, but both sides must target the exact port. The beacon coordinates this by sharing the mapped endpoints.

Symmetric NAT

The NAT creates a different mapping for each destination. If Agent A sends to Beacon from internal port 4000, the NAT might map it to external port 32761. If Agent A then sends to Agent B from the same internal port 4000, the NAT creates a new mapping on a different external port, say 32762.

Symmetric NAT: different external port for each destination
Agent A (internal:4000) → Beacon (1.2.3.4)  =  NAT maps to 34.x.x.x:32761
Agent A (internal:4000) → Agent B (5.6.7.8) =  NAT maps to 34.x.x.x:32762  ← DIFFERENT!
                                                 ↑
                                                 STUN learned 32761, but B sees 32762

Traversal strategy: Hole-punching cannot work because the STUN-discovered port is different from the port the NAT will assign for communication with the peer. The only option is relay.

Summary table

NAT TypeDirectHole-PunchRelayPrevalence
Full ConeYesN/AN/A~15%
Restricted ConeNoYesN/A~25%
Port-RestrictedNoYesN/A~35%
SymmetricNoNoRequired~25%

The upshot: roughly 75% of NAT scenarios can be handled with direct connection or hole-punching. The remaining 25% require relay. Pilot Protocol handles all four automatically.

Tier 1: STUN Discovery

Every Pilot agent starts by discovering its own public endpoint. This is the STUN (Session Traversal Utilities for NAT) phase, and it runs during daemon startup, before the tunnel is active.

The temporary socket trick

STUN discovery and the tunnel both use UDP. If they shared the same socket, they would race: STUN replies would interleave with tunnel frames, and the tunnel's readLoop might consume STUN responses. Pilot avoids this with a deliberate design:

  1. The daemon opens a temporary UDP socket bound to the tunnel port
  2. It sends a STUN request (MsgDiscover) to the beacon
  3. It receives the STUN reply (MsgDiscoverReply) with its public IP:port
  4. It closes the temporary socket
  5. It starts the tunnel, which binds the same port
// STUN must complete BEFORE tunnel readLoop starts
// They share the UDP port — cannot run concurrently

// Step 1: temporary socket for STUN
stunConn, _ := net.ListenPacket("udp", ":4000")
publicAddr := doSTUN(stunConn, beaconAddr)  // MsgDiscover → MsgDiscoverReply
stunConn.Close()                            // MUST close before tunnel binds

// Step 2: tunnel binds the same port
tunnel := NewTunnel(":4000")
tunnel.Start()                              // readLoop now owns the socket

Why not a different port? The public endpoint discovered by STUN is only valid for the port it was discovered on. If STUN uses port 4000 and the tunnel uses port 4001, the NAT mapping for port 4000 is useless for tunnel traffic on port 4001. They must share the same port.

The beacon protocol: MsgDiscover

The STUN exchange uses two message types:

MsgDiscover (0x01)
Client → Beacon
[1 byte]  Message type: 0x01
[4 bytes] Node ID (big-endian uint32)

MsgDiscoverReply (0x02)
Beacon → Client
[1 byte]  Message type: 0x02
[4 bytes] Node ID
[N bytes] Public address as string (e.g., "34.148.103.117:4000")

The beacon simply reads the source address from the UDP packet and sends it back. This is how the agent learns its own public-facing IP and port as seen from the outside.

What the agent does with the result

After STUN discovery, the daemon registers its public endpoint with the registry. Other agents can then look up this endpoint and attempt to connect directly.

// Registration includes the STUN-discovered endpoint
registry.Register(nodeID, publicAddr)  // "34.148.103.117:4000"

If the agent is behind a Full Cone NAT, this is all that is needed. Peers look up the endpoint and send packets directly. For other NAT types, more work is required.

Tier 2: UDP Hole-Punching

When Agent A wants to connect to Agent B, and one or both are behind a Restricted or Port-Restricted NAT, direct packets get dropped. Hole-punching solves this by coordinating simultaneous outbound packets from both sides.

The sequence

Hole-punch sequence (simplified):

1. Agent A looks up Agent B's endpoint from registry
2. Agent A sends MsgPunchRequest(0x03) to the beacon:
   "I want to connect to Agent B. My endpoint is 34.148.103.117:4000"

3. Beacon sends MsgPunchCommand(0x04) to Agent A:
   "Send UDP to Agent B at 34.79.161.216:4000"

4. Beacon sends MsgPunchCommand(0x04) to Agent B:
   "Send UDP to Agent A at 34.148.103.117:4000"

5. Agent A sends a UDP packet to 34.79.161.216:4000
   → Agent A's NAT creates a mapping allowing replies from 34.79.161.216

6. Agent B sends a UDP packet to 34.148.103.117:4000
   → Agent B's NAT creates a mapping allowing replies from 34.148.103.117

7. Both NATs now have holes. Subsequent packets flow directly.

Wire format

MsgPunchRequest (0x03)
Client → Beacon
[1 byte]  Message type: 0x03
[4 bytes] Requester node ID
[4 bytes] Target node ID

MsgPunchCommand (0x04)
Beacon → Both clients
[1 byte]  Message type: 0x04
[4 bytes] Peer node ID (who to punch toward)
[N bytes] Peer endpoint as string (e.g., "34.79.161.216:4000")

Timing and retries

Hole-punching is timing-sensitive. If Agent A's packet arrives at B's NAT before B's outbound packet has created a mapping, the packet is dropped. Pilot handles this with retries:

  1. Both agents send 3 punch packets at 200ms intervals
  2. If any packet gets through, the hole is established
  3. The tunnel readLoop detects the first valid frame and completes the handshake
  4. If all 3 attempts fail, the system falls back to relay
// DialConnection: 3 direct retries → auto-switch to relay
for attempt := 0; attempt < 3; attempt++ {
    conn, err := directDial(peerEndpoint)
    if err == nil {
        return conn  // hole-punch succeeded
    }
    time.Sleep(200 * time.Millisecond)
}
// Fall through to relay
return relayDial(peerNodeID)

Why it works for Restricted Cone

A Restricted Cone NAT drops packets from sources the internal host has not communicated with. But "communicated with" means "sent a packet to that IP." When Agent A sends a UDP packet to Agent B's NAT-mapped address, Agent A's NAT records that it has communicated with B's public IP. When B's reply arrives, the NAT allows it through.

Port-Restricted Cone adds a port check, but the same principle applies: Agent A sends to B's exact (IP, port), and B sends to A's exact (IP, port). Both NATs create the necessary mappings.

Tier 3: Relay Fallback

Symmetric NATs defeat hole-punching because the STUN-discovered port is not the port the NAT will use for a different destination. When hole-punching fails, Pilot automatically falls back to relay mode.

How relay works

In relay mode, all traffic between two agents flows through the beacon. The beacon acts as a packet forwarder, not a processor. It does not decrypt, inspect, or modify the tunnel frames. It wraps them in a relay envelope and forwards them.

Relay path:
Agent A ──→ Beacon ──→ Agent B

Agent A sends: MsgRelay(0x05) to Beacon
  [1 byte]  Message type: 0x05
  [4 bytes] Sender node ID
  [4 bytes] Destination node ID
  [N bytes] Encrypted tunnel frame (opaque to beacon)

Beacon receives, looks up Agent B's endpoint, and sends:
  MsgRelayDeliver(0x06) to Agent B
  [1 byte]  Message type: 0x06
  [4 bytes] Sender node ID
  [N bytes] Encrypted tunnel frame

Auto-detection

The receiving agent detects that it is receiving data via relay rather than direct connection by the message type. When it sees a MsgRelayDeliver (0x06) instead of a raw tunnel frame, it knows the path is relayed and adjusts its connection state accordingly.

// Auto-detect relay mode
switch msgType {
case MsgRelayDeliver:
    // Data arrived via beacon relay
    // Mark this connection as relayed
    conn.SetRelayed(true)
    // Extract and process the inner tunnel frame
    frame := msg[5:]  // skip type + sender node ID
    tunnel.HandleFrame(frame)
}

Relay is encrypted end-to-end

A critical property: the beacon never sees plaintext. The tunnel frame inside the relay envelope is encrypted with X25519 + AES-256-GCM between the two agents. The beacon merely forwards opaque bytes. This means a compromised beacon cannot read agent traffic.

Privacy guarantee: Even in relay mode, the beacon sees only encrypted bytes, source node ID, and destination node ID. It cannot read message contents, file data, or task payloads. See our encryption deep dive for the cryptographic details.

Beacon Gossip Protocol

In a multi-beacon deployment, beacons need to know about each other's agents so they can relay cross-beacon traffic. This is handled by the gossip protocol.

MsgSync: node list broadcast

MsgSync (0x07)
Beacon → Beacon (every 10 seconds)
[1 byte]   Message type: 0x07
[4 bytes]  Number of entries (big-endian uint32)
[N entries] Each entry:
  [4 bytes] Node ID
  [N bytes] Endpoint as string (null-terminated)

Every 10 seconds, each beacon broadcasts its full node list to all known peer beacons. This is a simple, reliable mechanism:

Cross-beacon relay

When Agent A is connected to Beacon 1 and Agent B is connected to Beacon 2, relay traffic crosses beacons:

Cross-beacon relay:
Agent A ──→ Beacon 1 ──→ Beacon 2 ──→ Agent B

Agent A sends MsgRelay to Beacon 1 (destination: Agent B)
Beacon 1 does not know Agent B directly
Beacon 1 knows (via gossip) that Beacon 2 has Agent B
Beacon 1 forwards the MsgRelay to Beacon 2
Beacon 2 sends MsgRelayDeliver to Agent B

This creates a mesh topology where any agent can reach any other agent, regardless of which beacon they are connected to. The beacon network forms a routing backbone that agents use transparently.

Cloud VM Shortcut

Cloud VMs with stable public IPs do not need STUN at all. Pilot provides a --endpoint flag that skips the STUN phase entirely:

# Cloud VM with a known public IP — skip STUN
pilotctl daemon start --endpoint 34.148.103.117:4000 --public

This tells the daemon: "My public endpoint is 34.148.103.117:4000. Do not bother with STUN. Register this endpoint directly."

The --public flag makes the agent visible in the registry, which is typically what you want for a cloud VM that is meant to accept connections.

When to use --endpoint

Do not use --endpoint behind a NAT. The address must be reachable from the public internet. If you specify a wrong endpoint, peers will attempt to connect and fail.

Performance by Traversal Tier

The traversal tier has a direct impact on latency and throughput. Here are real-world numbers from Pilot's test fleet deployed across GCP regions:

TierRTTThroughputSetup Time
Direct (same region)~2ms~850 Mbps~50ms
Direct (cross-region)~40ms~400 Mbps~80ms
Hole-punched~5ms overheadSame as direct~600ms
Relay (same beacon)~15ms overhead~200 Mbps~100ms
Relay (cross-beacon)~25ms overhead~120 Mbps~150ms

Key observations:

The automatic fallback

Pilot's DialConnection function implements the fallback transparently:

// DialConnection: try direct, then relay
func DialConnection(peerNodeID uint32) (*Conn, error) {
    endpoint := registry.Lookup(peerNodeID)

    // Tier 1: try direct connection (3 attempts)
    for i := 0; i < 3; i++ {
        conn, err := directDial(endpoint)
        if err == nil {
            return conn, nil
        }
        time.Sleep(200 * time.Millisecond)
    }

    // Tier 2: request hole-punch via beacon
    beacon.SendPunchRequest(peerNodeID)
    time.Sleep(500 * time.Millisecond)
    for i := 0; i < 3; i++ {
        conn, err := directDial(endpoint)
        if err == nil {
            return conn, nil
        }
        time.Sleep(200 * time.Millisecond)
    }

    // Tier 3: fall back to relay (3 attempts)
    for i := 0; i < 3; i++ {
        conn, err := relayDial(peerNodeID)
        if err == nil {
            return conn, nil
        }
        time.Sleep(200 * time.Millisecond)
    }

    return nil, fmt.Errorf("all traversal tiers failed")
}

The caller does not know or care which tier succeeded. They get a *Conn that works, with encryption active regardless of the path.

Diagnosing NAT Issues

When connections fail, Pilot provides diagnostic tools:

# Check your agent's NAT situation
pilotctl --json info

# Trace the connection path to a peer
pilotctl traceroute agent-b

# Check if a peer is reachable
pilotctl ping agent-b --count 4

# View active connections and their types
pilotctl --json connections
# Each connection shows whether it's direct or relayed

The info output includes the STUN-discovered public endpoint, which tells you what external IP:port your agent is reachable at. If this differs from what you expect, you likely have a multi-layered NAT (NAT behind NAT), which behaves like symmetric NAT and requires relay.

Common issues

Design Decisions and Tradeoffs

Why UDP instead of TCP?

TCP hole-punching is unreliable. TCP's three-way handshake (SYN, SYN-ACK, ACK) requires both sides to have open ports at the right time. If the first SYN arrives before the other side's NAT has been punched, it gets a RST, which kills the connection attempt. UDP is fire-and-forget: lost packets just disappear, and the next attempt can succeed.

Pilot implements its own reliable transport on top of UDP: sliding window, congestion control (AIMD), flow control, Nagle algorithm, and retransmission. This gives TCP-like reliability with UDP's NAT traversal properties.

Why a beacon instead of a TURN server?

TURN (RFC 5766) is the standard relay protocol, but it is complex: allocations, permissions, channels, and lifetime management. Pilot's beacon relay is simpler by design:

The beacon processes each relay packet independently. There is no session state on the beacon. This makes it horizontally scalable: add more beacons, and the gossip protocol distributes agents across them.

Why full-state gossip instead of incremental updates?

Full-state broadcast every 10 seconds is wasteful in terms of bandwidth, but it is maximally simple and robust. With 10,000 agents at ~50 bytes per entry, the gossip message is ~500 KB every 10 seconds. For a beacon with gigabit connectivity, this is negligible.

Incremental updates (deltas) are more efficient but introduce complexity: ordering, conflict resolution, and reconciliation. For the beacon's use case, full-state wins on simplicity.

Putting It All Together

Here is the complete journey of a connection between two agents behind different NAT types:

Agent A: behind Restricted Cone NAT
Agent B: behind Symmetric NAT

1. Agent A starts daemon
   → STUN: discovers public endpoint 34.148.103.117:4000
   → Registers with registry

2. Agent B starts daemon
   → STUN: discovers public endpoint 34.79.161.216:32761
   → Registers with registry

3. Agent A: pilotctl connect agent-b --message "hello"
   → DialConnection(B's nodeID)
   → Lookup B's endpoint: 34.79.161.216:32761
   → Try direct: FAIL (B is behind Symmetric NAT, port changed)
   → Try direct: FAIL
   → Try direct: FAIL
   → Request hole-punch via beacon
   → Try direct: FAIL (Symmetric NAT: port is different now)
   → Try direct: FAIL
   → Try direct: FAIL
   → Fall back to relay
   → MsgRelay to beacon, destination: B
   → Beacon forwards as MsgRelayDeliver to B
   → B receives, responds via relay
   → Connection established (relayed, encrypted)

4. Message "hello" delivered over the relayed tunnel

Agent A's code is simply pilotctl connect agent-b --message "hello". The nine retry attempts, the hole-punch request, and the relay fallback all happen inside DialConnection. The agent developer never sees any of this.

For more on the transport protocol that runs on top of these tunnels, see How Pilot Protocol Works. For benchmarks comparing direct and relayed connections, see Benchmarking Agent Communication. For the encryption layer that protects all traffic regardless of traversal tier, see Zero-Dependency Agent Encryption.

Connect agents through any NAT

Install Pilot Protocol and establish your first connection. NAT traversal is automatic — you do not need to configure anything.

Get Started