← Back to Blog

Lightweight Swarm Communication for Drones and Robots

February 25, 2026 robotics swarm edge

"The network grinds to a halt 90% of the time when I add a second robot." This complaint from a ROS2 user captures a problem that robotics teams encounter repeatedly. DDS, the middleware underneath ROS2, uses multicast for peer discovery. On WiFi networks -- where most mobile robots and drones actually operate -- multicast is unreliable, generates broadcast storms, and consumes bandwidth that the robots need for telemetry and coordination. The result: adding more robots to the swarm makes communication worse, not better.

The question robotics teams keep asking is straightforward: what affordable, stable system provides communication between multiple robots without the overhead of ROS2/DDS? This article examines why the DDS multicast model breaks for swarms, what lightweight alternatives exist, and how a single-socket UDP overlay with tag-based discovery can coordinate robot swarms on constrained hardware.

The DDS/ROS2 Multicast Problem

ROS2 uses DDS (Data Distribution Service) as its communication middleware. DDS was designed for military and industrial systems where participants are on the same wired LAN with reliable multicast support. It works well in that environment. The problems start when you move to WiFi, cellular, or mixed networks -- which is where most modern robot swarms actually operate.

Discovery storms. DDS uses SPDP (Simple Participant Discovery Protocol) to find other participants on the network. Each participant periodically multicasts its presence. With 10 robots, that is 10 discovery messages every few seconds, each broadcast to all network participants. On WiFi, multicast frames are transmitted at the lowest data rate (often 1 Mbps on 802.11b/g compatibility mode), consuming disproportionate airtime. With 50 robots, the discovery traffic alone can saturate the WiFi channel.

50% latency overhead. Benchmarks consistently show that ROS2 adds approximately 50% latency overhead compared to raw DDS. The ROS2 middleware layer -- type support, serialization, executor callbacks -- adds measurable delay to every message. For a swarm doing position updates at 10 Hz, this means the position data each robot receives is 50% staler than it needs to be.

WiFi multicast is fundamentally broken. WiFi access points handle multicast by converting it to broadcast, which is transmitted without acknowledgment or retransmission. If a robot misses a multicast frame due to interference (common in outdoor environments with 2.4 GHz congestion), the message is silently lost. DDS has reliability mechanisms, but they depend on the transport layer delivering the multicast in the first place. On WiFi, this assumption fails routinely.

Memory and CPU overhead. A ROS2 node with DDS middleware typically consumes 50-100 MB of RSS depending on the number of topics and participants. On a Raspberry Pi 4 with 4 GB of RAM running a navigation stack, vision processing, and motor control, the communication middleware alone uses 2-5% of available memory. On a Jetson Nano with 2 GB, it is 5-10%. Every megabyte matters when you are running neural networks on edge hardware.

Why Swarms Need Something Lighter Than ROS2

Robot swarms have specific communication requirements that differ from single-robot ROS2 systems:

ROS2/DDS was not designed for these constraints. It was designed for fixed installations with wired networks and powerful compute nodes. Using it for swarms works in the lab (where the WiFi is clean and the robot count is small) but fails in the field.

Pilot's Approach: Single Socket, No Multicast, Tag Discovery

Pilot Protocol takes a fundamentally different approach to the problems DDS struggles with.

Single UDP socket. Each Pilot daemon uses one UDP socket for all communication. There is no multicast, no broadcast, no SPDP discovery traffic. All messages are unicast -- sent directly from one peer to another. On WiFi, unicast frames are transmitted at the highest negotiated data rate with acknowledgment and retransmission. This means every Pilot message gets the reliability that WiFi provides for unicast, which multicast discards.

Tag-based discovery. Instead of multicast discovery, robots register with a lightweight registry server and tag themselves with capabilities. Other robots query the registry to discover peers. The registry is a single process that can run on the ground station, a cloud VM, or one of the robots themselves. Discovery traffic is unicast TCP to the registry, not multicast on the WiFi channel.

# Each drone tags itself on startup
pilotctl set-tags swarm-member drone quadcopter survey-team-alpha

# Ground station discovers all drones
pilotctl find-by-tag drone --json

# Or find specific capabilities
pilotctl find-by-tag drone thermal-camera --json

10 MB RSS per daemon. The Pilot daemon is a single Go binary with no external dependencies. It uses 10 MB of RSS at idle, which is 5-10x less than a ROS2 node with DDS middleware. On a Raspberry Pi 4, this is 0.25% of available memory. On a Jetson Nano, it is 0.5%. The remaining resources are available for the robot's actual workload: navigation, perception, and control.

# Memory comparison on Raspberry Pi 4 (4GB)
# ROS2 Humble + FastDDS:  ~80MB RSS  (2.0% of RAM)
# ROS2 Humble + CycloneDDS: ~60MB RSS (1.5% of RAM)
# Pilot daemon:            ~10MB RSS  (0.25% of RAM)

Architecture: Drone Swarm Coordinating via Event Stream

Here is a concrete architecture for a swarm of 10 survey drones coordinated by a ground station. Each drone publishes telemetry (position, battery, status) to the event stream. The ground station subscribes to all telemetry, computes the survey plan, and publishes waypoints back to individual drones.

Components:

Communication flow:

  1. Each drone publishes telemetry to swarm.telemetry.drone-N every 100ms.
  2. Ground station subscribes to swarm.telemetry.* -- receives all drone telemetry.
  3. Ground station publishes waypoints to swarm.command.drone-N for each drone.
  4. Each drone subscribes to swarm.command.drone-N (its own command channel).
  5. Drones publish events to swarm.events.* (low battery, obstacle detected, survey complete).
# Drone startup sequence
pilotctl daemon start
pilotctl join --network 1
pilotctl set-tags swarm-member drone drone-07 survey-team-alpha

# Ground station startup
pilotctl daemon start
pilotctl join --network 1
pilotctl set-tags ground-station survey-team-alpha
pilotctl subscribe "swarm.telemetry.*"
pilotctl subscribe "swarm.events.*"

Code Example: Telemetry Publishing and Subscribing

Here is a Go program that runs on each drone, publishing GPS telemetry at 10 Hz over the Pilot event stream.

package main

import (
    "encoding/json"
    "fmt"
    "os"
    "time"

    "github.com/TeoSlayer/pilotprotocol/pkg/driver"
)

type Telemetry struct {
    DroneID   string  `json:"drone_id"`
    Lat       float64 `json:"lat"`
    Lon       float64 `json:"lon"`
    AltM      float64 `json:"alt_m"`
    SpeedMps  float64 `json:"speed_mps"`
    HeadingDeg float64 `json:"heading_deg"`
    BatteryPct int     `json:"battery_pct"`
    State     string  `json:"state"`      // "idle", "flying", "returning", "landed"
    Ts        int64   `json:"ts"`
}

func main() {
    droneID := os.Getenv("DRONE_ID") // e.g., "drone-07"
    if droneID == "" {
        droneID = "drone-01"
    }

    d, err := driver.Connect()
    if err != nil {
        panic(err)
    }
    stream, err := d.OpenEventStream()
    if err != nil {
        panic(err)
    }

    topic := fmt.Sprintf("swarm.telemetry.%s", droneID)

    // Also subscribe to commands addressed to this drone
    cmdTopic := fmt.Sprintf("swarm.command.%s", droneID)
    cmdCh, _ := stream.Subscribe(cmdTopic)

    // Handle incoming commands in background
    go func() {
        for event := range cmdCh {
            var cmd struct {
                Action string    `json:"action"`
                WaypointLat float64 `json:"waypoint_lat,omitempty"`
                WaypointLon float64 `json:"waypoint_lon,omitempty"`
                WaypointAlt float64 `json:"waypoint_alt,omitempty"`
            }
            json.Unmarshal(event.Data, &cmd)
            fmt.Printf("[CMD] %s: %+v\n", cmd.Action, cmd)
            // Send command to flight controller...
        }
    }()

    // Publish telemetry at 10 Hz
    ticker := time.NewTicker(100 * time.Millisecond)
    defer ticker.Stop()

    for range ticker.C {
        telem := Telemetry{
            DroneID:    droneID,
            Lat:        readGPSLat(),
            Lon:        readGPSLon(),
            AltM:       readAltimeter(),
            SpeedMps:   readAirspeed(),
            HeadingDeg: readCompass(),
            BatteryPct: readBattery(),
            State:      getFlightState(),
            Ts:         time.Now().UnixMilli(),
        }
        data, _ := json.Marshal(telem)
        stream.Publish(topic, data)
    }
}

// Hardware interface stubs -- replace with actual sensor reads
func readGPSLat() float64    { return 37.7749 }
func readGPSLon() float64    { return -122.4194 }
func readAltimeter() float64 { return 50.0 }
func readAirspeed() float64  { return 5.2 }
func readCompass() float64   { return 270.0 }
func readBattery() int       { return 85 }
func getFlightState() string { return "flying" }

And the ground station subscriber that aggregates telemetry from all drones and sends waypoint commands:

package main

import (
    "encoding/json"
    "fmt"

    "github.com/TeoSlayer/pilotprotocol/pkg/driver"
)

type Telemetry struct {
    DroneID    string  `json:"drone_id"`
    Lat        float64 `json:"lat"`
    Lon        float64 `json:"lon"`
    AltM       float64 `json:"alt_m"`
    BatteryPct int     `json:"battery_pct"`
    State      string  `json:"state"`
    Ts         int64   `json:"ts"`
}

type Waypoint struct {
    Action      string  `json:"action"`
    WaypointLat float64 `json:"waypoint_lat"`
    WaypointLon float64 `json:"waypoint_lon"`
    WaypointAlt float64 `json:"waypoint_alt"`
}

func main() {
    d, err := driver.Connect()
    if err != nil {
        panic(err)
    }
    stream, err := d.OpenEventStream()
    if err != nil {
        panic(err)
    }

    // Subscribe to ALL drone telemetry
    telemCh, _ := stream.Subscribe("swarm.telemetry.*")

    // Subscribe to swarm events (low battery, obstacles, etc.)
    eventCh, _ := stream.Subscribe("swarm.events.*")

    // Handle events in background
    go func() {
        for event := range eventCh {
            fmt.Printf("[EVENT] %s: %s\n", event.Topic, string(event.Data))
        }
    }()

    // Process telemetry
    droneState := make(map[string]Telemetry)

    for event := range telemCh {
        var telem Telemetry
        json.Unmarshal(event.Data, &telem)
        droneState[telem.DroneID] = telem

        // Check for low battery and send return-to-home command
        if telem.BatteryPct < 20 && telem.State == "flying" {
            fmt.Printf("[ALERT] %s battery at %d%%, sending RTH\n",
                telem.DroneID, telem.BatteryPct)

            cmd := Waypoint{
                Action:      "return-to-home",
                WaypointLat: 37.7749,  // Home base coordinates
                WaypointLon: -122.4194,
                WaypointAlt: 30.0,
            }
            data, _ := json.Marshal(cmd)
            cmdTopic := fmt.Sprintf("swarm.command.%s", telem.DroneID)
            stream.Publish(cmdTopic, data)
        }

        // Display swarm status every 100 updates
        if len(droneState) > 0 {
            fmt.Printf("[SWARM] %d drones active\n", len(droneState))
        }
    }
}

NAT Traversal for Field Deployment

Field deployments introduce a network topology that lab setups do not have. Drones connect via cellular (4G/5G), ground robots via WiFi, and the ground station via a wired connection or a different WiFi network. These networks assign private IPs behind NAT, and the participants cannot directly reach each other.

Pilot handles this transparently with its three-stage NAT traversal:

  1. STUN discovery. Each drone discovers its public IP and NAT type by querying the beacon server over UDP. This takes ~50ms.
  2. Hole-punching. For restricted-cone and port-restricted-cone NAT (the most common types on consumer and carrier networks), the beacon coordinates a simultaneous hole-punch. Both peers send a UDP packet to each other's public endpoint at the same time, creating a NAT mapping that allows bidirectional traffic.
  3. Relay fallback. For symmetric NAT (common on enterprise networks and some cellular carriers), where hole-punching cannot work, the beacon relays traffic. This adds ~10ms of latency (one extra hop) but guarantees connectivity.
# Drone on cellular (behind carrier-grade NAT)
pilotctl daemon start --beacon beacon.example.com:9001
# NAT type is detected automatically
# Hole-punch or relay is selected automatically
# No configuration needed on the drone

# Ground station on WiFi (behind home router NAT)
pilotctl daemon start --beacon beacon.example.com:9001
# Same: automatic NAT traversal

# Both can now communicate regardless of NAT type
pilotctl ping 1:0001.0002.0001  # Ground station pings drone

This is critical for field deployment. You do not ask a farmer to configure port forwarding on the cellular modem attached to a survey drone. You do not ask a search-and-rescue team to set up a VPN before deploying robots in a disaster zone. The communication just works because NAT traversal is built into the protocol, not bolted on as a configuration step.

Comparison: Pilot vs ROS2/DDS vs MAVLink vs Zenoh

Property ROS2/DDS MAVLink Zenoh Pilot
Discovery Multicast SPDP Heartbeat Scouting (multicast) Registry (unicast)
Transport UDP multicast + unicast Serial / UDP UDP / TCP / QUIC UDP unicast
WiFi compatibility Poor (multicast issues) Good (unicast) Mixed (multicast scout) Good (unicast only)
NAT traversal No No Router-based STUN + hole-punch + relay
Encryption DDS Security (complex) MAVLink 2 signing TLS / QUIC AES-256-GCM (built-in)
Memory (daemon) 60-100 MB ~1 MB (library) ~20 MB ~10 MB
QoS/Reliability Full DDS QoS None (best-effort) Reliability levels Congestion control, flow control
Pub/Sub Yes (DDS topics) No (point-to-point) Yes Yes (event stream)
Ecosystem Massive (ROS packages) Autopilot-focused Growing Agent-focused
Target use case General robotics UAV telemetry/command IoT / Edge / Robotics Agent networks / Swarms

ROS2/DDS has the largest ecosystem. If you need MoveIt for manipulation, Nav2 for navigation, or integration with hundreds of sensor drivers, ROS2 is the pragmatic choice despite its communication overhead. The communication layer is the price you pay for the ecosystem.

MAVLink is purpose-built for drone autopilots. It is extremely lightweight (~1 MB) and well-suited for telemetry and command between a ground station and drones using standard autopilot firmware (ArduPilot, PX4). But it is point-to-point and has no pub/sub, no encryption, and no NAT traversal. For swarm coordination beyond basic telemetry, you need something on top of MAVLink.

Zenoh is the closest competitor for this use case. It has low overhead (~20 MB), supports pub/sub, and has a router-based approach to bridging networks. Its scouting mechanism still uses multicast by default (though it can be configured for unicast), and its NAT traversal relies on Zenoh routers rather than built-in hole-punching. For constrained environments, Zenoh is a strong option.

Pilot occupies the niche of zero-configuration swarm communication. No multicast, built-in NAT traversal, 10 MB footprint, and encrypted by default. It trades the rich QoS model of DDS for simplicity and the massive ecosystem of ROS2 for lightweight deployment.

Limitations: Not Real-Time, Not Safety-Critical

Honesty about limitations matters more than feature lists. Pilot is not suitable for everything in robotics:

The practical architecture: Use ROS2 for intra-robot communication (sensors, actuators, navigation). Use Pilot for inter-robot communication (swarm telemetry, task assignment, coordination). Each layer does what it is good at.

Getting Started with a Robot Swarm

Set up a 3-robot swarm with a ground station:

# On the ground station (also runs the registry)
go install github.com/TeoSlayer/pilotprotocol/cmd/pilotctl@latest
pilotctl daemon start
pilotctl set-tags ground-station swarm-controller
pilotctl subscribe "swarm.telemetry.*"

# On each robot (Raspberry Pi / Jetson Nano)
go install github.com/TeoSlayer/pilotprotocol/cmd/pilotctl@latest
pilotctl daemon start --beacon ground-station.local:9001
pilotctl join --network 1
pilotctl set-tags swarm-member robot ground-unit

# Verify connectivity
pilotctl find-by-tag swarm-member --json
pilotctl ping 1:0001.0001.0001

# Start publishing telemetry
pilotctl publish swarm.telemetry.robot-01 '{"lat":37.77,"lon":-122.42,"battery":95}'

The entire communication stack -- daemon, registry client, NAT traversal, encryption -- is a single binary with zero external dependencies. Cross-compile it for ARM (Raspberry Pi) or ARM64 (Jetson) with one command:

GOOS=linux GOARCH=arm64 go build -o pilotctl ./cmd/pilotctl
# Copy the binary to the robot
scp pilotctl [email protected]:/usr/local/bin/

No ROS2 installation. No DDS configuration. No multicast tuning. A single 15 MB binary that runs on any Linux ARM device and provides encrypted, NAT-traversing swarm communication out of the box.

Try Pilot Protocol

10 MB daemon, single UDP socket, no multicast storms. Coordinate robot swarms on constrained hardware with zero-configuration networking.

View on GitHub