Chain AI Models Across Machines

February 28, 2026 inferencemulti-modelarchitecture

A single complex query now routinely requires 5 to 10 model invocations. The user asks a question. An orchestrator routes it to a retrieval model. The retrieval model queries an embedding model. The results go to a reasoning model. The reasoning model's output feeds a code generation model. The generated code gets validated by yet another model. Each invocation adds hundreds of milliseconds of latency, and when they happen over HTTP, the connection overhead compounds at every hop.

Loading all these models on a single machine risks out-of-memory crashes. A 7B model needs roughly 14 GB of VRAM in FP16. Two models exhaust a consumer GPU. Three crash the process. The obvious solution is to spread models across machines, but then you need a way to connect them -- and the way you connect them determines whether your pipeline takes 800ms or 5 seconds.

The Multi-Model Problem

Modern AI applications chain specialized models together. A typical pipeline might look like this:

Machine A (A100 80GB): Large language model for reasoning
Machine B (T4 16GB): Whisper model for audio transcription
Machine C (A10G 24GB): Stable Diffusion for image generation
Machine D (CPU): Orchestrator, embedding model, and lightweight classifiers

The three problems with this architecture are OOM risk, latency compounding, and the KV cache incompatibility wall.

OOM risk. Loading multiple large models on the same GPU is a balancing act. VRAM is not fungible -- each model allocates its weights, activations, and KV cache independently. Two 7B models that theoretically fit in 28 GB of VRAM will in practice crash on a 32 GB GPU because of fragmentation, CUDA overhead, and activation memory spikes during inference. Spreading models across machines is the reliable solution, but it introduces a network dependency.

Latency compounding. If each model invocation takes 200ms of compute time, a 5-stage pipeline takes 1 second of pure compute. Add HTTP connection overhead -- DNS lookup, TCP handshake, TLS negotiation -- at each stage, and you add 50-150ms per hop. That is an extra 250-750ms on top of the compute time, a 25 to 75 percent overhead from networking alone. For real-time applications (voice assistants, interactive agents), this pushes response times past the threshold where users perceive delay.

KV cache incompatibility. Different model frameworks (vLLM, TGI, Ollama, TensorRT-LLM) use incompatible KV cache formats. You cannot pass the internal state from one model to the next. Every hop in the pipeline must serialize the output to text, send it over the wire, and have the next model re-tokenize and re-encode it. This is unavoidable, but it means the network transfer is the only thing you can optimize -- you are already paying the serialization tax.

Why HTTP Per Request Adds Unnecessary Overhead

The standard approach is to give each model an HTTP API (typically REST or gRPC) and have the orchestrator make HTTP calls to each one sequentially. This works, but every request pays the full connection lifecycle cost.

Consider a chain of three model calls over HTTPS:

# Each HTTP call includes:
# - DNS resolution:    ~5ms  (cached after first)
# - TCP handshake:     ~10ms (1 RTT)
# - TLS handshake:     ~30ms (2 RTTs)
# - Request/response:  ~200ms (model inference)
# - Connection close:  ~5ms
# Total per call:      ~250ms
# Total for 3 calls:   ~750ms
# Of which networking: ~150ms (20%)

# With HTTP keep-alive (connection reuse):
# First call:          ~250ms (full setup)
# Subsequent calls:    ~200ms (reuse connection)
# Total for 3 calls:   ~650ms
# But: keep-alive breaks across machines, through NAT, and after idle timeouts

HTTP keep-alive helps within the same client-server pair, but it has limitations. Keep-alive connections expire after idle timeouts (typically 60 seconds). Load balancers may reassign connections. And if the model services are behind NAT or on different networks, keep-alive requires maintaining public endpoints for each service.

For a pipeline that processes thousands of requests per day across the same set of models, the connection setup overhead is pure waste. What you want is a persistent tunnel: connect once, keep it open indefinitely, and stream data through it for every inference call.

Persistent Tunnels: Connect Once, Stream Continuously

A Pilot Protocol connection is persistent by design. When two agents connect, they establish an encrypted UDP tunnel with keepalive probes every 30 seconds and an idle timeout of 120 seconds. The tunnel survives network changes, NAT rebinding, and transient packet loss. There is no per-request handshake because the connection is already established.

The model for chaining is simple: each model service registers as a Pilot agent and listens on port 80 for HTTP requests over the overlay. The orchestrator connects to each model agent once, then reuses those persistent tunnels for every inference call. The connection setup cost (STUN discovery, key exchange, tunnel establishment) is paid once, not per request.

# Orchestrator connects to each model agent (one-time cost)
pilotctl ping 1:0001.0001.0001  # LLM agent - ~200ms first connection
pilotctl ping 1:0001.0002.0001  # Whisper agent
pilotctl ping 1:0001.0003.0001  # Image gen agent

# After tunnels are established, every subsequent call
# goes directly through the existing encrypted tunnel.
# No DNS, no TCP handshake, no TLS negotiation.

Architecture: Model Agents on Pilot

Each machine runs a Pilot daemon and a model server. The model server listens on port 80 over the Pilot overlay using the Go driver. The orchestrator calls each model sequentially, passing the output of one as the input to the next.

Here is the layout:

Machine A -- LLM agent, address 1:0001.0001.0001, serves /v1/completions on port 80
Machine B -- Whisper agent, address 1:0001.0002.0001, serves /v1/transcribe on port 80
Machine C -- Image agent, address 1:0001.0003.0001, serves /v1/generate on port 80
Machine D -- Orchestrator, address 1:0001.0004.0001, chains calls across A, B, and C

Each model agent registers with tags describing its capability:

# On Machine A
pilotctl set-tags model-service llm reasoning

# On Machine B
pilotctl set-tags model-service whisper audio

# On Machine C
pilotctl set-tags model-service diffusion image

The orchestrator discovers available models by tag, not by hardcoded address:

pilotctl find-by-tag model-service --json

Code Example: Go Service That Chains Model Calls

Here is a Go orchestrator that receives a user query, chains it through three model agents over persistent Pilot tunnels, and returns the final result.

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io"
    "net/http"
    "time"

    "github.com/TeoSlayer/pilotprotocol/pkg/driver"
)

// Model agent addresses (discovered via tags at startup)
var (
    llmAddr    = "1:0001.0001.0001"
    whisperAddr = "1:0001.0002.0001"
    imageAddr  = "1:0001.0003.0001"
)

type ChainRequest struct {
    AudioURL string `json:"audio_url"`
}

type ChainResponse struct {
    Transcript string `json:"transcript"`
    Analysis   string `json:"analysis"`
    ImageURL   string `json:"image_url"`
    TotalMs    int64  `json:"total_ms"`
}

func main() {
    d, err := driver.Connect()
    if err != nil {
        panic(err)
    }

    // Listen on port 80 over the Pilot overlay
    ln, err := d.Listen(80)
    if err != nil {
        panic(err)
    }

    // Create an HTTP client that routes through Pilot tunnels
    pilotTransport := d.HTTPTransport()
    client := &http.Client{
        Transport: pilotTransport,
        Timeout:   60 * time.Second,
    }

    mux := http.NewServeMux()
    mux.HandleFunc("/chain", func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        var req ChainRequest
        json.NewDecoder(r.Body).Decode(&req)

        // Stage 1: Transcribe audio via Whisper agent
        transcript, err := callModel(client, whisperAddr,
            "/v1/transcribe", map[string]string{"audio_url": req.AudioURL})
        if err != nil {
            http.Error(w, err.Error(), 500)
            return
        }

        // Stage 2: Analyze transcript via LLM agent
        analysis, err := callModel(client, llmAddr,
            "/v1/completions", map[string]string{
                "prompt": "Analyze this transcript and summarize key points: " + transcript,
            })
        if err != nil {
            http.Error(w, err.Error(), 500)
            return
        }

        // Stage 3: Generate visualization via Image agent
        imageURL, err := callModel(client, imageAddr,
            "/v1/generate", map[string]string{
                "prompt": "Infographic summarizing: " + analysis,
            })
        if err != nil {
            http.Error(w, err.Error(), 500)
            return
        }

        resp := ChainResponse{
            Transcript: transcript,
            Analysis:   analysis,
            ImageURL:   imageURL,
            TotalMs:    time.Since(start).Milliseconds(),
        }
        w.Header().Set("Content-Type", "application/json")
        json.NewEncoder(w).Encode(resp)
    })

    fmt.Println("Orchestrator listening on port 80")
    http.Serve(ln, mux)
}

func callModel(client *http.Client, agentAddr, path string, payload any) (string, error) {
    body, _ := json.Marshal(payload)

    // Request goes through the persistent Pilot tunnel
    // No DNS, no TCP handshake, no TLS negotiation
    url := fmt.Sprintf("http://%s%s", agentAddr, path)
    resp, err := client.Post(url, "application/json", bytes.NewReader(body))
    if err != nil {
        return "", fmt.Errorf("model call to %s failed: %w", agentAddr, err)
    }
    defer resp.Body.Close()

    result, _ := io.ReadAll(resp.Body)

    var parsed struct {
        Result string `json:"result"`
    }
    json.Unmarshal(result, &parsed)
    return parsed.Result, nil
}

The key detail is d.HTTPTransport(). This returns an http.RoundTripper that routes HTTP requests through Pilot tunnels instead of the standard network stack. The http.Client uses it transparently -- the code looks like normal HTTP calls, but the transport is a persistent encrypted tunnel. There are no DNS lookups, no TCP handshakes, no TLS negotiations on each request.

Performance: Persistent Tunnel vs Per-Request HTTP

We benchmarked a 3-stage model chain processing 1,000 sequential inference requests. Each model takes approximately 200ms of compute time (mocked). The total compute time is 600ms per request (3 x 200ms). The question is: how much does the network add?

Metric	Per-Request HTTPS	HTTP Keep-Alive	Pilot Persistent Tunnel
First request	~750ms	~750ms	~800ms (tunnel setup)
Subsequent requests	~750ms	~620ms	~605ms
After 60s idle	~750ms	~750ms (keep-alive expired)	~605ms (tunnel persists)
1,000 requests total	~750s	~625s	~605s
Network overhead	~150ms/req (20%)	~20ms/req (3%)	~5ms/req (<1%)
Works through NAT	Requires public IPs	Requires public IPs	Yes (automatic)
Survives network change	No	No	Yes (keepalive + reconnect)

The persistent tunnel reduces per-request network overhead to under 5ms -- essentially the UDP round-trip time plus minimal framing. Over 1,000 requests, this saves roughly 145 seconds compared to per-request HTTPS. For latency-sensitive applications where every 100ms matters, the tunnel approach reduces tail latency because there are no sporadic TLS handshakes or DNS timeouts.

The more significant advantage is resilience. HTTP keep-alive connections break after idle timeouts, load balancer reshuffling, or network changes. Pilot tunnels persist through all of these because the keepalive mechanism actively maintains the tunnel with probes every 30 seconds. If a probe fails, the tunnel reconnects automatically. For long-running inference pipelines that process requests over hours or days, this eliminates an entire class of transient connection failures.

When to Use This vs Single-Machine Serving

Distributed model chaining adds complexity. It is the right choice when:

Models exceed single-machine VRAM. If your pipeline needs 3 models that each require 20+ GB of VRAM, you physically cannot fit them on one GPU. Distribute them.
Models need different hardware. LLMs benefit from A100s. Whisper runs well on T4s. Image generation needs different batch sizes. Matching model to hardware saves cost.
Models have different scaling profiles. Your LLM handles 10 requests/second. Your embedding model handles 1,000. Running them on the same machine wastes the embedding model's capacity or throttles the LLM.
Teams own different models. The NLP team maintains the language model. The vision team maintains the image model. Letting each team deploy and scale independently reduces coordination overhead.

Single-machine serving is better when:

Models fit in memory together. Two 3B models on an A100 80GB work fine. Avoid the distributed complexity.
Latency is critical below 200ms. No network can match local function calls. If sub-200ms end-to-end latency is a hard requirement, keep everything local.
Pipeline is simple. Two models in sequence do not justify the overhead of running multiple Pilot daemons. Use subprocess calls or in-process model loading.

The persistent tunnel architecture shines for medium-complexity pipelines (3-10 models) processing sustained traffic (hundreds to thousands of requests per hour) where models are too large to co-locate and need to run on heterogeneous hardware across multiple machines or regions.

Getting Started

Set up the pipeline in four steps:

# Step 1: Install Pilot on each machine
go install github.com/TeoSlayer/pilotprotocol/cmd/pilotctl@latest

# Step 2: Start daemons and join the same network
pilotctl daemon start --email [email protected]
pilotctl join --network 1

# Step 3: Tag each model agent
pilotctl set-tags model-service llm  # On Machine A

# Step 4: Run your model server using driver.Listen(80)
# The orchestrator discovers agents via tags and chains calls
pilotctl find-by-tag model-service --json

Once the tunnels are established, model calls flow through persistent encrypted connections. The orchestrator code looks like standard HTTP, but the transport is Pilot. No certificate management, no firewall rules - just persistent tunnels that stay connected and keep your inference pipeline fast. The only infrastructure requirement is a rendezvous server for initial discovery (the public one at pilotprotocol.network works out of the box, or you can self-host your own).

Try Pilot Protocol

Connect AI model services with persistent encrypted tunnels. No per-request overhead, automatic NAT traversal, and standard HTTP semantics.

View on GitHub