← Back to Blog

Secure Research Collaboration: Share Models, Not Data

February 23, 2026 research privacy federated-learning

A medical researcher at University A has a model that detects early-stage lung cancer from CT scans. The model is good, but it was trained on 5,000 images from a single hospital. University B, across the country, has 8,000 additional scans that could significantly improve the model's accuracy. The researcher at University A cannot access University B's data. HIPAA prohibits it. The institutional review board at University B would not approve it. The IT security team at University A cannot whitelist another institution's network. And even if all of those barriers were overcome, the data transfer itself would need to be encrypted, logged, and auditable.

This is the research collaboration paradox: the most valuable scientific progress requires combining data across institutions, and the regulations designed to protect patients, subjects, and institutions make that combination extraordinarily difficult. The researchers know exactly how to improve the model. The legal, compliance, and technical barriers prevent them from doing it.

The result is that most cross-institutional ML collaboration does not happen. Projects that could benefit from larger, more diverse datasets remain constrained to whatever data a single institution can collect. The models are worse, the papers have smaller sample sizes, and the science moves slower than it should.

Current Approaches and Their Gaps

The research community has developed several mechanisms for cross-institutional collaboration. Each solves part of the problem while creating new ones.

Hugging Face Hub and Model Registries

Hugging Face Hub is the de facto standard for sharing pre-trained models. Researchers upload model weights, training configurations, and evaluation metrics. Other researchers download, fine-tune, and build upon them. This works well for public models, but it fails for sensitive research. Clinical models trained on patient data cannot be uploaded to a public repository -- even if the model weights themselves do not contain identifiable information, the institutional data governance policy often prohibits it. The model registry approach assumes openness, and regulated research often cannot be open.

Delta Sharing and Data Clean Rooms

Databricks Delta Sharing provides a protocol for sharing datasets across organizations without copying data. Recipients access the data through a standardized API, and the data owner retains control. The problem is that the data still leaves the originating network in some form -- the recipient's query results contain derived data that may fall under the same regulatory constraints as the source. For HIPAA-covered data, even aggregate statistics can be considered protected health information if the cohort is small enough.

VPNs and Direct Network Connections

The brute-force approach: establish a VPN tunnel between the two institutions, transfer data or model files through the tunnel, and rely on the VPN encryption for security. This works technically, but it fails operationally. University IT departments have different VPN standards, different firewall policies, and different approval timelines. A researcher described the process: "Cross-institution security inconsistency -- each university has different IT policies. Getting a VPN approved took four months and required three meetings with two security committees."

Federated Learning Frameworks

Federated learning (FL) addresses the core problem directly: train a model across multiple institutions without moving the data. Each institution trains on its local data, shares only model updates (gradients or weights), and a central aggregator combines the updates into an improved global model. The data never leaves its home institution.

FL frameworks like Flower, PySyft, and NVIDIA FLARE provide the ML machinery. What they do not provide is the network infrastructure. How do the institutions' training nodes find each other? How do they establish secure connections through institutional firewalls and NATs? How do they authenticate each other? FL frameworks assume that network connectivity is already solved. In practice, it is the hardest part of the deployment.

The gap is not in the ML. The algorithms for federated learning, differential privacy, and secure aggregation are mature. The gap is in the plumbing: encrypted connectivity between institutions that have incompatible network architectures, different security policies, and no common authentication infrastructure.

Pilot as a Transport Layer for Research Collaboration

Pilot Protocol does not do federated learning. It does not do differential privacy. It does not do secure aggregation. What it does is solve the connectivity problem that blocks every other approach: how do two machines at different institutions establish an encrypted, authenticated, NAT-traversing connection without involving either institution's IT department in a months-long VPN approval process?

Each lab runs a Pilot daemon. The daemons connect to a shared rendezvous server (which can be self-hosted on neutral infrastructure). The protocol handles NAT traversal, encryption, and authentication. The researchers exchange model weights, gradient updates, or evaluation results through encrypted tunnels. No data touches third-party servers. No VPN configuration required.

# Lab A (University Hospital, behind institutional NAT)
$ pilotctl init --hostname lab-a-trainer
$ pilotctl daemon start --registry rendezvous.research-consortium.org:9000 \
    --beacon rendezvous.research-consortium.org:9001
Agent online: lab-a-trainer (1:0001.0000.0010)

# Lab B (Research Institute, different network)
$ pilotctl init --hostname lab-b-trainer
$ pilotctl daemon start --registry rendezvous.research-consortium.org:9000 \
    --beacon rendezvous.research-consortium.org:9001
Agent online: lab-b-trainer (1:0001.0000.0020)

# Establish trust (cryptographic handshake = collaboration agreement)
# Lab A initiates:
$ pilotctl handshake lab-b-trainer "Federated lung cancer detection study, IRB #2026-0142"
Handshake request sent. Waiting for approval...

# Lab B reviews and approves:
$ pilotctl pending
1:0001.0000.0010 (lab-a-trainer)
  Justification: "Federated lung cancer detection study, IRB #2026-0142"
  Signed by: 5a2f...c8d1 (verified)
$ pilotctl approve 1:0001.0000.0010
Trust established with lab-a-trainer
Encrypted tunnel active (X25519 + AES-256-GCM)

The handshake justification is not a formality. It is a signed, cryptographically verifiable statement. When an auditor asks "who authorized this data exchange and why?", the answer is recorded in the handshake: the IRB number, the study description, the Ed25519 signatures of both parties, and the timestamp. This is better audit evidence than most VPN approval forms.

Sharing Model Weights Over Encrypted Tunnels

The most common operation in cross-institutional ML collaboration is transferring model weight files. After a round of local training, Lab A sends its updated weights to Lab B (or to a central aggregator). Here is the complete workflow:

# Lab A: Train locally, then send updated weights
$ python train.py --data /local/ct-scans --output /models/round-3-weights.pt
Training complete. Model saved: /models/round-3-weights.pt (142MB)

# Send weights to Lab B via encrypted Pilot tunnel
$ pilotctl send-file lab-b-trainer /models/round-3-weights.pt
Sending: round-3-weights.pt (142MB)
Transfer: ████████████████████ 100% (4.2s, 33.8 MB/s)
File delivered (encrypted, verified)

# Lab B: Receive weights and aggregate
$ ls ~/pilot-received/
round-3-weights.pt

# Lab B aggregates with its own local weights
$ python aggregate.py \
    --local /models/lab-b-round-3.pt \
    --remote ~/pilot-received/round-3-weights.pt \
    --output /models/global-round-3.pt
Aggregation complete. Global model: /models/global-round-3.pt

The file transfer uses Pilot's data exchange port (1001). The entire payload is encrypted with AES-256-GCM. If the connection is relayed through the beacon (because both labs are behind institutional NATs), the beacon sees only encrypted bytes. At no point does the model file exist in cleartext on any third-party infrastructure.

For automated workflows, the transfer can be scripted into the training loop:

#!/usr/bin/env python3
"""Federated training round with Pilot transport."""
import subprocess
import sys

PARTNER = "lab-b-trainer"
ROUNDS = 10

for round_num in range(1, ROUNDS + 1):
    print(f"\n--- Round {round_num}/{ROUNDS} ---")

    # Step 1: Train locally
    subprocess.run([
        "python", "train.py",
        "--data", "/local/ct-scans",
        "--weights", f"/models/global-round-{round_num - 1}.pt",
        "--output", f"/models/local-round-{round_num}.pt",
        "--epochs", "5"
    ], check=True)

    # Step 2: Send local weights to partner
    subprocess.run([
        "pilotctl", "send-file", PARTNER,
        f"/models/local-round-{round_num}.pt"
    ], check=True)

    # Step 3: Wait for partner's weights
    print("Waiting for partner weights...")
    subprocess.run([
        "pilotctl", "subscribe", "training/weights-ready",
    ], check=True, timeout=600)  # 10 min timeout

    # Step 4: Aggregate
    subprocess.run([
        "python", "aggregate.py",
        "--local", f"/models/local-round-{round_num}.pt",
        "--remote", f"~/pilot-received/local-round-{round_num}.pt",
        "--output", f"/models/global-round-{round_num}.pt"
    ], check=True)

    # Step 5: Notify partner that aggregation is done
    subprocess.run([
        "pilotctl", "publish", "training/round-complete",
        "--data", f'{{"round": {round_num}, "status": "complete"}}'
    ], check=True)

print("Federated training complete.")

The event stream (port 1002) coordinates the training rounds. Each lab publishes to training/weights-ready when its weights are available and subscribes to training/round-complete to know when to proceed to the next round. The data exchange (port 1001) handles the actual weight file transfer. The entire coordination and transport layer is provided by Pilot. The ML framework -- PyTorch, TensorFlow, JAX -- is the researcher's choice.

The Trust Model as Collaboration Agreement

In traditional research collaboration, the collaboration agreement is a legal document. A Data Use Agreement (DUA), a Memorandum of Understanding (MOU), or an IRB-approved protocol. These documents specify who can access what data, for what purpose, and under what conditions. They are essential, but they are not enforced by the technology. Once a VPN is established, the technical controls do not prevent someone from transferring data outside the scope of the agreement.

Pilot's trust model provides a technical analog to the legal agreement. The handshake is a cryptographic commitment: both parties explicitly agree to communicate, the justification is signed and immutable, and either party can revoke trust instantly.

# Revoke collaboration access (either side can do this)
$ pilotctl untrust 1:0001.0000.0020
Trust revoked for lab-b-trainer (1:0001.0000.0020)
Active tunnel torn down
Peer notified

# Lab B can no longer send or receive any data
# The revocation is effective immediately

Self-Hosted Rendezvous: No Data on Third-Party Servers

A critical property for regulated research: no data touches infrastructure that is not controlled by the collaborating institutions. The Pilot rendezvous server can be hosted by either institution or by a neutral third party (a research consortium, a federal computing facility, a university's shared research computing service).

# Option 1: Host rendezvous at one institution
$ pilot-rendezvous -registry-addr :9000 -beacon-addr :9001

# Option 2: Host on neutral research infrastructure
# (e.g., XSEDE/ACCESS allocation, institutional research computing)
$ ssh research-cluster.xsede.org
$ pilot-rendezvous -registry-addr :9000 -beacon-addr :9001

The rendezvous server stores only metadata: virtual addresses, hostnames, public keys, and online status. It never sees the contents of the data exchange. Model weights, gradients, evaluation results -- all of these are encrypted end-to-end between the lab nodes. Even if the rendezvous server is compromised, the attacker learns that Lab A and Lab B are communicating, but not what they are exchanging.

For maximum isolation, each collaboration can run its own rendezvous server. A multi-site cancer study runs one rendezvous. A separate genomics collaboration runs another. The networks are completely independent. Agents on one network cannot discover, enumerate, or connect to agents on the other.

Compliance Properties

Pilot Protocol is not a compliance product. It does not certify your deployment as HIPAA-compliant or GDPR-compliant. But it provides technical properties that compliance teams evaluate when reviewing research infrastructure:

Important: These properties support compliance but do not guarantee it. HIPAA compliance requires a full risk assessment, administrative safeguards, physical safeguards, and organizational policies that go far beyond the transport layer. GDPR compliance requires a Data Protection Impact Assessment, lawful basis for processing, and data subject rights implementations. Pilot handles the transport. Your institution handles the rest.

Combining With Federated Learning Frameworks

The most powerful use of Pilot in research is as the transport layer for a federated learning framework. Pilot handles the hard networking problems (NAT traversal, encryption, authentication). The FL framework handles the hard ML problems (aggregation, differential privacy, convergence).

Here is how the layers combine:

# Layer stack for cross-institutional federated learning:
#
# ┌────────────────────────────────────────────┐
# │  ML Framework (Flower / PySyft / FLARE)    │  Training, aggregation, DP
# ├────────────────────────────────────────────┤
# │  Application (train.py / aggregate.py)     │  Study-specific logic
# ├────────────────────────────────────────────┤
# │  Pilot Protocol                            │  Addressing, encryption, NAT
# ├────────────────────────────────────────────┤
# │  UDP / IP                                  │  Physical transport
# └────────────────────────────────────────────┘

Pilot replaces the network configuration layer in the FL framework. Instead of configuring IP addresses, ports, TLS certificates, and VPN tunnels, you configure Pilot hostnames. The FL clients connect to each other using virtual addresses that work across any network topology.

This has been tested with gradient exchange via the data exchange port. Two nodes on different continents, both behind NAT, exchanged model gradients every 30 seconds over Pilot tunnels. The latency overhead of the overlay network was approximately 5-15ms per transfer -- negligible compared to the minutes-long training rounds that produce the gradients.

Scaling to Multiple Institutions

Research consortia often involve more than two institutions. A multi-site clinical trial might have 10 hospitals contributing data. A genomics consortium might have 20 labs. The trust model scales naturally: each pair of institutions performs a handshake. The rendezvous server supports thousands of agents on minimal infrastructure (3 VMs handle 10,000 agents).

# 5-institution research consortium
# Each institution runs a Pilot node tagged by role

# Aggregator (hosted by consortium coordinator)
$ pilotctl init --hostname fl-aggregator
$ pilotctl set-tags aggregator lung-cancer-study
$ pilotctl set-visibility public

# Each institution's trainer connects and handshakes with the aggregator
$ pilotctl handshake fl-aggregator "Site 3 trainer, IRB #2026-0142, lung cancer FL study"

# Aggregator approves all verified sites
$ pilotctl pending
1:0001.0000.0010 (johns-hopkins-trainer) — "Site 1 trainer, IRB #2026-0142"
1:0001.0000.0020 (mayo-clinic-trainer)   — "Site 2 trainer, IRB #2026-0142"
1:0001.0000.0030 (stanford-trainer)      — "Site 3 trainer, IRB #2026-0142"
1:0001.0000.0040 (charité-trainer)       — "Site 4 trainer, IRB #2026-0142"
1:0001.0000.0050 (tokyo-u-trainer)       — "Site 5 trainer, IRB #2026-0142"

# Approve all verified participants
$ pilotctl approve 1:0001.0000.0010
$ pilotctl approve 1:0001.0000.0020
$ pilotctl approve 1:0001.0000.0030
$ pilotctl approve 1:0001.0000.0040
$ pilotctl approve 1:0001.0000.0050

Each institution has a trust relationship with the aggregator only. The institutions do not have trust with each other. Stanford cannot see Johns Hopkins' traffic. The aggregator receives weights from all five, aggregates them, and distributes the global model back. This star topology mirrors the standard federated learning architecture, but with cryptographic trust enforcement at each link.

Are Distributed Remote Research Labs Possible?

"Are distributed, remote research labs possible now?" This question surfaces frequently in research computing forums. The answer is yes, but with careful architectural choices.

A distributed research lab needs three things that traditional VPN-based infrastructure provides poorly: cross-network connectivity for researchers working from different institutions and locations, encrypted data exchange that satisfies compliance requirements, and a trust model that maps to research collaboration agreements rather than network perimeters.

Pilot provides all three. Researchers at different institutions join a shared Pilot network. The rendezvous server is hosted on research infrastructure. Each researcher's node gets a permanent virtual address that works regardless of their physical location -- at the university lab, at home, or at a conference. Trust relationships define who can communicate with whom, independently of network topology.

The event stream (port 1002) adds coordination: researchers publish experimental results, training status, and coordination messages to shared topics. This replaces the Slack channels and email threads that currently coordinate most distributed research -- with the added benefit that the messages are encrypted and the participants are cryptographically authenticated.

Limitations

Pilot Protocol solves the transport problem for research collaboration. Here is what it does not solve:

The honest pitch: Pilot makes it possible for two machines at two institutions to establish an encrypted connection in two minutes instead of two months. What flows over that connection -- model weights, gradients, evaluation metrics, or coordination messages -- is up to the researchers and their compliance teams.

For the encryption details behind these transfers, see Zero-Dependency Encryption: X25519 + AES-256-GCM. For the NAT traversal that connects machines behind institutional firewalls, see NAT Traversal: A Deep Dive. For the file transfer protocol used for weight exchange, see Peer-to-Peer File Transfer Between AI Agents.

Try Pilot Protocol

Encrypted, authenticated, NAT-traversing connections between research institutions. Self-hosted rendezvous, trust-gated file exchange, zero cloud dependencies. Connect in minutes, not months.

View on GitHub