Distributed RAG Without a Central Knowledge Base

February 27, 2026 RAGmulti-agentprivacy

The standard RAG architecture puts all your documents in one vector database. Every agent queries the same store. Every document is copied into the same embedding space. This creates three problems that get worse as you scale: privacy violations (sensitive documents are centralized and accessible to any query), single point of failure (if the vector database goes down, every agent loses retrieval capability), and access control gaps (vector databases were not built to enforce per-document, per-user permissions).

The coordination nightmare is real. Teams report constantly debugging where things break down in their RAG pipelines. Every time the retriever output format changes, downstream agents break. And vector databases contain copies of private data that lack the access controls of the source systems they were copied from.

There is an alternative: distribute the knowledge. Instead of copying all documents into one database, let each agent own its domain. The legal agent keeps legal documents. The engineering agent keeps code and architecture docs. The HR agent keeps employee data. Queries are routed to the relevant agents, who search their local data and return results. No document leaves its owner. No central database to secure, scale, or maintain.

The Centralized RAG Problem

A centralized RAG system has a deceptively simple architecture: documents go into a vector store, queries hit the store, top-K results are retrieved, and an LLM generates a response using those results as context. The problems emerge at scale.

Privacy by violation. When you ingest documents from multiple departments into a single vector store, you create a copy of every document that exists outside the access controls of the source system. An HR performance review, a legal contract, and a public marketing document all become equal vectors in the same index. The retrieval query has no concept of "this user should not see HR documents." You can add metadata filters, but these are application-level patches on a fundamentally flat data store. One misconfigured filter exposes sensitive data.

Single point of failure. Every agent depends on the vector database. If Pinecone has an outage, or your self-hosted Qdrant instance runs out of disk, or the embedding model endpoint throttles, the entire system stops retrieving. You can add replicas and load balancers, but the architectural fragility remains: one component serves all queries for all domains.

Coordination overhead. As the document corpus grows, so does the challenge of keeping the index fresh. Different data sources update at different frequencies. The engineering wiki changes daily. Legal contracts change monthly. Employee records change quarterly. Synchronizing all of these into one index requires ETL pipelines, change detection, re-embedding, and index rebuilds. Each pipeline is a potential failure point, and debugging "why did the RAG return stale data" means tracing through multiple ingestion paths.

Format coupling. The retriever and the generator must agree on the output format. When the retriever changes its response structure -- adding a new metadata field, changing the chunk format, switching embedding models -- every downstream consumer breaks. This is tight coupling between components that should be independent.

Sharded Knowledge: Each Agent Owns Its Domain

The distributed alternative inverts the architecture. Instead of moving documents to a central store, you move the query to the documents. Each agent maintains its own local knowledge base -- a local vector store, a SQLite database, a directory of files, whatever fits the domain. When a query arrives, the orchestrator routes it to the relevant agents, each agent searches its local data, and the results are synthesized.

This is not a new idea. It is how organizations actually work. The legal department does not give engineering access to all contracts. Engineering does not give HR access to all source code. Each department answers questions about its domain. The distributed RAG architecture mirrors this real-world access pattern.

The advantages are immediate:

Data stays where it belongs. The HR agent searches HR data on the HR machine. Documents never leave the owner's environment.
Independent scaling. The engineering agent can index 10 million code files. The legal agent indexes 500 contracts. Each scales independently based on its domain size, not the total corpus.
Independent updates. When a legal contract changes, only the legal agent re-indexes. No global index rebuild.
Natural access control. The agent only responds to peers it trusts. No metadata filters, no application-level ACLs.

Architecture: Retriever Agents + Synthesis Agent

The system has two types of agents:

Retriever agents own a knowledge domain. Each runs a local vector store (FAISS, ChromaDB, or any embedding-based index) over its documents. It accepts queries, retrieves relevant chunks, and returns them. It does not generate answers -- it only retrieves.

The synthesis agent (orchestrator) receives user queries, determines which retriever agents to query based on the topic, fans out the query, collects results, and feeds them into an LLM to generate the final answer.

# Architecture overview
#
# User Query
#     |
#     v
# [Synthesis Agent]  (1:0001.0001.0001)
#     |         |         |
#     v         v         v
# [Legal]   [Eng]    [HR]
# (0002)    (0003)   (0004)
#   |         |         |
#   v         v         v
# local DB  local DB  local DB

The synthesis agent discovers retriever agents by tags. Each retriever agent tags itself with its domain:

# Legal agent
pilotctl set-tags rag-retriever legal contracts compliance

# Engineering agent
pilotctl set-tags rag-retriever engineering code architecture

# HR agent
pilotctl set-tags rag-retriever hr employees benefits

The synthesis agent queries for available retrievers:

pilotctl find-by-tag rag-retriever --json

This returns the address and tags of each retriever. The synthesis agent uses the tags to route queries: a question about "contract terms" goes to the legal agent, a question about "deployment architecture" goes to the engineering agent, and a question about "vacation policy" goes to the HR agent.

Example: Three Domain Agents Queried by an Orchestrator

Suppose a user asks: "What is our policy on using open-source libraries in customer-facing products?" This question spans two domains: legal (licensing compliance) and engineering (technical policy). The synthesis agent routes to both.

# Step 1: Synthesis agent sends query to Legal and Engineering agents
pilotctl send-message 1:0001.0002.0001 --data '{"query": "open-source licensing policy for customer products"}'
pilotctl send-message 1:0001.0003.0001 --data '{"query": "open-source library usage policy in production"}'

# Step 2: Each agent searches its local knowledge and responds
# Legal agent returns: relevant contract clauses, compliance requirements
# Engineering agent returns: internal wiki pages on OSS policy, approved license list

# Step 3: Synthesis agent combines results and generates answer via LLM

The legal agent searches its local vector store of contracts and compliance documents. The engineering agent searches its local index of wiki pages and policy documents. Neither agent sees the other's data. The synthesis agent receives both result sets, combines them into a context window, and uses an LLM to generate a coherent answer that cites both sources.

If the user asks "How much PTO does John Smith have left?" -- the synthesis agent routes only to the HR agent. The legal and engineering agents never see the query, never receive it, and have no way to access HR data. Privacy is enforced by architecture, not by application code.

Trust-Gated Access

In Pilot Protocol, agents are private by default. An agent is not discoverable and cannot receive messages from peers it has not explicitly trusted. This property is critical for distributed RAG because it enforces access control at the network layer.

The trust flow works like this:

# Synthesis agent requests trust from the Legal retriever
pilotctl trust request 1:0001.0002.0001 --reason "RAG orchestrator: needs to query legal documents"

# Legal agent operator reviews and approves
pilotctl trust approve 1:0001.0001.0001

# Now the synthesis agent can send queries to the legal agent
# Other agents that have NOT been approved cannot query legal data

This means a rogue agent that joins the network cannot query the HR agent's employee database. It cannot even discover the HR agent exists (private by default). Only agents that have completed the mutual trust handshake can communicate.

Trust can be revoked instantly:

# HR agent revokes trust from a compromised orchestrator
pilotctl trust revoke 1:0001.0001.0001
# The orchestrator can no longer query HR data, effective immediately

Compare this to a centralized vector database where access control is a metadata filter applied at query time. If the filter has a bug, or if someone queries the database directly (bypassing the application layer), all documents are exposed. With distributed RAG on Pilot, the data never leaves the owner. The access control is not a filter -- it is the absence of a network path.

Code Example: Python RAG Agent

Here is a retriever agent implemented in Python. It maintains a local ChromaDB collection, listens for queries over Pilot messaging, searches its local data, and returns results.

import subprocess
import json
import chromadb
import time
import threading

# Initialize local vector store
chroma = chromadb.PersistentClient(path="./legal_knowledge")
collection = chroma.get_or_create_collection(
    name="legal_docs",
    metadata={"hnsw:space": "cosine"},
)

def ingest_documents(doc_dir):
    """Index local documents into ChromaDB."""
    import os
    for filename in os.listdir(doc_dir):
        filepath = os.path.join(doc_dir, filename)
        with open(filepath, "r") as f:
            text = f.read()
        # Chunk document into 512-char segments
        chunks = [text[i:i+512] for i in range(0, len(text), 512)]
        for idx, chunk in enumerate(chunks):
            doc_id = f"{filename}_chunk_{idx}"
            collection.upsert(
                ids=[doc_id],
                documents=[chunk],
                metadatas=[{"source": filename, "chunk": idx}],
            )
    print(f"Indexed {collection.count()} chunks from {doc_dir}")

def search(query, top_k=5):
    """Search local vector store."""
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
    )
    return [
        {
            "text": doc,
            "source": meta["source"],
            "score": round(1 - dist, 4),
        }
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

def listen_for_queries():
    """Listen for incoming queries via Pilot messaging."""
    while True:
        # Receive message from any trusted peer
        result = subprocess.run(
            ["pilotctl", "receive-message", "--timeout", "60", "--json"],
            capture_output=True, text=True,
        )
        if result.returncode != 0:
            continue

        msg = json.loads(result.stdout)
        sender = msg["from"]
        query = json.loads(msg["data"])["query"]

        print(f"Query from {sender}: {query}")

        # Search local knowledge base
        results = search(query)

        # Send results back to the querying agent
        response = json.dumps({
            "domain": "legal",
            "query": query,
            "results": results,
        })
        subprocess.run([
            "pilotctl", "send-message", sender,
            "--data", response,
        ])
        print(f"Sent {len(results)} results to {sender}")

# Tag this agent as a legal retriever
subprocess.run(["pilotctl", "set-tags", "rag-retriever", "legal", "contracts"])

# Index local documents
ingest_documents("./legal_documents/")

# Start listening for queries
print("Legal RAG agent ready. Listening for queries...")
listen_for_queries()

And here is the synthesis agent that orchestrates queries across multiple retrievers:

import subprocess
import json
import concurrent.futures
import openai

def discover_retrievers():
    """Find all retriever agents on the network."""
    result = subprocess.run(
        ["pilotctl", "find-by-tag", "rag-retriever", "--json"],
        capture_output=True, text=True,
    )
    return json.loads(result.stdout)

def route_query(query, retrievers):
    """Determine which retrievers are relevant based on tags."""
    # Simple keyword-based routing; use an LLM classifier for production
    keywords = {
        "legal": ["contract", "compliance", "license", "legal", "policy"],
        "engineering": ["code", "architecture", "deploy", "technical", "api"],
        "hr": ["employee", "pto", "benefits", "salary", "vacation"],
    }
    relevant = []
    query_lower = query.lower()
    for r in retrievers:
        for tag in r.get("tags", []):
            if tag in keywords:
                if any(kw in query_lower for kw in keywords[tag]):
                    relevant.append(r)
                    break
    return relevant if relevant else retrievers  # fallback: query all

def query_retriever(address, query):
    """Send query to a retriever agent and get results."""
    subprocess.run([
        "pilotctl", "send-message", address,
        "--data", json.dumps({"query": query}),
    ])
    # Wait for response
    result = subprocess.run(
        ["pilotctl", "receive-message", "--timeout", "30",
         "--from", address, "--json"],
        capture_output=True, text=True,
    )
    if result.returncode == 0:
        return json.loads(json.loads(result.stdout)["data"])
    return {"domain": "unknown", "results": []}

def synthesize(query, all_results):
    """Combine retrieval results and generate answer with LLM."""
    context_parts = []
    for result_set in all_results:
        domain = result_set.get("domain", "unknown")
        for r in result_set.get("results", []):
            context_parts.append(
                f"[{domain} - {r['source']}] {r['text']}"
            )
    context = "\n\n".join(context_parts)

    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "Answer based on the provided context. Cite sources."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ],
    )
    return response.choices[0].message.content

# Main query flow
query = "What is our policy on using open-source libraries in customer products?"

retrievers = discover_retrievers()
relevant = route_query(query, retrievers)
print(f"Routing to {len(relevant)} retrievers: {[r['hostname'] for r in relevant]}")

# Query relevant retrievers in parallel
with concurrent.futures.ThreadPoolExecutor() as pool:
    futures = {
        pool.submit(query_retriever, r["address"], query): r
        for r in relevant
    }
    all_results = [f.result() for f in concurrent.futures.as_completed(futures)]

# Synthesize answer
answer = synthesize(query, all_results)
print(f"\nAnswer:\n{answer}")

Comparison: Centralized vs Distributed RAG

Property	Centralized Vector DB	Pilot Distributed RAG
Data location	Copied to central store	Stays with owner
Access control	Metadata filters (app-level)	Trust handshake (network-level)
Single point of failure	Vector DB is SPOF	Each agent independent
Index freshness	ETL pipeline lag	Owner indexes in real-time
Privacy compliance	Data copied to third-party store	Data never leaves origin
Query latency	~50ms (single store)	~200ms (network + search)
Scaling	Scale the vector DB	Add more agents
Cross-domain queries	Single query searches all	Fan-out to multiple agents
Infrastructure	Vector DB + embedding API	Pilot daemon (10MB per agent)
Works through NAT	Requires network access to DB	Automatic NAT traversal

The latency trade-off is real. A centralized vector database returns results in roughly 50ms because the query is a single local operation. Distributed RAG adds network round-trips: the synthesis agent sends queries to retrievers (~5ms per peer on Pilot), each retriever searches its local store (~50ms), and results come back (~5ms). Total is roughly 100-200ms per retriever, done in parallel. For interactive applications where sub-100ms retrieval is required, centralized wins on speed.

But for applications where privacy, compliance, and fault tolerance matter more than 150ms of latency -- healthcare, legal, finance, multi-tenant SaaS -- distributed RAG is the architecture that matches the requirements.

Limitations and When Centralized RAG Is Better

Distributed RAG is not universally better. It is worse in these scenarios:

Semantic search across domains. If a query requires understanding relationships between documents owned by different agents (e.g., "how does the engineering hiring plan affect the Q3 budget?"), each agent searches independently. Cross-domain semantic understanding requires the synthesis agent to do more work, and the LLM context window becomes the bottleneck.
Small corpus, single team. If all documents fit in one index and one team manages them, a centralized vector database is simpler. Do not distribute for the sake of distribution.
High-frequency queries. If your application makes hundreds of retrieval calls per second, the network round-trip overhead of distributed RAG adds up. A local vector database with in-memory indexes will always be faster for single-query latency.
Embedding consistency. In centralized RAG, all documents are embedded with the same model, ensuring consistent similarity scores. In distributed RAG, each agent can use a different embedding model (or different chunk sizes), which makes score comparison across agents meaningless. The synthesis agent must rely on the LLM to judge relevance, not raw similarity scores.
Debugging complexity. "Why did the RAG return wrong results?" is harder to debug when the retrieval is distributed. You need to check which agents were queried, what each returned, and how the synthesis agent combined them. Centralized RAG has one place to look.

The right choice depends on your constraints. If privacy and data residency are non-negotiable, distribute. If query latency and simplicity are the priority, centralize. For many real-world systems, a hybrid works: keep non-sensitive shared knowledge in a central store for fast access, and keep sensitive domain knowledge distributed behind trust-gated agents.