Enterprise Production Complete: 99 Features, 234 Tests, Zero Dependencies

Enterprise Production Complete: 99 Features, 234 Tests, Zero Dependencies

Pilot Protocol now ships 99 features across 53 protocol commands, backed by 234 tests. This post catalogs everything the enterprise stack does today — identity providers, directory sync, declarative provisioning, SIEM-grade audit export, JWT validation, webhook reliability, per-network Prometheus metrics, and the security hardening that ties it all together. Every feature described here is implemented, tested, and running in production on the live registry.

Enterprise Feature Gating

Enterprise features are opt-in at the network level. Set enterprise: true when creating a network, and the full enterprise stack activates for that network only. Non-enterprise networks are unaffected — zero overhead, zero behavior change. Toggling enterprise on mid-lifecycle initializes RBAC roles for all existing members.

Seven registry handlers enforce the enterprise gate: promote, demote, kick, set policy, invite, respond to invite, and set key expiry. Each returns a clear "enterprise feature" error when called against a non-enterprise network. The gate check is a single function call at the top of each handler — no configuration, no flags, no environment variables.

Identity Provider Integration

The registry validates identity tokens from five provider types: OIDC, SAML, Entra ID, LDAP, and webhook. Configuration is per-registry (global) via the set_idp_config protocol command or a blueprint.

Built-in JWT Validation

For OIDC providers, the registry validates JWTs directly — no sidecar, no proxy, no external validator. Two signature algorithms are supported:

Claims validated: iss (issuer), aud (audience, supports both string and array), exp (expiry), nbf (not-before). A 60-second clock skew tolerance prevents failures from minor time drift between the IDP and registry.

Algorithm confusion attacks are blocked. The registry enforces that the JWKS key type matches the JWT header algorithm — an HS256 header with an RSA key, or an RS256 header with an oct key, is rejected before signature verification.

JWKS Caching

JWKS keys are cached for 5 minutes with automatic refresh. The first validation triggers a fetch; subsequent validations within the TTL use the cache. Cache entries are keyed by JWKS URL and key ID (kid). Response size is capped at 64KB.

# Configure OIDC identity provider
{
  "type": "set_idp_config",
  "admin_token": "$TOKEN",
  "provider_type": "oidc",
  "url": "https://login.microsoftonline.com/tenant/.well-known/jwks",
  "issuer": "https://login.microsoftonline.com/tenant/v2.0",
  "client_id": "your-app-client-id"
}

# Validate a JWT token
{
  "type": "validate_token",
  "admin_token": "$TOKEN",
  "token": "eyJhbGciOiJSUzI1NiI..."
}

Webhook-Based Identity

For providers that do not expose JWKS (LDAP, SAML, custom systems), the registry delegates verification to an external webhook. The webhook receives the token and returns a verified external_id. The external ID links the node to a directory entry for RBAC pre-assignment and directory sync.

Directory Sync

Enterprise directories (Active Directory, Entra ID, LDAP) push membership listings to the registry via the directory_sync command. The registry maps directory entries to existing nodes by external_id and applies changes:

# Push a directory listing
{
  "type": "directory_sync",
  "admin_token": "$TOKEN",
  "network_id": 3,
  "remove_unlisted": true,
  "entries": [
    {"external_id": "[email protected]", "role": "admin", "display_name": "Alice"},
    {"external_id": "[email protected]", "role": "member"},
    {"external_id": "[email protected]", "disabled": true}
  ]
}

# Response
{
  "type": "directory_sync_ok",
  "mapped": 2,
  "unmapped": 0,
  "updated": 1,
  "disabled": 1,
  "actions": [
    "role [email protected]: member → admin (node 685)",
    "disabled [email protected] (node 687)"
  ]
}

The directory_status command returns current sync state: total members, mapped (with external ID), unmapped (without), pre-assignment count, and last sync timestamp.

Blueprint Provisioning

Declare an entire network configuration in a single JSON document. The provision_network command creates the network (or finds it by name), enables enterprise, applies policy, configures the identity provider, sets up webhooks, configures audit export, and stores RBAC pre-assignments — all in one atomic operation.

# Blueprint: production network with full enterprise config
{
  "name": "prod-fleet",
  "join_rule": "invite",
  "enterprise": true,
  "policy": {
    "max_members": 100,
    "allowed_ports": [7, 80, 443, 1001],
    "description": "Production agent fleet"
  },
  "identity_provider": {
    "type": "oidc",
    "url": "https://login.microsoftonline.com/tenant/.well-known/jwks",
    "issuer": "https://login.microsoftonline.com/tenant/v2.0",
    "client_id": "pilot-prod"
  },
  "webhooks": {
    "audit_url": "https://siem.acme.com/pilot/events"
  },
  "audit_export": {
    "format": "splunk_hec",
    "endpoint": "https://splunk.acme.com:8088/services/collector",
    "token": "HEC-TOKEN",
    "index": "pilot_audit",
    "source": "pilot-registry"
  },
  "roles": [
    {"external_id": "[email protected]", "role": "owner"},
    {"external_id": "[email protected]", "role": "admin"},
    {"external_id": "[email protected]", "role": "member"}
  ]
}

Blueprints are idempotent. Running the same blueprint twice finds the existing network by name and applies updates. The result lists every action taken:

{
  "type": "provision_network_ok",
  "network_id": 3,
  "name": "prod-fleet",
  "created": true,
  "actions": [
    "created network 3 (prod-fleet)",
    "enabled enterprise features",
    "applied network policy",
    "configured oidc identity provider",
    "configured audit webhook",
    "configured splunk_hec audit export to https://splunk.acme.com:8088/services/collector",
    "stored 3 RBAC pre-assignments"
  ]
}

Validation runs before any mutation. Invalid join rules, missing tokens for token-gated networks, unrecognized IDP types, and invalid audit export formats are all rejected with clear errors.

Audit Export

Every registry mutation emits a structured audit event. Events flow through three channels simultaneously: slog structured logging, the webhook system, and the audit exporter. Three export formats are supported, covering the major SIEM platforms:

Splunk HEC

Events are formatted as Splunk HTTP Event Collector payloads with time (epoch), source (configurable, defaults to pilot-registry), sourcetype (pilot:audit), optional index, and the full event object. Authentication via Authorization: Splunk <token> header.

CEF/Syslog

Events are formatted in Common Event Format for syslog-based SIEMs: CEF:0|Pilot|Registry|1.0|<action>|<action>|<severity>|<extensions>. Severity is dynamic: 6 for kick/delete operations, 4 for promote/demote, 3 (informational) for everything else. Extensions include network_id, node_id, and action.

JSON

Plain JSON AuditEntry objects posted to an HTTP endpoint. Each entry includes timestamp, action, network_id, node_id, and a freeform details string.

The exporter is asynchronous with a 1024-slot buffered channel. Delivery uses 3 retries with exponential backoff (1s, 2s, 4s). Non-blocking: if the buffer is full, events are dropped and counted. The get_audit_export command returns export stats (delivered, dropped counts).

Webhook Dead Letter Queue

Building on the Phase 3 webhook system, failed deliveries now land in a dead letter queue. Server errors (5xx) are retried 3 times with exponential backoff; client errors (4xx) go directly to the DLQ without retry. The DLQ holds the last 100 failed events for inspection via the get_webhook_dlq command.

# Inspect the dead letter queue
{
  "type": "get_webhook_dlq",
  "admin_token": "$TOKEN"
}

# Response
{
  "type": "webhook_dlq_ok",
  "events": [
    {
      "event_id": 47,
      "action": "member.promoted",
      "timestamp": "2026-03-28T10:30:00Z",
      "details": {"node_id": 685, "network_id": 3}
    }
  ]
}

Ownership Transfer

Network owners can now transfer ownership to another member. The transfer is atomic: the previous owner is demoted to admin, the new owner gets the owner role. Six edge cases are enforced: non-owners cannot transfer, non-members cannot receive, self-transfer is rejected, and the audit trail records both the old and new owner.

# Transfer ownership
pilotctl network transfer-ownership 1 --to 686

Chain transfers work: A transfers to B, then B can transfer to C. Each transfer generates its own audit event with the complete provenance.

Per-Network Prometheus Metrics

The Prometheus endpoint now includes per-network labeled metrics for Grafana dashboards:

MetricTypeDescription
pilot_network_members{network="prod"}GaugeMembers per network
pilot_network_admins{network="prod"}GaugeAdmins per network
pilot_network_owners{network="prod"}GaugeOwners per network
pilot_network_enterprise{network="prod"}GaugeEnterprise flag (0/1)
pilot_network_policy_set{network="prod"}GaugePolicy configured (0/1)

Enterprise-specific counters track operational volume:

MetricDescription
pilot_invites_sent_totalInvites sent
pilot_invites_accepted_totalInvites accepted
pilot_rbac_operations_total{op="promote"}RBAC operations by type
pilot_policy_changes_totalPolicy mutations
pilot_key_rotations_totalKey rotations
pilot_provisions_totalBlueprint provisions
pilot_audit_exports_totalAudit events exported
pilot_idp_verifications_totalIdentity verifications
pilot_directory_synced_networksNetworks with directory sync

Status gauges show configuration state at a glance: pilot_idp_configured, pilot_webhook_configured, pilot_audit_export_active. All metrics are zero-dependency — the Prometheus exposition format is implemented from scratch with custom counter, gauge, histogram, counterVec, and histogramVec types using atomic operations.

Security Hardening

The latest security audit covered JWT handling, rate limiting, and protocol-level defenses:

234 Tests Across 21 Files

The enterprise stack ships with comprehensive test coverage. Every feature has dedicated tests, including concurrent stress tests, persistence round-trips, and negative-path validation.

Test FileTestsCategory
enterprise_gate_test.go43Feature gating, RBAC edge cases, ownership transfer
fuzz_registry_server_test.go42Registry API surface coverage
invite_acceptance_test.go19Consent flow, concurrent accepts, inbox cap
integration_test.go17E2E: blueprint, RS256 JWT, Splunk HEC
network_test.go17Name validation, join rules, leave/rejoin
pilotctl_network_test.go14Driver IPC, concurrent join/leave
audit_test.go12Concurrent audit, persistence, enriched events
key_lifecycle_test.go9Rotation, expiry, heartbeat warning
auto_join_test.go9Fleet enrollment, mixed success/failure
rbac_test.go8Per-network tokens, role hierarchy
syn_trust_gate_test.go8SYN/datagram trust, defense-in-depth
handshake_test.go8P2P trust, revocation, persistence
metrics_test.go7Prometheus, enterprise counters, DLQ
replication_test.go6Hot-standby failover, enterprise data
network_policy_test.go6MaxMembers, AllowedPorts, persistence
Other (6 files)9Dashboard, hostname privacy, admin tokens, etc.

Testing methodology includes clock injection for time-sensitive tests (invite TTL, key expiry), concurrent goroutine stress tests (up to 10 workers), mock HTTP servers for external systems (Splunk HEC, OIDC JWKS, CEF/Syslog), and full registry restart round-trips for persistence verification.

53 Protocol Commands

The complete protocol surface, grouped by authentication requirement:

AuthCommands
Admin tokencreate_network, set_network_enterprise, get_audit_log, set_webhook, get_webhook, get_webhook_dlq, set_identity_webhook, set_external_id, get_identity, provision_network, set_audit_export, get_audit_export, set_idp_config, get_idp_config, get_provision_status, directory_sync, directory_status, validate_token
Signature + enterpriseinvite_to_network, respond_invite, kick_member, promote_member, demote_member, transfer_ownership, set_network_policy, set_key_expiry
Signature or adminjoin_network, leave_network, delete_network, rename_network, resolve, deregister, set_visibility, set_hostname, set_tags, set_task_exec
Signaturereport_trust, revoke_trust, request_handshake, poll_handshakes, respond_handshake, heartbeat, punch, rotate_key
None (read)register, lookup, list_networks, list_nodes, check_trust, resolve_hostname, beacon_register, beacon_list, get_member_role, get_network_policy, get_key_info, poll_invites, update_polo_score, set_polo_score, get_polo_score
Replication tokensubscribe_replication

99 Features, 21 Categories

The full feature inventory spans:

What Is Next

The enterprise stack is production-complete. What follows is the console control plane: a web UI for managing enterprise networks, nodes, policies, audit logs, and identity configuration — all backed by the protocol commands described here. The registry is the engine. The console is the dashboard.

Try the Enterprise Stack

Install the latest release and provision your first enterprise network from a blueprint.

Get Started  ·  Network Docs  ·  Open Console