Most AI infrastructure advice assumes a working credit card and an outbound HTTPS connection. Strip those away and the architecture changes shape. The model weights still load, the inference still runs, but every assumption underneath — managed identity, hosted vector stores, telemetry pipelines, automatic security patches — has to be rebuilt from parts that exist on the inside of the boundary.
Private AI infrastructure for regulated environments is not cloud-minus-the-cloud. It is a distinct discipline with its own failure modes. These notes summarize what holds up, what breaks, and where the surprises tend to cluster when deploying on-premise AI or air-gapped AI for customers in defense, healthcare, finance, and critical infrastructure.
What changes when the public cloud is off the table
The first thing to go is elasticity. In a sensitive environment the GPU pool is fixed, often for the lifetime of the accreditation. Capacity planning shifts from a runtime concern to a procurement concern with a six-to-twelve-month lead time. Bursting to a public region is not an option, so the design has to absorb peak load on hardware that is already on the floor.
The second thing to go is the managed control plane. Identity, secrets, observability, container registries, model registries, and the CI pipeline all need on-prem equivalents inside the boundary. Each of those substitutions carries its own hardening obligation, its own backup story, and its own audit trail. A reference architecture that looks tidy in a slide deck tends to expand into thirty or forty supporting services once the compliance reviewer asks where each artifact comes from.
The third thing to go is the assumption of continuous updates. Air-gapped AI deployments receive software through a one-way diode or an approved media transfer process. Patches arrive in batches, weeks behind upstream, after they have been scanned, signed, and approved. Any architecture that assumes a fresh CVE fix within hours will not pass review.
How does the threat model shift inside the boundary
Public cloud threat models center on tenant isolation, credential theft, and exposed endpoints. Inside a hardened environment the population of attackers narrows, but the consequences of a single compromise widen. The same operator who runs the inference cluster may also hold keys to the data lake, the logging plane, and the model registry. Lateral movement is the dominant concern, not initial access.
Two attack surfaces deserve more attention than they usually get. The first is the model itself: weights, tokenizers, and adapters are executable artifacts that arrive from outside the boundary. A poisoned checkpoint or a backdoored embedding model is functionally indistinguishable from a clean one until it is probed. The second is the retrieval layer. A self-hosted vector database with permissive access controls becomes a parallel copy of the source documents, frequently with weaker authorization than the system of record.
Mitigations are unglamorous and effective. Treat every imported weight as untrusted code: scan, sign, pin, and run under the same egress controls applied to any other binary. Enforce row-level authorization at the retrieval boundary, not only at the application layer, so a prompt injection cannot exfiltrate documents the user could not have read directly.
Reference building blocks for private AI infrastructure
The stack below is not exhaustive, but it covers the components most often missed in a first-pass design for sensitive data AI. Each line is a category, not a product endorsement — the right choice depends on accreditation target, existing tooling, and the operations team that will inherit it.
- Hardened base images with a documented provenance chain, rebuilt on a known cadence and scanned against the operator's CVE feed.
- On-prem inference servers with GPU partitioning, so a single A100 or H100 can host multiple tenants without sharing memory contexts.
- A model registry inside the boundary that stores weights, tokenizer files, signatures, and the evaluation results that justified promotion.
- A self-hosted vector database with authentication tied to the enterprise identity provider and per-namespace encryption keys.
- A key management service that supports hardware-backed roots — HSM or equivalent — and rotates inference-time signing keys without an outbound call.
- An audit logging pipeline that captures prompts, retrieved chunks, model outputs, and the identity of the requesting principal, retained on write-once storage for the period the regulator requires.
- A separate evaluation environment that mirrors production data classes but is reachable only by the model engineering team, not by end users.
The audit logging line is the one most likely to be underestimated. Volumes for an enterprise AI deployment can exceed a terabyte per day once full prompt and retrieval context is captured. Storage, indexing, and redaction need to be sized for that reality from day one, not retrofitted after the first incident review.
Why FedRAMP, IL5, and HIPAA push the architecture in similar directions
FedRAMP High, DoD Impact Level 5, and HIPAA enforcement have different scopes, but they converge on a small set of architectural requirements: a documented boundary, encryption in transit and at rest with operator-controlled keys, identity tied to enterprise directories, immutable audit logs, and a change management process that can produce evidence for any artifact in production.
For AI for regulated environments this convergence is useful. A design that satisfies the strictest control set in scope will usually satisfy the others with documentation work rather than architectural rework. The economically painful path is the reverse — building for the lightest regime first and discovering that data residency, key custody, or log retention forces a rebuild before the second customer.
The honest caveat: convergence applies to architecture, not to evidence. Each regime has its own assessment artifacts, control mappings, and continuous monitoring expectations. Plan for the documentation effort to roughly match the engineering effort, not to be a thin layer on top of it.
Where the surprises usually are
Three categories of surprise show up repeatedly. The first is licensing. Open-weight models with permissive licenses for research often carry restrictions on use in classified or commercial settings, and those restrictions are enforced through terms that a procurement office will read carefully. Confirm the license against the deployment context before the model is selected, not after the integration is built.
The second is GPU driver and firmware management. Inside an air-gapped network, a driver update is a change request, a media transfer, and a maintenance window. Coupling model performance tightly to the latest CUDA or ROCm release creates an upgrade treadmill the operations team cannot run. Pin versions, validate against the pinned stack, and treat driver changes as first-class architectural events.
The third is people. A private AI infrastructure stack requires operators who can debug a stuck inference pod at three in the morning without external support, and engineers who treat compliance as a design input rather than a hurdle. Staffing for that profile takes longer than standing up the hardware. Start hiring before the racks arrive.
A short checklist before signing off on a design
- 01Identify every artifact that crosses the boundary — weights, container images, datasets, telemetry — and document the transfer mechanism and approval owner for each.
- 02Run a failure drill for the longest plausible disconnection from the outside, including expired certificates, license check-ins, and CVE feeds.
- 03Confirm that audit logs capture enough context to reconstruct any model response, and that the retention policy matches the strictest regime in scope.
- 04Verify that the identity model covers humans, services, and models as distinct principals, each with its own credentials and authorization scope.
- 05Walk an accreditor through the design before construction, not after, and capture their objections as design constraints.
Private systems for sensitive environments reward conservatism in dependencies and rigor in evidence. The teams that ship successfully are the ones that treat the boundary as a product surface — versioned, tested, monitored, and owned — rather than a perimeter to be drawn once and forgotten. The hardware is the easy part. The discipline is the work.
