GET AI Labs logoG.E.TAI LABS
Infrastructure · RN-009

Building on-premise AI systems for sensitive data

On-prem is not just cloud-without-the-cloud. The constraints reshape the architecture, the deployment model, and the kind of model you can run in the first place.

Published
2024 · 03
Read
11 min
Author
GET Team
Category
Infrastructure

On-premise AI is not cloud-without-the-cloud. The moment you commit to running models inside a customer's data center, an air-gapped enclave, or a sovereign region, the assumptions baked into hyperscaler architectures stop holding. Elastic GPU pools disappear. Managed inference endpoints disappear. The model registry, the secrets manager, the observability stack — all of it has to be rebuilt against a smaller, harder substrate.

For organizations in defense, healthcare, finance, and critical infrastructure, this is rarely optional. Data residency rules, IL4/IL5 boundaries, HIPAA constraints, and contractual obligations frequently rule out sending sensitive inputs to a third-party API. The interesting question is not whether to run AI on-prem. It is how to design a system that holds together under those constraints without quietly degrading into a slower, less reliable copy of what the cloud already does well.

Why on-premise AI is different from cloud AI

Cloud inference assumes near-infinite hardware behind an autoscaler. On-prem inference assumes a fixed fleet — eight H100s, four L40S, or a handful of older A100s repurposed from a training cluster. Capacity planning becomes a hard constraint rather than a billing question. Every additional concurrent user, every increase in context length, every new model variant competes for the same finite VRAM.

The deployment model shifts too. There is no rolling deploy across a managed region. Updates ship as signed artifacts moved through a one-way diode or an approved transfer process. Rollbacks have to be local. Telemetry cannot phone home. The CI/CD pipeline that worked for a SaaS product needs to be re-expressed as a release bundle that a customer's operators can verify, install, and audit without external network access.

Most consequentially, the model choice narrows. Closed frontier models are off the table in air-gapped environments. The practical universe collapses to open-weight families — Llama, Mistral, Qwen, Gemma, Phi — quantized aggressively enough to fit the hardware that the customer actually has, not the hardware a vendor wishes they had.

What hardware to plan for in private AI infrastructure

Hardware is the first constraint that shapes everything downstream. A 70B-class model in FP16 needs roughly 140 GB of VRAM before KV cache. The same model in INT8 fits on two H100s with room for batched inference; in NF4 it can run on a single 80 GB card at the cost of measurable quality loss on reasoning-heavy tasks. The right answer depends on what the workload actually demands — long-context summarization, structured extraction, retrieval-augmented generation, or agentic tool use — and benchmarking on the customer's real prompts is the only honest way to decide.

For deployments that must survive procurement cycles measured in quarters, oversizing the memory budget pays for itself. Quantization can always be loosened later. Adding cards to a sealed rack rarely happens on the timeline anyone hopes for.

  • H100 or H200 SXM for high-throughput, multi-tenant inference where NVLink bandwidth matters.
  • L40S or RTX 6000 Ada for single-tenant or workstation-class deployments with lower power envelopes.
  • A100 80 GB as a defensible floor for organizations with existing inventory.
  • AMD MI300X where CUDA lock-in is a procurement issue, with the caveat that the inference-server ecosystem is still maturing.
  • CPU-only inference with Intel AMX for small models, classifiers, and embeddings where latency budgets allow.

How to choose an inference stack for air-gapped AI

The serving layer is where on-prem deployments most often underperform their cloud equivalents. vLLM has become the default for throughput-oriented workloads — PagedAttention, continuous batching, and tensor parallelism handle most of the heavy lifting. TensorRT-LLM with Triton wins on raw latency for fixed-shape workloads where compile-time optimization is worth the operational complexity. SGLang and TGI occupy the middle ground.

Choosing among them is less interesting than the boundary conditions around them. The inference server needs to run inside a hardened container image — preferably one built from a FIPS-validated base, with no package manager available at runtime, no debug shells, and signed at every layer. Model weights need to be encrypted at rest and decrypted into memory by a process that can attest to its own integrity. Token streams need to be observable without being exfiltrated.

Operators in regulated environments will ask for an SBOM, a vulnerability scan from a tool they trust, and a clear story for how CVEs get patched without breaking the air gap. None of that is exotic. It is the table stakes that separates a demo from a system.

Security, compliance, and the boundaries that actually matter

Compliance frameworks — FedRAMP High, IL4, IL5, HIPAA, PCI, ITAR — share a small set of underlying primitives. Encryption at rest with customer-controlled keys. Encryption in transit with mutually authenticated TLS. Strong identity for every service-to-service call. Audit logs that cannot be edited by the operator they describe. Separation of duties between the people who deploy the system and the people who can read the data flowing through it.

AI workloads complicate two of these. Prompts and responses are themselves sensitive data, often more sensitive than the documents they reference. Treat them as such — encrypt them, redact them in logs by default, and require explicit opt-in for any human review path. Second, model weights are intellectual property and, for fine-tuned models, a leakage vector for training data. Apply the same key management, the same access controls, and the same audit posture that already governs source code and customer data.

Observability without exfiltration

Standard observability stacks assume telemetry can leave the environment. On-prem deployments require a local equivalent — Prometheus, Loki or OpenSearch, Grafana, and an OTLP collector — packaged as part of the release and operable by the customer's own SRE team. Metrics on tokens per second, queue depth, GPU utilization, KV cache pressure, and per-tenant quotas matter as much as application-level latency.

Quality observability is harder. Evaluation pipelines need to run locally against held-out datasets that the customer controls, with results visible to them and to the vendor only through whatever transfer mechanism the contract allows. Drift detection, prompt-injection monitoring, and guardrail telemetry should be first-class citizens, not bolt-ons. Without them, regressions land silently and confidence in the system erodes faster than it took to build.

What to build first

On-prem AI rewards sequencing. The teams that succeed tend to converge on a similar order of operations — hardware and quantization choices benchmarked on real workloads, a hardened inference stack with reproducible builds, key management and audit before any feature work, and observability wired in from the first deployment rather than retrofitted after the first incident.

  1. 01Profile the workload against two or three candidate models at the quantization levels the hardware can support.
  2. 02Lock the inference server, container base image, and signing pipeline before touching the application layer.
  3. 03Stand up key management, secrets, and audit logging as a precondition for accepting any sensitive data.
  4. 04Build the evaluation harness against customer-controlled data before scaling out users.
  5. 05Plan the patch and CVE workflow at the same time as the initial deployment, not after the first finding.

Done in that order, an on-prem AI system stops being a compromise version of a cloud product and starts being its own category — slower to ship, harder to change, and dramatically more defensible in the rooms where these decisions actually get made.

Authored by GET Team · GET AI Labs
← All research notes
Next step

Have a technical challenge worth investigating?

Bring us the problem. We will help determine what is possible, what is practical, and what should be built next.

Response within two business days · NDAs available when required