AI

Deploying AI in Air-Gapped Environments: Architecture Patterns for IL4/IL5

Everyone's asking how to get ChatGPT on their classified network. Here's the honest answer: you can't. But you can get something close—if you understand the tradeoffs.

The Challenge: Why Standard Cloud AI Doesn't Work

I get this question constantly. Leadership sees what ChatGPT can do, analysts are drowning in documents, and everyone wants that capability on the high side. "Just deploy it in the SCIF," they say. If only it were that simple.

The models that power ChatGPT, Claude, and the rest are massive—we're talking hundreds of billions of parameters that require specialized hardware clusters to run. They're also proprietary. You can't just download GPT-4 and spin it up on a server. And even if you could, it's been trained on internet data that may include things you don't want anywhere near a classified system.

For IL4, IL5, and classified environments, this model breaks down completely:

  • No internet connectivity: Air-gapped networks have no path to cloud services by design
  • Data sovereignty: Classified data cannot leave the secure enclave, period
  • Model provenance: You need to know exactly what's in the model and how it was trained
  • Supply chain security: Every component must be vetted and approved for the environment
  • Limited infrastructure: You're working with whatever hardware exists in the SCIF, not infinite cloud scale

Despite these constraints, the mission need is real. Analysts drowning in documents need help. Operators need faster intelligence. Leadership needs better decision support. The question isn't whether to deploy AI in these environments—it's how.

Architecture Options

There's no single "right" architecture for air-gapped AI. The best approach depends on your specific constraints: available hardware, acceptable latency, model size requirements, and security boundaries. Here are the patterns we've seen work.

Pattern 1: On-Premises Inference with Quantized Models

The most common pattern is running inference locally on hardware within the security boundary. Modern open-source models (Llama 2, Mistral, Mixtral) can run on surprisingly modest hardware when properly quantized.

Architecture Components

  • Inference Server: vLLM, llama.cpp, or text-generation-inference running on local hardware
  • Model Storage: Quantized model weights (GGUF, GPTQ, AWQ format) stored locally
  • API Layer: OpenAI-compatible API for application integration
  • Load Balancer: Distribute requests across multiple inference nodes

When to use: You need general-purpose LLM capabilities (summarization, Q&A, analysis) and have at least one machine with 32GB+ RAM or a GPU with 16GB+ VRAM.

Pattern 2: Specialized Small Models

Not every use case needs GPT-4-class capabilities. For specific tasks, smaller specialized models often outperform general-purpose LLMs while requiring a fraction of the resources.

Example Configurations

  • Named Entity Recognition: SpaCy or fine-tuned BERT models (runs on CPU)
  • Document Classification: Sentence transformers + classifier (minimal resources)
  • Translation: NLLB or MarianMT models (can run on modest GPUs)
  • Summarization: BART or T5 variants fine-tuned for your domain

When to use: You have a well-defined task, limited hardware, or need high throughput. A 500MB model running on CPU can process thousands of documents while a 70B parameter model processes dozens.

Pattern 3: RAG (Retrieval Augmented Generation)

RAG combines the reasoning capabilities of LLMs with retrieval from your document corpus. This is often the right architecture for intelligence analysis use cases where you need to answer questions about specific documents.

Architecture Components

  • Vector Database: Milvus, Qdrant, or pgvector for embedding storage
  • Embedding Model: Local embedding model (e5, bge, or OpenAI-compatible)
  • Document Processor: Chunking, parsing, and metadata extraction pipeline
  • LLM: For synthesis and answer generation
  • Orchestration: LangChain, LlamaIndex, or custom pipeline

When to use: Users need to query large document repositories with natural language. Critical for DOMEX, all-source analysis, and policy research use cases.

Pattern 4: Hybrid Architecture with Cross-Domain Solutions

Some organizations need to leverage more powerful models that can't run on local hardware. Cross-domain solutions (CDS) can enable controlled data flow between classification levels, but this adds significant complexity.

Considerations

  • Data must be sanitized before crossing the boundary
  • CDS adds latency (often 100ms+ per request)
  • Requires extensive ATO documentation for the data flow
  • Not appropriate for all data types or classification levels

When to use: You need capabilities that truly cannot run locally, and the data being processed can be appropriately sanitized. This is rare but sometimes necessary.

Hardware Considerations

Hardware availability in classified environments is often the binding constraint. You can't just order GPUs from Amazon—everything must go through the procurement and accreditation process.

GPU Options

GPU VRAM Max Model Size Notes
NVIDIA A100 40/80GB 70B (quantized) Gold standard, limited availability
NVIDIA A10 24GB 13B-30B More available, good price/performance
NVIDIA T4 16GB 7B-13B Widely deployed, often already available
NVIDIA L4 24GB 13B-30B Newer, efficient inference
CPU Only N/A 7B (slow) Fallback option with llama.cpp

Model Quantization

Quantization reduces model precision to fit larger models on smaller hardware. The trade-off is slight quality degradation—usually acceptable for most use cases.

Quantization Memory Reduction Quality Impact Best For
FP16 50% Negligible Default if you have memory
INT8 75% Minimal Good balance
INT4 (GPTQ/AWQ) 87% Noticeable on edge cases Memory constrained
GGUF Q4_K_M ~85% Good quality retention CPU inference

Practical example: A 70B parameter model at FP16 requires ~140GB of VRAM. At INT4, that drops to ~35GB—runnable on a single A100-80GB or two A100-40GBs.

Data Handling: Training vs Inference

One of the most important architectural decisions is whether you'll do any training or fine-tuning in the classified environment, or only inference.

Inference Only (Recommended Starting Point)

  • Pre-trained model is imported once (after security review)
  • No classified data touches the model weights
  • Simpler security boundary—model is read-only
  • Easier to update models (just import new weights)

Fine-Tuning in Environment

  • Model weights become classified after training on classified data
  • Requires more compute resources and MLOps infrastructure
  • Training data management becomes critical
  • May be necessary for domain-specific performance

Our recommendation: Start with inference only. Use RAG and prompt engineering to adapt base models to your domain. Only pursue fine-tuning if you've validated the use case and have proven you can't achieve acceptable results with RAG.

Security Boundaries and Access Control

Even within a classified environment, you need to think carefully about who can access what through the AI system.

Critical Security Considerations

Document-Level Access Control

If your RAG system indexes documents with varying access controls, the retrieval system must enforce those controls. A user with SECRET clearance shouldn't see TS/SCI content in their results, even within a TS/SCI environment.

  • Tag documents with classification and access control markings during ingestion
  • Filter retrieval results based on user's clearance and need-to-know
  • Audit all queries and returned documents

Prompt Injection Protection

Adversarial prompts can attempt to extract information the user shouldn't have access to or manipulate the system's behavior.

  • Input validation and sanitization on all user prompts
  • System prompts that instruct the model on appropriate behavior
  • Output filtering for classification markings and sensitive patterns
  • Rate limiting to prevent enumeration attacks

Audit Trail Requirements

Every interaction with the AI system should be logged:

  • User identity (tied to PKI/CAC authentication)
  • Timestamp and session information
  • Full prompt text
  • Retrieved documents (if RAG)
  • Model response
  • Any errors or security events

Real Patterns That Work

Without revealing client specifics, here are deployment patterns we've seen succeed in production classified environments:

Pattern: Intelligence Report Summarization

  • Model: Llama 2 70B (INT4 quantized)
  • Hardware: 2x A100-40GB
  • Use case: Summarizing long-form intelligence reports into executive briefs
  • Throughput: ~50 reports/hour
  • Key success factor: Careful prompt engineering for consistent output format

Pattern: Document Repository Q&A

  • Architecture: RAG with Milvus + Mistral 7B
  • Hardware: 1x A10 (24GB) for inference, separate server for vector DB
  • Corpus: 500K+ documents across multiple security compartments
  • Key success factor: Document-level access control in the retrieval layer

Pattern: Entity Extraction Pipeline

  • Model: Fine-tuned SpaCy NER + custom patterns
  • Hardware: CPU only (runs on standard servers)
  • Use case: Extracting people, places, organizations from DOMEX materials
  • Throughput: 10K+ documents/hour
  • Key success factor: Domain-specific training data and entity types

Getting Started

If you're looking to bring AI capabilities to your classified environment, here's a practical starting path:

  1. 1. Inventory your hardware. What GPUs (if any) are currently deployed or available for procurement? What's the approval timeline for new hardware?
  2. 2. Define your use case precisely. "AI for analysts" is too vague. "Summarize multi-page intelligence reports into 3-paragraph briefs" is actionable.
  3. 3. Prototype on unclassified. Build and test your architecture on an unclassified system first. Work out the kinks before dealing with SCIF logistics.
  4. 4. Plan your model import. How will you get model weights into the environment? What security review is required?
  5. 5. Start small. Deploy a 7B model on modest hardware. Prove value before scaling.

Ready to Bring AI Capabilities to Your Classified Mission?

We specialize in deploying AI in environments where cloud services aren't an option. From architecture design to production deployment, we can help you bring modern AI capabilities inside your security boundary.

Get in Touch
Merlin System Solutions