Deploying AI in Air-Gapped Environments: Architecture Patterns for IL4/IL5
Everyone's asking how to get ChatGPT on their classified network. Here's the honest answer: you can't. But you can get something close—if you understand the tradeoffs.
The Challenge: Why Standard Cloud AI Doesn't Work
I get this question constantly. Leadership sees what ChatGPT can do, analysts are drowning in documents, and everyone wants that capability on the high side. "Just deploy it in the SCIF," they say. If only it were that simple.
The models that power ChatGPT, Claude, and the rest are massive—we're talking hundreds of billions of parameters that require specialized hardware clusters to run. They're also proprietary. You can't just download GPT-4 and spin it up on a server. And even if you could, it's been trained on internet data that may include things you don't want anywhere near a classified system.
For IL4, IL5, and classified environments, this model breaks down completely:
- • No internet connectivity: Air-gapped networks have no path to cloud services by design
- • Data sovereignty: Classified data cannot leave the secure enclave, period
- • Model provenance: You need to know exactly what's in the model and how it was trained
- • Supply chain security: Every component must be vetted and approved for the environment
- • Limited infrastructure: You're working with whatever hardware exists in the SCIF, not infinite cloud scale
Despite these constraints, the mission need is real. Analysts drowning in documents need help. Operators need faster intelligence. Leadership needs better decision support. The question isn't whether to deploy AI in these environments—it's how.
Architecture Options
There's no single "right" architecture for air-gapped AI. The best approach depends on your specific constraints: available hardware, acceptable latency, model size requirements, and security boundaries. Here are the patterns we've seen work.
Pattern 1: On-Premises Inference with Quantized Models
The most common pattern is running inference locally on hardware within the security boundary. Modern open-source models (Llama 2, Mistral, Mixtral) can run on surprisingly modest hardware when properly quantized.
Architecture Components
- • Inference Server: vLLM, llama.cpp, or text-generation-inference running on local hardware
- • Model Storage: Quantized model weights (GGUF, GPTQ, AWQ format) stored locally
- • API Layer: OpenAI-compatible API for application integration
- • Load Balancer: Distribute requests across multiple inference nodes
When to use: You need general-purpose LLM capabilities (summarization, Q&A, analysis) and have at least one machine with 32GB+ RAM or a GPU with 16GB+ VRAM.
Pattern 2: Specialized Small Models
Not every use case needs GPT-4-class capabilities. For specific tasks, smaller specialized models often outperform general-purpose LLMs while requiring a fraction of the resources.
Example Configurations
- • Named Entity Recognition: SpaCy or fine-tuned BERT models (runs on CPU)
- • Document Classification: Sentence transformers + classifier (minimal resources)
- • Translation: NLLB or MarianMT models (can run on modest GPUs)
- • Summarization: BART or T5 variants fine-tuned for your domain
When to use: You have a well-defined task, limited hardware, or need high throughput. A 500MB model running on CPU can process thousands of documents while a 70B parameter model processes dozens.
Pattern 3: RAG (Retrieval Augmented Generation)
RAG combines the reasoning capabilities of LLMs with retrieval from your document corpus. This is often the right architecture for intelligence analysis use cases where you need to answer questions about specific documents.
Architecture Components
- • Vector Database: Milvus, Qdrant, or pgvector for embedding storage
- • Embedding Model: Local embedding model (e5, bge, or OpenAI-compatible)
- • Document Processor: Chunking, parsing, and metadata extraction pipeline
- • LLM: For synthesis and answer generation
- • Orchestration: LangChain, LlamaIndex, or custom pipeline
When to use: Users need to query large document repositories with natural language. Critical for DOMEX, all-source analysis, and policy research use cases.
Pattern 4: Hybrid Architecture with Cross-Domain Solutions
Some organizations need to leverage more powerful models that can't run on local hardware. Cross-domain solutions (CDS) can enable controlled data flow between classification levels, but this adds significant complexity.
Considerations
- • Data must be sanitized before crossing the boundary
- • CDS adds latency (often 100ms+ per request)
- • Requires extensive ATO documentation for the data flow
- • Not appropriate for all data types or classification levels
When to use: You need capabilities that truly cannot run locally, and the data being processed can be appropriately sanitized. This is rare but sometimes necessary.
Hardware Considerations
Hardware availability in classified environments is often the binding constraint. You can't just order GPUs from Amazon—everything must go through the procurement and accreditation process.
GPU Options
| GPU | VRAM | Max Model Size | Notes |
|---|---|---|---|
| NVIDIA A100 | 40/80GB | 70B (quantized) | Gold standard, limited availability |
| NVIDIA A10 | 24GB | 13B-30B | More available, good price/performance |
| NVIDIA T4 | 16GB | 7B-13B | Widely deployed, often already available |
| NVIDIA L4 | 24GB | 13B-30B | Newer, efficient inference |
| CPU Only | N/A | 7B (slow) | Fallback option with llama.cpp |
Model Quantization
Quantization reduces model precision to fit larger models on smaller hardware. The trade-off is slight quality degradation—usually acceptable for most use cases.
| Quantization | Memory Reduction | Quality Impact | Best For |
|---|---|---|---|
| FP16 | 50% | Negligible | Default if you have memory |
| INT8 | 75% | Minimal | Good balance |
| INT4 (GPTQ/AWQ) | 87% | Noticeable on edge cases | Memory constrained |
| GGUF Q4_K_M | ~85% | Good quality retention | CPU inference |
Practical example: A 70B parameter model at FP16 requires ~140GB of VRAM. At INT4, that drops to ~35GB—runnable on a single A100-80GB or two A100-40GBs.
Data Handling: Training vs Inference
One of the most important architectural decisions is whether you'll do any training or fine-tuning in the classified environment, or only inference.
Inference Only (Recommended Starting Point)
- ✓ Pre-trained model is imported once (after security review)
- ✓ No classified data touches the model weights
- ✓ Simpler security boundary—model is read-only
- ✓ Easier to update models (just import new weights)
Fine-Tuning in Environment
- • Model weights become classified after training on classified data
- • Requires more compute resources and MLOps infrastructure
- • Training data management becomes critical
- • May be necessary for domain-specific performance
Our recommendation: Start with inference only. Use RAG and prompt engineering to adapt base models to your domain. Only pursue fine-tuning if you've validated the use case and have proven you can't achieve acceptable results with RAG.
Security Boundaries and Access Control
Even within a classified environment, you need to think carefully about who can access what through the AI system.
Critical Security Considerations
Document-Level Access Control
If your RAG system indexes documents with varying access controls, the retrieval system must enforce those controls. A user with SECRET clearance shouldn't see TS/SCI content in their results, even within a TS/SCI environment.
- • Tag documents with classification and access control markings during ingestion
- • Filter retrieval results based on user's clearance and need-to-know
- • Audit all queries and returned documents
Prompt Injection Protection
Adversarial prompts can attempt to extract information the user shouldn't have access to or manipulate the system's behavior.
- • Input validation and sanitization on all user prompts
- • System prompts that instruct the model on appropriate behavior
- • Output filtering for classification markings and sensitive patterns
- • Rate limiting to prevent enumeration attacks
Audit Trail Requirements
Every interaction with the AI system should be logged:
- • User identity (tied to PKI/CAC authentication)
- • Timestamp and session information
- • Full prompt text
- • Retrieved documents (if RAG)
- • Model response
- • Any errors or security events
Real Patterns That Work
Without revealing client specifics, here are deployment patterns we've seen succeed in production classified environments:
Pattern: Intelligence Report Summarization
- • Model: Llama 2 70B (INT4 quantized)
- • Hardware: 2x A100-40GB
- • Use case: Summarizing long-form intelligence reports into executive briefs
- • Throughput: ~50 reports/hour
- • Key success factor: Careful prompt engineering for consistent output format
Pattern: Document Repository Q&A
- • Architecture: RAG with Milvus + Mistral 7B
- • Hardware: 1x A10 (24GB) for inference, separate server for vector DB
- • Corpus: 500K+ documents across multiple security compartments
- • Key success factor: Document-level access control in the retrieval layer
Pattern: Entity Extraction Pipeline
- • Model: Fine-tuned SpaCy NER + custom patterns
- • Hardware: CPU only (runs on standard servers)
- • Use case: Extracting people, places, organizations from DOMEX materials
- • Throughput: 10K+ documents/hour
- • Key success factor: Domain-specific training data and entity types
Getting Started
If you're looking to bring AI capabilities to your classified environment, here's a practical starting path:
- 1. Inventory your hardware. What GPUs (if any) are currently deployed or available for procurement? What's the approval timeline for new hardware?
- 2. Define your use case precisely. "AI for analysts" is too vague. "Summarize multi-page intelligence reports into 3-paragraph briefs" is actionable.
- 3. Prototype on unclassified. Build and test your architecture on an unclassified system first. Work out the kinks before dealing with SCIF logistics.
- 4. Plan your model import. How will you get model weights into the environment? What security review is required?
- 5. Start small. Deploy a 7B model on modest hardware. Prove value before scaling.
Ready to Bring AI Capabilities to Your Classified Mission?
We specialize in deploying AI in environments where cloud services aren't an option. From architecture design to production deployment, we can help you bring modern AI capabilities inside your security boundary.
Get in TouchRelated Reading