8bit.tr Journal
Open-Source Models in Production: System Requirements, Tokens, and Context Windows
A technical, engineering-first guide to hardware sizing for open-source LLMs, including VRAM, RAM, tokens, and context window tradeoffs.
Why System Requirements Decide Model Choice
Open-source models are no longer a curiosity; they power production-grade copilots, assistants, and internal agents. The bottleneck is not model availability, it is hardware fit.
A model that is accurate but slow or too expensive in VRAM will fail under real traffic. Engineering teams need clear sizing rules that balance quality, latency, and cost.
This guide stays practical: parameter count, precision, context window, and throughput are the knobs that determine whether a model can run on a laptop, a single GPU, or a multi-node cluster.
The Four Variables That Drive Hardware Needs
Parameter count determines baseline memory. A 7B model is fundamentally cheaper to host than a 70B model, regardless of framework.
Precision determines memory per parameter. FP16/BF16 is high quality but heavy; INT8 and INT4 can cut memory by 2x to 4x with some quality tradeoffs.
Context window drives key-value cache size. Longer context means more memory and lower token throughput at the same batch size.
Throughput is a system-level outcome. You trade VRAM and compute for tokens per second, especially under concurrent requests.
Baseline VRAM by Parameter Size (Approximate)
These ranges assume modern inference stacks with weight-only quantization and a modest batch size. Exact numbers vary by architecture and implementation, but the order of magnitude holds.
7B models: FP16/BF16 typically require 14 to 16 GB VRAM; INT8 often fits in 8 to 10 GB; INT4 can run in 5 to 7 GB.
13B models: FP16/BF16 often need 26 to 32 GB VRAM; INT8 fits in 14 to 18 GB; INT4 can run in 9 to 12 GB.
34B models: FP16/BF16 can require 70 to 80 GB VRAM; INT8 lands near 40 to 48 GB; INT4 can fit in 24 to 32 GB.
70B models: FP16/BF16 usually needs 140 to 160 GB VRAM; INT8 around 80 to 96 GB; INT4 around 48 to 64 GB.
Context Window and KV Cache: The Hidden Multiplier
Tokens are not just output length. The context window controls how many tokens must be kept in memory for attention during generation.
KV cache grows roughly with context length times batch size and model size. A 32k context can add tens of GB of VRAM for large models.
If you double the context length, do not expect linear costs only in time. You also pay in memory, which can force you into a larger GPU class.
For long-context workloads, consider models optimized for memory efficiency or sliding-window attention to reduce cache growth.
Token Throughput: What Performance Actually Means
Tokens per second is the real KPI for user experience. It is affected by model size, precision, context length, and batch size.
Small models can exceed 100 to 200 tokens per second on a single modern GPU at short context lengths. Large models can drop below 20 tokens per second in long-context scenarios.
Batching improves throughput but increases per-request latency. For interactive workloads, target lower batch sizes and prioritize tail latency.
Measure both prefill (prompt processing) and decode (generation). Prefill cost scales with context length and often dominates for long prompts.
CPU-Only and Hybrid Inference
CPU-only deployments are viable for small models or offline batch tasks, but they are rarely competitive for interactive UX.
If you must run on CPU, quantize aggressively and keep context windows short. Token throughput can be an order of magnitude lower than GPU.
Hybrid setups offload embeddings or retrieval to CPU while keeping generation on GPU. This can control costs without breaking latency budgets.
Storage and Disk I/O Considerations
Model weights are large assets. A 70B FP16 checkpoint can exceed 140 GB on disk, which requires fast SSD storage to avoid load bottlenecks.
Plan for at least 2x storage overhead for checkpoints, merged weights, and quantized variants you want to evaluate.
If you rotate models frequently, prioritize NVMe and model streaming to reduce downtime during swaps.
Sizing Patterns That Work in Practice
Single GPU, interactive apps: 7B to 13B with INT4 or INT8, 8k to 16k context, target 30 to 80 tokens per second per request.
Single GPU, high-quality apps: 13B to 34B with INT8, 8k context, careful batching and routing to keep latency stable.
Multi-GPU, high-reliability apps: 34B to 70B with INT4 or INT8, 8k to 32k context, tensor parallelism and request routing.
If you need long-context reasoning, smaller models with larger context often outperform larger models that cannot fit the KV cache.
A Practical Deployment Checklist
Decide your maximum context window and stick to it. Oversized defaults silently destroy throughput.
Measure VRAM headroom with real prompts. Always leave margin for KV cache growth under load.
Run load tests that mimic real concurrency, not single-request benchmarks.
Version and test quantization variants. A stable INT8 model can outperform a fragile FP16 model in production.
Instrument prefill time, decode time, and queue depth. These three metrics explain nearly all latency spikes.
FAQ: Open-Source Model Requirements
How many tokens should I allow per request? Start with 2k to 8k for interactive apps and only increase when you have a clear workflow need.
Is INT4 always good enough? Not always. It is excellent for many tasks, but some domains need INT8 or FP16 for accuracy and stability.
When do I need multi-GPU? When model weights plus KV cache exceed a single GPU budget or when throughput requirements exceed one device.
About the author
