8bit.tr Journal
Inference
7 articles tagged with Inference.
January 5, 2026
Test-Time Compute Scaling: Self-Consistency and Reasoning Gains
A technical look at test-time compute strategies that improve reasoning without retraining the model.
December 23, 2025
LLM Latency Profiling and Optimization: Finding the Real Bottlenecks
How to profile LLM latency end-to-end and optimize the slowest paths in production.
December 16, 2025
Speculative Decoding and Fast Inference: Making LLMs Feel Instant
A technical guide to speculative decoding, draft models, and system tricks that cut latency without sacrificing quality.
December 12, 2025
AI Inference Optimization Stack: Latency, Cost, and Quality
A production-focused guide to optimizing AI inference with batching, caching, quantization, and routing strategies.
December 11, 2025
Knowledge Distillation for Inference: Smaller Models, Real Speed
A deep dive into distillation pipelines that preserve quality while cutting inference cost.
December 7, 2025
Distributed Inference and Load Balancing: Serving LLMs at Planet Scale
A systems-level guide to distributed inference, load balancing, and traffic shaping for large-scale LLM services.
December 5, 2025
Kernel Fusion and Inference Kernels: Squeezing Latency Out of GPUs
A deep dive into kernel fusion, custom kernels, and GPU-level optimizations for fast LLM inference.