8bit.tr Journal

Inference

7 articles tagged with Inference.

January 5, 2026

Test-Time Compute Scaling: Self-Consistency and Reasoning Gains

A technical look at test-time compute strategies that improve reasoning without retraining the model.

December 23, 2025

How to profile LLM latency end-to-end and optimize the slowest paths in production.

December 16, 2025

A technical guide to speculative decoding, draft models, and system tricks that cut latency without sacrificing quality.

December 12, 2025

A production-focused guide to optimizing AI inference with batching, caching, quantization, and routing strategies.

December 11, 2025

A deep dive into distillation pipelines that preserve quality while cutting inference cost.

December 7, 2025

A systems-level guide to distributed inference, load balancing, and traffic shaping for large-scale LLM services.

December 5, 2025

A deep dive into kernel fusion, custom kernels, and GPU-level optimizations for fast LLM inference.