Back to KB

reduce active parameter counts per token, this assumption overlooks a critical hardwar

Difficulty
Intermediate
Read Time
84 min

Optimizing Dense LLM Inference on Trillium TPUs: A Production-Grade vLLM Deployment Guide

By Codcompass Team··84 min read

Optimizing Dense LLM Inference on Trillium TPUs: A Production-Grade vLLM Deployment Guide

Current Situation Analysis

The industry is currently experiencing a structural shift in how large language models are served at scale. Architectural debates heavily favor Mixture-of-Experts (MoE) designs, with teams assuming that sparse activation automatically translates to lower costs, higher throughput, and better latency. While MoE models undeniably reduce active parameter counts per token, this assumption overlooks a critical hardware reality: modern accelerator architectures are increasingly optimized for dense matrix multiplication patterns. When serving dense models on next-generation silicon, the expected efficiency gap narrows dramatically, and in specific throughput profiles, dense architectures can actually outperform their sparse counterparts.

This problem is frequently misunderstood because benchmarking is often conducted in isolation, without accounting for continuous batching dynamics, KV cache fragmentation, or hardware topology alignment. Engineering teams default to MoE for cost savings, only to discover that dense models on Trillium-class TPUs deliver comparable or superior peak throughput due to tighter memory bandwidth utilization and reduced routing overhead. The misconception stems from evaluating models purely on parameter counts rather than on actual silicon utilization metrics.

Empirical data from recent production deployments confirms this shift. When running google/gemma-4-31B-it on Cloud TPU v6e-4 (Trillium) infrastructure, the dense architecture achieves a peak prefill throughput of 463,345 tokens per second. At Flex-start pricing (~$0.40/hour), this translates to approximately 308 million tokens processed per dollar. The system maintains stability under extreme concurrency, handling 1,024 simultaneous requests without memory exhaustion. These metrics demonstrate that dense models, when paired with optimized serving engines and correctly tuned concurrency windows, remain highly competitive for both interactive and batch workloads. The engineering challenge is no longer about choosing dense versus sparse; it is about aligning model architecture with hardware execution patterns and request scheduling strategies.

WOW Moment: Key Findings

The most significant insight from recent benchmarking cycles is the throughput parity between dense and sparse architectures on Trillium hardware, coupled with distinct latency and context scaling tradeoffs. This finding fundamentally changes how teams should evaluate model selection for production inference pipelines.

ArchitecturePeak Throughput (v6e-4)Interactive TTFT (Low Load)Active Compute per TokenMax Context WindowCost Efficiency (Peak)
Gemma-4 31B (Dense)463,345 tok/s0.314s31B parameters64K (tested to 16K)~308M tokens/$
Gemma-4 26B (MoE A4B)~457,000 tok/s<1.200s3.8B parameters256K (Shared KV)Lower active compute, higher routing overhead

Why this matters: The data reveals that Trillium's matrix multiplication units are heavily co-optimized for dense workloads. The dense model's slight throughput advantage (463k vs 457k tok/s) indicates that routing overhead in MoE architectures introduces latency penalties that partially offset the benefits of sparse activation. For interactive APIs requiring sub-second time-to-first-token (TTFT), the dense model's 0.314s response time at low concurrency is a decisive advantage. Conversely, MoE's shared KV cache and 7.5x reduction in active parameters make it superior for long-context applications and multi-tenant environments where thermal and power constraints dictate scaling limits. Understanding this tradeoff allows infrastructure teams to route workloads intelligently rather than applying a one-size-fits-all model strategy.

Core Solution

Deploying dense models efficiently on TPU v6e-4 requires a systematic approach that aligns the serving engine, concurrency management, and hardware topology.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register — Start Free Trial

7-day free trial · Cancel anytime · 30-day money-back