Back to KB
Difficulty
Intermediate
Read Time
7 min

Netflix Serves 84% of Query Results from Cache with Interval-Aware Caching in Apache Druid

By Codcompass TeamΒ·Β·7 min read

Optimizing Rolling Windows in Apache Druid: A Segment-Based Caching Strategy

Current Situation Analysis

Real-time analytics dashboards frequently rely on sliding time windows to display metrics such as "active users in the last 24 hours" or "error rates over the past hour." In traditional query engines, these rolling windows create a severe cache inefficiency known as cache thrashing. Because the query window shifts continuously with time, the query signature changes constantly. A query executed at 10:00:00 for the window [T-24h, T] produces a different result key than the same query executed at 10:00:01, even though 99.99% of the underlying data remains identical.

This forces the system to rescan historical data repeatedly, consuming CPU and I/O resources for redundant computation. The problem is often overlooked because standard caching mechanisms treat query results as atomic units based on exact parameter matching. They lack temporal awareness, meaning they cannot recognize that a new query overlaps significantly with a previously cached result.

Netflix's engineering team addressed this by implementing Interval-Aware Caching within Apache Druid. By decomposing rolling window queries into fixed, reusable time segments, the system enables partial cache reuse. Instead of recomputing the entire window, the engine only recalculates the most recent segment where data is changing. This approach yielded measurable production gains: 84% of query results were served directly from cache, and overall query load was reduced by 33%. This demonstrates that temporal decomposition is critical for scaling real-time analytics without linear infrastructure growth.

WOW Moment: Key Findings

The impact of interval-aware caching becomes evident when comparing standard caching behavior against segment-based decomposition in a rolling window workload. The following data reflects the performance delta observed in high-throughput Druid deployments utilizing this strategy.

StrategyCache Hit RateQuery Load ReductionP90 Latency Impact
Standard Key-Based Caching< 10%BaselineHigh variance; spikes during peak window shifts
Interval-Aware Segment Caching84%33%Stabilized; significant reduction in tail latency

Why this matters: The 84% cache hit rate indicates that the vast majority of historical data access is eliminated. The 33% reduction in query load translates directly to lower compute costs and the ability to handle higher query concurrency. Most importantly, the improvement in P90 latency ensures that dashboard users experience consistent responsiveness, even as the underlying data volume grows. This enables organizations to maintain sub-second response times for complex aggregations over large time horizons without provisioning additional query nodes.

Core Solution

The core mechanism relies on temporal decomposition. Rather than caching the result of a full rolling window, the system breaks the window into smaller, fixed-duration segments. When a new query arrives, the engine identifies which segments overlap with existing cache entries and which segments require fresh computation.

Architecture Decisions

  1. Segment Granularity Alignment: Cache segments must align with Druid's data segments or a logical m

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back