Redis eviction policies in production

Redis is one of those tools that feels simple until the day it does not. I have run it as a cache, a session store, a rate limiter, and a queue, and each role wants a different eviction policy. Getting the policy wrong has caused more 3 am pages for me than any other piece of caching infrastructure, so this is my collected notes on what to set and why.

The eviction policy is what Redis does when used_memory reaches maxmemory and a new write needs space. The default in older builds was noeviction, which simply rejects writes with an OOM error rather than dropping anything. That is the right policy if Redis is your primary store and you would rather have explicit failures than silent data loss. It is exactly the wrong policy if Redis is a cache, because under memory pressure your service will start returning errors on every write, including the ones that would have populated the cache for the next reader. I have watched that exact failure mode wipe out a service in about ninety seconds.

The policies in plain language

There are eight policies, but in practice you almost always want one of four. allkeys-lru evicts the least recently used key out of the entire keyspace, regardless of whether it has a TTL. This is the default I reach for in a generic cache where everything is fair game. volatile-lru only evicts keys that have a TTL set, leaving keys without expiration alone; useful when Redis holds a mix of cache entries and persistent state in the same instance, though I personally prefer to split those into separate instances if I can. allkeys-lfu evicts the least frequently used key, which behaves better than LRU for workloads with strong hot-key patterns because a one-time scan does not flush genuinely hot keys out of memory. noeviction, as I said, is for when Redis is your source of truth.

The thing many people miss is that Redis does not maintain a real LRU or LFU list. Doing so would require extra memory per key and pointer maintenance on every access. Instead it samples. When eviction needs to run, Redis picks N random keys, computes their idle time (for LRU) or frequency counter (for LFU), and evicts the worst one. The sample size N is controlled by maxmemory-samples, and the default of 5 is, in my experience, slightly too aggressive a memory-versus-accuracy tradeoff for most production workloads.

Why the sample size matters

With 5 samples, Redis approximates true LRU reasonably well when the keyspace is small, but as the keyspace grows the chance of picking the actual oldest key drops fast. The Redis documentation includes a great chart showing the difference: at 5 samples there is a visible "haze" of recently used keys getting evicted alongside the truly cold ones. At 10 samples that haze nearly disappears. The cost is CPU during eviction: each sample requires reading the key's metadata. On a healthy instance evictions are rare enough that doubling the sample size is invisible in CPU usage. On an unhealthy instance where evictions are constant, you have bigger problems than the sampling overhead.

The other parameter worth setting explicitly is maxmemory itself. Leaving it at zero means Redis will grow until the OS kills it, which is not eviction, that is OOM. I size maxmemory at roughly 70-75% of the instance's available RAM, leaving headroom for replication buffers, client output buffers, and the COW overhead of BGSAVE. A forked save process can briefly double resident memory if writes are heavy; budgeting for that on a small instance is the difference between a clean snapshot and a kernel OOM.

A config I actually use

# Cache instance backing the product catalog service.
# Roughly 12 GB available on the host; 9 GB budget for Redis.

maxmemory 9gb
maxmemory-policy allkeys-lfu
maxmemory-samples 10

# LFU tuning: decay over ~1 hour so yesterday's hot keys
# do not stay hot forever.
lfu-log-factor 10
lfu-decay-time 60

# Keep the slow log honest; eviction storms show up here.
slowlog-log-slower-than 10000
slowlog-max-len 256

# Active expiration so TTL'd keys vacate before LFU has to work.
hz 20
active-expire-effort 4

# Replication safety: refuse writes if no replica is connected
# for too long. Cache instance, so disabled here.
min-replicas-to-write 0

The catalog service has a very Zipfian access pattern: about 80% of reads hit roughly 2% of the keys. LFU was a clear win over LRU there; switching from allkeys-lru to allkeys-lfu dropped the cache miss rate from 9.4% to 5.1% with no change in memory budget. The decay parameters matter for that: without decay, an item that was popular last week stays "popular" forever and crowds out today's hot keys. An hour of decay was a good compromise for our workload, but I have used as short as 10 minutes for a flash-sale facing service.

Incidents I keep paying for

Two stories worth remembering. The first was a session store running with noeviction that we forgot about. A traffic spike from a marketing campaign pushed memory past maxmemory, and the application started failing every login because new sessions could not be written. Users hammered retry, which generated more failed sessions, which... you can see where this is going. The fix was a one-line config change, but it took twenty minutes to diagnose because the application logs said "OOM command not allowed when used memory > 'maxmemory'" and we had nobody on call who recognized that string.

The second was a service that used Redis for both cache entries and a small set of feature flags. Someone set the policy to allkeys-lru for cache reasons and forgot the flags were in the same instance. During an unrelated cache pressure event, several feature flag keys were evicted, defaulted to false in the application, and shut off a critical pricing path. We moved the flags to a separate instance with noeviction that afternoon. The lesson I take from both: the eviction policy is part of the contract between Redis and your application, and you should write it down somewhere your future self will find it.