Engineering Notes

Redis eviction policies in production

June 15, 2026 rediscache

Redis is one of those tools that feels simple until the day it does not. I have run it as a cache, a session store, a rate limiter, and a queue, and each role wants a different eviction policy. Getting the policy wrong has caused more 3 am pages for me than any other piece of cac

Avoiding head-of-line blocking with HTTP/2 multiplexing

June 6, 2026 httpperformance

One of the more counter-intuitive lessons I picked up while running an edge proxy fleet is that HTTP/2 sometimes makes tail latency worse, not better, despite multiplexing being its headline feature. The reason is head-of-line blocking, and it shows up in two different places dep

Connection pool sizing for high-concurrency Go services

June 2, 2026 go

Last quarter I spent a frustrating week chasing a latency regression in one of our payment routing services. The p99 grew from 12ms to 80ms after we doubled traffic, but CPU was idle and the database itself was bored. The culprit was a misconfigured pgxpool that I had inherited a

Backpressure in gRPC streaming

March 14, 2026 grpcgoflow-control

HTTP/2 flow control gives you per-stream window updates, but if you push a producer faster than the consumer can read, the server-side queue starts to grow before the window closes. Reactor and Akka Streams both expose a request-driven cursor that makes this explicit — here is how I wired it for a streaming aggregation endpoint that backs a real-time dashboard.

Choosing PgBouncer pooling modes

February 22, 2026 postgrespgbouncer

Session, transaction and statement pooling each break a different subset of postgres client features. After the third incident with a long-running advisory lock surviving a connection reuse, I went back through our connection inventory and re-tagged each consumer with the minimum pooling mode it can actually run in. Notes on the matrix below.

Tail sampling with the OpenTelemetry collector

January 30, 2026 opentelemetrytracing

Head sampling at the SDK is cheap but blind — you decide whether to keep a trace before the trace finishes. Tail sampling at the collector keeps the whole trace in memory until the root span closes, then runs a policy. Memory budgeting and the trade-off between sample rate and policy latency are the interesting parts.

Static membership saved our rebalance budget

December 12, 2025 kafkaconsumer-groups

Rolling deploys with 40 consumer instances generated more than a minute of stop-the-world rebalance per release, which started leaking SLO budget after we ramped throughput. Setting a stable group.instance.id and tuning session.timeout.ms moved that cost down to a few seconds per restart.