Connection pool sizing for high-concurrency Go services

Last quarter I spent a frustrating week chasing a latency regression in one of our payment routing services. The p99 grew from 12ms to 80ms after we doubled traffic, but CPU was idle and the database itself was bored. The culprit was a misconfigured pgxpool that I had inherited and never bothered to revisit. This post is the writeup I wish I had read before that week started.

Go makes it deceptively easy to scale request handlers. Goroutines are cheap, the runtime schedules them well, and a single binary can comfortably hold tens of thousands of concurrent in-flight requests. Database connections are the opposite of that: each one costs memory on the server, a TCP socket, a backend process in Postgres, and a slot in the pooler if you run one. Mixing cheap concurrency primitives with expensive connection primitives is where most of my production incidents come from.

Why Little's Law is the right starting point

Little's Law states that the average number of in-flight requests L equals arrival rate λ multiplied by average service time W. For database calls this gives a direct way to size the pool: if you serve 4000 queries per second and each query takes 8ms on average, you need at least 4000 * 0.008 = 32 concurrent connections to keep up with steady state. Anything below that and queries will queue inside the pool; anything well above that and connections sit idle wasting backend resources.

The interesting part is what happens near the limit. Once the arrival rate approaches the pool's throughput ceiling, waiting time grows non-linearly. I have a small spreadsheet I use for sanity checks: with utilization at 70% the expected wait is roughly 2.3x the service time; at 90% it is 9x; at 95% it is 19x. So a pool that "almost" fits will not just be a little slower, it will explode. The 80ms p99 I mentioned earlier was almost exactly what an M/M/c queue predicts for 28 connections handling 32 average concurrent waiters.

Picking a number with pgxpool

For Postgres workloads I default to pgx and pgxpool. The defaults are not unreasonable but they assume a small service. For a high-concurrency edge service I usually set the following:

config, err := pgxpool.ParseConfig(dsn)
if err != nil {
    return nil, fmt.Errorf("parse pool config: %w", err)
}

// Sized for ~4000 qps with 8ms avg service time and 30% headroom.
config.MaxConns = 48
config.MinConns = 8
config.MaxConnLifetime = 30 * time.Minute
config.MaxConnIdleTime = 5 * time.Minute
config.HealthCheckPeriod = 30 * time.Second

config.ConnConfig.RuntimeParams["application_name"] = "router-edge"
config.ConnConfig.RuntimeParams["statement_timeout"] = "2000"

pool, err := pgxpool.NewWithConfig(ctx, config)
if err != nil {
    return nil, fmt.Errorf("create pool: %w", err)
}

MinConns matters more than people think: it keeps warm connections around so the first burst after a quiet period does not pay the full TCP plus TLS plus authentication cost. MaxConnLifetime is the other quiet hero. Without it, long-lived connections accumulate prepared statement cache bloat and, more importantly, prevent rolling restarts of pgbouncer from rebalancing the fleet. Thirty minutes is a good compromise: short enough to recycle, long enough that the churn is negligible at our query rate.

The metrics that actually predict pain

Two gauges tell me almost everything I need to know about a pool's health. The first is connections_in_use: how many slots are currently leased to goroutines. The second is connections_waiting: how many goroutines are blocked on Acquire with no slot available. In pgxpool these are exposed through pool.Stat() as AcquiredConns and EmptyAcquireCount plus AcquireDuration.

The rule I follow is simple: if connections_waiting is ever non-zero in steady state, the pool is too small. A short spike during a deployment or a cache miss storm is fine. A sustained non-zero value means every request is paying queueing latency that does not show up in your database metrics at all, because from the database's perspective everything is healthy. I have seen teams chase phantom slow queries for days because their slow request logs were dominated by pool acquire time and not actual SQL time.

The other thing worth instrumenting is the ratio of AcquiredConns to MaxConns. Alerting at 80% utilization gives you a window to scale before Little's Law starts charging interest. I learned this the hard way; now it is the first dashboard I build for any new service.

One more thing: the database has its own limits

It does not matter how big you make your Go pool if Postgres is configured for 100 max connections and you run 20 replicas of your service each asking for 48. The pool will happily hand out leases that fail at connect time. Always do the multiplication: replicas * MaxConns must comfortably fit within max_connections minus whatever you reserve for ops tooling and other services. A connection pooler like pgbouncer in transaction mode is usually the right answer once that math gets uncomfortable, but it adds its own latency floor and changes how prepared statements behave. That is a topic for another post.