Postgres: executor bottlenecks

July 08, 2025

I wrote this post to expand on the first slide of a talk I gave recently at PGConf.in 2025 and then again at POSETTE 2025, covering executor performance improvements in PostgreSQL. It revisits some observations about where the executor spends its time and sketches directions for improvement that could realistically increase single-node query execution efficiency without a full executor rewrite – though we’d need that someday too.

OLTP workloads and executor overhead

Short OLTP-style queries – primary key lookups, targeted updates – still dominate many production systems. For most of these, executor overhead is not the limiting factor. I/O latency, WAL flushes, locks, or client round trips tend to matter more. And while such queries do traverse the full executor plan tree from root to leaf and back, the tree is usually shallow – just a scan node and maybe a filter or limit – and they return after a few tuples, so they do not stress much of the executor’s runtime machinery.

That said, executor efficiency does become visible in certain OLTP scenarios: high-frequency queries under tight response time constraints, in-memory workloads on replicas or caches where CPU is the main bottleneck, and situations where small per-tuple overheads compound across thousands of queries per second. So while OLTP is not the executor’s primary stress test, reducing unnecessary work in hot paths like tuple deforming or qual evaluation is still worthwhile.

OLAP is where overheads pile up

In OLAP workloads, the same inefficiencies multiply quickly. Queries that scan millions of tuples, join wide rows, and apply many filters or aggregates can turn a few dozen nanoseconds of overhead per tuple into hundreds of milliseconds of total query time. Even with parallel or distributed plans, each worker runs its slice of the executor in a single-threaded loop, so CPU time per core is what matters, and the executor’s tight loops determine how efficiently that time is used.

What gets in the way

Postgres’ executor favors flexibility over CPU efficiency. A few design choices stand out as persistent sources of overhead.

The volcano-style execution model processes one tuple at a time, with each plan node calling the one below it to fetch the next tuple. This leads to frequent function calls, virtual dispatch into plan-node-specific code, and poor instruction cache locality.

Tuple deforming happens one tuple at a time, one attribute at a time, with branches on every attribute to handle nulls, alignment, and type-specific layout. There is no batching, no tight loop over a group of similar tuples.

Expression evaluation runs a linear sequence of opcodes per tuple (the ExecInterpExpr() interpreter introduced in Postgres 10), which avoids the overhead of recursive tree evaluation, but is still fundamentally row-at-a-time. Each tuple is processed in isolation, intermediate results are boxed into Datums, and there is little opportunity for SIMD or CPU pipelining.

In analytical queries processing millions of tuples, these per-tuple inefficiencies add up and CPU usage becomes visibly dominated by executor logic rather than I/O. The following perf profile of a backend running SELECT agg(col1) FROM t WHERE col2 = ? illustrates the point:

                --99.99%--exec_simple_query
                          |
                           --99.97%--PortalRun
                                     PortalRunSelect
                                     standard_ExecutorRun
                                     ExecSeqScanWithQualProject
                                     |
                                     |--71.50%--ExecInterpExpr
                                     |          |
                                     |          |--62.97%--slot_getsomeattrs_int
                                     |          |          |
                                     |          |           --60.89%--tts_buffer_heap_getsomeattrs
                                     |          |
                                     |           --0.72%--int4eq
                                     |
                                     |--23.78%--heap_getnextslot
                                     |          |
                                     |          |--16.17%--heapgettup_pagemode
                                     |          |          |
                                     |          |          |--6.81%--heap_prepare_pagescan
                                     |          |          |          |
                                     |          |          |           --1.22%--HeapTupleSatisfiesVisibility
                                     |          |          |
                                     |          |           --4.35%--read_stream_next_buffer
                                     |          |                     |
                                     |          |                      --4.20%--StartReadBuffer
                                     |          |                                |
                                     |          |                                 --3.12%--BufTableLookup
                                     |          |                                           |
                                     |          |                                            --3.10%--hash_search_with_hash_value
                                     |          |
                                     |           --4.88%--ExecStoreBufferHeapTuple
                                     |
                                      --1.32%--MemoryContextReset

ExecInterpExpr() takes 71% of CPU time here. Nearly all of that is slot_getsomeattrs_int() deforming HeapTuples to extract the columns needed for the WHERE condition and the aggregation – not the actual filter comparison or aggregate computation, which barely register.

Realistic improvements

There are worthwhile improvements possible within the current executor architecture, particularly by introducing batching – doing deforming and qual evaluation in tight loops over groups of tuples rather than one at a time.

Batching between the scan node and the table AM

Today, ExecScanExtended() pulls one tuple at a time from the table access method: one call, one deform, one qual evaluation. A natural improvement is to let AMs return small batches of tuples – initially as arrays of TupleTableSlot, each wrapping a HeapTuple, preserving existing executor abstractions while enabling fewer function calls, loop-based qual and deform processing, and better instruction cache behavior. In the longer term this could evolve toward a more compact batch representation that eliminates redundancy like repeated TupleDesc references and per-tuple metadata, but even the simple version shows measurable improvement.

Batched tuple deforming

Even without changing the deforming logic itself, running it in a loop over a batch of slots rather than once per tuple improves things. The same deforming code is reused across consecutive calls, which helps instruction cache locality and branch predictor accuracy when the rows being processed have similar shapes. Function call overhead is also amortized across the batch. This is a baseline improvement that sets the stage for more ambitious optimizations like SIMD later.

Final thoughts

No matter how smart the planner is or how fast the storage layer gets, the executor still runs the actual computation – filtering, joining, aggregating, returning results. For OLTP queries, trimming executor overhead keeps latency lean. For OLAP queries, it is where most of the CPU time goes and where the biggest gains are available. And reducing architectural friction in the executor is also what makes future work, like vectorized execution, feasible without a ground-up rewrite.


© 2025 Amit Langote. Hosted by GitHub. Powered by Jekyll