July 08, 2025
I wrote this post to expand on the first slide of a talk I gave recently at PGConf.in 2025 and then again at POSETTE 2025, covering executor performance improvements in PostgreSQL. It revisits some observations about where the executor spends its time and sketches directions for improvement that could realistically increase single-node query execution efficiency without a full executor rewrite – though we’d need that someday too.
Short OLTP-style queries – primary key lookups, targeted updates – still dominate many production systems. For most of these, executor overhead is not the limiting factor. I/O latency, WAL flushes, locks, or client round trips tend to matter more. And while such queries do traverse the full executor plan tree from root to leaf and back, the tree is usually shallow – just a scan node and maybe a filter or limit – and they return after a few tuples, so they do not stress much of the executor’s runtime machinery.
That said, executor efficiency does become visible in certain OLTP scenarios: high-frequency queries under tight response time constraints, in-memory workloads on replicas or caches where CPU is the main bottleneck, and situations where small per-tuple overheads compound across thousands of queries per second. So while OLTP is not the executor’s primary stress test, reducing unnecessary work in hot paths like tuple deforming or qual evaluation is still worthwhile.
In OLAP workloads, the same inefficiencies multiply quickly. Queries that scan millions of tuples, join wide rows, and apply many filters or aggregates can turn a few dozen nanoseconds of overhead per tuple into hundreds of milliseconds of total query time. Even with parallel or distributed plans, each worker runs its slice of the executor in a single-threaded loop, so CPU time per core is what matters, and the executor’s tight loops determine how efficiently that time is used.
Postgres’ executor favors flexibility over CPU efficiency. A few design choices stand out as persistent sources of overhead.
The volcano-style execution model processes one tuple at a time, with each plan node calling the one below it to fetch the next tuple. This leads to frequent function calls, virtual dispatch into plan-node-specific code, and poor instruction cache locality.
Tuple deforming happens one tuple at a time, one attribute at a time, with branches on every attribute to handle nulls, alignment, and type-specific layout. There is no batching, no tight loop over a group of similar tuples.
Expression evaluation runs a linear sequence of opcodes per tuple (the
ExecInterpExpr() interpreter introduced in Postgres 10), which avoids the
overhead of recursive tree evaluation, but is still fundamentally
row-at-a-time. Each tuple is processed in isolation, intermediate results are
boxed into Datums, and there is little opportunity for SIMD or CPU
pipelining.
In analytical queries processing millions of tuples, these per-tuple
inefficiencies add up and CPU usage becomes visibly dominated by executor logic
rather than I/O. The following perf profile of a backend running
SELECT agg(col1) FROM t WHERE col2 = ? illustrates the point:
--99.99%--exec_simple_query
|
--99.97%--PortalRun
PortalRunSelect
standard_ExecutorRun
ExecSeqScanWithQualProject
|
|--71.50%--ExecInterpExpr
| |
| |--62.97%--slot_getsomeattrs_int
| | |
| | --60.89%--tts_buffer_heap_getsomeattrs
| |
| --0.72%--int4eq
|
|--23.78%--heap_getnextslot
| |
| |--16.17%--heapgettup_pagemode
| | |
| | |--6.81%--heap_prepare_pagescan
| | | |
| | | --1.22%--HeapTupleSatisfiesVisibility
| | |
| | --4.35%--read_stream_next_buffer
| | |
| | --4.20%--StartReadBuffer
| | |
| | --3.12%--BufTableLookup
| | |
| | --3.10%--hash_search_with_hash_value
| |
| --4.88%--ExecStoreBufferHeapTuple
|
--1.32%--MemoryContextReset
ExecInterpExpr() takes 71% of CPU time here. Nearly all of that is
slot_getsomeattrs_int() deforming HeapTuples to extract the columns needed
for the WHERE condition and the aggregation – not the actual filter
comparison or aggregate computation, which barely register.
There are worthwhile improvements possible within the current executor architecture, particularly by introducing batching – doing deforming and qual evaluation in tight loops over groups of tuples rather than one at a time.
Today, ExecScanExtended() pulls one tuple at a time from the table access
method: one call, one deform, one qual evaluation. A natural improvement is to
let AMs return small batches of tuples – initially as arrays of
TupleTableSlot, each wrapping a HeapTuple, preserving existing executor
abstractions while enabling fewer function calls, loop-based qual and deform
processing, and better instruction cache behavior. In the longer term this
could evolve toward a more compact batch representation that eliminates
redundancy like repeated TupleDesc references and per-tuple metadata, but
even the simple version shows measurable improvement.
Even without changing the deforming logic itself, running it in a loop over a batch of slots rather than once per tuple improves things. The same deforming code is reused across consecutive calls, which helps instruction cache locality and branch predictor accuracy when the rows being processed have similar shapes. Function call overhead is also amortized across the batch. This is a baseline improvement that sets the stage for more ambitious optimizations like SIMD later.
No matter how smart the planner is or how fast the storage layer gets, the executor still runs the actual computation – filtering, joining, aggregating, returning results. For OLTP queries, trimming executor overhead keeps latency lean. For OLAP queries, it is where most of the CPU time goes and where the biggest gains are available. And reducing architectural friction in the executor is also what makes future work, like vectorized execution, feasible without a ground-up rewrite.