July 08, 2025
As OLAP performance and cloud-native database architectures draw more attention, it’s worth remembering that the executor remains the core engine doing the work on every worker, even in parallel and distributed plans.
I wrote this post to expand on the first slide of a talk I gave recently at PGConf.in 2025 and then again at POSETTE 2025, covering executor performance improvements in PostgreSQL. It revisits some observations about executor bottlenecks and sketches directions for future work that could realistically improve single-node query execution efficiency without a full executor rewrite, though we’d need that someday.
Short OLTP-style queries, such as primary key lookups or targeted updates, still dominate many production systems that back user applications. For example:
SELECT * FROM orders WHERE order_id = $1;
UPDATE accounts SET balance = balance - $1 WHERE account_id = $2;
In most cases, executor overhead isn’t the dominant cost for these – I/O latency, WAL flushes, locks, or client round trips often matter more.
And while these queries do traverse the full executor plan tree from root to leaf where it reads from storage and back to root which returns the tuple(s) to the application, the tree is usually shallow – just a scan node and maybe a filter or limit. Because they return after a few tuples, they don’t stress much of the executor’s runtime machinery, such as join logic, aggregates, or sort nodes.
Still, there are edge cases where executor efficiency can become noticeable:
So while OLTP isn’t the executor’s biggest stress test, it still benefits from reducing unnecessary work in hot paths like tuple deforming or qual evaluation.
In OLAP workloads, the same inefficiencies can multiply very quickly . These queries scan millions of tuples, join wide rows, and apply many filters or aggregates. A few dozen nanoseconds of overhead per tuple can easily add up to hundreds of milliseconds in total.
Even with parallel or distributed plans, each worker still runs its slice of the executor in a single-threaded loop. CPU time remains the dominant cost on each core, and the executor’s tight loops determine how efficiently that time is used.
This is where executor efficiency becomes a central thing to optimize.
PostgreSQL’s executor favors flexibility, but not CPU efficiency. Several design choices lead to persistent overheads:
Datum
s, and there’s little opportunity for SIMD or CPU pipelining.In analytical queries where the system processes millions of tuples, these small
per-tuple inefficiencies add up and CPU usage becomes visibly dominated by
executor logic, not I/O. You can perhaps see that in the following perf profile
of a backend running select agg(col1) from table where col2 = ?
:
--99.99%--exec_simple_query
|
--99.97%--PortalRun
PortalRunSelect
standard_ExecutorRun
ExecSeqScanWithQualProject
|
|--71.50%--ExecInterpExpr
| |
| |--62.97%--slot_getsomeattrs_int
| | |
| | --60.89%--tts_buffer_heap_getsomeattrs
| |
| --0.72%--int4eq
|
|--23.78%--heap_getnextslot
| |
| |--16.17%--heapgettup_pagemode
| | |
| | |--6.81%--heap_prepare_pagescan
| | | |
| | | --1.22%--HeapTupleSatisfiesVisibility
| | |
| | --4.35%--read_stream_next_buffer
| | |
| | --4.20%--StartReadBuffer
| | |
| | --3.12%--BufTableLookup
| | |
| | --3.10%--hash_search_with_hash_value
| |
| --4.88%--ExecStoreBufferHeapTuple
|
--1.32%--MemoryContextReset
That ExecInterpExpr()
taking up 71% of CPU time is to deform Postgres HeapTuple
s to extract column needed
to evaluate the WHERE
condition (col2
) and then the column needed to apply the aggregation function (col1
).
There are realistic improvements possible within the current executor structure, particularly by introducing batching, where the execution work such as deforming and qual evaluation is done in tight loops on batches of tuples instead of one at a time.
ExecScanExtended()
and the table AMToday, ExecScanExtended()
pulls one tuple at a time from the table access
method. That means one function call per tuple, one deform, one qual eval.
A straightforward improvement would be to allow AMs to return small batches of
tuples – initially as arrays of TupleTableSlot
, each wrapping a HeapTuple
.
This would preserve existing executor abstractions while enabling:
In the longer term, this could evolve into a more compact representation such as
a TupleBatch
, reducing redundancy like repeated TupleDesc
or per-tuple metadata.
Tuple deforming today walks over attributes one by one, checking for nulls and format flags for every attribute in every tuple. There’s no batching, and no tight loop over a group of tuples.
Even without rewriting the deforming logic, a simple loop that performs
deforming over a batch of TupleTableSlot
s (each wrapping a HeapTuple
) can
improve CPU efficiency:
This baseline change can bring measurable improvements even before more complex optimizations like SIMD are introduced.
Executor performance matters because:
No matter how smart the planner is or how fast the storage layer gets, the executor still runs the actual computation. It’s where the tuples are filtered, joined, aggregated, and returned and it’s still CPU-bound work.