Postgres: executor bottlenecks

July 08, 2025

As OLAP performance and cloud-native database architectures draw more attention, it’s worth remembering that the executor remains the core engine doing the work on every worker, even in parallel and distributed plans.

I wrote this post to expand on the first slide of a talk I gave recently at PGConf.in 2025 and then again at POSETTE 2025, covering executor performance improvements in PostgreSQL. It revisits some observations about executor bottlenecks and sketches directions for future work that could realistically improve single-node query execution efficiency without a full executor rewrite, though we’d need that someday.

OLTP workloads and executor path

Short OLTP-style queries, such as primary key lookups or targeted updates, still dominate many production systems that back user applications. For example:

SELECT * FROM orders WHERE order_id = $1;
UPDATE accounts SET balance = balance - $1 WHERE account_id = $2;

In most cases, executor overhead isn’t the dominant cost for these – I/O latency, WAL flushes, locks, or client round trips often matter more.

And while these queries do traverse the full executor plan tree from root to leaf where it reads from storage and back to root which returns the tuple(s) to the application, the tree is usually shallow – just a scan node and maybe a filter or limit. Because they return after a few tuples, they don’t stress much of the executor’s runtime machinery, such as join logic, aggregates, or sort nodes.

Still, there are edge cases where executor efficiency can become noticeable:

High-frequency queries under tight response time constraints (e.g., <1ms)
In-memory workloads on replicas or caches where CPU becomes the main bottleneck
Aggregate CPU usage across thousands of QPS, because small inefficiencies can add up

So while OLTP isn’t the executor’s biggest stress test, it still benefits from reducing unnecessary work in hot paths like tuple deforming or qual evaluation.

OLAP is where executor path overheads pile up

In OLAP workloads, the same inefficiencies can multiply very quickly . These queries scan millions of tuples, join wide rows, and apply many filters or aggregates. A few dozen nanoseconds of overhead per tuple can easily add up to hundreds of milliseconds in total.

Even with parallel or distributed plans, each worker still runs its slice of the executor in a single-threaded loop. CPU time remains the dominant cost on each core, and the executor’s tight loops determine how efficiently that time is used.

This is where executor efficiency becomes a central thing to optimize.

What gets in the way

PostgreSQL’s executor favors flexibility, but not CPU efficiency. Several design choices lead to persistent overheads:

The executor uses a volcano-style model, where each node calls the one below it to fetch one tuple at a time. This leads to frequent function calls, virtual dispatch (to plan node-specific execution code), and poor instruction cache locality.
Tuple deforming is done one tuple at a time, one attribute at a time, with branches to handle nulls, alignment, and type-specific layout.
Expression evaluation runs a linear sequence of opcodes per tuple. While this avoids tree recursion, it’s still row-at-a-time: each tuple is processed in isolation, with no batching or loop fusion. Intermediate results are boxed into Datums, and there’s little opportunity for SIMD or CPU pipelining.

In analytical queries where the system processes millions of tuples, these small per-tuple inefficiencies add up and CPU usage becomes visibly dominated by executor logic, not I/O. You can perhaps see that in the following perf profile of a backend running select agg(col1) from table where col2 = ?:

                --99.99%--exec_simple_query
                          |          
                           --99.97%--PortalRun
                                     PortalRunSelect
                                     standard_ExecutorRun
                                     ExecSeqScanWithQualProject
                                     |          
                                     |--71.50%--ExecInterpExpr
                                     |          |          
                                     |          |--62.97%--slot_getsomeattrs_int
                                     |          |          |          
                                     |          |           --60.89%--tts_buffer_heap_getsomeattrs
                                     |          |          
                                     |           --0.72%--int4eq
                                     |          
                                     |--23.78%--heap_getnextslot
                                     |          |          
                                     |          |--16.17%--heapgettup_pagemode
                                     |          |          |          
                                     |          |          |--6.81%--heap_prepare_pagescan
                                     |          |          |          |          
                                     |          |          |           --1.22%--HeapTupleSatisfiesVisibility
                                     |          |          |          
                                     |          |           --4.35%--read_stream_next_buffer
                                     |          |                     |          
                                     |          |                      --4.20%--StartReadBuffer
                                     |          |                                |          
                                     |          |                                 --3.12%--BufTableLookup
                                     |          |                                           |          
                                     |          |                                            --3.10%--hash_search_with_hash_value
                                     |          |          
                                     |           --4.88%--ExecStoreBufferHeapTuple
                                     |          
                                      --1.32%--MemoryContextReset
 

That ExecInterpExpr() taking up 71% of CPU time is to deform Postgres HeapTuples to extract column needed to evaluate the WHERE condition (col2) and then the column needed to apply the aggregation function (col1).

Low-hanging improvements

There are realistic improvements possible within the current executor structure, particularly by introducing batching, where the execution work such as deforming and qual evaluation is done in tight loops on batches of tuples instead of one at a time.

1. Batching between `ExecScanExtended()` and the table AM

Today, ExecScanExtended() pulls one tuple at a time from the table access method. That means one function call per tuple, one deform, one qual eval.

A straightforward improvement would be to allow AMs to return small batches of tuples – initially as arrays of TupleTableSlot, each wrapping a HeapTuple. This would preserve existing executor abstractions while enabling:

Fewer function calls
Loop-based processing of quals and deforming
Better instruction cache behavior

In the longer term, this could evolve into a more compact representation such as a TupleBatch, reducing redundancy like repeated TupleDesc or per-tuple metadata.

2. Batched tuple deforming

Tuple deforming today walks over attributes one by one, checking for nulls and format flags for every attribute in every tuple. There’s no batching, and no tight loop over a group of tuples.

Even without rewriting the deforming logic, a simple loop that performs deforming over a batch of TupleTableSlots (each wrapping a HeapTuple) can improve CPU efficiency:

Instruction cache locality improves as the same deforming code is reused
Branch predictor accuracy increases when processing similar-shaped rows
Function call overhead is amortized across the batch

This baseline change can bring measurable improvements even before more complex optimizations like SIMD are introduced.

Final notes

Executor performance matters because:

It helps OLTP queries stay lean and responsive, especially in high-throughput environments
It speeds up OLAP workloads where executor loops consume most CPU time
It clears the way for future improvements — like vectorization — by removing architectural blockers

No matter how smart the planner is or how fast the storage layer gets, the executor still runs the actual computation. It’s where the tuples are filtered, joined, aggregated, and returned and it’s still CPU-bound work.