Postgres: executor bottlenecks

July 08, 2025

As OLAP performance and cloud-native database architectures draw more attention, it’s worth remembering that the executor remains the core engine doing the work on every worker, even in parallel and distributed plans.

I wrote this post to expand on the first slide of a talk I gave recently at PGConf.in 2025 and then again at POSETTE 2025, covering executor performance improvements in PostgreSQL. It revisits some observations about executor bottlenecks and sketches directions for future work that could realistically improve single-node query execution efficiency without a full executor rewrite, though we’d need that someday.

OLTP workloads and executor path

Short OLTP-style queries, such as primary key lookups or targeted updates, still dominate many production systems that back user applications. For example:

SELECT * FROM orders WHERE order_id = $1;
UPDATE accounts SET balance = balance - $1 WHERE account_id = $2;

In most cases, executor overhead isn’t the dominant cost for these – I/O latency, WAL flushes, locks, or client round trips often matter more.

And while these queries do traverse the full executor plan tree from root to leaf where it reads from storage and back to root which returns the tuple(s) to the application, the tree is usually shallow – just a scan node and maybe a filter or limit. Because they return after a few tuples, they don’t stress much of the executor’s runtime machinery, such as join logic, aggregates, or sort nodes.

Still, there are edge cases where executor efficiency can become noticeable:

So while OLTP isn’t the executor’s biggest stress test, it still benefits from reducing unnecessary work in hot paths like tuple deforming or qual evaluation.

OLAP is where executor path overheads pile up

In OLAP workloads, the same inefficiencies can multiply very quickly . These queries scan millions of tuples, join wide rows, and apply many filters or aggregates. A few dozen nanoseconds of overhead per tuple can easily add up to hundreds of milliseconds in total.

Even with parallel or distributed plans, each worker still runs its slice of the executor in a single-threaded loop. CPU time remains the dominant cost on each core, and the executor’s tight loops determine how efficiently that time is used.

This is where executor efficiency becomes a central thing to optimize.

What gets in the way

PostgreSQL’s executor favors flexibility, but not CPU efficiency. Several design choices lead to persistent overheads:

In analytical queries where the system processes millions of tuples, these small per-tuple inefficiencies add up and CPU usage becomes visibly dominated by executor logic, not I/O. You can perhaps see that in the following perf profile of a backend running select agg(col1) from table where col2 = ?:

                --99.99%--exec_simple_query
                          |          
                           --99.97%--PortalRun
                                     PortalRunSelect
                                     standard_ExecutorRun
                                     ExecSeqScanWithQualProject
                                     |          
                                     |--71.50%--ExecInterpExpr
                                     |          |          
                                     |          |--62.97%--slot_getsomeattrs_int
                                     |          |          |          
                                     |          |           --60.89%--tts_buffer_heap_getsomeattrs
                                     |          |          
                                     |           --0.72%--int4eq
                                     |          
                                     |--23.78%--heap_getnextslot
                                     |          |          
                                     |          |--16.17%--heapgettup_pagemode
                                     |          |          |          
                                     |          |          |--6.81%--heap_prepare_pagescan
                                     |          |          |          |          
                                     |          |          |           --1.22%--HeapTupleSatisfiesVisibility
                                     |          |          |          
                                     |          |           --4.35%--read_stream_next_buffer
                                     |          |                     |          
                                     |          |                      --4.20%--StartReadBuffer
                                     |          |                                |          
                                     |          |                                 --3.12%--BufTableLookup
                                     |          |                                           |          
                                     |          |                                            --3.10%--hash_search_with_hash_value
                                     |          |          
                                     |           --4.88%--ExecStoreBufferHeapTuple
                                     |          
                                      --1.32%--MemoryContextReset
 

That ExecInterpExpr() taking up 71% of CPU time is to deform Postgres HeapTuples to extract column needed to evaluate the WHERE condition (col2) and then the column needed to apply the aggregation function (col1).

Low-hanging improvements

There are realistic improvements possible within the current executor structure, particularly by introducing batching, where the execution work such as deforming and qual evaluation is done in tight loops on batches of tuples instead of one at a time.

1. Batching between ExecScanExtended() and the table AM

Today, ExecScanExtended() pulls one tuple at a time from the table access method. That means one function call per tuple, one deform, one qual eval.

A straightforward improvement would be to allow AMs to return small batches of tuples – initially as arrays of TupleTableSlot, each wrapping a HeapTuple. This would preserve existing executor abstractions while enabling:

In the longer term, this could evolve into a more compact representation such as a TupleBatch, reducing redundancy like repeated TupleDesc or per-tuple metadata.

2. Batched tuple deforming

Tuple deforming today walks over attributes one by one, checking for nulls and format flags for every attribute in every tuple. There’s no batching, and no tight loop over a group of tuples.

Even without rewriting the deforming logic, a simple loop that performs deforming over a batch of TupleTableSlots (each wrapping a HeapTuple) can improve CPU efficiency:

This baseline change can bring measurable improvements even before more complex optimizations like SIMD are introduced.

Final notes

Executor performance matters because:

No matter how smart the planner is or how fast the storage layer gets, the executor still runs the actual computation. It’s where the tuples are filtered, joined, aggregated, and returned and it’s still CPU-bound work.


© 2024 Amit Langote. Hosted by GitHub. Powered by Jekyll