May 16, 2022
Prepared statements (aka parameterized queries) suffer when you have partitioned tables mentioned in them. Let’s see why.
Using prepared statements, a client can issue a query in two stages. First, prepare
PREPARE a_name AS <query>, followed by multiple executions using
EXECUTE a_name. During the
PREPARE step, the backend will only parse-analyze the
query and remember the resulting parse tree in a cache using the provided name as its
lookup key. During each
EXECUTE step, the backend will look that parse tree up in
the cache, create or look up a plan for it, and execute it to return the query’s output
rows. Note that both
PREPARE and subsequent
EXECUTEs must run in the same connection,
because the storage used for the parse tree and the plan for a given query is process
local at the moment.
The query specified in
PREPARE may put numbered parameters ($1, $2, …) in place of
the actual constant values in the query’s conditions (like
WHERE). The values themselves
are only supplied during a given
The benefit of this 2-stage processing is that a lot of CPU cycles are saved by not redoing the parse-analysis processing on every execution, which is fine because the parse tree does not change unless objects mentioned in the query were changed by DDL, something that tends to happen rather unfrequently.
Even more CPU cycles are saved if the
EXECUTE step is able to use a plan that is also
cached, instead of building it from scratch for every execution, that is, for each fresh
set of parameter values.
The performance benefit of this 2-stage protocol of performing queries can be easily
verified using the handy
pgbench tool, which allows specifying which protocol to use
when running the benchmark queries using the parameter
--protocol=querymode. The value
simple instructs it to execute the queries in one go (parse-plan-execute) and prepared
to use the 2-stage method described above.
$ pgbench -i > /dev/null 2>&1 $ pgbench -S -t 100000 --protocol=simple pgbench (15devel) starting vacuum...end. transaction type: <builtin: select only> scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 maximum number of tries: 1 number of transactions per client: 100000 number of transactions actually processed: 100000/100000 number of failed transactions: 0 (0.000%) latency average = 0.058 ms initial connection time = 9.907 ms tps = 17201.844726 (without initial connection time)
$ pgbench -i > /dev/null 2>&1 $ pgbench -S -t 100000 --protocol=prepared ... latency average = 0.031 ms initial connection time = 9.686 ms tps = 32234.211200 (without initial connection time)
The latency average for query
SELECT abalance FROM pgbench_accounts WHERE aid = ? is 0.031
milliseconds when performed using the 2-stage protocol versus 0.058 when using the simple protocol,
so the caching does seem productive.
It’s interesting to look at the mechanism of how the plan itself is cached, because that is where the problems with partitioned tables can be traced to. The backend code relies on the plancache module which hosts the plan caching logic. When presented with the parse tree of a query, its job is to figure out if a plan matching the query already exists and if one does, check if it’s still valid by acquiring read locks on all the relations that it scans. If any of the relations has changed since the plan was created, this validation step (the locking) detects those changes and triggers re-planning.
An important bit about such a cached plan is that, when creating it, the plancache would not have
passed the actual values of parameters mentioned in
EXECUTE to the planner, so it is a generic
(parameter-value-independent) plan. In most cases, the plan is the same as one that the plancache
would have gotten by passing the values, because the planner is smart enough to perform some of its
optimizations even in the absence of the actual constant values to compare against the table/index
statistics, at least for simpler queries. For example, it can heuristically determine that an index
colname can be used to optimize
colname = $1 in a similar manner as it can be used for
colname = <actual-constant-value>.
However, parameter values not being available to the planner when creating a generic plan can be a big problem if the query contains partitioned tables. Without plan-time partition pruning, which relies on constant values being available for the planner to compare against the partition bounds found in the system catalog, the generic plan must include all partitions.
Now, while pruning will occur during execution (run-time pruning), there are a number of steps
that manipulate the generic plan that must take place before the pruning can occur, each of which
currently takes O(n) time, where n is the number of partitions present in the generic plan. One of
those steps, the one that consumes the most time, is the plancache module’s validation of the plan
which must lock all relations mentioned in the plan. The more partitions there are, the longer it
will take the validation step to finish. The following graph shows the average latency reported by
pgbench -S --protocol=prepared with varying number of partitions (initialized with
pgbench -i --partitions=N in each case), which shows the latency increases as the number of
The following graph also plots the latencies one may see by forcing the plancache module to
create a parameter value dependent (custom) plan on every
EXECUTE (by setting
force_custom_plan in postgresql.conf), whereby the planner is able to prune the unnecessary
However, forcing re-planning on each
EXECUTE means leaving the peformance benefit of plan caching
on the table. So I proposed a patch to fix the
generic plan validation step such that it only locks the partitions that match the parameter values
EXECUTE by performing the run-time pruning earlier than the validation step. Here is
the graph showing the improved generic plan execution latency after applying the patch:
As expected, the latency no longer degrades, and perhaps more importantly, is indeed lower than with
forced re-planning on each
Note: With default setting of
auto), Postgres currently will choose a custom plan
for partitions because a generic plan looks expensive given that it must contain all partitions, so
the performance one gets by default looks more like the custom line in 2nd graph. Along with the patch
I mentioned above which improves the execution performance of generic plans containing partitions, we
will also need to fix things such that generic plans don’t appear more expensive than custom plans to the
plan caching logic for it to choose the former. Till that’s also fixed, users will need to use
force_generic_plan to have plan caching for partitions.