Nov 24, 2022
In a previous post, I
described some performance problems when using prepared statements with partitioning due to
excessive locking of partitions, causing a bottleneck as the number of partitions increases.
I also mentioned a patch
that I have proposed to address that particular problem. The following graph shows a comparison
of the TPS performance of pgbench -T60 --protocol=prepared
with varying number of partitions,
initialized using pgbench -i --partitions=$count
, before and after applying that patch. (Note:
I have set plan_cache_mode = force_generic_plan
in postgresql.conf to ensure the benchmark
measures the performance with cached generic plans.)
In this post, I would like to highlight another bottleneck that pops its head out when the locking bottleneck is addressed. Addressing that overhead with another patch that is also in the pipeline for v16 gives a somewhat significant performance improvement, provided the locking overhead is addressed first.
Let’s look at the perf
profile of a Postgres backend process running the above benchmark
without applying any patch, at low partition counts (say, 32) first:
- 97.99% 0.00% postgres libc-2.17.so [.] __libc_start_main
__libc_start_main
main
PostmasterMain
- ServerLoop
- 96.13% PostgresMain
+ 39.86% PortalStart
+ 12.67% finish_xact_command
- 11.09% GetCachedPlan
+ 9.09% LockRelationOid
+ 0.88% RevalidateCachedQuery
+ 9.53% pq_getbyte
+ 7.23% PortalRun
+ 5.36% ReadyForQuery
+ 1.85% start_xact_command
+ 0.69% EndCommand
0.54% OidInputFunctionCall
0.50% CreatePortal
Process spends a non-trivial amount of time in GetCachedPlan()
, the backend function that
looks up, validates, and returns a cached plan for a given prepared statement. One of the
things it does is locking the relations referenced in the plan. This gets worse as the
partition count increases as can be seen in the following profile taken when the same
benchmark is repeated with with 2048 partitions:
- 99.71% 0.00% postgres libc-2.17.so [.] __libc_start_main
__libc_start_main
main
PostmasterMain
- ServerLoop
- 99.54% PostgresMain
- 50.57% GetCachedPlan
+ 46.76% LockRelationOid
1.07% IsSharedRelation
+ 40.37% finish_xact_command
+ 4.56% PortalStart
+ 1.60% PortalRun
+ 0.72% pq_getbyte
+ 0.65% ReadyForQuery
Almost half of the query execution is time spent locking the partitions, all 2048 of them in this case.
The idea of the proposed patch (one mentioned in the 1st paragraph) is to reduce the set of partitions
that are locked by GetCachedPlan()
to just those that will be scanned during query execution, that is,
skipping the locking for those that can be pruned.
Here’s the profile again, with the patch applied:
- 96.97% 0.00% postgres libc-2.17.so [.] __libc_start_main
__libc_start_main
main
PostmasterMain
- ServerLoop
- 95.52% PostgresMain
+ 33.99% PortalStart
+ 15.54% finish_xact_command
+ 14.18% PortalRun
- 8.84% GetCachedPlan
+ 6.82% ExecutorDoInitialPruning
+ 0.96% RevalidateCachedQuery
+ 8.68% pq_getbyte
+ 5.05% ReadyForQuery
+ 1.48% start_xact_command
Unlike without the patch, the time spent under GetCachedPlan()
hasn’t ballooned to over 50%.
Comparing the TPS figure is even more helpful – almost 10x improvement!
OK, so where’s the other bottleneck?
PortalStart()
, whose job is to initialize the plan tree for execution spends a non-trivial
amount of time too, found in roughly over 30% of the samples. Expanding that frame, one hopes to
find the culprits:
- 97.20% 0.00% postgres libc-2.17.so [.] __libc_start_main
__libc_start_main
main
PostmasterMain
- ServerLoop
- 95.61% PostgresMain
- 34.23% PortalStart
- 33.62% standard_ExecutorStart
+ 19.52% ExecInitNode
13.23% ExecCheckRTPerms
0.55% ExecInitRangeTable
+ 15.57% finish_xact_command
+ 14.27% PortalRun
+ 9.27% GetCachedPlan
+ 8.53% pq_getbyte
+ 4.47% ReadyForQuery
+ 1.58% start_xact_command
0.51% OidInputFunctionCall
ExecCheckRTPerms()
among those is there to check whether the user running the query
has needed permissions on the table(s) mentioned in the query. It should really be
quick in this case (that is, not show as much in the perf
profile) as there is just
one table to be checked in this case (the “root” partitioned table). But the current
implementation of ExecCheckRTPerms()
is such that it spends O(n)
amount of time in
the number of partitions. Note that the code visits each partition, it doesn’t actually
checks its permissions, so that’s just wasteful looping.
The patch I mentioned in the 2nd paragraph changes things (specifically, the list contained
the plan that records the relations whose permissions should be checked) such that
ExecCheckRTPerms()
no longer needs to visit each partition. That reduces the time
spent in that function significantly, as can be seen in the following updated profile
after applying that patch:
- 96.37% 0.00% postgres libc-2.17.so [.] __libc_start_main
__libc_start_main
main
PostmasterMain
- ServerLoop
- 94.58% PostgresMain
- 24.37% PortalStart
- 23.64% standard_ExecutorStart
+ 22.30% ExecInitNode
+ 0.70% ExecInitRangeTable
+ 19.20% finish_xact_command
+ 17.54% PortalRun
+ 9.73% GetCachedPlan
+ 8.90% pq_getbyte
+ 4.86% ReadyForQuery
+ 1.63% start_xact_command
0.54% OidInputFunctionCall
So the time spent in PortalStart()
goes from 34% down to 24%. The TPS is slightly improved too
as can be seen in the following graph:
While it is unquestionable that these patches are necessary for seeing these bottlenecks out the door, the changes they make to the query execution pipeline are non-trivial and haven’t been fully reviewed yet. The work is ongoing and I hope they make the finish line for the next year’s v16.
Addendum: The finish_xact_command()
also spends more time than it really should in this particular
benchmark (generic plan, with many partitions). There’s work underway to fix that too.
Addendum 2: 2nd patch mentioned above, one that reworks ExecCheckRTPerms()
, has been committed for now by
Álvaro as a61b1f748, so will hopefully be a part of the next (v16) release.