Reading Parallel Plans Correctly

Reading Parallel Plans Correctly

Historically, the PostgreSQL server operating model looks like many independent processes with partially shared memory. Each process serves only one client connection and handles one query at a time, meaning no multithreading occurs.

That’s why, within each individual process, you won’t find any typical oddities with parallel code execution, blocks, race condition, etc. The development of the DBMS itself becomes a pleasant and simple task.

But this very simplicity also brings a significant limitation. Since there is only one worker thread per process, it can use no more than one CPU core to execute a query, which means the server's speed depends directly on the frequency and architecture of a separate core.

In today's world where multicore and multiprocessor systems have stolen the limelight, such behavior is an unacceptable luxury and waste of resources. For this reason, starting with PostgreSQL 9.6, a number of operations in a query can be parallelized, i.e. performed by several processes simultaneously.

You can find a number of cases of parallelism described in the article “Parallelism in PostgreSQL” by Ibrar Ahmed, from where this image is taken.

If you're using one of the latest versions of PostgreSQL, there is a good chance you'll see Parallel… in the plan when dealing with Seq Scan, Join, Aggregate, Append, etc.

Parallelism brings certain complications, but don't worry — we've already studied these peculiarities and have you covered.

As we keep sharing more features of our PostgreSQL query analysis service Saby Explain, below you’ll find our experience of addressing these challenges.

Time oddities

Let’s have a look at a plan from PostgreSQL 9.6:

Parallel Seq Scan alone was running 153.621 ms within a subtree, while Gather with all its subnodes — only 104.867 ms. Why do we get less time for a parent with its children? Let's have a closer look at the Gather node:

Gather (actual time=0.969..104.867 rows=333333 loops=1)
    Workers Planned: 2
    Workers Launched: 2
    Buffers: shared hit=4425

Workers Launched: 2 shows us that 2 more processes down the tree have been added to the main process. That means the Gather time is the summary of all the 3 processes altogether.

Now let's see what’s happening in Parallel Seq Scan:

Parallel Seq Scan on tst (actual time=0.024..51.207 rows=111111 loops=3)
    Filter: ((i % 3) = 0)
    Rows Removed by Filter: 222222
    Buffers: shared hit=4425

loops=3 indicates that we’re dealing with the data on all the 3 processes here. On average, each loop took 51.207 ms, meaning the server needed 51.207 x 3 = 153.621 milliseconds of CPU time to process this node in total. This is exactly the number we need to get a real picture of what the server was busy with.

Note that to understand the real execution time, we need to divide the total time by the number of workers— i.e., [actual time] x [loops] / [Workers Launched].

In our example, each worker executed only one cycle per node, so 153.621 / 3 = 51.207. It's no longer strange that the only Gather in the parent process was executed in sort of less time.

The main takeaway is that you actually need the total execution time of the node (all processes summed up) to find what was keeping your server busy and identify the query part for further optimization.

From this perspective, explain.depesz.com, for example, showing the “averaged real” time, looks less useful for debugging purposes:

Disagree? Feel free to comment below.

Gather Merge loses it all

Let’s run the same query with PostgreSQL 10:

Note that we now have the Gather Merge node instead of Gather in the plan. Here is what the manual says about it:

When the node at the top of the parallel portion of the plan is Gather Merge rather than Gather, it indicates that each process executing the parallel portion of the plan is producing tuples in sorted order, and that the leader is performing an order-preserving merge. In contrast, Gather reads tuples from the workers in whatever order is convenient, destroying any sort order that may have existed.

But it seems something is rotten in the state of Denmark:

Limit (actual time=110.740..113.138 rows=10000 loops=1)
  Buffers: shared hit=888 read=801, temp read=18 written=218
  I/O Timings: read=9.709
  ->  Gather Merge (actual time=110.739..117.654 rows=10000 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=2943 read=1578, temp read=24 written=571
        I/O Timings: read=17.156

When the Buffers and I/O Timings attributes were transmitted upwards through the tree, some data was lost. We can estimate the size of this loss as about 2/3, which have been formed by auxiliary processes.

Unfortunately, there is no way to get this information in the plan itself. That’s why we see negative values for the top node. And if we look at the further evolution of this plan in PostgreSQL 12, it remains unchanged, except for some stats added for each worker on the Sort node:

Limit (actual time=77.063..80.480 rows=10000 loops=1)
Buffers: shared hit=1764, temp read=223 written=355
  ->  Gather Merge (actual time=77.060..81.892 rows=10000 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=4519, temp read=575 written=856
        ->  Sort (actual time=72.630..73.252 rows=4278 loops=3)
              Sort Key: i
              Sort Method: external merge  Disk: 1832kB
              Worker 0:  Sort Method: external merge  Disk: 1512kB
              Worker 1:  Sort Method: external merge  Disk: 1248kB
              Buffers: shared hit=4519, temp read=575 written=856
              ->  Parallel Seq Scan on tst (actual time=0.014..44.970 rows=111111 loops=3)
                    Filter: ((i % 3) = 0)
                    Rows Removed by Filter: 222222
                    Buffers: shared hit=4425
Planning Time: 0.142 ms
Execution Time: 83.884 ms

There is one more takeaway I’d like to leave you with. Do not rely blindly on the data you see above the Gather Merge node.

Get Saby Explain