November 6, 2017

Understanding Query Plans

I was recently trying a product feed for all products on the Spring website for a 3rd party vendor. The implementation plan was as follows:

  1. Create a cron job that runs once a day to create a product feed.
  2. The product feed is a CSV file in S3 which contains information such as product name, description, brand name, photo urls, taxonomy, etc.
  3. Only active products should be included in the feed.

For this use case, an active product is defined as a product with: - An active brand - Active posts - Taxonomy - Status not deleted

These products can be retreived from our database with the following query:

SELECT *
FROM
    products
JOIN
    brands ON products.vendor_user_id = brands.id
WHERE
    products.taxonomy != '{}'
    AND products.taxonomy IS NOT NULL
    AND products.deleted IS FALSE
    AND brands.status = 'active'
    AND(
        EXISTS(
            SELECT 1
            FROM
                posts
            WHERE
                posts.product_id = products.id
                AND posts.author_id = products.vendor_user_id
                AND (posts.active_dates @> NOW())
        )
    )

The query plan for this query is as follows:

                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=11.43..17.88 rows=1 width=8821)
   Join Filter: (vendors.id = products.vendor_user_id)
   ->  Nested Loop  (cost=11.30..12.38 rows=1 width=3762)
         Join Filter: (vendors.id = posts.author_id)
         ->  Seq Scan on vendors  (cost=0.00..1.01 rows=1 width=3754)
               Filter: (is_brand AND (status = 'active'::text))
         ->  HashAggregate  (cost=11.30..11.33 rows=3 width=8)
               Group Key: posts.product_id, posts.author_id
               ->  Bitmap Heap Scan on posts  (cost=4.17..11.28 rows=3 width=8)
                     Recheck Cond: (active_dates @> now())
                     ->  Bitmap Index Scan on index_posts_on_active_dates  (cost=0.00..4.17 rows=3 width=0)
                           Index Cond: (active_dates @> now())
   ->  Index Scan using products_vendor_user_id_created_at on products  (cost=0.14..5.49 rows=1 width=5067)
         Index Cond: (vendor_user_id = posts.author_id)
         Filter: ((taxonomy IS NOT NULL) AND (deleted IS FALSE) AND (taxonomy <> '{}'::text[]) AND (posts.product_id = id))
(15 rows)

Strategies for Retrieving Products in Batches

Running SELECT max(id) FROM products currently returns 54,368,440. Obviously, we can’t pull out all 54 million + products at the same time, so we have to limit the products that are returned. There are two ways in which I tried to go about doing this:

  1. Provide a start product id, and limit the results to some batch size (let’s say, 100).

    • This method guarantees that exactly 100 products will be returned everytime (until the last batch).
  2. Provide a start product id and end product id.

    • This method means that at most 100 products will be returned everytime, and sometimes, no products will be returned.

At first glance, method 1 seems to be the right move. However, upon trying it out, I ran into timeouts at batch size of as little as 10. This was unexpected, so I took a look at the query plans.

Query Plan 1: Provide a Start Id, Limit Products to a Batch Size

The query looks like this:

EXPLAIN SELECT *
FROM
    products
JOIN
    brands ON products.vendor_user_id = brands.id
WHERE
    products.taxonomy != '{}'
    AND products.taxonomy IS NOT NULL
    AND products.deleted IS FALSE
    AND brands.status = 'active'
    AND(
        EXISTS(
            SELECT 1
            FROM
                posts
            WHERE
                posts.product_id = products.id
                AND posts.author_id = products.vendor_user_id
                AND (posts.active_dates @> NOW())
        )
    )
		AND products.id >= 1
ORDER BY products.id
LIMIT 100
;
                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Limit  (cost=16.34..21625.27 rows=100 width=3557)
   ->  Nested Loop  (cost=16.34..14388108.63 rows=66584 width=3557)
         ->  Merge Semi Join  (cost=16.06..14332198.19 rows=163224 width=1447)
               Merge Cond: (products.id = posts.product_id)
               Join Filter: (products.vendor_user_id = posts.author_id)
               ->  Index Scan using products_pkey on products  (cost=0.56..14182152.56 rows=2473929 width=1443)
                     Index Cond: (id >= 1)
                     Filter: ((taxonomy IS NOT NULL) AND (deleted IS FALSE) AND (taxonomy <> '{}'::text[]))
               ->  Index Scan using maciek_foobarbaz on posts  (cost=0.43..152587.13 rows=428285 width=8)
                     Filter: (active_dates @> now())
         ->  Index Scan using vendors_pkey on vendors  (cost=0.28..0.33 rows=1 width=2110)
               Index Cond: (id = products.vendor_user_id)
               Filter: (is_brand AND (status = 'active'::text))
(13 rows)

Reading the Postgres EXPLAIN Query Plan:

To understand why this query timed out, let’s break down the query execution plan. In this example, we used EXPLAIN to understand the SQL statement. This means that the statement is not actually executed and we are just getting a prediction of what happens. If we wanted to actually run the query, we would use EXPLAIN ANALYZE.

Postgres builds a tree structure of the different actions that it takes to run the query. In the case of this query, the structure is as follows:

Nested Loop
└── Hash Join
    ├── Merge Semi Join
        └── Merge Cond
        └── Join Filter
            └── Index Scan
            └── Index Scan
    └── Index Scan
Pt1: Index Scan

Query plans are meant to be read inside out, so to figure out what’s happening “first,” let’s start with:

->  Index Scan using vendors_pkey on vendors  (cost=0.28..0.33 rows=1 width=2110)
      Index Cond: (id = products.vendor_user_id)
          Filter: (is_brand AND (status = 'active'::text))

This corresponds to the SQL WHERE brands.status = 'active'. In order for Postgres to find all the rows that match this condition,

Postgres explains a total “cost” of 5.04 to find these values, with a startup cost of 0.28 to start doing this computation. It explains to return 1 row and estimated 2110 bytes.

And index scan is…

Pt2: Breaking down the Merge Semi Join
->  Merge Semi Join  (cost=16.06..14332198.19 rows=163224 width=1447)
      Merge Cond: (products.id = posts.product_id)
      Join Filter: (products.vendor_user_id = posts.author_id)
      ->  Index Scan using products_pkey on products  (cost=0.56..14182152.56 rows=2473929 width=1443)
            Index Cond: (id >= 1)
            Filter: ((taxonomy IS NOT NULL) AND (deleted IS FALSE) AND (taxonomy <> '{}'::text[]))
      ->  Index Scan using maciek_foobarbaz on posts  (cost=0.43..152587.13 rows=428285 width=8)
            Filter: (active_dates @> now())

This corresponds to the SQL:

WHERE
  EXISTS(
      SELECT 1
      FROM
          posts
      WHERE
          posts.product_id = products.id
          AND posts.author_id = products.vendor_user_id
          AND (posts.active_dates @> NOW())
  )

Query Plan 2: Provide a Start Id and a End Id

EXPLAIN SELECT *
FROM
    products
JOIN
    brands ON products.vendor_user_id = brands.id
WHERE
    products.taxonomy != '{}'
    AND products.taxonomy IS NOT NULL
    AND products.deleted IS FALSE
    AND brands.status = 'active'
    AND(
        EXISTS(
            SELECT 1
            FROM
                posts
            WHERE
                posts.product_id = products.id
                AND posts.author_id = products.vendor_user_id
                AND (posts.active_dates @> NOW())
        )
    )
		AND products.id >= 1
		AND products.id < 101
;

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.27..1114.92 rows=1 width=3553)
   Join Filter: (products.vendor_user_id = vendors.id)
   ->  Nested Loop Semi Join  (cost=0.99..1113.27 rows=3 width=1447)
         ->  Index Scan using products_pkey on products  (cost=0.56..511.26 rows=47 width=1443)
               Index Cond: ((id >= 1) AND (id < 101))
               Filter: ((taxonomy IS NOT NULL) AND (deleted IS FALSE) AND (taxonomy <> '{}'::text[]))
         ->  Index Scan using maciek_foobarbaz on posts  (cost=0.43..12.80 rows=1 width=8)
               Index Cond: (product_id = products.id)
               Filter: ((products.vendor_user_id = author_id) AND (active_dates @> now()))
   ->  Index Scan using vendors_pkey on vendors  (cost=0.28..0.54 rows=1 width=2110)
         Index Cond: (id = posts.author_id)
         Filter: (is_brand AND (status = 'active'::text))
(12 rows)