I’m computing some percentile sales data grouped by product / US state, and the query is extremely slow for a large list of locations (see "dma" in [long list of locations]
). The table is around ~15 million rows.
SELECT 'us_state' AS type, us_state AS type_id, product, COUNT(*) AS count, MIN(depletions) AS min, MAX(depletions) AS max, SUM(depletions) AS sum, MIN(start_date) AS start_date, MAX(end_date) AS end_date,
percentile_cont(0.2) WITHIN GROUP (ORDER BY depletions) AS p20,
percentile_cont(0.4) WITHIN GROUP (ORDER BY depletions) AS p40,
percentile_cont(0.5) WITHIN GROUP (ORDER BY depletions) AS p50,
percentile_cont(0.6) WITHIN GROUP (ORDER BY depletions) AS p60,
percentile_cont(0.8) WITHIN GROUP (ORDER BY depletions) AS p80
FROM "venue_product_depletions_locations"
WHERE "account_id" = 6352 AND "dma" IN ('862','716','532','551','521','622','752','575','604','881','576','555','801','825','569','641','506','588','531','619','855','636','790','623','523','567','512','519','807','819','581','508','600','751','839','691','518','658','693','630','510','635','789','669','661','709','770','673','705','736','500','644','544','866','507','821','520','556','686','592'',89','522','560','557','676','573','529','515','516','570','602','679','634','540','541','678','534','828','671','613','649','771','640','757','659','550','633','698','528','504','804','625','548','559','637','535','618','527','611','517','617','670','561','571','813'',65','765','509','514','566','753','538','650','820','662','546','539','803','563','582','810','501','616','773','758','609','524','530','584','702','533','656','505','511','547') AND (us_state IS NOT NULL)
GROUP BY product, us_state
I’ve got this really expensive sort in the query plan that I can’t figure out how to optimize. I’ve tried indexes on:
- account_id & dma
- product & us_state
- product, us_state & depletions
Currently the query takes ~21 seconds. If I remove the percentile_cont
selects it completes in ~7.5 seconds. I’d like to get it as fast as possible.
Here’s the query plan as well (updated to EXPLAIN (ANALYZE, BUFFERS)
)
GroupAggregate (cost=1110518.81..1330560.79 rows=44443 width=137) (actual time=17829.623..26151.760 rows=5799 loops=1)
Group Key: product, us_state
Buffers: shared hit=3279444, temp read=44945 written=44950
-> Sort (cost=1110518.81..1126164.67 rows=6258343 width=41) (actual time=17829.559..21253.068 rows=6220661 loops=1)
Sort Key: product, us_state
Sort Method: external merge Disk: 359560kB
Buffers: shared hit=3279444, temp read=44945 written=44950
-> Index Scan using account_id_dma_index on venue_product_depletions_locations (cost=0.43..291274.59 rows=6258343 width=41) (actual time=0.207..4629.167 rows=6220661 loops=1)
Index Cond: ((account_id = 6352) AND ((dma)::text = ANY ('{862,716,532,551,521,622,752,575,604,881,576,555,801,825,569,641,506,588,531,619,855,636,790,623,523,567,512,519,807,819,581,508,600,751,839,691,518,658,693,630,510,635,789,669,661,709,770,673,705,736,500,644,544,866,507,821,520,556,686,"592'',89",522,560,557,676,573,529,515,516,570,602,679,634,540,541,678,534,828,671,613,649,771,640,757,659,550,633,698,528,504,804,625,548,559,637,535,618,527,611,517,617,670,561,571,"813'',65",765,509,514,566,753,538,650,820,662,546,539,803,563,582,810,501,616,773,758,609,524,530,584,702,533,656,505,511,547}'::text[])))
Filter: (us_state IS NOT NULL)
Rows Removed by Filter: 10187
Buffers: shared hit=3279444
Planning Time: 0.870 ms
Execution Time: 26200.913 ms
2
Answers
Are you sure that accont_id is a string ??? I will supprose it’s a INT…
Turn the contents of your IN clause into an indexed temporary table and use a join :
When you don’t use the percentile functions, it probably uses a HashAggregate rather than a GroupAggregate. But HashAggregate is not an option with the percentile functions, as they need to see all the within-group data at the same time.
You can avoid the sort by making an index which can provide the data already in order such as
(account_id, product, us_state)
. But that is likely to need to jump all around in the table to get the rows fetched in the index order, so to really get a big improvement you might need to add the rest of the columns used by this query to the end of the index so you can get an index-only scan.