skip to Main Content

Describing the scenario

I have this table called event defined in PostgresSQL that, among some not relevant ones, basically has 5 columns:

  • id (integer): primary key of the table
  • date (timestamp): date when the event occurred
  • fk_type (integer): foreign key to the type of the event (eg. Sports, Charity…)
  • fk_city (integer): foreign key to the city of the event
  • status (varchar): status of the event (eg. Canceled, Closed, Confirmed, etc)

There are 4 indexes defined for this table:

  • event_pkey, for column id
  • ix_event_fk_type, for column fk_type
  • ix_event_fk_city, for column fk_city
  • ix_event_date, for column date

I have this query that combines each event e1 to the event e2 that happened immediately after e1, from the same type and city of e1. Given a certain fk_type (eg. 0), the objective is to get the difference between the dates of e2 and e1 and determine whether it took too long (30 days or more) for e2 to happen. If there is no e2 after e1, e1 is combined with null (hence the left join), and for this case, the default for the "took too long" is true.

select 
    e1.id as e1_id, 
    e1.date as e1_date, 
    min(e2.date) as next_event_date,
    min(e2.date) - e1.date as date_diff,
    case 
        when min(e2.date) - e1.date is not null then min(e2.date) - e1.date >= interval '30 days'
        else true
    end took_too_long
from event e1
left join event e2 on e2.date > e1.date 
    and e2.fk_type = e1.fk_type
    and e2.fk_city = e1.fk_city 
where e1.fk_type = 0
group by e1.id, e1.date

This query executes successfully, with the following execution plan:

Finalize GroupAggregate  (cost=1000.72..129436.66 rows=78930 width=37)
  Group Key: e1.id
  ->  Gather Merge  (cost=1000.72..126871.43 rows=78930 width=20)
        Workers Planned: 1
        ->  Partial GroupAggregate  (cost=0.71..116991.80 rows=78930 width=20)
              Group Key: e1.id
              ->  Nested Loop Left Join  (cost=0.71..29119.91 rows=17416518 width=20)
                    ->  Parallel Index Scan using ix_event_id on event e1  (cost=0.42..7456.53 rows=46429 width=20)
                          Filter: (fk_type = 0)
                    ->  Index Scan using ix_event_fk_city on event e2  (cost=0.29..0.46 rows=1 width=16)
                          Index Cond: (fk_city = e1.fk_city)
                          Filter: ((fk_type = 0) AND (date > e1.date) AND (fk_type = e1.fk_type))

The problem

However, the issue is, if I try to add some other filters, say I don’t want to consider events with status "Canceled" or "Closed" for this calculation, and show only events that happened in 2022 (event.date >= '2022-01-01' and event.date < '2023-01-01'), the query just breaks, taking forever to execute.

select 
    e1.id as e1_id, 
    e1.date as e1_date, 
    min(e2.date) as next_event_date,
    min(e2.date) - e1.date as date_diff,
    case 
        when min(e2.date) - e1.date is not null then min(e2.date) - e1.date >= interval '30 days'
        else true
    end took_too_long
from event e1
left join event e2 on e2.date > e1.date 
    and e2.fk_type = e1.fk_type
    and e2.fk_city = e1.fk_city
    and (e2.status not in ('CANCELED', 'CLOSED') or e2.status is null)
    and e2.date >= '2022-01-01 00:00:00'
where
    (e1.status not in ('CANCELED', 'CLOSED') or e1.status is null) 
    and e1.date >= '2022-01-01 00:00:00' and e1.date < '2023-01-01 00:00:00' 
    and e1.fk_type = 0
group by e1.id, e1.date

The execution plan is quite different as well. Instead of using GroupAggregate, the optimizer chose to use HashAggregate.

Finalize GroupAggregate  (cost=48403.48..53704.04 rows=35936 width=37)
  Group Key: e1.id
  ->  Gather Merge  (cost=48403.48..52536.12 rows=35936 width=20)
        Workers Planned: 1
        ->  Sort  (cost=47403.47..47493.31 rows=35936 width=20)
              Sort Key: e1.id
              ->  Partial HashAggregate  (cost=44324.98..44684.34 rows=35936 width=20)
                    Group Key: e1.id
                    ->  Nested Loop Left Join  (cost=0.29..15104.27 rows=5844143 width=20)
                          ->  Parallel Seq Scan on event e1  (cost=0.00..2670.71 rows=21139 width=20)
                                Filter: ((((status)::text <> ALL ('{CANCELED,CLOSED}'::text[])) OR (status IS NULL)) AND (date >= '2022-01-01 00:00:00'::timestamp without time zone) AND (date < '2023-01-01 00:00:00'::timestamp without time zone) AND (fk_type = 0))
                          ->  Index Scan using ix_event_fk_city on event e2  (cost=0.29..0.58 rows=1 width=16)
                                Index Cond: (fk_city = e1.fk_city)
                                Filter: ((((status)::text <> ALL ('{CANCELED,CLOSED}'::text[])) OR (status IS NULL)) AND (date >= '2022-01-01 00:00:00'::timestamp without time zone) AND (fk_type = 0) AND (date > e1.date) AND (fk_type = e1.fk_type))

The option for the HashAggregate seems to be the problem, since if I deactivate the HashAggregate from my database manually with SET enable_hashagg = off;, the query is executed normally without issues.

Why is this happening? How can I induce Postrgres to use GroupAggregate instead of HashAggregate for my query to work properly, without having to manually deactivate HashAggregate? Is there some optimization I am missing that I can do to improve the query?

Important things to know that I noted

If I remove some of the filter options that I set in the query, everything seems to go back to normal. For example, if I remove one of the status from the not in ('CANCELED', 'CLOSED') filter (say ‘Closed‘), the execution is successful.

select 
    e1.id as e1_id, 
    e1.date as e1_date, 
    min(e2.date) as next_event_date,
    min(e2.date) - e1.date as date_diff,
    case 
        when min(e2.date) - e1.date is not null then min(e2.date) - e1.date >= interval '30 days'
        else true
    end took_too_long
from event e1
left join event e2 on e2.date > e1.date 
    and e2.fk_type = e1.fk_type
    and e2.fk_city = e1.fk_city
    and (e2.status not in ('CANCELED') or e2.status is null)
    and e2.date >= '2022-01-01 00:00:00'
where
    (e1.status not in ('CANCELED') or e1.status is null) 
    and e1.date >= '2022-01-01 00:00:00' and e1.date < '2023-01-01 00:00:00' 
    and e1.fk_type = 0
group by e1.id, e1.date
Finalize GroupAggregate  (cost=1000.72..69924.89 rows=42348 width=37)
  Group Key: e1.id
  ->  Gather Merge  (cost=1000.72..68548.58 rows=42348 width=20)
        Workers Planned: 1
        ->  Partial GroupAggregate  (cost=0.71..62784.42 rows=42348 width=20)
              Group Key: e1.id
              ->  Nested Loop Left Join  (cost=0.71..21782.65 rows=8115659 width=20)
                    ->  Parallel Index Scan using ix_event_id on event e1  (cost=0.42..7868.54 rows=24911 width=20)
                          Filter: ((((status)::text <> 'CANCELED'::text) OR (status IS NULL)) AND (date >= '2022-01-01 00:00:00'::timestamp without time zone) AND (date < '2023-01-01 00:00:00'::timestamp without time zone) AND (fk_type = 0))
                    ->  Index Scan using ix_event_fk_city on event e2  (cost=0.29..0.55 rows=1 width=16)
                          Index Cond: (fk_city = e1.fk_city)
                          Filter: ((((status)::text <> 'CANCELED'::text) OR (status IS NULL)) AND (date >= '2022-01-01 00:00:00'::timestamp without time zone) AND (fk_type = 0) AND (date > e1.date) AND (fk_type = e1.fk_type))

The same behavior happens if I remove:

  1. The whole status filtering;
  2. The whole date filtering ("from date" and "to date";
  3. Only the "to date" piece from the date filtering;
  4. The fk_type filtering.

According to my needs, this query has to work even if I specify 20 status options to filter, for instance.

Additional info

  1. event table has exactly 93453 rows/occurrences;
  2. PostgreSQL version: PostgreSQL 13.7 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-12), 64-bit

2

Answers


  1. If you haven’t tried, you should SET enable_hashagg = off;

    The reason why the query fails when you add the status filtering could be because the HashAggregate operator could not handle NULL values in the GROUP BY clause.

    When you add the status filtering, the HashAggregate operator might have to create a hash table for each possible combination of values for the status column. This might lead to a very large hash table, which can cause the query to run out of memory.

    The GroupAggregate operator, on the other hand, can handle NULL values in the GROUP BY clause. When you use the GroupAggregate operator, PostgreSQL would first sort the rows by the values in the GROUP BY clause. Then, it would iterate over the sorted rows and aggregate the rows that have the same values in the GROUP BY clause. This means that the GroupAggregate operator does not have to create a hash table, which might prevent the query from running out of memory.

    To force PostgreSQL to use the GroupAggregate operator, you can set the enable_hashagg parameter to off. This will tell PostgreSQL to never use the HashAggregate operator, even if it would be more efficient to do so.

    Few other points to think around:

    • Use the DISTINCT keyword to remove duplicate rows from the results.
    • Try adjusting the configuration parameter work_mem to allocate more memory for sorting operations, which can improve the performance of the query
    SET work_mem = '64MB'; -- Adjust the value according to your available memory
    
    Login or Signup to reply.
  2. Use a window function to avoid the LEFT JOIN projection.

    Start with this, and your WHERE criteria should have little to no impact on the runtime of the wide-open query:

    select id, date, fk_type, fk_city,
           lead(date) over w as next_event_date,
           lead(date) over w - date as date_diff,
           coalesce(
             date + interval '30 days' <= lead(date) over w, 
             true
           ) as took_too_long
      from event
    window w as (partition by fk_type, fk_city order by date);
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search