skip to Main Content

This is the table schema:

CREATE TABLE IF NOT EXISTS events_table
(
      id                    UUID NOT NULL,
      company_id            INT8 NOT NULL,
      employee_id           INT8 NOT NULL,
      category_type_id      UUID, 
      event_type            VARCHAR(255) NOT NULL, 
      data_payload          bytea NOT NULL, 
      occurred_at     TIMESTAMP WITH TIME ZONE NOT NULL,
      stored_at       TIMESTAMP WITH TIME ZONE NOT NULL, 

      PRIMARY KEY (id)
);

We have the following index:

CREATE INDEX events_idx ON events_table (company_id, employee_id, category_type_id, occurred_at, stored_at);

And this is the query:

SELECT 
    table_name.id,
    table_name.company_id,
    table_name.employee_id,
    table_name.category_type_id,
    table_name.type,
    table_name.payload,
    table_name.occurred_at,
    table_name.stored_at
FROM 
    schema_name.table_name
WHERE 
    (table_name.company_id = ? 
    AND table_name.employee_id IN (?) 
    AND table_name.category_type_id = cast ( ? )
    AND table_name.occurred_at <= CAST(? AS TIMESTAMP WITH TIME ZONE))

ORDER BY 
    table_name.occurred_at ASC, 
    table_name.stored_at ASC;

Why is the DB sorting in the first place:

 Gather Merge  (cost=116700.30..129758.08 rows=111916 width=122) (actual time=357.017..512.239 rows=295813 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=115700.28..115840.17 rows=55958 width=122) (actual time=345.202..364.049 rows=98604 loops=3)
         Sort Key: occurred_at, stored_at
         Sort Method: external merge  Disk: 13760kB
         Worker 0:  Sort Method: external merge  Disk: 6280kB
         Worker 1:  Sort Method: external merge  Disk: 6120kB
         ->  Parallel Bitmap Heap Scan on balance_state_event  (cost=6311.95..107650.92 rows=55958 width=122) (actual time=51.027..280.405 rows=98604 loops=3)
               Recheck Cond: ((company_id = 100) AND (category_type_id = '0baee077-c735-4f95-bd26-0c6b4b1eddd4'::uuid) AND (employee_id = ANY ('{17,1,19,2,18,11,15,16,20,861,8,12,4,3,5,13,7,6,14,9,10,1021,1022,1023,1024,1025,1026
,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,1071,10
72,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100}'::bigint[])) AND (occurred_at <= '2024-05-03 00:00:00+00'::timestamp with time zone)
)
               Rows Removed by Index Recheck: 64790
               Heap Blocks: exact=21346 lossy=17655
               ->  Bitmap Index Scan on events_idx  (cost=0.00..6278.13 rows=134298 width=0) (actual time=49.032..49.032 rows=295813 loops=1)
                     Index Cond: ((company_id = 100) AND (absence_type_id = '0baee077-c735-4f95-bd26-0c6b4b1eddd4'::uuid) AND (employee_id = ANY ('{17,1,19,2,18,11,15,16,20,861,8,12,4,3,5,13,7,6,14,9,10,1021,1022,1023,1024,1025,
1026,1027,1028,1029,1030,1031,1032,1033,1034,1035,1036,1037,1038,1039,1040,1041,1042,1043,1044,1045,1046,1047,1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,1064,1065,1066,1067,1068,1069,1070,107
1,1072,1073,1074,1075,1076,1077,1078,1079,1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,1096,1097,1098,1099,1100}'::bigint[])) AND (occurred_at <= '2024-05-03 00:00:00+00'::timestamp with time z
one))
 Planning Time: 0.445 ms
 Execution Time: 528.740 ms
(16 rows)

Parallel Bitmap Heap Scan on balance_state_event first?

should not sort, since the index is maintaining the sort order.

2

Answers


  1. The reason an explicit sort is needed is because the index scan does not necessarily return the rows in sorted order. The cause is the IN condition.

    If you have a three-column index on (a, b, c), the following query does not need an explicit sort, because the index scan returns the rows in sorted order:

    SELECT ... WHERE a = 42 AND b = 'something' ORDER BY c
    

    But this query will need an explicit sort:

    SELECT ... WHERE a 0 42 AND b IN ('something', 'other') ORDER BY c
    
    Login or Signup to reply.
  2. The rows would be returned by the index only piecewise ordered, that is, ordered separately for each chunk derived from the same employee_id. This is not the same thing as a total ordering your query specified.

    PostgreSQL could get each sorted piece and then interleave them together with something like a "Merge Append", but no one has bothered to implement this. It isn’t clear how this would even be implemented as doing the merge within the index scan node would be weird, while getting the boundaries up into some separate merge node would be tricky.

    It is hard to believe this matters, as the sort takes up a minority of the total time and that total time is pretty fast to start with given the number of rows being returned. Getting rid of the sort might even make it slower, as then you would not get the IO benefit that you might be getting from sequential-read-like property of the bitmap and the parallelization might also be more difficult.

    You could get a literal Merge Append mode by writing the query as a UNION ALL of a bunch of subqueries, once for each value in the IN-list. You would need to specify the ORDER BY on each subquery as well as on the entire UNION ALL in order to get the desired behavior. This would be rather awkward to write, and while the performance improvement can be spectacular in certain situations it is hard to see how your query is one of those situations.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search