Postgresql - DISTINCT ON needing Bitmap Heap Scan

JiewMeng
November 10, 2024
60 views
1 vote
2 Answers

I have a query like:

SELECT 
  DISTINCT ON (services.group_id) 
  services.service_id,
  ...
FROM services
WHERE services.group_id IN ('67240181b97477f054f9b1bc', '67240181b97477f054f9b1be')

I have an index on group_id.

Why does Postgres need to do an extra heap scan? I just want any 1st match with a matching group_id. Shouldn’t it already have the correct rows from the Bitmap Index Scan which checks for matching group_id?

Unique  (cost=609.73..611.66 rows=50 width=240) (actual time=0.353..0.385 rows=2 loops=1)
  ->  Sort  (cost=609.73..610.70 rows=386 width=240) (actual time=0.353..0.364 rows=386 loops=1)
        Sort Key: group_id
        Sort Method: quicksort  Memory: 133kB
        ->  Bitmap Heap Scan on services  (cost=11.56..593.15 rows=386 width=240) (actual time=0.030..0.180 rows=386 loops=1)
              Recheck Cond: ((group_id)::text = ANY ('{67240181b97477f054f9b1bc,67240181b97477f054f9b1be}'::text[]))
              Heap Blocks: exact=39
              ->  Bitmap Index Scan on ix_services_group_id  (cost=0.00..11.46 rows=386 width=0) (actual time=0.019..0.020 rows=386 loops=1)
                    Index Cond: ((group_id)::text = ANY ('{67240181b97477f054f9b1bc,67240181b97477f054f9b1be}'::text[]))
Planning Time: 0.505 ms
Execution Time: 0.434 ms

Answers

- nbk
- November 8, 2024 at 11:04 pm
- 0 votes
0
the IN clause works like a OR and that is the cuase for the heap

Use
```
SELECT 
  DISTINCT ON (services.group_id) 
  services.service_id,
  ...
FROM services
WHERE services.group_id = '67240181b97477f054f9b1bc'
UNION 
SELECT 
  DISTINCT ON (services.group_id) 
  services.service_id,
  ...
FROM services
WHERE services.group_id = '67240181b97477f054f9b1be'
```
Login or Signup to reply.

- ErwinBrandstetter
- November 10, 2024 at 3:46 am
- 0 votes
0
The Bitmap Heap Scan is just an integral step of a bitmap index scan. After building a bitmap for (estimated many) qualifying tuples, Postgres accesses each harboring data page once.

But your query is inefficient to begin with. For more than a few hits per group_id (avg. 193 in the example) this should be much faster:
```
SELECT s.*
FROM   unnest('{67240181b97477f054f9b1bc, 67240181b97477f054f9b1be}'::text[]) i(gid)
CROSS  JOIN LATERAL (  
   SELECT s.group_id, s.service_id  -- more?
   FROM   services s
   WHERE  s.group_id = i.gid
-- ORDER  BY ???  -- typically you want to pick a deterministic row - matching an index ...
   LIMIT  1
   ) s;
```
Assuming group_id is type text. Else adapt the input accordingly.

With this query, Postgres will fetch the first live row from the index. Unless you can make an index-only scan work with a covering index (seems unlikely), a plain index scan (not a bitmap index scan like in your query) still goes to the heap to satisfy the SELECT list – but only once per input value (that finds any match). Ignoring possible HOT chains to keep it simple)

Besides the more expensive bitmap index scan, this also avoids removing all the duplicates fetched in vain.

Your index ix_services_group_id is good for it. If you add ORDER BY criteria, ideally have a matching multicolumn index.
See:
- Optimize GROUP BY query to retrieve latest row per user
- Select first row in each GROUP BY group?
Note a subtle difference: This query (unlike your original) does not fold duplicates in the input array (input list in your case). Make sure not to pass duplicates or remove those before the LATERAL subquery.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – DISTINCT ON needing Bitmap Heap Scan

Answers