Postgresql - Fastest way to find SQL intersection between 2 subsections of a huge dataset?

LittleBobbyTables
December 5, 2024
102 views
0 votes
2 Answers

I have a gigantic indexed (104 million row) table of docs, pages and words:

CREATE TABLE my_table (
    doc_id integer,
    page_id integer,
    word_id integer
);

CREATE INDEX my_table_word ON my_table (word_id);
CREATE INDEX my_table_doc ON my_table (doc_id);
CREATE INDEX my_table_page my_table (page_id);

I want to find pages within the same documents that have both word A and word B. My current attempts are as follows:

Attempt 1 – aggregate things:

SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2 
having count(distinct word_id) = 2

-- ~39k row result, took 20 seconds

Attempt 2) with CTEs, marginally faster

with foo as (
    select doc_id, page_id
    from my_table
    where word_id = 123 -- foo -- 44k rows
),

bar as (
    select doc_id, page_id
    from my_table
    where word_id = 456 -- bar -- 439k rows
)

select f.doc_id, f.page_id
from foo f
inner join bar b on f.doc_id = b.doc_id and f.page_id = b.page_id

-- same results, takes 15 seconds

Attempt 3) – doing an INTERSECT between the two CTEs is exactly the same 15 seconds, probably same query plan.

Is there a faster way to do this? I’m hoping to get this down to < 1 second for a web app with somewhat impatient users.

Answers

- Zegarek
- December 5, 2024 at 6:34 pm
- 0 votes
0
A basic intersect seems to perform fine:
_{demo at db<>fiddle}
```
select doc_id, page_id
from my_table
where word_id = 456
INTERSECT
select doc_id, page_id
from my_table
where word_id = 123;
```
In my test on 700k docs, 100 pages each, using 1k words, it runs about as fast as your second example, while being simpler and duplicate-free because it defaults to INTERSECT DISTINCT.

You can also use an exists:
```
explain analyze verbose
select distinct doc_id, page_id
from my_table as a
where word_id = 456
  and exists(select from my_table as b 
             where a.doc_id=b.doc_id
               and a.page_id=b.page_id
               and word_id = 123);
```
And that runs about just as fast.

Still, you were right to try and find everything you needed in a single pass, a single query, using aggregation and having:
```
select doc_id, page_id
from my_table
group by doc_id, page_id
having bool_or(word_id=456) 
   and bool_or(word_id=123);
```
That’s 8x faster on my setup. The two bool_or()‘s check if any of the words on the page is your first word, and the same for the other one.
Login or Signup to reply.

- Bohemian
- December 5, 2024 at 6:37 pm
- 0 votes
0
Try a self join:
```
SELECT a.doc_id, a.page_id
FROM my_table a
JOIN my_table b ON b.page_id = a.page_id
  AND b.word_id = 456
WHERE a.word_id = 123
```
With an index on (word_id, page_id)
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Fastest way to find SQL intersection between 2 subsections of a huge dataset?

Answers