I have a gigantic indexed (104 million row) table of docs, pages and words:
CREATE TABLE my_table (
doc_id integer,
page_id integer,
word_id integer
);
CREATE INDEX my_table_word ON my_table (word_id);
CREATE INDEX my_table_doc ON my_table (doc_id);
CREATE INDEX my_table_page my_table (page_id);
I want to find pages within the same documents that have both word A and word B. My current attempts are as follows:
Attempt 1 – aggregate things:
SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2
having count(distinct word_id) = 2
-- ~39k row result, took 20 seconds
Attempt 2) with CTEs, marginally faster
with foo as (
select doc_id, page_id
from my_table
where word_id = 123 -- foo -- 44k rows
),
bar as (
select doc_id, page_id
from my_table
where word_id = 456 -- bar -- 439k rows
)
select f.doc_id, f.page_id
from foo f
inner join bar b on f.doc_id = b.doc_id and f.page_id = b.page_id
-- same results, takes 15 seconds
Attempt 3) – doing an INTERSECT
between the two CTEs is exactly the same 15 seconds, probably same query plan.
Is there a faster way to do this? I’m hoping to get this down to < 1 second for a web app with somewhat impatient users.
2
Answers
A basic
intersect
seems to perform fine:demo at db<>fiddle
In my test on 700k docs, 100 pages each, using 1k words, it runs about as fast as your second example, while being simpler and duplicate-free because it defaults to
INTERSECT DISTINCT
.You can also use an
exists
:And that runs about just as fast.
Still, you were right to try and find everything you needed in a single pass, a single query, using aggregation and
having
:That’s 8x faster on my setup. The two
bool_or()
‘s check if any of the words on the page is your first word, and the same for the other one.Try a self join:
With an index on
(word_id, page_id)