skip to Main Content

I have a gigantic indexed (104 million row) table of docs, pages and words:

CREATE TABLE my_table (
    doc_id integer,
    page_id integer,
    word_id integer
);

CREATE INDEX my_table_word ON my_table (word_id);
CREATE INDEX my_table_doc ON my_table (doc_id);
CREATE INDEX my_table_page my_table (page_id);

I want to find pages within the same documents that have both word A and word B. My current attempts are as follows:

Attempt 1 – aggregate things:

SELECT doc_id, page_id
FROM my_table
WHERE word_id in (123, 456)
group by 1,2 
having count(distinct word_id) = 2

-- ~39k row result, took 20 seconds

Attempt 2) with CTEs, marginally faster

with foo as (
    select doc_id, page_id
    from my_table
    where word_id = 123 -- foo -- 44k rows
),

bar as (
    select doc_id, page_id
    from my_table
    where word_id = 456 -- bar -- 439k rows
)

select f.doc_id, f.page_id
from foo f
inner join bar b on f.doc_id = b.doc_id and f.page_id = b.page_id

-- same results, takes 15 seconds

Attempt 3) – doing an INTERSECT between the two CTEs is exactly the same 15 seconds, probably same query plan.

Is there a faster way to do this? I’m hoping to get this down to < 1 second for a web app with somewhat impatient users.

2

Answers


  1. A basic intersect seems to perform fine:
    demo at db<>fiddle

    select doc_id, page_id
    from my_table
    where word_id = 456
    INTERSECT
    select doc_id, page_id
    from my_table
    where word_id = 123;
    

    In my test on 700k docs, 100 pages each, using 1k words, it runs about as fast as your second example, while being simpler and duplicate-free because it defaults to INTERSECT DISTINCT.

    You can also use an exists:

    explain analyze verbose
    select distinct doc_id, page_id
    from my_table as a
    where word_id = 456
      and exists(select from my_table as b 
                 where a.doc_id=b.doc_id
                   and a.page_id=b.page_id
                   and word_id = 123);
    

    And that runs about just as fast.

    Still, you were right to try and find everything you needed in a single pass, a single query, using aggregation and having:

    select doc_id, page_id
    from my_table
    group by doc_id, page_id
    having bool_or(word_id=456) 
       and bool_or(word_id=123);
    

    That’s 8x faster on my setup. The two bool_or()‘s check if any of the words on the page is your first word, and the same for the other one.

    Login or Signup to reply.
  2. Try a self join:

    SELECT a.doc_id, a.page_id
    FROM my_table a
    JOIN my_table b ON b.page_id = a.page_id
      AND b.word_id = 456
    WHERE a.word_id = 123
    

    With an index on (word_id, page_id)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search