skip to Main Content

I have a table with a text column that holds millions of text documents of various lengths (from a couple of kb to some mb). There are many duplicates.

I would like to create a table with a unique md5 hash and the text. The table looks like this:

create table txt_document
(
    id serial primary key,
    doc_hash uuid not null unique,
    doc_text text
)
;

create index idx_txt_document__hash on txt_document(doc_hash);

To copy all documents into the table I could just do:

with txt_data as (
    select html_text
    from many_texts
    where
        html_text is not null
)
insert into txt_document(doc_hash, doc_text)
select md5(html_text)::uuid
    , html_text
from txt_data
on conflict do nothing
;

The execution plan looks like this:

                                    QUERY PLAN
----------------------------------------------------------------------------------
 Insert on txt_document  (cost=0.00..2829451.85 rows=0 width=0)
   Conflict Resolution: NOTHING
   ->  Seq Scan on many_texts  (cost=0.00..2829451.85 rows=10438262 width=52)
         Filter: (html_text IS NOT NULL)
(4 rows)

But this query seems to run forever without ever taxing the CPU at all.

Are there any strategies to make such an operation faster?

My Postgres version is 14.x

2

Answers


  1. My idea would be to split the operations.
    Means, you first write a python script which is handles the duplicates and then in a second script I would create the table with the "clean" data.

    Login or Signup to reply.
  2. To achieve this, you anyway have to compute md5 for each record in the source table. Here is what I would consider:

    1. add a GENERATED md5 hash to the source table:

       alter table many_texts
         add column html_text_md5
           generated always as (md5(html_text)) stored;
      
    2. (optional) create index on the md5 hash

       create index on many_texts (html_text_md5) where html_text_md5 is not null;
      
    3. Use that column to insert only unique rows into destination table:

       insert into txt_document(doc_hash, doc_text)
         select distinct on (html_text_md5)
           html_text_md5::uuid, html_text
         from many_texts
         where html_text_md5 is not null
         order by html_text_md5;
      
    4. optionally, instead of (3) you can try to speed up your insert by using 16 parallel workers:

       insert into txt_document(doc_hash, doc_text)
         select distinct on (html_text_md5)
           html_text_md5::uuid, html_text
         from many_texts
         where html_text_md5 LIKE '0%'  -- 0,...,9,a,...,f
         order by html_text_md5;
      
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search