I have a table with a text column that holds millions of text documents of various lengths (from a couple of kb to some mb). There are many duplicates.
I would like to create a table with a unique md5 hash and the text. The table looks like this:
create table txt_document
(
id serial primary key,
doc_hash uuid not null unique,
doc_text text
)
;
create index idx_txt_document__hash on txt_document(doc_hash);
To copy all documents into the table I could just do:
with txt_data as (
select html_text
from many_texts
where
html_text is not null
)
insert into txt_document(doc_hash, doc_text)
select md5(html_text)::uuid
, html_text
from txt_data
on conflict do nothing
;
The execution plan looks like this:
QUERY PLAN
----------------------------------------------------------------------------------
Insert on txt_document (cost=0.00..2829451.85 rows=0 width=0)
Conflict Resolution: NOTHING
-> Seq Scan on many_texts (cost=0.00..2829451.85 rows=10438262 width=52)
Filter: (html_text IS NOT NULL)
(4 rows)
But this query seems to run forever without ever taxing the CPU at all.
Are there any strategies to make such an operation faster?
My Postgres version is 14.x
2
Answers
My idea would be to split the operations.
Means, you first write a python script which is handles the duplicates and then in a second script I would create the table with the "clean" data.
To achieve this, you anyway have to compute md5 for each record in the source table. Here is what I would consider:
add a GENERATED md5 hash to the source table:
(optional) create index on the md5 hash
Use that column to insert only unique rows into destination table:
optionally, instead of (3) you can try to speed up your insert by using 16 parallel workers: