skip to Main Content

I am using a Postgres table to store HTML data. One of the columns is of type varchar, and this is where the HTML data is stored. My understanding is that this column datatype has no maximum length.

I tried to create a unique index on this column to prevent duplicate HTML entries from being added.

This table is referenced by another table. Duplicate HTML entries are not allowed. This is a form of data compression using data (or table) normalization.

If Table A references Table B, and Table B stores de-duplicated HTML data, then less storage space is required to store the complete dataset since rather than storing duplicated HTML entries in Table A, these entries are normalized and stored in a separate table, Table B.

When I tried to create a unique index, I get an error:

CREATE UNIQUE INDEX html_source_div_html_source_div_idx 
ON rightmove.html_source_div (html_source_div)

SQL Error [54000]: ERROR: index row size 4536 exceeds btree version 4 maximum 2704 for index "html_source_div_html_source_div_idx"
Detail: Index row references tuple (0,46) in relation "html_source_div"
Hint: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.

I want to add both an index for fast lookup of existing data and a unique constraint to prevent duplicate entries from being inserted. Postgres can probably do both functions using a single index provided the right index type is used.

I do not want to md5hash the data, because there is a risk of a collision. If there is a collision the whole process will break.

  • Is there a type of index supported by Postgres which is designed to work on text based data of arbitrary length?
  • Could I use such an index to enforce the unique constraint as well as improve text-search performance for select queries?

2

Answers


  1. PostreSQL indexes indeed have a max size, you could try to mitigate hash collisions with a computed column, maybe try something like this:

    ALTER TABLE html_source_div ADD COLUMN html_hash bytea;
    UPDATE html_source_div SET html_hash = digest(html_source_div, 'sha256');
    

    and

    CREATE UNIQUE INDEX html_source_div_html_hash_idx ON html_source_div (html_hash);
    
    Login or Signup to reply.
  2. You can use a hash index. While they do not support unique constraints, they do support EXCLUDE constraints which fulfill much the same function.

    create table jj (x text);
    alter table jj add constraint lkj exclude using hash (x with =);
    

    It will automatically resolve hash collision (I think, I haven’t tested as I don’t know how to generate collisions at will)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search