Postgresql - What is the most suitable data type for a column in a table?

bylazy
July 4, 2024
238 views
0 votes
2 Answers

Currently I’m trying to develop a PostgreSQL database schema, it has only 2 tables, one of which contains usernames. The difficulty is that for certain reasons I cannot store the username directly, so I have to store the hash (SHA256).

Postgres has a bytea data type, which is an array of bytes, which is technically SHA256.

My question is whether there is a better type to store the hash in terms of speed of searching whether the username exists in the database or not.

Perhaps I should look towards NoSQL solutions, where such a search is faster?

—–Add—–

The answer and comments suggest that it is optimal to use bytea.

CREATE TABLE users (
    username bytea UNIQUE NOT NULL,
    somedata text NOT NULL
);

CREATE INDEX idx_username ON users USING HASH (username);

Added a hash index because I’m only interested in equality checks. Is the scheme ok?

Answers

- LaurenzAlbe
- July 4, 2024 at 12:49 pm
- 0 votes
0
It is probably a micro-optimization, but bytea would use significantly less space and compare faster.

If you end up using text (avoid character), make sure that you are using the C collation:
```
hash_value text COLLATE "C"
```
Other collations will result in much more expensive comparisons.
Login or Signup to reply.

- Zegarek
- July 4, 2024 at 2:38 pm
- 0 votes
0
Try it. As you suspected and as already confirmed in comments and the answer from @Laurenz Albe, bytea wins by being the most compact and fastest to look up.

Here’s how storing 400k hashes compares between these:

variant pg_total_relation_size pg_size_pretty

table1_bytea 48709632 46 MB

table2_text 75530240 72 MB

table3_text_collate_c 75522048 72 MB

And here’s how much time it takes to look one up:

variant avg

hashes_in_bytea 00:00:00.00003

hashes_in_text_collate_c 00:00:00.000035

hashes_in_text 00:00:00.000042

hashes_in_text_otf 00:00:00.00011

hash index because I’m only interested in equality checks

You can’t be using hash for unique indexes, so you’d need to maintain both that and the separate unique index that gets created to handle the UNIQUE constraint on your username column. I’d just stick with the default unique you already have.

Login or Signup to reply.