How to make big postgres db faster? - Postgresql

unbrokendub
December 23, 2022
276 views
0 votes
2 Answers

I have big Postgres database(around 75 GB) and queries are very slow. Is there any way to make them faster?

About database:

                                             List of relations
 Schema |       Name        |   Type   |  Owner   | Persistence | Access method |    Size    | Description
--------+-------------------+----------+----------+-------------+---------------+------------+-------------
 public | fingerprints      | table    | postgres | permanent   | heap          | 35 GB      |
 public | songs             | table    | postgres | permanent   | heap          | 26 MB      |
 public | songs_song_id_seq | sequence | postgres | permanent   |               | 8192 bytes |

d+ fingerprints
                                                    Table "public.fingerprints"
    Column     |            Type             | Collation | Nullable | Default | Storage  | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 hash          | bytea                       |           | not null |         | extended |             |              |
 song_id       | integer                     |           | not null |         | plain    |             |              |
 offset        | integer                     |           | not null |         | plain    |             |              |
 date_created  | timestamp without time zone |           | not null | now()   | plain    |             |              |
 date_modified | timestamp without time zone |           | not null | now()   | plain    |             |              |
Indexes:
    "ix_fingerprints_hash" hash (hash)
    "uq_fingerprints" UNIQUE CONSTRAINT, btree (song_id, "offset", hash)
Foreign-key constraints:
    "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap

d+ songs
                                                                       Table "public.songs"
    Column     |            Type             | Collation | Nullable |                Default                 | Storage  | Compression | Stats target | Description
---------------+-----------------------------+-----------+----------+----------------------------------------+----------+-------------+--------------+-------------
 song_id       | integer                     |           | not null | nextval('songs_song_id_seq'::regclass) | plain    |             |              |
 song_name     | character varying(250)      |           | not null |                                        | extended |             |              |
 fingerprinted | smallint                    |           |          | 0                                      | plain    |             |              |
 file_sha1     | bytea                       |           |          |                                        | extended |             |              |
 total_hashes  | integer                     |           | not null | 0                                      | plain    |             |              |
 date_created  | timestamp without time zone |           | not null | now()                                  | plain    |             |              |
 date_modified | timestamp without time zone |           | not null | now()                                  | plain    |             |              |
Indexes:
    "pk_songs_song_id" PRIMARY KEY, btree (song_id)
Referenced by:
    TABLE "fingerprints" CONSTRAINT "fk_fingerprints_song_id" FOREIGN KEY (song_id) REFERENCES songs(song_id) ON DELETE CASCADE
Access method: heap

DB Scheme
DB Amount

No need to write to database, only read. All queries are very simple:

SELECT song_id 
WHERE hash in fingerpints = X

 EXPLAIN(analyze, buffers, format text) SELECT "song_id", "offset" FROM "fingerprints" WHERE "hash" = decode('eeafdd7ce9130f9697','hex');
                                                                QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using ix_fingerprints_hash on fingerprints  (cost=0.00..288.28 rows=256 width=8) (actual time=0.553..234.257 rows=871 loops=1)
   Index Cond: (hash = 'xeeafdd7ce9130f9697'::bytea)
   Buffers: shared hit=118 read=749
 Planning Time: 0.225 ms
 Execution Time: 234.463 ms
(5 rows)

234 ms looks fine where it is one query. But in reality there 3000 query per time, that takes about 600 seconds. It is audio recognition application, so algoritm works like that.

About indexes:

CREATE INDEX  "ix_fingerprints_hash" ON "fingerprints" USING hash ("hash");

For pooler I use Odyssey.

Little bit of info from config:

shared_buffers = 4GB   
huge_pages = try
work_mem = 582kB
maintenance_work_mem = 2GB  
effective_io_concurrency = 200
max_worker_processes = 24
max_parallel_workers_per_gather = 12
max_parallel_maintenance_workers = 4
max_parallel_workers = 24
wal_buffers = 16MB
checkpoint_completion_target = 0.9
max_wal_size = 16GB
min_wal_size = 4GB
random_page_cost = 1.1
effective_cache_size = 12GB

Info about hardware:

Xeon 12 core (24 threads)
RAM DDR4 16 GB ECC
NVME disk

Will the database be accelerated by purchase more RAM to handle all DB inside (128 GB in example)? And what parameters should I change to say to Postgres to store db in ram?

I read about several topics about pg_tune, etc. but experiments don’t show any good results.

Answers

- ErwinBrandstetter
- December 23, 2022 at 11:32 pm
- 0 votes
0
No need to write to database, only read.

So the DB is read-only.
And in comments:

db worked fine on small amount of data(few GB), but after i filled out database started to slowdown.

So indexes have been built up incrementally.

Indexes

UNIQUE CONSTRAINT on (song_id, "offset", hash)

I would replace that with:
```
ALTER TABLE fingerprints
  DROP CONSTRAINT uq_fingerprints
, ADD CONSTRAINT uq_fingerprints UNIQUE(hash, song_id, "offset") WITH (FILLFACTOR = 100)
```
This enforces the same constraint, but the leading hash column in the underlying B-tree index now supports the filter on hash in your displayed query. And the fact that all needed columns are included in the index, further allows much faster index-only scans. The (smaller) index should also be more easily cached than the (bigger) table (plus index).

See:
- Is a composite index also good for queries on the first field?
Also rewrites the index in pristine condition, and with FILLFACTOR 100 for the read-only DB. (Instead of the default 90 for a B-tree index.)

Hash index on (hash) and CLUSTER

The name of the column "hash" has nothing to do with the name of the index type, which also happens to be "hash". (The column should probably not be named "hash" to begin with.)

If (and only if) you also have other queries centered around one of few hash values, that cannot use index-only scans (and you actually see faster queries than without) keep the hash index, additionally. And optimize it. (Else drop it!)
```
ALTER INDEX ix_fingerprints_hash SET (FILLFACTOR = 100);
```
An incrementally grown index may end up with bloat or unbalanced overflow pages in case of a hash index. REINDEX should take care of that. While being at it, increase FILLFACTER to 100 (from the default 75 for a hash index) for your read-only (!) DB. You can REINDEX to make the change effective.
```
REINDEX INDEX ix_fingerprints_hash;
```
Or you can CLUSTER (like jjanes already suggested) on the rearranged B-tree index from above:
```
CLUSTER fingerprints USING uq_fingerprints;
```
Rewrites the table and all indexes; rows are physically sorted according to the given index, so "clustered" around the leading column(s). Effects are permanent for your read-only DB. But index-only scans do not benefit from this.

When done optimizing, run once:
```
VACUUM ANALYZE fingerprints;
```
work_mem

The tiny setting for work_mem stands out:

work_mem = 582kB

Even the (very conservative!) default is 4MB.
But after reading your question again, it would seem you only have tiny queries. So maybe that’s ok after all.

Else, with 16GB RAM you can typically afford a 100 times as much. Depends on your work load of course.
- Many small queries, many parallel workers –> keep small work_mem (like 4MB?)
- Few big queries, few parallel workers –> go high (like 256MB? or more)
Large amounts of temporary files written in your database over time, and mentions of "disk" in the output of EXPLAIN ANALYZE would indicate the need for more work_mem.

Additional questions

Will the database be accelerated by purchase more RAM to handle all DB inside (128 GB in example)?

More RAM almost always helps until the whole DB can be cached in RAM and all processes can afford all the work_mem they desire.

And what parameters should I change to say to Postgres to store db in ram?

Everything that’s read from the database is cached automatically in system cache and Postgres cache, up to the limit of available RAM. (Setting work_mem too high competes for that same resource.)
Login or Signup to reply.

- jjanes
- December 24, 2022 at 1:04 am
- 0 votes
0
Increasing the RAM so that everything can stay in cache (perhaps after using pg_prewarm to get it into cache in the first place) would certainly work. But it is expensive and shouldn’t be necessary.

Having a hash index on something which is already a hashed value is probably not very helpful. Have you tried just a default (btree) index instead?

If you CLUSTER the table on the index over the column named "hash" (which you can only do if it is a btree index) then rows with the same hash code should mostly share the same table page, which would greatly cut down on the number of different buffer reads needed to fetch them all.

If you could get it do a bitmap heap scan instead of an index scan, then it should be able to have a large number of read requests outstanding at a time, due to effective_io_concurrency. But the planner does not account for effective_io_concurrency when doing planning, which means it won’t choose a bitmap heap scan specifically to get it that benefit. Normally an index read finding hundreds of rows on different pages would automatically choose a bitmap heap scan method, but in your case it is probably the low setting of random_page_cost which is inhibiting it from doing so. The low setting of random_page_cost is probably reasonable in itself, but it does have this unfortunate side effect. A problem with this strategy is that it doesn’t reduce the overall amount of IO needed, it just allows them overlap and so make better use of multiple IO channels. But if many sessions are running many instances of this query at the same time, they will start filling up those channels and so start competing with each other. So the CLUSTER method is probably superior as it gets the same answer with less IO. If you want to play around with bitmap scans, you could temporarily increase random_page_cost or temporarily set enable_indexscan to off.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

How to make big postgres db faster? – Postgresql

Answers

Indexes

`UNIQUE CONSTRAINT` on `(song_id, "offset", hash)`

Hash index on `(hash)` and `CLUSTER`

`work_mem`

Additional questions

How to make big postgres db faster? – Postgresql

Answers

Indexes

UNIQUE CONSTRAINT on (song_id, "offset", hash)

Hash index on (hash) and CLUSTER

work_mem

Additional questions

`UNIQUE CONSTRAINT` on `(song_id, "offset", hash)`

Hash index on `(hash)` and `CLUSTER`

`work_mem`