I have a PostgreSQL cluster composed of four machines, configured as follows:
Three PostgreSQL nodes managed by Patroni and ETCD, using PgBouncer. One of these nodes is the primary, and the other two are replicas.
One machine using HAProxy to accept connections and perform load balancing.
We have a database with a table containing millions of rows. This table is an hypertable created with TimescaleDB. In our organization, we collect data from meters, which we store in this hypertable, resulting in thousands of rows being inserted daily.
However, the database cannot handle this load. The number of locks keeps increasing, and the RAM usage also continuously grows, eventually causing the primary node to crash.
For context, each node has 12GB of RAM and 16 CPU cores.
What could be causing the continuous increase in locks and RAM usage?
What are the best practices to handle high insert rates in a TimescaleDB hypertable?
How can we optimize our configuration to prevent the primary node from crashing?
I appreciate any insights or recommendations on how to address these performance issues. Thank you!
I am a young developer with little experience, and I’m doing my best to explain the situation.
PostgreSQL configuration with Patroni :
loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
max_parallel_workers: 16
max_parallel_workers_per_gather: 4
parameters:
archive_mode: false
autovacuum: true
autovacuum_max_workers: 5
autovacuum_naptime: 1min
autovacuum_vacuum_cost_delay: 20ms
effective_cache_size: 8GB
effective_io_concurrency: 4
hot_standby: true
log_directory: /var/log/postgresql/pg_log
log_filename: postgresql.log
log_hostname: true
log_statement: all
logging_collector: true
maintenance_work_mem: 512MB
max_connections: 3000
max_locks_per_transaction: 1024
max_prepared_transactions: 0
max_wal_senders: 10
max_worker_processes: 16
shared_buffers: 4096MB
shared_preload_libraries: timescaledb
use_pg_rewind: true
use_slots: true
wal_keep_segments: 128
wal_level: hot_standby
wal_sender_timeout: 60s
work_mem: 64MB
retry_timeout: 10
ttl: 30
PgBouncer configuration :
logfile = /var/log/postgresql/pgbouncer.log
pidfile = /var/run/postgresql/pgbouncer.pid
;; IP address or * which means all IPs
listen_addr = 0.0.0.0
listen_port = 6432
unix_socket_dir = /var/run/postgresql
pool_mode = transaction
max_client_conn = 5000
default_pool_size = 20
reserve_pool_size = 20
reserve_pool_timeout = 2
server_idle_timeout = 5
Here is the pg_locks view for PgAdmin :
2
Answers
PostGreSQL has been designed as a multiprocessing by process (equivalent to executables) and not designed like other RDBMS like SQL Server with a thread model. The result is an excessive consumption of resources of all kinds, unsuitable for absorbing a heavy competitive load.
https://www.postgresql.org/message-id/31cc6df9-53fe-3cd9-af5b-ac0d801163f4%40iki.fi
having more than 300 users with a high concurrency in a single PG cluster is out of PG scope… even with some external pooling…
For another reason, the model of MVCC that is internal of the table (other RDBMS use external copies like SQL Server or Oracle) need a cleaner called VACUUM, that brings many locks and causes contention on resources.
https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html
One way you can try is to add very much more hardware resources to bypass the poor PG executable… That is :
Also PostGreSQL does not have in "memory table" (that are 10 to 30 times faster rather than persistent tables) and is very poor in terms of parallelism of query (PG is not able to parallelize INSERTs, UPDATEs, nor DELETEs queries…) and does not use columnar table nor columnar indexes… that are essentials when dealing whith thousand of millions or rows…
First, I’d suggest to check if you’re having blocking. Having some locking could be normal, depending on your database activity, but the important thing is not to have blocking, where a process is waiting for another one.
I see in your case some exclusive locking which can cause blocking.
You can check it by running, for example, the following query:
Depending on the version, this query could be different but you can get the correct query for your version and check the output.
Also, I agree on the number of max connections look pretty big if you’re using PgBouncer. What is the actual number of used connections? Is the Pgbouncer running on the same sever that the database?