skip to Main Content

I have a PostgreSQL cluster composed of four machines, configured as follows:

Three PostgreSQL nodes managed by Patroni and ETCD, using PgBouncer. One of these nodes is the primary, and the other two are replicas.
One machine using HAProxy to accept connections and perform load balancing.
We have a database with a table containing millions of rows. This table is an hypertable created with TimescaleDB. In our organization, we collect data from meters, which we store in this hypertable, resulting in thousands of rows being inserted daily.

However, the database cannot handle this load. The number of locks keeps increasing, and the RAM usage also continuously grows, eventually causing the primary node to crash.

For context, each node has 12GB of RAM and 16 CPU cores.

What could be causing the continuous increase in locks and RAM usage?
What are the best practices to handle high insert rates in a TimescaleDB hypertable?
How can we optimize our configuration to prevent the primary node from crashing?

I appreciate any insights or recommendations on how to address these performance issues. Thank you!

I am a young developer with little experience, and I’m doing my best to explain the situation.

PostgreSQL configuration with Patroni :

loop_wait: 10
maximum_lag_on_failover: 1048576
postgresql:
  max_parallel_workers: 16
  max_parallel_workers_per_gather: 4
  parameters:
    archive_mode: false
    autovacuum: true
    autovacuum_max_workers: 5
    autovacuum_naptime: 1min
    autovacuum_vacuum_cost_delay: 20ms
    effective_cache_size: 8GB
    effective_io_concurrency: 4
    hot_standby: true
    log_directory: /var/log/postgresql/pg_log
    log_filename: postgresql.log
    log_hostname: true
    log_statement: all
    logging_collector: true
    maintenance_work_mem: 512MB
    max_connections: 3000
    max_locks_per_transaction: 1024
    max_prepared_transactions: 0
    max_wal_senders: 10
    max_worker_processes: 16
    shared_buffers: 4096MB
    shared_preload_libraries: timescaledb
    use_pg_rewind: true
    use_slots: true
    wal_keep_segments: 128
    wal_level: hot_standby
    wal_sender_timeout: 60s
    work_mem: 64MB
retry_timeout: 10
ttl: 30

PgBouncer configuration :

logfile = /var/log/postgresql/pgbouncer.log
pidfile = /var/run/postgresql/pgbouncer.pid

;; IP address or * which means all IPs
listen_addr = 0.0.0.0
listen_port = 6432

unix_socket_dir = /var/run/postgresql

pool_mode = transaction

max_client_conn = 5000

default_pool_size = 20

reserve_pool_size = 20

reserve_pool_timeout = 2

server_idle_timeout = 5

Here is the pg_locks view for PgAdmin :

pg_locks view

2

Answers


  1. PostGreSQL has been designed as a multiprocessing by process (equivalent to executables) and not designed like other RDBMS like SQL Server with a thread model. The result is an excessive consumption of resources of all kinds, unsuitable for absorbing a heavy competitive load.

    https://www.postgresql.org/message-id/31cc6df9-53fe-3cd9-af5b-ac0d801163f4%40iki.fi

    having more than 300 users with a high concurrency in a single PG cluster is out of PG scope… even with some external pooling…

    For another reason, the model of MVCC that is internal of the table (other RDBMS use external copies like SQL Server or Oracle) need a cleaner called VACUUM, that brings many locks and causes contention on resources.

    https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html

    One way you can try is to add very much more hardware resources to bypass the poor PG executable… That is :

    • high frequency CPU (in turbo mode) able to go on 5 Ghz
    • great amount of RAM (depeding of data volume) almost 32 Gb to begin
    • MVNe storage for all the transaction log files

    Also PostGreSQL does not have in "memory table" (that are 10 to 30 times faster rather than persistent tables) and is very poor in terms of parallelism of query (PG is not able to parallelize INSERTs, UPDATEs, nor DELETEs queries…) and does not use columnar table nor columnar indexes… that are essentials when dealing whith thousand of millions or rows…

    Login or Signup to reply.
  2. First, I’d suggest to check if you’re having blocking. Having some locking could be normal, depending on your database activity, but the important thing is not to have blocking, where a process is waiting for another one.

    I see in your case some exclusive locking which can cause blocking.

    You can check it by running, for example, the following query:

    SELECT blocked_locks.pid     AS blocked_pid,
    blocked_activity.usename  AS blocked_user,
    blocking_locks.pid     AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocked_activity.query    AS blocked_statement,
    blocking_activity.query   AS current_statement_in_blocking_process
    FROM  pg_catalog.pg_locks         blocked_locks
    JOIN pg_catalog.pg_stat_activity blocked_activity  ON blocked_activity.pid = blocked_locks.pid
    JOIN pg_catalog.pg_locks         blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.DATABASE IS NOT DISTINCT FROM blocked_locks.DATABASE
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
    JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
    WHERE NOT blocked_locks.GRANTED;
    
    

    Depending on the version, this query could be different but you can get the correct query for your version and check the output.

    Also, I agree on the number of max connections look pretty big if you’re using PgBouncer. What is the actual number of used connections? Is the Pgbouncer running on the same sever that the database?

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search