skip to Main Content

I have AWS RDS PostgreSQL as a database. I have a table for storing childbirth_data data with currently 5M records and another for storing message to child stats (not the actual messages) with 30M records. Monthly around 0.3M person records and 3M message records are added to these tables

Query:

SELECT message_date, message_from, sms_type,  message_type, direction, status,
 cd.state ,cd.date_uploaded,
FROM "Suvita".messages m, "Suvita".childbirth_data cd
where m.contact_record_id = cd.id

Explain analyze results:

Hash Join  (cost=568680.28..6272284.94 rows=29688640 width=319) (actual time=5473.787..96501.807 rows=30893261 loops=1)
  Hash Cond: (m.contact_record_id = cd.id)
  ->  Seq Scan on messages m  (cost=0.00..2739237.50 rows=29719350 width=174) (actual time=2.364..41071.274 rows=30936614 loops=1)
  ->  Hash  (cost=402612.68..402612.68 rows=5347568 width=121) (actual time=5448.157..5448.158 rows=5349228 loops=1)
        Buckets: 32768  Batches: 256  Memory Usage: 3518kB
        ->  Seq Scan on childbirth_data cd  (cost=0.00..402612.68 rows=5347568 width=121) (actual time=0.011..3013.382 rows=5349228 loops=1)
Planning Time: 18.349 ms
Execution Time: 97849.004 ms

I have made aggregated tables for the dashboard where aggregate stats are required and materialized views for different states or types of messages depending on the reporting needs.
My writes on these tables are via a scheduled job which run every night, so even if writes become comparatively slow because of more indexes, it will not be a concern.

I have already put in the following indexes:

person - id,state, mother name, mother phone, rchid, telerivet_contact_id
Messages - id, contact_record_id, message_date and state

My Query is still very slow and the indexes are not being used in the query, it seems. How do I make PostgreSQL use the indexes?

Would partitioning help? The issue is that I was thinking of partitioning by state but one of the states has 90% of the data, so I think it will not be helpful. Also I do not want to touch the code which does the data input.

I am not database expert so please advise if there is anything else I can do to make the reads on the tables faster?

2

Answers


  1. There is no index that can help with a hash join.

    One problem is that work_mem is set too small for efficient processing. You can use EXPLAIN (ANALYZE, BUFFERS) to see the temporary file usage. If you can affort to increase work_mem for this query, it should help.

    I am surprised that PostgreSQL does not consider a merge join if you give it that little memory. What does the plan look like if you create indexes on childbirth_data.id and messages.contact_record_id? If you temporarily set enable_hashjoin and enable_nestloop to off in your session, how does the execution plan look?

    Login or Signup to reply.
  2. You are correct that adding indexes to the table is the best way to make reads faster. The tradeoff is that writes will be slower, so go easy on how many you add to the table (even if right now you think it won’t matter much); it’s important to choose them carefully.

    Your problem

    When picking an index, you want to look at the queries you’re making. You’ve indicated that you want to run the following query:

    SELECT
        message_date,
        message_from,
        sms_type,
        message_type,
        direction,
        status,
        cd.state,
        cd.date_uploaded,
    FROM
        "Suvita".messages AS m
    CROSS JOIN
        "Suvita".childbirth_data AS cd
    WHERE
        m.contact_record_id = cd.id
    

    Here I’ve reformatted things for clarity, but it is equivalent.

    This is quite an odd thing to be doing. You are:

    • cross joining every row in m to every row in cd
    • filtering on a certain condition that throws away all cross joined rows where the IDs don’t match

    I am fairly sure this is the cause of your problems. First, this operation massively increases the amount of data being wrangled (30M x 5M rows = 150T rows before filtering). Secondly, there is no index on the cross joined data so no index can be used.

    The good news

    The query you’re making seems to me to be precisely equivalent to

    SELECT
        message_date,
        message_from,
        sms_type,
        message_type,
        direction,
        status,
        cd.state,
        cd.date_uploaded,
    FROM
        "Suvita".messages AS m
    INNER JOIN
        "Suvita".childbirth_data AS cd
    ON
        m.contact_record_id = cd.id
    

    This is a very typical query and should run much faster.

    If you need additional performance improvements, the way forward would be to add an index on contact_record_id to messages and an index on id to childbirth_data.

    However, those indexes are likely already defined: if you made id a primary key then it will certainly have a unique index on it; if you made contact_record_id a foreign key back to childbirth_data then it too will have an index defined. I would recommend doing it this way rather than adding extra indexes.

    With these indexes in place I am very confident that the query I have written will run in an optimised way.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search