I have AWS RDS PostgreSQL as a database. I have a table for storing childbirth_data
data with currently 5M records and another for storing message to child stats (not the actual messages) with 30M records. Monthly around 0.3M person records and 3M message records are added to these tables
Query:
SELECT message_date, message_from, sms_type, message_type, direction, status,
cd.state ,cd.date_uploaded,
FROM "Suvita".messages m, "Suvita".childbirth_data cd
where m.contact_record_id = cd.id
Explain analyze results:
Hash Join (cost=568680.28..6272284.94 rows=29688640 width=319) (actual time=5473.787..96501.807 rows=30893261 loops=1)
Hash Cond: (m.contact_record_id = cd.id)
-> Seq Scan on messages m (cost=0.00..2739237.50 rows=29719350 width=174) (actual time=2.364..41071.274 rows=30936614 loops=1)
-> Hash (cost=402612.68..402612.68 rows=5347568 width=121) (actual time=5448.157..5448.158 rows=5349228 loops=1)
Buckets: 32768 Batches: 256 Memory Usage: 3518kB
-> Seq Scan on childbirth_data cd (cost=0.00..402612.68 rows=5347568 width=121) (actual time=0.011..3013.382 rows=5349228 loops=1)
Planning Time: 18.349 ms
Execution Time: 97849.004 ms
I have made aggregated tables for the dashboard where aggregate stats are required and materialized views for different states or types of messages depending on the reporting needs.
My writes on these tables are via a scheduled job which run every night, so even if writes become comparatively slow because of more indexes, it will not be a concern.
I have already put in the following indexes:
person - id,state, mother name, mother phone, rchid, telerivet_contact_id
Messages - id, contact_record_id, message_date and state
My Query is still very slow and the indexes are not being used in the query, it seems. How do I make PostgreSQL use the indexes?
Would partitioning help? The issue is that I was thinking of partitioning by state but one of the states has 90% of the data, so I think it will not be helpful. Also I do not want to touch the code which does the data input.
I am not database expert so please advise if there is anything else I can do to make the reads on the tables faster?
2
Answers
There is no index that can help with a hash join.
One problem is that
work_mem
is set too small for efficient processing. You can useEXPLAIN (ANALYZE, BUFFERS)
to see the temporary file usage. If you can affort to increasework_mem
for this query, it should help.I am surprised that PostgreSQL does not consider a merge join if you give it that little memory. What does the plan look like if you create indexes on
childbirth_data.id
andmessages.contact_record_id
? If you temporarily setenable_hashjoin
andenable_nestloop
tooff
in your session, how does the execution plan look?You are correct that adding indexes to the table is the best way to make reads faster. The tradeoff is that writes will be slower, so go easy on how many you add to the table (even if right now you think it won’t matter much); it’s important to choose them carefully.
Your problem
When picking an index, you want to look at the queries you’re making. You’ve indicated that you want to run the following query:
Here I’ve reformatted things for clarity, but it is equivalent.
This is quite an odd thing to be doing. You are:
m
to every row incd
I am fairly sure this is the cause of your problems. First, this operation massively increases the amount of data being wrangled (30M x 5M rows = 150T rows before filtering). Secondly, there is no index on the cross joined data so no index can be used.
The good news
The query you’re making seems to me to be precisely equivalent to
This is a very typical query and should run much faster.
If you need additional performance improvements, the way forward would be to add an index on
contact_record_id
tomessages
and an index onid
tochildbirth_data
.However, those indexes are likely already defined: if you made
id
a primary key then it will certainly have a unique index on it; if you madecontact_record_id
a foreign key back tochildbirth_data
then it too will have an index defined. I would recommend doing it this way rather than adding extra indexes.With these indexes in place I am very confident that the query I have written will run in an optimised way.