Table A: (columns: id (integer), name (varchar), ...)
Table B: (columns: id (integer), a_id (integer), value (numeric), ...)
Table C: (columns: id (integer), a_id (integer), b_id (integer), date (date), ...)
SELECT A.name, SUM(B.value) AS total_value,
COUNT(DISTINCT C.date) AS distinct_dates
FROM A
JOIN B ON A.id = B.a_id
JOIN C ON B.id = C.b_id
WHERE C.date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY A.name
ORDER BY total_value DESC;
The query takes a considerable amount of time to execute, even when limited to a specific date range. Indexes are in place on the relevant columns.
Table A has approximately 10 million rows.
Table B has approximately 20 million rows.
Table C has approximately 50 million rows.
The query execution time increases exponentially with larger date ranges.
What are specific indexes, query rewrites, or configuration changes to improve the execution time?
What are alternative approaches for handling such a complex join and aggregation operation on a large dataset?
2
Answers
It’s difficult to say without knowing the actual dataset and database configuration. Some general tips would be:
A.id
,B.a_id
,B.id
,C.b_id
, andC.date
).EXPLAIN
to analyze your query: it can help you understand how the query will be executed so you can identify potential bottlenecks.Are they? To have this query run fast, you should have an index on C’s
date
column and itsa_id
, so to get quickly from theWHERE
clause to the first join.Do you have this index?
If this is still too slow, you may want to consider providing covering indexes on the other tables:
In the end, this all still depends on the date range you provide and the decision the DBMS makes based on this. If your date range covers 90% of the C rows, then full tabke scans may be a lot faster than using indexes. If your date range covers 1% or even less of the C rows, then running through the indexes will be faster. We don’t know the threshold when indexes become preferable, but the DBMS will do its best to make the best decision.
As table A is not really part of the aggregation, it may help to make this plainly visible to the DBMS. Ideally, the optimizer should see this is the same query and come up with the same execution plan, but optimizers are not perfect, so a hint may help every now and then.