Postgresql - Query Optimization for Large Dataset with Complex Join and Aggregation

Baxromov
May 29, 2023
168 views
1 vote
2 Answers

Table A: (columns: id (integer), name (varchar), ...)
Table B: (columns: id (integer), a_id (integer), value (numeric), ...)
Table C: (columns: id (integer), a_id (integer), b_id (integer), date (date), ...)

SELECT A.name, SUM(B.value) AS total_value,
    COUNT(DISTINCT C.date) AS distinct_dates
FROM A
JOIN B ON A.id = B.a_id
JOIN C ON B.id = C.b_id
WHERE C.date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY A.name
ORDER BY total_value DESC;

The query takes a considerable amount of time to execute, even when limited to a specific date range. Indexes are in place on the relevant columns.

Table A has approximately 10 million rows.
Table B has approximately 20 million rows.
Table C has approximately 50 million rows.

The query execution time increases exponentially with larger date ranges.

What are specific indexes, query rewrites, or configuration changes to improve the execution time?

What are alternative approaches for handling such a complex join and aggregation operation on a large dataset?

Answers

- HassanRehan
- May 27, 2023 at 1:26 pm
- 0 votes
0
It’s difficult to say without knowing the actual dataset and database configuration. Some general tips would be:
- Make sure you have the right indexes: indexes can improve query performance by reducing the number of data to be scanned. In your case, you should have indexes on all columns used in join conditions, filters, and order by clauses. (A.id, B.a_id, B.id, C.b_id, and C.date).
- Use EXPLAIN to analyze your query: it can help you understand how the query will be executed so you can identify potential bottlenecks.
- Try rewriting the query: sometimes it helps to improve performance. Tou can try subqueries or CTEs to break down the query into smaller parts.
- Consider partitioning your data: it may also help reduce the number of data to be scanned. But it depends on whether your data can be split into more manageable chunks or not.
Login or Signup to reply.

- ThorstenKettner
- May 29, 2023 at 1:55 pm
- 0 votes
0
Indexes are in place on the relevant columns

Are they? To have this query run fast, you should have an index on C’s date column and its a_id, so to get quickly from the WHERE clause to the first join.
```
create index idx_c on c (date) include (b_id);
```
Do you have this index?

If this is still too slow, you may want to consider providing covering indexes on the other tables:
```
create unique index idx_b on b (id) include (a_id, value);

create unique index idx_a on a (id) include (name);
```
In the end, this all still depends on the date range you provide and the decision the DBMS makes based on this. If your date range covers 90% of the C rows, then full tabke scans may be a lot faster than using indexes. If your date range covers 1% or even less of the C rows, then running through the indexes will be faster. We don’t know the threshold when indexes become preferable, but the DBMS will do its best to make the best decision.

As table A is not really part of the aggregation, it may help to make this plainly visible to the DBMS. Ideally, the optimizer should see this is the same query and come up with the same execution plan, but optimizers are not perfect, so a hint may help every now and then.
```
SELECT a.name, bc.total_value, bc.distinct_dates
FROM 
(
  SELECT
    a_id,
    SUM(b.value) AS total_value,
    COUNT(DISTINCT C.date) AS distinct_dates
  FROM c
  JOIN b ON b.id = c.b_id
  WHERE c.date BETWEEN DATE '2022-01-01' AND DATE '2022-12-31'
  GROUP BY a_id
) bc
JOIN a on ON a.id = bc.a_id
ORDER BY bc.total_value DESC;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Query Optimization for Large Dataset with Complex Join and Aggregation

Answers