skip to Main Content
Table A: (columns: id (integer), name (varchar), ...)
Table B: (columns: id (integer), a_id (integer), value (numeric), ...)
Table C: (columns: id (integer), a_id (integer), b_id (integer), date (date), ...)
SELECT A.name, SUM(B.value) AS total_value,
    COUNT(DISTINCT C.date) AS distinct_dates
FROM A
JOIN B ON A.id = B.a_id
JOIN C ON B.id = C.b_id
WHERE C.date BETWEEN '2022-01-01' AND '2022-12-31'
GROUP BY A.name
ORDER BY total_value DESC;

The query takes a considerable amount of time to execute, even when limited to a specific date range. Indexes are in place on the relevant columns.

Table A has approximately 10 million rows.
Table B has approximately 20 million rows.
Table C has approximately 50 million rows.

The query execution time increases exponentially with larger date ranges.

What are specific indexes, query rewrites, or configuration changes to improve the execution time?

What are alternative approaches for handling such a complex join and aggregation operation on a large dataset?

2

Answers


  1. It’s difficult to say without knowing the actual dataset and database configuration. Some general tips would be:

    • Make sure you have the right indexes: indexes can improve query performance by reducing the number of data to be scanned. In your case, you should have indexes on all columns used in join conditions, filters, and order by clauses. (A.id, B.a_id, B.id, C.b_id, and C.date).
    • Use EXPLAIN to analyze your query: it can help you understand how the query will be executed so you can identify potential bottlenecks.
    • Try rewriting the query: sometimes it helps to improve performance. Tou can try subqueries or CTEs to break down the query into smaller parts.
    • Consider partitioning your data: it may also help reduce the number of data to be scanned. But it depends on whether your data can be split into more manageable chunks or not.
    Login or Signup to reply.
  2. Indexes are in place on the relevant columns

    Are they? To have this query run fast, you should have an index on C’s date column and its a_id, so to get quickly from the WHERE clause to the first join.

    create index idx_c on c (date) include (b_id);
    

    Do you have this index?

    If this is still too slow, you may want to consider providing covering indexes on the other tables:

    create unique index idx_b on b (id) include (a_id, value);
    
    create unique index idx_a on a (id) include (name);
    

    In the end, this all still depends on the date range you provide and the decision the DBMS makes based on this. If your date range covers 90% of the C rows, then full tabke scans may be a lot faster than using indexes. If your date range covers 1% or even less of the C rows, then running through the indexes will be faster. We don’t know the threshold when indexes become preferable, but the DBMS will do its best to make the best decision.

    As table A is not really part of the aggregation, it may help to make this plainly visible to the DBMS. Ideally, the optimizer should see this is the same query and come up with the same execution plan, but optimizers are not perfect, so a hint may help every now and then.

    SELECT a.name, bc.total_value, bc.distinct_dates
    FROM 
    (
      SELECT
        a_id,
        SUM(b.value) AS total_value,
        COUNT(DISTINCT C.date) AS distinct_dates
      FROM c
      JOIN b ON b.id = c.b_id
      WHERE c.date BETWEEN DATE '2022-01-01' AND DATE '2022-12-31'
      GROUP BY a_id
    ) bc
    JOIN a on ON a.id = bc.a_id
    ORDER BY bc.total_value DESC;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search