skip to Main Content

Resolved: I was misunderstanding how indexing works. The issue was that I only had single-column indexes and so by joining another table or adding a WHERE clause, SQL could no longer do a simple run through the index, but needed to perform additional operations (accessing the column and/or not using the index entirely). By creating covering composite indexes search times went way down.

I’ve been having big performance issues with GROUP BY queries to count occurrences by date. I have been working on a relatively large table (900k records) and in trying to isolate the issue I have just ended up confusing myself.

I’ll try to demonstrate.

With no filtering applied on the table the query runs in well under a second (0.1 – 0.2s) (the date column is indexed):

SELECT date, COUNT(*) AS occurrences
FROM tableA
GROUP BY date
ORDER BY date ASC

Now some of my columns in tableA reference other tables and I want to join them. However, simply joining them (not even using them in a WHERE clause or otherwise) bump the query time to 5+ seconds.

e.g:

SELECT date, COUNT(date) AS occurrences
FROM tableA
LEFT JOIN tableB ON tableA_refColumn = tableB_id
GROUP BY date
ORDER BY date ASC

If I don’t JOIN foreign tables, but try including a WHERE clause in the query, I run into a similar issue. I have an indexed single digit binary column, binColumn. Filtering by this column – expectedly – is very fast and runs in virtually 0 seconds:

SELECT date
FROM tableA
WHERE binColumn = 1
*Returns roughly 300k results

But if I try adding this WHERE clause to the GROUP BY query, it similarly takes around 4-5 seconds:

SELECT date, COUNT(*) AS occurrences
FROM tableA
WHERE binColumn = 1
GROUP BY date
ORDER BY date ASC

I would think filtering the data would improve the performance of the GROUP BY, but clearly it doesn’t in this case.

I have tried using sub-queries and storing intermediate results to no avail.

e.g:

SELECT date, COUNT(date) AS occurrences
FROM (SELECT date FROM tableA WHERE binColumn = 1) AS subquery
GROUP BY date
ORDER BY date ASC

I did notice that selecting * instead of just date makes the GROUP BY take roughly 5 seconds like the other queries I have been having issues with, so I suspect it may have something to do with the way the WHERE and JOINs affect how/which columns are stored while processing.:

SELECT *, COUNT(*) AS occurrences
FROM tableA
GROUP BY date
ORDER BY date ASC

I have a limited knowledge of how SQL works, so I would love to understand what might be going on with the engine to produce these results. But any solution that will allow me to JOIN the tables and apply filtering criteria to the query without making it take forever would be greatly appreciated.

Changing the table structure is not out of question and I can be more specific if I haven’t painted the whole picture.

Thank you in advance.

2

Answers


  1. First query

    You have a covering index here. This means that the DBMS can just run through the index and get all information needed. It doesn’t have to read the table. Moreover, all you do is group and sort by date. As the index is on date, the DBMS can just read the index sequentially and get the result right away.

    Second query

    You join. There is no WHERE clause to limit the rows. So you want to join the whole tables. I don’t know how the DBMS will approach this problem. As it will have to read every row in the tables, it may just take both tables, sort all rows by the join keys and then start comparing the two data sets. At last you want to sort by date. So it will take the whole result set and sort it again.

    Is there anything we can do about this? Well, we can consider the task as follows: Get tableA rows where exists a matching tableB row and then sort by date. So the appropriate indexes would be:

    CREATE INDEX idx1 ON tableA (tableA_refColumn, date);
    CREATE INDEX idx2 ON tableB (tableB_id);
    

    Thus we can again run through an index (idx1), use its tableA_refColumn to find the matching tableB rows (by idx2) and have already selected the date we want to use for sorting. This will still be much slower than the first query, but the indexes may help a bit.

    Third query

    It is very easy for a DBMS to run through a table and just keep the rows matching the specified criteria. As no sorting has to be done, it is just look at each row and decide whether to show it or not. Simple.

    Fourth query

    This query could use the index on date again, but then it would have to run through the entire index and for each entry still access the able to learn about binColumn. Or it just runs through the tabke sequentially and then sorts all result rows. This query would greatly benefit from a covering index starting with the filtering column binColumn:

    CREATE INDEX idx3 ON tableB (binColumn, date);
    

    Thus the DBMS could again just run though the index. But well, once it has all the data collected, it will still need to sort it, because we had to put binColumn first in order to find the rows quickly, which is particularly important, if the column is very selective, so we only access very few rows per value. If the binColumn is not selective (e.g. only alows two values 0 and 1 and not millions) we can turn around the index, because binColumn may not filter much out anyway. Thus we get a pre-sorted result where only some rows get dismissed:

    CREATE INDEX idx4 ON tableB (date, binColumn);
    

    Fifth query

    You have re-written the query, but the DBMS easily sees that it is just the same thing you want: get the rows with binColumn = 1, then sort by date. It makes no difference to the DBMS that you re-wrote the query. A general advice: Don’t do this. Keep your queries as simple as possible. This even helps the DBMS to find the optimal execution plan.

    Sixth query

    COUNT(*) is the simple thing: just count rows. COUNT(expression) is the more complicated thing: look at each row whether the expression is null or not and only count non-nulls. IF date is nullable then COUNT(date) may take a little more time than COUNT(*). If date is not nullable and we use COUNT(date), then we are lucky that the DBMS sees this and silently applies COUNT(*)instead, rather than looking at each row only to see everytime that its date is not null.

    Login or Signup to reply.
  2. Query 2 is wasting time doing the LEFT JOIN even though nothing is needed from the extra table.

    Query 4 needs heeds this composite index:

    INDEX(bin_column, date)
    

    Replace the existing INDEX(bin_column) with that. Then Query 3 sill still be fast, but without using an extra index. Or you may want COUNT(DISTINCT date).

    A note: COUNT(date) checks each date for being non-null. This is probably a waste of effort, so use simply COUNT*).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search