skip to Main Content

I have a MySQL dev_table with thousands of products identified by their Name, containing changes to price and availability (among other values) over time. The table contains only the dates where a change occurred.

I am looking for an efficient way to select the latest results for all products available on a certain date – e.g. on 2016-01-01 – even though there may be no entry for the product for the given date.

SET GLOBAL sql_mode=(SELECT REPLACE(@@sql_mode,'ONLY_FULL_GROUP_BY',''));

CREATE TABLE `dev_table` (`uid` int(11) NOT NULL, `Name` varchar(200) NOT NULL, `Date` date NOT NULL, `Available` tinyint(1) NOT NULL, `Price` decimal(5,2) NOT NULL, `Size` tinyint(1) NOT NULL); 

ALTER TABLE `dev_table` ADD PRIMARY KEY (`uid`);

INSERT INTO `dev_table` (`uid`, `Name`, `Date`, `Available`, `Price`, `Size`) VALUES
(1, 'product A', '2000-01-01', 1, '123.33', 5), (2, 'product B', '2000-01-01', 1, '17.50', 1), (3, 'product C', '2003-06-01', 1, '578.25', 6), (4, 'product A', '2015-08-07', 1, '131.67', 2), (5, 'product D', '2016-01-01', 1, '8.02', 2),
(6, 'product C', '2016-01-01', 0, '478.72', 6), (7, 'product A', '2023-01-27', 1, '134.07', 5), (8, 'product B', '2023-09-01', 0, '12.01', 1);
+-----+------------+-------------+-----------+--------+------+ 
| uid | Name       | Date        | Available | Price  | Size |
+-----+------------+-------------+-----------+--------+------+
|   1 | product A  | 2000-01-01  | 1         | 123.33 |    5 |
|   2 | product B  | 2000-01-01  | 1         |  17.50 |    1 |
|   3 | product C  | 2003-06-01  | 1         | 578.25 |    6 |
|   4 | product A  | 2015-08-07  | 1         | 131.67 |    2 |
|   5 | product D  | 2016-01-01  | 1         |   8.02 |    2 |
|   6 | product C  | 2016-01-01  | 0         | 478.72 |    6 |
|   7 | product A  | 2023-01-27  | 1         | 134.07 |    5 |
|   8 | product B  | 2023-09-01  | 0         |  12.01 |    1 |
+-----+------------+-------------+-----------+--------+------+ 

I use 2 nested subqueries to achieve this:

SELECT * FROM `dev_table`
WHERE `uid` IN (
    SELECT max(subquery1.`uid`) AS lastUid FROM `dev_table` AS subquery1
    WHERE subquery1.`Date` <= '2016-01-01' 
    AND subquery1.`Name` NOT IN (
        SELECT subquery2.`Name` FROM `dev_table` AS subquery2
        WHERE subquery2.`Date` <= '2016-01-01' 
        AND subquery2.`Available` != 1
    ) 
    GROUP BY subquery1.`Name`
    ORDER by lastUid DESC
);

This does return the rows for uid 2, 4 and 5 as expected, these being the latest entries for product B, A and D respectively, and not returning uid 6, as product D is not available on the given date.

This solution is inefficient on larger tables, however. It gets even worse when adding more criteria such as MATCH (`Name`) AGAINST ('product' IN BOOLEAN MODE). Also, ONLY_FULL_GROUP_BY does seem a bit outdated.

How can I improve this query?

3

Answers


  1. You can use the HAVING clause to exclude those with at least one unavailable flag, to do so, we must select only those with total available products. Total products ‘SUM(d.Available)‘ equal total products ‘COUNT(*)‘ :

    SELECT d.Name, max(d.uid) AS lastUid
    FROM dev_table AS d
    WHERE d.`Date` <= '2016-01-01'
    GROUP BY d.Name
    HAVING SUM(d.Available) = COUNT(*);
    

    Demo here

    Login or Signup to reply.
  2. You’re right that nested subqueries can become quite inefficient, especially as the table grows. One approach to optimize your query is by using window functions and common table expressions (CTEs).

    WITH LatestProductInfo AS (
        SELECT uid, Name, Date, Available, Price, Size,
               ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Date DESC, uid DESC) AS rn
          FROM dev_table
         WHERE Date <= '2016-01-01'
    )
    SELECT uid, Name, Date, Available, Price, Size
      FROM LatestProductInfo
    WHERE rn = 1
      AND (Available = 1);
    
    Login or Signup to reply.
  3. Your current query suggests that a product cannot go from Available = 0 back to Available = 1. Is this correct? If not, both your query and SelVazi’s will return incorrect results, as they will not include products which become available again.

    Assuming products can become available again, you need to check the availability of the most recent row on, or before, the desired date. Whether a window function (as suggested by Amit) or max subquery is more efficient will depend on the scale and distribution of your data, as well as the rest of the query.

    This query requires that both uid and Date are in sequence, but should be slightly more efficient as it joins on PK, not a secondary index:

    SELECT `dt`.*
    FROM (
        SELECT MAX(`uid`) AS `max_uid`
        FROM `dev_table`
        WHERE `Date` <= '2016-01-01'
        GROUP BY `Name`
    ) AS `most_recent`
    JOIN `dev_table` AS `dt`
        ON `most_recent`.`max_uid` = `dt`.`uid`
    WHERE `dt`.`available` = 1;
    

    This query joins on the secondary index, so may be slightly less efficient than the query above, and it will produce multiple rows for the same Name if there are multiple rows for the same Name on the same Date (e.g. different sizes):

    SELECT `dt`.*
    FROM (
        SELECT `Name`, MAX(`Date`) AS `max_date`
        FROM `dev_table`
        WHERE `Date` <= '2016-01-01'
        GROUP BY `Name`
    ) AS `most_recent`
    JOIN `dev_table` AS `dt`
        ON `most_recent`. `Name` = `dt`.`Name`
        AND `most_recent`.`max_date` = `dt`.`Date`
    WHERE `dt`.`available` = 1;
    

    Both of these queries require an index on (Name, Date):

    ALTER TABLE `dev_table` ADD INDEX `idx_name_date` (`Name`, `Date`);
    

    Here’s a db<>fiddle.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search