skip to Main Content

I need to adjust the following select statement to effectively filter out 523 records where ‘first_date_purchased’ is less than ‘first_date_watched’. The following select statement returns 20778 records, of which 20255 should remain.

    SELECT se.student_id,
           si.date_registered,
           MIN(se.date_watched) AS first_date_watched,
           MIN(sp.date_purchased) AS first_date_purchased
    FROM student_engagement se
      LEFT JOIN student_info si ON se.student_id = si.student_id
      LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
    GROUP BY se.student_id
    ORDER BY- first_date_purchased DESC
    -- Gives 20778 records

I have come up with the following solution, using NOT IN, but it has a substantial runtime of 1 minute. Initially I cancelled the process, but was advised by @Akina to include a sample dataset, on which both this query and another solution by @Raphi worked perfectly. My query is as follows:

    SELECT se.student_id,
           si.date_registered,
           MIN(se.date_watched) AS first_date_watched,
           MIN(sp.date_purchased) AS first_date_purchased
    FROM student_engagement se
      LEFT JOIN student_info si ON se.student_id = si.student_id
      LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
    WHERE se.student_id NOT IN (SELECT sp2.student_id
                                FROM student_purchases sp2
                                WHERE sp2.date_purchased < se.date_watched)
    GROUP BY se.student_id
    ORDER BY- first_date_purchased DESC;
    -- Gives 20255 records

Although, I wanted to complete all filtering within a single select statement, I initially cast the unfiltered result of 20778 records into a new table called ‘unfiltered_true_dataset’ and got the desired filtering done with the following select statement based on that:

    SELECT *
    FROM unfiltered_true_dataset
    WHERE student_id NOT IN (SELECT student_id
                             FROM unfiltered_true_dataset
                             WHERE first_date_purchased < first_date_watched)
    -- Successfully obtained 20255 records

My question is whether there might be a more effective way to apply such a filter in a single select statement? Here is a sample dataset. The unfiltered select stetement returns 5 records, where the solution using the NOT IN subquery correctly returns only 4 records:

    CREATE TABLE `student_info` 
    (`student_id` INT NOT NULL, 
    `date_registered` DATE DEFAULT NULL, 
    PRIMARY KEY (`student_id`));
    
    INSERT INTO `student_info` VALUES
    (255193, '2021-12-01'),(255194, '2021-12-01'),(255196, '2021-12-01'),
    (255198, '2021-12-01'),(255199, '2021-12-01'),(255200, '2021-12-01'),
    (255201, '2021-12-01'),(255203, '2021-12-01'),(255204, '2021-12-01'),
    (255205, '2021-12-01');
    -- Full dataset has 40979 records
    
    CREATE TABLE `student_engagement` 
    (`student_id` INT DEFAULT NULL, 
    `date_watched` DATE DEFAULT NULL);
    
    INSERT INTO `student_engagement` VALUES
    (255200, '2021-12-01'),(255201, '2021-12-02'),(255201, '2021-12-06'),
    (255201, '2021-12-08'),(255201, '2022-10-05'),(255203, '2022-03-08'),
    (255203, '2022-03-17'),(255203, '2022-03-23'),(255203, '2021-12-01'),
    (255203, '2022-01-25'),(255203, '2022-01-26'),(255203, '2022-01-27'),
    (255203, '2022-02-08'),(255203, '2022-02-09'),(255203, '2022-02-21'),
    (255203, '2022-02-22'),(255203, '2022-02-23'),(255204, '2021-12-01'),
    (255204, '2021-12-02'),(255204, '2021-12-03'),(255204, '2021-12-14'),
    (255204, '2021-12-17'),(255204, '2021-12-07'),(255204, '2021-12-05'),
    (255204, '2021-12-09'),(255204, '2021-12-23'),(255204, '2022-01-11'),
    (255204, '2022-01-18'),(255204, '2022-03-05'),(255204, '2022-06-19'),
    (255204, '2022-07-06'),(255204, '2021-12-04'),(255204, '2021-12-18'),
    (255204, '2021-12-20'),(255204, '2021-12-29'),(255204, '2022-03-28'),
    (255204, '2022-03-31'),(255204, '2022-07-17'),(255204, '2022-07-21'),
    (255204, '2022-07-31'),(255204, '2021-12-12'),(255204, '2021-12-13'),
    (255204, '2021-12-21'),(255204, '2021-12-27'),(255204, '2022-01-07'),
    (255204, '2021-12-08'),(255204, '2022-01-27'),(255205, '2021-12-01'),
    (255205, '2021-12-02'),(255205, '2021-12-07'),(255205, '2022-01-22'),
    (255205, '2022-04-05');
    -- Full dataset has 74246 records
    
    CREATE TABLE `student_purchases` 
    (`purchase_id` INT NOT NULL, 
    `student_id` INT DEFAULT NULL, 
    `date_purchased` DATE DEFAULT NULL, 
    PRIMARY KEY (`purchase_id`));
    
    INSERT INTO `student_purchases` VALUES
    (15055, 255201, '2021-12-01'),(15057, 255203, '2021-12-01'),
    (15059, 255204, '2021-12-01'),(15060, 255196, '2021-12-01');
    -- Full dataset has 5922 records

The objective is to filter out the records where ‘first_date_purchased’ is less than ‘first_date_watched’ in a single select statement, in the most effective way possible. I’m writing this query as a student in Data Analysis and am still new to programming in general.

Thank you in advance for you assistance!

2

Answers


  1. If this is your actual DDL, you should create some indexes for the columns you’re joining on:
    student_purchases.student_id and student_engagement.student_id.
    If it’s still too slow, consider creating an index on the columns where you’re calculating the MIN of

    I’m not 100% sure about the syntax for mysql but you should get the idea 😉

    CREATE INDEX sp_student_id ON student_purchases (`student_id`);
    CREATE INDEX se_student_id ON student_engagement (`student_id`);
    

    As a rule of thumb, you would typically create an index on your foreign key columns, as these are typically the columns, you’re joining on.

    Login or Signup to reply.
  2. Consider:
    Using a HAVING clause which executes like a where clause AFTER the selected values and aggregation is complete.
    Note it isn’t standard SQL to not include all non-aggregated values in a select in the Where clause, so I added the date-registered. MySQL supports this syntax but it can be dangerous to use if you don’t fully understand what it’s doing. (Basically it’s free to pick any value in the group even if those values are different) so if a student_Id had different date_Registered for some reason, you could get different results in each run. STANDARD SQL is to always group by all values not aggregated in the select.

    Demo dbfiddle.uk

       SELECT se.student_id,
               si.date_registered,
               MIN(se.date_watched) AS first_date_watched,
               MIN(sp.date_purchased) AS first_date_purchased
        FROM student_engagement se
          LEFT JOIN student_info si ON se.student_id = si.student_id
          LEFT JOIN student_purchases sp ON se.student_id = sp.student_id
        GROUP BY se.student_id, si.date_registered
        HAVING first_date_purchased >= first_date_watched
         or first_date_purchased is null
        -- or use this in the having coalesce(first_date_purchased,first_date_watched) >= first_date_watched
        ORDER BY first_date_purchased DESC
    

    It’s odd that you expect 4 records to me. Since first_date_purchased is null on two of the 5 records and would fail equality checks. but we can account for that by using an Or in the having. or it might be faster to coalesce

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search