skip to Main Content

Suppose I have a table A with columns: (id, item, price). It has records like

id item price
1 banana 1
2 banana 1
3 apple 2

I want to calculate the average price after deduplication based on the item column. In the example that I just gave, the average should be (1 + 2)/ 2 = 1.5.

There is a simple-minded way of doing this:

SELECT AVG(price) FROM (SELECT MIN(price) FROM A GROUP BY item)

However in reality I have a huge table so it is not realistic to do the select subquery first. I wonder whether there is any hack around this.

3

Answers


  1. Outside of deduplicating this per some of the comments, one approach is to add an index on both item and price. With one million rows, time to execute is reduced from ~0.85 s to 0 s.

    SELECT AVG(price)
    FROM (
        SELECT MIN(price) price
        FROM A
        GROUP BY item
    ) t;
    
    --  1 row(s) returned   0.875 sec / 0.000 sec
    
    ALTER TABLE A ADD INDEX itemPriceIndex (item, price)
    --  0 row(s) affected Records: 0  Duplicates: 0  Warnings: 0    1.891 sec
    
    SELECT AVG(price)
     FROM (
      SELECT MIN(price) price
         FROM A
         GROUP BY item
     ) t
    -- 1 row(s) returned    0.000 sec / 0.000 sec
    

    Data generation code (only 100,000 rows generated below but timings above were 1,000,000 rows):

    DROP DATABASE IF EXISTS stackOverflow;
    CREATE DATABASE stackOverflow;
    USE stackOverflow;
    
    CREATE TABLE A (
       id INT NOT NULL AUTO_INCREMENT,
       item VARCHAR(20),
       price INT,
       PRIMARY KEY (id)
    );
    
    DROP PROCEDURE if EXISTS generateData;
    delimiter #
    CREATE PROCEDURE generateData(in nReps int)
    BEGIN
        DECLARE v_counter int unsigned DEFAULT 0;
        
        TRUNCATE TABLE A;
        START transaction;
        while v_counter < nReps DO
            INSERT INTO A (item, price) VALUES ("banana", 1);
            SET v_counter = v_counter + 1;
        END WHILE;
        INSERT INTO A (item, price) VALUES("apple", 2);
        COMMIT;
    END #
    delimiter ;
    CALL generateData(100000);
    
    Login or Signup to reply.
  2. You can use AVG() window function after deduplication:

    SELECT DISTINCT AVG(MAX(price)) OVER () AS avg_price
    FROM tablename
    GROUP BY item;
    

    See the demo.

    Login or Signup to reply.
  3. First of all, it is unlikely that the DBMS will deduplicate. So your question is misleading. the DBMS will scan a table a create a list of unique values.

    If you want to improve the performance of this query, you need to look at the statistics of how this query is performed.

    1. This query can only be answered with at least one pass. Unless you have extra constructs (like a table that gets updated with a trigger), the DMBS needs to read every tuple to find: a) the different values of item, and b) the minimum price for each. In other words, the fastest time this query can be answered (even in the presence of indices) is the time it takes to do a select * from table;

    2. This query requires space proportional to the number of different values of item (this number is stored in the statistics). If there is not enough memory in the DBMS to store all these different values and their minimum, a temp table will be created in disk.

    3. If there is not enough memory for all the different tuples, the query will start to degrade because it needs to do some sorting (effectively turning your O(n) query into a O(nlogn).

    so, if I were you, I would:

    1. Inspect the plan that the DBMS uses to answer the query and make sure it is linear in the retrieval of the tuples (a sequential scan). Use EXPLAIN to do this.

    2. If it is not, it is most likely due to lack of memory. Can you increase it?

    3. If none works, you need to think of a different schema to solve this query (a trigger can keep the minimum value for each item in a table with little overhead).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search