skip to Main Content

I want to query the last data from the database table below for the receive time of each machine_id.There are about 130 million records in the table machine_data.

machine data table:

create table public.machine_data( 
    machine_id integer not null 
    ,receive_time timestamp (6) with time zone not null 
    ,code integer not null 
    ,primary key (machine_id, receive_time)
);

machine master table:

create table public.machine_info (
  machine_id serial not null
  , model integer not null
  , primary key (machine_id)
);

At the beginning, I used the following SQL. My Java service retrieval timed out.

WITH LatestMachineData AS ( 
    SELECT
        machine_id
        , receive_time
        , code
        , ROW_NUMBER() OVER (PARTITION BY machine_id ORDER BY receive_time DESC) as rn 
    FROM
        machine_data
) 
SELECT
    * 
FROM
    LatestMachineData 
WHERE
    rn = 1;

Then I asked the AI and got the following SQL.But service still timed out.

SELECT
    md.* 
FROM
    machine_data md 
    inner join machine_info mi 
        on mi.machine_id = md.machine_id 
    INNER JOIN ( 
        SELECT
            machine_id
            , MAX(receive_time) AS latest_receive_time 
        FROM
            machine_data 
        GROUP BY
            machine_id
    ) latest_data 
        ON mi.machine_id = latest_data.machine_id 
        AND md.receive_time = latest_data.latest_receive_time;

finally I try the following SQL, and it works very well.

select
    * 
from
    machine_data md 
where
    (md.machine_id,md.receive_time) in ( 
        SELECT
            a.machine_id
            , a.receive_time as receive_time 
        FROM
            ( 
                SELECT
                    mi.machine_id
                    , ( 
                        SELECT
                            receive_time 
                        FROM
                            machine_data md 
                        WHERE
                            mi.machine_id = md.machine_id 
                        ORDER BY
                            receive_time DESC 
                        LIMIT
                            1
                    ) as receive_time 
                FROM
                    machine_info as mi 
            ) a 
        where
            a.receive_time is not null
    );

The question is:
why is ORDER BY LIMIT so faster then others ?
Can the first and second SQL statements be optimized?

2

Answers


  1. The first query is slow by design, it must sort all data and then keep only records where the row number equals 1. You’re asking the database to use all 130 million records and that’s what it does.

    The second query is rather unusual since there is no foreign key between the table and you’re using a sequence for both columns that suddenly relate to each other in the JOIN. I would not trust the results from this query, could be sort of random results.

    Same issue for the 3rd one, also an unreliable relation between the tables. Performance is imho irrelevant when the result isn’t reliable.

    Login or Signup to reply.
  2. The first query has to read all of the rows from machine_data, sort them to determine rn, and then scan the result set to select only those that have rn = 1.

    The second query is somewhat more efficient; however, it still requires reading significantly more data than is necessary. Similar results can be achieved by using DISTINCT ON (md.machine_id) with ORDER BY md.machine_id, md.receive_time DESC instead of using GROUP BY.

    The third approach is able to use the index on (machine_id, receive_time) to efficiently retrieve only the desired rows. If the number of rows per machine_id in machine_data were small, then this would likely not be as fast as the second query because of the I/O required to process the large number of subqueries.

    The following establishes a test environment (I’ve added a foreign key that is missing from the OP’s orginal post):

    DROP TABLE IF EXISTS machine_data;
    
    DROP TABLE IF EXISTS machine_info;
    
    CREATE TABLE public.machine_info (
      machine_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
      model INTEGER NOT NULL,
      delete_flag BOOLEAN NOT NULL DEFAULT FALSE
    );
    
    CREATE TABLE public.machine_data (
      machine_id INTEGER NOT NULL REFERENCES machine_info (machine_id),
      receive_time TIMESTAMP(6) WITH TIME ZONE NOT NULL,
      code INTEGER NOT NULL,
      PRIMARY KEY (machine_id, receive_time)
    );
    

    The following DO block populates machine_info and machine_data. The constants declared at the top of the block control the data characteristics. This facilitates exploring the behavior of queries as related to data distribution.

    DO language plpgsql $block$
    DECLARE
      fraction_deleted constant float := 0.01;
      num_codes constant integer := 5;
      num_machines constant integer := 10000;
      num_models constant integer := 10;
      num_samples constant integer := 10000000;
    BEGIN
    INSERT INTO
      machine_info (model, delete_flag)
    SELECT
      CEIL(RANDOM() * num_models)::INTEGER,
      RANDOM() * 100 <= fraction_deleted
    FROM
      GENERATE_SERIES(1, num_machines) gs (n);
    
    INSERT INTO
      machine_data (machine_id, receive_time, code)
    SELECT
      CEIL(RANDOM() * num_machines)::INTEGER,
      '2024-01-01'::TIMESTAMP + INTERVAL '1' SECOND * RANDOM(),
      CEIL(RANDOM() * num_codes)::INTEGER
    FROM
      GENERATE_SERIES(1, num_samples) AS gs (n)
    ON CONFLICT DO NOTHING;
    END;
    $block$;
    

    When generating rows for machine_data, some pairs of (machine_id, receive_time) occurred more than once. Those duplicates were ignored which resulted in only 9,994,981 rows instead of the 10,000,000 requested. This difference is insignificant in regards to evaluating query execution performance.

    The following is an efficient query (in terms of both code structure and execution time) for retrieving the most recent data for each machine when there are many rows in machine_data for each machine_id:

    SELECT
      md.*
    FROM
      machine_data md
    WHERE
      (md.machine_id, md.receive_time) IN (
        SELECT
          mi.machine_id,
          (
            SELECT
              MAX(md.receive_time)
            FROM
              machine_data md
            WHERE
              md.machine_id = mi.machine_id
          ) AS receive_time
        FROM
          machine_info mi
      );
    

    The explain plan shows that the query is taking advantage of the index on (machine_id, receive_time) to reduce the number of required data reads.

    QUERY PLAN
    Nested Loop  (cost=0.92..82986.52 rows=11 width=16) (actual time=0.060..285.896 rows=10000 loops=1)
      Output: md.machine_id, md.receive_time, md.code
      Inner Unique: true
      Buffers: shared hit=58178 read=16864 written=2114
      ->  Seq Scan on public.machine_info mi  (cost=0.00..155.00 rows=10000 width=4) (actual time=0.008..1.674 rows=10000 loops=1)
            Output: mi.machine_id, mi.model, mi.delete_flag
            Buffers: shared hit=18 read=37 written=4
      ->  Index Scan using machine_data_pkey on public.machine_data md  (cost=0.92..8.28 rows=1 width=16) (actual time=0.007..0.007 rows=1 loops=10000)
            Output: md.machine_id, md.receive_time, md.code
            Index Cond: ((md.machine_id = mi.machine_id) AND (md.receive_time = (SubPlan 2)))
            Buffers: shared hit=33172 read=6828 written=858
            SubPlan 2
              ->  Result  (cost=0.48..0.49 rows=1 width=8) (actual time=0.020..0.020 rows=1 loops=10000)
                    Output: $1
                    Buffers: shared hit=24988 read=9999 written=1252
                    InitPlan 1 (returns $1)
                      ->  Limit  (cost=0.43..0.48 rows=1 width=8) (actual time=0.020..0.020 rows=1 loops=10000)
                            Output: md_1.receive_time
                            Buffers: shared hit=24988 read=9999 written=1252
                            ->  Index Only Scan Backward using machine_data_pkey on public.machine_data md_1  (cost=0.43..44.47 rows=1002 width=8) (actual time=0.019..0.019 rows=1 loops=10000)
                                  Output: md_1.receive_time
                                  Index Cond: ((md_1.machine_id = mi.machine_id) AND (md_1.receive_time IS NOT NULL))
                                  Heap Fetches: 2
                                  Buffers: shared hit=24988 read=9999 written=1252
    Planning:
      Buffers: shared hit=18 read=5
    Planning Time: 0.520 ms
    Execution Time: 286.905 ms
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search