Postgresql MAX vs ROW_NUMBER() vs ORDER BY LIMIT 1 SQL execution

YouKoTora
August 10, 2024
105 views
0 votes
2 Answers

I want to query the last data from the database table below for the receive time of each machine_id.There are about 130 million records in the table machine_data.

machine data table:

create table public.machine_data( 
    machine_id integer not null 
    ,receive_time timestamp (6) with time zone not null 
    ,code integer not null 
    ,primary key (machine_id, receive_time)
);

machine master table:

create table public.machine_info (
  machine_id serial not null
  , model integer not null
  , primary key (machine_id)
);

At the beginning, I used the following SQL. My Java service retrieval timed out.

WITH LatestMachineData AS ( 
    SELECT
        machine_id
        , receive_time
        , code
        , ROW_NUMBER() OVER (PARTITION BY machine_id ORDER BY receive_time DESC) as rn 
    FROM
        machine_data
) 
SELECT
    * 
FROM
    LatestMachineData 
WHERE
    rn = 1;

Then I asked the AI and got the following SQL.But service still timed out.

SELECT
    md.* 
FROM
    machine_data md 
    inner join machine_info mi 
        on mi.machine_id = md.machine_id 
    INNER JOIN ( 
        SELECT
            machine_id
            , MAX(receive_time) AS latest_receive_time 
        FROM
            machine_data 
        GROUP BY
            machine_id
    ) latest_data 
        ON mi.machine_id = latest_data.machine_id 
        AND md.receive_time = latest_data.latest_receive_time;

finally I try the following SQL, and it works very well.

select
    * 
from
    machine_data md 
where
    (md.machine_id,md.receive_time) in ( 
        SELECT
            a.machine_id
            , a.receive_time as receive_time 
        FROM
            ( 
                SELECT
                    mi.machine_id
                    , ( 
                        SELECT
                            receive_time 
                        FROM
                            machine_data md 
                        WHERE
                            mi.machine_id = md.machine_id 
                        ORDER BY
                            receive_time DESC 
                        LIMIT
                            1
                    ) as receive_time 
                FROM
                    machine_info as mi 
            ) a 
        where
            a.receive_time is not null
    );

The question is:
why is ORDER BY LIMIT so faster then others ?
Can the first and second SQL statements be optimized?

Tags: postgresql

Answers

- FrankHeikens
- August 9, 2024 at 5:13 pm
- 0 votes
0
The first query is slow by design, it must sort all data and then keep only records where the row number equals 1. You’re asking the database to use all 130 million records and that’s what it does.

The second query is rather unusual since there is no foreign key between the table and you’re using a sequence for both columns that suddenly relate to each other in the JOIN. I would not trust the results from this query, could be sort of random results.

Same issue for the 3rd one, also an unreliable relation between the tables. Performance is imho irrelevant when the result isn’t reliable.

Login or Signup to reply.

The first query has to read all of the rows from machine_data, sort them to determine rn, and then scan the result set to select only those that have rn = 1.

The second query is somewhat more efficient; however, it still requires reading significantly more data than is necessary. Similar results can be achieved by using DISTINCT ON (md.machine_id) with ORDER BY md.machine_id, md.receive_time DESC instead of using GROUP BY.

The third approach is able to use the index on (machine_id, receive_time) to efficiently retrieve only the desired rows. If the number of rows per machine_id in machine_data were small, then this would likely not be as fast as the second query because of the I/O required to process the large number of subqueries.

The following establishes a test environment (I’ve added a foreign key that is missing from the OP’s orginal post):

DROP TABLE IF EXISTS machine_data;

DROP TABLE IF EXISTS machine_info;

CREATE TABLE public.machine_info (
  machine_id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  model INTEGER NOT NULL,
  delete_flag BOOLEAN NOT NULL DEFAULT FALSE
);

CREATE TABLE public.machine_data (
  machine_id INTEGER NOT NULL REFERENCES machine_info (machine_id),
  receive_time TIMESTAMP(6) WITH TIME ZONE NOT NULL,
  code INTEGER NOT NULL,
  PRIMARY KEY (machine_id, receive_time)
);

The following DO block populates machine_info and machine_data. The constants declared at the top of the block control the data characteristics. This facilitates exploring the behavior of queries as related to data distribution.

DO language plpgsql $block$
DECLARE
  fraction_deleted constant float := 0.01;
  num_codes constant integer := 5;
  num_machines constant integer := 10000;
  num_models constant integer := 10;
  num_samples constant integer := 10000000;
BEGIN
INSERT INTO
  machine_info (model, delete_flag)
SELECT
  CEIL(RANDOM() * num_models)::INTEGER,
  RANDOM() * 100 <= fraction_deleted
FROM
  GENERATE_SERIES(1, num_machines) gs (n);

INSERT INTO
  machine_data (machine_id, receive_time, code)
SELECT
  CEIL(RANDOM() * num_machines)::INTEGER,
  '2024-01-01'::TIMESTAMP + INTERVAL '1' SECOND * RANDOM(),
  CEIL(RANDOM() * num_codes)::INTEGER
FROM
  GENERATE_SERIES(1, num_samples) AS gs (n)
ON CONFLICT DO NOTHING;
END;
$block$;

When generating rows for machine_data, some pairs of (machine_id, receive_time) occurred more than once. Those duplicates were ignored which resulted in only 9,994,981 rows instead of the 10,000,000 requested. This difference is insignificant in regards to evaluating query execution performance.

The following is an efficient query (in terms of both code structure and execution time) for retrieving the most recent data for each machine when there are many rows in machine_data for each machine_id:

SELECT
  md.*
FROM
  machine_data md
WHERE
  (md.machine_id, md.receive_time) IN (
    SELECT
      mi.machine_id,
      (
        SELECT
          MAX(md.receive_time)
        FROM
          machine_data md
        WHERE
          md.machine_id = mi.machine_id
      ) AS receive_time
    FROM
      machine_info mi
  );

The explain plan shows that the query is taking advantage of the index on (machine_id, receive_time) to reduce the number of required data reads.

QUERY PLAN
Nested Loop  (cost=0.92..82986.52 rows=11 width=16) (actual time=0.060..285.896 rows=10000 loops=1)
  Output: md.machine_id, md.receive_time, md.code
  Inner Unique: true
  Buffers: shared hit=58178 read=16864 written=2114
  ->  Seq Scan on public.machine_info mi  (cost=0.00..155.00 rows=10000 width=4) (actual time=0.008..1.674 rows=10000 loops=1)
        Output: mi.machine_id, mi.model, mi.delete_flag
        Buffers: shared hit=18 read=37 written=4
  ->  Index Scan using machine_data_pkey on public.machine_data md  (cost=0.92..8.28 rows=1 width=16) (actual time=0.007..0.007 rows=1 loops=10000)
        Output: md.machine_id, md.receive_time, md.code
        Index Cond: ((md.machine_id = mi.machine_id) AND (md.receive_time = (SubPlan 2)))
        Buffers: shared hit=33172 read=6828 written=858
        SubPlan 2
          ->  Result  (cost=0.48..0.49 rows=1 width=8) (actual time=0.020..0.020 rows=1 loops=10000)
                Output: $1
                Buffers: shared hit=24988 read=9999 written=1252
                InitPlan 1 (returns $1)
                  ->  Limit  (cost=0.43..0.48 rows=1 width=8) (actual time=0.020..0.020 rows=1 loops=10000)
                        Output: md_1.receive_time
                        Buffers: shared hit=24988 read=9999 written=1252
                        ->  Index Only Scan Backward using machine_data_pkey on public.machine_data md_1  (cost=0.43..44.47 rows=1002 width=8) (actual time=0.019..0.019 rows=1 loops=10000)
                              Output: md_1.receive_time
                              Index Cond: ((md_1.machine_id = mi.machine_id) AND (md_1.receive_time IS NOT NULL))
                              Heap Fetches: 2
                              Buffers: shared hit=24988 read=9999 written=1252
Planning:
  Buffers: shared hit=18 read=5
Planning Time: 0.520 ms
Execution Time: 286.905 ms

Please signup or login to give your own answer.

Click here to cancel reply.