I want to query the last data from the database table below for the receive time of each machine_id.There are about 130 million records in the table machine_data.
machine data table:
create table public.machine_data(
machine_id integer not null
,receive_time timestamp (6) with time zone not null
,code integer not null
,primary key (machine_id, receive_time)
);
machine master table:
create table public.machine_info (
machine_id serial not null
, model integer not null
, primary key (machine_id)
);
At the beginning, I used the following SQL. My Java service retrieval timed out.
WITH LatestMachineData AS (
SELECT
machine_id
, receive_time
, code
, ROW_NUMBER() OVER (PARTITION BY machine_id ORDER BY receive_time DESC) as rn
FROM
machine_data
)
SELECT
*
FROM
LatestMachineData
WHERE
rn = 1;
Then I asked the AI and got the following SQL.But service still timed out.
SELECT
md.*
FROM
machine_data md
inner join machine_info mi
on mi.machine_id = md.machine_id
INNER JOIN (
SELECT
machine_id
, MAX(receive_time) AS latest_receive_time
FROM
machine_data
GROUP BY
machine_id
) latest_data
ON mi.machine_id = latest_data.machine_id
AND md.receive_time = latest_data.latest_receive_time;
finally I try the following SQL, and it works very well.
select
*
from
machine_data md
where
(md.machine_id,md.receive_time) in (
SELECT
a.machine_id
, a.receive_time as receive_time
FROM
(
SELECT
mi.machine_id
, (
SELECT
receive_time
FROM
machine_data md
WHERE
mi.machine_id = md.machine_id
ORDER BY
receive_time DESC
LIMIT
1
) as receive_time
FROM
machine_info as mi
) a
where
a.receive_time is not null
);
The question is:
why is ORDER BY LIMIT so faster then others ?
Can the first and second SQL statements be optimized?
2
Answers
The first query is slow by design, it must sort all data and then keep only records where the row number equals 1. You’re asking the database to use all 130 million records and that’s what it does.
The second query is rather unusual since there is no foreign key between the table and you’re using a sequence for both columns that suddenly relate to each other in the JOIN. I would not trust the results from this query, could be sort of random results.
Same issue for the 3rd one, also an unreliable relation between the tables. Performance is imho irrelevant when the result isn’t reliable.
The first query has to read all of the rows from
machine_data
, sort them to determinern
, and then scan the result set to select only those that havern = 1
.The second query is somewhat more efficient; however, it still requires reading significantly more data than is necessary. Similar results can be achieved by using
DISTINCT ON (md.machine_id)
withORDER BY md.machine_id, md.receive_time DESC
instead of usingGROUP BY
.The third approach is able to use the index on
(machine_id, receive_time)
to efficiently retrieve only the desired rows. If the number of rows permachine_id
inmachine_data
were small, then this would likely not be as fast as the second query because of the I/O required to process the large number of subqueries.The following establishes a test environment (I’ve added a foreign key that is missing from the OP’s orginal post):
The following
DO
block populatesmachine_info
andmachine_data
. The constants declared at the top of the block control the data characteristics. This facilitates exploring the behavior of queries as related to data distribution.When generating rows for
machine_data
, some pairs of(machine_id, receive_time)
occurred more than once. Those duplicates were ignored which resulted in only 9,994,981 rows instead of the 10,000,000 requested. This difference is insignificant in regards to evaluating query execution performance.The following is an efficient query (in terms of both code structure and execution time) for retrieving the most recent data for each machine when there are many rows in
machine_data
for eachmachine_id
:The explain plan shows that the query is taking advantage of the index on
(machine_id, receive_time)
to reduce the number of required data reads.