Overview
I am trying to develop a query that selects as many rows as necessary so that I update up to LIMIT
unique thing.thing_id
s and is ordered by the status_1_date
column (meaning older things should be updated first). thing_id
is NOT the primary key.
Example Schema
CREATE TABLE IF NOT EXISTS thing (
name VARCHAR(255) PRIMARY KEY,
thing_id VARCHAR(255),
c_id VARCHAR(255),
status VARCHAR(255),
etag VARCHAR(255),
last_modified VARCHAR(255),
size VARCHAR(255),
status_1_date DATE
);
Sample Data
INSERT INTO thing (name,thing_id, c_id,status, status_1_date)
values
('protocol://full/path/to/thing/thing_1.file1', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:01'),
('protocol://full/path/to/thing/thing_1.file2', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:02'),
('protocol://full/path/to/thing/thing_2.file5', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:02.5'),
('protocol://something else', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:02.8'),
('protocol://full/path/to/thing/thing_2.file1', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:03'),
('protocol://full/path/to/thing/thing_2.file2', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:04'),
('protocol://full/path/to/thing/thing_2.file3', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:05'),
('protocol://full/path/to/thing/thing_2.file4', 'thing_2','different', 'status_1', '2023-09-29 09:00:05'),
('protocol://full/path/to/thing/thing_3.file1', 'thing_3','c_id', 'status_1', '2023-09-29 09:00:06'),
('protocol://full/path/to/thing/thing_3.file2', 'thing_3','c_id', 'status_1', '2023-09-29 09:00:06.2'),
('protocol://full/path/to/thing/thing_4.file1', 'thing_4','c_id', 'status_1', '2023-09-29 09:00:06.1'),
('protocol://full/path/to/thing/thing_4.file2', 'thing_4','c_id', 'status_1', '2023-09-29 09:00:06.3'),
('protocol://full/path/to/thing/thing_5.file1', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.4'),
('protocol://full/path/to/thing/thing_5.file2', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.5'),
('protocol://full/path/to/thing/thing_5.file3', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.6'),
('protocol://full/path/to/thing/thing_6.file1', 'thing_6','c_id', 'status_1', '2023-09-29 09:00:06.7');
Desired Results
Using this data I am trying to get a query that updates rows 1-6 and 7-9 when LIMIT 3
is used. The rows for thing_id
s thing_5
, thing_6
, and thing_1
that match there WHERE
clause should be updated.
Additional Details
The issues I am trying to solve is I work with a large database and I need to limit the data returned from it but ensure I get a complete set of data for the entities I am querying for. The thing.thing_id
is not the primary key in our database however the thing_id
represents an entity that could be composed of data from N rows as indicated by a shared thing_id
. I need to make sure I get a complete set of rows for all thing entities where LIMIT
represents the total number of thing entities I am fetching data for.
This question spawned from a previous thread but no solution: Unable to get Postgresql LIMIT to work with UPDATE and subquery
Issues with Answer
- It does not get the
thing_id
with the oldest file: https://dbfiddle.uk/GStIbrYA - If two queries targeting the same
thing.c_id
andthing.name
prefix run at the same time a double update will happen. This update in the database I am working on could be working with 100,000s of rows at one time so this is not desirable.
Alternative Attempt 1
It looks like FOR UPDATE
might be needed to prevent the double update from occurring but it cannot be used with the window function.
with ranked_matching_things as (
select
dense_rank() over (order by thing_id) as thing_rank,
name
from thing
where c_id = 'c_id'
and name like 'protocol://full/path/to/thing/%'
order by status_1_date
for update
)
update thing
set status = 'CHANGED'
from ranked_matching_things
where thing.name = ranked_matching_things.name
and ranked_matching_things.thing_rank <= 3
ERROR: FOR UPDATE is not allowed with window functions
Alternative Attempt 2
- It does not get the
thing_id
with the oldest file first: https://dbfiddle.uk/tMG-vvU4 - It seems redundant with the duplication of the where clause
UPDATE thing
SET status = 'CHANGED'
FROM (
SELECT name
FROM thing
WHERE thing.thing_id in (
SELECT DISTINCT thing.thing_id
FROM thing
WHERE c_id = 'c_id' AND name like 'protocol://full/path/to/thing/%' AND status = 'status_1'
ORDER BY thing.thing_id
LIMIT (3)
) AND thing.c_id = 'c_id' AND thing.name like 'protocol://full/path/to/thing/%' AND thing.status = 'status_1'
ORDER BY thing.status_1_date
FOR UPDATE
) AS rows
WHERE thing.name = rows.name
RETURNING *
Test Setup
The following link may be useful for testing: https://dbfiddle.uk/ZTcbkqsF
2
Answers
I was able to derive the following, likely suboptimal, solution.
Query to get a list of valid
thing_id
s with the oldest file associated:Query to get
LIMIT
thing_id
s getting thething_id
s with the oldest file first:Query to get the necessary rows
FOR UPDATE
:Query to
UPDATE
valid rows:Solution: https://dbfiddle.uk/Z7trYUKa
Revision 1:
Revision 1 Solution: https://dbfiddle.uk/dg8poZaA
I'm sure there must be some way to prevent the redundant
WHERE
clause but I have been unable to accomplish this so far.This can’t be done with a limit, because it’s too… limited. Instead we can use a window function. Let’s break it down…
First, we identify the matching rows and put them in order.
Simple enough.
From those results, we need to pick the first 3 distinct matching thing_ids. We can add a column to the result with
dense_rank
.Now we have a way to identify the rows which correspond to the first three thing_ids: thing_1 has a rank of 1, thing_2 has a rank of 2, and thing_3 has a rank of 3.
We use
dense_rank
, notrank
, to ensure the ranks are 1, 2, 3… Otherwise we’d have gaps.Join with that sub-query on the primary key, and only update those with a rank <= 3.
Demonstration