SQL Query Help in Postgresql

CogitoErgoSum
October 9, 2023
233 views
0 votes
2 Answers

Overview
I am trying to develop a query that selects as many rows as necessary so that I update up to LIMIT unique thing.thing_ids and is ordered by the status_1_date column (meaning older things should be updated first). thing_id is NOT the primary key.

Example Schema

CREATE TABLE IF NOT EXISTS thing (
  name VARCHAR(255) PRIMARY KEY,
  thing_id VARCHAR(255),
  c_id VARCHAR(255),
  status VARCHAR(255),
  etag VARCHAR(255),
  last_modified VARCHAR(255),
  size VARCHAR(255),
  status_1_date DATE
);

Sample Data

INSERT INTO thing (name,thing_id, c_id,status, status_1_date)
values 
('protocol://full/path/to/thing/thing_1.file1', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:01'),
('protocol://full/path/to/thing/thing_1.file2', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:02'),
('protocol://full/path/to/thing/thing_2.file5', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:02.5'),
  ('protocol://something else', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:02.8'),
('protocol://full/path/to/thing/thing_2.file1', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:03'),
('protocol://full/path/to/thing/thing_2.file2', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:04'),
('protocol://full/path/to/thing/thing_2.file3', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:05'),
('protocol://full/path/to/thing/thing_2.file4', 'thing_2','different', 'status_1', '2023-09-29 09:00:05'),
('protocol://full/path/to/thing/thing_3.file1', 'thing_3','c_id', 'status_1', '2023-09-29 09:00:06'),
('protocol://full/path/to/thing/thing_3.file2', 'thing_3','c_id', 'status_1', '2023-09-29 09:00:06.2'),
('protocol://full/path/to/thing/thing_4.file1', 'thing_4','c_id', 'status_1', '2023-09-29 09:00:06.1'),
('protocol://full/path/to/thing/thing_4.file2', 'thing_4','c_id', 'status_1', '2023-09-29 09:00:06.3'),
('protocol://full/path/to/thing/thing_5.file1', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.4'),
('protocol://full/path/to/thing/thing_5.file2', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.5'),
('protocol://full/path/to/thing/thing_5.file3', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.6'),
('protocol://full/path/to/thing/thing_6.file1', 'thing_6','c_id', 'status_1', '2023-09-29 09:00:06.7');

Desired Results
Using this data I am trying to get a query that updates rows 1-6 and 7-9 when LIMIT 3 is used. The rows for thing_ids thing_5, thing_6, and thing_1 that match there WHERE clause should be updated.

Additional Details
The issues I am trying to solve is I work with a large database and I need to limit the data returned from it but ensure I get a complete set of data for the entities I am querying for. The thing.thing_id is not the primary key in our database however the thing_id represents an entity that could be composed of data from N rows as indicated by a shared thing_id. I need to make sure I get a complete set of rows for all thing entities where LIMIT represents the total number of thing entities I am fetching data for.

This question spawned from a previous thread but no solution: Unable to get Postgresql LIMIT to work with UPDATE and subquery

Issues with Answer

It does not get the thing_id with the oldest file: https://dbfiddle.uk/GStIbrYA
If two queries targeting the same thing.c_id and thing.name prefix run at the same time a double update will happen. This update in the database I am working on could be working with 100,000s of rows at one time so this is not desirable.

Alternative Attempt 1
It looks like FOR UPDATE might be needed to prevent the double update from occurring but it cannot be used with the window function.

with ranked_matching_things as (
  select
    dense_rank() over (order by thing_id) as thing_rank,
    name
  from thing
  where c_id = 'c_id'
    and name like 'protocol://full/path/to/thing/%'
  order by status_1_date
  for update
)
update thing
set status = 'CHANGED'
from ranked_matching_things
where thing.name = ranked_matching_things.name
  and ranked_matching_things.thing_rank <= 3

ERROR:  FOR UPDATE is not allowed with window functions

Alternative Attempt 2

It does not get the thing_id with the oldest file first: https://dbfiddle.uk/tMG-vvU4
It seems redundant with the duplication of the where clause

UPDATE thing
SET status = 'CHANGED'
FROM (
    SELECT name 
    FROM thing
    WHERE thing.thing_id in (
        SELECT DISTINCT thing.thing_id
        FROM thing
        WHERE c_id = 'c_id' AND name like 'protocol://full/path/to/thing/%' AND status = 'status_1'
        ORDER BY thing.thing_id
        LIMIT (3)
    ) AND thing.c_id = 'c_id' AND thing.name like 'protocol://full/path/to/thing/%' AND thing.status = 'status_1'
    ORDER BY thing.status_1_date
    FOR UPDATE
) AS rows
WHERE thing.name = rows.name 
RETURNING *

Test Setup
The following link may be useful for testing: https://dbfiddle.uk/ZTcbkqsF

Tags: postgresql

Answers

Chosen as BEST ANSWER

I was able to derive the following, likely suboptimal, solution.

Query to get a list of valid thing_ids with the oldest file associated:

SELECT thing_id, MIN(status_1_date)
FROM thing
WHERE status = ('status_1') AND c_id = ('c_id') AND name LIKE ('protocol://full/path/to/thing/%')
GROUP BY thing_id

Query to get LIMIT thing_ids getting the thing_ids with the oldest file first:

SELECT thing_id
FROM (
    SELECT thing_id, MIN(status_1_date) AS oldest_file
    FROM thing
    GROUP BY thing_id
) as thing_ids
ORDER BY thing_ids.oldest_file
LIMIT 3

Query to get the necessary rows FOR UPDATE:

SELECT * 
FROM thing
WHERE thing.thing_id in (
    SELECT thing_id
        FROM (
            SELECT thing_id, MIN(status_1_date) AS oldest_file
            FROM thing
            GROUP BY thing_id
        ) as thing_ids
    ORDER BY thing_ids.oldest_file
)
FOR UPDATE

Query to UPDATE valid rows:

WITH rows AS (
    SELECT * 
    FROM thing
    WHERE thing.thing_id in (
        SELECT thing_id
            FROM (
                SELECT thing_id, MIN(status_1_date) AS oldest_file
                FROM thing
                WHERE status = ('status_1') AND c_id = ('c_id') AND name LIKE ('protocol://full/path/to/thing/%')
                GROUP BY thing_id
            ) as thing_ids
        WHERE thing.status = ('status_1') AND thing.c_id = ('c_id') AND thing.name LIKE ('protocol://full/path/to/thing/%')
        ORDER BY thing_ids.oldest_file
        LIMIT 3
    )
    FOR UPDATE
)
UPDATE thing
SET status = 'CHANGED'
FROM rows
WHERE thing.name = rows.name
RETURNING *

Solution: https://dbfiddle.uk/Z7trYUKa

Revision 1:

WITH thing_ids as (
    SELECT thing_id, MIN(status_1_date) AS oldest
    FROM thing
    WHERE status = ('status_1') AND c_id = ('c_id') AND name LIKE ('protocol://full/path/to/thing/%')
    GROUP BY thing_id
    ORDER BY oldest
    LIMIT 3
),
names AS (
    SELECT name
    FROM thing
    WHERE thing.thing_id IN (SELECT thing_id FROM thing_ids) AND thing.status = ('status_1') AND thing.c_id = ('c_id') AND thing.name LIKE ('protocol://full/path/to/thing/%')
    FOR UPDATE
)
UPDATE thing
SET status = 'CHANGED'
FROM names
WHERE thing.name = names.name
RETURNING *

Revision 1 Solution: https://dbfiddle.uk/dg8poZaA

I'm sure there must be some way to prevent the redundant WHERE clause but I have been unable to accomplish this so far.

(Edit)

This can’t be done with a limit, because it’s too… limited. Instead we can use a window function. Let’s break it down…

First, we identify the matching rows and put them in order.

select
  *
from thing
where c_id = 'c_id'
  and name like 'protocol://full/path/to/thing/%'
order by status_1_date;

Simple enough.

From those results, we need to pick the first 3 distinct matching thing_ids. We can add a column to the result with dense_rank.

select
  dense_rank() over (order by thing_id) as thing_rank,
  *
from thing
where c_id = 'c_id'
  and name like 'protocol://full/path/to/thing/%'
order by status_1_date;

thing_rank  name    thing_id    c_id    status  status_1_date
1   protocol://full/path/to/thing/thing_1.file1     thing_1     c_id    status_1    2023-09-29 09:00:01
1   protocol://full/path/to/thing/thing_1.file2     thing_1     c_id    status_1    2023-09-29 09:00:02
2   protocol://full/path/to/thing/thing_2.file5     thing_2     c_id    status_1    2023-09-29 09:00:02.5
2   protocol://full/path/to/thing/thing_2.file1     thing_2     c_id    status_1    2023-09-29 09:00:03
2   protocol://full/path/to/thing/thing_2.file2     thing_2     c_id    status_1    2023-09-29 09:00:04
2   protocol://full/path/to/thing/thing_2.file3     thing_2     c_id    status_1    2023-09-29 09:00:05
3   protocol://full/path/to/thing/thing_3.file1     thing_3     c_id    status_1    2023-09-29 09:00:06
4   protocol://full/path/to/thing/thing_4.file1     thing_4     c_id    status_1    2023-09-29 09:00:06.1
3   protocol://full/path/to/thing/thing_3.file2     thing_3     c_id    status_1    2023-09-29 09:00:06.2
4   protocol://full/path/to/thing/thing_4.file2     thing_4     c_id    status_1    2023-09-29 09:00:06.3
5   protocol://full/path/to/thing/thing_5.file1     thing_5     c_id    status_1    2023-09-29 09:00:06.4
5   protocol://full/path/to/thing/thing_5.file2     thing_5     c_id    status_1    2023-09-29 09:00:06.5
5   protocol://full/path/to/thing/thing_5.file3     thing_5     c_id    status_1    2023-09-29 09:00:06.6
6   protocol://full/path/to/thing/thing_6.file1     thing_6     c_id    status_1    2023-09-29 09:00:06.7

Now we have a way to identify the rows which correspond to the first three thing_ids: thing_1 has a rank of 1, thing_2 has a rank of 2, and thing_3 has a rank of 3.

We use dense_rank, not rank, to ensure the ranks are 1, 2, 3… Otherwise we’d have gaps.

Join with that sub-query on the primary key, and only update those with a rank <= 3.

with ranked_matching_things as (
  select
    dense_rank() over (order by thing_id) as thing_rank,
    name
  from thing
  where c_id = 'c_id'
    and name like 'protocol://full/path/to/thing/%'
  order by status_1_date
)
update thing
set status = 'CHANGED'
from ranked_matching_things
where thing.name = ranked_matching_things.name
  and ranked_matching_things.thing_rank <= 3

Demonstration

Please signup or login to give your own answer.

Click here to cancel reply.