skip to Main Content

Overview
I am trying to develop a query that selects as many rows as necessary so that I update up to LIMIT unique thing.thing_ids and is ordered by the status_1_date column (meaning older things should be updated first). thing_id is NOT the primary key.

Example Schema

CREATE TABLE IF NOT EXISTS thing (
  name VARCHAR(255) PRIMARY KEY,
  thing_id VARCHAR(255),
  c_id VARCHAR(255),
  status VARCHAR(255),
  etag VARCHAR(255),
  last_modified VARCHAR(255),
  size VARCHAR(255),
  status_1_date DATE
);

Sample Data

INSERT INTO thing (name,thing_id, c_id,status, status_1_date)
values 
('protocol://full/path/to/thing/thing_1.file1', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:01'),
('protocol://full/path/to/thing/thing_1.file2', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:02'),
('protocol://full/path/to/thing/thing_2.file5', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:02.5'),
  ('protocol://something else', 'thing_1','c_id', 'status_1', '2023-09-29 09:00:02.8'),
('protocol://full/path/to/thing/thing_2.file1', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:03'),
('protocol://full/path/to/thing/thing_2.file2', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:04'),
('protocol://full/path/to/thing/thing_2.file3', 'thing_2','c_id', 'status_1', '2023-09-29 09:00:05'),
('protocol://full/path/to/thing/thing_2.file4', 'thing_2','different', 'status_1', '2023-09-29 09:00:05'),
('protocol://full/path/to/thing/thing_3.file1', 'thing_3','c_id', 'status_1', '2023-09-29 09:00:06'),
('protocol://full/path/to/thing/thing_3.file2', 'thing_3','c_id', 'status_1', '2023-09-29 09:00:06.2'),
('protocol://full/path/to/thing/thing_4.file1', 'thing_4','c_id', 'status_1', '2023-09-29 09:00:06.1'),
('protocol://full/path/to/thing/thing_4.file2', 'thing_4','c_id', 'status_1', '2023-09-29 09:00:06.3'),
('protocol://full/path/to/thing/thing_5.file1', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.4'),
('protocol://full/path/to/thing/thing_5.file2', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.5'),
('protocol://full/path/to/thing/thing_5.file3', 'thing_5','c_id', 'status_1', '2023-09-29 09:00:06.6'),
('protocol://full/path/to/thing/thing_6.file1', 'thing_6','c_id', 'status_1', '2023-09-29 09:00:06.7');

Desired Results
Using this data I am trying to get a query that updates rows 1-6 and 7-9 when LIMIT 3 is used. The rows for thing_ids thing_5, thing_6, and thing_1 that match there WHERE clause should be updated.

Additional Details
The issues I am trying to solve is I work with a large database and I need to limit the data returned from it but ensure I get a complete set of data for the entities I am querying for. The thing.thing_id is not the primary key in our database however the thing_id represents an entity that could be composed of data from N rows as indicated by a shared thing_id. I need to make sure I get a complete set of rows for all thing entities where LIMIT represents the total number of thing entities I am fetching data for.

This question spawned from a previous thread but no solution: Unable to get Postgresql LIMIT to work with UPDATE and subquery

Issues with Answer

  • It does not get the thing_id with the oldest file: https://dbfiddle.uk/GStIbrYA
  • If two queries targeting the same thing.c_id and thing.name prefix run at the same time a double update will happen. This update in the database I am working on could be working with 100,000s of rows at one time so this is not desirable.

Alternative Attempt 1
It looks like FOR UPDATE might be needed to prevent the double update from occurring but it cannot be used with the window function.

with ranked_matching_things as (
  select
    dense_rank() over (order by thing_id) as thing_rank,
    name
  from thing
  where c_id = 'c_id'
    and name like 'protocol://full/path/to/thing/%'
  order by status_1_date
  for update
)
update thing
set status = 'CHANGED'
from ranked_matching_things
where thing.name = ranked_matching_things.name
  and ranked_matching_things.thing_rank <= 3

ERROR:  FOR UPDATE is not allowed with window functions

Alternative Attempt 2

  • It does not get the thing_id with the oldest file first: https://dbfiddle.uk/tMG-vvU4
  • It seems redundant with the duplication of the where clause
UPDATE thing
SET status = 'CHANGED'
FROM (
    SELECT name 
    FROM thing
    WHERE thing.thing_id in (
        SELECT DISTINCT thing.thing_id
        FROM thing
        WHERE c_id = 'c_id' AND name like 'protocol://full/path/to/thing/%' AND status = 'status_1'
        ORDER BY thing.thing_id
        LIMIT (3)
    ) AND thing.c_id = 'c_id' AND thing.name like 'protocol://full/path/to/thing/%' AND thing.status = 'status_1'
    ORDER BY thing.status_1_date
    FOR UPDATE
) AS rows
WHERE thing.name = rows.name 
RETURNING *

Test Setup
The following link may be useful for testing: https://dbfiddle.uk/ZTcbkqsF

2

Answers


  1. Chosen as BEST ANSWER

    I was able to derive the following, likely suboptimal, solution.

    Query to get a list of valid thing_ids with the oldest file associated:

    SELECT thing_id, MIN(status_1_date)
    FROM thing
    WHERE status = ('status_1') AND c_id = ('c_id') AND name LIKE ('protocol://full/path/to/thing/%')
    GROUP BY thing_id
    

    Query to get LIMIT thing_ids getting the thing_ids with the oldest file first:

    SELECT thing_id
    FROM (
        SELECT thing_id, MIN(status_1_date) AS oldest_file
        FROM thing
        GROUP BY thing_id
    ) as thing_ids
    ORDER BY thing_ids.oldest_file
    LIMIT 3
    

    Query to get the necessary rows FOR UPDATE:

    SELECT * 
    FROM thing
    WHERE thing.thing_id in (
        SELECT thing_id
            FROM (
                SELECT thing_id, MIN(status_1_date) AS oldest_file
                FROM thing
                GROUP BY thing_id
            ) as thing_ids
        ORDER BY thing_ids.oldest_file
    )
    FOR UPDATE
    

    Query to UPDATE valid rows:

    WITH rows AS (
        SELECT * 
        FROM thing
        WHERE thing.thing_id in (
            SELECT thing_id
                FROM (
                    SELECT thing_id, MIN(status_1_date) AS oldest_file
                    FROM thing
                    WHERE status = ('status_1') AND c_id = ('c_id') AND name LIKE ('protocol://full/path/to/thing/%')
                    GROUP BY thing_id
                ) as thing_ids
            WHERE thing.status = ('status_1') AND thing.c_id = ('c_id') AND thing.name LIKE ('protocol://full/path/to/thing/%')
            ORDER BY thing_ids.oldest_file
            LIMIT 3
        )
        FOR UPDATE
    )
    UPDATE thing
    SET status = 'CHANGED'
    FROM rows
    WHERE thing.name = rows.name
    RETURNING *
    

    Solution: https://dbfiddle.uk/Z7trYUKa

    Revision 1:

    WITH thing_ids as (
        SELECT thing_id, MIN(status_1_date) AS oldest
        FROM thing
        WHERE status = ('status_1') AND c_id = ('c_id') AND name LIKE ('protocol://full/path/to/thing/%')
        GROUP BY thing_id
        ORDER BY oldest
        LIMIT 3
    ),
    names AS (
        SELECT name
        FROM thing
        WHERE thing.thing_id IN (SELECT thing_id FROM thing_ids) AND thing.status = ('status_1') AND thing.c_id = ('c_id') AND thing.name LIKE ('protocol://full/path/to/thing/%')
        FOR UPDATE
    )
    UPDATE thing
    SET status = 'CHANGED'
    FROM names
    WHERE thing.name = names.name
    RETURNING *
    

    Revision 1 Solution: https://dbfiddle.uk/dg8poZaA

    I'm sure there must be some way to prevent the redundant WHERE clause but I have been unable to accomplish this so far.


  2. This can’t be done with a limit, because it’s too… limited. Instead we can use a window function. Let’s break it down…

    First, we identify the matching rows and put them in order.

    select
      *
    from thing
    where c_id = 'c_id'
      and name like 'protocol://full/path/to/thing/%'
    order by status_1_date;
    

    Simple enough.

    From those results, we need to pick the first 3 distinct matching thing_ids. We can add a column to the result with dense_rank.

    select
      dense_rank() over (order by thing_id) as thing_rank,
      *
    from thing
    where c_id = 'c_id'
      and name like 'protocol://full/path/to/thing/%'
    order by status_1_date;
    
    thing_rank  name    thing_id    c_id    status  status_1_date
    1   protocol://full/path/to/thing/thing_1.file1     thing_1     c_id    status_1    2023-09-29 09:00:01
    1   protocol://full/path/to/thing/thing_1.file2     thing_1     c_id    status_1    2023-09-29 09:00:02
    2   protocol://full/path/to/thing/thing_2.file5     thing_2     c_id    status_1    2023-09-29 09:00:02.5
    2   protocol://full/path/to/thing/thing_2.file1     thing_2     c_id    status_1    2023-09-29 09:00:03
    2   protocol://full/path/to/thing/thing_2.file2     thing_2     c_id    status_1    2023-09-29 09:00:04
    2   protocol://full/path/to/thing/thing_2.file3     thing_2     c_id    status_1    2023-09-29 09:00:05
    3   protocol://full/path/to/thing/thing_3.file1     thing_3     c_id    status_1    2023-09-29 09:00:06
    4   protocol://full/path/to/thing/thing_4.file1     thing_4     c_id    status_1    2023-09-29 09:00:06.1
    3   protocol://full/path/to/thing/thing_3.file2     thing_3     c_id    status_1    2023-09-29 09:00:06.2
    4   protocol://full/path/to/thing/thing_4.file2     thing_4     c_id    status_1    2023-09-29 09:00:06.3
    5   protocol://full/path/to/thing/thing_5.file1     thing_5     c_id    status_1    2023-09-29 09:00:06.4
    5   protocol://full/path/to/thing/thing_5.file2     thing_5     c_id    status_1    2023-09-29 09:00:06.5
    5   protocol://full/path/to/thing/thing_5.file3     thing_5     c_id    status_1    2023-09-29 09:00:06.6
    6   protocol://full/path/to/thing/thing_6.file1     thing_6     c_id    status_1    2023-09-29 09:00:06.7
    

    Now we have a way to identify the rows which correspond to the first three thing_ids: thing_1 has a rank of 1, thing_2 has a rank of 2, and thing_3 has a rank of 3.

    We use dense_rank, not rank, to ensure the ranks are 1, 2, 3… Otherwise we’d have gaps.

    Join with that sub-query on the primary key, and only update those with a rank <= 3.

    with ranked_matching_things as (
      select
        dense_rank() over (order by thing_id) as thing_rank,
        name
      from thing
      where c_id = 'c_id'
        and name like 'protocol://full/path/to/thing/%'
      order by status_1_date
    )
    update thing
    set status = 'CHANGED'
    from ranked_matching_things
    where thing.name = ranked_matching_things.name
      and ranked_matching_things.thing_rank <= 3
    

    Demonstration

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search