DELETE excess rows per group - Postgresql

ee_engineer
March 30, 2023
225 views
0 votes
2 Answers

I have a Postgres table with these columns:

  id          int8
, user_id     varchar
, is_favorite boolean
, join_time   timestamptz

I want to delete some rows in this table with some conditions:

Keep a maximum of 10 rows for each user_id.
These 10 rows must contain each user_id‘s rows with is_favorite=true
(There can’t be more than 5 rows with is_favorite=true per user_id.)
The rest of 10 rows must be the ones with the latest join_time.

I want to delete rows past the 10 per user_id in this table.

Example

id|user_id                             |is_favorite|join_time                    
------------------------------------+------------------------------------+-------
1 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-04 15:16:40.000 +0300
2 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-03 15:16:25.000 +0300
3 |655caab8-ce81-11ed-afa1-0242ac120002|true       |2023-03-02 15:16:40.000 +0300
4 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-04-22 15:16:40.000 +0300
5 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-23 15:16:25.000 +0300
6 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-21 15:16:25.000 +0300
7 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-20 15:16:40.000 +0300
8 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-19 15:16:25.000 +0300
9 |655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-18 15:16:40.000 +0300
10|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-17 15:16:25.000 +0300
11|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-16 15:16:40.000 +0300
12|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-15 15:16:25.000 +0300
13|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-14 15:16:40.000 +0300
14|655caab8-ce81-11ed-afa1-0242ac120002|false      |2023-03-14 15:16:39.000 +0300
15|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 12:16:25.000 +0300
16|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 11:16:25.000 +0300
17|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 10:16:25.000 +0300
18|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 09:16:25.000 +0300
19|81c126b6-ce81-11ed-afa1-0242ac120002|true       |2023-03-01 08:16:25.000 +0300
20|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 07:16:25.000 +0300
21|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 06:16:25.000 +0300
22|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 05:16:25.000 +0300
23|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 04:16:25.000 +0300
24|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 03:16:25.000 +0300
25|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 02:16:25.000 +0300
26|81c126b6-ce81-11ed-afa1-0242ac120002|false      |2023-03-01 01:16:25.000 +0300

For user_id = 655caab8-ce81-11ed-afa1-0242ac120002 these IDs must be deleted: 11,12,13,14

For user_id = 81c126b6-ce81-11ed-afa1-0242ac120002 these IDs must be deleted 25,26.

Answers

- ErwinBrandstetter
- March 30, 2023 at 1:16 am
- 0 votes
0
Since you are processing the whole table, using a simple subquery with row_number() should be fastest:
```
DELETE FROM tbl t
USING (
   SELECT id, row_number() OVER (PARTITION BY user_id
                                 ORDER BY is_favorite DESC, join_time DESC
                                 ROWS UNBOUNDED PRECEDING) AS rn
   FROM   tbl t
   ) del
WHERE  t.id = del.id
AND    del.rn > 10;
```
Adding ROWS UNBOUNDED PRECEDING is optional, but should make it substantially faster (until Postgres 16 is released). See:
- Efficient downsampling of a selected timeseries to equidistant samples
Applying the right sort order, this skips the top 10 of most desirable rows per user and deletes the rest.

true sorts before false in descending order. See:
- PostgreSQL: order by column, with specific NON-NULL value LAST
If there can be null values, you need to do more. Like, first of all clarify your question.

Obviously, there would be race conditions with concurrent writes. If there can be concurrent write load, take a write lock on the table in the same transaction first …

If that’s going to delete the majority of rows, it may be cheaper to create a new table of survivors instead …

There are other ways. Like:
- Is there a way to SELECT n ON (like DISTINCT ON, but more than one of each)
Aside: use type uuid for your user_id column. Much better. See:
- Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
- What is the optimal data type for an MD5 field?
Login or Signup to reply.

- SelVazi
- March 30, 2023 at 2:22 am
- 0 votes
0
You can use row_number() twice. first one used to remove records above rank 10, and second one will be used to remove any records is_favorite=true above rank 5.
```
with cte as (
  select *, row_number() over (partition by user_id order by is_favorite desc, join_time desc) as rn,
  (CASE WHEN is_favorite = 'true'
        THEN row_number() over (partition by user_id order by is_favorite desc)
        ELSE 0
  END) as fav_rn
  from mytable
)
delete from mytable
where id in (
  select id
  from cte
  where rn>10 or fav_rn > 5
)
```
Demo here
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

DELETE excess rows per group – Postgresql

Example

Answers