Postgresql - Is there a way to delete / update duplicate values in a time sequence?

SkyWalker
October 26, 2023
204 views
0 votes
2 Answers

I have a table that looks like (never mind the database vendor as I need this in ANSI SQL):

create table edge (
   parent_id int not null,
   child_id int not null,
   value float not null,
   start date not null,
   end date not null
)

I would then have the following CSV input data:

1,2,0,2023-01-01,2023-01-10
1,2,0,2023-01-11,2023-01-20
1,2,0,2023-01-21,NULL
1,3,0,2023-01-01,2023-01-10
1,3,0,2023-01-11,2023-01-20
1,3,1,2023-01-21,NULL

Note the case where the parent is 1 and the child is 2, the same value 0 repeats across multiple date intervals, so it could be collapsed to:

1,2,0,2023-01-01,NULL
1,3,0,2023-01-01,2023-01-20
1,3,1,2023-01-21,NULL

a value of NULL in end means valid until further notice.

Ideally there would be a query that would for table edge (yes graph edges) generate the output of the second excerpt. The goal is to first delete redundant rows and update the remaining last row with start to match the start of the first deleted row.

The closest I can get is the following, here I am finding all the redundant rows to delete:

SELECT *
FROM edge
WHERE (parent, child, value, start) IN (SELECT parent, child, value, end+1
                                        FROM edge)
ORDER BY parent, child, start

I would then need to do the update step to leave the time series consistent. On a separate note, value is being exact matched and it is a float but this works on PostgreSQL at least.

If there is no way using ANSI I would be interested to learn for the PostgreSQL case.

Answers

- Stefanovsm
- October 26, 2023 at 7:40 pm
- 0 votes
0
I would first suggest a query that selects the desired rows as per the business rules above that may be used (as a view?) without deleting. I am using the non-ANSI distinct on PostgreSQL’s clause.
```
select distinct on (parent_id, child_id, value) 
 parent_id, child_id, value, 
 first_value(start_date) over (partition by parent_id, child_id, value order by start_date),
 end_date
from edge;
order by parent_id, child_id, value, start_date desc;
```
If this is not acceptable then shape the query as a CTE and then delete these rows that are not in it.

Here is a SQL standard-compliant version w/o distinct on.
```
select parent_id, child_id, value, sd start_date, end_date
from
(
 select
   row_number() over (partition by parent_id, child_id, value order by start_date desc) rn, 
   parent_id, child_id, value, 
   first_value(start_date) over (partition by parent_id, child_id, value order by start_date) sd,
   end_date
 from edge
) t
where rn = 1;
```
SQL Fiddle demo
Login or Signup to reply.

- maraca
- October 26, 2023 at 10:27 pm
- 0 votes
0
If the date ranges are continuous I think the easiest (but a bit dirty) approach would be as follows:
1. Rename table and set end to a unique date in the future where it is null.
2. Put values in new edge table:
```
 SELECT parent_id, child_id, value, MIN(start), MAX(end)
 FROM edge_old
 GROUP BY parent_id, child_id, value
```
3. Set end to null where it matches the unique date you set before and delete old table.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Is there a way to delete / update duplicate values in a time sequence?

Answers