Is Inner SELF JOIN the right tool for this two-step filtering? - Postgresql

FranciscoRyanTolmaskyI
March 16, 2023
161 views
0 votes
2 Answers

Some context, I am using postgres and I have a table that has items defined something like this:

CREATE TYPE EventType AS ENUM ('published', 'unpublished');

CREATE TABLE items (
    id          integer NOT NULL,
    name        varchar(40) NOT NULL,
    event       EventType NOT NULL,
    date        date
);

The critical thing to realize here is that name is not necessarily unique, so something could be published with a name on date X, then get unpublished on date Y, then republished again on date Z. Given any particular contiguous [published, unpublished) for a given name, these will be associated with each other by sharing the same id. The only constraint is that there won’t be overlapping [published, unpublished) ranges with the same name in the database. You can thus imagine this as representing a list of "in-existence ranges" for a given name. Usually any id will have a published and unpublished entry, except for the most recent published one, which may not yet have an unpublished entry. You can imagine this as representing [published, ?).

As such, the goal is to, given a specified QUERY_DATE and name, return the id that represents the range it falls under (QUERY_DATE >= published && QUERY_DATE < unpublished), and taking into account the special case where (`QUERY_DATE >= published and there is no "unpublished entry").

A separate representation you could imagine for this (which would admittedly make my problem much easier to solve, but I am not in control of this data), is:

CREATE TABLE items (
    id              integer NOT NULL,
    name            varchar(40) NOT NULL,
    published       date NOT NULL,
    unpublished     date
);

Here the above query would be fairly straight-forward, you would just do:

SELECT id FROM items
    WHERE
        name = THE_NAME AND
        published <= QUERY_DATE AND 
        (unpublished = NULL OR unpublished > QUERY_DATE)

However, we don’t have the above schema, we have the separated range one from before. If I break this up into two queries, it is conceptually simple (pseudo-code below). Given that the date ranges are known to not overlap, we can just first ask for the LATEST matching published item, and then we just need to do a quick check at the end to make sure it wasn’t unpublished. But I am still doing two queries:

const lastPublishedAfterDateID = query(`
    SELECT id FROM items
    WHERE
        name = ${name} AND
        event = published AND
        published <= ${date}
    ORDERED DESCENDING
    LIMIT 1`));

const isUnpublished = query(`
    SELECT id FROM items
    WHERE
        id = ${lastPublishedAfterDateID} AND
        event = unpublished AND
        unpublished > ${date}`);

const result = isUnpublished ? null : lastPublishedAfterDateID;

I think there is an alternative way to accomplish this all in the query by doing an INNER SELF JOIN. Something like:

SELECT id
    FROM items AS published
    LEFT JOIN items AS unpublished
    WHERE
        published.name = THE_NAME AND
        published.type = published AND
        published.date <= QUERY_DATE AND
        
        unplublished.id = published.id AND
        unpublished.type = unpublished AND
        unpublished.date > QUERY_DATE
    LIMIT 1

I think they limit 1 might be superfluous here, since there’d only ever be one unless that data in the table is incorrect. The LEFT JOIN handles the case of no end date. Either way, I think the above might get me the equivalent to the two-step process above, but it’s not clear to me whether this is horribly inefficient or not idiomatic/etc. Conceptually this is simple to explain "Give me the LATEST item whose publish date is before the given date, unless it also falls after the unpublished date, in which case just return null". Again, if I was in control of the data, I would just be trying to make my life easier with the second schema I posted, but I am not.

Answers

- Turo
- March 16, 2023 at 5:01 pm
- 0 votes
0
Window functions could be the way:
```
select ID, name, event, date from 
(select ID, name, event, date,
LEAD(date,1) OVER (PARTITION BY ID ORDER BY date) next_date
   from items) e
where QUERY_DATE >= date and ( QUERY_DATE < next_date  or next_date is null)
and name = THE_NAME;     
```
db-fiddle
Login or Signup to reply.

- Zegarek
- March 16, 2023 at 8:08 pm
- 0 votes
0
I like how you can translate this into a not exists pretty much word-for-word: demo
```
select distinct on(name)*     --"Give me the LATEST
from  items p                 -- item
where p.event='published'     -- whose publish
and   p.date<='2023-03-06'    -- date is before the given date,
and not exists (              -- unless
    select true
    from   items u
    where u.name = p.name     -- it
    and u.id   = p.id           
    and u.date>= p.date       -- also 
    and u.date < '2023-03-06' -- falls after 
    and u.event='unpublished' -- the unpublished date"
    )
and   p.name='Matrix'         -- (given a specified QUERY_DATE **and name**)
order by name,date desc;--this selects the the latest as the distinct one to return 
```
1. Determine bound inclusivity: it’s not clear whether a name unpublished exactly on the QUERY_DATE should be considered already unpublished as of the date, or still published until the end of it.
2. If your ID’s are in ascending order following the progression of date and you’re only after the ID, you can swap out distinct on()...order by for a max(id). Your pseudocode SELECT id FROM...ORDERED DESCENDING LIMIT 1 would suggest that would be the case.
3. When there’s nothing matching your criteria, it returns no rows, which is not the same as returning a row with a null.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Is Inner SELF JOIN the right tool for this two-step filtering? – Postgresql

Answers