Some context, I am using postgres and I have a table that has items defined something like this:
CREATE TYPE EventType AS ENUM ('published', 'unpublished');
CREATE TABLE items (
id integer NOT NULL,
name varchar(40) NOT NULL,
event EventType NOT NULL,
date date
);
The critical thing to realize here is that name is not necessarily unique, so something could be published with a name on date X, then get unpublished on date Y, then republished again on date Z. Given any particular contiguous [published, unpublished)
for a given name
, these will be associated with each other by sharing the same id
. The only constraint is that there won’t be overlapping [published, unpublished)
ranges with the same name
in the database. You can thus imagine this as representing a list of "in-existence ranges" for a given name
. Usually any id
will have a published
and unpublished
entry, except for the most recent published
one, which may not yet have an unpublished
entry. You can imagine this as representing [published, ?)
.
As such, the goal is to, given a specified QUERY_DATE
and name
, return the id
that represents the range it falls under (QUERY_DATE >= published && QUERY_DATE < unpublished
), and taking into account the special case where (`QUERY_DATE >= published and there is no "unpublished entry").
A separate representation you could imagine for this (which would admittedly make my problem much easier to solve, but I am not in control of this data), is:
CREATE TABLE items (
id integer NOT NULL,
name varchar(40) NOT NULL,
published date NOT NULL,
unpublished date
);
Here the above query would be fairly straight-forward, you would just do:
SELECT id FROM items
WHERE
name = THE_NAME AND
published <= QUERY_DATE AND
(unpublished = NULL OR unpublished > QUERY_DATE)
However, we don’t have the above schema, we have the separated range one from before. If I break this up into two queries, it is conceptually simple (pseudo-code below). Given that the date ranges are known to not overlap, we can just first ask for the LATEST matching published item, and then we just need to do a quick check at the end to make sure it wasn’t unpublished. But I am still doing two queries:
const lastPublishedAfterDateID = query(`
SELECT id FROM items
WHERE
name = ${name} AND
event = published AND
published <= ${date}
ORDERED DESCENDING
LIMIT 1`));
const isUnpublished = query(`
SELECT id FROM items
WHERE
id = ${lastPublishedAfterDateID} AND
event = unpublished AND
unpublished > ${date}`);
const result = isUnpublished ? null : lastPublishedAfterDateID;
I think there is an alternative way to accomplish this all in the query by doing an INNER SELF JOIN. Something like:
SELECT id
FROM items AS published
LEFT JOIN items AS unpublished
WHERE
published.name = THE_NAME AND
published.type = published AND
published.date <= QUERY_DATE AND
unplublished.id = published.id AND
unpublished.type = unpublished AND
unpublished.date > QUERY_DATE
LIMIT 1
I think they limit 1 might be superfluous here, since there’d only ever be one unless that data in the table is incorrect. The LEFT JOIN handles the case of no end date. Either way, I think the above might get me the equivalent to the two-step process above, but it’s not clear to me whether this is horribly inefficient or not idiomatic/etc. Conceptually this is simple to explain "Give me the LATEST item whose publish date is before the given date, unless it also falls after the unpublished date, in which case just return null". Again, if I was in control of the data, I would just be trying to make my life easier with the second schema I posted, but I am not.
2
Answers
Window functions could be the way:
db-fiddle
I like how you can translate this into a
not exists
pretty much word-for-word: demoQUERY_DATE
should be considered already unpublished as of the date, or still published until the end of it.date
and you’re only after the ID, you can swap outdistinct on()...order by
for amax(id)
. Your pseudocodeSELECT id FROM...ORDERED DESCENDING LIMIT 1
would suggest that would be the case.null
.