I have a table called cc_index and the table has among others the following columns: length, date, url, [other_colums].
The same url may appear in multiple rows.
For each url, I only want to retain a single row (but all columns).
I want to retain a row that has the longest length, unfortunately, this row may not be unique. In the cases, where multiple rows have the highest value for length for the same url, I want to retain the row that has the latest (highest) date, date values are unique for each row.
To clarify this is an example of a possible input table:
url | length | date | other |
---|---|---|---|
abc.com | 42 | 292 | adsk |
abc.com | 36 | 312 | sdlkf |
abc.com | 42 | 281 | sdjl |
def.com | 12 | 210 | tom |
def.com | 18 | 112 | vkr |
ghi.com | 29 | 292 | tom |
ghi.com | 29 | 103 | tyr |
ghi.com | 29 | 101 | vnv |
The output I desire is as follows:
url | length | date | other |
---|---|---|---|
abc.com | 42 | 292 | adsk |
def.com | 18 | 112 | vkr |
ghi.com | 29 | 292 | tom |
I know that I can run a query like this for deduplication
SELECT t1.*
FROM ccindex t1
INNER JOIN (SELECT url, MAX(length) AS max_length FROM ccindex GROUP BY url) t2
ON t1.url=t2.url AND t1.length = t2.max_length
However, in the case where the criterion is not unique (length) this will return all rows with the maximum length for each URL, and hence multiple rows for the same URL, i.e. the output will be
Undesired output:
url | length | date | other |
---|---|---|---|
abc.com | 42 | 292 | adsk |
abc.com | 42 | 281 | sdjl |
def.com | 18 | 112 | vkr |
ghi.com | 29 | 292 | tom |
ghi.com | 29 | 103 | tyr |
ghi.com | 29 | 101 | vnv |
How can I modify this query to select only a single row for each url that has length=max_length?
Follow-ups: If I don’t care about the date, and want to return a single arbitrary row that has the maximum length can performance be improved significantly?
3
Answers
Check this query.
I’d simply use
ROW_NUMBER() OVER (PARTITION...)
for this:In the 2nd SELECT, you might of course want to replace * with only the rows you need.
We can use
DENSE_RANK
here.The
PARTITION BY
part builds the ranking perurl
, theORDER BY
clause makes sure to begin with highestlength
and if identic with the latestdate
:Add further columns to both sub query and main query if you want them to be shown.
To your question…
…I don’t believe checking for the latest date will have much impact on your performance. But we don’t know how big your table is, what indexes it has etc. So I recommend to simply try out both queries and compare the execution time.
The more important thing about performance is certainly to use an up-to-date version of your RDBMS. So this answer assumes (as your mentioned in your last comment) you already use it or you will upgrade.
Try out above queries on this sample fiddle with your data.
Have a look in the documentation about
DENSE_RANK