Mysql - SQL deduplication with criteria that has non-unique values

mbpaulus
September 29, 2023
191 views
2 votes
3 Answers

I have a table called cc_index and the table has among others the following columns: length, date, url, [other_colums].

The same url may appear in multiple rows.
For each url, I only want to retain a single row (but all columns).
I want to retain a row that has the longest length, unfortunately, this row may not be unique. In the cases, where multiple rows have the highest value for length for the same url, I want to retain the row that has the latest (highest) date, date values are unique for each row.

To clarify this is an example of a possible input table:

url	length	date	other
abc.com	42	292	adsk
abc.com	36	312	sdlkf
abc.com	42	281	sdjl
def.com	12	210	tom
def.com	18	112	vkr
ghi.com	29	292	tom
ghi.com	29	103	tyr
ghi.com	29	101	vnv

The output I desire is as follows:

url	length	date	other
abc.com	42	292	adsk
def.com	18	112	vkr
ghi.com	29	292	tom

I know that I can run a query like this for deduplication

SELECT t1.*
FROM ccindex t1
INNER JOIN (SELECT url, MAX(length) AS max_length FROM ccindex GROUP BY url) t2
ON t1.url=t2.url AND t1.length = t2.max_length

However, in the case where the criterion is not unique (length) this will return all rows with the maximum length for each URL, and hence multiple rows for the same URL, i.e. the output will be

Undesired output:

url	length	date	other
abc.com	42	292	adsk
abc.com	42	281	sdjl
def.com	18	112	vkr
ghi.com	29	292	tom
ghi.com	29	103	tyr
ghi.com	29	101	vnv

How can I modify this query to select only a single row for each url that has length=max_length?

Follow-ups: If I don’t care about the date, and want to return a single arbitrary row that has the maximum length can performance be improved significantly?

Answers

Check this query.

SELECT t1.*
FROM ccindex t1
WHERE (t1.url, t1.length, t1.date) IN (
    SELECT max_lengths.url, max_lengths.max_length, max_date
    FROM (
        SELECT url, MAX(length) AS max_length
        FROM ccindex
        GROUP BY url
    ) AS max_lengths
    JOIN (
        SELECT url, length, MAX(date) AS max_date
        FROM ccindex
        GROUP BY url, length
    ) AS max_dates
    ON max_lengths.url = max_dates.url AND max_lengths.max_length = max_dates.length
);

- Swifty
- September 29, 2023 at 4:04 pm
- 0 votes
0
I’d simply use ROW_NUMBER() OVER (PARTITION...) for this:
```
WITH t1 AS (
    SELECT  *,
            ROW_NUMBER() OVER (PARTITION BY url ORDER BY length DESC, date DESC) AS n_row
    FROM ccindex
    )
SELECT *
FROM t1
WHERE n_row = 1
```
In the 2nd SELECT, you might of course want to replace * with only the rows you need.
Login or Signup to reply.

- JonasMetzler
- September 29, 2023 at 4:21 pm
- 0 votes
0
We can use DENSE_RANK here.

The PARTITION BY part builds the ranking per url, the ORDER BY clause makes sure to begin with highest length and if identic with the latest date:
```
WITH rankedData AS
(SELECT
  url, length, date, other,
  DENSE_RANK() OVER(PARTITION BY url ORDER BY length DESC, date DESC) AS ranking
FROM ccindex)
SELECT
  url, length, date, other
FROM rankedData 
WHERE ranking = 1
ORDER BY url;
```
Add further columns to both sub query and main query if you want them to be shown.

To your question…

If I don’t care about the date, and want to return a single arbitrary row that has the maximum length can performance be improved significantly?

…I don’t believe checking for the latest date will have much impact on your performance. But we don’t know how big your table is, what indexes it has etc. So I recommend to simply try out both queries and compare the execution time.

The more important thing about performance is certainly to use an up-to-date version of your RDBMS. So this answer assumes (as your mentioned in your last comment) you already use it or you will upgrade.

Try out above queries on this sample fiddle with your data.

Have a look in the documentation about DENSE_RANK
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Mysql – SQL deduplication with criteria that has non-unique values

Answers