skip to Main Content

Briefly: database imported from foreign source, so I cannot prevent duplicates, I can only prune and clean the database.
Foreign db changes daily, so, I want to automate the pruning process.

It resides on:
MariaDB v10.4.6 managed predominantly by phpMyadmin GUI v4.9.0.1 (both pretty much up to date as of this writing).

This is a radio browsing database.

It has multiple columns, but for me there are only few important:
StationID (it is unique entry number, thus db does not consider new entries as duplicates, all of them are unique because of this primary key)
There are no row numbers.
Name, url, home-page, country, etc
I do want to remove multiple url duplicated entries base on:
duplicate url has country to it, but some country values are NULL (=empty)
so I do want remove all duplicates except one containing country name, if there is one entry with it, if there is none, just one url, regardless of name (names are multilingual, so some duplicated urls have also various names, which I do not care for.

  • StationID (unique number, but not consecutive, also this is primary db key)
  • Name (variable, least important)
  • url (variable, but I do want to remove the duplicates)
  • country (variable, frequently NULL/empty, I want to eliminate those with empty entries as much as possible, if possible)
  • One url has to stay by any means (not to be deleted)

I have tried multitude of queries, some work for SELECT, but do NOT for DELETE, some hang my machine when executed. Here are some queries I tried (remember I use MariaDB, not oracle, or ms-sql)

SELECT * from `radio`.`Station`
WHERE  (`radio`.`Station`.`Url`, `radio`.`Station`.`Name`) IN (
  SELECT `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
  FROM   `radio`.`Station`
  GROUP  BY `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
  HAVING COUNT(*) > 1)

This one should show all entries (not only one grouped), but this query hangs my machine
This query gets me as close as possible:

SELECT *
FROM `radio`.`Station`
WHERE `radio`.`Station`.`StationID` NOT IN (
     SELECT MAX(`radio`.`Station`.`StationID`)
     FROM `radio`.`Station`
     GROUP BY `radio`.`Station`.`Url`,`radio`.`Station`.`Name`,`radio`.`Station`.`Country`)

However this query lists more entries:

SELECT *, COUNT(`radio`.`Station`.`Url`) FROM `radio`.`Station` GROUP BY `radio`.`Station`.`Name`,`radio`.`Station`.`Url` HAVING (COUNT(`radio`.`Station`.`Url`) > 1);

But all of these queries group them and display only one row.
I also tried UNION, INNER JOIN, but failed.
WITH cte AS…, but phpMyadmin does NOT like this query, and mariadb cli also did not like it.

I also tried something of this kind, published at oracle blog, which did not work, and I really had no clue what was what in this function:

select * 
from   (
  select f.*, 
         count(*) over (
           partition by `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
         ) ct
  from `radio`.`Station` f
)
where  ct > 1

I did not know what f.* was, query did not like ct.

2

Answers


  1. Given

    drop table if exists radio;
    create table radio
    (stationid int,name varchar(3),country varchar(3),url varchar(3));
    
    insert into radio values
    (1,'aaa','uk','a/b'),
    (2,'bbb','can','a/b'),
    (3,'bbb',null,'a/b'),
    (4,'bbb',null,'b/b'),
    (5,'bbb',null,'b/b');
    

    You could give the null countries a unique value (using coalesce), fortunately stationid is unique so:

    select t.stationid,t.name,t.country,t.url
    from radio t
    join
    (select url,max(coalesce(country,stationid)) cntry from radio t group by url) s 
    on s.url = t.url and s.cntry= coalesce(t.country,t.stationid);
    

    Yields

    +-----------+------+---------+------+
    | stationid | name | country | url  |
    +-----------+------+---------+------+
    |         1 | aaa  | uk      | a/b  |
    |         5 | bbb  | NULL    | b/b  |
    +-----------+------+---------+------+
    2 rows in set (0.00 sec)
    

    Translated to a delete

    delete t from radio t 
    join
    (select url,max(coalesce(country,stationid)) cntry from radio t group by url) s 
    on s.url = t.url and s.cntry <> coalesce(t.country,t.stationid);
    
    MariaDB [sandbox]> select * from radio;
    +-----------+------+---------+------+
    | stationid | name | country | url  |
    +-----------+------+---------+------+
    |         1 | aaa  | uk      | a/b  |
    |         5 | bbb  | NULL    | b/b  |
    +-----------+------+---------+------+
    2 rows in set (0.00 sec)
    
    Login or Signup to reply.
  2. Fix 2 problems at once:

    • Dup rows already in table
    • Dup rows can still be put in table

    Do this fore each table:

    CREATE TABLE new LIKE real;
    ALTER TABLE new ADD UNIQUE(x,y);  -- will prevent future dups
    INSERT IGNORE INTO new         -- IGNORE dups
        SELECT * FROM real;
    RENAME TABLE real TO old, new TO real;
    DROP TABLE old;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search