skip to Main Content

I have a database of songs, and I want to find duplicates:

  • With the exact same title, e.g. "itsy bitsy" vs "itsy bitsy"
  • OR with one title the substring of the other, e.g. "itsy bitsy" vs "itsy bitsy spider"

I have tried using LIKE like this:

  scope :duplicate_titles, -> {
    where("(SELECT COUNT(*) FROM songs AS songs1
                            WHERE songs1.title LIKE (songs.title || '%')) > 1")
  }

However it’s very slow, queries that were 40ms jump to 2s with this LIKE statement.

Is there a better way to find duplicates, including where one field is the substring of another?

Edit 1: some examples

id: 1, title: "itsy bitsy spider"
id: 2, title: "itsy bitsy"
id: 3, title: "row, row, row your boat"
id: 4, title: "twinkle twinkle"
id: 5, title: "twinkle twinkle little star"

In this data, I would want to find songs 1,2,4,5 (or at least songs 2 and 4) because they are duplicates.

Edit 2: Another attempt

I tried using substring like this:

  scope :duplicate_titles, -> {
    where("(SELECT COUNT(*) FROM songs AS songs1
                            WHERE SUBSTRING(songs1.title, 0, LENGTH(songs.title)+1) = songs.title) > 1")
  }

But this took even longer, 4.3s to execute a query..

2

Answers


  1. data

    CREATE TABLE mytable(
       id     INTEGER  NOT NULL  
      ,title  VARCHAR(100) NOT NULL
     
    );
    INSERT INTO mytable(id,title) VALUES (1,'itsy bitsy spider');
    INSERT INTO mytable(id,title) VALUES (2,'itsy bitsy');
    INSERT INTO mytable(id,title) VALUES (3,'row row row your boat');
    INSERT INTO mytable(id,title) VALUES (4,'twinkle twinkle');
    INSERT INTO mytable(id,title) VALUES (5,'twinkle twinkle little star');
    

    use join on like

    select * from mytable m1
    join mytable m2
    on  m1.title LIKE '%' || m2.title || '%'
    and m1.id<>m2.id
    

    or use contain as follows

    select * from mytable m1
    join mytable m2
    on  POSITION(m2.title IN m1.title)<>0
    and m1.id<>m2.id
    

    dbfiddle

    Login or Signup to reply.
  2. You can use any text value directly as a POSIX regular expression and look for a case-insensitive match using ~* operator. If it’s a substring or the same whole value, it’ll match. Demo at db<>fiddle:

    id title
    1 itsy bitsy spider
    2 itsy bitsy
    3 itsy bitsy
    4 row row row your boat
    5 twinkle twinkle
    6 twinkle twinkle little star
    select * from song s1 
    inner join song s2 
     on s1.title ~* s2.title;
    
    id title id title
    1 itsy bitsy spider 1 itsy bitsy spider
    1 itsy bitsy spider 2 itsy bitsy
    1 itsy bitsy spider 3 itsy bitsy
    2 itsy bitsy 2 itsy bitsy
    2 itsy bitsy 3 itsy bitsy
    3 itsy bitsy 2 itsy bitsy
    3 itsy bitsy 3 itsy bitsy
    4 row row row your boat 4 row row row your boat
    5 twinkle twinkle 5 twinkle twinkle
    6 twinkle twinkle little star 5 twinkle twinkle
    6 twinkle twinkle little star 6 twinkle twinkle little star

    For more flexibility and performance you’d probably be interested in Full Text Search rather than pattern matching.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search