Postgresql - Finding duplicate values in SQL including substrings

Mirror318
February 28, 2024
161 views
0 votes
2 Answers

I have a database of songs, and I want to find duplicates:

With the exact same title, e.g. "itsy bitsy" vs "itsy bitsy"
OR with one title the substring of the other, e.g. "itsy bitsy" vs "itsy bitsy spider"

I have tried using LIKE like this:

  scope :duplicate_titles, -> {
    where("(SELECT COUNT(*) FROM songs AS songs1
                            WHERE songs1.title LIKE (songs.title || '%')) > 1")
  }

However it’s very slow, queries that were 40ms jump to 2s with this LIKE statement.

Is there a better way to find duplicates, including where one field is the substring of another?

Edit 1: some examples

id: 1, title: "itsy bitsy spider"
id: 2, title: "itsy bitsy"
id: 3, title: "row, row, row your boat"
id: 4, title: "twinkle twinkle"
id: 5, title: "twinkle twinkle little star"

In this data, I would want to find songs 1,2,4,5 (or at least songs 2 and 4) because they are duplicates.

Edit 2: Another attempt

I tried using substring like this:

  scope :duplicate_titles, -> {
    where("(SELECT COUNT(*) FROM songs AS songs1
                            WHERE SUBSTRING(songs1.title, 0, LENGTH(songs.title)+1) = songs.title) > 1")
  }

But this took even longer, 4.3s to execute a query..

Answers

data

CREATE TABLE mytable(
   id     INTEGER  NOT NULL  
  ,title  VARCHAR(100) NOT NULL
 
);
INSERT INTO mytable(id,title) VALUES (1,'itsy bitsy spider');
INSERT INTO mytable(id,title) VALUES (2,'itsy bitsy');
INSERT INTO mytable(id,title) VALUES (3,'row row row your boat');
INSERT INTO mytable(id,title) VALUES (4,'twinkle twinkle');
INSERT INTO mytable(id,title) VALUES (5,'twinkle twinkle little star');

use join on like

select * from mytable m1
join mytable m2
on  m1.title LIKE '%' || m2.title || '%'
and m1.id<>m2.id

or use contain as follows

select * from mytable m1
join mytable m2
on  POSITION(m2.title IN m1.title)<>0
and m1.id<>m2.id

dbfiddle

You can use any text value directly as a POSIX regular expression and look for a case-insensitive match using ~* operator. If it’s a substring or the same whole value, it’ll match. Demo at db<>fiddle:

id	title
1	itsy bitsy spider
2	itsy bitsy
3	itsy bitsy
4	row row row your boat
5	twinkle twinkle
6	twinkle twinkle little star

select * from song s1 
inner join song s2 
 on s1.title ~* s2.title;

id	title	id	title
1	itsy bitsy spider	1	itsy bitsy spider
1	itsy bitsy spider	2	itsy bitsy
1	itsy bitsy spider	3	itsy bitsy
2	itsy bitsy	2	itsy bitsy
2	itsy bitsy	3	itsy bitsy
3	itsy bitsy	2	itsy bitsy
3	itsy bitsy	3	itsy bitsy
4	row row row your boat	4	row row row your boat
5	twinkle twinkle	5	twinkle twinkle
6	twinkle twinkle little star	5	twinkle twinkle
6	twinkle twinkle little star	6	twinkle twinkle little star

For more flexibility and performance you’d probably be interested in Full Text Search rather than pattern matching.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Finding duplicate values in SQL including substrings

Answers