I have a database of songs, and I want to find duplicates:
- With the exact same title, e.g. "itsy bitsy" vs "itsy bitsy"
- OR with one title the substring of the other, e.g. "itsy bitsy" vs "itsy bitsy spider"
I have tried using LIKE like this:
scope :duplicate_titles, -> {
where("(SELECT COUNT(*) FROM songs AS songs1
WHERE songs1.title LIKE (songs.title || '%')) > 1")
}
However it’s very slow, queries that were 40ms jump to 2s with this LIKE statement.
Is there a better way to find duplicates, including where one field is the substring of another?
Edit 1: some examples
id: 1, title: "itsy bitsy spider"
id: 2, title: "itsy bitsy"
id: 3, title: "row, row, row your boat"
id: 4, title: "twinkle twinkle"
id: 5, title: "twinkle twinkle little star"
In this data, I would want to find songs 1,2,4,5 (or at least songs 2 and 4) because they are duplicates.
Edit 2: Another attempt
I tried using substring like this:
scope :duplicate_titles, -> {
where("(SELECT COUNT(*) FROM songs AS songs1
WHERE SUBSTRING(songs1.title, 0, LENGTH(songs.title)+1) = songs.title) > 1")
}
But this took even longer, 4.3s to execute a query..
2
Answers
data
use join on like
or use contain as follows
dbfiddle
You can use any
text
value directly as a POSIX regular expression and look for a case-insensitive match using~*
operator. If it’s a substring or the same whole value, it’ll match. Demo at db<>fiddle:For more flexibility and performance you’d probably be interested in Full Text Search rather than pattern matching.