skip to Main Content

I am currently trying to implement an algorithm to find anagrams that look like real names. I have a solution that is working but takes too much time for some queries and I am wondering how to improve it.

I am trying to find anagrams composed of a forename and a surname, based on a database holding 50k forenames and 50k surnames. The schema of the database is the following :


CREATE TABLE `forename` (
  `id` int(11) NOT NULL,
  `q` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `label` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `labels` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `labels_length` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE `surname` (
  `id` int(11) NOT NULL,
  `q` varchar(32) COLLATE utf8mb4_unicode_ci NOT NULL,
  `label` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `labels` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
  `labels_length` int(11) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

ALTER TABLE `forename`
  ADD PRIMARY KEY (`id`),
  ADD KEY `idx_length` (`labels_length`);
ALTER TABLE `forename` ADD FULLTEXT KEY `idx_labels` (`labels`);

ALTER TABLE `surname`
  ADD PRIMARY KEY (`id`),
  ADD KEY `idx_length` (`labels_length`),
  ADD KEY `idx_labels` (`labels`);

In each table the meaning of the columns is as follow:

  • label : forename or surename
  • labels : a slugified version of the label : all characters in upper case alphabetically sorted;
  • labels_length : the number of characters in labels;

I am currently querying this database using a query generated in php which, for Ada Lovelace for example, looks like :

select distinct A.label as surname, B.label as forename 
from forename as A, surname as B WHERE (A.labels not like '%B%' and B.labels not like '%B%') AND 
(A.labels not like '%F%' and B.labels not like '%F%') AND 
(A.labels not like '%G%' and B.labels not like '%G%') AND 
(A.labels not like '%H%' and B.labels not like '%H%') AND 
(A.labels not like '%I%' and B.labels not like '%I%') AND 
(A.labels not like '%J%' and B.labels not like '%J%') AND 
(A.labels not like '%K%' and B.labels not like '%K%') AND 
(A.labels not like '%M%' and B.labels not like '%M%') AND 
(A.labels not like '%N%' and B.labels not like '%N%') AND 
(A.labels not like '%P%' and B.labels not like '%P%') AND 
(A.labels not like '%Q%' and B.labels not like '%Q%') AND 
(A.labels not like '%R%' and B.labels not like '%R%') AND 
(A.labels not like '%S%' and B.labels not like '%S%') AND 
(A.labels not like '%T%' and B.labels not like '%T%') AND 
(A.labels not like '%U%' and B.labels not like '%U%') AND 
(A.labels not like '%W%' and B.labels not like '%W%') AND 
(A.labels not like '%X%' and B.labels not like '%X%') AND 
(A.labels not like '%Y%' and B.labels not like '%Y%') AND 
(A.labels not like '%Z%' and B.labels not like '%Z%') AND 
(A.labels like '%A%' or B.labels like '%A%') AND 
(A.labels like '%C%' or B.labels like '%C%') AND 
(A.labels like '%D%' or B.labels like '%D%') AND 
(A.labels like '%E%' or B.labels like '%E%') AND 
(A.labels like '%L%' or B.labels like '%L%') AND 
(A.labels like '%O%' or B.labels like '%O%') AND 
(A.labels like '%V%' or B.labels like '%V%') AND 
(A.labels_length + B.labels_length) = 11

The explanation of this query is that Ada Lovelace slug is AAACDEELLOV so I need to find surnames and forenames that contain these letters and that don’t contains other letters from the alphabet. I am adding a filter on the number of characters to try to limit the number of rows returned.

With this query I get results that needs to be processed using PHP to control that the number of times each character is used is correct (for example that for Ada Lovelace my result contains 3 A).

My current database contains approximately 50k surnames and 50k forenames. When I search for Ada Lovelace I get 458 SQL rows in ~ 0,30 second (11 exact anagrams found if you wonder).

If I change my search for Sylvain Lovelace, I get 1774 rows in more than 10 seconds. 30 times slower and the duration that was acceptable for Ada Lovelace is now out of range. I have tried to remove the filter on the number of characters, and the duration steps down to 8 seconds, still too much.

I am pretty sure that it should be possible to improve, either the indexes of my database, either the way my query is built. If anyone has any idea, I would be more than happy to try them!

In case someone wants to try it on real data, the dump is available on a github repository.

3

Answers


  1. Chosen as BEST ANSWER

    After some months I came to this issue and have now found a way that I find acceptable. The solution was to change my data model by adding 26 columns to the two tables, each containing the number of letter, with an index on each column.

    Based on this data model I am able to build queries like this one :

    select distinct A.label as surname, B.label as forename 
    from forename as A, surname as B 
    WHERE 
    (A.A >= 1 or B.A >= 1) AND 
    (A.B = 0 and B.B = 0) AND 
    (A.C = 1 xor B.C = 1) AND 
    (A.D = 0 and B.D = 0) AND 
    (A.E = 0 and B.E = 0) AND 
    /--/
    (A.Z = 1 xor B.Z = 1) AND 
    (A.labels_length = 4) AND (B.labels_length = 9)
    

    In this sample query, I am searching anagrams for Aaron Schwartz (Letters: AAACHNORRSTWZ) with a surname containing 4 letters. I need results where at least one of the surname and forename contains a A because I need 3 of them, forename and surname both contain no B because I don't want any, and as I only want a C, forename XOR surname might contain one.

    This query won't give me exact results but the number of results returned is satisfying enough for me to process them with PHP afterwards and control if they are real anagrams or not.

    A resulting website has been built as a proof of concept on http://apf.geobib.fr/


  2. The main problem here is your data model. Storing forenames and lastnames in two different tables makes your local slugs useless, as they need to be recomposed into a global slug to be compared to the search slug.

    A (slightly) less verbose method would be to check the number of occurences of each character of the search slug. For

    where 
        char_length(a.label) + char_length(b.label) = char_length('AAACDEELLOV')
        and char_length(concat(a.label, b.label)) 
            - char_length(replace(upper(concat(a.label, b.label)), 'A', '')) = 3
        and char_length(a.label) + char_length(b.label) 
            - char_length(replace(upper(concat(a.label, b.label)), 'C', '')) = 1
        and char_length(a.label) + char_length(b.label)
            - char_length(replace(upper(concat(a.label, b.label)), 'D', '')) = 1
        and char_length(a.label) + char_length(b.label)
            - char_length(replace(upper(concat(a.label, b.label)), 'E', '')) = 2
        and char_length(a.label) + char_length(b.label)
            - char_length(replace(upper(concat(a.label, b.label)), 'L', '')) = 2
        and char_length(a.label) + char_length(b.label)
            - char_length(replace(upper(concat(a.label, b.label)), 'O', '')) = 1
        and char_length(a.label) + char_length(b.label)
            - char_length(replace(upper(concat(a.label, b.label)), 'V', '')) = 1
    

    But bottom line, you would be better off fixing your data model by generating a unique table that would store the full names (first and last names) and the associated slug.

    create table fullnames (
        id int auto_increment primary key
        name varchar(100),
        slug varchar(100)
    );
    

    You can feed the new table from the old tables with a recursive cte that generates the slugs:

    insert into fullnames(name, slug)
    with recursive cte as (
        select 
            concat(f.label, ' ', s.label) name,
            upper(concat(f.label, s.label) slug_name, 
            0 pos, 
            '' char_at_pos,
            char_length(concat(f.label, s.label)) slug_length
        from forename f 
        cross join surname s
        union all
        select 
            name,
            slug_name,
            pos + 1
            substring(slug_name, pos + 1, 1),
            slug_length
        from cte
        where pos + 1 <= slug_length
    )
    select name, group_concat(char_at_pos order by char_at_pos separator '') slug
    from cte
    group by name
    

    Then you can directly query the table:

    select * from fullnames where slug = 'AAACDEELLOV';
    

    Of course, you could also use the results of the recursive cte to search for the target slug, but I would expect that performance would not be good.

    Login or Signup to reply.
  3. Ponder this…

    Suppose X is the least common letter (fewest number of words with X). Let’s say 100 first names have X and 80 last names. Now test all such first names against all last names — 100*50K tests, plus a similar 80*50K the other way. Compute the 180*50K anagrams. Sort. Output any dup anagrams thus found. Then remove the 100 & 80 names; they are of no further use.

    Suppose Q is the most common letter from the remaining words. Repeat, but now with 49900 and 49920 names.

    Probably, as go through the alphabet, the number of names eliminated will increase until near the end. Meanwhile, the 50K decreases.

    Maybe the “effort” is about 26 * 200 * 50K = 260M

    Another technique could be added. This will increase complexity but might decrease the total effort… Start by breaking the 50K+50K into buckets; one bucket for each combined length.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search