skip to Main Content

I have a list of almost 5,000 "fantasy" words, which are written in ASCII text. A few of them are like:

txintoq
txiqbal
txiqfun
txiqwek
txiqyal
txiyton
txonmiq
txoqwul
txoqxik

I would like to come up with an algorithm which checks/verifies that there are no two words in the list that differ by only 1 "similar consonant". So I would define "similar consonant sets" like this (for now):

zs
xj
pb
td
kg

There could be 3+ consonants in a set, but I only show 2 for now. I would need to fine tune this as I learn better what consonants sound similar in real life in the fantasy language tonality.

So words like this would be marked as "needs fixing" (because they are too similar sounding):

txindan
txintan # differs by only d/t

xumaq
jumaq # differs by x/j

dolpar
dolbar # differs by only a b/p

How can I somewhat efficiently find these words in my ~5k word list, where they are off by only one consonant like this?

Here is a big list of the words I currently have, ranging from 3-10 letters.

I am only currently imagining how to solve it in an extremely naive way, like this:

import fs from 'fs'

const terms = fs
  .readFileSync('term.csv', 'utf-8')
  .trim()
  .split(/n+/)
  .map(line => {
    let [term] = line.split(',')
    return term
  })
  .filter(x => x)

const consonantSets = `
zs
xj
pb
td
kg`
  .split(/n+/)
  .map(x => x.split(''))

function computeSimilarTerms(
  term: string,
  consonantSets: Array<Array<string>>,
) {
  const termLetters = term?.split('') ?? []
  const newTerms: Array<string> = []
  for (const consonantSet of consonantSets) {
    for (const letter of consonantSet) {
      for (const letter2 of consonantSet) {
        if (letter === letter2) {
          continue
        }

        let i = 0
        while (i < termLetters.length) {
          const termLetter = termLetters[i]
          if (termLetter === letter) {
            const newTerm = termLetters.concat()
            termLetters[i] = letter2
            newTerms.push(newTerm.join(''))
          }

          i++
        }
      }
    }
  }
  return newTerms
}

for (const term of terms) {
  const similarTerms = computeSimilarTerms(term, consonantSets)
  similarTerms.forEach(similarTerm => {
    if (terms.includes(similarTerm)) {
      console.log(term, similarTerm)
    }
  })
}

How can this be done in a somewhat less brute force way? And this is also incomplete, as it doesn’t construct all combinations of terms which it could be similar to. So somehow it should be able to do that at some point in the algorithm. Any ideas?

2

Answers


  1. Choose one consonant in each group as the "representant" of this group. Then, build a map that groups together words that become the same when their consonants are replaced by representant consonants.

    Important note: this approach only works if the consonant groups form equivalence classes. In particular, similarity of consonants has to be transitive. If 'bp' are similar and 'bv' are similar, but 'pv' are not similar, then this approach won’t work.

    Here is code in python to illustrate; I’ll let you write the javascript code.

    • f is a map that maps each consonant to its representant consonant;
    • d is a map that maps each representant word to the list of words that have this representant.
    bigwordlist = '''dolbar
    dolpar
    jumaq
    txindan
    txintan
    txintoq
    txiqbal
    txiqfun
    txiqwek
    txiqyal
    txinton
    txonmiq
    txoqwul
    txoqxik
    xumaq'''.splitlines()
    
    consonant_groups = '''zs
    xj
    pb
    td
    kg'''.splitlines()
    
    f = {}
    for g in consonant_groups:
        for c in g:
            f[c] = g[0]
    
    print(f)
    # {'z': 'z', 's': 'z', 'x': 'x', 'j': 'x', 'p': 'p', 'b': 'p', 't': 't', 'd': 't', 'k': 'k', 'g': 'k'}
        
    d = {}
    for word in bigwordlist:
        key = ''.join(f.get(c, c) for c in word)
        d.setdefault(key, []).append(word)
    
    print(d)
    # {'tolpar': ['dolbar', 'dolpar'], 'xumaq': ['jumaq', 'xumaq'], 'txintan': ['txindan', 'txintan'], 'txintoq': ['txintoq'], 'txiqpal': ['txiqbal'], 'txiqfun': ['txiqfun'], 'txiqwek': ['txiqwek'], 'txiqyal': ['txiqyal'], 'txinton': ['txinton'], 'txonmiq': ['txonmiq'], 'txoqwul': ['txoqwul'], 'txoqxik': ['txoqxik']}
    

    And finally we can see which words are similar:

    print([g for g in d.values() if len(g) > 1])
    # [['dolbar', 'dolpar'], ['jumaq', 'xumaq'], ['txindan', 'txintan']]
    
    Login or Signup to reply.
  2. I thought about this much the same way Stef did in another answer. But I think about it in JS, so this version may be relevant.

    const group = (fn) => (xs) => Object.values(
      xs.reduce((a, x, k) => ((k = fn(x)), (a[k] = a[k] || []), (a[k].push(x), a)), {})
    )
    
    const tooSimilar = (ws, sc, 
      cs = Object.fromEntries(sc.flatMap(g => [...g].map(c => [c, g[0]])))
    ) => group(s=> [...s].map(c => cs[c] || c) .join('')) (ws)
      .filter(xs => xs .length > 1)
    
    const words = ['dolpar', 'dolbar', 'jumaq', 'tolbar', 'tolpar', 'txintoq', 'txiqbal', 'txiqfun', 'txiqwek', 'txindan', 'txintan', 'txiqyal', 'txiyton', 'txonmiq', 'txoqwul', 'txoqxik', 'xumaq']
    const similarConsonants = ['zs', 'xj', 'pb', 'td', 'kg']
    
    console.log(tooSimilar(words, similarConsonants))
    .as-console-wrapper {max-height: 100% !important; top: 0}

    group is similar to the the groupBy functions in Underscore, lodash, and Ramda (disclaimer: I’m an author), except that it returns a flat array of arrays, ignoring the generated keys.

    tooSimilar creates an index of the similar letters, with each one pointing to the first letter of the string, then calls groupBy on the words input using a function that simply replaces the letters in that index with their group representative, then filters the results to only include the ones containing more than one word.

    Note that my sample data includes ones that match on two different consonant pairs: 'dolpar', 'dolbar', 'tolpar', and 'tolbar'.

    A realistic example of the potential problem with equivalence classes in English would be 'ck' and 'cs'. 'k' and 's' probably shouldn’t overlap.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search