I have a list of almost 5,000 "fantasy" words, which are written in ASCII text. A few of them are like:
txintoq
txiqbal
txiqfun
txiqwek
txiqyal
txiyton
txonmiq
txoqwul
txoqxik
I would like to come up with an algorithm which checks/verifies that there are no two words in the list that differ by only 1 "similar consonant". So I would define "similar consonant sets" like this (for now):
zs
xj
pb
td
kg
There could be 3+ consonants in a set, but I only show 2 for now. I would need to fine tune this as I learn better what consonants sound similar in real life in the fantasy language tonality.
So words like this would be marked as "needs fixing" (because they are too similar sounding):
txindan
txintan # differs by only d/t
xumaq
jumaq # differs by x/j
dolpar
dolbar # differs by only a b/p
How can I somewhat efficiently find these words in my ~5k word list, where they are off by only one consonant like this?
Here is a big list of the words I currently have, ranging from 3-10 letters.
I am only currently imagining how to solve it in an extremely naive way, like this:
import fs from 'fs'
const terms = fs
.readFileSync('term.csv', 'utf-8')
.trim()
.split(/n+/)
.map(line => {
let [term] = line.split(',')
return term
})
.filter(x => x)
const consonantSets = `
zs
xj
pb
td
kg`
.split(/n+/)
.map(x => x.split(''))
function computeSimilarTerms(
term: string,
consonantSets: Array<Array<string>>,
) {
const termLetters = term?.split('') ?? []
const newTerms: Array<string> = []
for (const consonantSet of consonantSets) {
for (const letter of consonantSet) {
for (const letter2 of consonantSet) {
if (letter === letter2) {
continue
}
let i = 0
while (i < termLetters.length) {
const termLetter = termLetters[i]
if (termLetter === letter) {
const newTerm = termLetters.concat()
termLetters[i] = letter2
newTerms.push(newTerm.join(''))
}
i++
}
}
}
}
return newTerms
}
for (const term of terms) {
const similarTerms = computeSimilarTerms(term, consonantSets)
similarTerms.forEach(similarTerm => {
if (terms.includes(similarTerm)) {
console.log(term, similarTerm)
}
})
}
How can this be done in a somewhat less brute force way? And this is also incomplete, as it doesn’t construct all combinations of terms which it could be similar to. So somehow it should be able to do that at some point in the algorithm. Any ideas?
2
Answers
Choose one consonant in each group as the "representant" of this group. Then, build a map that groups together words that become the same when their consonants are replaced by representant consonants.
Important note: this approach only works if the consonant groups form equivalence classes. In particular, similarity of consonants has to be transitive. If
'bp'
are similar and'bv'
are similar, but'pv'
are not similar, then this approach won’t work.Here is code in python to illustrate; I’ll let you write the javascript code.
f
is a map that maps each consonant to its representant consonant;d
is a map that maps each representant word to the list of words that have this representant.And finally we can see which words are similar:
I thought about this much the same way Stef did in another answer. But I think about it in JS, so this version may be relevant.
group
is similar to the thegroupBy
functions in Underscore, lodash, and Ramda (disclaimer: I’m an author), except that it returns a flat array of arrays, ignoring the generated keys.tooSimilar
creates an index of the similar letters, with each one pointing to the first letter of the string, then callsgroupBy
on the words input using a function that simply replaces the letters in that index with their group representative, then filters the results to only include the ones containing more than one word.Note that my sample data includes ones that match on two different consonant pairs:
'dolpar'
,'dolbar'
,'tolpar'
, and'tolbar'
.A realistic example of the potential problem with equivalence classes in English would be
'ck'
and'cs'
.'k'
and's'
probably shouldn’t overlap.