Javascript - How to create perfect hash with ASCII symbols as input, where output hash is always the same for each ASCII sequence, even after adding more later?

LancePollard
January 7, 2025
70 views
0 votes
2 Answers

I have this code so far, which performs some kind of hashing. The goal is to map each ASCII string to a single Hangul Syllable unicode point:

const HANGUL_START = 0xAC00; // Start of Hangul Syllables block
const HANGUL_END = 0xD7A3; // End of Hangul Syllables block
const HANGUL_COUNT = HANGUL_END - HANGUL_START + 1; // 11,172 syllables

// Stable hash function
const hashString = (input) => {
    return Array.from(input).reduce(
        (hash, char) => hash * 31n + BigInt(char.charCodeAt(0)),
        0n
    );
};

// Map input strings to unique Hangul syllables
const mapToHangul = (inputs) => {
    // Compute hashes for all inputs
    const hashes = inputs.map((input) => ({
        input,
        hash: hashString(input),
    }));

    // Sort inputs by their hash values
    hashes.sort((a, b) => (a.hash < b.hash ? -1 : a.hash > b.hash ? 1 : 0));

    // Map each input to a unique Hangul syllable based on its rank
    const result = new Map();
    hashes.forEach((item, index) => {
        const hangulCode = HANGUL_START + (index % HANGUL_COUNT);
        result.set(item.input, String.fromCharCode(hangulCode));
    });

    return result;
};

// Example usage
const demonstrateMapping = () => {
    const inputs = ["Hello", "World", "ASCII", "Test123", "ExtraInput"];
    const mapping = mapToHangul(inputs);

    console.log("Mapping:");
    mapping.forEach((hangul, input) => {
        console.log(`${input} -> ${hangul}`);
    });
};

demonstrateMapping();

But, as I pieced together from various ideas online, I don’t feel like it will handle the case where, at first say I have 1000 ASCII strings (from 1 to 5 characters let’s say, each), and I compute the hashes. I get one set of hash values. Then I add 20 more ASCII strings and position them at arbitrary points in the ASCII strings array, then compute the hashes again. There will either be collisions (I can’t have any collisions), or if I use a used cache (Map), then the hashes will depend on the order of the input, which I also don’t want. A variant which has a sort of cache is here:

const HANGUL_START = 0xAC00; // Start of Hangul Syllables block
const HANGUL_END = 0xD7A3; // End of Hangul Syllables block
const HANGUL_COUNT = HANGUL_END - HANGUL_START + 1n; // 11,172 syllables

// Hash function: Purely maps a string to a bigint
const hashString = (input: string): bigint => {
    return Array.from(input).reduce(
        (hash, char) => hash * 31n + BigInt(char.charCodeAt(0)),
        0n
    );
};

// Maps a hash to a Hangul syllable deterministically
const mapHashToHangul = (hash: bigint): string => {
    const normalizedHash = Number(hash % HANGUL_COUNT);
    return String.fromCharCode(HANGUL_START + normalizedHash);
};

// Main function: Convert multiple strings into unique Hangul syllables
const hashMultipleToHangul = (inputs: string[]): Map<string, string> => {
    const usedCodes = new Set<number>(); // Track used Hangul syllables
    const resultMap = new Map<string, string>();

    for (const input of inputs) {
        const hash = hashString(input);
        let hangulCode = HANGUL_START + Number(hash % HANGUL_COUNT);

        // Resolve collisions by finding the next available Hangul code
        while (usedCodes.has(hangulCode)) {
            hangulCode = (hangulCode + 1 - HANGUL_START) % HANGUL_COUNT + HANGUL_START;
        }

        usedCodes.add(hangulCode);
        resultMap.set(input, String.fromCharCode(hangulCode));
    }

    return resultMap;
};

// Example usage
const inputs = [
    "Hello",
    "World",
    "ASCII",
    "Test123",
    "HelloWorld",
    "AnotherTest",
    "FinalString",
    "ExtraInput"
];

const results = hashMultipleToHangul(inputs);
console.log("Hashed Hangul results:");
for (const [input, hangul] of results.entries()) {
    console.log(`${input} -> ${hangul}`);
}

For reference, my input ASCII strings are the i field from this dynamically generated dataset (and the x: getSingleGlyph() is what I want to replace with something more deterministic):

import st from '@lancejpollard/script-tree'

// https://en.wikipedia.org/wiki/Hangul_Syllables
// first hangul jamo syllable
const Xi = parseInt('AC00', 16)
// last hangul jamo syllable
const Xf = parseInt('D7A3', 16)
let X = Xi

const m = {
  u: {
    grave: 'u0300',
    acute: 'u0301',
    dacute: 'u030B',
    dgrave: 'u030F',
    up: 'u0302',
    down: 'u030C',
    dot: 'u0307',
    ddot: 'u0308',
    ring: 'u030A',
    tilde: 'u0303',
    macron: 'u0304',
    hook: 'u0309',
  },
  d: {
    grave: 'u0316',
    acute: 'u0317',
    ring: 'u0325',
    dot: 'u0323',
    ddot: 'u0324',
    down: 'u032C',
    tilde: 'u0330',
    macron: 'u0331',
    cedilla: 'u0327',
    up: 'u032D',
    hook: 'u0328',
  },
}

const D: Record<string, string> = {
  '--': m.u.dgrave,
  '-': m.u.grave,
  '++': m.u.dacute,
  '+': m.u.acute,
  '//': `${m.d.hook}${m.u.dacute}`, // rising 2 (vietnamese ngã)
  '/': `${m.d.hook}${m.u.acute}`, // rising (vietnamese sắc)
  '\/': m.u.down, // falling rising (vietnamese hỏi)
  '/\': m.u.up, // rising falling
  '\\': `${m.d.hook}${m.u.dgrave}`, // falling 2 (vietnamese nặng)
  '\': `${m.d.hook}${m.u.grave}`, // falling (vietnamese huyền)
  '^': m.u.dot, // accent/stress mark
  $: m.d.ddot,
  '&': m.d.tilde,
  _: m.u.macron, // long vowel
  '@': m.d.grave, // non-syllabic
  '!': m.d.macron, // short vowel
  '': '',
}

const G: Record<string, string> = {
  I: `i${m.d.dot}`,
  E: `e${m.d.dot}`,
  A: `a${m.d.dot}`,
  O: `o${m.d.dot}`,
  U: `u${m.d.dot}`,
  i: `i`,
  e: `e`,
  a: `a`,
  o: `o`,
  u: `u`,
}

export type Take = {
  i: string
  x: string
  o: string
  name?: string
  o2?: string
}

export const VOWELS: Array<Take> = []

export const BASE_VOWEL_GLYPHS = [
  'I',
  'E',
  'A',
  'O',
  'U',
  'i',
  'e',
  'a',
  'o',
  'u',
]
export const TONE_MARKS = [
  '--',
  '-',
  '++',
  '+',
  '/',
  '//',
  '\/',
  '/\',
  '\\',
  '\',
  '',
]
export const VARIANT_MARKS = ['$', '']
export const NASAL_MARKS = ['&', '']
export const DURATION_MARKS = ['_', '!', '']
export const SYLLABIC_MARKS = ['@', '']
export const ACCENT_MARKS = ['^', '']

BASE_VOWEL_GLYPHS.forEach(g => {
  ACCENT_MARKS.forEach(a => {
    DURATION_MARKS.forEach(l => {
      SYLLABIC_MARKS.forEach(s => {
        NASAL_MARKS.forEach(n => {
          VARIANT_MARKS.forEach(v => {
            TONE_MARKS.forEach(t => {
              const i = `${g}${v}${n}${s}${t}${l}${a}`
              const x = g.match(/i/i) && a === '^' ? 'ï' : G[g]
              const y = x === 'ï' ? '' : D[a]
              const x2 = v === '$' && x === 'u' ? 'r' : x
              const v2 = v === '$' && g === 'u' ? '' : `${D[v]}`
              const o =
                l === '!'
                  ? `${x2}${y}${D[l]}${D[n]}${D[s]}${D[t]}${v2}`
                  : `${x2}${D[l]}${D[n]}${D[s]}${D[t]}${v2}${y}`
              VOWELS.push({ i, x: getSingleGlyph(), o })
            })
          })
        })
      })
    })
  })
})

export const SYMBOLS = [
  { i: '=.', x: getSingleGlyph(), o: '.' },
  { i: '=?', x: getSingleGlyph(), o: '?' },
  { i: '=!', x: getSingleGlyph(), o: '!' },
  { i: '=+', x: getSingleGlyph(), o: '+' },
  { i: '=-', x: getSingleGlyph(), o: '-' },
  { i: '>', x: getSingleGlyph(), o: '>' },
  { i: '<', x: getSingleGlyph(), o: '<' },
  { i: '/', x: getSingleGlyph(), o: '/' },
  { i: '\', x: getSingleGlyph(), o: '\' },
  { i: '|', x: getSingleGlyph(), o: '|' },
  { i: '(', x: getSingleGlyph(), o: '(' },
  { i: ')', x: getSingleGlyph(), o: ')' },
  { i: '[', x: getSingleGlyph(), o: '[' },
  { i: ']', x: getSingleGlyph(), o: ']' },
  { i: ' ', x: getSingleGlyph(), o: ' ' },
]

export const NUMERALS = [
  { i: '0', x: getSingleGlyph(), o: '0' },
  { i: '1', x: getSingleGlyph(), o: '1' },
  { i: '2', x: getSingleGlyph(), o: '2' },
  { i: '3', x: getSingleGlyph(), o: '3' },
  { i: '4', x: getSingleGlyph(), o: '4' },
  { i: '5', x: getSingleGlyph(), o: '5' },
  { i: '6', x: getSingleGlyph(), o: '6' },
  { i: '7', x: getSingleGlyph(), o: '7' },
  { i: '8', x: getSingleGlyph(), o: '8' },
  { i: '9', x: getSingleGlyph(), o: '9' },
]

export const CONSONANTS = [
  { i: '@', x: getSingleGlyph(), o: `@` },
  { i: 'h~', x: getSingleGlyph(), o: `ɦ` },
  { i: 'm', x: getSingleGlyph(), o: `m` },
  { i: 'N', x: getSingleGlyph(), o: `n${m.d.dot}` },
  { i: 'n', x: getSingleGlyph(), o: `n` },
  { i: 'q', x: getSingleGlyph(), o: `n${m.u.dot}` },
  { i: 'G~', x: getSingleGlyph(), o: `g${m.u.tilde}` },
  { i: 'G', x: getSingleGlyph(), o: `g${m.u.dot}` },
  { i: 'g?', x: getSingleGlyph(), o: `g${m.u.grave}` },
  { i: 'g', x: getSingleGlyph(), o: `g` },
  { i: "'", x: getSingleGlyph(), o: `'` },
  { i: 'Q', x: getSingleGlyph(), o: `q${m.u.dot}` },
  { i: 'd?', x: getSingleGlyph(), o: `d${m.d.grave}` },
  { i: 'd!', x: getSingleGlyph(), o: `d${m.d.acute}` },
  { i: 'd*', x: getSingleGlyph(), o: `d${m.d.down}` },
  { i: 'd.', x: getSingleGlyph(), o: `d${m.d.macron}` },
  { i: 'D', x: getSingleGlyph(), o: `d${m.d.dot}` },
  { i: 'dQ~', x: getSingleGlyph(), o: `d${m.d.tilde}` },
  { i: 'd', x: getSingleGlyph(), o: `d` },
  { i: 'b?', x: getSingleGlyph(), o: `b${m.d.grave}` },
  { i: 'b!', x: getSingleGlyph(), o: `b${m.d.acute}` },
  { i: 'b', x: getSingleGlyph(), o: `b` },
  { i: 'p!', x: getSingleGlyph(), o: `p${m.u.acute}` },
  { i: 'p*', x: getSingleGlyph(), o: `p${m.u.up}` },
  { i: 'p.', x: getSingleGlyph(), o: `t${m.u.macron}` },
  { i: 'p@', x: getSingleGlyph(), o: `x${m.u.down}` },
  { i: 'p', x: getSingleGlyph(), o: `p` },
  { i: 'T!', x: getSingleGlyph(), o: `t${m.d.dot}${m.d.acute}` },
  { i: 'T', x: getSingleGlyph(), o: `t${m.d.dot}` },
  { i: 't!', x: getSingleGlyph(), o: `t${m.d.acute}` },
  { i: 't*', x: getSingleGlyph(), o: `t${m.d.down}` },
  { i: 'tQ~', x: getSingleGlyph(), o: `t${m.d.tilde}` },
  { i: 't@', x: getSingleGlyph(), o: `t${m.d.up}` },
  { i: 't.', x: getSingleGlyph(), o: `t${m.d.macron}` },
  { i: 't', x: getSingleGlyph(), o: `t` },
  { i: 'k!', x: getSingleGlyph(), o: `k${m.d.acute}` },
  { i: 'k.', x: getSingleGlyph(), o: `k${m.d.macron}` },
  { i: 'k*', x: getSingleGlyph(), o: `k${m.d.down}` },
  { i: 'K!', x: getSingleGlyph(), o: `k${m.d.dot}${m.d.acute}` },
  { i: 'K', x: getSingleGlyph(), o: `k${m.d.dot}` },
  { i: 'k', x: getSingleGlyph(), o: `k` },
  { i: 'H!', x: getSingleGlyph(), o: `h${m.d.dot}${m.d.acute}` },
  { i: 'H', x: getSingleGlyph(), o: `h${m.d.dot}` },
  { i: 'h!', x: getSingleGlyph(), o: `ħ` },
  { i: 'h', x: getSingleGlyph(), o: `h` },
  { i: 'J', x: getSingleGlyph(), o: `ȷ̈` },
  { i: 'j!', x: getSingleGlyph(), o: `j${m.u.acute}` },
  { i: 'j', x: getSingleGlyph(), o: `j` },
  { i: 'S!', x: getSingleGlyph(), o: `s${m.d.dot}${m.u.acute}` },
  { i: 's!', x: getSingleGlyph(), o: `s${m.u.acute}` },
  { i: 'S', x: getSingleGlyph(), o: `s${m.d.dot}` },
  { i: 'sQ~', x: getSingleGlyph(), o: `s${m.d.tilde}` },
  { i: 's@', x: getSingleGlyph(), o: `s${m.d.up}` },
  { i: 's', x: getSingleGlyph(), o: `s` },
  { i: 'F', x: getSingleGlyph(), o: `f${m.d.dot}` },
  { i: 'f!', x: getSingleGlyph(), o: `f${m.d.acute}` },
  { i: 'f', x: getSingleGlyph(), o: `f` },
  { i: 'V', x: getSingleGlyph(), o: `v${m.d.dot}` },
  { i: 'v', x: getSingleGlyph(), o: `v` },
  { i: 'z!', x: getSingleGlyph(), o: `z${m.u.acute}` },
  { i: 'zQ~', x: getSingleGlyph(), o: `z${m.d.tilde}` },
  { i: 'z', x: getSingleGlyph(), o: `z` },
  { i: 'Z!', x: getSingleGlyph(), o: `z${m.d.dot}${m.u.acute}` },
  { i: 'Z', x: getSingleGlyph(), o: `z${m.d.dot}` },
  { i: 'CQ~', x: getSingleGlyph(), o: `c${m.d.dot}${m.u.tilde}` },
  { i: 'C', x: getSingleGlyph(), o: `c${m.d.dot}` },
  { i: 'cQ~', x: getSingleGlyph(), o: `c${m.u.tilde}` },
  { i: 'c', x: getSingleGlyph(), o: `c` },
  { i: 'L', x: getSingleGlyph(), o: `l${m.d.dot}` },
  { i: 'l*', x: getSingleGlyph(), o: `l${m.d.down}` },
  { i: 'lQ~', x: getSingleGlyph(), o: `l${m.d.tilde}` },
  { i: 'l', x: getSingleGlyph(), o: `l` },
  { i: 'R', x: getSingleGlyph(), o: `r${m.d.dot}` },
  { i: 'rQ~', x: getSingleGlyph(), o: `r${m.u.tilde}` },
  { i: 'r', x: getSingleGlyph(), o: `r${m.u.dot}` },
  { i: 'x!', x: getSingleGlyph(), o: `x${m.u.acute}` },
  { i: 'X!', x: getSingleGlyph(), o: `x${m.d.dot}${m.u.acute}` },
  { i: 'X', x: getSingleGlyph(), o: `x${m.d.dot}` },
  { i: 'x@', x: getSingleGlyph(), o: `x${m.d.up}` },
  { i: 'x', x: getSingleGlyph(), o: `x` },
  { i: 'W', x: getSingleGlyph(), o: `w${m.u.dot}` },
  { i: 'w!', x: getSingleGlyph(), o: `w${m.u.acute}` },
  { i: 'w~', x: getSingleGlyph(), o: `w${m.d.dot}` },
  { i: 'w', x: getSingleGlyph(), o: `w` },
  { i: 'y~', x: getSingleGlyph(), o: `y${m.u.dot}` },
  { i: 'y', x: getSingleGlyph(), o: `y` },
]

export const GLYPHS = [
  ...VOWELS,
  ...CONSONANTS,
  ...SYMBOLS,
  ...NUMERALS,
]

// eslint-disable-next-line @typescript-eslint/no-unsafe-assignment, @typescript-eslint/no-unsafe-member-access
const tree = st.fork(GLYPHS)
// eslint-disable-next-line @typescript-eslint/no-unsafe-return, @typescript-eslint/no-unsafe-member-access
const make = (text: string): string => st.form(text, tree)

make.inputs = (text: string): Array<string> =>
  // eslint-disable-next-line @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-return
  st.list(text, tree).map((x: any) => x.i)

make.readableOutput = (text: string): Array<string> =>
  // eslint-disable-next-line @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-return
  st.list(text, tree).map((x: any) => x.o)

make.readable = (text: string): string =>
  make.readableOutput(text).join('')

make.machineOutputs = (text: string): Array<string> =>
  // eslint-disable-next-line @typescript-eslint/no-unsafe-member-access, @typescript-eslint/no-unsafe-return
  st.list(text, tree).map((x: any) => x.x)

make.machine = (text: string): string =>
  make.machineOutputs(text).join('')

export default make

function getSingleGlyph() {
  if (X === Xf) {
    throw new Error(`Too many glyphs created`)
  }

  return String.fromCodePoint(X++)
}

How can I make a hashing function which takes my ASCII strings in my array (GLYPHS array above, in the i field), and maps it to a Hangul Jamo Syllable, in such a way that, if in the future I add more items to the GLYPHS array (in arbitrary spots), the old mappings will stay the same, and the new ones won’t create any conflicts? Is it possible? If so, how to do? If it’s not possible, why is it not possible? In that case I might have to resort to manually selecting Hangul glyphs by hand and setting them, but that will be a pain to do and pain to maintain. Would like a hashing solution so I don’t need to do it by hand, but all the hashing solutions I’ve thought about so far seems to have the problem that if I add new items later, the hashing values will all change. Can that be avoided?

Note: There are about 11,000 Hangul Jamo Syllables in that unicode block. I will have much less than that number of input ASCII strings, so we won’t run out of space.

Answers

- MattTimmermans
- January 5, 2025 at 3:39 pm
- 0 votes
0
Your requirements that:
- hashes are stable against expansion of the key set;
- the order in which keys are added doesn’t affect the hash values; and
- no collisions
… make it impossible to get what you want.

Because any key might be the first one you add to the key set, your function would have to be able to generate a unique hash for any key individually.

There are more possible keys than hashes, so that’s not gonna happen.
Login or Signup to reply.

- Surt
- January 7, 2025 at 10:31 pm
- 0 votes
0
As you stated your universe is smaller than the Hangul universe.

Then you need a perfect hash, a hash with O(1) collision, ie no jumping through a random length chain of links to the result, we also want it to be a minimal perfect hash, where there are no more buckets than entries.

A short blog on this explains a possible implementation of which ill give the quick quick here.

Make a hash function taking 2 parameters first is a number(initially zero) and second that is your value(ie. string).
Create a table of tables the length of your universe(number of unique value). ie. buckets[size].
Append to the table in the hash position with the value of your string.

Sort the buckets after number of entries (collisions).
now run buckets through from most entries to reaching the first with only one or zero.
And try to place all without collisions
change the number until success!!!
once each bucket is successfully placed copy the placements to the final result.

The final result are two lists G with index to a 2nd table values (hangul). In fact we don’t need the final table with values in your case, more on that later, but we need it for a temp result to build the hash.

make a list of free position in the result.
for all entries with only one key just use the next index from the free list. Here we need a flag to indicate we should use the index directly, using the negative value (with a -1 if your zero based).

So when doing a lookup, use your hash(0, string) to find the value in G, the value we want is just the index + HANGUL_START (beware off-by-one errors).

Alternatively there is another method where you explore the bits in the strings, find a minimal number of bits that gives a perfect hash, compress them and use that as an index to the lookup table!

Don’t be too scared, but I had to read the wiki, the paper and the code several times before I saw the light 🙂

Login or Signup to reply.