skip to Main Content

I have been reading a bit about tries, and how they are a good structure for typeahead designs. Aside from the trie, you usually also have a key/value pair for nodes and pre-computed top-n suggestions to improve response times.

Usually, from what I’ve gathered, it is ideal to keep them in memory for fast searches, such as what was suggested in this question: Scrabble word finder: building a trie, storing a trie, using a trie?. However, what if your Trie is too big and you have to shard it somehow? (e.g. perhaps a big e-commerce website).

The key/value pair for pre-computed suggestions can be obviously implemented in a key/value store (either kept in memory, like memcached/redis or in a database, and horizontally scalled as needed), but what is the best way to store a trie if it can’t fit in memory? Should it be done at all, or should distributed systems each hold part of the trie in memory, while also replicating it so that it is not lost?

Alternatively, a search service (e.g. Solr or Elasticsearch) could be used to produce search suggestions/auto-complete, but I’m not sure whether the performance is up to par for this particular use-case. The advantage of the Trie is that you can pre-compute top-N suggestions based on its structure, leaving the search service to handle actual search on the website.

I know there are off-the-shelf solutions for this, but I’m mostly interesting in learning how to re-invent the wheel on this one, or at least catch a glimpse of the best practices if one wants to broach this topic.

What are your thoughts?

Edit: I also saw this post: https://medium.com/@prefixyteam/how-we-built-prefixy-a-scalable-prefix-search-service-for-powering-autocomplete-c20f98e2eff1, which basically covers the use of Redis as primary data store for a skip list with mongodb for LRU prefixes. Seems an OK approach, but I would still want to learn if there are other viable/better approaches.

2

Answers


  1. I was reading ‘System Design Interview: An Insider’s Guide’ by Alex Yu and he briefly covers this topic.

    Trie DB. Trie DB is the persistent storage. Two options are available to store the data:

    1. Document store: Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB [4] are good fits for serialized data.
    2. Key-value store: A trie can be represented in a hash table form [4] by applying the following logic:
      • Every prefix in the trie is mapped to a key in a hash table.
      • Data on each trie node is mapped to a value in a hash table.
      Figure 13-10 shows the mapping between the trie and hash table.

    Trie stored in key-value store

    The numbers on each prefix node represent the frequency of searches for that specific prefix/word.

    He then suggests possibly scaling the storage by sharding the data by each alphabet (or groups of alphabets, like a-m, n-z). But this would unevenly distribute the data since there are more words that start with ‘a’ than with ‘z’.

    So he recommends using some type of shard manager where it keeps track of the query frequency and assigns a shard based on that. So if there are twice the amount of queries for ‘s’ as opposed to ‘z’ and ‘x’ combined, two shards can be used, one for ‘s’, and another for ‘z’ and ‘x’.

    Login or Signup to reply.
  2. We can maintain a singular alphabet starting trie and save/update the frequency node everytime it hits ,
    And along side we can create a job that can take up the most frequent search and train a ml model. For picking up the frequent ones side by side , now for the search parameter the auto complete and all can be achieved , at the same time a better suggestions could be given to the user.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search