I have a large JSON file with a dictionary-like structure, containing values and keys. I want to search through the values efficiently and effectively. I’m looking for a search algorithm that is fast and returns the most relevant results.
Currently, I’m using fuzzy search, but it only matches from the beginning of the string. For instance, searching for dimix
doesn’t yield a match if the actual string is ada_dimix
. I need an algorithm that can find the closest substring matches as well. For example, searching for mix
should return ada_dimix
.
Could you suggest a suitable search algorithm?
Here is sample data to try:
{
"tf-checking-feeders.md": "tf-checking-feeders.md",
"rule-setting-your-local-data-directory.md": "rule-setting-your-local-data-directory.md",
"formula-using-conditional-logic.md": "formula-using-conditional-logic.md",
"allocations-feeding-qty-required-kgs.md": "allocations-feeding-qty-required-kgs.md",
"rules-using-db-functions-move-data-between-cubes.md": "rules-using-db-functions-move-data-between-cubes.md",
"feeders-skipcheck.md": "feeders-skipcheck.md",
"rule-components-calculation-statement.md": "rule-components-calculation-statement.md",
"window-inserting-function.md": "window-inserting-function.md",
"market-writing-exchange-rate-rule-statement.md": "market-writing-exchange-rate-rule-statement.md",
"series-hard-coded-feeders.md": "series-hard-coded-feeders.md",
"allocations-calculating-quantities-fish-required-by-fishcake-type.md": "allocations-calculating-quantities-fish-required-by-fishcake-type.md",
"rule-purchase-cost-calculation.md": "rule-purchase-cost-calculation.md",
"options-setting-areas-appearance.md": "options-setting-areas-appearance.md",
"solutions-third-approach-using-dimix-comparisons.md": "solutions-third-approach-using-dimix-comparisons.md",
"process-feeding-first-statement.md": "process-feeding-first-statement.md",
"flows-implementing-depletion-model-using-rules.md": "flows-implementing-depletion-model-using-rules.md",
"rf-miscellaneous-rules-functions.md": "rf-miscellaneous-rules-functions.md",
"formula-external-cube-references.md": "formula-external-cube-references.md",
"costs-calculating-daily-fish-in-inventory-cube.md": "costs-calculating-daily-fish-in-inventory-cube.md",
"rf-consolidation-calculation-rules-functions.md": "rf-consolidation-calculation-rules-functions.md",
"replace-replacing-text.md": "replace-replacing-text.md",
"market-rule.md": "market-rule.md",
"eirf-elementindex.md": "functions/eirf-elementindex.md",
"mrf-round.md": "functions/mrf-round.md",
"dtrf-today.md": "functions/dtrf-today.md",
"eirf-elementtype.md": "functions/eirf-elementtype.md",
"eirf-elparn.md": "functions/eirf-elparn.md",
"trf-char.md": "functions/trf-char.md",
"dtrf-time.md": "functions/dtrf-time.md",
"mrf-cos.md": "functions/mrf-cos.md",
"trfdf-dimix.md": "functions/trfdf-dimix.md",
"ada-dimix.md": "functions/ada-dimix.md",
"trf-char-dimix.md": "functions/trf-char-dimix.md"
}
please keep in mind that the JSON data is large and will get even larger.
2
Answers
You should probably look at suffix trees. These slides explain a how these suffix trees can be formed, and how do they solve your problem. The basic idea is to preprocess each key for faster lookup later on. Preprocessing gives you a suffix tree, which is basically a very efficient data structure to look for substrings.
Looks like a job for a database.
If you’re looking for compact solution (more compact than actual db, but probably less effective), look into what sqlite has to offer.
For example this answer combined with this sqlite3 lib might solve your problem.