fast and effective search algorithm for large json file in python

CSe
October 10, 2024
104 views
0 votes
2 Answers

I have a large JSON file with a dictionary-like structure, containing values and keys. I want to search through the values efficiently and effectively. I’m looking for a search algorithm that is fast and returns the most relevant results.

Currently, I’m using fuzzy search, but it only matches from the beginning of the string. For instance, searching for dimix doesn’t yield a match if the actual string is ada_dimix. I need an algorithm that can find the closest substring matches as well. For example, searching for mix should return ada_dimix.

Could you suggest a suitable search algorithm?

Here is sample data to try:

{
    "tf-checking-feeders.md": "tf-checking-feeders.md",
    "rule-setting-your-local-data-directory.md": "rule-setting-your-local-data-directory.md",
    "formula-using-conditional-logic.md": "formula-using-conditional-logic.md",
    "allocations-feeding-qty-required-kgs.md": "allocations-feeding-qty-required-kgs.md",
    "rules-using-db-functions-move-data-between-cubes.md": "rules-using-db-functions-move-data-between-cubes.md",
    "feeders-skipcheck.md": "feeders-skipcheck.md",
    "rule-components-calculation-statement.md": "rule-components-calculation-statement.md",
    "window-inserting-function.md": "window-inserting-function.md",
    "market-writing-exchange-rate-rule-statement.md": "market-writing-exchange-rate-rule-statement.md",
    "series-hard-coded-feeders.md": "series-hard-coded-feeders.md",
    "allocations-calculating-quantities-fish-required-by-fishcake-type.md": "allocations-calculating-quantities-fish-required-by-fishcake-type.md",
    "rule-purchase-cost-calculation.md": "rule-purchase-cost-calculation.md",
    "options-setting-areas-appearance.md": "options-setting-areas-appearance.md",
    "solutions-third-approach-using-dimix-comparisons.md": "solutions-third-approach-using-dimix-comparisons.md",
    "process-feeding-first-statement.md": "process-feeding-first-statement.md",
    "flows-implementing-depletion-model-using-rules.md": "flows-implementing-depletion-model-using-rules.md",
    "rf-miscellaneous-rules-functions.md": "rf-miscellaneous-rules-functions.md",
    "formula-external-cube-references.md": "formula-external-cube-references.md",
    "costs-calculating-daily-fish-in-inventory-cube.md": "costs-calculating-daily-fish-in-inventory-cube.md",
    "rf-consolidation-calculation-rules-functions.md": "rf-consolidation-calculation-rules-functions.md",
    "replace-replacing-text.md": "replace-replacing-text.md",
    "market-rule.md": "market-rule.md",
    "eirf-elementindex.md": "functions/eirf-elementindex.md",
    "mrf-round.md": "functions/mrf-round.md",
    "dtrf-today.md": "functions/dtrf-today.md",
    "eirf-elementtype.md": "functions/eirf-elementtype.md",
    "eirf-elparn.md": "functions/eirf-elparn.md",
    "trf-char.md": "functions/trf-char.md",
    "dtrf-time.md": "functions/dtrf-time.md",
    "mrf-cos.md": "functions/mrf-cos.md",
    "trfdf-dimix.md": "functions/trfdf-dimix.md",
    "ada-dimix.md": "functions/ada-dimix.md",
    "trf-char-dimix.md": "functions/trf-char-dimix.md"
}

please keep in mind that the JSON data is large and will get even larger.

Answers

- FratKyak
- October 10, 2024 at 5:23 pm
- 0 votes
0
You should probably look at suffix trees. These slides explain a how these suffix trees can be formed, and how do they solve your problem. The basic idea is to preprocess each key for faster lookup later on. Preprocessing gives you a suffix tree, which is basically a very efficient data structure to look for substrings.

Login or Signup to reply.

- AntonLarin
- October 10, 2024 at 5:36 pm
- 0 votes
0
Looks like a job for a database.
If you’re looking for compact solution (more compact than actual db, but probably less effective), look into what sqlite has to offer.
For example this answer combined with this sqlite3 lib might solve your problem.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.