skip to Main Content

I need to retrieve an object at a specific index from a massive JSON array. The array contains 2,000,000 objects and the file size is around 5GB.

I’ve experimented with various approaches using jq in combination with Python, but performance remains an issue.
Here are some of the methods I’ve tried:

  1. Direct indexing:

    jq -c '.[100000]' Movies.json
    
  2. Slurping and indexing:

    jq --slurp '.[0].[100000]' Movies.json
    
  3. Using nth():

    jq -c 'nth(100000; .[])' Movies.json
    

While these methods seem to work, they are too slow for my requirements. I’ve also tried using streams, which significantly improves performance:

jq -cn --stream 'nth(100000; fromstream(1|truncate_stream(inputs)))' Movies.json

However, as the index increases, so does the retrieval time, which I suspect is due to how streaming operates.

I understand that one option is to divide the file into chunks, but I’d rather avoid creating additional files by doing so.

JSON structure example:

[
    {
        "Item": {
            "Name": "Darkest Legend",
            "Year": 1992,
            "Genre": ["War"],
            "Director": "Sherill Eal Eisenberg",
            "Producer": "Arabella Orth",
            "Screenplay": ["Octavia Delmer"],
            "Cast": ["Johanna Azar", "..."],
            "Runtime": 161,
            "Rate": "9.0",
            "Description": "Robin Northrop Cymbre",
            "Reviews": "Gisela Seumas"
        },
        "Similars": [
            {
                "Name": "Smooth of Edge",
                "Year": 1985,
                "Genre": ["Western"],
                "Director": "Vitoria Eustacia",
                "Producer": "Auguste Jamaal Corry",
                "Screenplay": ["Jaquenette Lance Gibe"],
                "Cast": ["Althea Nicole", "..."],
                "Runtime": 96,
                "Rate": "6.5",
                "Description": "Ashlan Grobe",
                "Reviews": "Annnora Vasquez"
            }
        ]
    },
    ...
]

How could I improve the efficiency of object retrieval from such a large array?

2

Answers


  1. Not a tested solution (due to missing data) but I think the expression

    nth(100000; fromstream(1|truncate_stream(inputs)))
    

    creates 100000 objects and throws them away (except the last).

    This expression should avoid that overhead and might be faster:

    fromstream(1|truncate_stream( inputs | select(.[0][0] == 100000)))
    
    Login or Signup to reply.
  2. If you want to do this repeatedly, you could create a prepared file of one item per line at the one-time cost of retrieving the last line, and then use that with external tools to perform the actual retrieval, which they can do faster as they only parse for the occurrences of a row delimiter (newline in this case). This way, the JSON parsing happens once only.

    For example, using sed as the external tool:

    jq -c '.[]' > lines    # slow, once only
    
    sed '100000q;d' lines  # fast, repeatable
    :
    

    Or awk:

    jq -c '.[]' > lines
    
    awk 'NR==100000 {print; exit}' lines
    :
    

    After that, you can use jq again to operate on the item(s) retrieved.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search