I need to retrieve an object at a specific index from a massive JSON array. The array contains 2,000,000 objects and the file size is around 5GB.
I’ve experimented with various approaches using jq
in combination with Python, but performance remains an issue.
Here are some of the methods I’ve tried:
-
Direct indexing:
jq -c '.[100000]' Movies.json
-
Slurping and indexing:
jq --slurp '.[0].[100000]' Movies.json
-
Using
nth()
:jq -c 'nth(100000; .[])' Movies.json
While these methods seem to work, they are too slow for my requirements. I’ve also tried using streams, which significantly improves performance:
jq -cn --stream 'nth(100000; fromstream(1|truncate_stream(inputs)))' Movies.json
However, as the index increases, so does the retrieval time, which I suspect is due to how streaming operates.
I understand that one option is to divide the file into chunks, but I’d rather avoid creating additional files by doing so.
JSON structure example:
[
{
"Item": {
"Name": "Darkest Legend",
"Year": 1992,
"Genre": ["War"],
"Director": "Sherill Eal Eisenberg",
"Producer": "Arabella Orth",
"Screenplay": ["Octavia Delmer"],
"Cast": ["Johanna Azar", "..."],
"Runtime": 161,
"Rate": "9.0",
"Description": "Robin Northrop Cymbre",
"Reviews": "Gisela Seumas"
},
"Similars": [
{
"Name": "Smooth of Edge",
"Year": 1985,
"Genre": ["Western"],
"Director": "Vitoria Eustacia",
"Producer": "Auguste Jamaal Corry",
"Screenplay": ["Jaquenette Lance Gibe"],
"Cast": ["Althea Nicole", "..."],
"Runtime": 96,
"Rate": "6.5",
"Description": "Ashlan Grobe",
"Reviews": "Annnora Vasquez"
}
]
},
...
]
How could I improve the efficiency of object retrieval from such a large array?
2
Answers
Not a tested solution (due to missing data) but I think the expression
creates 100000 objects and throws them away (except the last).
This expression should avoid that overhead and might be faster:
If you want to do this repeatedly, you could create a prepared file of one item per line at the one-time cost of retrieving the last line, and then use that with external tools to perform the actual retrieval, which they can do faster as they only parse for the occurrences of a row delimiter (newline in this case). This way, the JSON parsing happens once only.
For example, using
sed
as the external tool:Or
awk
:After that, you can use
jq
again to operate on the item(s) retrieved.