Optimising object retrieval from a large JSON array using jq

mattijah
March 4, 2024
159 views
1 vote
2 Answers

I need to retrieve an object at a specific index from a massive JSON array. The array contains 2,000,000 objects and the file size is around 5GB.

I’ve experimented with various approaches using jq in combination with Python, but performance remains an issue.
Here are some of the methods I’ve tried:

Direct indexing:
```
jq -c '.[100000]' Movies.json
```
Slurping and indexing:
```
jq --slurp '.[0].[100000]' Movies.json
```
Using nth():
```
jq -c 'nth(100000; .[])' Movies.json
```

While these methods seem to work, they are too slow for my requirements. I’ve also tried using streams, which significantly improves performance:

jq -cn --stream 'nth(100000; fromstream(1|truncate_stream(inputs)))' Movies.json

However, as the index increases, so does the retrieval time, which I suspect is due to how streaming operates.

I understand that one option is to divide the file into chunks, but I’d rather avoid creating additional files by doing so.

JSON structure example:

[
    {
        "Item": {
            "Name": "Darkest Legend",
            "Year": 1992,
            "Genre": ["War"],
            "Director": "Sherill Eal Eisenberg",
            "Producer": "Arabella Orth",
            "Screenplay": ["Octavia Delmer"],
            "Cast": ["Johanna Azar", "..."],
            "Runtime": 161,
            "Rate": "9.0",
            "Description": "Robin Northrop Cymbre",
            "Reviews": "Gisela Seumas"
        },
        "Similars": [
            {
                "Name": "Smooth of Edge",
                "Year": 1985,
                "Genre": ["Western"],
                "Director": "Vitoria Eustacia",
                "Producer": "Auguste Jamaal Corry",
                "Screenplay": ["Jaquenette Lance Gibe"],
                "Cast": ["Althea Nicole", "..."],
                "Runtime": 96,
                "Rate": "6.5",
                "Description": "Ashlan Grobe",
                "Reviews": "Annnora Vasquez"
            }
        ]
    },
    ...
]

How could I improve the efficiency of object retrieval from such a large array?

Tags: jq json performance python

Answers

- AH
- March 4, 2024 at 8:28 pm
- 0 votes
0
Not a tested solution (due to missing data) but I think the expression
```
nth(100000; fromstream(1|truncate_stream(inputs)))
```
creates 100000 objects and throws them away (except the last).

This expression should avoid that overhead and might be faster:
```
fromstream(1|truncate_stream( inputs | select(.[0][0] == 100000)))
```
Login or Signup to reply.

- pmf
- March 4, 2024 at 9:19 pm
- 0 votes
0
If you want to do this repeatedly, you could create a prepared file of one item per line at the one-time cost of retrieving the last line, and then use that with external tools to perform the actual retrieval, which they can do faster as they only parse for the occurrences of a row delimiter (newline in this case). This way, the JSON parsing happens once only.

For example, using sed as the external tool:
```
jq -c '.[]' > lines    # slow, once only

sed '100000q;d' lines  # fast, repeatable
:
```
Or awk:
```
jq -c '.[]' > lines

awk 'NR==100000 {print; exit}' lines
:
```
After that, you can use jq again to operate on the item(s) retrieved.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.