I have been using jq
to successfully extract one JSON blob at a time from some relatively large files and write it out to a file of one JSON object per line for further processing. Here is an example of the JSON format:
{
"date": "2023-07-30",
"results1":[
{
"data": [
{"row": [{"key1": "row1", "key2": "row1"}]},
{"row": [{"key1": "row2", "key2": "row2"}]}
]
},
{
"data": [
{"row": [{"key1": "row3", "key2": "row3"}]},
{"row": [{"key1": "row4", "key2": "row4"}]}
]
}
],
"results2":[
{
"data": [
{"row": [{"key3": "row1", "key4": "row1"}]},
{"row": [{"key3": "row2", "key4": "row2"}]}
]
},
{
"data": [
{"row": [{"key3": "row3", "key4": "row3"}]},
{"row": [{"key3": "row4", "key4": "row4"}]}
]
}
]
}
My current approach is to run the following and redirect the stdout to a file:
jq -rc ".results1[]" my_json.json
This works fine, however, it seems like jq
reads the entire file into memory in order to extract the chunk I am interested in.
Questions:
- Does jq read the entire file into memory when I execute the above
statement? - Assuming the answer is yes, is there a way that I can extract
results1[]
andresults2[]
on the same call to avoid reading the file twice?
I have used the --stream
option but it is very slow. I also read that it sacrifices speed for memory savings, but memory is not an issue at this time so I would prefer to avoid using this option. Basically, what I need is to read in the above json once and output two files in JSON lines format.
Edit: (I changed the input data a bit to show the differences in the output)
Output file 1:
{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
Output file 2:
{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}
It seems pretty well known that the streaming option is slow. See the discussion here.
My attempt at implementing it followed the answer here.
2
Answers
jq doesn’t have any file IO facilities, so you can’t output multiple files.
You can output each piece of data with it’s key and post-process it:
outputs
So:
or
With Bash ≥ 4, processing bigger chunks could be improved by reading n lines at once using
mapfile
: