I’ve got the following (100GB+) JSON file which includes the following top level key:
{
"data": {
"<userId1>": {
"things": {
"<thingId1>": {
"subfield1": "blah1",
"subfield3": "foobar",
},
"<thingId2>": {
"subfield2": "blah2"
}
}
},
"<userId2>": {
"things": {
"<thingId1>": {
"subfield4": "blah3"
},
"<thingId2>": {
"subfield3": "blah4"
}
}
}
}
}
I’d like each things
object outputted with its path on a new line (for later newline delimited JSON usage down the pipeline), e.g. with the above it would output:
{ "path": "data/<userId1>/things/<thingId1>", "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": "data/<userId1>/things/<thingId2>", "value": { "subfield2": "blah2" } }
{ "path": "data/<userId2>/things/<thingId1>", "value": { "subfield4": "blah3" } }
{ "path": "data/<userId2>/things/<thingId2>", "value": { "subfield3": "blah4" } }
The separators in path
can be .
if needed but would be most useful if it was an actual string array of keys.
It also needs to be a bit generic as I’d also like the simpler case of:
{
"users": {
"<userId1>": {
"name": "user1",
"email": "[email protected]"
},
"<userId2>": {
"name": "user2"
},
"<userId3>": {
"email": "[email protected]"
}
}
}
To produce the output of:
{ "path": "users/<userId1>", "value": { "name": "user1", "email": "[email protected]" } }
{ "path": "users/<userId2>", "value": { "name": "user2" } }
{ "path": "users/<userId3>", "value": { "email": "[email protected]" } }
For the simpler case, I’ve managed to get JQ to output the key
of each object with:
echo '{"users":{"<userId1>":{"name":"user1"},"<userId2>":{"name":"user2"}}}' | jq -rn --stream 'fromstream(1 | truncate_stream(inputs | select(.[0][0] == "users"))) | to_entries | map("(. | tojson)") | join("n")'
And while that works, I can’t quite work out how to adapt it to output paths for the nested objects in the more complex example.
Also JQ-wise I don’t know if piping the output of fromstream
to more functions is realistic memory wise? We’re talking about 10s of millions of users
/data
sub objects and possibly a similar amount of things
sub objects.
2
Answers
Since your input is very large and monolithic, I would expect your best option will be to use jq’s "streaming parser", though it might seem quite slow. Fortunately, in your case, it’s quite simple, or at least short:
If the solution using jq’s "streaming parser" is too slow, you might might like to combine using "JSON machine" with jq. Using the
jm
interface to JSON machine:See https://github.com/pkoppstein/jm for further details, and a Python-based interface to "JSON Machine".