Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Include path of object when streaming out JSON from JQ

Rich
August 15, 2023
288 views
0 votes
2 Answers

I’ve got the following (100GB+) JSON file which includes the following top level key:

{
   "data": {
      "<userId1>": {
         "things": {
            "<thingId1>": {
               "subfield1": "blah1",
               "subfield3": "foobar",
            },
            "<thingId2>": {
               "subfield2": "blah2"
            }
         }
      },
      "<userId2>": {
         "things": {
            "<thingId1>": {
               "subfield4": "blah3"
            },
            "<thingId2>": {
               "subfield3": "blah4"
            }
         }
      }
   }
}

I’d like each things object outputted with its path on a new line (for later newline delimited JSON usage down the pipeline), e.g. with the above it would output:

{ "path": "data/<userId1>/things/<thingId1>", "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": "data/<userId1>/things/<thingId2>", "value": { "subfield2": "blah2" } }
{ "path": "data/<userId2>/things/<thingId1>", "value": { "subfield4": "blah3" } }
{ "path": "data/<userId2>/things/<thingId2>", "value": { "subfield3": "blah4" } }

The separators in path can be . if needed but would be most useful if it was an actual string array of keys.

It also needs to be a bit generic as I’d also like the simpler case of:

{
   "users": {
      "<userId1>": {
         "name": "user1",
         "email": "[email protected]"
      },
      "<userId2>": {
         "name": "user2"
      },
      "<userId3>": {
         "email": "[email protected]"
      }
   }
}

To produce the output of:

{ "path": "users/<userId1>", "value": { "name": "user1", "email": "[email protected]" } }
{ "path": "users/<userId2>", "value": { "name": "user2" } }
{ "path": "users/<userId3>", "value": { "email": "[email protected]" } }

For the simpler case, I’ve managed to get JQ to output the key of each object with:

echo '{"users":{"<userId1>":{"name":"user1"},"<userId2>":{"name":"user2"}}}' | jq -rn --stream 'fromstream(1 | truncate_stream(inputs | select(.[0][0] == "users"))) | to_entries | map("(. | tojson)") | join("n")'

And while that works, I can’t quite work out how to adapt it to output paths for the nested objects in the more complex example.

Also JQ-wise I don’t know if piping the output of fromstream to more functions is realistic memory wise? We’re talking about 10s of millions of users/data sub objects and possibly a similar amount of things sub objects.

Tags: jq json

Answers

- peak
- August 15, 2023 at 3:30 pm
- 0 votes
0
Since your input is very large and monolithic, I would expect your best option will be to use jq’s "streaming parser", though it might seem quite slow. Fortunately, in your case, it’s quite simple, or at least short:
```
< input.json jq --stream -c '
 select(length==2 and .[0][-1] == "name")
 | {path: (.[0]|join("/")), name: .[1]}
'
```
Login or Signup to reply.

- peak
- August 15, 2023 at 3:42 pm
- 0 votes
0
If the solution using jq’s "streaming parser" is too slow, you might might like to combine using "JSON machine" with jq. Using the jm interface to JSON machine:
```
jm --pointer=/data < input.json | jq -c '
  tostream
  | select(length==2 and .[0][-1] == "name")
  | {path: ( ["data"] + .[0]|join("/")), name: .[1]}
'
```
See https://github.com/pkoppstein/jm for further details, and a Python-based interface to "JSON Machine".
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.