skip to Main Content

I’ve got the following (100GB+) JSON file which includes the following top level key:

{
   "data": {
      "<userId1>": {
         "things": {
            "<thingId1>": {
               "subfield1": "blah1",
               "subfield3": "foobar",
            },
            "<thingId2>": {
               "subfield2": "blah2"
            }
         }
      },
      "<userId2>": {
         "things": {
            "<thingId1>": {
               "subfield4": "blah3"
            },
            "<thingId2>": {
               "subfield3": "blah4"
            }
         }
      }
   }
}

I’d like each things object outputted with its path on a new line (for later newline delimited JSON usage down the pipeline), e.g. with the above it would output:

{ "path": "data/<userId1>/things/<thingId1>", "value": { "subfield1": "blah1", "subfield3": "foobar" } }
{ "path": "data/<userId1>/things/<thingId2>", "value": { "subfield2": "blah2" } }
{ "path": "data/<userId2>/things/<thingId1>", "value": { "subfield4": "blah3" } }
{ "path": "data/<userId2>/things/<thingId2>", "value": { "subfield3": "blah4" } }

The separators in path can be . if needed but would be most useful if it was an actual string array of keys.

It also needs to be a bit generic as I’d also like the simpler case of:

{
   "users": {
      "<userId1>": {
         "name": "user1",
         "email": "[email protected]"
      },
      "<userId2>": {
         "name": "user2"
      },
      "<userId3>": {
         "email": "[email protected]"
      }
   }
}

To produce the output of:

{ "path": "users/<userId1>", "value": { "name": "user1", "email": "[email protected]" } }
{ "path": "users/<userId2>", "value": { "name": "user2" } }
{ "path": "users/<userId3>", "value": { "email": "[email protected]" } }

For the simpler case, I’ve managed to get JQ to output the key of each object with:

echo '{"users":{"<userId1>":{"name":"user1"},"<userId2>":{"name":"user2"}}}' | jq -rn --stream 'fromstream(1 | truncate_stream(inputs | select(.[0][0] == "users"))) | to_entries | map("(. | tojson)") | join("n")'

And while that works, I can’t quite work out how to adapt it to output paths for the nested objects in the more complex example.

Also JQ-wise I don’t know if piping the output of fromstream to more functions is realistic memory wise? We’re talking about 10s of millions of users/data sub objects and possibly a similar amount of things sub objects.

2

Answers


  1. Since your input is very large and monolithic, I would expect your best option will be to use jq’s "streaming parser", though it might seem quite slow. Fortunately, in your case, it’s quite simple, or at least short:

    < input.json jq --stream -c '
     select(length==2 and .[0][-1] == "name")
     | {path: (.[0]|join("/")), name: .[1]}
    '
    
    Login or Signup to reply.
  2. If the solution using jq’s "streaming parser" is too slow, you might might like to combine using "JSON machine" with jq. Using the jm interface to JSON machine:

    jm --pointer=/data < input.json | jq -c '
      tostream
      | select(length==2 and .[0][-1] == "name")
      | {path: ( ["data"] + .[0]|join("/")), name: .[1]}
    '
    

    See https://github.com/pkoppstein/jm for further details, and a Python-based interface to "JSON Machine".

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search