skip to Main Content

I have been using jq to successfully extract one JSON blob at a time from some relatively large files and write it out to a file of one JSON object per line for further processing. Here is an example of the JSON format:

{
  "date": "2023-07-30",
  "results1":[
    {
      "data": [    
        {"row": [{"key1": "row1", "key2": "row1"}]},
        {"row": [{"key1": "row2", "key2": "row2"}]}
      ]
    },
    {
      "data": [    
        {"row": [{"key1": "row3", "key2": "row3"}]},
        {"row": [{"key1": "row4", "key2": "row4"}]}
      ]
    }
  ],
  "results2":[
    {
      "data": [    
        {"row": [{"key3": "row1", "key4": "row1"}]},
        {"row": [{"key3": "row2", "key4": "row2"}]}
      ]
    },
    {
      "data": [    
        {"row": [{"key3": "row3", "key4": "row3"}]},
        {"row": [{"key3": "row4", "key4": "row4"}]}
      ]
    }
  ]
}

My current approach is to run the following and redirect the stdout to a file:

jq -rc ".results1[]" my_json.json

This works fine, however, it seems like jq reads the entire file into memory in order to extract the chunk I am interested in.

Questions:

  1. Does jq read the entire file into memory when I execute the above
    statement?
  2. Assuming the answer is yes, is there a way that I can extract results1[] and results2[] on the same call to avoid reading the file twice?

I have used the --stream option but it is very slow. I also read that it sacrifices speed for memory savings, but memory is not an issue at this time so I would prefer to avoid using this option. Basically, what I need is to read in the above json once and output two files in JSON lines format.

Edit: (I changed the input data a bit to show the differences in the output)

Output file 1:

{"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
{"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}

Output file 2:

{"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
{"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}

It seems pretty well known that the streaming option is slow. See the discussion here.

My attempt at implementing it followed the answer here.

2

Answers


  1. doesn’t have any file IO facilities, so you can’t output multiple files.

    You can output each piece of data with it’s key and post-process it:

    jq -r '
        to_entries[]
        | select(.key != "date")
        | .key as $k
        | .value[]
        | [$k, @json]
        | @tsv
    ' my_json.json
    

    outputs

    results1    {"data":[{"row":[{"key1":"row1","key2":"row1"}]},{"row":[{"key1":"row2","key2":"row2"}]}]}
    results1    {"data":[{"row":[{"key1":"row3","key2":"row3"}]},{"row":[{"key1":"row4","key2":"row4"}]}]}
    results2    {"data":[{"row":[{"key3":"row1","key4":"row1"}]},{"row":[{"key3":"row2","key4":"row2"}]}]}
    results2    {"data":[{"row":[{"key3":"row3","key4":"row3"}]},{"row":[{"key3":"row4","key4":"row4"}]}]}
    

    So:

    while IFS=$'t' read -r key json; do
        printf '%sn' "$json" >> "${key}.jsonl"
    done < <(
        jq -r '...' my_json.json
    )
    

    or

    jq -r '...' my_json.json | awk -F 't' '{print $2 > ($1 ".jsonl")}'
    
    Login or Signup to reply.
  2. With Bash ≥ 4, processing bigger chunks could be improved by reading n lines at once using mapfile:

    jq -cr '$ARGS.positional[] as $key | .[$key] | $key, length, .[]' input.json 
      --args results1 results2 | while read -r key; read -r len
    do mapfile -t -n $len
      printf '%sn' "${MAPFILE[@]}" > "$key.jsonl"
    done
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search