As part of data preprocessing I need to remove all empty values from an input JSON like empty arrays []
, empty objects {}
, empty strings ""
/" "
/"t"
, objects with empty keys {"":5}
and I need to do that recursively. I also need to trim all the whitespaces of all strings (also if they are object keys). I built a solution using jq 1.6 and a custom walk()
function. I was wondering if I could improve the performance of my query somehow since I am quite new to advanced jq stuff. Memory is not the problem I would like it to be less CPU intensive (so I do not consider jq stream). Currently I run it through executeScript on a 10 nodes 4 CPU cluster with 16GB RAM each and it is mostly hitting the CPU the hardest. Memory is only at around 60%.
jq 'walk(
if type == "string" then
(sub("^[[:space:]]+"; "") | sub("[[:space:]]+$"; "") | if . == "true" then . |= true else . end | if . == "false" then . |= false else . end)
elif type == "object" then
with_entries(select(.value | IN("",null, [], {}) | not) | .key |= sub("^[[:space:]]+"; "") | .key |= sub("[[:space:]]+$"; "") |select(.key | IN("") | not ))
elif type == "array" then
map(select(. | IN("",null, [], {}) | not))
else . end)'
That is what I have now. I also cast "true"
to boolean true
and "false"
to boolean false
. Are there any obvious query improvements?
I tried doing the whole thing in plain JavaScript or Groovy but I did not feel like reinventing the wheel when recursing nested JSON object is already handled that gracefully by jq. I am open for JavaScript or Groovy implementation if the jq query cannot be improved substantially.
3
Answers
I did end up writing a small rust binary that does the aforementionded thing:
It is significantly faster:
as opposed to the improved jq:
How to optimize? Since you don’t seem to have shown the customized version of
walk
you’re using, it’s hard to say, but here is a more efficient version than the one in builtins.jq:Here are two variants of
walk
that might be of interestif the main targets of transformation are scalars and/or keys.
Note that
scalar_walk
should be very fast,whereas
with_entries
will tend to makeatomic_walk
relatively slowwhen processing JSON objects.