Javascript - Removing empty values, casting strings to bools and trimming whitespace recursively in JSON with jq - how to optimize?

ChristianM
June 5, 2023
123 views
2 votes
3 Answers

As part of data preprocessing I need to remove all empty values from an input JSON like empty arrays [], empty objects {}, empty strings ""/" "/"t", objects with empty keys {"":5} and I need to do that recursively. I also need to trim all the whitespaces of all strings (also if they are object keys). I built a solution using jq 1.6 and a custom walk() function. I was wondering if I could improve the performance of my query somehow since I am quite new to advanced jq stuff. Memory is not the problem I would like it to be less CPU intensive (so I do not consider jq stream). Currently I run it through executeScript on a 10 nodes 4 CPU cluster with 16GB RAM each and it is mostly hitting the CPU the hardest. Memory is only at around 60%.

jq 'walk(
  if type == "string" then
    (sub("^[[:space:]]+"; "") | sub("[[:space:]]+$"; "") | if . == "true" then . |= true else . end | if . == "false" then . |= false else . end)
  elif type == "object" then
    with_entries(select(.value | IN("",null, [], {}) | not) | .key |= sub("^[[:space:]]+"; "") | .key |= sub("[[:space:]]+$"; "") |select(.key | IN("") | not ))
  elif type == "array" then
      map(select(. | IN("",null, [], {}) | not))
  else . end)'

That is what I have now. I also cast "true" to boolean true and "false" to boolean false. Are there any obvious query improvements?

I tried doing the whole thing in plain JavaScript or Groovy but I did not feel like reinventing the wheel when recursing nested JSON object is already handled that gracefully by jq. I am open for JavaScript or Groovy implementation if the jq query cannot be improved substantially.

Answers

Chosen as BEST ANSWER

I did end up writing a small rust binary that does the aforementionded thing:

use std::io::Read;
use serde_json::{Value, Map};

fn clean_value(val: &Value) -> Option<Value> {
    match val {
        Value::Null => None,
        Value::String(s) => {
            let trimmed = s.trim().to_owned();
            match trimmed.to_lowercase().as_str() {
                "true" => Some(Value::Bool(true)),
                "false" => Some(Value::Bool(false)),
                _ => if trimmed.is_empty() { None } else { Some(Value::String(trimmed)) },
            }
        },
        Value::Array(arr) => {
            let cleaned: Vec<Value> = arr.iter()
                .filter_map(clean_value)
                .collect();
            if cleaned.is_empty() { None } else { Some(Value::Array(cleaned)) }
        },
        Value::Object(map) => {
            let cleaned: Map<String, Value> = map.iter()
                .filter_map(|(k, v)| clean_value(v).map(|v| (k.trim().to_owned(), v)))
                .collect();
            if cleaned.is_empty() { None } else { Some(Value::Object(cleaned)) }
        },
        _ => Some(val.clone()),
    }
}

fn clean_json(json: &str) -> Result<String, serde_json::Error> {
    let value: Value = serde_json::from_str(json)?;
    let cleaned = clean_value(&value);
    match cleaned {
        Some(v) => Ok(serde_json::to_string(&v)?),
        None => Ok(String::new()),
    }
}

fn main() {
    let mut buffer = String::new();
    std::io::stdin().read_to_string(&mut buffer).unwrap();
    match clean_json(&buffer) {
        Ok(json) => println!("{}", json),
        Err(e) => eprintln!("Error cleaning json: {}", e),
    }
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let input = r#"
        {
            "  key  ": "  true  ",
            "  empty array  ": [],
            "  empty object  ": {},
            "  empty string  ": "",
            "  null  ": null,
            "  nested  ": {
                "  key  ": "  false  ",
                "  empty array  ": [],
                "  empty object  ": {},
                "  empty string  ": "",
                "  null  ": null
            }
        }
        "#;
        let expected = r#"{"key":true,"nested":{"key":false}}"#;
        let cleaned = clean_json(input).unwrap();
        assert_eq!(cleaned, expected);
    }
}

It is significantly faster:

time ./clean-json < ~/test_small.json
real    0m0.022s
user    0m0.021s
sys     0m0.001s

as opposed to the improved jq:

real    0m0.365s
user    0m0.336s
sys     0m0.031s

(Edit)

- peak
- June 2, 2023 at 12:53 am
- 0 votes
0
How to optimize? Since you don’t seem to have shown the customized version of walk you’re using, it’s hard to say, but here is a more efficient version than the one in builtins.jq:
```
def walk(f):
  def w:
    if type == "object"
    then . as $in
    | reduce keys_unsorted[] as $key
        ( {}; . + { ($key):  ($in[$key] | w) } ) | f
    elif type == "array" then map( w ) | f
    else f
    end;
  w;
```
Login or Signup to reply.

- peak
- June 4, 2023 at 10:12 pm
- 0 votes
0
Here are two variants of walk that might be of interest
if the main targets of transformation are scalars and/or keys.

Note that scalar_walk should be very fast,
whereas with_entries will tend to make atomic_walk relatively slow
when processing JSON objects.
```
# Apply f to keys and scalars only
# To speed things up, do not apply f to objects or arrays themselves
def atomic_walk(f):
  def w:
    if type == "object"
    then with_entries( .key |= f | .value |= w)
    elif type == "array" then map( w )
    else f
    end;
  w;
```
```
# Apply f to scalars (excluding keys) only
def scalar_walk(f):
  def w:
    if type == "object"
    then map_values(w)
    elif type == "array" then map( w )
    else f
    end;
  w;
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Javascript – Removing empty values, casting strings to bools and trimming whitespace recursively in JSON with jq – how to optimize?

Answers