skip to Main Content

As part of data preprocessing I need to remove all empty values from an input JSON like empty arrays [], empty objects {}, empty strings ""/" "/"t", objects with empty keys {"":5} and I need to do that recursively. I also need to trim all the whitespaces of all strings (also if they are object keys). I built a solution using jq 1.6 and a custom walk() function. I was wondering if I could improve the performance of my query somehow since I am quite new to advanced jq stuff. Memory is not the problem I would like it to be less CPU intensive (so I do not consider jq stream). Currently I run it through executeScript on a 10 nodes 4 CPU cluster with 16GB RAM each and it is mostly hitting the CPU the hardest. Memory is only at around 60%.

jq 'walk(
  if type == "string" then
    (sub("^[[:space:]]+"; "") | sub("[[:space:]]+$"; "") | if . == "true" then . |= true else . end | if . == "false" then . |= false else . end)
  elif type == "object" then
    with_entries(select(.value | IN("",null, [], {}) | not) | .key |= sub("^[[:space:]]+"; "") | .key |= sub("[[:space:]]+$"; "") |select(.key | IN("") | not ))
  elif type == "array" then
      map(select(. | IN("",null, [], {}) | not))
  else . end)'

That is what I have now. I also cast "true" to boolean true and "false" to boolean false. Are there any obvious query improvements?

I tried doing the whole thing in plain JavaScript or Groovy but I did not feel like reinventing the wheel when recursing nested JSON object is already handled that gracefully by jq. I am open for JavaScript or Groovy implementation if the jq query cannot be improved substantially.

3

Answers


  1. Chosen as BEST ANSWER

    I did end up writing a small rust binary that does the aforementionded thing:

    use std::io::Read;
    use serde_json::{Value, Map};
    
    fn clean_value(val: &Value) -> Option<Value> {
        match val {
            Value::Null => None,
            Value::String(s) => {
                let trimmed = s.trim().to_owned();
                match trimmed.to_lowercase().as_str() {
                    "true" => Some(Value::Bool(true)),
                    "false" => Some(Value::Bool(false)),
                    _ => if trimmed.is_empty() { None } else { Some(Value::String(trimmed)) },
                }
            },
            Value::Array(arr) => {
                let cleaned: Vec<Value> = arr.iter()
                    .filter_map(clean_value)
                    .collect();
                if cleaned.is_empty() { None } else { Some(Value::Array(cleaned)) }
            },
            Value::Object(map) => {
                let cleaned: Map<String, Value> = map.iter()
                    .filter_map(|(k, v)| clean_value(v).map(|v| (k.trim().to_owned(), v)))
                    .collect();
                if cleaned.is_empty() { None } else { Some(Value::Object(cleaned)) }
            },
            _ => Some(val.clone()),
        }
    }
    
    fn clean_json(json: &str) -> Result<String, serde_json::Error> {
        let value: Value = serde_json::from_str(json)?;
        let cleaned = clean_value(&value);
        match cleaned {
            Some(v) => Ok(serde_json::to_string(&v)?),
            None => Ok(String::new()),
        }
    }
    
    fn main() {
        let mut buffer = String::new();
        std::io::stdin().read_to_string(&mut buffer).unwrap();
        match clean_json(&buffer) {
            Ok(json) => println!("{}", json),
            Err(e) => eprintln!("Error cleaning json: {}", e),
        }
    }
    
    #[cfg(test)]
    mod tests {
        use super::*;
    
        #[test]
        fn it_works() {
            let input = r#"
            {
                "  key  ": "  true  ",
                "  empty array  ": [],
                "  empty object  ": {},
                "  empty string  ": "",
                "  null  ": null,
                "  nested  ": {
                    "  key  ": "  false  ",
                    "  empty array  ": [],
                    "  empty object  ": {},
                    "  empty string  ": "",
                    "  null  ": null
                }
            }
            "#;
            let expected = r#"{"key":true,"nested":{"key":false}}"#;
            let cleaned = clean_json(input).unwrap();
            assert_eq!(cleaned, expected);
        }
    }
    

    It is significantly faster:

    time ./clean-json < ~/test_small.json
    real    0m0.022s
    user    0m0.021s
    sys     0m0.001s
    

    as opposed to the improved jq:

    real    0m0.365s
    user    0m0.336s
    sys     0m0.031s
    

  2. How to optimize? Since you don’t seem to have shown the customized version of walk you’re using, it’s hard to say, but here is a more efficient version than the one in builtins.jq:

    def walk(f):
      def w:
        if type == "object"
        then . as $in
        | reduce keys_unsorted[] as $key
            ( {}; . + { ($key):  ($in[$key] | w) } ) | f
        elif type == "array" then map( w ) | f
        else f
        end;
      w;
    
    Login or Signup to reply.
  3. Here are two variants of walk that might be of interest
    if the main targets of transformation are scalars and/or keys.

    Note that scalar_walk should be very fast,
    whereas with_entries will tend to make atomic_walk relatively slow
    when processing JSON objects.

    # Apply f to keys and scalars only
    # To speed things up, do not apply f to objects or arrays themselves
    def atomic_walk(f):
      def w:
        if type == "object"
        then with_entries( .key |= f | .value |= w)
        elif type == "array" then map( w )
        else f
        end;
      w;
    
    # Apply f to scalars (excluding keys) only
    def scalar_walk(f):
      def w:
        if type == "object"
        then map_values(w)
        elif type == "array" then map( w )
        else f
        end;
      w;
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search