skip to Main Content

I have some huge JSON files I need to profile so I can transform them into some tables. I found jq to be really useful in inspecting them, but there are going to be hundreds of these, and I’m pretty new to jq.

I already have some really handy functions in my ~/.jq (big thank you to @mikehwang)

def profile_object:
    to_entries | def parse_entry: {"key": .key, "value": .value | type}; map(parse_entry)
        | sort_by(.key) | from_entries;

def profile_array_objects:
    map(profile_object) | map(to_entries) | reduce .[] as $item ([]; . + $item) | sort_by(.key) | from_entries;

I’m sure I’ll have to modify them after I describe my question.

I’d like a jq line to profile a single object. If a key maps to an array of objects then collect the unique keys across the objects and keep profiling down if there are nested arrays of objects there. If a value is an object, profile that object.

Sorry for the long example, but imagine several GBs of this:

{
    "name": "XYZ Company",
    "type": "Contractors",
    "reporting": [
        {
            "group_id": "660",
            "groups": [
                {
                    "ids": [
                        987654321,
                        987654321,
                        987654321
                    ],   
                    "market": {
                        "name": "Austin, TX",
                        "value": "873275"
                    }
                },
                {
                    "ids": [
                        987654321,
                        987654321,
                        987654321
                    ],   
                    "market": {
                        "name": "Nashville, TN",
                        "value": "2393287"
                    }
                }
            ]
        }
    ],
    "product_agreements": [
        {
            "negotiation_arrangement": "FFVII",
            "code": "84144",
            "type": "DJ",
            "type_version": "V10",
            "description": "DJ in a mask",
            "name": "Claptone",
            "negotiated_rates": [
                {
                    "company_references": [
                        1,
                        5,
                        458
                    ],
                    "negotiated_prices": [
                        {
                            "type": "negotiated",
                            "rate": 17.73,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_modifier_code": [
                                "124"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                },
                {
                    "company_references": [
                        747
                    ],
                    "negotiated_prices": [
                        {
                            "type": "fee",
                            "rate": 28.42,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                }
            ]
        },
        {
            "negotiation_arrangement": "MGS3",
            "name": "David Byrne",
            "type": "Producer",
            "type_version": "V10",
            "code": "654321",
            "description": "Frontman from Talking Heads",
            "negotiated_rates": [
                {
                    "company_references": [
                        1,
                        9,
                        2344,
                        8456
                    ],
                    "negotiated_prices": [
                        {
                            "type": "negotiated",
                            "rate": 68.73,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                },
                {
                    "company_references": [
                        679
                    ],
                    "negotiated_prices": [
                        {
                            "type": "fee",
                            "rate": 89.25,
                            "expiration_date": "9999-12-31",
                            "code": [
                                "11"
                            ],
                            "billing_class": "professional"
                        }
                    ]
                }
            ]
        }
    ],
    "version": "1.3.1",
    "last_updated_on": "2023-02-01"
}

Desired output:

{
    "name": "string",
    "type": "string",
    "reporting": [
      {
        "group_id": "number",
        "groups": [
            {
                "ids": [
                    "number"
                ],
                "market": {
                    "type": "string",
                    "value": "string"
                }
            }
        ]
      }
    ],
    "product_agreements": [
      {
        "negotiation_arrangement": "string",
        "code": "string",
        "type": "string",
        "type_version": "string",
        "description": "string",
        "name": "string",
        "negotiated_rates": [
          {
            "company_references": [
                "number"
            ],
            "negotiated_prices": [
              {
                "type": "string",
                "rate": "number",
                "expiration_date": "string",
                "code": [
                  "string"
                ],
                "billing_modifier_code": [
                  "string"
                ],
                "billing_class": "string"
              }
            ]
          }
        ]        
      }
    ],
    "version": "string",
    "last_updated_on": "string"
}

Really sorry if there’s any errors in that, but I tried to make it all consistent and about as simple as I could.

To restate the need, recursively profile each key in a JSON object if a value is an object or array. Solution needs to be key name independent. Happily to clarify further if needed.

3

Answers


  1. The jq module schema.jq at https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed
    Was designed to produce the kind of structural schema you describe.

    For very large inputs, it might be very slow, so if the JSON is sufficiently regular, it might be possible to use a hybrid strategy – profiling enough of the data to come up with a comprehensive structural schema, and then checking that it does apply.

    For conformance testing of structural schemas such as produced by schema.jq, see https://github.com/pkoppstein/JESS

    Login or Signup to reply.
  2. Given your input.json, here is a solution :

    jq '
    def schema:
        if   type == "object" then .[] |= schema
        elif type == "array"  then map(schema)|unique
             | if (first | type) == "object" then [add] else . end
        else type
        end;
    schema
    ' input.json
    
    Login or Signup to reply.
  3. Here’s a variant of @Philippe’s solution: it coalesces objects in map(schema) for arrays in a principled though lossy way. (All these half-solutions trade speed for loss of precision.)

    Note that keys_unsorted is used below; if using gojq, then either this would have to be changed to keys, or a def of keys_unsorted provided.

    # Use "JSON" as the union of two distinct types
    # except combine([]; [ $x ]) => [ $x ]
    def combine($a;$b):
      if $a == $b then $a elif $a == null then $b elif $b == null then $a
      elif ($a == []) and ($b|type) == "array" then $b
      elif ($b == []) and ($a|type) == "array" then $a
      else "JSON"
      end;
    
    # Profile an array by calling mergeTypes(.[] | schema)
    # in order to coalesce objects
    def mergeTypes(s):
        reduce s as $t (null;
           if ($t|type) != "object" then .types = (.types + [$t] | unique)
           else .object as $o
           | .object = reduce ($t | keys_unsorted[]) as $k ($o;
                        .[$k] = combine( $t[$k]; $o[$k] ) 
              )
           end)
           | (if .object then [.object] else null end ) + .types ;
    
    def schema:
        if   type == "object" then .[] |= schema
        elif type == "array"
        then if . == [] then [] else mergeTypes(.[] | schema) end
        else type
        end;
    schema
    

    Example:
    Input:

    {"a": [{"b":[1]}, {"c":[2]}, {"c": []}] }
    

    Output:

    {
      "a": [
        {
          "b": [
            "number"
          ],
          "c": [
            "number"
          ]
        }
      ]
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search