Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Json – How to use `select` within a jq –stream command?

kostr
December 9, 2022
186 views
0 votes
2 Answers

I have a very large json document (~100 GB) that I am trying to use jq to parse out specific objects that meet a given criteria. Because it is so large, I won’t be able to read it into memory, and will need to utilize the --stream option.

I understand how to run a select to extract what I need when I’m not streaming, but could use some assistance in figuring out how to configure my command correctly.

Here’s a sample of my document named example.json.

{
  "reporting_entity_name" : "INSURANCE COMPANY",
  "reporting_entity_type" : "INSURER",
  "last_updated_on" : "2022-12-01",
  "version" : "1.0.0",
  "in_network" : [ {
    "negotiation_arrangement" : "ffs",
    "name" : "ER VISIT",
    "billing_code_type" : "CPT",
    "billing_code_type_version" : "2022",
    "billing_code" : "99285",
    "description" : "HIGHEST LEVEL ER VISIT",
    "negotiated_rates" : [ {
      "provider_groups" : [ {
        "npi" : [ 111111111, 222222222],
        "tin" : {
          "type" : "ein",
          "value" : "99-9999999"
        }
      } ],
      "negotiated_prices" : [ {
        "negotiated_type" : "negotiated",
        "negotiated_rate" : 550.50,
        "expiration_date" : "9999-12-31",
        "service_code" : [ "23" ],
        "billing_class" : "institutional"
      } ]
    } ]
  }
]
}

I am trying to grab the in_network object where billing_code is equal to 99285.

If I was able to do this without streaming, here’s how I would approach it:

jq '.in_network[] | select(.billing_code == "99285")' example.json

Expected output:

{
  "negotiation_arrangement": "ffs",
  "name": "ER VISIT",
  "billing_code_type": "CPT",
  "billing_code_type_version": "2022",
  "billing_code": "99285",
  "description": "HIGHEST LEVEL ER VISIT",
  "negotiated_rates": [
    {
      "provider_groups": [
        {
          "npi": [
            111111111,
            222222222
          ],
          "tin": {
            "type": "ein",
            "value": "99-9999999"
          }
        }
      ],
      "negotiated_prices": [
        {
          "negotiated_type": "negotiated",
          "negotiated_rate": 550.5,
          "expiration_date": "9999-12-31",
          "service_code": [
            "23"
          ],
          "billing_class": "institutional"
        }
      ]
    }
  ]
}

Any help on how I could configure this with the --stream option would be greatly appreciated!

Tags: bigdata jq json select

Answers

If the objects from the .in_network array alone do fit into your memory, truncate at the array items (two levels deep):

jq --stream -n '
  fromstream(2|truncate_stream(inputs | select(.[0][0] == "in_network")))
  | select(.billing_code == "99285")
' example.json

{
  "negotiation_arrangement": "ffs",
  "name": "ER VISIT",
  "billing_code_type": "CPT",
  "billing_code_type_version": "2022",
  "billing_code": "99285",
  "description": "HIGHEST LEVEL ER VISIT",
  "negotiated_rates": [
    {
      "provider_groups": [
        {
          "npi": [
            111111111,
            222222222
          ],
          "tin": {
            "type": "ein",
            "value": "99-9999999"
          }
        }
      ],
      "negotiated_prices": [
        {
          "negotiated_type": "negotiated",
          "negotiated_rate": 550.5,
          "expiration_date": "9999-12-31",
          "service_code": [
            "23"
          ],
          "billing_class": "institutional"
        }
      ]
    }
  ]
}

- peak
- December 9, 2022 at 4:32 pm
- 0 votes
0
You will find jq —-stream excruciatingly slow even for 10GB. Since jq is intended to complement other shell tools, I would recommend using jstream (https://github.com/bcicen/jstream), or my own jm or jm.py (https://github.com/pkoppstein/jm), to ”splat” the array, and pipe the result to jq.

E.g. to achieve the same effect as your jq filter:
```
jm —-pointer /in_network example.json | 
  jq 'select(.billing_code == "99285")' 
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.