Json - array of records without jq's "slurp"

TonyEnnis
August 17, 2024
124 views
0 votes
2 Answers

Edit – I was not clear. My apologies.

My customer is producing enormous json-like files, using automation. For this reason they can be enormous; tens of gigabytes or more; I cannot control these files in size or in content.
The files aren’t valid json; they tend to be sequential json records without separators between them. They look sort of like this:

{ "a":1, "b": 2, … }
{ "a":2, "b": 4, … }
{ "a":3, "b": 6, … }
Our software runs at customer site, autonomously, without my team being present after the initial setup.
Customers have many files, and I have many customers. Custom coding is a last resort.
I have jq in my environment. I would prefer to use what I already have.

Given the above set-up, I fear jq -s will load entire multi-gigabyte files into memory.

I need to convert the above semi-json into something valid like:

[
{ 'a':1, 'b': 2, ... },
{ 'a':2, 'b': 4, ... },
{ 'a':3, 'b': 6, ... }
]

and I would like to stream the json while I make this conversion to reduce resource consumption.

Using jq --slurp ".", the files are converted to the desired array-of-records. However slurp pulls the entire file into memory and that’s not ok.

Using jq what’s an alternative "streaming" method?

Tags: hjson jq json stream

Answers

- peak
- August 16, 2024 at 10:57 pm
- 0 votes
0
[Note: the following is a response to the original question, in which the input was described as a sequence of newline-delimited JSON-like records, with the suggestion that each record might be a valid hjson value that was not valid JSON.]

I need to fix the format before I try to process the file.

First, please notice that the proposed "fix" is at best incomplete, as in JSON, the keys must be JSON strings (i.e., with double-quotes).

Second, if your preferred tool requires you to have a single ginormous JSON file, then as others have suggested, the problem is with your preference in that regard. Since you have indicated a willingness to use jq, please note that the C, Go and Rust implementations thereof all handle JSON streams very nicely, without any need to engage is slurpiness.

Thirdly, there are any number of approaches for converting a quasi-JSON stream such as you have described to a stream of valid JSON entities.
I’d focus my attention there.

Unfortunately the hjson CLI program does not handle streams, and it would probably be unwise to invoke the program zillions of times, so here’s a simple Python program that could be used on new-line-delimited log files:
```
import hjson

def process_json_line_by_line(stream):
    for line in stream:
        line = line.strip()  # Remove leading/trailing whitespace
        if not line:
            continue  # Skip empty lines
        try:
            # Parse the line as Hjson
            obj = hjson.loads(line)
            print(hjson.dumpsJSON(obj))
        except hjson.HjsonDecodeError as e:
            print("Failed to parse line:", line)
            print("Error:", e)

with open('test.qjson', 'r') as file:
    process_json_line_by_line(file)
```
As @pmf pointed out, once the stream of (hjson?) measurements has been converted to a JSON stream, you can use jq to convert that to a JSON array efficiently (i.e. without slurping) by:
```
jq -r '"[", ., (inputs | ",", .), "]"'
```
Login or Signup to reply.

- peak
- August 17, 2024 at 8:06 pm
- 0 votes
0
The files aren’t valid json; they tend to be sequential json records without separators between them.

So long as the input consists of “sequential JSON records” (with or without whitespace between them), jq can handle it. To convert such input to ndjson (i.e. a stream of newline-delimited JSON records), you could simply use jq ., but in practice, you will probably want to have the ability to recover from errors, so your jq invocation will probably look something like
```
jq -n ‘recurse(try inputs catch .)’
```
For files which consist of hjson objects that are not JSON objects, then if the individual objects are newline-separated, then you could use the Python hjson package as explained elsewhere on this page. Otherwise, you will probably have a difficult choice: try some hackery (e.g. repairing the damage using sed); write your own parser; remonstrate with the customer; or abandon said customer.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Json – array of records without jq's "slurp"

Answers