Edit – I was not clear. My apologies.
-
My customer is producing enormous json-like files, using automation. For this reason they can be enormous; tens of gigabytes or more; I cannot control these files in size or in content.
-
The files aren’t valid json; they tend to be sequential json records without separators between them. They look sort of like this:
{ "a":1, "b": 2, … }
{ "a":2, "b": 4, … }
{ "a":3, "b": 6, … } -
Our software runs at customer site, autonomously, without my team being present after the initial setup.
-
Customers have many files, and I have many customers. Custom coding is a last resort.
-
I have
jq
in my environment. I would prefer to use what I already have.
Given the above set-up, I fear jq -s
will load entire multi-gigabyte files into memory.
I need to convert the above semi-json into something valid like:
[
{ 'a':1, 'b': 2, ... },
{ 'a':2, 'b': 4, ... },
{ 'a':3, 'b': 6, ... }
]
and I would like to stream the json while I make this conversion to reduce resource consumption.
Using jq --slurp "."
, the files are converted to the desired array-of-records. However slurp
pulls the entire file into memory and that’s not ok.
Using jq
what’s an alternative "streaming" method?
2
Answers
[Note: the following is a response to the original question, in which the input was described as a sequence of newline-delimited JSON-like records, with the suggestion that each record might be a valid hjson value that was not valid JSON.]
First, please notice that the proposed "fix" is at best incomplete, as in JSON, the keys must be JSON strings (i.e., with double-quotes).
Second, if your preferred tool requires you to have a single ginormous JSON file, then as others have suggested, the problem is with your preference in that regard. Since you have indicated a willingness to use jq, please note that the C, Go and Rust implementations thereof all handle JSON streams very nicely, without any need to engage is slurpiness.
Thirdly, there are any number of approaches for converting a quasi-JSON stream such as you have described to a stream of valid JSON entities.
I’d focus my attention there.
Unfortunately the hjson CLI program does not handle streams, and it would probably be unwise to invoke the program zillions of times, so here’s a simple Python program that could be used on new-line-delimited log files:
As @pmf pointed out, once the stream of (hjson?) measurements has been converted to a JSON stream, you can use jq to convert that to a JSON array efficiently (i.e. without slurping) by:
So long as the input consists of “sequential JSON records” (with or without whitespace between them), jq can handle it. To convert such input to ndjson (i.e. a stream of newline-delimited JSON records), you could simply use
jq .
, but in practice, you will probably want to have the ability to recover from errors, so your jq invocation will probably look something likeFor files which consist of hjson objects that are not JSON objects, then if the individual objects are newline-separated, then you could use the Python hjson package as explained elsewhere on this page. Otherwise, you will probably have a difficult choice: try some hackery (e.g. repairing the damage using sed); write your own parser; remonstrate with the customer; or abandon said customer.