I have a need to split very large json file (20GB) into multiple smaller json files (Say threshold is 100 MB).
The Example file layout looks like this.
file.json
[{"name":"Joe", "Place":"Denver", "phone_number":["980283", "980284", "980285"]},{"name":"kruger", "Place":"boston",
"phone_number":["980281", "980282", "980283"]},{"name":"Dan", "Place":"Texas","phone_number":["980286", "980287", "980286"]}, {"name":"Kyle", "Place":"Newyork", "phone_number":["980282", "980288", "980289"]}]
The output should look like this
file1:
[{"name":"Joe", "Place":"Denver", "phone_number":["980283", "980284", "980285"]}, {"name":"kruger", "Place":"boston", "phone_number":["980281", "980282", "980283"]}]
file2:
[{"name":"Dan", "Place":"Texas","phone_number":["980286", "980287", "980286"]}, {"name":"Kyle", "Place":"Newyork", "phone_number":["980282", "980288", "980289"]}]
May I know the best way to achieve this? Should i opt for shell command or python?
2
Answers
The Python module
json-stream
can do this, with a few caveats, which I’ll get to later.You’ll have to implement the visitor pattern.
This
visitor
function will get called for each complete JSON element encountered in a depth-first manner. So, each complete JSON element (number, string, array, etc) will invoke this callback. It is up to you at which point to pause processing and write your partial file out.Things to look out for include if your input file is a single JSON element (like a single dictionary) you will have to change the output structure if you want the split-up files to also be valid JSON.
An illustrative example of this would be to try to split this JSON file
{ "top" : [1,2,3] }
into two separate files of half the size. You can’t without changing the data structure.As long as the file is structured that way with 1 item per line and no item in the main list that are a sub-list, you you just do a basic string replacement with
sed
. This is fragile, but relatively fast and memory efficient sincesed
is designed for streaming text.Here is an example modifying "file.json" in-place:
Then each line can be written in a separate file using a basic bash loop using
read
.To compute the input file without modifying it and write the target files, you can do that:
For the example file, it creates two files:
file1
andfile2