skip to Main Content

I have a need to split very large json file (20GB) into multiple smaller json files (Say threshold is 100 MB).

The Example file layout looks like this.

file.json

[{"name":"Joe", "Place":"Denver", "phone_number":["980283", "980284", "980285"]},{"name":"kruger", "Place":"boston",
 "phone_number":["980281", "980282", "980283"]},{"name":"Dan", "Place":"Texas","phone_number":["980286", "980287", "980286"]}, {"name":"Kyle", "Place":"Newyork", "phone_number":["980282", "980288", "980289"]}]

The output should look like this

file1:

[{"name":"Joe", "Place":"Denver", "phone_number":["980283", "980284", "980285"]}, {"name":"kruger", "Place":"boston", "phone_number":["980281", "980282", "980283"]}]

file2:

[{"name":"Dan", "Place":"Texas","phone_number":["980286", "980287", "980286"]}, {"name":"Kyle", "Place":"Newyork", "phone_number":["980282", "980288", "980289"]}]

May I know the best way to achieve this? Should i opt for shell command or python?

2

Answers


  1. The Python module json-stream can do this, with a few caveats, which I’ll get to later.

    You’ll have to implement the visitor pattern.

    import json_stream
    
    def visitor(item, path):
        print(f"{item} at path {path}")
    
    with open('mylargejsonfile.json','r') as f:
        json_stream.visit(f, visitor)
    

    This visitor function will get called for each complete JSON element encountered in a depth-first manner. So, each complete JSON element (number, string, array, etc) will invoke this callback. It is up to you at which point to pause processing and write your partial file out.

    Things to look out for include if your input file is a single JSON element (like a single dictionary) you will have to change the output structure if you want the split-up files to also be valid JSON.

    An illustrative example of this would be to try to split this JSON file
    { "top" : [1,2,3] } into two separate files of half the size. You can’t without changing the data structure.

    Login or Signup to reply.
  2. As long as the file is structured that way with 1 item per line and no item in the main list that are a sub-list, you you just do a basic string replacement with sed. This is fragile, but relatively fast and memory efficient since sed is designed for streaming text.

    Here is an example modifying "file.json" in-place:

    sed -e 's/^[//g' -e 's/, *$//g' -e 's/]$//g' -i file.json
    

    Then each line can be written in a separate file using a basic bash loop using read.

    To compute the input file without modifying it and write the target files, you can do that:

    i=1
    sed -e 's/^[//g' -e 's/, *$//g' -e 's/]$//g' file.json | while read -r line; do
        echo -e "[$line]" > file$i
        i=$((i+1))
    done
    

    For the example file, it creates two files: file1 and file2

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search