skip to Main Content

I’m working on a project where I need to process a JSON file that’s over 5.5 GB in size.

I’ve tried using json.load() from the json module, but it loads the entire file into memory, which isn’t practical for this size.

Thank you

2

Answers


  1. You might want to use ijson library for processing large json files in python without running into memory issues. Here is a detailed article on how to use it.

    Login or Signup to reply.
  2. Rather than trying to decode the whole document then work with the data, you need to use a streaming decoder. This looks at the JSON as a "stream" of data and you work on each piece of the data at a time.

    One example is json-stream which has several modes. The simplest is transient mode which reads the JSON but doesn’t store the whole document. This is useful if you’re reading a large array or dictionary.

    import json_stream
    
    # JSON: [1, 2, 3, 4, 5, ...]
    nums = json_stream.load(f)
    
    for num in nums:
      print(num)
    

    For more complex data, use the visitor pattern where you pass in a function to handle each piece of data.

    import json_stream
    
    # JSON: {"x": 1, "y": {}, "xxxx": [1,2, {"yyyy": 1}, "z", 1, []]}
    
    def visitor(item, path):
        print(f"{item} at path {path}")
    
    json_stream.visit(f, visitor)
    
    1 at path ('x',)
    {} at path ('y',)
    1 at path ('xxxx', 0)
    2 at path ('xxxx', 1)
    1 at path ('xxxx', 2, 'yyyy')
    z at path ('xxxx', 3)
    1 at path ('xxxx', 4)
    [] at path ('xxxx', 5)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search