skip to Main Content

I have a JSON file with 1 000 000 entries in it (Size: 405 Mb). It looks like that:

[
  {
     "orderkey": 1,
     "name": "John",
     "age": 23,
     "email": "[email protected]"
  },
  {
     "orderkey": 2,
     "name": "Mark",
     "age": 33,
     "email": "[email protected]"
  },
...
]

The data is sorted by "orderkey", I need to shuffle data.

I tried to apply the following Python code. It worked for smaller JSON file, but did not work for my 405 MB one.

import json
import random

with open("sorted.json") as f:
     data = json.load(f)

random.shuffle(data)

with open("sorted.json") as f:
     json.dump(data, f, indent=2)

How to do it?

UPDATE:

Initially I got the following error:

~/Desktop/shuffleData$ python3 toShuffle.py 
Traceback (most recent call last):
  File "/home/andrei/Desktop/shuffleData/toShuffle.py", line 5, in <module>
    data = json.load(f)
  File "/usr/lib/python3.10/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 403646259 (char 403646258)

Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid.

Removing "}" fixed the problem.

2

Answers


  1. Chosen as BEST ANSWER

    Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]} that was not valid format.

    Removing "}" in the end fixed the problem.

    Provided python code works.


  2. Well this should ideally work unless you have memory constraints.

    import random
    random.shuffle(data)
    

    In case you are looking for another way and would like to benchmark which is faster for the huge set, you can use the sci-kit learn libraries shuffle function.

    from sklearn.utils import shuffle
    
    shuffled_data = shuffle(data)
    print(shuffled_data)
    

    Note: Additional package has to be installed called Scikit learn. (pip install -U scikit-learn)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search