I have a JSON file with 1 000 000 entries in it (Size: 405 Mb). It looks like that:
[
{
"orderkey": 1,
"name": "John",
"age": 23,
"email": "[email protected]"
},
{
"orderkey": 2,
"name": "Mark",
"age": 33,
"email": "[email protected]"
},
...
]
The data is sorted by "orderkey", I need to shuffle data.
I tried to apply the following Python code. It worked for smaller JSON file, but did not work for my 405 MB one.
import json
import random
with open("sorted.json") as f:
data = json.load(f)
random.shuffle(data)
with open("sorted.json") as f:
json.dump(data, f, indent=2)
How to do it?
UPDATE:
Initially I got the following error:
~/Desktop/shuffleData$ python3 toShuffle.py
Traceback (most recent call last):
File "/home/andrei/Desktop/shuffleData/toShuffle.py", line 5, in <module>
data = json.load(f)
File "/usr/lib/python3.10/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 403646259 (char 403646258)
Figured out that the problem was that I had "}" in the end of JSON file. I had [{...},{...}]}
that was not valid.
Removing "}" fixed the problem.
2
Answers
Figured out that the problem was that I had "}" in the end of JSON file. I had
[{...},{...}]}
that was not valid format.Removing "}" in the end fixed the problem.
Provided python code works.
Well this should ideally work unless you have memory constraints.
In case you are looking for another way and would like to benchmark which is faster for the huge set, you can use the sci-kit learn libraries shuffle function.
Note: Additional package has to be installed called Scikit learn. (
pip install -U scikit-learn
)