Using pandas version 1.4.3 and python 3.8 on Ubuntu.
I am dumping some training dataframe for a machine learning model using json orient format (which I find most intuitive on a text editor), but cannot read it back using pandas. Here is the example code.
filename='train_data.json'
train_data.to_json(path_or_buf=filename, orient='records', lines=True, indent=4)
pd.read_json(path_or_buf=filename, orient='records') # Throws error
How to read it back as the exact same dataframe (except the index, and I do not want any extra unnamed column) using pandas?
Also, one more constraint is, the lines
and indent
arguments while dumping are meant for visual aid, not to alter the parsing of the file by pandas. Is it possible to not make any assumption about those arguments while reading back, i.e. decouple the read method from visual aspects of the file?
Asking this because in production, likely the reading and writing will happen in different processes.
2
Answers
The issue is that
lines=True
combined withindent=4
creates an invalid JSON.Note that adding
lines=True
to yourread_json
call will have the reader assume one JSON object per line, so this will not work if you keepindent=4
.read_json
has noindent
argument yet, unfortunately.You can’t have it both ways here, you can either have
lines=True
orindent=4
, but not both to still be able to read your JSON.If you open the file, you’ll see it is invalid:
I suggest for legibility you stick with just
indent
:Reading the dataframe works and you can see your JSON file is also now well-formatted:
Finally, to exclude the index, we would ordinarily add argument
index=False
to yourdf.to_json()
, but there is no need for this here as you are using therecords
orient.Your problem involves how pandas writes and reads the json file.
According to their documentation (see here), one must be cautious about which json format (orientation type) will be applied in the write/read pandas methods.
If you use the orient=’records’, there will be several constraints that will cause the problem you presented in your question. They can surely be solved by a logical reading function, though the effort is meaningless since there are smarter and more ready-to-use ways for your problem.
Instead of addressing all the constraints caused by this orientation choice, you can choose another orientation for the read/write pandas functions so that the pandas.DataFrame index and columns are all preserved. For such, I suggest you use the default value (i.e, "index"), which does exactly what you need (even with the indentation that you required for visual inspection).
Below is a snippet that depicts each orientation approach.