I’m trying to find out if Pandas.read_json performs some level of autodetection. For example, I have the following data:
data_records = [
{
"device": "rtr1",
"dc": "London",
"vendor": "Cisco",
},
{
"device": "rtr2",
"dc": "London",
"vendor": "Cisco",
},
{
"device": "rtr3",
"dc": "London",
"vendor": "Cisco",
},
]
data_index = {
"rtr1": {"dc": "London", "vendor": "Cisco"},
"rtr2": {"dc": "London", "vendor": "Cisco"},
"rtr3": {"dc": "London", "vendor": "Cisco"},
}
If I do the following:
import pandas as pd
import json
pd.read_json(json.dumps(data_records))
---
device dc vendor
0 rtr1 London Cisco
1 rtr2 London Cisco
2 rtr3 London Cisco
though I get the output that I desired, the data is record based. Being that the default orient
is columns, I would have not thought this would have worked.
Therefore is there some level of autodetection going on? With index based inputs the behaviour seems more inline. As this shows appears to have parsed the data based on a column orient by default.
pd.read_json(json.dumps(data_index))
rtr1 rtr2 rtr3
dc London London London
vendor Cisco Cisco Cisco
pd.read_json(json.dumps(data_index), orient="index")
dc vendor
rtr1 London Cisco
rtr2 London Cisco
rtr3 London Cisco
4
Answers
TL;DR
When using
pd.read_json()
withorient=None
, the representation of the data is automatically determined throughpd.DataFrame()
.Explanation
The pandas documentation is a bit misleading here. When not specifying
orient
, the parser for ‘columns’ is used, which isself.obj = pd.DataFrame(json.loads(json))
. Sois equivalent to
which again is equivalent to
I.e., you pass a list of dicts to the DataFrame constructor, which then performs the automatic determination of the data representation. Note that this does not mean that
orient
is auto-detected. Instead, simple heuristics (see below) on how the data should be loaded into a DataFrame are applied.Loading JSON-like data through
pd.DataFrame()
For the 3 most relevant cases of JSON-structured data, the DataFrame construction through
pd.DataFrame()
is:We can’t speak of auto-detection, but rather of a nested hierarchical structure determined by a specified orientation or by the one used by default.
Moreover, it should be specified that we cannot use any data structure with a given orientation.
Case 1 : Dataframes to / from a JSON string
read_json
: Convert a JSON string to Dataframe with argumenttyp = 'frame'
to_json
: Convert a DataFrame to a JSON stringOrient value is explicitly specified with the Pandas
to_json
andread_json
functions in case ofsplit
,index
,record
,table
andvalues
orientations.This is not necessary to specify
orient
value forcolumns
because orientation is columns by default.Case 2 : Series to / from a JSON string
read_json
: Convert a JSON string to Series with argumenttyp = 'series'
to_json
: Convert a Series to a JSON stringIf
typ = series
inread_json
, default value for argumentorient
isindex
see Pandas documentationWhen trying to convert a Series into a JSON string using
to_json
, defaultorient
value is alsoindex
.With the other allowed orientation values for a Series
split
,records
,index
,table
argumentorient
must be specified.Resources
We have some oriented structures in the comment section at this github link (have a look around line 680 in the
_json.py
file).Note that there are no examples with
orient=columns
in the code comments on git-hub.This is simply because in the absence of an orientation specification,
columns
is used by default.Clearer view of a nested hierarchical structure
Be careful here
Data structures cannot be used with any value given to the
orient
argument. Otherwise, abuiltins.AttributeError
exception should be raised (see the github link to see the diffrent structures).Is there an autodetection mechanism ?
After some experimentations I can say now answer is no.
On github,
split
data shape is presented like the following :So let’s do an experiment.
We will use
read_json
to read the data without filling in the orientation and see if thesplit
shape is recognized.Then we will read the data by entering the
split
orientation.If there is an automatic shape recognition, the result should be the same in both cases.
we print without filling in the orientation :
and now we print with filling in the
split
orientation.Dataframes are different, Pandas do not recognize the
split
shape.There is no automatic detection mechanism
.If you want to understand every detail of a function call, I would suggest using VSCode and setting
"justMyCode": false
in yourlaunch.json
for debugging.That being said, if you follow what’s going on when you call
pd.read_json()
you’ll find out that it instantiates a JsonReader, before reading it which then instantiates a FrameParser in turn parsed with _parse_no_numpy:As you can see, like stated in a previous answer, in terms of orientation:
is equivalent to
and
to
So in the end it all boils down to how
pd.DataFrame
handles the passed list of dict.Going down this hole, you’ll find that the constructor checks if the data is list-like and then calls nested_data_to_arrays which in turn calls to_arrays that finally calls _list_of_dict_to_arrays:
The "autodetection" is actually the hierarchical handling of all possible cases/types.
No, Pandas does not perform any autodetection when using the
read_json
function.It is entirely determined by the
orient
parameter, which specifies the format of the input json data.In your first example, you passed the
data_records
list to thejson.dumps
function, which is then converted it to a json-string. After passing the resulting json string topd.read_json
, it is seen as arecord
orientation.In your second example, you passed the
data_index
tojson.dumps
which is thenseen as a "column" orientationIn both cases, the behavior of the
read_json
function is entirely based on the value of theorient
parameter and not by an automatic detection by Pandas.