Pandas JSON Orient Autodetection

felix001
February 2, 2023
270 views
4 votes
4 Answers

I’m trying to find out if Pandas.read_json performs some level of autodetection. For example, I have the following data:

data_records = [
    {
        "device": "rtr1",
        "dc": "London",
        "vendor": "Cisco",
    },
    {
        "device": "rtr2",
        "dc": "London",
        "vendor": "Cisco",
    },
    {
        "device": "rtr3",
        "dc": "London",
        "vendor": "Cisco",
    },
]

data_index = {
    "rtr1": {"dc": "London", "vendor": "Cisco"},
    "rtr2": {"dc": "London", "vendor": "Cisco"},
    "rtr3": {"dc": "London", "vendor": "Cisco"},
}

If I do the following:

import pandas as pd
import json

pd.read_json(json.dumps(data_records))
---
  device      dc vendor
0   rtr1  London  Cisco
1   rtr2  London  Cisco
2   rtr3  London  Cisco

though I get the output that I desired, the data is record based. Being that the default orient is columns, I would have not thought this would have worked.

Therefore is there some level of autodetection going on? With index based inputs the behaviour seems more inline. As this shows appears to have parsed the data based on a column orient by default.

pd.read_json(json.dumps(data_index))

          rtr1    rtr2    rtr3
dc      London  London  London
vendor   Cisco   Cisco   Cisco

pd.read_json(json.dumps(data_index), orient="index")

          dc vendor
rtr1  London  Cisco
rtr2  London  Cisco
rtr3  London  Cisco

Answers

- ErikFubel
- February 6, 2023 at 3:47 pm
- 0 votes
0
TL;DR

When using pd.read_json() with orient=None, the representation of the data is automatically determined through pd.DataFrame().

Explanation

The pandas documentation is a bit misleading here. When not specifying orient, the parser for ‘columns’ is used, which is self.obj = pd.DataFrame(json.loads(json)). So
```
pd.read_json(json.dumps(data_records))
```
is equivalent to
```
pd.DataFrame(json.loads(json.dumps(data_records)))
```
which again is equivalent to
```
pd.DataFrame(data_records)
```
I.e., you pass a list of dicts to the DataFrame constructor, which then performs the automatic determination of the data representation. Note that this does not mean that orient is auto-detected. Instead, simple heuristics (see below) on how the data should be loaded into a DataFrame are applied.

Loading JSON-like data through pd.DataFrame()

For the 3 most relevant cases of JSON-structured data, the DataFrame construction through pd.DataFrame() is:
1. Dict of lists
```
In[1]: data = {"a": [1, 2, 3], "b": [9, 8, 7]}
  ...: pd.DataFrame(data)
Out[1]: 
   a  b
0  1  9
1  2  8
2  3  7
```
1. Dict of dicts
```
In[2]: data = {"a": {"x": 1, "y": 2, "z": 3}, "b": {"x": 9, "y": 8, "z": 7}}
  ...: pd.DataFrame(data)
Out[2]: 
   a  b
x  1  9
y  2  8
z  3  7
```
1. List of dicts
```
In[3]: data = [{'a': 1, 'b': 9}, {'a': 2, 'b': 8}, {'a': 3, 'b': 7}]
  ...: pd.DataFrame(data)
Out[3]: 
   a  b
0  1  9
1  2  8
2  3  7
```
Login or Signup to reply.

- LaurentB
- February 7, 2023 at 12:09 am
- 0 votes
0
We can’t speak of auto-detection, but rather of a nested hierarchical structure determined by a specified orientation or by the one used by default.

Moreover, it should be specified that we cannot use any data structure with a given orientation.

Case 1 : Dataframes to / from a JSON string

read_json : Convert a JSON string to Dataframe with argument typ = 'frame'

to_json : Convert a DataFrame to a JSON string

Orient value is explicitly specified with the Pandas to_json and read_json functions in case of split, index, record, table and values orientations.

This is not necessary to specify orient value for columns because orientation is columns by default.

Case 2 : Series to / from a JSON string

read_json : Convert a JSON string to Series with argument typ = 'series'

to_json : Convert a Series to a JSON string

If typ = series in read_json, default value for argument orient is index see Pandas documentation

When trying to convert a Series into a JSON string using to_json, default orient value is also index.

With the other allowed orientation values for a Series split, records, index, table argument orient must be specified.

Resources

We have some oriented structures in the comment section at this github link (have a look around line 680 in the _json.py file).

Note that there are no examples with orient=columns in the code comments on git-hub.
This is simply because in the absence of an orientation specification, columns is used by default.

Clearer view of a nested hierarchical structure
```
import pandas as pd
import json

##### BEGINNING : HIERARCHICAL LEVEL #####

# Second Level - Values levels
d21 = {'v1': "value 1", 'v2': "value 3"}
d22 = {'v1': "value 3", 'v2': "value 4"}

# First Level - Rows levels
d1 = {'row1': d21, 'row2': d22}

# 0-Level - Columns Levels
d = {'col1': d1}

##### END : HIERARCHICAL LEVEL #####

print(pd.read_json(json.dumps(d))) # No need for specification : orient is columns by default

#                                     col1
# row1  {'v1': 'value 1', 'v2': 'value 3'}
# row2  {'v1': 'value 3', 'v2': 'value 4'}
```
Be careful here

Data structures cannot be used with any value given to the orient argument. Otherwise, a builtins.AttributeError exception should be raised (see the github link to see the diffrent structures).
```
pd.read_json(json.dumps(data_records))
#   device      dc vendor
# 0   rtr1  London  Cisco
# 1   rtr2  London  Cisco
# 2   rtr3  London  Cisco

#### Like orient is columns by default the previous is the same as following
pd.read_json(json.dumps(data_records), orient='columns')
#   device      dc vendor
# 0   rtr1  London  Cisco
# 1   rtr2  London  Cisco
# 2   rtr3  London  Cisco

pd.read_json(json.dumps(data_records), orient='values')
#   device      dc vendor
# 0   rtr1  London  Cisco
# 1   rtr2  London  Cisco
# 2   rtr3  London  Cisco

#### Dataframe shape is also important and exception could be raised

pd.read_json(json.dumps(data_records), orient='index')
# ...
# builtins.AttributeError: 'list' object has no attribute 'values'

pd.read_json(json.dumps(data_records), orient='table')
# builtins.KeyError: 'schema'

pd.read_json(json.dumps(data_records), orient='split')
# builtins.AttributeError: 'list' object has no attribute 'items'
```
Is there an autodetection mechanism ?

After some experimentations I can say now answer is no.

On github, split data shape is presented like the following :
```
data = {
"columns":["col 1","col 2"],
"index":["row 1","row 2"],
"data":[["a","b"],["c","d"]]
}
```
So let’s do an experiment.

We will use read_json to read the data without filling in the orientation and see if the split shape is recognized.

Then we will read the data by entering the split orientation.

If there is an automatic shape recognition, the result should be the same in both cases.
```
import pandas as pd
import json

data = {
"columns":["col 1","col 2"],
"index":["row 1","row 2"],
"data":[["a","b"],["c","d"]]
}

json_string = json.dumps(data)
```
we print without filling in the orientation :
```
>>> pd.read_json(json_string)
  columns  index    data
0   col 1  row 1  [a, b]
1   col 2  row 2  [c, d]
```
and now we print with filling in the split orientation.
```
>>> pd.read_json(json_string, orient='split')
      col 1 col 2
row 1     a     b
row 2     c     d
```
Dataframes are different, Pandas do not recognize the split shape. There is no automatic detection mechanism.
Login or Signup to reply.

If you want to understand every detail of a function call, I would suggest using VSCode and setting "justMyCode": false in your launch.json for debugging.

That being said, if you follow what’s going on when you call pd.read_json() you’ll find out that it instantiates a JsonReader, before reading it which then instantiates a FrameParser in turn parsed with _parse_no_numpy:

def _parse_no_numpy(self):
  json = self.json
  orient = self.orient

  if orient == "columns": # default
      self.obj = DataFrame(
          loads(json, precise_float=self.precise_float), dtype=None
      )
  elif orient == "split":
      decoded = {
          str(k): v
          for k, v in loads(json, precise_float=self.precise_float).items()
      }
      self.check_keys_split(decoded)
      self.obj = DataFrame(dtype=None, **decoded)
  elif orient == "index": # your second case
      self.obj = DataFrame.from_dict(
          loads(json, precise_float=self.precise_float),
          dtype=None,
          orient="index",
      )
  elif orient == "table":
      self.obj = parse_table_schema(json, precise_float=self.precise_float)
  else:
      self.obj = DataFrame(
          loads(json, precise_float=self.precise_float), dtype=None
            )

As you can see, like stated in a previous answer, in terms of orientation:

pd.read_json(json.dumps(data_records))

is equivalent to

pd.DataFrame(data_records)

and

pd.read_json(json.dumps(data_index), orient='index')

pd.DataFrame.from_dict(data_index, orient='index')

So in the end it all boils down to how pd.DataFrame handles the passed list of dict.

Going down this hole, you’ll find that the constructor checks if the data is list-like and then calls nested_data_to_arrays which in turn calls to_arrays that finally calls _list_of_dict_to_arrays:

def _list_of_dict_to_arrays(
    data: list[dict],
    columns: Index | None,
) -> tuple[np.ndarray, Index]:
    """
    Convert list of dicts to numpy arrays

    if `columns` is not passed, column names are inferred from the records
    - for OrderedDict and dicts, the column names match
      the key insertion-order from the first record to the last.
    - For other kinds of dict-likes, the keys are lexically sorted.

    Parameters
    ----------
    data : iterable
        collection of records (OrderedDict, dict)
    columns: iterables or None

    Returns
    -------
    content : np.ndarray[object, ndim=2]
    columns : Index
    """
    if columns is None:
        gen = (list(x.keys()) for x in data)
        sort = not any(isinstance(d, dict) for d in data)
        pre_cols = lib.fast_unique_multiple_list_gen(gen, sort=sort)
        columns = ensure_index(pre_cols)

    # assure that they are of the base dict class and not of derived
    # classes
    data = [d if type(d) is dict else dict(d) for d in data]

    content = lib.dicts_to_array(data, list(columns))
    return content, columns

The "autodetection" is actually the hierarchical handling of all possible cases/types.

- SergedeGossondeVarennes
- February 9, 2023 at 2:52 pm
- 0 votes
0
No, Pandas does not perform any autodetection when using the read_json function.

It is entirely determined by the orient parameter, which specifies the format of the input json data.

In your first example, you passed the data_records list to the json.dumps function, which is then converted it to a json-string. After passing the resulting json string to pd.read_json, it is seen as a record orientation.

In your second example, you passed the data_index to json.dumps which is thenseen as a "column" orientation

In both cases, the behavior of the read_json function is entirely based on the value of the orient parameter and not by an automatic detection by Pandas.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Pandas JSON Orient Autodetection

Answers

TL;DR

Explanation

Loading JSON-like data through pd.DataFrame()

Loading JSON-like data through `pd.DataFrame()`