skip to Main Content

I’m trying to find out if Pandas.read_json performs some level of autodetection. For example, I have the following data:

data_records = [
    {
        "device": "rtr1",
        "dc": "London",
        "vendor": "Cisco",
    },
    {
        "device": "rtr2",
        "dc": "London",
        "vendor": "Cisco",
    },
    {
        "device": "rtr3",
        "dc": "London",
        "vendor": "Cisco",
    },
]

data_index = {
    "rtr1": {"dc": "London", "vendor": "Cisco"},
    "rtr2": {"dc": "London", "vendor": "Cisco"},
    "rtr3": {"dc": "London", "vendor": "Cisco"},
}

If I do the following:

import pandas as pd
import json

pd.read_json(json.dumps(data_records))
---
  device      dc vendor
0   rtr1  London  Cisco
1   rtr2  London  Cisco
2   rtr3  London  Cisco

though I get the output that I desired, the data is record based. Being that the default orient is columns, I would have not thought this would have worked.

Therefore is there some level of autodetection going on? With index based inputs the behaviour seems more inline. As this shows appears to have parsed the data based on a column orient by default.

pd.read_json(json.dumps(data_index))

          rtr1    rtr2    rtr3
dc      London  London  London
vendor   Cisco   Cisco   Cisco

pd.read_json(json.dumps(data_index), orient="index")

          dc vendor
rtr1  London  Cisco
rtr2  London  Cisco
rtr3  London  Cisco

4

Answers


  1. TL;DR

    When using pd.read_json() with orient=None, the representation of the data is automatically determined through pd.DataFrame().

    Explanation

    The pandas documentation is a bit misleading here. When not specifying orient, the parser for ‘columns’ is used, which is self.obj = pd.DataFrame(json.loads(json)). So

    pd.read_json(json.dumps(data_records))
    

    is equivalent to

    pd.DataFrame(json.loads(json.dumps(data_records)))
    

    which again is equivalent to

    pd.DataFrame(data_records)
    

    I.e., you pass a list of dicts to the DataFrame constructor, which then performs the automatic determination of the data representation. Note that this does not mean that orient is auto-detected. Instead, simple heuristics (see below) on how the data should be loaded into a DataFrame are applied.

    Loading JSON-like data through pd.DataFrame()

    For the 3 most relevant cases of JSON-structured data, the DataFrame construction through pd.DataFrame() is:

    1. Dict of lists
    In[1]: data = {"a": [1, 2, 3], "b": [9, 8, 7]}
      ...: pd.DataFrame(data)
    Out[1]: 
       a  b
    0  1  9
    1  2  8
    2  3  7
    
    1. Dict of dicts
    In[2]: data = {"a": {"x": 1, "y": 2, "z": 3}, "b": {"x": 9, "y": 8, "z": 7}}
      ...: pd.DataFrame(data)
    Out[2]: 
       a  b
    x  1  9
    y  2  8
    z  3  7
    
    1. List of dicts
    In[3]: data = [{'a': 1, 'b': 9}, {'a': 2, 'b': 8}, {'a': 3, 'b': 7}]
      ...: pd.DataFrame(data)
    Out[3]: 
       a  b
    0  1  9
    1  2  8
    2  3  7
    
    Login or Signup to reply.
  2. We can’t speak of auto-detection, but rather of a nested hierarchical structure determined by a specified orientation or by the one used by default.

    Moreover, it should be specified that we cannot use any data structure with a given orientation.

    Case 1 : Dataframes to / from a JSON string

    read_json : Convert a JSON string to Dataframe with argument typ = 'frame'

    to_json : Convert a DataFrame to a JSON string

    Orient value is explicitly specified with the Pandas to_json and read_json functions in case of splitindexrecordtable and values orientations.

    This is not necessary to specify orient value for columns because orientation is columns by default.

    Case 2 : Series to / from a JSON string

    read_json : Convert a JSON string to Series with argument typ = 'series'

    to_json : Convert a Series to a JSON string

    If typ = series in read_json, default value for argument orient is index see Pandas documentation

    When trying to convert a Series into a JSON string using to_json, default orient value is also index.

    With the other allowed orientation values for a Series split, records, index, table argument orient must be specified.

    Resources

    We have some oriented structures in the comment section at this github link (have a look around line 680 in the _json.py file).

    Note that there are no examples with orient=columns in the code comments on git-hub.
    This is simply because in the absence of an orientation specification, columns is used by default.

    Clearer view of a nested hierarchical structure

    import pandas as pd
    import json
    
    ##### BEGINNING : HIERARCHICAL LEVEL #####
    
    # Second Level - Values levels
    d21 = {'v1': "value 1", 'v2': "value 3"}
    d22 = {'v1': "value 3", 'v2': "value 4"}
    
    # First Level - Rows levels
    d1 = {'row1': d21, 'row2': d22}
    
    # 0-Level - Columns Levels
    d = {'col1': d1}
    
    ##### END : HIERARCHICAL LEVEL #####
    
    print(pd.read_json(json.dumps(d))) # No need for specification : orient is columns by default
    
    #                                     col1
    # row1  {'v1': 'value 1', 'v2': 'value 3'}
    # row2  {'v1': 'value 3', 'v2': 'value 4'}
    

    Be careful here

    Data structures cannot be used with any value given to the orient argument. Otherwise, a builtins.AttributeError exception should be raised (see the github link to see the diffrent structures).

    pd.read_json(json.dumps(data_records))
    #   device      dc vendor
    # 0   rtr1  London  Cisco
    # 1   rtr2  London  Cisco
    # 2   rtr3  London  Cisco
    
    #### Like orient is columns by default the previous is the same as following
    pd.read_json(json.dumps(data_records), orient='columns')
    #   device      dc vendor
    # 0   rtr1  London  Cisco
    # 1   rtr2  London  Cisco
    # 2   rtr3  London  Cisco
    
    pd.read_json(json.dumps(data_records), orient='values')
    #   device      dc vendor
    # 0   rtr1  London  Cisco
    # 1   rtr2  London  Cisco
    # 2   rtr3  London  Cisco
    
    #### Dataframe shape is also important and exception could be raised
    
    pd.read_json(json.dumps(data_records), orient='index')
    # ...
    # builtins.AttributeError: 'list' object has no attribute 'values'
    
    pd.read_json(json.dumps(data_records), orient='table')
    # builtins.KeyError: 'schema'
    
    pd.read_json(json.dumps(data_records), orient='split')
    # builtins.AttributeError: 'list' object has no attribute 'items'
    

    Is there an autodetection mechanism ?

    After some experimentations I can say now answer is no.

    On github, split data shape is presented like the following :

    data = {
    "columns":["col 1","col 2"],
    "index":["row 1","row 2"],
    "data":[["a","b"],["c","d"]]
    }
    

    So let’s do an experiment.

    We will use read_json to read the data without filling in the orientation and see if the split shape is recognized.

    Then we will read the data by entering the split orientation.

    If there is an automatic shape recognition, the result should be the same in both cases.

    import pandas as pd
    import json
    
    data = {
    "columns":["col 1","col 2"],
    "index":["row 1","row 2"],
    "data":[["a","b"],["c","d"]]
    }
    
    json_string = json.dumps(data)
    

    we print without filling in the orientation :

    >>> pd.read_json(json_string)
      columns  index    data
    0   col 1  row 1  [a, b]
    1   col 2  row 2  [c, d]
    

    and now we print with filling in the split orientation.

    >>> pd.read_json(json_string, orient='split')
          col 1 col 2
    row 1     a     b
    row 2     c     d
    

    Dataframes are different, Pandas do not recognize the split shape. There is no automatic detection mechanism.

    Login or Signup to reply.
  3. If you want to understand every detail of a function call, I would suggest using VSCode and setting "justMyCode": false in your launch.json for debugging.

    That being said, if you follow what’s going on when you call pd.read_json() you’ll find out that it instantiates a JsonReader, before reading it which then instantiates a FrameParser in turn parsed with _parse_no_numpy:

    def _parse_no_numpy(self):
      json = self.json
      orient = self.orient
    
      if orient == "columns": # default
          self.obj = DataFrame(
              loads(json, precise_float=self.precise_float), dtype=None
          )
      elif orient == "split":
          decoded = {
              str(k): v
              for k, v in loads(json, precise_float=self.precise_float).items()
          }
          self.check_keys_split(decoded)
          self.obj = DataFrame(dtype=None, **decoded)
      elif orient == "index": # your second case
          self.obj = DataFrame.from_dict(
              loads(json, precise_float=self.precise_float),
              dtype=None,
              orient="index",
          )
      elif orient == "table":
          self.obj = parse_table_schema(json, precise_float=self.precise_float)
      else:
          self.obj = DataFrame(
              loads(json, precise_float=self.precise_float), dtype=None
                )
    

    As you can see, like stated in a previous answer, in terms of orientation:

    pd.read_json(json.dumps(data_records))
    

    is equivalent to

    pd.DataFrame(data_records)
    

    and

    pd.read_json(json.dumps(data_index), orient='index')
    

    to

    pd.DataFrame.from_dict(data_index, orient='index')
    

    So in the end it all boils down to how pd.DataFrame handles the passed list of dict.

    Going down this hole, you’ll find that the constructor checks if the data is list-like and then calls nested_data_to_arrays which in turn calls to_arrays that finally calls _list_of_dict_to_arrays:

    def _list_of_dict_to_arrays(
        data: list[dict],
        columns: Index | None,
    ) -> tuple[np.ndarray, Index]:
        """
        Convert list of dicts to numpy arrays
    
        if `columns` is not passed, column names are inferred from the records
        - for OrderedDict and dicts, the column names match
          the key insertion-order from the first record to the last.
        - For other kinds of dict-likes, the keys are lexically sorted.
    
        Parameters
        ----------
        data : iterable
            collection of records (OrderedDict, dict)
        columns: iterables or None
    
        Returns
        -------
        content : np.ndarray[object, ndim=2]
        columns : Index
        """
        if columns is None:
            gen = (list(x.keys()) for x in data)
            sort = not any(isinstance(d, dict) for d in data)
            pre_cols = lib.fast_unique_multiple_list_gen(gen, sort=sort)
            columns = ensure_index(pre_cols)
    
        # assure that they are of the base dict class and not of derived
        # classes
        data = [d if type(d) is dict else dict(d) for d in data]
    
        content = lib.dicts_to_array(data, list(columns))
        return content, columns
    

    The "autodetection" is actually the hierarchical handling of all possible cases/types.

    Login or Signup to reply.
  4. No, Pandas does not perform any autodetection when using the read_json function.

    It is entirely determined by the orient parameter, which specifies the format of the input json data.

    In your first example, you passed the data_records list to the json.dumps function, which is then converted it to a json-string. After passing the resulting json string to pd.read_json, it is seen as a record orientation.

    In your second example, you passed the data_index to json.dumps which is thenseen as a "column" orientation

    In both cases, the behavior of the read_json function is entirely based on the value of the orient parameter and not by an automatic detection by Pandas.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search