skip to Main Content

I am writing a function that returns a Pandas DataFrame object. I would like to have some kind of a type hinting what columns this DataFrame contains, outside mere specification in the documentation, as I feel this will make it much easier for the end user to read the data.

Is there a way to type hint DataFrame content that different tools like Visual Studio Code and PyCharm would support, when editing Python files and when editing Jupyter Notebooks?

An example function:


def generate_data(bunch, of, inputs) -> pd.DataFrame:
      """Massages the input to a nice and easy DataFrame.

      :return:
           DataFrame with columns a(int), b(float), c(string), d(us dollars as float)
      """

4

Answers


  1. As far as I am aware, there is no way to do this with just core Python and pandas.

    I would recommend using pandera. It has a broader scope, but type checking dataframe column types is one of its capabilities.

    pandera can also be used in conjunction with pydantic, for which in turn dedicated VS Code (via Pylance) and Pycharm plugins are available.

    Login or Signup to reply.
  2. The most powerful project for strong typing of pandas DataFrame as of now (Apr 2023) is pandera. Unfortunately, what it offers is quite limited and far from what we might have wanted.

    Here is an example of how you can use pandera in your case:

    import pandas as pd
    import pandera as pa
    from pandera.typing import DataFrame, Series, String
    
    class MySchema(pa.DataFrameModel):
        a: Series[int]
        b: Series[float]
        c: Series[String]
        d: Series[float]    # US dollars
    
    class OtherSchema(pa.DataFrameModel):
        year: Series[int] = pa.Field(ge=1900, le=2050)
    
    
    def generate_data() -> DataFrame[MySchema]:
        df = pd.DataFrame({
            "a": [1, 2, 3],
            "b": [10.0, 20.0, 30.0],
            "c": ["A", "B", "C"],
            "d": [0.1, 0.2, 0.3],
        })
    
        # Runtime verification here, throws on schema mismatch
        strongly_typed_df = DataFrame[MySchema](df)
        return strongly_typed_df
    
    def transform(input: DataFrame[MySchema]) -> DataFrame[OtherSchema]:
        # This demonstrates that you can use strongly
        # typed column names from the schema
        df = input.filter(items=[MySchema.a]).rename(
                columns={MySchema.a: OtherSchema.year}
        )
    
        return DataFrame[OtherSchema](df) # This will throw on range validation!
    
    
    df1 = generate_data()
    df2 = transform(df1)
    transform(df2)   # mypy prints error here - incompatible type!
    

    You can see mypy producing static type check error on the last line:

    enter image description here

    Discussion of advantages and limitations

    With pandera we get –

    1. Clear and readable (dataclass style) DataFrame schema definitions and ability to use them as type hints.
    2. Run-time schema verification. Schema can define even more constraints than just types (see year in the example below and pandera docs for more).
    3. Experimental support for static type checking by mypy.

    What we still miss –

    1. Full static type checking for column level verification.
    2. Any IDE support for column name auto-completion.
    3. Inline syntax for schema declaration, we have to explicitly define each schema as separate class before using it.

    More examples

    Pandera docs – https://pandera.readthedocs.io/en/stable/dataframe_models.html

    Similar question – Type hints for a pandas DataFrame with mixed dtypes

    Other typing projects

    pandas-stubs is an active project providing type declarations for the pandas public API which is richer than type stubs included in pandas itself. But it doesn’t provide any facilities for column level schemas.

    There are quite a few outdated libraries related to this and pandas typing in general – dataenforce, data-science-types, python-type-stubs

    pandera provides two different APIs that seem to be equally powerful – object-based API and class-based API. I demonstrate the later here.

    Login or Signup to reply.
  3. Arne is right, Python’s type-hinting does not have any native out-of-the-box support for specifying col types in a Pandas DataFrame.

    You can perhaps use comments with custom types

    from typing import NamedTuple
    import pandas as pd
    
    class MyDataFrame(NamedTuple):
        a: int
        b: float
        c: str
        d: float  # US dollars as float
    
    def generate_data(bunch, of, inputs) -> pd.DataFrame:
        """Massages the input to a nice and easy DataFrame.
    
        :return:
            DataFrame with columns a(int), b(float), c(string), d(us dollars as float)
        """
        # Your implementation here
        pass
    

    This is a sample approach you could take. It defines a custom NamedTuple called MyDataFrame. Of course, it’s not strictly type-hinting the DataFrame, and IDE and type-checking tools wont enforce it, but it provides a hint to the user about the expected struct of the output DataFrame.

    An alternative approach you could take is using a custom type alias and docstring

    from typing import Any, Dict
    import pandas as pd
    
    # Define a type alias for better documentation
    DataFrameWithColumns = pd.DataFrame
    
    def generate_data(bunch: Any, of: Any, inputs: Any) -> DataFrameWithColumns:
        """
        Massages the input to a nice and easy DataFrame.
    
        :param bunch: Description of the input parameter 'bunch'
        :param of: Description of the input parameter 'of'
        :param inputs: Description of the input parameter 'inputs'
    
        :return: DataFrame with columns:
            a (int): Description of column 'a'
            b (float): Description of column 'b'
            c (str): Description of column 'c'
            d (float): US dollars as float, description of column 'd'
        """
        # Your implementation here
        pass
    

    Here, you could define a custom type alias for pd.DataFrame to represent the expectec output DataFrame, which could be helpful to end-users

    Login or Signup to reply.
  4. I’m not sure to fully understand what you expect. Isn’t df.info() sufficient to help users?

    >>> df.info()
    <class '__main__.MyDataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 4 columns):
     #   Column  Non-Null Count  Dtype  
    ---  ------  --------------  -----  
     0   a       3 non-null      int64  
     1   b       3 non-null      float64
     2   c       3 non-null      object 
     3   d       3 non-null      float64
    dtypes: float64(2), int64(1), object(1)
    memory usage: 224.0+ bytes
    

    If not, you can subclassing DataFrame and override methods like info and __repr__. You can store additional informations in attrs dictionary and use it in these methods. Here an example:

    class MyDataFrame(pd.DataFrame):
        
        def info(self):
            super().info()
            s = 'nMore information as footer:n'
            s += self.attrs.get('more_info')
            print(s)
    
        def __repr__(self):
            s = 'More information as header:n'
            s += f"{self.attrs.get('more_info')}nn"
            s += super().__repr__()
            return s
            
        @property
        def _constructor(self):
            return MyDataFrame
    
    def generate_data(bunch, of, inputs) -> pd.DataFrame:
        df = MyDataFrame({'a': [0, 1, 2], 'b': [1.0, 1.1, 1.2],
                          'c': ['A', 'B', 'C'], 'd': [0.99, 2.49, 3.99]})
        df.attrs = {
            'more_info': 'Additional information here'
        }
        return df
    
    df = generate_data('nothing', 'to', 'do')
    

    Usage:

    >>> df
    More information as header:  # <- HERE
    Additional information here  # <- HERE
    
       a    b  c     d
    0  0  1.0  A  0.99
    1  1  1.1  B  2.49
    2  2  1.2  C  3.99
    
    >>> df.info()
    <class '__main__.MyDataFrame'>
    RangeIndex: 3 entries, 0 to 2
    Data columns (total 4 columns):
     #   Column  Non-Null Count  Dtype  
    ---  ------  --------------  -----  
     0   a       3 non-null      int64  
     1   b       3 non-null      float64
     2   c       3 non-null      object 
     3   d       3 non-null      float64
    dtypes: float64(2), int64(1), object(1)
    memory usage: 224.0+ bytes
    
    More information as footer:  # <- HERE
    Additional information here  # <- HERE
    
    >>> df[['a', 'b']]
    More information as header:
    Additional information here
    
       a    b
    0  0  1.0
    1  1  1.1
    2  2  1.2
    

    I just used a simple string but you can have a more complex attrs structure and a special function to display this dict (check if columns exist and avoid display useless information). I hope this helps.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search