I am writing a function that returns a Pandas DataFrame
object. I would like to have some kind of a type hinting what columns this DataFrame
contains, outside mere specification in the documentation, as I feel this will make it much easier for the end user to read the data.
Is there a way to type hint DataFrame
content that different tools like Visual Studio Code and PyCharm would support, when editing Python files and when editing Jupyter Notebooks?
An example function:
def generate_data(bunch, of, inputs) -> pd.DataFrame:
"""Massages the input to a nice and easy DataFrame.
:return:
DataFrame with columns a(int), b(float), c(string), d(us dollars as float)
"""
4
Answers
As far as I am aware, there is no way to do this with just core Python and pandas.
I would recommend using pandera. It has a broader scope, but type checking dataframe column types is one of its capabilities.
pandera can also be used in conjunction with pydantic, for which in turn dedicated VS Code (via Pylance) and Pycharm plugins are available.
The most powerful project for strong typing of pandas
DataFrame
as of now (Apr 2023) is pandera. Unfortunately, what it offers is quite limited and far from what we might have wanted.Here is an example of how you can use
pandera
in your case†:You can see
mypy
producing static type check error on the last line:Discussion of advantages and limitations
With pandera we get –
dataclass
style)DataFrame
schema definitions and ability to use them as type hints.year
in the example below andpandera
docs for more).What we still miss –
More examples
Pandera docs – https://pandera.readthedocs.io/en/stable/dataframe_models.html
Similar question – Type hints for a pandas DataFrame with mixed dtypes
Other typing projects
pandas-stubs is an active project providing type declarations for the pandas public API which is richer than type stubs included in pandas itself. But it doesn’t provide any facilities for column level schemas.
There are quite a few outdated libraries related to this and pandas typing in general – dataenforce, data-science-types, python-type-stubs
†
pandera
provides two different APIs that seem to be equally powerful – object-based API and class-based API. I demonstrate the later here.Arne is right, Python’s type-hinting does not have any native out-of-the-box support for specifying col types in a Pandas DataFrame.
You can perhaps use comments with custom types
This is a sample approach you could take. It defines a custom NamedTuple called MyDataFrame. Of course, it’s not strictly type-hinting the DataFrame, and IDE and type-checking tools wont enforce it, but it provides a hint to the user about the expected struct of the output DataFrame.
An alternative approach you could take is using a custom type alias and docstring
Here, you could define a custom type alias for pd.DataFrame to represent the expectec output DataFrame, which could be helpful to end-users
I’m not sure to fully understand what you expect. Isn’t
df.info()
sufficient to help users?If not, you can subclassing
DataFrame
and override methods likeinfo
and__repr__
. You can store additional informations inattrs
dictionary and use it in these methods. Here an example:Usage:
I just used a simple string but you can have a more complex
attrs
structure and a special function to display this dict (check if columns exist and avoid display useless information). I hope this helps.