skip to Main Content

I want to read in a polars dataframe from a json string containing dates in the standard iso-format "yyyy-mm-dd".
When I try to read the string in and set the dtype of the date column witheither schema or schema_override this results in only NULL values.

MRE

from datetime import datetime, timedelta
from io import StringIO

import polars as pl

# Generate a list of dates
start_date = datetime.today()
dates = [start_date + timedelta(days=i) for i in range(100)]
date_strings = [date.strftime("%Y-%m-%d") for date in dates]

# Create a Polars DataFrame
df = pl.DataFrame({"dates": date_strings})

df_reread = pl.read_json(
    StringIO(df.write_json()),
    schema_overrides={"dates": pl.Date},
)

output of print(df_reread)

Error

shape: (100, 1)
┌───────┐
│ dates │
│ ---   │
│ date  │
╞═══════╡
│ null  │
│ null  │
│ null  │
│ null  │
│ null  │
│ …     │
│ null  │
│ null  │
│ null  │
│ null  │
│ null  │
└───────┘

Question

Is there anyway to correctly read in the Date dtype from a json string?

2

Answers


  1. pl.read_json() doesn’t automatically infer or correctly handle the Date type when reading from JSON, as JSON doesn’t natively support date types. Dates are typically represented as strings in JSON,you can explicitly cast the column to Date after reading the JSON. try this

    from datetime import datetime, timedelta
    from io import StringIO
    
    import polars as pl
    
    # Generate a list of dates
    start_date = datetime.today()
    dates = [start_date + timedelta(days=i) for i in range(100)]
    date_strings = [date.strftime("%Y-%m-%d") for date in dates]
    
    # Create a Polars DataFrame
    df = pl.DataFrame({"dates": date_strings})
    
    # Write DataFrame to JSON and read it back in
    df_reread = pl.read_json(StringIO(df.write_json()))
    
    # Explicitly cast the 'dates' column to Date type
    df_reread = df_reread.with_columns([
        pl.col("dates").str.strptime(pl.Date, "%Y-%m-%d")
    ])
    
    print(df_reread)
    
    Login or Signup to reply.
  2. After having a bit of a play around, it looks like unfortunately dates being read from a JSON file have a bit of a quirk. It seems to me that currently they must be written in days since the unix epoch (which is how Polars internally represents dates) for things to work as you expect.

    I have raised this feature request on their github to hopefully get that improved.

    In the mean time,

    df = (
        pl.DataFrame({"dates": "2024-01-01"})
        # add this line below
        .select(pl.col("dates").cast(pl.Date).dt.epoch("d"))
    )
    
    df_reread = pl.read_json(
        df.write_json().encode(),
        schema_overrides={"dates": pl.Date},
    )
    print(df_reread)
    # shape: (1, 1)
    # ┌────────────┐
    # │ dates      │
    # │ ---        │
    # │ date       │
    # ╞════════════╡
    # │ 2024-01-01 │
    # └────────────┘
    

    or do as you say with df_reread.with_columns(pl.col("dates").cast(pl.Date)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search