skip to Main Content

I’m trying to filter out data from the current date to last 3 years and trying to use this for Pyspark dataframe.

below sql query needs to convert into Pyspark dataframe format

date >= dateadd(month,-4,current_date)

How to write above sql in Pyspark dataframe format

2

Answers


  1. Lets create a dataframe to test on:

    from datetime import date,timedelta
    from dateutil.relativedelta import relativedelta
    start_date = date(2023,1,1)
    dates_list = []
    for i in range(0,12):
        curr_date = start_date + relativedelta(months=i)
        print(curr_date)
        dates_list.append((i+1,curr_date))
    columns =["id","date"]
    dates_df = spark.createDataFrame(dates_list,columns)
    

    These create a dataframe with 12 rows, each for the 1 day of the month in 2023:

    display(dates_df)
    

    dates_df

    Now lets filter the dataframe into a new dataframe:

    filter_dates_df = dates_df.where("date>=dateadd(month,-4,current_date)")
    

    And the results:

    display(filter_dates_df)
    

    filter_dates_df

    Login or Signup to reply.
  2. You need to use a combination of the filter and add_months functions, like this:

    from pyspark.sql import functions as F
    
    df = df.filter(F.col("date") >= F.add_months(F.current_date(), -4))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search