skip to Main Content
rake_name origin destination created_At bpc_issue_date departure_time journey_distance bpc_balance_distance_km abc rn logic
ARIL-05 a b 09-02-2024 06:37 08-02-2024 18:30 09-02-2024 13:40 173.4600067 5271 5097.539993 1 5271-173.4600067
ARIL-05 a b 11-02-2024 02:58 08-02-2024 18:30 11-02-2024 04:15 1099.369995 5271 3998 2 5097-1099
ARIL-05 a b 14-02-2024 12:46 08-02-2024 18:30 14-02-2024 15:50 1099.369995 5271 2898 3 3998-1099
ARIL-05 a b 16-02-2024 17:29 08-02-2024 18:30 17-02-2024 02:30 175.6600037 5271 2713 4 2898-175
ARIL-05 a b 17-02-2024 10:54 08-02-2024 18:30 18-02-2024 11:30 1178.880005 5271 1544 5 2713-1178
ARIL-05 a b 20-02-2024 11:44 08-02-2024 18:30 20-02-2024 15:20 1179.369995 5271 364 6 1544-1179

Please look at the logic column for substraction i want.

so i want to update abc column with difference between bpc_balance_distance_km and journey_distance once i get that value i want to substract below abc values with the result of abc - respective journey_distance.

i have tried sql, however i am not sure how can i populate row >1, am able to populate 1st row using
case when rn= 1 then bpc_balance_km - journey_distance

i have tried following query however, i awnt to covert this query int databricks sql supported query. this query works

SET @newabc := 0;

WITH cte AS (
 SELECT *,
       CASE WHEN rn=1 THEN @newabc := bpc_balance_distance_km-journey_distance
         ELSE @newabc := @newabc-journey_distance END AS nabc
 FROM derived_abc ORDER BY rn)
 SELECT *, ROUND(nabc) AS final_result
 FROM cte;

any suggastion in python or sql will be helpful.
here is DB fiddle link : https://dbfiddle.uk/JBzaylSb i am using databricks sql

2

Answers


  1. Chosen as BEST ANSWER
    def calculate_nabc(rn, bpc_balance_distance_km, journey_distance, newabc):
        if rn == 1:
            newabc[0] = bpc_balance_distance_km - journey_distance
        else:
            newabc[0] = newabc[0] - journey_distance
        return newabc[0]
    
    def calculate_final_result(df):
        # Add a row number column
        df['rn'] = df.groupby('rake_name').cumcount() + 1
        
        # Initialize the value of newabc
        newabc = [0.0]
    
        # Calculate the nabc column using the UDF
        df['nabc'] = df.apply(lambda row: calculate_nabc(row['rn'], row['bpc_balance_distance_km'], row['journey_distance'], newabc), axis=1)
    
        # Round the nabc values
        df['final_result'] = df['nabc'].round()
    
        return df
    df = calculate_final_result(df)
    df.display()
    

  2. Here, try this:

    SET @newabc := 0;
    
    WITH cte AS (
     SELECT *,
           CASE WHEN rn=1 THEN @newabc := bpc_balance_distance_km-journey_distance
             ELSE @newabc := @newabc-journey_distance END AS nabc
     FROM derived_abc ORDER BY rn)
     SELECT *, ROUND(nabc) AS final_result
     FROM cte;
    

    In the query above, the value of bpc_balance_distance_km-journey_distance for rn=1 is assigned to the @newabc variable. Then for rows where rn > 1, it’ll take the current assigned @newabc value to subtract with the row’s journey_distance value.

    There’ll be a slight different result in which causes by what I already point out in the comment – the inconsistencies of the decimal treatment. You can see the example in this demo fiddle.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search