skip to Main Content

I have a Dataframe containing transactions from an account and I need to categorize them and sum each category. I have 8 categories in total and can have 10+ keywords in each. I have found posts on this site that accomplish something similar but they have their caveats.

df['Contains string'] = (df[columns].apply(lambda x: x.str.contains('myString'))).sum(axis=1)

This is the closest I can find but it doesn’t allow for multiple keywords or categories.

Examples of categories:

Phone payment - contains string "ATT Payment"
Supplies - contains string "EBAY" or "BIZCARD" or "AMAZON" or etc.

The DataFrame can have unlimited rows but a sample looks like:

                                         TransactionID    Debit   Credit
0                                   ATT Payment ID 001     0.00    21.29
1                                      BIZCARD PAYMENT     0.00   124.93
2                                          Check #1867  8755.50     0.00
3                                          Check #1874   110.66     0.00
5                                          EBAY ID 544     0.00    36.00
6                                        AMAZONPAY 788     0.00   373.20

The credit column would be the only that needs values to be summed.

Desired Output:

Phone Payment = 21.29  
Supplies = 534.13
Etc. = 00

New to Python. Pardon the lack or surplus of information. Thanks.

2

Answers


  1. Use pandas.Series.str.contains

    • .contains() is then used for Indexing and selecting data
    • Create a mask for each set of desired constraints
    • The masks can then be used with a ~, which is not, for the etc data.
    import pandas as pd
    
    # create dataframe
    data = {'TransactionID': ['ATT Payment ID 001', 'BIZCARD PAYMENT', 'Check #1867', 'Check #1874', 'EBAY ID 544', 'AMAZONPAY 788'],
            'Debit': [0.0, 0.0, 8755.5, 110.66, 0.0, 0.0],
            'Credit': [21.29, 124.93, 0.0, 0.0, 36.0, 373.2]}
    
    df = pd.DataFrame(data)
    
    # Get rows based on criteria of contains
    phone_payment_mask = df.TransactionID.str.contains('ATT')
    phone_payment = df.Credit[phone_payment_mask]
    print(phone_payment.sum())
    21.29
    
    supplies_mask = df.TransactionID.str.contains('EBAY|BIZCARD|AMAZON')
    supplies = df.Credit[supplies_mask]
    print(supplies.sum())
    534.13
    
    etc = df.Credit[~phone_payment_mask & ~supplies_mask]
    print(etc.sum())
    0.0
    
    Login or Signup to reply.
  2. You can start by adding a category column as described in this answer (use .str.contains instead of .isin)

    Then you can do the sum using a groupby :

    df.groupby('Category_column').Credit.sum()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search