skip to Main Content

I have an item dataframe such as:

item_dict = {
    'index': [18, 24, 25, 26, 30, 31, 37, 38, 61, 62, 63, 67, 68, 69],
    'BarCode_x': ['12345678ABCD', '12345678IJKL', '12345678IJKL', '12345678IJKL', '12345678EFGH', '12345678EFGH', '67890123IJKL', '67890123IJKL', '67890123ABCD', '67890123ABCD', '67890123ABCD', '67890123EFGH', '67890123EFGH', '67890123EFGH'],
    'Extracted_Code': ['12345678', '12345678', '12345678', '12345678', '12345678', '12345678', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123'],
    'Description_x': ['Apples', 'Mangoes', 'Mangoes', 'Mangoes', 'Oranges', 'Oranges', 'Oats', 'Oats', 'Yoghurt', 'Yoghurt', 'Yoghurt', 'Cookies', 'Cookies', 'Cookies'],
    'Unique_Code_x': ['EFG', 'LMO', 'LMO', 'LMO', 'JKL', 'JKL', 'OPZ', 'OPZ', 'YQA', 'YQA', 'YQA', 'CDF', 'CDF', 'CDF'],
    'Category_x': ['M', 'S', 'S', 'S', 'T', 'T', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M'],
    'Code_x': [1, 4, 4, 4, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    'Quantity_x': [52, 90, 90, 90, 11, 11, 90, 90, 52, 52, 52, 11, 11, 11],
    'Price_x': [15.6, 67.0, 67.0, 67.0, 12.9, 12.9, 67.0, 67.0, 15.6, 15.6, 15.6, 12.9, 12.9, 12.9],
    'BarCode': ['12345678AAAA', '12345678AAAA', '12345678BBBB', '12345678CCCC', '12345678AAAA', '12345678BBBB', '67890123XXXX', '67890123YYYY', '67890123XXXX', '67890123YYYY', '67890123ZZZZ', '67890123XXXX', '67890123YYYY', '67890123ZZZZ'],
    'Description': ['Fruits', 'Fruits', 'Fruits', 'Fruits', 'Fruits', 'Fruits', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks'],
    'Unique_Code': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC', 'ABC', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ'],
    'Category': ['H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H'],
    'Code': [0, 0, 2, 3, 0, 2, 0, 2, 0, 2, 3, 0, 2, 3],
    'Quantity': [99, 99, 77, 10, 99, 77, 99, 77, 99, 77, 10, 99, 77, 10],
    'Price': [12.0, 12.0, 10.5, 11.0, 12.0, 10.5, 12.0, 10.5, 12.0, 10.5, 11.0, 12.0, 10.5, 11.0]
}

item_df = pd.DataFrame(item_dict)

I am trying to group the dataframe based on ['BarCode_x', 'Extracted_Code', 'Unique_Code_x'], convert each group into a list of jsons and store it in a new column Grouped. My desired result is:

BarCode_x       Extracted_Code  Unique_Code_x   Grouped
12345678ABCD    12345678        EFG             [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
12345678EFGH    12345678        JKL             [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
12345678IJKL    12345678        LMO             [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123ABCD    67890123        YQA             [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123EFGH    67890123        CDF             [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123IJKL    67890123        OPZ             [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]

This is what I have done:

item_df.groupby(['BarCode_x', 'Extracted_Code', 'Unique_Code_x'])[["BarCode", "Description", "Category", "Code", "Quantity", "Price"]].apply(lambda group: group.to_dict("records")).reset_index(name="Grouped")

The item_df shown above is a small representation of another dataframe that contains over 3 million records. When I apply the above logic using groupby+apply, the process takes 2 hours to complete, which is not feasible. Therefore, is there any way I can achieve the same result in a shorter amount of time using another optimized method instead of using groupby+apply?

2

Answers


  1. You can try:

    out = {}
    for a, b, c, *d in zip(
        item_df.BarCode_x,
        item_df.Extracted_Code,
        item_df.Unique_Code_x,
        item_df.BarCode,
        item_df.Description,
        item_df.Category,
        item_df.Code,
        item_df.Quantity,
        item_df.Price,
    ):
        out.setdefault((a, b, c), []).append(d)
    
    ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
    
    df = pd.DataFrame(
        ((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
        columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
    )
    print(df)
    

    Prints:

          BarCode_x Extracted_Code Unique_Code_x                                                                                                                                                                                                                                                                                                                                              Grouped
    0  12345678ABCD       12345678           EFG                                                                                                                                                                                                                                    [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
    1  12345678IJKL       12345678           LMO  [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
    2  12345678EFGH       12345678           JKL                                                                                                                   [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
    3  67890123IJKL       67890123           OPZ                                                                                                                   [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
    4  67890123ABCD       67890123           YQA  [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
    5  67890123EFGH       67890123           CDF  [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
    

    Quick benchmark:

    from timeit import timeit
    
    
    def fn1(item_df):
        x = (
            item_df.groupby(["BarCode_x", "Extracted_Code", "Unique_Code_x"])[
                ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
            ]
            .apply(lambda group: group.to_dict("records"))
            .reset_index(name="Grouped")
        )
        return x
    
    
    def fn2(item_df):
        out = {}
        for a, b, c, *d in zip(
            item_df.BarCode_x,
            item_df.Extracted_Code,
            item_df.Unique_Code_x,
            item_df.BarCode,
            item_df.Description,
            item_df.Category,
            item_df.Code,
            item_df.Quantity,
            item_df.Price,
        ):
            out.setdefault((a, b, c), []).append(d)
    
        ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
    
        return pd.DataFrame(
            ((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
            columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
        )
    
    t1 = timeit('fn1(x)', 'x=item_df.copy()',number=1000, globals=globals())
    t2 = timeit('fn2(x)', 'x=item_df.copy()',number=1000, globals=globals())
    print(t1)
    print(t2)
    

    Prints on my machine AMD 5700X/Ubuntu 20.04/Python 3.10.9/Pandas 1.5.3:

    2.507308159954846
    0.1901476769708097
    

    So it should be ~12x faster.

    Login or Signup to reply.
  2. Here is an alternative approach (that avoids using GroupBy) :

    from collections import OrderedDict
    ​
    use_cols = [
        "BarCode_x", "Extracted_Code", "Unique_Code_x", # Groupers
        "BarCode", "Description", "Category", "Code",   # Aggs (p1)
        "Quantity", "Price"                             # Aggs (p2)
    ]
    ​
    def gby_records(df):
        data = {}
        for r in df[use_cols].to_dict("records"):
            key = tuple(r[col] for col in use_cols[:3])
            data.setdefault(key, []).append(OrderedDict((k, r[k]) for k in use_cols[3:]))
    ​
        for k, rs in data.items():
            yield {**dict(zip(use_cols[:3], k)), "Grouped": rs}
    ​    ​
    out = pd.DataFrame(gby_records(item_df))
    


    Output :

    BarCode_x Extracted_Code Unique_Code_x Grouped
    0 12345678ABCD 12345678 EFG [{‘BarCode’: ‘12345678AAAA’, ‘Description’: ‘F…
    1 12345678IJKL 12345678 LMO [{‘BarCode’: ‘12345678AAAA’, ‘Description’: ‘F…
    2 12345678EFGH 12345678 JKL [{‘BarCode’: ‘12345678AAAA’, ‘Description’: ‘F…
    3 67890123IJKL 67890123 OPZ [{‘BarCode’: ‘67890123XXXX’, ‘Description’: ‘S…
    4 67890123ABCD 67890123 YQA [{‘BarCode’: ‘67890123XXXX’, ‘Description’: ‘S…
    5 67890123EFGH 67890123 CDF [{‘BarCode’: ‘67890123XXXX’, ‘Description’: ‘S…

    Timings :

    🥉 %%timeit #Animeartist 
    5.84 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    🥈 %%timeit #Timeless
    1.28 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    🥇 %%timeit #Andrej Kesely
    612 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search