Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Pandas: Convert group into list of jsons without using groupby or apply

Animeartist
June 13, 2023
258 views
0 votes
2 Answers

I have an item dataframe such as:

item_dict = {
    'index': [18, 24, 25, 26, 30, 31, 37, 38, 61, 62, 63, 67, 68, 69],
    'BarCode_x': ['12345678ABCD', '12345678IJKL', '12345678IJKL', '12345678IJKL', '12345678EFGH', '12345678EFGH', '67890123IJKL', '67890123IJKL', '67890123ABCD', '67890123ABCD', '67890123ABCD', '67890123EFGH', '67890123EFGH', '67890123EFGH'],
    'Extracted_Code': ['12345678', '12345678', '12345678', '12345678', '12345678', '12345678', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123', '67890123'],
    'Description_x': ['Apples', 'Mangoes', 'Mangoes', 'Mangoes', 'Oranges', 'Oranges', 'Oats', 'Oats', 'Yoghurt', 'Yoghurt', 'Yoghurt', 'Cookies', 'Cookies', 'Cookies'],
    'Unique_Code_x': ['EFG', 'LMO', 'LMO', 'LMO', 'JKL', 'JKL', 'OPZ', 'OPZ', 'YQA', 'YQA', 'YQA', 'CDF', 'CDF', 'CDF'],
    'Category_x': ['M', 'S', 'S', 'S', 'T', 'T', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'M'],
    'Code_x': [1, 4, 4, 4, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4],
    'Quantity_x': [52, 90, 90, 90, 11, 11, 90, 90, 52, 52, 52, 11, 11, 11],
    'Price_x': [15.6, 67.0, 67.0, 67.0, 12.9, 12.9, 67.0, 67.0, 15.6, 15.6, 15.6, 12.9, 12.9, 12.9],
    'BarCode': ['12345678AAAA', '12345678AAAA', '12345678BBBB', '12345678CCCC', '12345678AAAA', '12345678BBBB', '67890123XXXX', '67890123YYYY', '67890123XXXX', '67890123YYYY', '67890123ZZZZ', '67890123XXXX', '67890123YYYY', '67890123ZZZZ'],
    'Description': ['Fruits', 'Fruits', 'Fruits', 'Fruits', 'Fruits', 'Fruits', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks', 'Snacks'],
    'Unique_Code': ['ABC', 'ABC', 'ABC', 'ABC', 'ABC', 'ABC', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ', 'XYZ'],
    'Category': ['H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H', 'H'],
    'Code': [0, 0, 2, 3, 0, 2, 0, 2, 0, 2, 3, 0, 2, 3],
    'Quantity': [99, 99, 77, 10, 99, 77, 99, 77, 99, 77, 10, 99, 77, 10],
    'Price': [12.0, 12.0, 10.5, 11.0, 12.0, 10.5, 12.0, 10.5, 12.0, 10.5, 11.0, 12.0, 10.5, 11.0]
}

item_df = pd.DataFrame(item_dict)

I am trying to group the dataframe based on ['BarCode_x', 'Extracted_Code', 'Unique_Code_x'], convert each group into a list of jsons and store it in a new column Grouped. My desired result is:

BarCode_x       Extracted_Code  Unique_Code_x   Grouped
12345678ABCD    12345678        EFG             [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
12345678EFGH    12345678        JKL             [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
12345678IJKL    12345678        LMO             [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123ABCD    67890123        YQA             [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123EFGH    67890123        CDF             [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
67890123IJKL    67890123        OPZ             [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]

This is what I have done:

item_df.groupby(['BarCode_x', 'Extracted_Code', 'Unique_Code_x'])[["BarCode", "Description", "Category", "Code", "Quantity", "Price"]].apply(lambda group: group.to_dict("records")).reset_index(name="Grouped")

The item_df shown above is a small representation of another dataframe that contains over 3 million records. When I apply the above logic using groupby+apply, the process takes 2 hours to complete, which is not feasible. Therefore, is there any way I can achieve the same result in a shorter amount of time using another optimized method instead of using groupby+apply?

Answers

You can try:

out = {}
for a, b, c, *d in zip(
    item_df.BarCode_x,
    item_df.Extracted_Code,
    item_df.Unique_Code_x,
    item_df.BarCode,
    item_df.Description,
    item_df.Category,
    item_df.Code,
    item_df.Quantity,
    item_df.Price,
):
    out.setdefault((a, b, c), []).append(d)

ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]

df = pd.DataFrame(
    ((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
    columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
)
print(df)

Prints:

      BarCode_x Extracted_Code Unique_Code_x                                                                                                                                                                                                                                                                                                                                              Grouped
0  12345678ABCD       12345678           EFG                                                                                                                                                                                                                                    [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}]
1  12345678IJKL       12345678           LMO  [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '12345678CCCC', 'Description': 'Fruits', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
2  12345678EFGH       12345678           JKL                                                                                                                   [{'BarCode': '12345678AAAA', 'Description': 'Fruits', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '12345678BBBB', 'Description': 'Fruits', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
3  67890123IJKL       67890123           OPZ                                                                                                                   [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}]
4  67890123ABCD       67890123           YQA  [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]
5  67890123EFGH       67890123           CDF  [{'BarCode': '67890123XXXX', 'Description': 'Snacks', 'Category': 'H', 'Code': 0, 'Quantity': 99, 'Price': 12.0}, {'BarCode': '67890123YYYY', 'Description': 'Snacks', 'Category': 'H', 'Code': 2, 'Quantity': 77, 'Price': 10.5}, {'BarCode': '67890123ZZZZ', 'Description': 'Snacks', 'Category': 'H', 'Code': 3, 'Quantity': 10, 'Price': 11.0}]

Quick benchmark:

from timeit import timeit


def fn1(item_df):
    x = (
        item_df.groupby(["BarCode_x", "Extracted_Code", "Unique_Code_x"])[
            ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]
        ]
        .apply(lambda group: group.to_dict("records"))
        .reset_index(name="Grouped")
    )
    return x


def fn2(item_df):
    out = {}
    for a, b, c, *d in zip(
        item_df.BarCode_x,
        item_df.Extracted_Code,
        item_df.Unique_Code_x,
        item_df.BarCode,
        item_df.Description,
        item_df.Category,
        item_df.Code,
        item_df.Quantity,
        item_df.Price,
    ):
        out.setdefault((a, b, c), []).append(d)

    ks = ["BarCode", "Description", "Category", "Code", "Quantity", "Price"]

    return pd.DataFrame(
        ((*k, [dict(zip(ks, l)) for l in v]) for k, v in out.items()),
        columns=["BarCode_x", "Extracted_Code", "Unique_Code_x", "Grouped"],
    )

t1 = timeit('fn1(x)', 'x=item_df.copy()',number=1000, globals=globals())
t2 = timeit('fn2(x)', 'x=item_df.copy()',number=1000, globals=globals())
print(t1)
print(t2)

Prints on my machine AMD 5700X/Ubuntu 20.04/Python 3.10.9/Pandas 1.5.3:

2.507308159954846
0.1901476769708097

So it should be ~12x faster.

Here is an alternative approach (that avoids using GroupBy) :

from collections import OrderedDict

use_cols = [
    "BarCode_x", "Extracted_Code", "Unique_Code_x", # Groupers
    "BarCode", "Description", "Category", "Code",   # Aggs (p1)
    "Quantity", "Price"                             # Aggs (p2)
]

def gby_records(df):
    data = {}
    for r in df[use_cols].to_dict("records"):
        key = tuple(r[col] for col in use_cols[:3])
        data.setdefault(key, []).append(OrderedDict((k, r[k]) for k in use_cols[3:]))

    for k, rs in data.items():
        yield {**dict(zip(use_cols[:3], k)), "Grouped": rs}
    
out = pd.DataFrame(gby_records(item_df))

Output :

	BarCode_x	Extracted_Code	Unique_Code_x	Grouped
0	12345678ABCD	12345678	EFG	[{‘BarCode’: ‘12345678AAAA’, ‘Description’: ‘F…
1	12345678IJKL	12345678	LMO	[{‘BarCode’: ‘12345678AAAA’, ‘Description’: ‘F…
2	12345678EFGH	12345678	JKL	[{‘BarCode’: ‘12345678AAAA’, ‘Description’: ‘F…
3	67890123IJKL	67890123	OPZ	[{‘BarCode’: ‘67890123XXXX’, ‘Description’: ‘S…
4	67890123ABCD	67890123	YQA	[{‘BarCode’: ‘67890123XXXX’, ‘Description’: ‘S…
5	67890123EFGH	67890123	CDF	[{‘BarCode’: ‘67890123XXXX’, ‘Description’: ‘S…

Timings :

🥉 %%timeit #Animeartist 
5.84 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

🥈 %%timeit #Timeless
1.28 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

🥇 %%timeit #Andrej Kesely
612 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Please signup or login to give your own answer.

Click here to cancel reply.