dataframe to json columns with multiple values read as one string

Betabr
April 28, 2024
69 views
0 votes
2 Answers

I am trying to read a csv file into a dataframe to transform that into nested jsons. The deliminter in the csv file is ‘;’. After loading the data into a dataframe, following is what I get:

name	frequency	frequency_start_date	frequency_count	schedule_type	day_of_week	Comments
Pattern 1	daily	2022-02-01	1	weekdays	Monday	null
Pattern 2	daily	2024-01-01	1, 5	weekdays, weekdays	Monday, Tuesday	null, null
Pattern 3	daily	2021-03-21	1, 2	weekdays, weekdays	Thursday,Friday	null, null

This is the code I wrote to load the csv as a dataframe:

df1_data = pd.read_csv(csv1, delimiter=';', keep_default_na=False, dtype=object)

This is the code I wrote to transform the dataframe into json:

nested_cols = ['frequency_count', 'schedule_type', 'day_of_week']

df1_data['patterns'] = df1_data[nested_cols].to_dict('records')

df1_nested2 = df1_data[['name', 'frequency', 'frequency_start_date', 'patterns', 'comments']].to_json(orient='records', indent=4)

When I run this, I get the following:

[
    {
        "name": "Pattern 1",
        "frequency": "daily",
        "frequency_start_date": "2022-02-01",
        "patterns": {
            "frequency_count": "1",
            "schedule_type": "weekdays",
            "day_of_week": null
        },
        "comments": null
    },
    {
        "name": "Pattern 2",
        "frequency": "daily",
        "frequency_start_date": "2024-01-01",
        "patterns": {
            "frequency_count": "1, 5",
            "schedule_type": "weekdays,weekdays",
            "day_of_week": "null, null"
        },
        "comments": null
    },
    {
        "name": "Pattern 3",
        "frequency": "daily",
        "frequency_start_date": "2021-03-21",
        "patterns": {
            "frequency_count": "1, 2",
            "schedule_type": "weekdays,weekdays",
            "day_of_week": "null, null"
        },
        "comments": null
    }
]

But this is what I want:

[
    {
        "name": "Pattern 1",
        "frequency": "daily",
        "frequency_start_date": "2022-02-01",
        "patterns": {
            "frequency_count": "1",
            "schedule_type": "weekdays",
            "day_of_week": null
        },
        "comments": null
    },
    {
        "name": "Pattern 2",
        "frequency": "daily",
        "frequency_start_date": "2024-01-01",
        "patterns": {
            "frequency_count": 1,
            "schedule_type": "weekdays",
            "day_of_week": null
        },
        {
            "frequency_count": 5,
            "schedule_type": "weekdays",
            "day_of_week": null
        },
        "comments": null
    },
    {
        "name": "Pattern 3",
        "frequency": "daily",
        "frequency_start_date": "2021-03-21",
        "patterns": {
            "frequency_count": 1,
            "schedule_type": "weekdays",
            "day_of_week": null
        },
        {
            "frequency_count": 2,
            "schedule_type": "weekdays",
            "day_of_week": null
        },
        "comments": null
    }
]

This is where I am stuck. Is this an issue with csv into dataframe? Or dataframe into json? It seems that the columns where there are multiple values are read as strings. I read questions posted at StackOverflow but none of them actually had this issue or answers did not help me to solve this.

Any help would be greatly appreciated.

Thanks,

Beta

Answers

Chosen as BEST ANSWER

Thanks to @BeRT2me, the final code for this is the following:

nested_cols = ['frequency_count', 'schedule_type', 'day_of_week']

# splitting the strings in nested columns
for col in nested_cols:
    df1_data[col] = df1_data[col].str.split(',')

#exploding the columns
df1_data.explode(nested_cols)

#changed the frequency_count type to int as a number is always expected
df1_data['frequency_count'] = df1_data['frequency_count'].astype(int)

df1_data['patterns'] = df1_data[nested_cols].to_dict("records")

# Using @BeRT2me response:
output = df.pivot_table(
index=["name", "frequency", "frequency_start_date", "Comments"], 
values="patterns", 
aggfunc=list).reset_index().to_json(orient="records", indent=4)

# replacing the string "null" by null in json
df1_nested = df1_nested.replace('"null"', 'null')

The output is:

[
{
    "pattern_name":"daily_pattern_cadence_start_after_effective_start",
    "cadence":"Daily",
    "cadence_start_date":"2023-02-01",
    "comments":null,
    "patterns":[
        {
            "cadence_count":1,
            "weekend_type":"Western",
            "annual_reset":"No",
            "fixed_day_of_week":null,
            "effective_start_date":"2022-12-31",
            "effective_end_date":"2023-12-31"
        },
        {
            "cadence_count":1,
            "weekend_type":" Western",
            "annual_reset":" No",
            "fixed_day_of_week":" null",
            "effective_start_date":" 2020-12-31",
            "effective_end_date":" 2021-12-31"
        }
    ]
},
{
    "pattern_name":"daily_pattern_effective_start_greater_previous_effective_end",
    "cadence":"Daily",
    "cadence_start_date":"2019-05-31",
    "comments":null,
    "patterns":[
        {
            "cadence_count":1,
            "weekend_type":"Western",
            "annual_reset":"No",
            "fixed_day_of_week":null,
            "effective_start_date":"2019-05-31",
            "effective_end_date":"2021-12-31"
        },
        {
            "cadence_count":1,
            "weekend_type":" Western",
            "annual_reset":" No",
            "fixed_day_of_week":null,
            "effective_start_date":" 2022-03-31",
            "effective_end_date":" null"
        }
    ]
},
{
    "pattern_name":"daily_pattern_success",
    "cadence":"Daily",
    "cadence_start_date":"2022-02-01",
    "comments":"",
    "patterns":[
        {
            "cadence_count":1,
            "weekend_type":"Western",
            "annual_reset":"No",
            "fixed_day_of_week":null,
            "effective_start_date":"2022-02-01",
            "effective_end_date":null
        }
    ]
}

]

(Edit)

Given:

        name frequency frequency_start_date frequency_count       schedule_type      day_of_week    Comments
0  Pattern 1     daily           2022-02-01               1            weekdays           Monday         NaN
1  Pattern 2     daily           2024-01-01            1, 5  weekdays, weekdays  Monday, Tuesday  null, null
2  Pattern 3     daily           2021-03-21            1, 2  weekdays, weekdays  Thursday,Friday  null, null

Doing:

Convert the strings to lists:

multi_cols = ["frequency_count", "schedule_type", "day_of_week", "Comments"]

for col in multi_cols:
    df[col] = df[col].str.split(", ?") # regex

Explode these columns:

df = df.explode(multi_cols)

Make this patterns column and fix the Comments column:

df["patterns"] = df[["frequency_count", "schedule_type", "day_of_week"]].to_dict("records")
df["Comments"] = df["Comments"].fillna("null")

Pivot and export to json:

output = df.pivot_table(
    index=["name", "frequency", "frequency_start_date", "Comments"], 
    values="patterns", 
    aggfunc=list, 
).reset_index().to_json(orient="records", indent=4)
print(output)

Output:

[
    {
        "name":"Pattern 1",
        "frequency":"daily",
        "frequency_start_date":"2022-02-01",
        "Comments":"null",
        "patterns":[
            {
                "frequency_count":"1",
                "schedule_type":"weekdays",
                "day_of_week":"Monday"
            }
        ]
    },
    {
        "name":"Pattern 2",
        "frequency":"daily",
        "frequency_start_date":"2024-01-01",
        "Comments":"null",
        "patterns":[
            {
                "frequency_count":"1",
                "schedule_type":"weekdays",
                "day_of_week":"Monday"
            },
            {
                "frequency_count":"5",
                "schedule_type":"weekdays",
                "day_of_week":"Tuesday"
            }
        ]
    },
    {
        "name":"Pattern 3",
        "frequency":"daily",
        "frequency_start_date":"2021-03-21",
        "Comments":"null",
        "patterns":[
            {
                "frequency_count":"1",
                "schedule_type":"weekdays",
                "day_of_week":"Thursday"
            },
            {
                "frequency_count":"2",
                "schedule_type":"weekdays",
                "day_of_week":"Friday"
            }
        ]
    }
]

Please signup or login to give your own answer.

Click here to cancel reply.