skip to Main Content

I have below csv format. I want it to convert some nested dict.

name,columns,tests
ABC_ESTIMATE_REFINED,cntquota,dbt_expectations.expect_column_to_exist
ABC_ESTIMATE_REFINED,cntquota,not_null
ABC_ESTIMATE_REFINED,is_purged,dbt_expectations.expect_column_to_exist
ABC_ESTIMATE_REFINED,is_purged,not_null

Expected Output

{
    "name": "ABC_ESTIMATE_REFINED",
    "columns": [
        {
            "name": "cntquota",
            "tests": [
                "dbt_expectations.expect_column_to_exist",
                "not_null"
            ]
        },
        {
            "name": "is_purged",
            "tests": [
                "dbt_expectations.expect_column_to_exist",
                "not_null"
            ]
        }
    ]
}

my attempt is below , but not reaching even close to it.

df=pd.read_csv('data.csv')
print(df)
nested_dict = df.groupby(['name','columns']).apply(lambda x: x[['tests']].to_dict(orient='records')).to_dict()
 
print(nested_dict)

3

Answers


  1. IIUC, you can use nested groupby calls:

    out = [{'name': k1, 'columns': [{'name': k2, 'tests': g2['tests'].tolist()}
                                    for k2, g2 in g1.groupby('columns')]}
           for k1, g1 in df.groupby('name')]
    

    Since the processing occurs by pairs or columns, you could also imagine a recursive approach:

    def group(df, keys):
        if len(keys) > 1:
            key1, key2 = keys[:2]
            return [{key1: k, key2: group(g, keys[1:])}
                    for k, g in df.groupby(key1)]
        else:
            return df[keys[0]].tolist()
    
    out = group(df, ['name', 'columns', 'tests'])
    

    Output:

    [{'name': 'ABC_ESTIMATE_REFINED',
      'columns': [{'name': 'cntquota',
                   'tests': ['dbt_expectations.expect_column_to_exist', 'not_null']},
                  {'name': 'is_purged',
                   'tests': ['dbt_expectations.expect_column_to_exist', 'not_null']}],
     }]
    

    Demo of the recursive approach with a different order of the keys:

    group(df, ['name', 'tests', 'columns'])
    
    [{'name': 'ABC_ESTIMATE_REFINED',
      'tests': [{'tests': 'dbt_expectations.expect_column_to_exist',
                 'columns': ['cntquota', 'is_purged']},
                {'tests': 'not_null', 'columns': ['cntquota', 'is_purged']}],
    }]
    
    Login or Signup to reply.
  2. Something like:

    g = df.groupby(["name", "columns"]).agg({"tests": list})
          .reset_index().rename(columns={"name": "index", "columns": "name"})
          .set_index("index")
    
    result = {
        "name": g.index[0],
        "columns": g.to_dict(orient="records")
    }
    

    Seems to do the job for the expected output in your OP.

    Indeed if there is multiple modalities of name you will have to store it as a list of dictionary instead:

    df = pd.read_csv(io.StringIO("""name,columns,tests
    ABC_ESTIMATE_REFINED,cntquota,dbt_expectations.expect_column_to_exist
    ABC_ESTIMATE_REFINED,cntquota,not_null
    ABC_ESTIMATE_REFINED,is_purged,dbt_expectations.expect_column_to_exist
    ABC_ESTIMATE_REFINED,is_purged,not_null
    ABC_ESTIMATE_REFINED2,cntquota,dbt_expectations.expect_column_to_exist
    ABC_ESTIMATE_REFINED2,cntquota,not_null
    ABC_ESTIMATE_REFINED2,is_purged,dbt_expectations.expect_column_to_exist
    ABC_ESTIMATE_REFINED2,is_purged,not_null"""))
    
    g = df.groupby(["name", "columns"]).agg({"tests": list})
          .reset_index().rename(columns={"name": "index", "columns": "name"})
          .set_index("index")
    
    result = []
    for key in g.index.unique():
        result.append({
            "name": key,
            "columns": g.loc[key,:].to_dict(orient="records")
        })
    

    Which will render:

    [{'name': 'ABC_ESTIMATE_REFINED',
      'columns': [{'name': 'cntquota',
        'tests': ['dbt_expectations.expect_column_to_exist', 'not_null']},
       {'name': 'is_purged',
        'tests': ['dbt_expectations.expect_column_to_exist', 'not_null']}]},
     {'name': 'ABC_ESTIMATE_REFINED2',
      'columns': [{'name': 'cntquota',
        'tests': ['dbt_expectations.expect_column_to_exist', 'not_null']},
       {'name': 'is_purged',
        'tests': ['dbt_expectations.expect_column_to_exist', 'not_null']}]}]
    
    Login or Signup to reply.
  3. import pandas as pd
    
    # Read the CSV file into a DataFrame
    df = pd.read_csv('data.csv')
    
    # Initialize an empty dictionary to hold the final nested structure
    nested_dict = {}
    
    # Group by 'name' and 'columns'
    grouped = df.groupby(['name', 'columns'])['tests'].apply(list).reset_index()
    
    # Iterate through the grouped DataFrame to construct the nested dictionary
    for _, row in grouped.iterrows():
        name = row['name']
        column = row['columns']
        tests = row['tests']
    
        # If the name is not in the dictionary, add it with an empty 'columns' list
        if name not in nested_dict:
            nested_dict[name] = {'name': name, 'columns': []}
    
        # Append the column and its tests to the 'columns' list
        nested_dict[name]['columns'].append({'name': column, 'tests': tests})
    
    # Since we have only one 'name', we extract the value from the nested dictionary
    result = nested_dict['ABC_ESTIMATE_REFINED']
    
    print(result)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search