I want to convert the following nested JSON file to a CSV file using Python.
{
"page": {
"page": 1,
"pageSize": 250
},
"dataRows": [
{
"entityId": 349255,
"Id": "41432-95P",
"disabled": false,
"followed": false,
"suggestion": false,
"inactive": false,
"pinned": false,
"highlighted": false,
"columnValues": {
"lastName": [
{
"columnValueType": "ENTITY",
"accessStatus": "OK",
"columnValueType": "ENTITY",
"name": "McBrady",
"Id": "41432-95P",
"unpublished": false
}
],
"gender": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Male"
}
],
"hqCity": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Seattle"
}
],
"prefix": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Dr."
}
],
"lastUpdateDate": [
{
"columnValueType": "DATE",
"accessStatus": "OK",
"columnValueType": "DATE",
"expected": false,
"asOfdate": "2023-06-26"
}
],
"companyName": [
{
"columnValueType": "BUSINESS_ENTITY",
"accessStatus": "OK",
"columnValueType": "BUSINESS_ENTITY",
"name": "Global Partnerships",
"Id": "56347-39",
"unpublished": false,
"profileType": "INVESTOR"
}
],
"roles": [
{
"columnValueType": "INT_COLUMN_VALUE",
"accessStatus": "OK",
"columnValueType": "INT_COLUMN_VALUE",
"marked": false,
"value": 3
}
],
"dailyUpdates": [],
"assetClass": [],
"hqCountry": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "United States"
}
],
"latestNoteAuthor": [],
"primaryPosition": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Chair, Enterprise Risk, Compliance and Audit Committee and Member of the Board of Directors"
}
],
"boardSeats": [
{
"columnValueType": "INT_COLUMN_VALUE",
"accessStatus": "OK",
"columnValueType": "INT_COLUMN_VALUE",
"marked": false,
"value": 2
}
],
"fundRoles": [],
"institution": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Harvard University"
},
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "University of Oxford"
}
],
"latestNote": [],
"Id": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "41432-95P"
}
],
"hqRegion": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Americas"
}
],
"email": [],
"dealRoles": [],
"PrimaryCompanyType": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Not-For-Profit Venture Capital"
}
],
"mgtRoles": [],
"hqStateProvince": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Washington"
}
],
"fullName": [
{
"columnValueType": "ENTITY_WITH_NOTE",
"accessStatus": "OK",
"columnValueType": "ENTITY_WITH_NOTE",
"name": "Matthew McBrady Ph.D",
"Id": "41432-95P",
"unpublished": false
}
],
"hqLocation": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Seattle, WA"
}
],
"biography": [
{
"columnValueType": "DESCRIPTION_WITH_SOURCE",
"accessStatus": "OK",
"columnValueType": "DESCRIPTION_WITH_SOURCE",
"value": "Dr. Matthew McBrady serves as Chair, of the Enterprise Risk, Compliance, and Audit Committee.",
"morningstarSource": true
}
],
"firstName": [
{
"columnValueType": "ENTITY",
"accessStatus": "OK",
"columnValueType": "ENTITY",
"name": "Matthew",
"Id": "41432-95P",
"unpublished": false
}
],
"phone": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "+1 (206) 652-8773"
}
],
"hqSubRegion": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "North America"
}
],
"hqAddressLine2": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "Suite 410"
}
],
"hqAddressLine1": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "1201 Western Avenue"
}
],
"hqFax": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "+1 (206) 456-7877"
}
],
"middleName": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "R."
}
],
"companyWebsite": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "www.globalpartnerships.org"
}
],
"hqZipCode": [
{
"columnValueType": "STRING",
"accessStatus": "OK",
"columnValueType": "STRING",
"value": "98101"
}
],
"weeklyUpdates": []
}
}
]
}
I got the following code after iterating using chatgpt. I, however, couldn’t get it to capture the first nest that includes entityId, Id. Moreover, in each subsequent nest, I want to capture all fields; for example, in "lastName", I want to capture both "name" and "Id", similarly in "companyName", I want the "name", "Id" and "profileType" to be in separate columns. As I mentioned to chatgpt, I don’t care about "columnValueType", "accessStatus", or "unpublished".
Here is python code:
import csv
import json
def extract_field_value(data):
if isinstance(data, dict):
if 'value' in data:
return str(data['value'])
elif 'columnValueType' in data and data['columnValueType'] == 'ENTITY':
return str(data['name'])
else:
values = []
for key, value in data.items():
if key not in ['columnValueType', 'accessStatus', 'unpublished']:
values.append(extract_field_value(value))
return ', '.join(values) if values else ''
elif isinstance(data, list):
values = []
for item in data:
value = extract_field_value(item)
if value:
values.append(value)
return ', '.join(values) if values else ''
else:
return str(data) if data is not None else ''
# Read the JSON data
with open('data.json') as file:
data = json.load(file)
# Extract the nested data rows
data_rows = data['dataRows']
# Extract the column headers from the first data row
column_headers = list(data_rows[0]['columnValues'].keys())
# Create a CSV file
with open('data.csv', 'w', newline='') as file:
writer = csv.writer(file)
# Write the column headers as the first row
writer.writerow(column_headers)
# Write each data row as a separate row in the CSV file
for row in data_rows:
column_values = row['columnValues']
csv_row = []
for column_header in column_headers:
values = column_values.get(column_header, [])
value = extract_field_value(values)
csv_row.append(value)
writer.writerow(csv_row)
print("CSV file created successfully.")
2
Answers
I see you trying to write a more general program that figures out the structure of the JSON. I think it’d be easier, in the beginning at least, since you know this structure to just make your code aware of it… be very explicit, and in this case it comes out looking much simpler.
This approach also leverages the DictWriter class in the csv module, so your final row has the header in it… no need to track that separately. I also like typing, so I’ve added type hints for the structure of the JSON I saw (especially looking at columnValues).
When I run that on your sample JSON, the first row ends up looking like:
I know this deviates from what your code shows, but I only want to give you the rough idea (the approach). I believe you can adapt the code, especically extract_columns, to fit your needs.
You can try this example how to parse the Json to a DataFrame:
Prints: