Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Parse XML file and output JSON with Python

QuanHoangNguyen
December 15, 2022
220 views
0 votes
2 Answers

I am quite new to Python. I’m currently trying to parse xml files getting their information and printing them as JSON.

I have managed to parse the xml file, but I cannot print them as JSON. In addition, in my printjson function, the function did not run through all results and only print one time. The parse function worked and run through all input files while printjson didn’t.
My code is as follow.

from xml.dom import minidom
import os
import json

#input multiple files
def get_files(d):
        return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]

#parse xml
def parse(files):
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)

        #Get some details
        NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
        brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
        official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)

        return NCT_ID,brief_title,official_title

#print result in json
def printjson(results):
        for result in results:
                output_json = json.dumps(result)
                print(output_json)

printjson(parse(get_files('my files path')))

Output when running the file

"NCT ID : NCT00571389"
"brief title : Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products"
"official title : A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"

Expected output

{
"NCT ID" : "NCT00571389",
"brief title" : "Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products",
"official title" : "A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
}

The sample indexed xml file that I used is named as COVID-19 Clinical Trials dataset and can be found in kaggle

Answers

I don’t know much about xml.dom library but you can generate the json with a dictionary, because the dumps function is only for convert json to string.
Some like this.


def parse(files):
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)
        dicJson = {}
        dicJson.setdefault("NCT ID",tree.getElementsByTagName("nct_id")[0].firstChild.data)
        dicJson.setdefault("brief title",tree.getElementsByTagName("brief_title")[0].firstChild.data)
        dicJson.setdefault("official title", tree.getElementsByTagName("official_title")[0].firstChild.data)
    return dicJson

and in the function prinJson:

def printJson(results):
    # This function return the dictionary but in string, how to write to a JSON file.
    print(json.dumps(results))

- LeeKaiXuan
- December 15, 2022 at 2:12 am
- 0 votes
0
The issue is that your parse function is returning too early (it’s returning after getting the details from the first XML file. Instead, you should return a list of dictionaries that stores this information, so each item in the list represents a different file, and each dictionary contains the necessary information regarding the corresponding XML file.

Here’s the updated code:
```
def parse(files):
    xml_information = []
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)

        #Get some details
        NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
        brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
        official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)

        xml_information.append({"NCT_ID": NCT_ID, "brief title": brief_title, "official title": official_title})
    return xml_information

def printresults(results):
        for result in results:
                print(result)

printresults(parse(get_files('my files path')))
```
If you absolutely want to return format to be json, you can similarly use json.dumps on each dictionary.

Note: If you have a lot of XML files, I would recommend using yield in the function instead of returning a whole list of dictionaries in order to improve speed and performance.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.