Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

How to get values from a json file efficiently with python?

Totoro_D
March 6, 2023
211 views
3 votes
3 Answers

I’m trying to retrieve values from different layers of a json file, I’m using a quite silly way — get the values from one dictionary inside another dictionary through for looping. I want to get all the "title" and "question" and put them in a list or a pandas dataframe. How can I retrieve the values needed in a simpler way? How to handle json files efficiently in general?
Thanks a lot for anyone who answers the question:)

here’s a piece of the json:

{
    "contact": "xxx",
    "version": 1.0,
    "data": [
        {
            "title": "anges-musiciens-(national-gallery)",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "answers": [{
                                    "text": "La Vierge aux rochers"
                                }
                            ],
                            
                            "question": "Que concerne principalement les documents ?"
                        }
                 }
             ]
        }
     ]
}

titles = []
questions = []

for i in data["data"]:
    titles.append(i["title"])

    for p in i["paragraphs"]:
        for q in p["qas"]:
            questions.append(q["question"])
    
print(titles)
print(questions)

Tags: json python

Answers

You can use recursion to perform a depth-first-search on the nested structure:

def extract_fields(json_data, fields_of_interest=None, extracted=None):
    if extracted is None:
        extracted = {}
    if isinstance(json_data, dict):
        for field,value in json_data.items():
            if field in fields_of_interest:
                extracted.setdefault(field, []).append(value)
            elif isinstance(value, dict) or isinstance(value, list):
                extract_fields(value, fields_of_interest, extracted)
    elif isinstance(json_data, list):
        for x in json_data:
            extract_fields(x, fields_of_interest, extracted)
    return extracted

j = {'title': 'abc',
     'deep': {'question': 'zyx',
              'deeper': [{'title': 'def',
                          'question': 'wvu',
                          'nothing': 'hahaha'},
                         {'even deeper': [{'title': 'ghi',
                                           'question':'tsr',
                                           'answer': 42},
                                          {'not a title': "ceci n'est pas une pipe"}]}]}}

extracted = extract_fields(j, ('title', 'question'))

print(extracted)
# {'title': ['abc', 'def', 'ghi'], 'question': ['zyx', 'wvu', 'tsr']}

If the structure is regular (i.e. always the same hierarchy patterns and no missing keys when a dictionary is present), then you can obtain your results with nested list comprehensions:

titles    = [d["title"] for d in data["data"]]
questions = [q["question"] for d in data.get("data",[])
                           for p in d.get("paragraphs",[])
                           for q in p.get("qas",[])]

If the structure is not regular, you will need to keep track of new entries as you progress deeper and deeper in the structure. You can do this with a list (or a queue):

titles    = []
questions = []
more      = [*data.items()]  # start with key/values of first level dictionary
while more:
    key,value = more.pop(0)            # get next key/value pair to process
    if isinstance(value,list):         # if value is a list
        more.extend(enumerate(value))  # add key/values using indexes as keys
    elif isinstance(value,dict):       # if value is a dictionary
        more.extend(value.items())     # add more key/values from its items
    elif key == "title":               # for "title" key, add to titles list
        titles.append(value)
    elif key == "question":            # same for "question" keys
        questions.append(value)

output:

print(titles)
['anges-musiciens-(national-gallery)']

print(questions)

['Que concerne principalement les documents ?']

If you want to return a DataFrame

data = {
    "contact": "xxx",
    "version": 1.0,
    "data": [
        {
            "title": "anges-musiciens-(national-gallery)",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "answers": [{
                                    "text": "La Vierge aux rochers"
                                }
                            ],
                            
                            "question": "Que concerne principalement les documents ?"
                        }
                    ]
                }
            ]
        }
    ]
}

df = pd.json_normalize(data['data'], ['paragraphs', 'qas'], 'title')[['title', 'question']]
print(df)

                                title                            question  
0  anges-musiciens-(national-gallery)  Que concerne principalement les documents ?

Please signup or login to give your own answer.

Click here to cancel reply.