skip to Main Content

I’m trying to retrieve values from different layers of a json file, I’m using a quite silly way — get the values from one dictionary inside another dictionary through for looping. I want to get all the "title" and "question" and put them in a list or a pandas dataframe. How can I retrieve the values needed in a simpler way? How to handle json files efficiently in general?
Thanks a lot for anyone who answers the question:)

here’s a piece of the json:

{
    "contact": "xxx",
    "version": 1.0,
    "data": [
        {
            "title": "anges-musiciens-(national-gallery)",
            "paragraphs": [
                {
                    "qas": [
                        {
                            "answers": [{
                                    "text": "La Vierge aux rochers"
                                }
                            ],
                            
                            "question": "Que concerne principalement les documents ?"
                        }
                 }
             ]
        }
     ]
}
titles = []
questions = []

for i in data["data"]:
    titles.append(i["title"])

    for p in i["paragraphs"]:
        for q in p["qas"]:
            questions.append(q["question"])
    
print(titles)
print(questions)

3

Answers


  1. You can use recursion to perform a depth-first-search on the nested structure:

    def extract_fields(json_data, fields_of_interest=None, extracted=None):
        if extracted is None:
            extracted = {}
        if isinstance(json_data, dict):
            for field,value in json_data.items():
                if field in fields_of_interest:
                    extracted.setdefault(field, []).append(value)
                elif isinstance(value, dict) or isinstance(value, list):
                    extract_fields(value, fields_of_interest, extracted)
        elif isinstance(json_data, list):
            for x in json_data:
                extract_fields(x, fields_of_interest, extracted)
        return extracted
    
    j = {'title': 'abc',
         'deep': {'question': 'zyx',
                  'deeper': [{'title': 'def',
                              'question': 'wvu',
                              'nothing': 'hahaha'},
                             {'even deeper': [{'title': 'ghi',
                                               'question':'tsr',
                                               'answer': 42},
                                              {'not a title': "ceci n'est pas une pipe"}]}]}}
    
    extracted = extract_fields(j, ('title', 'question'))
    
    print(extracted)
    # {'title': ['abc', 'def', 'ghi'], 'question': ['zyx', 'wvu', 'tsr']}
    
    Login or Signup to reply.
  2. If the structure is regular (i.e. always the same hierarchy patterns and no missing keys when a dictionary is present), then you can obtain your results with nested list comprehensions:

    titles    = [d["title"] for d in data["data"]]
    questions = [q["question"] for d in data.get("data",[])
                               for p in d.get("paragraphs",[])
                               for q in p.get("qas",[])]
    

    If the structure is not regular, you will need to keep track of new entries as you progress deeper and deeper in the structure. You can do this with a list (or a queue):

    titles    = []
    questions = []
    more      = [*data.items()]  # start with key/values of first level dictionary
    while more:
        key,value = more.pop(0)            # get next key/value pair to process
        if isinstance(value,list):         # if value is a list
            more.extend(enumerate(value))  # add key/values using indexes as keys
        elif isinstance(value,dict):       # if value is a dictionary
            more.extend(value.items())     # add more key/values from its items
        elif key == "title":               # for "title" key, add to titles list
            titles.append(value)
        elif key == "question":            # same for "question" keys
            questions.append(value)
    

    output:

    print(titles)
    ['anges-musiciens-(national-gallery)']
    
    print(questions)
    
    ['Que concerne principalement les documents ?']
    
    Login or Signup to reply.
  3. If you want to return a DataFrame

    data = {
        "contact": "xxx",
        "version": 1.0,
        "data": [
            {
                "title": "anges-musiciens-(national-gallery)",
                "paragraphs": [
                    {
                        "qas": [
                            {
                                "answers": [{
                                        "text": "La Vierge aux rochers"
                                    }
                                ],
                                
                                "question": "Que concerne principalement les documents ?"
                            }
                        ]
                    }
                ]
            }
        ]
    }
    
    df = pd.json_normalize(data['data'], ['paragraphs', 'qas'], 'title')[['title', 'question']]
    print(df)
    
                                    title                            question  
    0  anges-musiciens-(national-gallery)  Que concerne principalement les documents ?  
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search