skip to Main Content

I have a script that collects JSON data from Twitter’s API. This script collects data and parses it with jq every minute. This data gets collected into a single file that ends up looking like the following:

[
  {"text": "Tweet 01",
   "id": "001"
  },
  {"text": "Tweet 02",
   "id": "002"
  },
  {"text": "Tweet 03",
   "id": "003"
  }
]
[
  {"text": "Tweet 04",
   "id": "004"
  },
  {"text": "Tweet 05",
   "id": "005"
  },
  {"text": "Tweet 06",
   "id": "006"
  },
  {"text": "Tweet 07",
   "id": "007"
  },
  {"text": "Tweet 08",
   "id": "008"
  }
]
[
  {"text": "Tweet 09",
   "id": "009"
  },
  {"text": "Tweet 10",
   "id": "010"
  }
]

I’ve previously had a single list of JSON data per file, and Pandas easily can work with one list in a file. But how can I efficiently iterate over these multiple lists, that are NOT comma-separated and are NOT of necessarily the same length?

My ultimate goal is to aggregate ALL the JSON data from this one file and convert it to a CSV file, where each column is a key in the JSON data. It should end up looking like:

text, id
Tweet 01, 001
Tweet 02, 002
Tweet 03, 003
Tweet 04, 004
Tweet 05, 005
Tweet 06, 006
Tweet 07, 007
Tweet 08, 008
Tweet 09, 009
Tweet 10, 010

If I were to try reading the file anyway, the following occurs:

>>> import pandas as pd
>>> df = pd.read_json("sample.json")
>>> df.head()
Traceback (most recent call last):
  File "lists.py", line 3, in <module>
    df = pd.read_json("sample.json")
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/util/_decorators.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 608, in read_json
    result = json_reader.read()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 731, in read
    obj = self._get_object_parser(self.data)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 753, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 857, in parse
    self._parse_no_numpy()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Trailing data

3

Answers


    • The file contents need to be converted to a standard list, by reading the file in, and converting the individual lists into a single list.
    • .readlines reads each row of the file in as a list of strings
      • Use a list comprehension to iterate through each row, to remove whitespace and newlines at the front and tail of the string, using str.strip.
    • str.join combines items in a list into a single string.
    • str.replace to replace '][' with ','.
    • Use ast.literal_eval to convert the string back to an list.
    from ast import literal_eval
    import pandas as pd
    
    # open and clean the contents of the file
    with open('test.json', 'r') as f:
        data = literal_eval(''.join([row.strip() for row in f.readlines()]).replace('][', ','))
    
    # print(data)
    [{'text': 'Tweet 01', 'id': '001'},
     {'text': 'Tweet 02', 'id': '002'},
     {'text': 'Tweet 03', 'id': '003'},
     {'text': 'Tweet 04', 'id': '004'},
     {'text': 'Tweet 05', 'id': '005'},
     {'text': 'Tweet 06', 'id': '006'},
     {'text': 'Tweet 07', 'id': '007'},
     {'text': 'Tweet 08', 'id': '008'},
     {'text': 'Tweet 09', 'id': '009'},
     {'text': 'Tweet 10', 'id': '010'}]
    
    # load into pandas
    df = pd.json_normalize(data)
    
    # display(df)
           text   id
    0  Tweet 01  001
    1  Tweet 02  002
    2  Tweet 03  003
    3  Tweet 04  004
    4  Tweet 05  005
    5  Tweet 06  006
    6  Tweet 07  007
    7  Tweet 08  008
    8  Tweet 09  009
    9  Tweet 10  010
    
    Login or Signup to reply.
  1. As @Trenton McKinney said, you need to clean the data. So you could use f.read() to get the file as a string, then str.replace() to replace the ']n[' with ',', because ']n[' it’s causing the error, and finally
    you could try with pd.read_json:

    with open('data.json') as f:
        data=f.read()
    data=data.replace(']n[',',')
    
    df=pd.read_json(data)
    

    Output:

    df
    0  Tweet 01   1
    1  Tweet 02   2
    2  Tweet 03   3
    3  Tweet 04   4
    4  Tweet 05   5
    5  Tweet 06   6
    6  Tweet 07   7
    7  Tweet 08   8
    8  Tweet 09   9
    9  Tweet 10  10
    
    Login or Signup to reply.
  2. As long as the original json file fits in memory, you can use raw_decoder to pull the json lists out, one at a time. raw_decoder returns the length of data consumed in the parse and you can trim the string and continue til no data remains.

    import csv
    import json
    with open('data.json') as f:
        rawdata = f.read()
    
    decoder = json.JSONDecoder().raw_decode
    
    with open('data.csv','w') as f:
        writer = csv.DictWriter(f, fieldnames=["id", "text"])
        while rawdata:
            data, idx = decoder(rawdata)
            writer.writerows(data)
            rawdata = rawdata[idx:].lstrip()
    
    print(open('data.csv').read())
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search