How to deal with quotecharacters within quotecharacters in CSV files? - Twitter API

RoanMartin
April 1, 2019
260 views
3 votes
3 Answers

I am building a twitter scraper in python, which I would like to scrape my home timeline and create a readable CSV file with the tweet ID, tweet creator, timestamp, and tweet content. Tweets often contain commas, (the delimiter I am using) which is not an issue when the tweet content column is wrapped in single quotes (the quotechar I am using) . However, due to the limitations of the twitter api, some tweets contain single quotes and commas, which confuses the CSV reader into treating commas within tweets as delimiters.

I have attempted to use regular expressions to remove or replace the single quotes within the actual quotecharacters I would like to keep, but I have not found a way to do so.

Here is what tweets.txt looks like:

ID,Creator,Timestamp,Tweet
1112783967302844417,twitteruser,Mon Apr 01 18:29:06 +0000 2019,'At Adobe's summit, 'experience' was everywhere'

Here is my python script:

import csv

with open ('tweets.txt','r') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter = ',', quotechar="'")
    for line in csv_reader:
        print(line)

I would like to recieve an output like this:

['ID', 'Creator', 'Timestamp', 'Tweet']
['1112783967302844417', 'twitteruser', 'Mon Apr 01 18:29:06 +0000 2019', 'At Adobe^s summit, ^experience^ was everywhere']

But currently, the fact that the tweet content contains single quotes within makes it so that the csv reader recognizes the commas as delimiters, and gives this output:

['ID', 'Creator', 'Timestamp', 'Tweet']
['1112783967302844417', 'twitteruser', 'Mon Apr 01 18:29:06 +0000 2019', 'At Adobes summit', " 'experience' was everywhere'"]

Answers

- double_j
- April 1, 2019 at 10:06 pm
- 0 votes
0
A solution is using regex. It’s not the best solution but it’s a start. I think there’s some other choices that could be made to avoid this problem to begin with, for example writing these records to a database. Or when writing to the file, properly escaping quotes.
```
import re

line_pattern = r'([^,]*),([^,]*),([^,]*),(.*)'

with open ('tweets.txt','r') as csv_file:
    for line in csv_file.readlines():

        match_obj = re.match(line_pattern, line)

        id_ = match_obj.group(1)
        creator = match_obj.group(2)
        timestamp = match_obj.group(3)
        tweet = match_obj.group(4).strip("'")  # clean quotes off ends

        print([id_, creator, timestamp, tweet])
```
Please note this solution is not flexible in anyway. It also works only if the first three columns don’t have commas. But like I said, there’s improvements to be made before getting to this point that would avoid this problem to begin with.
Login or Signup to reply.

- LaurentLAPORTE
- April 1, 2019 at 10:07 pm
- 0 votes
0
Since you have a non-standard input format, you should use your own parser. For instance, you can use a simple RegEx to parse the records.

For instance, the RegEx "([^,]+),([^,]+),([^,]+),'?(.*?)'?$" can parse the the header and the tweets. A tweet can be quoted or not.

Here is the code:
```
import re

match_record = re.compile(r"([^,]+),([^,]+),([^,]+),'?(.*?)'?$").match

with open('tweets.txt', mode='r', encoding="utf-8") as csv_file:
    for line in csv_file:
        line = line.strip()
        mo = match_record(line)
        record = mo.groups()
        print(record)
```
Don’t forget to specify the file encoding (I made the assumption it is “utf-8”)…
Login or Signup to reply.

- FabienP
- April 1, 2019 at 10:09 pm
- 0 votes
0
In this case where you know the number of columns in your CSV, and where only the last is free text containing commas, you could use Python’s string methods:
```
with open ('tweets.txt','r') as file:
    for line in file:
        l = (line.strip()                  # Get rid of newlines
                 .split(",", 3))           # Get four columns
        l[-1] = (l[-1].strip("'")          # Remove flanking single quotes
                      .replace("'", "^"))  # Replace inner single quotes if required
        print(l)
```
This code as many limitations, and will fit your current case only.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

How to deal with quotecharacters within quotecharacters in CSV files? – Twitter API

Answers