I am building a twitter scraper in python, which I would like to scrape my home timeline and create a readable CSV file with the tweet ID, tweet creator, timestamp, and tweet content. Tweets often contain commas, (the delimiter I am using) which is not an issue when the tweet content column is wrapped in single quotes (the quotechar I am using) . However, due to the limitations of the twitter api, some tweets contain single quotes and commas, which confuses the CSV reader into treating commas within tweets as delimiters.
I have attempted to use regular expressions to remove or replace the single quotes within the actual quotecharacters I would like to keep, but I have not found a way to do so.
Here is what tweets.txt looks like:
ID,Creator,Timestamp,Tweet
1112783967302844417,twitteruser,Mon Apr 01 18:29:06 +0000 2019,'At Adobe's summit, 'experience' was everywhere'
Here is my python script:
import csv
with open ('tweets.txt','r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter = ',', quotechar="'")
for line in csv_reader:
print(line)
I would like to recieve an output like this:
['ID', 'Creator', 'Timestamp', 'Tweet']
['1112783967302844417', 'twitteruser', 'Mon Apr 01 18:29:06 +0000 2019', 'At Adobe^s summit, ^experience^ was everywhere']
But currently, the fact that the tweet content contains single quotes within makes it so that the csv reader recognizes the commas as delimiters, and gives this output:
['ID', 'Creator', 'Timestamp', 'Tweet']
['1112783967302844417', 'twitteruser', 'Mon Apr 01 18:29:06 +0000 2019', 'At Adobes summit', " 'experience' was everywhere'"]
3
Answers
A solution is using regex. It’s not the best solution but it’s a start. I think there’s some other choices that could be made to avoid this problem to begin with, for example writing these records to a database. Or when writing to the file, properly escaping quotes.
Please note this solution is not flexible in anyway. It also works only if the first three columns don’t have commas. But like I said, there’s improvements to be made before getting to this point that would avoid this problem to begin with.
Since you have a non-standard input format, you should use your own parser. For instance, you can use a simple RegEx to parse the records.
For instance, the RegEx
"([^,]+),([^,]+),([^,]+),'?(.*?)'?$"
can parse the the header and the tweets. A tweet can be quoted or not.Here is the code:
Don’t forget to specify the file encoding (I made the assumption it is “utf-8”)…
In this case where you know the number of columns in your CSV, and where only the last is free text containing commas, you could use Python’s string methods:
This code as many limitations, and will fit your current case only.