skip to Main Content

I am building a twitter scraper in python, which I would like to scrape my home timeline and create a readable CSV file with the tweet ID, tweet creator, timestamp, and tweet content. Tweets often contain commas, (the delimiter I am using) which is not an issue when the tweet content column is wrapped in single quotes (the quotechar I am using) . However, due to the limitations of the twitter api, some tweets contain single quotes and commas, which confuses the CSV reader into treating commas within tweets as delimiters.

I have attempted to use regular expressions to remove or replace the single quotes within the actual quotecharacters I would like to keep, but I have not found a way to do so.

Here is what tweets.txt looks like:

ID,Creator,Timestamp,Tweet
1112783967302844417,twitteruser,Mon Apr 01 18:29:06 +0000 2019,'At Adobe's summit, 'experience' was everywhere'

Here is my python script:

import csv

with open ('tweets.txt','r') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter = ',', quotechar="'")
    for line in csv_reader:
        print(line)

I would like to recieve an output like this:

['ID', 'Creator', 'Timestamp', 'Tweet']
['1112783967302844417', 'twitteruser', 'Mon Apr 01 18:29:06 +0000 2019', 'At Adobe^s summit, ^experience^ was everywhere']

But currently, the fact that the tweet content contains single quotes within makes it so that the csv reader recognizes the commas as delimiters, and gives this output:

['ID', 'Creator', 'Timestamp', 'Tweet']
['1112783967302844417', 'twitteruser', 'Mon Apr 01 18:29:06 +0000 2019', 'At Adobes summit', " 'experience' was everywhere'"]

3

Answers


  1. A solution is using regex. It’s not the best solution but it’s a start. I think there’s some other choices that could be made to avoid this problem to begin with, for example writing these records to a database. Or when writing to the file, properly escaping quotes.

    import re
    
    line_pattern = r'([^,]*),([^,]*),([^,]*),(.*)'
    
    with open ('tweets.txt','r') as csv_file:
        for line in csv_file.readlines():
    
            match_obj = re.match(line_pattern, line)
    
            id_ = match_obj.group(1)
            creator = match_obj.group(2)
            timestamp = match_obj.group(3)
            tweet = match_obj.group(4).strip("'")  # clean quotes off ends
    
            print([id_, creator, timestamp, tweet])
    

    Please note this solution is not flexible in anyway. It also works only if the first three columns don’t have commas. But like I said, there’s improvements to be made before getting to this point that would avoid this problem to begin with.

    Login or Signup to reply.
  2. Since you have a non-standard input format, you should use your own parser. For instance, you can use a simple RegEx to parse the records.

    For instance, the RegEx "([^,]+),([^,]+),([^,]+),'?(.*?)'?$" can parse the the header and the tweets. A tweet can be quoted or not.

    Here is the code:

    import re
    
    match_record = re.compile(r"([^,]+),([^,]+),([^,]+),'?(.*?)'?$").match
    
    with open('tweets.txt', mode='r', encoding="utf-8") as csv_file:
        for line in csv_file:
            line = line.strip()
            mo = match_record(line)
            record = mo.groups()
            print(record)
    

    Don’t forget to specify the file encoding (I made the assumption it is “utf-8”)…

    Login or Signup to reply.
  3. In this case where you know the number of columns in your CSV, and where only the last is free text containing commas, you could use Python’s string methods:

    with open ('tweets.txt','r') as file:
        for line in file:
            l = (line.strip()                  # Get rid of newlines
                     .split(",", 3))           # Get four columns
            l[-1] = (l[-1].strip("'")          # Remove flanking single quotes
                          .replace("'", "^"))  # Replace inner single quotes if required
            print(l)
    

    This code as many limitations, and will fit your current case only.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search