skip to Main Content

I am working with twitter data on notepad ++ (xml files) and I’m trying to remove the retweets. Each RT starts with ‘<tweet id’, they all contain ‘>RT @’ and they all end with ‘</tweet’. Unfortunately due to Twitter’s API terms and conditions I can’t share examples from the data with you, so hopefully this gives you enough info to help.

The problem I’m having is that sometimes the metadata inbetween ‘<tweet id’ and ‘>RT @’ spans across multiple lines, and I can’t seem to find a regex which will capture RT’s that occur on both single and multiple lines.

This is the regex I have which captures single line RT’s:

(<tweet id).+?(>RT @).+?(/tweet>)

Does anyone have any ideas on what I can add to it so that it will scoop up RT’s (and their accompanying metadata) which span accross multiple lines too?

Example RT. I’ve altered some of the names and the content of the RT but the format remains the same. note there are two examples below, the second one which contains an emoji begins after ‘this is an example which contains an emoji’:

<tweet id='827364918734' createdAt='2011-01-16T18:13:02.000Z' language='en' authorId='673829' authorUsername='exampleuser' authorName='example' authorVerified='TRUE' authorDescription='example description' authorLocation='example location' authorCreatedAt='2009-05-10T05:02:51.000Z' authorFollowersCount='830211' authorFollowingCount='1763' authorTweetCount='34209' authorListedCount='7589' referencedTweetId='26690653563912192' referencedTweetCreatedAt='2011-01-16T17:22:02.000Z' referencedTweetText='example reference tweet text' referencedTweetRetweetCount='9' referencedTweetReplyCount='0' referencedTweetLikeCount='2' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example' referencedTweetAuthorVerified='TRUE' referencedTweetAuthorDescription='example description

Check out @example, our new example' referencedTweetAuthorLocation='example' referencedTweetAuthorCreatedAt='2008-08-27T15:24:02.000Z' referencedTweetAuthorFollowersCount='1380523' referencedTweetAuthorFollowingCount='1035' referencedTweetAuthorTweetCount='402492' referencedTweetAuthorListedCount='22425' retweetCount='9' replyCount='0' likeCount='0' quoteCount='0' >RT @example this is an example RT </tweet>```



This is an example with emoji's: 

```<tweet id='1783646' createdAt='2010-01-26T19:38:13.000Z' language='en' authorId='djsjchk' authorUsername='example' authorName='example' authorVerified='FALSE' authorDescription='example' authorLocation='example' authorCreatedAt='2009-06-26T19:50:16.000Z' authorFollowersCount='647' authorFollowingCount='204' authorTweetCount='6045' authorListedCount='31' referencedTweetId='8247516385' referencedTweetCreatedAt='2010-01-26T19:36:15.000Z' referencedTweetText='example' referencedTweetRetweetCount='1' referencedTweetReplyCount='0' referencedTweetLikeCount='0' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example 😷' referencedTweetAuthorVerified='FALSE' referencedTweetAuthorDescription='examples. #TCSC' referencedTweetAuthorLocation='Find me at' referencedTweetAuthorCreatedAt='2010-01-23T20:05:52.000Z' referencedTweetAuthorFollowersCount='25803' referencedTweetAuthorFollowingCount='3176' referencedTweetAuthorTweetCount='58883' referencedTweetAuthorListedCount='0' retweetCount='1' replyCount='0' likeCount='0' quoteCount='0' >RT @example: this is an example RT </tweet>```

2

Answers


  1. How about using python instead of notepadd++

    This code with xml.etree.ElementTree library and tweet inside code.
    It will get the attribute’s value and RT text.

    #1 install ElementTree library

    pip install pycopy-xml.etree.ElementTree
    

    #2 Save as get-tweet.py file.

    import xml.etree.ElementTree as ET
    
    xml = """
    <tweet id='827364918734' createdAt='2011-01-16T18:13:02.000Z' language='en' authorId='673829' authorUsername='exampleuser' authorName='example' authorVerified='TRUE' authorDescription='example description' authorLocation='example location' authorCreatedAt='2009-05-10T05:02:51.000Z' authorFollowersCount='830211' authorFollowingCount='1763' authorTweetCount='34209' authorListedCount='7589' referencedTweetId='26690653563912192' referencedTweetCreatedAt='2011-01-16T17:22:02.000Z' referencedTweetText='example reference tweet text' referencedTweetRetweetCount='9' referencedTweetReplyCount='0' referencedTweetLikeCount='2' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example' referencedTweetAuthorVerified='TRUE' referencedTweetAuthorDescription='example description
    
    Check out @example, our new example' referencedTweetAuthorLocation='example' referencedTweetAuthorCreatedAt='2008-08-27T15:24:02.000Z' referencedTweetAuthorFollowersCount='1380523' referencedTweetAuthorFollowingCount='1035' referencedTweetAuthorTweetCount='402492' referencedTweetAuthorListedCount='22425' retweetCount='9' replyCount='0' likeCount='0' quoteCount='0' >RT @example this is an example RT </tweet>
    """
    
    root = ET.fromstring(xml)
    print("root: " + str(root))
    print("root.tag: " + str(root.tag))
    print("root.attrib: " + str(root.attrib))
    print(type(root.attrib))
    for key in root.attrib.keys():
        print(key +': '+root.attrib[key])
    print("text: " + str(root.text))
    

    #3 run it

    python get-tweet.py
    

    #4 Result

    $ python get-tweet.py
    root: <Element 'tweet' at 0x000001593FED13F0>
    root.tag: tweet
    root.attrib: {'id': '827364918734', 'createdAt': '2011-01-16T18:13:02.000Z', 'language': 'en', 'authorId': '673829', 'authorUsername': 'exampleuser', 'authorName': 'example', 'authorVerified': 'TRUE', 'authorDescription': 'example description', 'authorLocation': 'example location', 'authorCreatedAt': '2009-05-10T05:02:51.000Z', 'authorFollowersCount': '830211', 'authorFollowingCount': '1763', 'authorTweetCount': '34209', 'authorListedCount': '7589', 'referencedTweetId': '26690653563912192', 'referencedTweetCreatedAt': '2011-01-16T17:22:02.000Z', 'referencedTweetText': 'example reference tweet text', 'referencedTweetRetweetCount': '9', 'referencedTweetReplyCount': '0', 'referencedTweetLikeCount': '2', 'referencedTweetQuoteCount': '0', 'referencedTweetAuthorUsername': 'example', 'referencedTweetAuthorName': 'example', 'referencedTweetAuthorVerified': 'TRUE', 'referencedTweetAuthorDescription': 'example description  Check out @example, our new example', 'referencedTweetAuthorLocation': 'example', 'referencedTweetAuthorCreatedAt': '2008-08-27T15:24:02.000Z', 'referencedTweetAuthorFollowersCount': '1380523', 'referencedTweetAuthorFollowingCount': '1035', 'referencedTweetAuthorTweetCount': '402492', 'referencedTweetAuthorListedCount': '22425', 'retweetCount': '9', 'replyCount': '0', 'likeCount': '0', 'quoteCount': '0'}
    <class 'dict'>
    id: 827364918734
    createdAt: 2011-01-16T18:13:02.000Z
    language: en
    authorId: 673829
    authorUsername: exampleuser
    authorName: example
    authorVerified: TRUE
    authorDescription: example description
    authorLocation: example location
    authorCreatedAt: 2009-05-10T05:02:51.000Z
    authorFollowersCount: 830211
    authorFollowingCount: 1763
    authorTweetCount: 34209
    authorListedCount: 7589
    referencedTweetId: 26690653563912192
    referencedTweetCreatedAt: 2011-01-16T17:22:02.000Z
    referencedTweetText: example reference tweet text
    referencedTweetRetweetCount: 9
    referencedTweetReplyCount: 0
    referencedTweetLikeCount: 2
    referencedTweetQuoteCount: 0
    referencedTweetAuthorUsername: example
    referencedTweetAuthorName: example
    referencedTweetAuthorVerified: TRUE
    referencedTweetAuthorDescription: example description  Check out @example, our new example
    referencedTweetAuthorLocation: example
    referencedTweetAuthorCreatedAt: 2008-08-27T15:24:02.000Z
    referencedTweetAuthorFollowersCount: 1380523
    referencedTweetAuthorFollowingCount: 1035
    referencedTweetAuthorTweetCount: 402492
    referencedTweetAuthorListedCount: 22425
    retweetCount: 9
    replyCount: 0
    likeCount: 0
    quoteCount: 0
    text: RT @example this is an example RT
    

    #Note – read tweet from file

    If you want to read from xml file.
    It will get the same result.

    import xml.etree.ElementTree as ET
    
    tree = ET.parse('tweet_data.xml')
    root = tree.getroot()
    print("root: " + str(root))
    print("root.tag: " + str(root.tag))
    print("root.attrib: " + str(root.attrib))
    print(type(root.attrib))
    for key in root.attrib.keys():
        print(key +': '+root.attrib[key])
    print("text: " + str(root.text))
    

    You can modify and write by ElementTree

    To modify & write XML file in here

    Login or Signup to reply.
  2. Your regex is not so bad, you forget the flag . matches newline.

    • Ctrl+H
    • Find what: <tweet id=(?:(?!</tweet>).)+?RT @.+?</tweet>
    • Replace with: LEAVE EMPTY
    • TICK Match case
    • TICK Wrap around
    • SELECT Regular expression
    • TICK . matches newline
    • Replace all

    Explanation:

    <tweet id=      # literally
        (?:             # non capture group
            (?!             # negative lookahead, make sure we haven't after:
                </tweet>        # literally
            )               # end lookahead
            .               # any character
        )+?             # end group, may appear 1 or more times, not greedy
    RT @            # literally
    .+?             # 1 or more any character, not greedy
    </tweet>        # literally
    

    Screenshot (before):

    enter image description here

    Screenshot (after):
    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search