I am working with twitter data on notepad ++ (xml files) and I’m trying to remove the retweets. Each RT starts with ‘<tweet id’, they all contain ‘>RT @’ and they all end with ‘</tweet’. Unfortunately due to Twitter’s API terms and conditions I can’t share examples from the data with you, so hopefully this gives you enough info to help.
The problem I’m having is that sometimes the metadata inbetween ‘<tweet id’ and ‘>RT @’ spans across multiple lines, and I can’t seem to find a regex which will capture RT’s that occur on both single and multiple lines.
This is the regex I have which captures single line RT’s:
(<tweet id).+?(>RT @).+?(/tweet>)
Does anyone have any ideas on what I can add to it so that it will scoop up RT’s (and their accompanying metadata) which span accross multiple lines too?
Example RT. I’ve altered some of the names and the content of the RT but the format remains the same. note there are two examples below, the second one which contains an emoji begins after ‘this is an example which contains an emoji’:
<tweet id='827364918734' createdAt='2011-01-16T18:13:02.000Z' language='en' authorId='673829' authorUsername='exampleuser' authorName='example' authorVerified='TRUE' authorDescription='example description' authorLocation='example location' authorCreatedAt='2009-05-10T05:02:51.000Z' authorFollowersCount='830211' authorFollowingCount='1763' authorTweetCount='34209' authorListedCount='7589' referencedTweetId='26690653563912192' referencedTweetCreatedAt='2011-01-16T17:22:02.000Z' referencedTweetText='example reference tweet text' referencedTweetRetweetCount='9' referencedTweetReplyCount='0' referencedTweetLikeCount='2' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example' referencedTweetAuthorVerified='TRUE' referencedTweetAuthorDescription='example description
Check out @example, our new example' referencedTweetAuthorLocation='example' referencedTweetAuthorCreatedAt='2008-08-27T15:24:02.000Z' referencedTweetAuthorFollowersCount='1380523' referencedTweetAuthorFollowingCount='1035' referencedTweetAuthorTweetCount='402492' referencedTweetAuthorListedCount='22425' retweetCount='9' replyCount='0' likeCount='0' quoteCount='0' >RT @example this is an example RT </tweet>```
This is an example with emoji's:
```<tweet id='1783646' createdAt='2010-01-26T19:38:13.000Z' language='en' authorId='djsjchk' authorUsername='example' authorName='example' authorVerified='FALSE' authorDescription='example' authorLocation='example' authorCreatedAt='2009-06-26T19:50:16.000Z' authorFollowersCount='647' authorFollowingCount='204' authorTweetCount='6045' authorListedCount='31' referencedTweetId='8247516385' referencedTweetCreatedAt='2010-01-26T19:36:15.000Z' referencedTweetText='example' referencedTweetRetweetCount='1' referencedTweetReplyCount='0' referencedTweetLikeCount='0' referencedTweetQuoteCount='0' referencedTweetAuthorUsername='example' referencedTweetAuthorName='example 😷' referencedTweetAuthorVerified='FALSE' referencedTweetAuthorDescription='examples. #TCSC' referencedTweetAuthorLocation='Find me at' referencedTweetAuthorCreatedAt='2010-01-23T20:05:52.000Z' referencedTweetAuthorFollowersCount='25803' referencedTweetAuthorFollowingCount='3176' referencedTweetAuthorTweetCount='58883' referencedTweetAuthorListedCount='0' retweetCount='1' replyCount='0' likeCount='0' quoteCount='0' >RT @example: this is an example RT </tweet>```
2
Answers
How about using python instead of
notepadd++
This code with
xml.etree.ElementTree
library and tweet inside code.It will get the attribute’s value and RT text.
#1 install
ElementTree
library#2 Save as
get-tweet.py
file.#3 run it
#4 Result
#Note – read tweet from file
If you want to read from xml file.
It will get the same result.
You can modify and write by
ElementTree
To modify & write XML file in here
Your regex is not so bad, you forget the flag
. matches newline
.<tweet id=(?:(?!</tweet>).)+?RT @.+?</tweet>
LEAVE EMPTY
. matches newline
Explanation:
Screenshot (before):
Screenshot (after):