I have extracted a lot of data from Telegram. However, I was not able to isolate the channel_id. Now I have a long string that among a lot of other information contain channel_id. Question is how do I remove everything apart from the channel_id i.e. the numbers following "channel_id=XXXXXXXXXX)?
Subset of my data.frame
df <- structure(list(channel_id = c("MessageFwdHeader(date=datetime.datetime(2021, 5, 13, 20, 50, 47, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1292436059), from_name=None, channel_post=1404, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)",
"MessageFwdHeader(date=datetime.datetime(2021, 5, 4, 9, 24, 16, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1480423705), from_name=None, channel_post=224, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)",
"MessageFwdHeader(date=datetime.datetime(2021, 3, 25, 14, 9, 38, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1489900933), from_name=None, channel_post=627, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)",
"MessageFwdHeader(date=datetime.datetime(2021, 3, 12, 22, 10, 3, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1455689590), from_name=None, channel_post=1457, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)",
"MessageFwdHeader(date=datetime.datetime(2021, 3, 9, 12, 52, 5, tzinfo=datetime.timezone.utc), imported=False, from_id=PeerChannel(channel_id=1348575245), from_name=None, channel_post=None, post_author=None, saved_from_peer=None, saved_from_msg_id=None, psa_type=None)"
)), row.names = c(NA, -5L), class = c("data.table", "data.frame"))
Desired result
channel_id <- structure(list(channel_id = c("1292436059",
"1480423705",
"1489900933",
"1455689590",
"1348575245"
)), row.names = c(NA, -5L), class = c("data.table", "data.frame"))
2
Answers
You can try
regexpr
with a look behind for(channel_id=
using(?<=\(channel_id=)
, than match digit(s)\d+
and look ahead for)
using(?=\))
and extract the matches usingregmatches
.or combining two
sub
.We may use
str_extract