I want to extract information from different sentences so i’m using nltk to divide each sentence to words, I’m using this code:
words=[]
for i in range(len(sentences)):
words.append(nltk.word_tokenize(sentences[i]))
words
it works pretty good but i want something little bit different .. for example i have this sentence :
'['Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"']'
i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"
to be one word and not divided to several single words .
UPDATE:
i want something like that:
[
'Jan',
'31',
'19:28:14',
'nginx',
'10.0.0.0',
'31/Jan/2019:19:28:14',
'+0100',
'POST',
'/test/itf/',
'HTTP/x.x',
'404',
'146',
'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
any idea to make it possible !?
Thank you in advance
3
Answers
First you need to chose to use " or ‘ because the both are unusual and can to cause any strange behavior. After that is just string formating:
You could do that using
parition()
with space delimiter, regex and recursion, as below. I have to say though, this solution is strict to the string format you provided.Output:
You can
import re
and parse the log line (which is not a natural language sentence) with a regex:See the Python demo.
The output will look like