skip to Main Content

I want to extract information from different sentences so i’m using nltk to divide each sentence to words, I’m using this code:

words=[]
for i in range(len(sentences)):
    words.append(nltk.word_tokenize(sentences[i]))
    words

it works pretty good but i want something little bit different .. for example i have this sentence :
'['Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"']'
i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)" to be one word and not divided to several single words .

UPDATE:
i want something like that:

[
 'Jan',
 '31',
 '19:28:14',
 'nginx',
 '10.0.0.0',
 '31/Jan/2019:19:28:14',
 '+0100',
 'POST',
 '/test/itf/',
 'HTTP/x.x',
 '404',
 '146',
 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']

any idea to make it possible !?
Thank you in advance

3

Answers


  1. First you need to chose to use " or ‘ because the both are unusual and can to cause any strange behavior. After that is just string formating:

    s='"["Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)""]" i want "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"'
    
    words = s.split(' ') # break the sentence into spaces
    # ['"["Jan', '31', '19:28:14', 'nginx:', '10.0.0.0', '-', '-', '[31/Jan/2019:19:28:14', '+0100]', '"POST', '/test/itf/', 'HTTP/x.x"', '404', '146', '"-"', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)""]"', 'i', 'want', '"Mozilla/5.2', '[en]', '(X11,', 'U;', 'OpenVAS-XX', '9.2.7)"']
    
    # then access your data list
    words[0] # '"["Jan'
    words[1] # '31'
    words[2] # '19:28:14'
    
    Login or Signup to reply.
  2. You could do that using parition() with space delimiter, regex and recursion, as below. I have to say though, this solution is strict to the string format you provided.

    import re
    s_list = []
    
    def str_partition(text):
        parts = text.partition(" ")
        part = re.sub('[[]"'-]', '', parts[0])
        
        if part.startswith("nginx"):
            s_list.append(part.replace(":", ''))
        elif part != "":
            s_list.append(part)
            
        if not parts[2].startswith('"Moz'):
            str_partition(parts[2])
        else:
            part = re.sub('["']', '', parts[2])
            part = part[:-1]
            s_list.append(part)
            return
    
    s = '['Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"']'     
    str_partition(s)       
    print(s_list)
    

    Output:

    ['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100',
    'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']
    
    Login or Signup to reply.
  3. You can import re and parse the log line (which is not a natural language sentence) with a regex:

    import re
    
    sentences = ['['Jan 31 19:28:14 nginx: 10.0.0.0 - - [31/Jan/2019:19:28:14 +0100] "POST /test/itf/ HTTP/x.x" 404 146 "-" "Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)"']']
    
    rx = re.compile(r'b(w{3})s+(d{1,2})s+(d{1,2}:d{1,2}:d{2})s+(w+)W+(d{1,3}(?:.d{1,3}){3})(?:s+S+){2}s+[([^][s]+)s+([+d]+)]s+"([A-Z]+)s+(S+)s+(S+)"s+(d+)s+(d+)s+S+s+"([^"]*)"')
    
    words=[]
    for sent in sentences:
        m = rx.search(sent)
        if m:
            words.append(list(m.groups()))
        else:
            words.append(nltk.word_tokenize(sent))
    
    print(words)
    

    See the Python demo.

    The output will look like

    [['Jan', '31', '19:28:14', 'nginx', '10.0.0.0', '31/Jan/2019:19:28:14', '+0100', 'POST', '/test/itf/', 'HTTP/x.x', '404', '146', 'Mozilla/5.2 [en] (X11, U; OpenVAS-XX 9.2.7)']]
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search