skip to Main Content

I am currently working with topic modelling and I have a dictionary with the information of each topic and the movies that correspond to that topic (like below):

{'Topic 49': ['0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"',
  array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction',
         'Green Lantern', 'Men in Black II',
         'Final Fantasy: The Spirits Within', 'Treasure Planet',
         'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars',
         'Independence Day', 'Titan A.E.', 'Sphere', 'Signs',
         'AVP: Alien vs. Predator', 'Zathura: A Space Adventure',
         'My Favorite Martian', 'I Am Number Four'], dtype=object)],...}

In the topics, the words are attached with their word-probability because that’s how I could extract them from LDA.

What I wanted to is to, from those topics, only select the pertinent words, achieving something like this:

{'Topic 49': ['alien science_fiction adventure action 2000',
  array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction',
         'Green Lantern', 'Men in Black II',
         'Final Fantasy: The Spirits Within', 'Treasure Planet',
         'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars',
         'Independence Day', 'Titan A.E.', 'Sphere', 'Signs',
         'AVP: Alien vs. Predator', 'Zathura: A Space Adventure',
         'My Favorite Martian', 'I Am Number Four'], dtype=object)],...}

I have tried several things but I can’t seem to make it work.

I’ve tried something like keeping all characters, but then I also lose terms like 2000, which describe the year of the movies.

Is there anyway I could select only the words(or numbers in case of the years) after * and separated by the + sign?

Hope this is clear!

3

Answers


  1. You can use regex to extract only words between ‘”‘ in topics.

    try something like this: “.*?”

    Login or Signup to reply.
  2. You can use re module

    import re
    
    ss = {'Topic 49': ['0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"',
      array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction',
             'Green Lantern', 'Men in Black II',
             'Final Fantasy: The Spirits Within', 'Treasure Planet',
             'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars',
             'Independence Day', 'Titan A.E.', 'Sphere', 'Signs',
             'AVP: Alien vs. Predator', 'Zathura: A Space Adventure',
             'My Favorite Martian', 'I Am Number Four'], dtype=object)],...}
    s = [re.search(r'"w*"', s).group(0).strip('"') for s in ss['Topic 49'][0].split('+')]
    # print(s)
    # ['alien', 'science_fiction', 'adventure', 'action', '2000']
    
    Login or Signup to reply.
  3. Assuming that the format of the string is very strict, this is possible with pythons inbuilt string and array manipulation functions:

    my_string = '0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"'
    
    sanitized_string = my_string.split('"')[1::2]
    

    Result:

    ['alien', 'science_fiction', 'adventure', 'action', '2000']
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search