How to select certain characters from a string with a certain condition - Artificial Intelligence

JDoe
August 9, 2019
204 views
2 votes
3 Answers

I am currently working with topic modelling and I have a dictionary with the information of each topic and the movies that correspond to that topic (like below):

{'Topic 49': ['0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"',
  array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction',
         'Green Lantern', 'Men in Black II',
         'Final Fantasy: The Spirits Within', 'Treasure Planet',
         'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars',
         'Independence Day', 'Titan A.E.', 'Sphere', 'Signs',
         'AVP: Alien vs. Predator', 'Zathura: A Space Adventure',
         'My Favorite Martian', 'I Am Number Four'], dtype=object)],...}

In the topics, the words are attached with their word-probability because that’s how I could extract them from LDA.

What I wanted to is to, from those topics, only select the pertinent words, achieving something like this:

{'Topic 49': ['alien science_fiction adventure action 2000',
  array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction',
         'Green Lantern', 'Men in Black II',
         'Final Fantasy: The Spirits Within', 'Treasure Planet',
         'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars',
         'Independence Day', 'Titan A.E.', 'Sphere', 'Signs',
         'AVP: Alien vs. Predator', 'Zathura: A Space Adventure',
         'My Favorite Martian', 'I Am Number Four'], dtype=object)],...}

I have tried several things but I can’t seem to make it work.

I’ve tried something like keeping all characters, but then I also lose terms like 2000, which describe the year of the movies.

Is there anyway I could select only the words(or numbers in case of the years) after * and separated by the + sign?

Hope this is clear!

Answers

- ModS
- August 9, 2019 at 11:41 am
- 0 votes
0
You can use regex to extract only words between ‘”‘ in topics.

try something like this: “.*?”

Login or Signup to reply.

You can use re module

import re

ss = {'Topic 49': ['0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"',
  array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction',
         'Green Lantern', 'Men in Black II',
         'Final Fantasy: The Spirits Within', 'Treasure Planet',
         'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars',
         'Independence Day', 'Titan A.E.', 'Sphere', 'Signs',
         'AVP: Alien vs. Predator', 'Zathura: A Space Adventure',
         'My Favorite Martian', 'I Am Number Four'], dtype=object)],...}
s = [re.search(r'"w*"', s).group(0).strip('"') for s in ss['Topic 49'][0].split('+')]
# print(s)
# ['alien', 'science_fiction', 'adventure', 'action', '2000']

- sobek
- August 9, 2019 at 12:01 pm
- 0 votes
0
Assuming that the format of the string is very strict, this is possible with pythons inbuilt string and array manipulation functions:
```
my_string = '0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"'

sanitized_string = my_string.split('"')[1::2]
```
Result:
```
['alien', 'science_fiction', 'adventure', 'action', '2000']
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

How to select certain characters from a string with a certain condition – Artificial Intelligence

Answers