I am currently working with topic modelling and I have a dictionary with the information of each topic and the movies that correspond to that topic (like below):
{'Topic 49': ['0.039*"alien" + 0.038*"science_fiction" + 0.020*"adventure" + 0.020*"action" + 0.017*"2000"',
array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction',
'Green Lantern', 'Men in Black II',
'Final Fantasy: The Spirits Within', 'Treasure Planet',
'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars',
'Independence Day', 'Titan A.E.', 'Sphere', 'Signs',
'AVP: Alien vs. Predator', 'Zathura: A Space Adventure',
'My Favorite Martian', 'I Am Number Four'], dtype=object)],...}
In the topics, the words are attached with their word-probability because that’s how I could extract them from LDA.
What I wanted to is to, from those topics, only select the pertinent words, achieving something like this:
{'Topic 49': ['alien science_fiction adventure action 2000',
array(['Avatar', 'Men in Black 3', 'Transformers: Age of Extinction',
'Green Lantern', 'Men in Black II',
'Final Fantasy: The Spirits Within', 'Treasure Planet',
'Men in Black', 'A.I. Artificial Intelligence', 'Mission to Mars',
'Independence Day', 'Titan A.E.', 'Sphere', 'Signs',
'AVP: Alien vs. Predator', 'Zathura: A Space Adventure',
'My Favorite Martian', 'I Am Number Four'], dtype=object)],...}
I have tried several things but I can’t seem to make it work.
I’ve tried something like keeping all characters, but then I also lose terms like 2000, which describe the year of the movies.
Is there anyway I could select only the words(or numbers in case of the years) after * and separated by the + sign?
Hope this is clear!
3
Answers
You can use regex to extract only words between ‘”‘ in topics.
try something like this: “.*?”
You can use
re
moduleAssuming that the format of the string is very strict, this is possible with pythons inbuilt string and array manipulation functions:
Result: