So, I have a keyword list lowercase. Let’s say
keywords = ['machine learning', 'data science', 'artificial intelligence']
and a list of texts in lowercase. Let’s say
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
I need to transform the texts into:
[[['the', 'new',
'machine_learning',
'model',
'built',
'by',
'google',
'is',
'revolutionary',
'for',
'the',
'current',
'state',
'of',
'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
[['data_science',
'and',
'artificial_intelligence',
'are',
'two',
'different',
'fields',
'although',
'they',
'are',
'interconnected'],
['scientists',
'from',
'harvard',
'are',
'explaining',
'it',
'in',
'a',
'detailed',
'presentation',
'that',
'could',
'be',
'found',
'on',
'our',
'page']]]
What I do right now is checking if the keywords are in a text and replace them with the keywords with _. But this is of complexity m*n and it is really slow when you have 700 long texts and 2M keywords as in my case.
I was trying to use Phraser, but I can’t manage to build one with only my keywords.
Could someone suggest me a more optimized way of doing it?
2
Answers
This is probably not the best pythonic way to do it but it works with 3 steps.
Output
The
Phrases
/Phraser
classes ofgensim
are designed to use their internal, statistically-derived records of what word pairs should be promoted to phrases – not user-supplied pairings. (You could probably poke & prod aPhraser
to do what you want, by synthesizing scores/thresholds, but that would be somewhat awkward & kludgey.)You could, mimic their general approach: (1) operate on lists-of-tokens rather than raw strings; (2) learn & remember token-pairs that should be combined; & (3) perform combination in a single pass. That should work far more efficiently than anything based on doing repeated search-and-replace on a string – which it sounds like you’ve already tried and found wanting.
For example, let’s first create a dictionary, where the keys are tuples of word-pairs that should be combined, and the values are tuples that include both their designated combination-token, and a 2nd item that’s just an empty-tuple. (The reason for this will become clear later.)
After this step,
combinations_dict
is:Now, we can use a Python generator function to create an iterable transformation of any other sequence-of-tokens, that takes original tokens one-by-one – but before emitting any, adds the next to a buffered candidate pair-of-tokens. If that pair is one that should be combined, a single combined token is
yield
ed – but if not, just the 1st token is emitted, leaving the 2nd to be combined with the next token in a new candidate pair.For example:
Here we see the reason for the earlier
()
empty-tuples: that’s the preferred state of thebuff
after a successful replacement. And driving the result & next-state this way helps us use the form ofdict.get(key, default)
that supplies a specific value to be used if the key isn’t found.Now designated combinations can be applied via:
…which reports
tokenized_texts
as:Note that the tokens
('artificial', 'intelligence.')
aren’t combined here, as the dirt-simple.split()
tokenization used has left the punctuation attached, preventing an exact match to the rule.Real projects will want to use a more-sophisticated tokenization, that might either strip the punctuation, or retain punctuation as tokens, or do other preprocessing – and as a result would properly pass
'artificial'
as a token without the attached'.'
. For example a simple tokenization that just retains runs-of-word-characters discarding punctuation would be:Another that also keeps any stray non-word/non-space characters (punctuation) as standalone tokens would be:
Either of these alternatives to a simple
.split()
would ensure your 1st text presents the necessary('artificial', 'intelligence')
pair for combination.