Python connect composed keywords in texts - Artificial Intelligence

ArhiliucCristina
November 13, 2019
106 views
1 vote
2 Answers

So, I have a keyword list lowercase. Let’s say

keywords = ['machine learning', 'data science', 'artificial intelligence']

and a list of texts in lowercase. Let’s say

texts = [
  'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 
  'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]

I need to transform the texts into:

[[['the', 'new',
   'machine_learning',
   'model',
   'built',
   'by',
   'google',
   'is',
   'revolutionary',
   'for',
   'the',
   'current',
   'state',
   'of',
   'artificial_intelligence'],
  ['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
 [['data_science',
   'and',
   'artificial_intelligence',
   'are',
   'two',
   'different',
   'fields',
   'although',
   'they',
   'are',
   'interconnected'],
  ['scientists',
   'from',
   'harvard',
   'are',
   'explaining',
   'it',
   'in',
   'a',
   'detailed',
   'presentation',
   'that',
   'could',
   'be',
   'found',
   'on',
   'our',
   'page']]]

What I do right now is checking if the keywords are in a text and replace them with the keywords with _. But this is of complexity m*n and it is really slow when you have 700 long texts and 2M keywords as in my case.

I was trying to use Phraser, but I can’t manage to build one with only my keywords.

Could someone suggest me a more optimized way of doing it?

Answers

This is probably not the best pythonic way to do it but it works with 3 steps.

keywords = ['machine learning', 'data science', 'artificial intelligence']

texts = ['the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.']

#Add underscore
for idx, text in enumerate(texts):
  for keyword in keywords:
    reload_text = texts[idx]
    if keyword in text:
      texts[idx] = reload_text.replace(keyword, keyword.replace(" ", "_"))

#Split text for each "." encountered
for idx, text in enumerate(texts):
  texts[idx] = list(filter(None, text.split(".")))
print(texts)

#Split text to get each word
for idx,text in enumerate(texts):
  for idx_s,sentence in enumerate(text):
    texts[idx][idx_s] = list(map(lambda x: re.sub("[,.!?]", "", x), sentence.split())) #map to delete every undesired characters

print(texts)

Output

[
    [
        ['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial_intelligence'], 
        ['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']
    ], 
    [
        ['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields', 'although', 'they', 'are', 'interconnected'], 
        ['scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page']
    ]
]

- gojomo
- November 14, 2019 at 8:27 pm
- 0 votes
0
The Phrases/Phraser classes of gensim are designed to use their internal, statistically-derived records of what word pairs should be promoted to phrases – not user-supplied pairings. (You could probably poke & prod a Phraser to do what you want, by synthesizing scores/thresholds, but that would be somewhat awkward & kludgey.)

You could, mimic their general approach: (1) operate on lists-of-tokens rather than raw strings; (2) learn & remember token-pairs that should be combined; & (3) perform combination in a single pass. That should work far more efficiently than anything based on doing repeated search-and-replace on a string – which it sounds like you’ve already tried and found wanting.

For example, let’s first create a dictionary, where the keys are tuples of word-pairs that should be combined, and the values are tuples that include both their designated combination-token, and a 2nd item that’s just an empty-tuple. (The reason for this will become clear later.)
```
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = [
    'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 
    'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]

combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ()) 
                     for kwsplit in [kwstr.split() for kwstr in keywords]}
combinations_dict
```
After this step, combinations_dict is:
```
{('machine', 'learning'): ('machine_learning', ()),
 ('data', 'science'): ('data_science', ()),
 ('artificial', 'intelligence'): ('artificial_intelligence', ())}
```
Now, we can use a Python generator function to create an iterable transformation of any other sequence-of-tokens, that takes original tokens one-by-one – but before emitting any, adds the next to a buffered candidate pair-of-tokens. If that pair is one that should be combined, a single combined token is yielded – but if not, just the 1st token is emitted, leaving the 2nd to be combined with the next token in a new candidate pair.

For example:
```
def combining_generator(tokens, comb_dict):
    buff = ()  # start with empty buffer
    for in_tok in tokens:
        buff += (in_tok,)  # add latest to buffer
        if len(buff) < 2:  # grow buffer to 2 tokens if possible
            continue
        # lookup what to do for current pair... 
        # ...defaulting to emit-[0]-item, keep-[1]-item in new buff
        out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
        yield out_tok 
    if buff:
        yield buff[0]  # last solo token if any
```
Here we see the reason for the earlier () empty-tuples: that’s the preferred state of the buff after a successful replacement. And driving the result & next-state this way helps us use the form of dict.get(key, default) that supplies a specific value to be used if the key isn’t found.

Now designated combinations can be applied via:
```
tokenized_texts = [text.split() for text in texts]
retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
retokenized_texts
```
…which reports tokenized_texts as:
```
[
  ['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'], 
  ['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
]
```
Note that the tokens ('artificial', 'intelligence.') aren’t combined here, as the dirt-simple .split() tokenization used has left the punctuation attached, preventing an exact match to the rule.

Real projects will want to use a more-sophisticated tokenization, that might either strip the punctuation, or retain punctuation as tokens, or do other preprocessing – and as a result would properly pass 'artificial' as a token without the attached '.'. For example a simple tokenization that just retains runs-of-word-characters discarding punctuation would be:
```
import re
tokenized_texts = [re.findall('w+', text) for text in texts]
tokenized_texts
```
Another that also keeps any stray non-word/non-space characters (punctuation) as standalone tokens would be:
```
tokenized_texts = [re.findall(r'w+|(?:[^ws])', text) for text in texts]
tokenized_texts
```
Either of these alternatives to a simple .split() would ensure your 1st text presents the necessary ('artificial', 'intelligence') pair for combination.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Python connect composed keywords in texts – Artificial Intelligence

Answers