skip to Main Content

So, I have a keyword list lowercase. Let’s say

keywords = ['machine learning', 'data science', 'artificial intelligence']

and a list of texts in lowercase. Let’s say

texts = [
  'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 
  'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]

I need to transform the texts into:

[[['the', 'new',
   'machine_learning',
   'model',
   'built',
   'by',
   'google',
   'is',
   'revolutionary',
   'for',
   'the',
   'current',
   'state',
   'of',
   'artificial_intelligence'],
  ['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
 [['data_science',
   'and',
   'artificial_intelligence',
   'are',
   'two',
   'different',
   'fields',
   'although',
   'they',
   'are',
   'interconnected'],
  ['scientists',
   'from',
   'harvard',
   'are',
   'explaining',
   'it',
   'in',
   'a',
   'detailed',
   'presentation',
   'that',
   'could',
   'be',
   'found',
   'on',
   'our',
   'page']]]

What I do right now is checking if the keywords are in a text and replace them with the keywords with _. But this is of complexity m*n and it is really slow when you have 700 long texts and 2M keywords as in my case.

I was trying to use Phraser, but I can’t manage to build one with only my keywords.

Could someone suggest me a more optimized way of doing it?

2

Answers


  1. This is probably not the best pythonic way to do it but it works with 3 steps.

    keywords = ['machine learning', 'data science', 'artificial intelligence']
    
    texts = ['the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.']
    
    #Add underscore
    for idx, text in enumerate(texts):
      for keyword in keywords:
        reload_text = texts[idx]
        if keyword in text:
          texts[idx] = reload_text.replace(keyword, keyword.replace(" ", "_"))
    
    #Split text for each "." encountered
    for idx, text in enumerate(texts):
      texts[idx] = list(filter(None, text.split(".")))
    print(texts)
    
    #Split text to get each word
    for idx,text in enumerate(texts):
      for idx_s,sentence in enumerate(text):
        texts[idx][idx_s] = list(map(lambda x: re.sub("[,.!?]", "", x), sentence.split())) #map to delete every undesired characters
    
    print(texts)
    

    Output

    [
        [
            ['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial_intelligence'], 
            ['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']
        ], 
        [
            ['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields', 'although', 'they', 'are', 'interconnected'], 
            ['scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page']
        ]
    ]
    
    Login or Signup to reply.
  2. The Phrases/Phraser classes of gensim are designed to use their internal, statistically-derived records of what word pairs should be promoted to phrases – not user-supplied pairings. (You could probably poke & prod a Phraser to do what you want, by synthesizing scores/thresholds, but that would be somewhat awkward & kludgey.)

    You could, mimic their general approach: (1) operate on lists-of-tokens rather than raw strings; (2) learn & remember token-pairs that should be combined; & (3) perform combination in a single pass. That should work far more efficiently than anything based on doing repeated search-and-replace on a string – which it sounds like you’ve already tried and found wanting.

    For example, let’s first create a dictionary, where the keys are tuples of word-pairs that should be combined, and the values are tuples that include both their designated combination-token, and a 2nd item that’s just an empty-tuple. (The reason for this will become clear later.)

    keywords = ['machine learning', 'data science', 'artificial intelligence']
    texts = [
        'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 
        'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
    ]
    
    combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ()) 
                         for kwsplit in [kwstr.split() for kwstr in keywords]}
    combinations_dict
    

    After this step, combinations_dict is:

    {('machine', 'learning'): ('machine_learning', ()),
     ('data', 'science'): ('data_science', ()),
     ('artificial', 'intelligence'): ('artificial_intelligence', ())}
    

    Now, we can use a Python generator function to create an iterable transformation of any other sequence-of-tokens, that takes original tokens one-by-one – but before emitting any, adds the next to a buffered candidate pair-of-tokens. If that pair is one that should be combined, a single combined token is yielded – but if not, just the 1st token is emitted, leaving the 2nd to be combined with the next token in a new candidate pair.

    For example:

    def combining_generator(tokens, comb_dict):
        buff = ()  # start with empty buffer
        for in_tok in tokens:
            buff += (in_tok,)  # add latest to buffer
            if len(buff) < 2:  # grow buffer to 2 tokens if possible
                continue
            # lookup what to do for current pair... 
            # ...defaulting to emit-[0]-item, keep-[1]-item in new buff
            out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
            yield out_tok 
        if buff:
            yield buff[0]  # last solo token if any
    

    Here we see the reason for the earlier () empty-tuples: that’s the preferred state of the buff after a successful replacement. And driving the result & next-state this way helps us use the form of dict.get(key, default) that supplies a specific value to be used if the key isn’t found.

    Now designated combinations can be applied via:

    tokenized_texts = [text.split() for text in texts]
    retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
    retokenized_texts
    

    …which reports tokenized_texts as:

    [
      ['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'], 
      ['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
    ]
    

    Note that the tokens ('artificial', 'intelligence.') aren’t combined here, as the dirt-simple .split() tokenization used has left the punctuation attached, preventing an exact match to the rule.

    Real projects will want to use a more-sophisticated tokenization, that might either strip the punctuation, or retain punctuation as tokens, or do other preprocessing – and as a result would properly pass 'artificial' as a token without the attached '.'. For example a simple tokenization that just retains runs-of-word-characters discarding punctuation would be:

    import re
    tokenized_texts = [re.findall('w+', text) for text in texts]
    tokenized_texts
    

    Another that also keeps any stray non-word/non-space characters (punctuation) as standalone tokens would be:

    tokenized_texts = [re.findall(r'w+|(?:[^ws])', text) for text in texts]
    tokenized_texts
    

    Either of these alternatives to a simple .split() would ensure your 1st text presents the necessary ('artificial', 'intelligence') pair for combination.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search