skip to Main Content

I’m trying to identify user similarities by comparing the keywords used in their profile (from a website). For example, Alice = pizza, music, movies, Bob = cooking, guitar, movie and Eve = knitting, running, gym. Ideally, Alice and Bob are the most similar. I put down some simple code to calculate the similarity. To account for possible plural/singular version of the keywords I use something like:

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
wnl = WordNetLemmatizer()
w1 = ["movies", "movie"] 
tokens = [token.lower() for token in word_tokenize(" ".join(w1))]
lemmatized_words = [wnl.lemmatize(token) for token in tokens]

So that, lemmatized_words = ["movie", "movie"].
Afterwards, I do some pairwise keywords comparison using spacy, such as:

import spacy
nlp = spacy.load('en')
t1 = nlp(u"pizza")
t2 = nlp(u"food")
sim = t1.similarity(t2)

Now, the problem starts when I have to deal with compound words such as: artificial intelligence, data science, whole food, etc. By tokenizing, I would simply split those words into 2 (e.g. artificial and intelligence), but this would affect my similarity measure. What is (would be) the best approach to take into account those type of words?

2

Answers


  1. There are many ways to achieve this. One way would be to create the embeddings (vectors) yourself. This would have two advantages: first, you would be able to use bi-, tri-, and beyond (n-) grams as your tokens, and secondly, you are able to define the space that is best suited for your needs — Wikipedia data is general, but, say, children’s stories would be a more niche dataset (and more appropriate / “accurate” if you were solving problems to do with children and/or stories). There are several methods, of course word2vec being the most popular, and several packages to help you (e.g. gensim).

    However, my guess is you would like something that’s already out there. The best word embeddings right now are:

    • Numberbatch (‘classic’ best-in-class ensemble);
    • fastText, by Facebook Research (created at the character level — some words that are out of vocabulary can be “understood” as a result);
    • sense2vec, by the same guys behind Spacy (created using parts-of-speech (POS) as additional information, with the objective to disambiguate).

    The one we are interested in for a quick resolve of your problem is sense2vec. You should read the paper, but essentially these word embeddings were created using Reddit with additional POS information, and (thus) able to discriminate entities (e.g. nouns) that span multiple words. This blog post describes sense2vec very well. Here’s some code to help you get started (taken from the prior links):

    Install:

    git clone https://github.com/explosion/sense2vec
    pip install -r requirements.txt
    pip install -e .
    sputnik --name sense2vec --repository-url http://index.spacy.io install reddit_vectors
    

    Example usage:

    import sense2vec
    model = sense2vec.load()
    freq, query_vector = model["onion_rings|NOUN"]
    freq2, query_vector2 = model["chicken_nuggets|NOUN"]
    print(model.most_similar(query_vector, n=5)[0])
    print(model.data.similarity(query_vector, query_vector2))
    

    Important note, sense2vec requires spacy>=0.100,<0.101, meaning it will downgrade your current spacy install, not too much of a problem if you are only loading the en model. Also, here are the POS tags used:

    ADJ ADP ADV AUX CONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ SYM VERB X
    

    You could use spacy for POS and dependency tagging, and then sense2vec to determine the similarity of resulting entities. Or, depending on the frequency of your dataset (not too large), you could grab n-grams in descending (n) order, and sequentially check to see if each one is an entity in the sense2vec model.

    Hope this helps!

    Login or Signup to reply.
  2. There is approach using nltk:

    from nltk.tokenize import MWETokenizer
    
    tokenizer = MWETokenizer([("artificial","intelligence"), ("data","science")], separator=' ')
    
    tokens = tokenizer.tokenize("I am really interested in data science and artificial intelligence".split())
    print(tokens)
    

    The output is given as:

    ['I', 'am', 'really', 'interested', 'in', 'data science', 'and', 'artificial intelligence']
    

    For more reference you can read here.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search