I’m trying to identify user similarities by comparing the keywords used in their profile (from a website). For example, Alice = pizza, music, movies
, Bob = cooking, guitar, movie
and Eve = knitting, running, gym
. Ideally, Alice
and Bob
are the most similar. I put down some simple code to calculate the similarity. To account for possible plural/singular version of the keywords I use something like:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
wnl = WordNetLemmatizer()
w1 = ["movies", "movie"]
tokens = [token.lower() for token in word_tokenize(" ".join(w1))]
lemmatized_words = [wnl.lemmatize(token) for token in tokens]
So that, lemmatized_words = ["movie", "movie"]
.
Afterwards, I do some pairwise keywords comparison using spacy
, such as:
import spacy
nlp = spacy.load('en')
t1 = nlp(u"pizza")
t2 = nlp(u"food")
sim = t1.similarity(t2)
Now, the problem starts when I have to deal with compound words such as: artificial intelligence
, data science
, whole food
, etc. By tokenizing, I would simply split those words into 2 (e.g. artificial
and intelligence
), but this would affect my similarity measure. What is (would be) the best approach to take into account those type of words?
2
Answers
There are many ways to achieve this. One way would be to create the embeddings (vectors) yourself. This would have two advantages: first, you would be able to use bi-, tri-, and beyond (n-) grams as your tokens, and secondly, you are able to define the space that is best suited for your needs — Wikipedia data is general, but, say, children’s stories would be a more niche dataset (and more appropriate / “accurate” if you were solving problems to do with children and/or stories). There are several methods, of course
word2vec
being the most popular, and several packages to help you (e.g.gensim
).However, my guess is you would like something that’s already out there. The best word embeddings right now are:
The one we are interested in for a quick resolve of your problem is
sense2vec
. You should read the paper, but essentially these word embeddings were created using Reddit with additional POS information, and (thus) able to discriminate entities (e.g. nouns) that span multiple words. This blog post describessense2vec
very well. Here’s some code to help you get started (taken from the prior links):Install:
Example usage:
Important note,
sense2vec
requiresspacy>=0.100,<0.101
, meaning it will downgrade your currentspacy
install, not too much of a problem if you are only loading theen
model. Also, here are the POS tags used:You could use
spacy
for POS and dependency tagging, and thensense2vec
to determine the similarity of resulting entities. Or, depending on the frequency of your dataset (not too large), you could grab n-grams in descending (n) order, and sequentially check to see if each one is an entity in thesense2vec
model.Hope this helps!
There is approach using nltk:
The output is given as:
For more reference you can read here.