Python Gensim how to make WMD similarity run faster with multiprocessing - Artificial Intelligence

jxn
May 16, 2017
123 views
0 votes
2 Answers

I am trying to run gensim WMD similarity faster. Typically, this is what is in the docs:
Example corpus:

    my_corpus = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']

model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)

from gensim import Word2Vec
from gensim.similarities import WmdSimilarity

def init_instance(my_corpus,model,num_best):
    instance = WmdSimilarity(my_corpus, model,num_best = 1)
    return instance
instance[my_tokenized_query]

the best matched document is "Human machine interface for lab abc computer applications" which is great.

However the function instance above takes an extremely long time. So I thought of breaking up the corpus into N parts and then doing WMD on each with num_best = 1, then at the end of it, the part with the max score will be the most similar.

    from multiprocessing import Process, Queue ,Manager

    def main( my_query,global_jobs,process_tmp):
        process_query = gensim.utils.simple_preprocess(my_query)

        def worker(num,process_query,return_dict):  
            instance=init_instance
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
            x = instance[process_query][0][0]
            y = instance[process_query][0][1]
            return_dict[x] = y
        manager = Manager()
        return_dict = manager.dict()

        for num in range(num_workers):
            process_tmp = Process(target=worker, args=(num,process_query,return_dict))
            global_jobs.append(process_tmp)
            process_tmp.start()
        for proc in global_jobs:
            proc.join()

        return_dict = dict(return_dict)
        ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
        print corpus[ind]
        >>> "Graph minors A survey"

The problem I have with this is that, even though it outputs something, it doesn’t give me a good similar query from my corpus even though it gets the max similarity of all the parts.

Am I doing something wrong?

Answers

- stovfl
- May 20, 2017 at 10:04 pm
- 0 votes
0
Comment: chunk is a static variable: e.g. chunk = 600 …

If you define chunk static, then you have to compute num_workers.
```
10001 / 600 = 16,6683333333 = 17 num_workers
```
It’s common to use not more process than cores you have.
If you have 17 cores, that’s ok.

cores are static, therefore you should:
```
num_workers = os.cpu_count()
chunk = chunksize(my_corpus, num_workers)
```
1. Not the same result, changed to:
```
#process_query = gensim.utils.simple_preprocess(my_query)
process_query = my_tokenized_query
```
2. All worker results Index 0..n.
  Therefore, return_dict[x] could be overwritten from last worker with same Index having lower value. The Index in return_dict is NOT the same as Index in my_corpus. Changed to:
```
#return_dict[x] = y
return_dict[ (num * chunk)+x ] = y
```
3. Using +1 in chunk size computing, will skip that first Document.
  I don’t know how you compute chunk, consider this example:
```
def chunksize(iterable, num_workers):
    c_size, extra = divmod(len(iterable), num_workers)
    if extra:
        c_size += 1
    if len(iterable) == 0:
        c_size = 0
    return c_size

#Usage
chunk = chunksize(my_corpus, num_workers)
...
#my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk]
my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk]
```
Results: 10 cycle, Tuple=(Index worker num=0, Index worker num=1)

With multiprocessing, with chunk=5:
02,09:(3, 8), 01,03:(3, 5):
System and human system engineering testing of EPS
04,06,07:(0, 8), 05,08:(0, 5), 10:(0, 7):
Human machine interface for lab abc computer applications

Without multiprocessing, with chunk=5:
01:(3, 6), 02:(3, 5), 05,08,10:(3, 7), 07,09:(3, 8):
System and human system engineering testing of EPS
03,04,06:(0, 5):
Human machine interface for lab abc computer applications

Without multiprocessing, without chunking:
01,02,03,04,06,07,08:(3, -1):
System and human system engineering testing of EPS
05,09,10:(0, -1):
Human machine interface for lab abc computer applications

Tested with Python: 3.4.2
Login or Signup to reply.

Using Python 2.7:
I used threading instead of multi-processing.
In the WMD-Instance creation thread, I do something like this:

    wmd_instances = []
    if wmd_instance_count > len(wmd_corpus):
        wmd_instance_count = len(wmd_corpus)
    chunk_size = int(len(wmd_corpus) / wmd_instance_count)
    for i in range(0, wmd_instance_count):
        if i == wmd_instance_count -1:
            wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:], wmd_model, num_results)
        else:
            wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:chunk_size], wmd_model, num_results)
        wmd_instances.append(wmd_instance)
    wmd_logic.setWMDInstances(wmd_instances, chunk_size)

‘wmd_instance_count’ is the number of threads to use to search. I also remember the chunk-size. Then, when I want to search for something, I start “wmd_instance_count”-threads to search for and they return found sims:

def perform_query_for_job_on_instance(wmd_logic, wmd_instances, query, jobID, instance):
    wmd_instance = wmd_instances[instance]
    sims = wmd_instance[query]
    wmd_logic.set_mt_thread_result(jobID, instance, sims)

‘wmd_logic’ is the instance of a class that then does this:

def set_mt_thread_result(self, jobID, instance, sims):
    res = []
    #
    # We need to scale the found ids back to our complete corpus size...
    #
    for sim in sims:
        aSim = (int(sim[0] + (instance * self.chunk_size)), sim[1])
        res.append(aSim)

I know, the code isn’t nice, but it works. It uses ‘wmd_instance_count’ threads to find results, I aggregate them and then choose the top-10 or something like that.

Hope this helps.

Please signup or login to give your own answer.

Click here to cancel reply.

Python Gensim how to make WMD similarity run faster with multiprocessing – Artificial Intelligence

Answers