skip to Main Content

I am trying to run gensim WMD similarity faster. Typically, this is what is in the docs:
Example corpus:

    my_corpus = ["Human machine interface for lab abc computer applications",
>>>              "A survey of user opinion of computer system response time",
>>>              "The EPS user interface management system",
>>>              "System and human system engineering testing of EPS",
>>>              "Relation of user perceived response time to error measurement",
>>>              "The generation of random binary unordered trees",
>>>              "The intersection graph of paths in trees",
>>>              "Graph minors IV Widths of trees and well quasi ordering",
>>>              "Graph minors A survey"]

my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']

model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)

from gensim import Word2Vec
from gensim.similarities import WmdSimilarity

def init_instance(my_corpus,model,num_best):
    instance = WmdSimilarity(my_corpus, model,num_best = 1)
    return instance
instance[my_tokenized_query]

the best matched document is "Human machine interface for lab abc computer applications" which is great.

However the function instance above takes an extremely long time. So I thought of breaking up the corpus into N parts and then doing WMD on each with num_best = 1, then at the end of it, the part with the max score will be the most similar.

    from multiprocessing import Process, Queue ,Manager

    def main( my_query,global_jobs,process_tmp):
        process_query = gensim.utils.simple_preprocess(my_query)

        def worker(num,process_query,return_dict):  
            instance=init_instance
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
            x = instance[process_query][0][0]
            y = instance[process_query][0][1]
            return_dict[x] = y
        manager = Manager()
        return_dict = manager.dict()

        for num in range(num_workers):
            process_tmp = Process(target=worker, args=(num,process_query,return_dict))
            global_jobs.append(process_tmp)
            process_tmp.start()
        for proc in global_jobs:
            proc.join()

        return_dict = dict(return_dict)
        ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
        print corpus[ind]
        >>> "Graph minors A survey"

The problem I have with this is that, even though it outputs something, it doesn’t give me a good similar query from my corpus even though it gets the max similarity of all the parts.

Am I doing something wrong?

2

Answers


  1. Comment: chunk is a static variable: e.g. chunk = 600 …

    If you define chunk static, then you have to compute num_workers.

    10001 / 600 = 16,6683333333 = 17 num_workers
    

    It’s common to use not more process than cores you have.
    If you have 17 cores, that’s ok.

    cores are static, therefore you should:

    num_workers = os.cpu_count()
    chunk = chunksize(my_corpus, num_workers)
    

    1. Not the same result, changed to:

      #process_query = gensim.utils.simple_preprocess(my_query)
      process_query = my_tokenized_query
      
    2. All worker results Index 0..n.
      Therefore, return_dict[x] could be overwritten from last worker with same Index having lower value. The Index in return_dict is NOT the same as Index in my_corpus. Changed to:

      #return_dict[x] = y
      return_dict[ (num * chunk)+x ] = y
      
    3. Using +1 in chunk size computing, will skip that first Document.
      I don’t know how you compute chunk, consider this example:

      def chunksize(iterable, num_workers):
          c_size, extra = divmod(len(iterable), num_workers)
          if extra:
              c_size += 1
          if len(iterable) == 0:
              c_size = 0
          return c_size
      
      #Usage
      chunk = chunksize(my_corpus, num_workers)
      ...
      #my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk]
      my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk]
      

    Results: 10 cycle, Tuple=(Index worker num=0, Index worker num=1)

    With multiprocessing, with chunk=5:
    02,09:(3, 8), 01,03:(3, 5):
    System and human system engineering testing of EPS
    04,06,07:(0, 8), 05,08:(0, 5), 10:(0, 7):
    Human machine interface for lab abc computer applications

    Without multiprocessing, with chunk=5:
    01:(3, 6), 02:(3, 5), 05,08,10:(3, 7), 07,09:(3, 8):
    System and human system engineering testing of EPS
    03,04,06:(0, 5):
    Human machine interface for lab abc computer applications

    Without multiprocessing, without chunking:
    01,02,03,04,06,07,08:(3, -1):
    System and human system engineering testing of EPS
    05,09,10:(0, -1):
    Human machine interface for lab abc computer applications

    Tested with Python: 3.4.2

    Login or Signup to reply.
  2. Using Python 2.7:
    I used threading instead of multi-processing.
    In the WMD-Instance creation thread, I do something like this:

        wmd_instances = []
        if wmd_instance_count > len(wmd_corpus):
            wmd_instance_count = len(wmd_corpus)
        chunk_size = int(len(wmd_corpus) / wmd_instance_count)
        for i in range(0, wmd_instance_count):
            if i == wmd_instance_count -1:
                wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:], wmd_model, num_results)
            else:
                wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:chunk_size], wmd_model, num_results)
            wmd_instances.append(wmd_instance)
        wmd_logic.setWMDInstances(wmd_instances, chunk_size)
    

    ‘wmd_instance_count’ is the number of threads to use to search. I also remember the chunk-size. Then, when I want to search for something, I start “wmd_instance_count”-threads to search for and they return found sims:

    def perform_query_for_job_on_instance(wmd_logic, wmd_instances, query, jobID, instance):
        wmd_instance = wmd_instances[instance]
        sims = wmd_instance[query]
        wmd_logic.set_mt_thread_result(jobID, instance, sims)
    

    ‘wmd_logic’ is the instance of a class that then does this:

    def set_mt_thread_result(self, jobID, instance, sims):
        res = []
        #
        # We need to scale the found ids back to our complete corpus size...
        #
        for sim in sims:
            aSim = (int(sim[0] + (instance * self.chunk_size)), sim[1])
            res.append(aSim)
    

    I know, the code isn’t nice, but it works. It uses ‘wmd_instance_count’ threads to find results, I aggregate them and then choose the top-10 or something like that.

    Hope this helps.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search