skip to Main Content

I am executing the below two code snippets to calculate the cosine similarity of two vectors where the vectors are the same for both executions and the code for the second one is mainly the code SciPy is running (see scipy cosine implementation).

The thing is that when calling SciPy it is running slightly faster (~0.55ms vs ~0.69ms) and I don’t understand why, as my implementation is like the one from SciPy removing some checks, which if something I would expect to make it faster.

Why is SciPy’s function faster?


import time
import math

import numpy as np
from scipy.spatial import distance

SIZE = 6400000
EXECUTIONS = 10000

path = "" # From https://github.com/joseprupi/cosine-similarity-comparison/blob/master/tools/vectors.csv
file_data = np.genfromtxt(path, delimiter=',')
A,B = np.moveaxis(file_data, 1, 0).astype('f')

accum = 0

start_time = time.time()
for _ in range(EXECUTIONS):    
    cos_sim = distance.cosine(A,B)

print(" %s ms" % (((time.time() - start_time) * 1000)/EXECUTIONS))
cos_sim_scipy = cos_sim

def cosine(u, v, w=None):
    
    uv = np.dot(u, v)
    uu = np.dot(u, u)
    vv = np.dot(v, v)
    dist = 1.0 - uv / math.sqrt(uu * vv)
    # Clip the result to avoid rounding error
    return np.clip(dist, 0.0, 2.0)

accum = 0

start_time = time.time()
for _ in range(EXECUTIONS):
    cos_sim = cosine(A,B)

print(" %s ms" % (((time.time() - start_time) * 1000)/EXECUTIONS))
cos_sim_manual = cos_sim

print(np.isclose(cos_sim_scipy, cos_sim_manual))

EDIT:

The code to generate A and B is below and the exact files I am using can be found at:

https://github.com/joseprupi/cosine-similarity-comparison/blob/master/tools/vectors.csv

def generate_random_vector(size):
    """
    Generate 2 random vectors with the provided size
    and save them in a text file
    """
    A = np.random.normal(loc=1.5, size=(size,))
    B = np.random.normal(loc=-1.5, scale=2.0, size=(size,))
    vectors = np.stack([A, B], axis=1)
    np.savetxt('vectors.csv', vectors, fmt='%f,%f')

generate_random_vector(640000)

Setup:

  • AMD Ryzen 9 3900X 12-Core Processor
  • 64GB RAM
  • Debian 12
  • Python 3.11.2
  • scipy 1.13.0
  • numpy 1.26.4

2

Answers


  1. It seems, does at the beginning of correlation() function this which practically means:

    u = np.asarray(u, dtype=None, order="c")
    v = np.asarray(v, dtype=None, order="c")
    

    This ensures that the arrays are C_CONTIGUOUS (you can check this by printing u.flags and/or v.flags)

    I presume numpy uses different implementations of np.dot for contiguos/non-contiguos arrays.


    If you change your function to:

    def cosine(u, v, w=None):
        u = np.asarray(u, dtype=None, order="c")  # <-- Ensure C_CONTIGUOUS True
        v = np.asarray(v, dtype=None, order="c")  # <-- detto.
    
        uv = np.dot(u, v)
        uu = np.dot(u, u)
        vv = np.dot(v, v)
        dist = 1.0 - uv / math.sqrt(uu * vv)
        # Clip the result to avoid rounding error
        return np.clip(dist, 0.0, 2.0)
    

    I get the same results 0.45ms vs 0.45ms on my AMD 5700x.

    Login or Signup to reply.
  2. I would point out that if you’re looking for the fastest implementation, SimSIMD is faster than either SciPy or your manual implementation.

    Example of how to use this:

    import simsimd
    
    
    def cosine(u, v, w=None):
        # Note: simsimd requires contiguous input
        u = np.asarray(u, dtype=None, order="c")
        v = np.asarray(v, dtype=None, order="c")
        return simsimd.cosine(u, v)
    

    On my system, this is 13% faster. (1.60 ms vs 1.40 ms)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search