skip to Main Content

so i have an Django app that i currently working on, this app will do euclidean distance for 2000+ data.

Im using this data to make recommendation system using Content Based Filtering. Content Based works like if you clicked an item, this item will find other item that has the closest feature. I have also figured out the feature. What i need is if a person click an item, i calculate euclidean distance of its features and i get the result. So i will use the euclidean distance of all possible combination. Because im doing the recommendation every X hour i need to store all combination of distance.

That much data if run when the web is in high demand will collapse so i think about several solution but i don’t know if this is different when it’s deployed.

  1. First idea is to compute all distances and put it in hardcoded variable in some_file.py. The file will look like this

    data = [[1,2,..],[3,4,..],[5,6,..],[7,8,..],...]

    and can be accessed like this data[0][2] = 2

    this file is 60MB

  2. Second idea is the basic one, i create a table with 3 columns. A,B, and euclidean_distances(A,B). But this solution will create 4.000.000+ records.

*NOTES

I’m using Postgresql for my database.
Im just comparing 2 item so it will be 2D euclidean distance.
I have several features, but i just posted 1 feature so that i can apply to other feature once it works

My Question is,

  1. which one is the better solution to save all the distances when it is deployed ?
  2. I’m planning to increase data in the future, my calculations is it will take (n^2 – n^2/2 – n) space in database. At what point that my database become so big that everytime i want to access that database, it become slow, like it takes 10-20 seconds longer ?

I’m open to other solution other than 2 above.

3

Answers


  1. You should really take a look at this article about optimising pairwise Euclidean distance calculations from Towards Data Science, which is pretty much about your situation.

    Login or Signup to reply.
  2. If you’re familiar with Pandas you can do all this with Vaex.io. It’s exactly, what its designed for. Building out the index correctly you can easily slice and dice the data and retrieve the data you’re seeking. https://vaex.io/

    Login or Signup to reply.
  3. I tend to prefer using a database, specially when envisioning that the system will grow because that way one can scale the web side and the database independently. Overall that seems to be the method people go with as well

    Some other considerations

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search