Postgresql - Which the Better Option for Storing Large Data in Django Application

MichaelHalim
September 19, 2022
162 views
3 votes
3 Answers

so i have an Django app that i currently working on, this app will do euclidean distance for 2000+ data.

Im using this data to make recommendation system using Content Based Filtering. Content Based works like if you clicked an item, this item will find other item that has the closest feature. I have also figured out the feature. What i need is if a person click an item, i calculate euclidean distance of its features and i get the result. So i will use the euclidean distance of all possible combination. Because im doing the recommendation every X hour i need to store all combination of distance.

That much data if run when the web is in high demand will collapse so i think about several solution but i don’t know if this is different when it’s deployed.

First idea is to compute all distances and put it in hardcoded variable in some_file.py. The file will look like this

data = [[1,2,..],[3,4,..],[5,6,..],[7,8,..],...]

and can be accessed like this data[0][2] = 2

this file is 60MB
Second idea is the basic one, i create a table with 3 columns. A,B, and euclidean_distances(A,B). But this solution will create 4.000.000+ records.

*NOTES

I’m using Postgresql for my database.
Im just comparing 2 item so it will be 2D euclidean distance.
I have several features, but i just posted 1 feature so that i can apply to other feature once it works

My Question is,

which one is the better solution to save all the distances when it is deployed ?
I’m planning to increase data in the future, my calculations is it will take (n^2 – n^2/2 – n) space in database. At what point that my database become so big that everytime i want to access that database, it become slow, like it takes 10-20 seconds longer ?

I’m open to other solution other than 2 above.

Answers

- Roland
- September 22, 2022 at 10:37 am
- 0 votes
0
You should really take a look at this article about optimising pairwise Euclidean distance calculations from Towards Data Science, which is pretty much about your situation.

Login or Signup to reply.

- user1470034
- September 25, 2022 at 1:07 am
- 0 votes
0
If you’re familiar with Pandas you can do all this with Vaex.io. It’s exactly, what its designed for. Building out the index correctly you can easily slice and dice the data and retrieve the data you’re seeking. https://vaex.io/

Login or Signup to reply.

- Gon231aloPeres
- September 25, 2022 at 10:44 am
- 0 votes
0
I tend to prefer using a database, specially when envisioning that the system will grow because that way one can scale the web side and the database independently. Overall that seems to be the method people go with as well
Some other considerations
- Read Django’s database access optimization docs to ensure you’ll be following the standards – like Do database work in the database rather than in Python
- Consider using Database replication since it may result in better performance.
- Consider using cache to improve the performance and reduce the database workloads. Read more about Django’s cache framework.
- Django supports only specific databases.
- Non-relational databases tend to be better to store large amounts of data and reduce the latency.
- Django supports the usage of multiple databases.
- Django has a method bulk_create() to insert data in the database (generally with one query only).
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – Which the Better Option for Storing Large Data in Django Application

Answers