skip to Main Content

I am currently working on a Django based service that will gather users’ data from their Shopify shops and provide some insights. I use PostgreSQL as my DB server. The problem I am facing is whether it is more efficient to create a separate table for every particular user and store their data there or rather keep everything in one big table. Approximate data length – about 100k-1mln rows per user. Data is unrelated among users. Furthermore, I will need to store the user-specific results of my analysis. Any thoughts?

2

Answers


  1. I suggest you have one big table (Table1) for the data of all users (one column would indicate which user it belongs to).
    For example, to store all the pages a user visited on one of your sites, you could use something like this:

    Table1
        pk   # maybe automatic id
        user_id
        datetime
        page_url
        query_string
        some_session_data
    

    If Table1 gets too big (like you indicated that it may happen), you could create periodical backups with date ranges (maybe one backup file with data from a single month of Table1 rows) and later delete the older data from Table1.

    backup_table1_2018_01.sql
    backup_table1_2018_02.sql
    backup_table1_2018_03.sql
    

    Furthermore, you could have a second table (Table2) that stores results of analysing the data from Table1. Table2 would also have a column indicating which user it belongs to, and maybe even another column which indicates what date range it belongs to.

    Continuing with the example, you could create a ranking to see the most visited page for each user in a single month:

    Table2
        pk   # maybe automatic id
        user_id
        year
        month
        page_url
        number_of_visits
    

    After calculating the results for Table2 you could remove the rows from Table1 to save some space.


    This may be a silly/simple example, but I think it could be a valid approach for large amount of data. I hope it gives you some pointers to think about.

    Login or Signup to reply.
  2. @Gleb: you commented on my other answer:

    But may we consider a more specific example: I have a user table which stores all user-related data, from every user I gather additional data to analyse (transactions). So I was thinking of creating a new table for every user and store there transactions data, the name of the table will be stored in table1 (where I keep info about my users) to locate it when needed. Should I keep transactions of all users in one table?

    So here goes a more concrete example.


    For starters, you have your User model (maybe even djangos default table) where you store attributes like names and email.

    from django.contrib.auth.models import User
    

    Second, you have transactions (I made up a few basic fields):

    class Transaction(models.Model):
        # default AutoField as pk
        datetime = models.DateTimeField()
        user_from = models.ForeignKey(User, related_name='transactions_from')
        user_to = models.ForeignKey(User, related_name='transactions_to')
        amount = models.IntegerField()
    

    And third, you could have a monthly summary of transactions to sum up the received and sent amounts of each user:

    class Summary(models.Model):
        # default AutoField as pk
        user = models.ForeignKey(User)
        year = models.IntegerField()
        month = models.IntegerField()
        amount_sent = models.IntegerField()
        amount_received = models.IntegerField()
    

    Now, the Transaction model will have a lot of records (you say it could have millions per user), but the Summary model will only have one record per month per user, which is quite manageable.
    Still, I don’t recommend using a separate model/table of transactions for each user.

    Once a month you could run a process that calculates the summaries for each user (or any other data analysis you wish to run) if you don’t wish to update the Summary model instance each time a new transaction is made.

    Afterwards you are free to archive the old Transactions (maybe everything older than 3 months or a year).
    Once a month you could run a process to export a given set of Transaction instances to a file (CSV, JSON or a SQL dump) and remove them from the Transaction model to keep the size of the table from expanding to much. You could obtain backup files per month for easier lookup if you ever need the Transaction data again:

    backup_transaction_2018_01.sql
    backup_transaction_2018_02.sql
    backup_transaction_2018_03.sql
    

    I hope this gives you a few ideas to think about.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search