skip to Main Content

Situation:
I have Model have a relation 1-1, sample:

class User(models.Model):
    user_namme = models.CharField(max_length=40)
    type = models.CharField(max_length=255)
    created_at = models.DatetimeField()
    ...

class Book(models.Model):
    user = models.OneToOneField(User, on_delete=models.CASCADE)

And I have a around 200,000 records.

  • Languague: Python
  • Framework: Django
  • Database: Postgres

Question:

  • How can I delete 200,000 records above with minimal cost?

Solution I have tried:

user_ids = Users.objects.filter(type='sample', created_date__gte='2022-11-15 08:00', created_date__lt="2022-11-15 08:30").values_list('id',flat=True)[:200000] # Fetch 200,000 user ids. 
for i, _ in enumerate(user_ids[:: 1000]):
    with transaction.atomic():
        batch_start = i * self.batch_size
        batch_end = batch_start + self.batch_size
        _, deleted = Users.objects.filter(id__in=user_ids[batch_start,batch_end]
    

With this solution, my server use arround:

  • 600MB CPU
  • 300MB RAM
  • Take more 15 minutes to finish workload.

I wonder do anyone have a better solution?

3

Answers


  1. Chosen as BEST ANSWER

    Thanks, everyone. I have tried the solution with RawQuery

    user_ids = Users.objects.filter(type='sample', created_date__gte='2022-11-15 08:00', created_date__lt="2022-11-15 08:30").values_list('id',flat=True)[:200000] # Fetch 200,000 user ids. 
    
    for i in range(0, 3):
        user_ids_str = ""
        for user_id in user_ids.iterator(chunk_size=5000):
            user_ids_str += f"{user_id},"
        query = f"""
                DELETE FROM "user" WHERE "user"."id" IN ({user_ids_str});
                DELETE FROM "book" WHERE "user"."id" IN ({user_ids_str});
        """
        with transaction.atomic():
            with connection.cursor() as c:
                c.execute("SET statement_timeout = '10min';")
                c.execute(query)
    

    This one can remove 600000 records and take around 10 minutes. And the server used around:

    • CPU: 50MB
    • RAM: 200MB

  2. By first principles, nothing beats raw(Django query) SQL in terms of speed because it operates closest to the database!

    cursor.execute(“DELETE FROM DB WHERE Column = %s”)

    Or else you can do it by:

    Variable = Model.objects.filter(variable=variable)

    if Variable.exists():
    
        Variable.delete()
    
    Login or Signup to reply.
  3. If you are using straight sql why not do a join on the user table with the date criteria to delete the books and then delete all the users using the created_date criteria? Let the database do all the work!

    Even without writing the join

    DELETE FROM "book" WHERE "user"."id" IN (select id from user where created_date >= '2022-11-15 08:00' and...)
    DELETE FROM "user" WHERE created_date >= '2022-11-15 08:00' and...
    

    would be better than what you have.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search