skip to Main Content

I’ll try to give as much information I can here. Although the solution would be great, I just want guidance on how to tackle the problem. How to view more useful log files, etc. As I’m new to server maintainance. Any advice are welcome.

Here’s what’s happenning in chronological order:

  • I’m running 2 digitalocean droplets (Ubuntu 14.04 VPS)
  • Droplet #1 running django, nginx, gunicorn
  • Droplet #2 running postgres
  • Everything runs fine for a month and suddenly the postgres droplet
    CPU usage spiked to 100%
  • You can see htop log when this happens. I’ve attached a screenshot
  • Another screenshot is nginx error.log, you can see that problem
    started at 15:56:14 where I highlighted with red box
  • sudo poweroff the Postgres droplet and restart it doesn’t fix the
    problem
  • Restore postgres droplet to my last backup (20 hours ago) solves the problem but it keep happening again. This is 7th time in 2 days

I’ll continue to do research and give more information. Meanwhile any opinions are welcome.

Thank you.

Postgres server htop log when problem occurs
nginx error.log when problem occurs

Update 20 May 2016

  • Enabled slow query logging on Postgres server as recommended by e4c5
  • 6 hours later, server freezed(100% CPU usage) again at 8:07 AM. I’ve attached all related screenshots
  • Browser display 502 error if try to access the site during the freeze
  • sudo service restart postgresql (and gunicorn, nginx on django server) does NOT fix
    the freeze (I think this is a very interesting point)
  • However, restore Postgres server to my previous backup(now 2 days old) does fix the freeze
  • The culprit Postgres log message is Could not send data to client: Broken
    Pipe
  • The culprit Nginx log message is a simple django-rest-framework
    api call which return only 20 items (each with some foreign-key data
    query)

Update#2 20 May 2016
When the freeze occurs, I tried doing the following in chronological order (turn off everything and turn them back on one-by-one)

  • sudo service stop postgresql –> cpu usage fall to 0-10%
  • sudo service stop gunicorn –> cpu usage stays at 0-10%
  • sudo service stop nginx–> cpu usage stays at to 0-10%
  • sudo service restart postgresql –> cpu usage stays at to 0-10%
  • sudo service restart gunicorn –> cpu usage stays at to 0-10%
  • sudo service restart nginx –> cpu usage rose to 100% and stays
    there

So this is not about server load or long query time then?

This is very confusing since if I restore database to my latest backup (2 days ago), everything is back online even without touching nginx/gunicorn/django server…


Update 8 June 2016
I turned on slow query logging. Set it to log queries that takes longer than 1000ms.

I got this one query shows up in the log many times.

SELECT
     "products_product"."id",
     "products_product"."seller_id",
     "products_product"."priority",
     "products_product"."media",
     "products_product"."active",
     "products_product"."title",
     "products_product"."slug",
     "products_product"."description",
     "products_product"."price",
     "products_product"."sale_active",
     "products_product"."sale_price",
     "products_product"."timestamp",
     "products_product"."updated",
     "products_product"."draft",
     "products_product"."hitcount",
     "products_product"."finished",
     "products_product"."is_marang_offline",
     "products_product"."is_seller_beta_program",
     COUNT("products_video"."id") AS "num_video"
 FROM "products_product"
 LEFT OUTER JOIN "products_video" ON ( "products_product"."id" = "products_video"."product_id" )
 WHERE ("products_product"."draft" = false AND "products_product"."finished" = true)
 GROUP BY
     "products_product"."id",
     "products_product"."seller_id",
     "products_product"."priority",
     "products_product"."media",
     "products_product"."active",
     "products_product"."title",
     "products_product"."slug",
     "products_product"."description",
     "products_product"."price",
     "products_product"."sale_active",
     "products_product"."sale_price",
     "products_product"."timestamp",
     "products_product"."updated",
     "products_product"."draft",
     "products_product"."hitcount",
     "products_product"."finished",
     "products_product"."is_marang_offline",
     "products_product"."is_seller_beta_program"
 HAVING COUNT("products_video"."id") >= 8
 ORDER BY "products_product"."priority" DESC, "products_product"."hitcount" DESC
 LIMIT 100

I know it’s such an ugly query (generated by django aggregation). In English, this query just means “give me a list of products that have more than 8 videos in it”.

And here the EXPLAIN output of this query:

                  QUERY PLAN                                                                                                                                                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=351.90..358.40 rows=100 width=933)
   ->  GroupAggregate  (cost=351.90..364.06 rows=187 width=933)
         Filter: (count(products_video.id) >= 8)
         ->  Sort  (cost=351.90..352.37 rows=187 width=933)
               Sort Key: products_product.priority, products_product.hitcount, products_product.id, products_product.seller_id, products_product.media, products_product.active, products_product.title, products_product.slug, products_product.description, products_product.price, products_product.sale_active, products_product.sale_price, products_product."timestamp", products_product.updated, products_product.draft, products_product.finished, products_product.is_marang_offline, products_product.is_seller_beta_program
               ->  Hash Right Join  (cost=88.79..344.84 rows=187 width=933)
                     Hash Cond: (products_video.product_id = products_product.id)
                     ->  Seq Scan on products_video  (cost=0.00..245.41 rows=2341 width=8)
                     ->  Hash  (cost=88.26..88.26 rows=42 width=929)
                           ->  Seq Scan on products_product  (cost=0.00..88.26 rows=42 width=929)
                                 Filter: ((NOT draft) AND finished)

(11 rows)

— Update 8 June 2016 #2 —
Since there are many suggestions by many people. So I’ll try to apply the fixes one-by-one and report back periodically.

@e4c5
Here’s the information you need:

You can think of my site somewhat like Udemy, an online course marketplace. There are “Product”(course). Each product contain a number of videos. Users can comment on both Product page itself and each Videos.

In many cases, I’ll need to query a list of products order by number of TOTAL comments it got(the sum of product comments AND comments on each Video of that Product)

The django query that correspond to the EXPLAIN output above:

all_products_exclude_draft = Product.objects.all().filter(draft=False)
products_that_contain_more_than_8_videos =  all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30]

I just noticed that I(or some other dev in my team) hit database twice with these 2 python lines.

Here’s the django models for Product and Video:

from django_model_changes import ChangesMixin

class Product(ChangesMixin, models.Model):
    class Meta:
        ordering = ['-priority', '-hitcount']
    seller = models.ForeignKey(SellerAccount)
    priority = models.PositiveSmallIntegerField(default=1)
    media = models.ImageField(blank=True, 
            null=True, 
            upload_to=download_media_location,
            default=settings.MEDIA_ROOT + '/images/default_icon.png',
            storage=FileSystemStorage(location=settings.MEDIA_ROOT))
    active = models.BooleanField(default=True)
    title = models.CharField(max_length=500)
    slug = models.SlugField(max_length=200, blank=True, unique=True)
    description = models.TextField()
    product_coin_price = models.IntegerField(default=0)
    sale_active = models.BooleanField(default=False)
    sale_price = models.IntegerField(default=0, null=True, blank=True) #100.00
    timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True)
    updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True)
    draft = models.BooleanField(default=True)
    hitcount = models.IntegerField(default=0)
    finished = models.BooleanField(default=False)
    is_marang_offline = models.BooleanField(default=False)
    is_seller_beta_program = models.BooleanField(default=False)

    def __unicode__(self):
        return self.title

    def get_avg_rating(self):
        rating_avg = self.productrating_set.aggregate(Avg("rating"), Count("rating"))
        return rating_avg

    def get_total_comment_count(self):
        comment_count = self.video_set.aggregate(Count("comment"))
        comment_count['comment__count'] += self.comment_set.count()
        return comment_count

    def get_total_hitcount(self):
        amount = self.hitcount
        for video in self.video_set.all():
            amount += video.hitcount
        return amount

    def get_absolute_url(self):
        view_name = "products:detail_slug"
        return reverse(view_name, kwargs={"slug": self.slug})

    def get_product_share_link(self):
        full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url())
        return full_url

    def get_edit_url(self):
        view_name = "sellers:product_edit"
        return reverse(view_name, kwargs={"pk": self.id})

    def get_video_list_url(self):
        view_name = "sellers:video_list"
        return reverse(view_name, kwargs={"pk": self.id})

    def get_product_delete_url(self):
        view_name = "products:product_delete"
        return reverse(view_name, kwargs={"pk": self.id})

    @property
    def get_price(self):
        if self.sale_price and self.sale_active:
            return self.sale_price
        return self.product_coin_price

    @property
    def video_count(self):
        videoCount = self.video_set.count()
        return videoCount

class Video(models.Model):
    seller = models.ForeignKey(SellerAccount)
    title = models.CharField(max_length=500)
    slug = models.SlugField(max_length=200, null=True, blank=True)
    story = models.TextField(default=" ")
    chapter_number = models.PositiveSmallIntegerField(default=1)
    active = models.BooleanField(default=True)
    featured = models.BooleanField(default=False)
    product = models.ForeignKey(Product, null=True)
    timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True)
    updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True)
    draft = models.BooleanField(default=True)
    hitcount = models.IntegerField(default=0)
    objects = VideoManager()

    class Meta:
        unique_together = ('slug', 'product')
        ordering = ['chapter_number', 'timestamp']

    def __unicode__(self):
        return self.title

    def get_comment_count(self):
        comment_count = self.comment_set.all_jing_jing().count()
        return comment_count

    def get_create_chapter_url(self):
        return reverse("sellers:video_create", kwargs={"pk": self.id})

    def get_edit_url(self):
        view_name = "sellers:video_update"
        return reverse(view_name, kwargs={"pk": self.id})

    def get_video_delete_url(self):
        view_name = "products:video_delete"
        return reverse(view_name, kwargs={"pk": self.id})

    def get_absolute_url(self):
        try:
            return reverse("products:video_detail", kwargs={"product_slug": self.product.slug, "pk": self.id})
        except:
            return "/"

    def get_video_share_link(self):
        full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url())
        return full_url      

    def get_next_url(self):
        current_product = self.product
        videos = current_product.video_set.all().filter(chapter_number__gt=self.chapter_number)
        next_vid = None
        if len(videos) >= 1:
            try:
                next_vid = videos[0].get_absolute_url()
            except IndexError:
                next_vid = None
        return next_vid

    def get_previous_url(self):
        current_product = self.product
        videos = current_product.video_set.all().filter(chapter_number__lt=self.chapter_number).reverse()
        next_vid = None
        if len(videos) >= 1:
            try:
                next_vid = videos[0].get_absolute_url()
            except IndexError:
                next_vid = None
        return next_vid

And here is the index of the Product and Video table I got from the command:

my_database_name=# di

Note: this is photoshopped and include some other models as well.
Indexes of Product and Video models


— Update 8 June 2016 #3 —
@Jerzyk
As you suspected. After I inspect all my code again, I found that I indeed did a ‘slicing-in-memory’: I tried to shuffle the first 10 results by doing this:

def get_queryset(self):
        all_product_list = Product.objects.all().filter(draft=False).annotate(
        num_video=Count(
                Case(
                    When(
                        video__draft=False,
                        then=1,
                    )
                )
            )
        ).order_by('-priority', '-num_video', '-hitcount')
        the_first_10_products = list(all_product_list[:10])
        the_11th_product_onwards = list(all_product_list[10:])
        random.shuffle(copy)
        finalList = the_first_10_products + the_11th_product_onwards

Note: in the code above I need to count number of Video that is not in draft status.

So this will be one of the thing I need to fix as well. Thanks. >_<


— Here are the related screenshots —

Postgres log when freezing occurs (log_min_duration = 500 milliseconds)
Postgres log 20 May 2016

Postgres log (contunued from the above screenshot)
Postgres log 20 May 2016 (page2)

Nginx error.log in the same time period
Nginx log 20 May 2016

DigitalOcean CPU usage graph just before freezing
DigitalOcean graph 20 May 2016 (1)

DigitalOcean CPU usage graph just after freezing
DigitalOcean graph 20 May 2016 (2)

2

Answers


  1. Chosen as BEST ANSWER

    *** Update 7 June 2016 : Issue occur again. CPU hit 100% and stays there. This answer does help with performance but unfortunately not the solution to this problem.


    Thanks to the recommendation by DigitalOcean suppport team. I tried the configuration suggested by this tool:

    http://pgtune.leopard.in.ua/

    Which recommend me the following values for my droplet with 1 CPU core and 1GB RAM:

    in postgresql.conf:

    max_connections = 200
    shared_buffers = 256MB
    effective_cache_size = 768MB
    work_mem = 1310kB
    maintenance_work_mem = 64MB
    checkpoint_segments = 32
    checkpoint_completion_target = 0.7
    wal_buffers = 7864kB
    default_statistics_target = 100
    

    /etc/sysctl.conf

    kernel.shmmax=536870912
    kernel.shmall=131072
    

    Until now my postgres server has been running fine for 3-4 days. So I assume this is the solution. Thanks everyone!


  2. We can jump to the conclusion that your problems are caused by the slow query in question. By itself each run of the query does not appear to be slow enough to cause timeouts. However it’s possible several of these queries are executed concurrently and that could lead to the meltdown. There are two things that you can do to speed things up.

    1) Cache the result

    The result of a long running query can be cached.

    from django.core.cache import cache
    
    def get_8x_videos():
        cache_key = 'products_videos_join'
        result = cache.get(cache_key, None)
        if not result:
            all_products_exclude_draft = Product.objects.all().filter(draft=False)
            result =  all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30]
    
            result = Product.objects.annotate('YOUR LONG QUERY HERE')
            cache.set(cache_key, result)
    
        return result
    

    This query now comes from memcache (or whatever you use for caching) that means if you have two successive hits for the page that uses this in quick succession, the second one will have no impact on the database. You can control how long the object is cached in memory.

    2) Optimize the Query

    The first thing that leaps out at you from the explain is that you are doing sequential scan on both the products_products and product_videos tables. Usually sequential scans are less desirable than index scans. However an index scan may not be used on this query because of the COUNT() and HAVING COUNT() clauses you have on it as well as the massive GROUP BY clauses on it.

    update:

    Your query has a LEFT OUTER JOIN, It’s possible that an INNER JOIN or a subquery might be faster, in order to do that, we need to recognize that grouping on the Video table on product_id can give us the set of videos that figure in at least 8 products.

    inner = RawSQL('SELECT id from product_videos GROUP BY product_id HAVING COUNT(product_id) > 1',params=[])
    
    Product.objects.filter(id__in=b)
    

    The above eleminates the LEFT OUTER JOIN and introduces a subquery. However this doesn’t give easy access to the actual number of videos for each product, so this query in it’s present form may not be fully usable.

    3) Improving indexes

    While it may be tempting to create an index on draft and finished columns, this will be futile as those columns do not have sufficient cardinality to be good candidates for indexes. However it may still be possible to create a conditional index. Again the conclusion can only be drawn after seeing your tables.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search