I’ll try to give as much information I can here. Although the solution would be great, I just want guidance on how to tackle the problem. How to view more useful log files, etc. As I’m new to server maintainance. Any advice are welcome.
Here’s what’s happenning in chronological order:
- I’m running 2 digitalocean droplets (Ubuntu 14.04 VPS)
- Droplet #1 running django, nginx, gunicorn
- Droplet #2 running postgres
- Everything runs fine for a month and suddenly the postgres droplet
CPU usage spiked to 100% - You can see htop log when this happens. I’ve attached a screenshot
- Another screenshot is nginx error.log, you can see that problem
started at 15:56:14 where I highlighted with red box - sudo poweroff the Postgres droplet and restart it doesn’t fix the
problem - Restore postgres droplet to my last backup (20 hours ago) solves the problem but it keep happening again. This is 7th time in 2 days
I’ll continue to do research and give more information. Meanwhile any opinions are welcome.
Thank you.
Update 20 May 2016
- Enabled slow query logging on Postgres server as recommended by e4c5
- 6 hours later, server freezed(100% CPU usage) again at 8:07 AM. I’ve attached all related screenshots
- Browser display 502 error if try to access the site during the freeze
sudo service restart postgresql
(and gunicorn, nginx on django server) does NOT fix
the freeze (I think this is a very interesting point)- However, restore Postgres server to my previous backup(now 2 days old) does fix the freeze
- The culprit Postgres log message is Could not send data to client: Broken
Pipe - The culprit Nginx log message is a simple django-rest-framework
api call which return only 20 items (each with some foreign-key data
query)
Update#2 20 May 2016
When the freeze occurs, I tried doing the following in chronological order (turn off everything and turn them back on one-by-one)
sudo service stop postgresql
–> cpu usage fall to 0-10%sudo service stop gunicorn
–> cpu usage stays at 0-10%sudo service stop nginx
–> cpu usage stays at to 0-10%sudo service restart postgresql
–> cpu usage stays at to 0-10%sudo service restart gunicorn
–> cpu usage stays at to 0-10%sudo service restart nginx
–> cpu usage rose to 100% and stays
there
So this is not about server load or long query time then?
This is very confusing since if I restore database to my latest backup (2 days ago), everything is back online even without touching nginx/gunicorn/django server…
Update 8 June 2016
I turned on slow query logging. Set it to log queries that takes longer than 1000ms.
I got this one query shows up in the log many times.
SELECT
"products_product"."id",
"products_product"."seller_id",
"products_product"."priority",
"products_product"."media",
"products_product"."active",
"products_product"."title",
"products_product"."slug",
"products_product"."description",
"products_product"."price",
"products_product"."sale_active",
"products_product"."sale_price",
"products_product"."timestamp",
"products_product"."updated",
"products_product"."draft",
"products_product"."hitcount",
"products_product"."finished",
"products_product"."is_marang_offline",
"products_product"."is_seller_beta_program",
COUNT("products_video"."id") AS "num_video"
FROM "products_product"
LEFT OUTER JOIN "products_video" ON ( "products_product"."id" = "products_video"."product_id" )
WHERE ("products_product"."draft" = false AND "products_product"."finished" = true)
GROUP BY
"products_product"."id",
"products_product"."seller_id",
"products_product"."priority",
"products_product"."media",
"products_product"."active",
"products_product"."title",
"products_product"."slug",
"products_product"."description",
"products_product"."price",
"products_product"."sale_active",
"products_product"."sale_price",
"products_product"."timestamp",
"products_product"."updated",
"products_product"."draft",
"products_product"."hitcount",
"products_product"."finished",
"products_product"."is_marang_offline",
"products_product"."is_seller_beta_program"
HAVING COUNT("products_video"."id") >= 8
ORDER BY "products_product"."priority" DESC, "products_product"."hitcount" DESC
LIMIT 100
I know it’s such an ugly query (generated by django aggregation). In English, this query just means “give me a list of products that have more than 8 videos in it”.
And here the EXPLAIN output of this query:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=351.90..358.40 rows=100 width=933)
-> GroupAggregate (cost=351.90..364.06 rows=187 width=933)
Filter: (count(products_video.id) >= 8)
-> Sort (cost=351.90..352.37 rows=187 width=933)
Sort Key: products_product.priority, products_product.hitcount, products_product.id, products_product.seller_id, products_product.media, products_product.active, products_product.title, products_product.slug, products_product.description, products_product.price, products_product.sale_active, products_product.sale_price, products_product."timestamp", products_product.updated, products_product.draft, products_product.finished, products_product.is_marang_offline, products_product.is_seller_beta_program
-> Hash Right Join (cost=88.79..344.84 rows=187 width=933)
Hash Cond: (products_video.product_id = products_product.id)
-> Seq Scan on products_video (cost=0.00..245.41 rows=2341 width=8)
-> Hash (cost=88.26..88.26 rows=42 width=929)
-> Seq Scan on products_product (cost=0.00..88.26 rows=42 width=929)
Filter: ((NOT draft) AND finished)
(11 rows)
— Update 8 June 2016 #2 —
Since there are many suggestions by many people. So I’ll try to apply the fixes one-by-one and report back periodically.
@e4c5
Here’s the information you need:
You can think of my site somewhat like Udemy, an online course marketplace. There are “Product”(course). Each product contain a number of videos. Users can comment on both Product page itself and each Videos.
In many cases, I’ll need to query a list of products order by number of TOTAL comments it got(the sum of product comments AND comments on each Video of that Product)
The django query that correspond to the EXPLAIN output above:
all_products_exclude_draft = Product.objects.all().filter(draft=False)
products_that_contain_more_than_8_videos = all_products_exclude_draft.annotate(num_video=Count('video')).filter(finished=True, num_video__gte=8).order_by('timestamp')[:30]
I just noticed that I(or some other dev in my team) hit database twice with these 2 python lines.
Here’s the django models for Product and Video:
from django_model_changes import ChangesMixin
class Product(ChangesMixin, models.Model):
class Meta:
ordering = ['-priority', '-hitcount']
seller = models.ForeignKey(SellerAccount)
priority = models.PositiveSmallIntegerField(default=1)
media = models.ImageField(blank=True,
null=True,
upload_to=download_media_location,
default=settings.MEDIA_ROOT + '/images/default_icon.png',
storage=FileSystemStorage(location=settings.MEDIA_ROOT))
active = models.BooleanField(default=True)
title = models.CharField(max_length=500)
slug = models.SlugField(max_length=200, blank=True, unique=True)
description = models.TextField()
product_coin_price = models.IntegerField(default=0)
sale_active = models.BooleanField(default=False)
sale_price = models.IntegerField(default=0, null=True, blank=True) #100.00
timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True)
updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True)
draft = models.BooleanField(default=True)
hitcount = models.IntegerField(default=0)
finished = models.BooleanField(default=False)
is_marang_offline = models.BooleanField(default=False)
is_seller_beta_program = models.BooleanField(default=False)
def __unicode__(self):
return self.title
def get_avg_rating(self):
rating_avg = self.productrating_set.aggregate(Avg("rating"), Count("rating"))
return rating_avg
def get_total_comment_count(self):
comment_count = self.video_set.aggregate(Count("comment"))
comment_count['comment__count'] += self.comment_set.count()
return comment_count
def get_total_hitcount(self):
amount = self.hitcount
for video in self.video_set.all():
amount += video.hitcount
return amount
def get_absolute_url(self):
view_name = "products:detail_slug"
return reverse(view_name, kwargs={"slug": self.slug})
def get_product_share_link(self):
full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url())
return full_url
def get_edit_url(self):
view_name = "sellers:product_edit"
return reverse(view_name, kwargs={"pk": self.id})
def get_video_list_url(self):
view_name = "sellers:video_list"
return reverse(view_name, kwargs={"pk": self.id})
def get_product_delete_url(self):
view_name = "products:product_delete"
return reverse(view_name, kwargs={"pk": self.id})
@property
def get_price(self):
if self.sale_price and self.sale_active:
return self.sale_price
return self.product_coin_price
@property
def video_count(self):
videoCount = self.video_set.count()
return videoCount
class Video(models.Model):
seller = models.ForeignKey(SellerAccount)
title = models.CharField(max_length=500)
slug = models.SlugField(max_length=200, null=True, blank=True)
story = models.TextField(default=" ")
chapter_number = models.PositiveSmallIntegerField(default=1)
active = models.BooleanField(default=True)
featured = models.BooleanField(default=False)
product = models.ForeignKey(Product, null=True)
timestamp = models.DateTimeField(auto_now_add=True, auto_now=False, null=True)
updated = models.DateTimeField(auto_now_add=False, auto_now=True, null=True)
draft = models.BooleanField(default=True)
hitcount = models.IntegerField(default=0)
objects = VideoManager()
class Meta:
unique_together = ('slug', 'product')
ordering = ['chapter_number', 'timestamp']
def __unicode__(self):
return self.title
def get_comment_count(self):
comment_count = self.comment_set.all_jing_jing().count()
return comment_count
def get_create_chapter_url(self):
return reverse("sellers:video_create", kwargs={"pk": self.id})
def get_edit_url(self):
view_name = "sellers:video_update"
return reverse(view_name, kwargs={"pk": self.id})
def get_video_delete_url(self):
view_name = "products:video_delete"
return reverse(view_name, kwargs={"pk": self.id})
def get_absolute_url(self):
try:
return reverse("products:video_detail", kwargs={"product_slug": self.product.slug, "pk": self.id})
except:
return "/"
def get_video_share_link(self):
full_url = "%s%s" %(settings.FULL_DOMAIN_NAME, self.get_absolute_url())
return full_url
def get_next_url(self):
current_product = self.product
videos = current_product.video_set.all().filter(chapter_number__gt=self.chapter_number)
next_vid = None
if len(videos) >= 1:
try:
next_vid = videos[0].get_absolute_url()
except IndexError:
next_vid = None
return next_vid
def get_previous_url(self):
current_product = self.product
videos = current_product.video_set.all().filter(chapter_number__lt=self.chapter_number).reverse()
next_vid = None
if len(videos) >= 1:
try:
next_vid = videos[0].get_absolute_url()
except IndexError:
next_vid = None
return next_vid
And here is the index of the Product and Video table I got from the command:
my_database_name=# di
Note: this is photoshopped and include some other models as well.
— Update 8 June 2016 #3 —
@Jerzyk
As you suspected. After I inspect all my code again, I found that I indeed did a ‘slicing-in-memory’: I tried to shuffle the first 10 results by doing this:
def get_queryset(self):
all_product_list = Product.objects.all().filter(draft=False).annotate(
num_video=Count(
Case(
When(
video__draft=False,
then=1,
)
)
)
).order_by('-priority', '-num_video', '-hitcount')
the_first_10_products = list(all_product_list[:10])
the_11th_product_onwards = list(all_product_list[10:])
random.shuffle(copy)
finalList = the_first_10_products + the_11th_product_onwards
Note: in the code above I need to count number of Video that is not in draft status.
So this will be one of the thing I need to fix as well. Thanks. >_<
— Here are the related screenshots —
Postgres log when freezing occurs (log_min_duration = 500 milliseconds)
Postgres log (contunued from the above screenshot)
Nginx error.log in the same time period
2
Answers
*** Update 7 June 2016 : Issue occur again. CPU hit 100% and stays there. This answer does help with performance but unfortunately not the solution to this problem.
Thanks to the recommendation by DigitalOcean suppport team. I tried the configuration suggested by this tool:
http://pgtune.leopard.in.ua/
Which recommend me the following values for my droplet with 1 CPU core and 1GB RAM:
in postgresql.conf:
/etc/sysctl.conf
Until now my postgres server has been running fine for 3-4 days. So I assume this is the solution. Thanks everyone!
We can jump to the conclusion that your problems are caused by the slow query in question. By itself each run of the query does not appear to be slow enough to cause timeouts. However it’s possible several of these queries are executed concurrently and that could lead to the meltdown. There are two things that you can do to speed things up.
1) Cache the result
The result of a long running query can be cached.
This query now comes from memcache (or whatever you use for caching) that means if you have two successive hits for the page that uses this in quick succession, the second one will have no impact on the database. You can control how long the object is cached in memory.
2) Optimize the Query
The first thing that leaps out at you from the explain is that you are doing sequential scan on both the
products_products
andproduct_videos
tables. Usually sequential scans are less desirable than index scans. However an index scan may not be used on this query because of theCOUNT()
andHAVING
COUNT()
clauses you have on it as well as the massiveGROUP BY
clauses on it.update:
Your query has a LEFT OUTER JOIN, It’s possible that an INNER JOIN or a subquery might be faster, in order to do that, we need to recognize that grouping on the
Video
table onproduct_id
can give us the set of videos that figure in at least 8 products.The above eleminates the LEFT OUTER JOIN and introduces a subquery. However this doesn’t give easy access to the actual number of videos for each product, so this query in it’s present form may not be fully usable.
3) Improving indexes
While it may be tempting to create an index on
draft
andfinished
columns, this will be futile as those columns do not have sufficient cardinality to be good candidates for indexes. However it may still be possible to create a conditional index. Again the conclusion can only be drawn after seeing your tables.