We’ve encountered an issue with our deployment on Kubernetes being run on GKE, recently.
It would seem that, randomly, our NGINX front end containers, that serve our Front End application, seemingly die. This has caused quite the commotion, as nginx-ingress would just tell us that there was an HTTP2 protocol error. After about a week of turmoil, we finally noticed in the logs for the FE containers that this output was happening anytime we had the HTTP2 Protocol Error (in chrome):
Once we switched the nginx-ingress down to HTTP1, we have an error of ERR_CONTENT_LENGTH_MISMATCH 200, but this is still a misleading error.
This is a gist of all of our configs for those that are interested:
gist
As for the nginx version, I tried the following:
stable-alpine
stable
mainline-alpine
1.17.10-alpine
All result in the same set of logs.
Things I have tried:
- change the nginx version for the FE
- tell the nginx-ingress to use HTTP 1
- tell the nginx-ingress to not use GZIP
- used everything from this site tencent high availability in nginx blog post
- turned On and Off proxy buffering, both for the nginx-ingress as a whole, and each individual child-ingress
- set max-temp-file-size to 0 in the nginx-ingress
- set max-temp-file-size to 10M in the nginx-ingress
- removed Accept Encoding, Content-Length, and Content-Type from the request to the upstream
- turned on Gzip for the FE container
- set worker processes to auto, set worker processes to 1 in the FE container
- set keepalive-timeout to 65, set it to 15 in the FE container
- updated the lifecyle preStop on the FE deployment
- set terminationGracePeriodSeconds to 60 (then removed it) from the FE deployment
Before anyone asks: all of the configurations done to the nginx-ingress have thus far been attempts to solve the HTTP2 protocol error. Obviously none of these work because if the upstream server is down, this doesn’t matter.
What I can deduce is that while NGINX is shutting down (why, I still don’t know), the container itself is not Restarting, and thus that pod is effectively a zombie case. How do I either A. force a restart or B. force the pod to die and respawn?
Of course, if someone has an answer as to why the nginx container is told to shutdown in the first place, that would also be helpful.
Another potentially related issue, is that sometimes the replicas of this deployment do not start, container is ready, but no logs or connections.
Killing the pods manually, seem to fix this issue, but this is not a real solution.
The cluster is running n1-standard-2 nodes, and we’ve got autoscale enabled, so CPU/Memory/Storage are not (should not be, never say never) an issue.
Thanks in advance! Leave a comment if I can improve this question in anyway.
Edit #1: Included that we are on GKE.
Edit #2: I’ve added readiness and liveliness probes. I’ve updated the nginx FE server with a health check route. This seems to be working as a failsafe to ensure that if the internal nginx process stops or doesn’t even start, the container will restart. However if anyone has better alternatives or root causes I’d love to know! Perhaps I should set specific cpu and memory requests for each pod?
2
Answers
Okay! After much intensive trial and error, this is what ultimately worked for us.
1. We set up a liveliness & readiness probe in our Front End/upstream nginx containers and in the deployment.
nginx.conf
deployment.yaml
This ensured that, without a doubt, when our nginx containers failed, for whatever reason, they would be restarted.
2. We set up a smooth shutdown for the nginx containers
deployment.yaml
This ensured that when the restarts were happening, the nginx containers actually shutdown without losing connections (though to be fair this was rarer than not).
3. Minimum CPU and Memory Requests
With the above two changes in place, it felt like we were in a good place. Until, we scaled out to 3 replicas per Front End deployment. After just one night, seemingly randomly chosen, a few of the replicas across a couple of the FE deployments ( we had set up several for testing, dev-1 through dev-20) had restarted over 800 times! Insanity!
We didn't know what could be causing this. But on a whim, we ended up adding some minimum CPU and Memory requests, and that ended up solving the restart/failure cycle.
Final Notes
As you can see from the above, we really ended up adding a lot of configuration and work to our existing deployment set up. But none of it was Ingress related, it was all our upstream nginx containers.
The liveliness probe is pretty strict because once the containers began acting up it was game over for that container. We had not witnessed the containers fix themselves, so once the liveliness probe caught that the health check was failing, we were fine with forcing a restart.
The preStop command is roughly what was recommended as the default for nginx in the kubernetes documentation ON preStop commands as shown here. The sleep 3 was found in another stackOverflow question, while pgrep was used in this medium article. It's definitely a cobbled together stop command, but it seems to be working for us.
Finally, the minimum CPU request and Memory requests we came upon were based on the metrics for all of our FE Deployments. 0.001 CPU was the the mean usage during higher load, and 4M was the mean usage for memory during higher load. Figured it is best to hedge out bets here and use slightly more CPU/Memory than we need (though obviously this CPU usage is tiny).
I am not entirely sure why setting up the requests is what finally sorted out the restart issue. If I had to take a guess, giving the requests allowed kubernetes to find better nodes to put these pods on. Perhaps the pods that were constantly failing were crammed into the very limits of one node or another and so any actual usage caused failure.
I hope this helps someone in the future, or perhaps this was a monster of our own making and everyone already knows to do all of the above.
Good luck to whoever finds this!
What I would suggest you to do is to change the Image itself from apine to debian or to update thar nginx from 1.17 to a newer version. I remembered a lot of people struggling with networking issues in this images of alpine expecialy on react applications and php applications. One other thing i suggest you to do if you are running nginx with supervisord to check the stopsignal on that child process or the stopasgroup etc because there are a couple of configs for that one.
One other thing to maybe improuve is to gracefully shutdown and you can so that on the deployment.yml
Also check kubernetes events if you are getting system OOM.