For many years, I have been running two websites without any problems, using several Docker containers on a virtual server that was once set up with CoreOS. And I never encountered a situation which I did not understand.
Until now. Since the last week, I have been struggling with phenomena that I can neither understand nor get under control.
Prerequisite
For some reason, I had to restart the machine. The automatic process to start the containers failed. I hadn’t changed anything on the machine, so this was unexpected and I had no clue.
I therefore suspended the automatic process to be able to investigate the phenomenon. To begin with, I made sure that the machine at least starts the Docker process itself properly and without any errors:
# systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2024-07-14 19:05:13 CEST; 7s ago
Docs: https://docs.docker.com
Main PID: 123469 (dockerd)
Tasks: 8
Memory: 80.4M
CGroup: /system.slice/docker.service
└─123469 /usr/bin/dockerd -H fd://
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.067795763+02:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068075039+02:00" level=warning msg="Your kernel does not support cgroup blkio weight"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068092922+02:00" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068447780+02:00" level=info msg="Loading containers: start."
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.278561566+02:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to>
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.370232284+02:00" level=info msg="Loading containers: done."
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.390172816+02:00" level=info msg="Docker daemon" commit=4c52b90 graphdriver(s)=overlay2 version=18.09.1
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.390223822+02:00" level=info msg="Daemon has completed initialization"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.455692794+02:00" level=info msg="API listen on /var/run/docker.sock"
Jul 14 19:05:13 IONOS-1 systemd[1]: Started Docker Application Container Engine.
My investigation regarding the warnings with respect to blkio
showed that these can be neglected.
My original stack
When I trigger my start process like docker stack deploy -c /root/external.net/wp/docker-compose.yml wp
, I notice that all containers appear in the overview with the status created
, but neither of them changes to the status running
as is normal:
Creating network wp_back_ntw
Creating service wp_adm
Creating service wp_joe
Creating service wp_wp
Creating service wp_master
Instead, all containers are restarted after a while, and this is repeated indefinitely, piling up created
containers, never resulting in any of them running
. I made sure that neither container in my .yml file has a restart instruction, so I am sure I don’t restart myself.
I first tried to remove the garbage with my universal clear command:
docker ps -a | grep 'ted'| awk {'print $1'} |xargs docker rm -v; docker ps -a | grep 'ead'| awk {'print $1'} |xargs docker rm -v
But this does not stop the replay process, it just starts again. So without further ado, I resorted to a series of commands I copied from somewhere else (without understanding the implications), but used successfully several times before:
systemctl stop docker
rm -rf /var/lib/docker
systemctl start docker
This procedure went fine, as expected.
Stepping back
To isolate the problems and gain more understanding, I switched to using the run
command and the usual test routines, which should definitely work as expected:
docker run -d --name loop-demo alpine sh -c "while true; do sleep 1; done"
docker run -d --name sleep-demo alpine sleep infinity
docker run -d --name tail-demo alpine tail -f /dev/null
docker run -dt --name tty-demo alpine
I expected these containers to run indefinitely, but they were reliably terminated by docker after 5 minutes:
# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5666ba05baf1 alpine "sh -c 'while true; …" About a minute ago Exited (137) 34 seconds ago loop-demo
cef06c31d246 alpine "sleep infinity" 2 minutes ago Exited (137) 34 seconds ago sleep-demo
cd813e81f3c6 alpine "/bin/sh" 3 minutes ago Exited (137) 34 seconds ago tty-demo
8aa49ec219cd alpine "tail -f /dev/null" 5 minutes ago Exited (137) 33 seconds ago tail-demo
This is not expected. Furthermore, the log is incomprehensible to me, for example:
# docker logs cd813e81f3c6
/ #
I tried the same thing with a container in my stack, with the same result in that it only runs for 5 minutes. Well, at least it runs so far and does not stay forever in mode created
, in contrast to the deployment as a stack. This is all very unfamiliar and incomprehensible to me. I finally ran out of ideas and humbly seek for help.
Any ideas or insights?
Now my questions are:
- did anybody ever experience this kind of behavior
- what am I doing wrong
- what can I learn from this setup
- how can I further investigate this scenario
- and how can I make the whole thing run as reliably as before
- and lastly, how could this happen in the first place?
Thank you for reading and your effort.
2
Answers
I put a lot of effort into solving the problem and finally managed it: it was simply and solely my fault, and a very stupid one at that.
I should have taken the regular execution
every 5 minutes
as hint to look at my cronjob right away. How come?On this machine, I had increasing problems with hard disk memory shortages and the machine became increasingly cluttered. I diagnosed docker to be the cause, so I took several measures to reclaim disk space.
As a result of these measures, I deleted the containers myself every 5 minutes. Bingo! Congratulations!
However, by reinstalling I have gained a lot of free space, so this problem should not occur again in the future.
Many thanks to everyone who has tried to solve my problem. I take this story as a lesson to look at the right place.
In the case that Docker is healthy (see @kade-youn’s comment to check that), then to investigate why docker would kill an otherwise healthy container, use
docker inspect <container_id>
:Find the id of a killed container. e.g.
docker container ls --all
Inspecting a stopped container can tell you why docker stopped it – usually the health check (if set) or out-of-memory:
e.g.