docker kills all processes after 5 min, also fails to get created containers running

kklepper
July 30, 2024
197 views
1 vote
2 Answers

For many years, I have been running two websites without any problems, using several Docker containers on a virtual server that was once set up with CoreOS. And I never encountered a situation which I did not understand.

Until now. Since the last week, I have been struggling with phenomena that I can neither understand nor get under control.

Prerequisite

For some reason, I had to restart the machine. The automatic process to start the containers failed. I hadn’t changed anything on the machine, so this was unexpected and I had no clue.

I therefore suspended the automatic process to be able to investigate the phenomenon. To begin with, I made sure that the machine at least starts the Docker process itself properly and without any errors:

# systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2024-07-14 19:05:13 CEST; 7s ago
     Docs: https://docs.docker.com
 Main PID: 123469 (dockerd)
    Tasks: 8
   Memory: 80.4M
   CGroup: /system.slice/docker.service
           └─123469 /usr/bin/dockerd -H fd://

Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.067795763+02:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068075039+02:00" level=warning msg="Your kernel does not support cgroup blkio weight"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068092922+02:00" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068447780+02:00" level=info msg="Loading containers: start."
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.278561566+02:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to>
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.370232284+02:00" level=info msg="Loading containers: done."
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.390172816+02:00" level=info msg="Docker daemon" commit=4c52b90 graphdriver(s)=overlay2 version=18.09.1
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.390223822+02:00" level=info msg="Daemon has completed initialization"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.455692794+02:00" level=info msg="API listen on /var/run/docker.sock"
Jul 14 19:05:13 IONOS-1 systemd[1]: Started Docker Application Container Engine.

My investigation regarding the warnings with respect to blkio showed that these can be neglected.

My original stack

When I trigger my start process like docker stack deploy -c /root/external.net/wp/docker-compose.yml wp, I notice that all containers appear in the overview with the status created, but neither of them changes to the status running as is normal:

Creating network wp_back_ntw
Creating service wp_adm
Creating service wp_joe
Creating service wp_wp
Creating service wp_master

Instead, all containers are restarted after a while, and this is repeated indefinitely, piling up created containers, never resulting in any of them running. I made sure that neither container in my .yml file has a restart instruction, so I am sure I don’t restart myself.

I first tried to remove the garbage with my universal clear command:

docker ps -a | grep 'ted'| awk {'print $1'} |xargs docker rm -v; docker ps -a | grep 'ead'| awk {'print $1'} |xargs docker rm -v

But this does not stop the replay process, it just starts again. So without further ado, I resorted to a series of commands I copied from somewhere else (without understanding the implications), but used successfully several times before:

systemctl stop docker
rm -rf /var/lib/docker
systemctl start docker

This procedure went fine, as expected.

Stepping back

To isolate the problems and gain more understanding, I switched to using the run command and the usual test routines, which should definitely work as expected:

docker run -d --name loop-demo alpine sh -c "while true; do sleep 1; done"
docker run -d --name sleep-demo alpine sleep infinity
docker run -d --name tail-demo alpine tail -f /dev/null
docker run -dt --name tty-demo alpine

I expected these containers to run indefinitely, but they were reliably terminated by docker after 5 minutes:

# docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS                        PORTS               NAMES
5666ba05baf1        alpine              "sh -c 'while true; …"   About a minute ago   Exited (137) 34 seconds ago                       loop-demo
cef06c31d246        alpine              "sleep infinity"         2 minutes ago        Exited (137) 34 seconds ago                       sleep-demo
cd813e81f3c6        alpine              "/bin/sh"                3 minutes ago        Exited (137) 34 seconds ago                       tty-demo
8aa49ec219cd        alpine              "tail -f /dev/null"      5 minutes ago        Exited (137) 33 seconds ago                       tail-demo

This is not expected. Furthermore, the log is incomprehensible to me, for example:

# docker logs cd813e81f3c6
/ #

I tried the same thing with a container in my stack, with the same result in that it only runs for 5 minutes. Well, at least it runs so far and does not stay forever in mode created, in contrast to the deployment as a stack. This is all very unfamiliar and incomprehensible to me. I finally ran out of ideas and humbly seek for help.

Any ideas or insights?

Now my questions are:

did anybody ever experience this kind of behavior
what am I doing wrong
what can I learn from this setup
how can I further investigate this scenario
and how can I make the whole thing run as reliably as before
and lastly, how could this happen in the first place?

Thank you for reading and your effort.

Answers

Chosen as BEST ANSWER
- kklepper
- July 30, 2024 at 8:47 pm
- 0 votes
0
I put a lot of effort into solving the problem and finally managed it: it was simply and solely my fault, and a very stupid one at that.

I should have taken the regular execution every 5 minutes as hint to look at my cronjob right away. How come?

On this machine, I had increasing problems with hard disk memory shortages and the machine became increasingly cluttered. I diagnosed docker to be the cause, so I took several measures to reclaim disk space.

As a result of these measures, I deleted the containers myself every 5 minutes. Bingo! Congratulations!

However, by reinstalling I have gained a lot of free space, so this problem should not occur again in the future.

Many thanks to everyone who has tried to solve my problem. I take this story as a lesson to look at the right place.

(Edit)

In the case that Docker is healthy (see @kade-youn’s comment to check that), then to investigate why docker would kill an otherwise healthy container, use docker inspect <container_id>:

Find the id of a killed container. e.g. docker container ls --all

Inspecting a stopped container can tell you why docker stopped it – usually the health check (if set) or out-of-memory:

e.g.

❯ docker container ls --all
CONTAINER ID   IMAGE                             COMMAND                  CREATED       STATUS                     PORTS      NAMES
28aa07338440   gcr.io/cadvisor/cadvisor:latest   "/usr/bin/cadvisor -…"   5 days ago    Exited (255) 2 days ago    8080/tcp   prometheus_cadvisor.vks4pi2inixb3kpm0ivc3gynt.n9uvbj1ujxhfv4v13cbtsp0ff
❯ docker container inspect 28aa --format '{{json .State}}' | jq
{
  "Status": "exited",
  "Running": false,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 0,
  "ExitCode": 255,
  "Error": "",
  "StartedAt": "2024-07-10T06:22:52.676158847Z",
  "FinishedAt": "2024-07-12T10:21:08.633161044Z",
  "Health": {
    "Status": "healthy",
    "FailingStreak": 0,
    "Log": [
      {
        "Start": "2024-07-12T10:14:02.255315062Z",
        "End": "2024-07-12T10:14:02.293230789Z",
        "ExitCode": 0,
        "Output": ""
      },
  ...

Please signup or login to give your own answer.

Click here to cancel reply.