skip to Main Content

For many years, I have been running two websites without any problems, using several Docker containers on a virtual server that was once set up with CoreOS. And I never encountered a situation which I did not understand.

Until now. Since the last week, I have been struggling with phenomena that I can neither understand nor get under control.

Prerequisite

For some reason, I had to restart the machine. The automatic process to start the containers failed. I hadn’t changed anything on the machine, so this was unexpected and I had no clue.

I therefore suspended the automatic process to be able to investigate the phenomenon. To begin with, I made sure that the machine at least starts the Docker process itself properly and without any errors:

# systemctl status docker.service
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2024-07-14 19:05:13 CEST; 7s ago
     Docs: https://docs.docker.com
 Main PID: 123469 (dockerd)
    Tasks: 8
   Memory: 80.4M
   CGroup: /system.slice/docker.service
           └─123469 /usr/bin/dockerd -H fd://

Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.067795763+02:00" level=info msg="Graph migration to content-addressability took 0.00 seconds"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068075039+02:00" level=warning msg="Your kernel does not support cgroup blkio weight"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068092922+02:00" level=warning msg="Your kernel does not support cgroup blkio weight_device"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.068447780+02:00" level=info msg="Loading containers: start."
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.278561566+02:00" level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to>
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.370232284+02:00" level=info msg="Loading containers: done."
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.390172816+02:00" level=info msg="Docker daemon" commit=4c52b90 graphdriver(s)=overlay2 version=18.09.1
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.390223822+02:00" level=info msg="Daemon has completed initialization"
Jul 14 19:05:13 IONOS-1 dockerd[123469]: time="2024-07-14T19:05:13.455692794+02:00" level=info msg="API listen on /var/run/docker.sock"
Jul 14 19:05:13 IONOS-1 systemd[1]: Started Docker Application Container Engine.

My investigation regarding the warnings with respect to blkio showed that these can be neglected.

My original stack

When I trigger my start process like docker stack deploy -c /root/external.net/wp/docker-compose.yml wp, I notice that all containers appear in the overview with the status created, but neither of them changes to the status running as is normal:

Creating network wp_back_ntw
Creating service wp_adm
Creating service wp_joe
Creating service wp_wp
Creating service wp_master

Instead, all containers are restarted after a while, and this is repeated indefinitely, piling up created containers, never resulting in any of them running. I made sure that neither container in my .yml file has a restart instruction, so I am sure I don’t restart myself.

I first tried to remove the garbage with my universal clear command:

docker ps -a | grep 'ted'| awk {'print $1'} |xargs docker rm -v; docker ps -a | grep 'ead'| awk {'print $1'} |xargs docker rm -v

But this does not stop the replay process, it just starts again. So without further ado, I resorted to a series of commands I copied from somewhere else (without understanding the implications), but used successfully several times before:

systemctl stop docker
rm -rf /var/lib/docker
systemctl start docker

This procedure went fine, as expected.

Stepping back

To isolate the problems and gain more understanding, I switched to using the run command and the usual test routines, which should definitely work as expected:

docker run -d --name loop-demo alpine sh -c "while true; do sleep 1; done"
docker run -d --name sleep-demo alpine sleep infinity
docker run -d --name tail-demo alpine tail -f /dev/null
docker run -dt --name tty-demo alpine

I expected these containers to run indefinitely, but they were reliably terminated by docker after 5 minutes:

# docker ps -a
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS                        PORTS               NAMES
5666ba05baf1        alpine              "sh -c 'while true; …"   About a minute ago   Exited (137) 34 seconds ago                       loop-demo
cef06c31d246        alpine              "sleep infinity"         2 minutes ago        Exited (137) 34 seconds ago                       sleep-demo
cd813e81f3c6        alpine              "/bin/sh"                3 minutes ago        Exited (137) 34 seconds ago                       tty-demo
8aa49ec219cd        alpine              "tail -f /dev/null"      5 minutes ago        Exited (137) 33 seconds ago                       tail-demo

This is not expected. Furthermore, the log is incomprehensible to me, for example:

# docker logs cd813e81f3c6
/ #

I tried the same thing with a container in my stack, with the same result in that it only runs for 5 minutes. Well, at least it runs so far and does not stay forever in mode created, in contrast to the deployment as a stack. This is all very unfamiliar and incomprehensible to me. I finally ran out of ideas and humbly seek for help.

Any ideas or insights?

Now my questions are:

  • did anybody ever experience this kind of behavior
  • what am I doing wrong
  • what can I learn from this setup
  • how can I further investigate this scenario
  • and how can I make the whole thing run as reliably as before
  • and lastly, how could this happen in the first place?

Thank you for reading and your effort.

2

Answers


  1. Chosen as BEST ANSWER

    I put a lot of effort into solving the problem and finally managed it: it was simply and solely my fault, and a very stupid one at that.

    I should have taken the regular execution every 5 minutes as hint to look at my cronjob right away. How come?

    On this machine, I had increasing problems with hard disk memory shortages and the machine became increasingly cluttered. I diagnosed docker to be the cause, so I took several measures to reclaim disk space.

    As a result of these measures, I deleted the containers myself every 5 minutes. Bingo! Congratulations!

    However, by reinstalling I have gained a lot of free space, so this problem should not occur again in the future.

    Many thanks to everyone who has tried to solve my problem. I take this story as a lesson to look at the right place.


  2. In the case that Docker is healthy (see @kade-youn’s comment to check that), then to investigate why docker would kill an otherwise healthy container, use docker inspect <container_id>:

    Find the id of a killed container. e.g. docker container ls --all

    Inspecting a stopped container can tell you why docker stopped it – usually the health check (if set) or out-of-memory:

    e.g.

    ❯ docker container ls --all
    CONTAINER ID   IMAGE                             COMMAND                  CREATED       STATUS                     PORTS      NAMES
    28aa07338440   gcr.io/cadvisor/cadvisor:latest   "/usr/bin/cadvisor -…"   5 days ago    Exited (255) 2 days ago    8080/tcp   prometheus_cadvisor.vks4pi2inixb3kpm0ivc3gynt.n9uvbj1ujxhfv4v13cbtsp0ff
    ❯ docker container inspect 28aa --format '{{json .State}}' | jq
    {
      "Status": "exited",
      "Running": false,
      "Paused": false,
      "Restarting": false,
      "OOMKilled": false,
      "Dead": false,
      "Pid": 0,
      "ExitCode": 255,
      "Error": "",
      "StartedAt": "2024-07-10T06:22:52.676158847Z",
      "FinishedAt": "2024-07-12T10:21:08.633161044Z",
      "Health": {
        "Status": "healthy",
        "FailingStreak": 0,
        "Log": [
          {
            "Start": "2024-07-12T10:14:02.255315062Z",
            "End": "2024-07-12T10:14:02.293230789Z",
            "ExitCode": 0,
            "Output": ""
          },
      ...
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search