skip to Main Content

Very recently I ran an Online Migration update through YaST on SUSE Linux Enterprise Server (SLES) 15.1 to 15.2 and ended up with the following versions of these after doing so:

SLES 15.2
Apache 2.4.43
MariaDB 10.4.17
PHP 7.4.6
Varnish 6.2.1

My main linux architecture is now as follows:

enter image description here

The preliminary tests showed no conflicts or issues prior to the upgrade and it rebooted and came up just fine when it all completed. Upon checking everything afterwards, I noticed that the varnish.service (varnishd) had failed to start. I’ve never had an issue with Varnish not starting, whether it was SUSE Linux, CentOS, Ubuntu, etc. I thought at first my custom vcl file was causing issues so I went with the default configuration file that it comes with (/etc/varnish/vcl.conf) just to start fresh with the basics but to no avail. The exact same issue happened.

Then I decided to take a shot and compile Varnish from source. Through YaST, I removed the varnish package and all of its configuration and service files, and then I downloaded the latest TAR Archive file (varnish-6.6.0.tgz) direct from https://varnish-cache.org/. After compiling and making Varnish this way, ironically, the same issue is happening when I try to start Varnish.

As with either, compiled (v6.6.0) or service package (v6.2.1), I get the following error(s) exactly the same between the two:

Varnish Screenshot 1

Varnish Screenshot 2

It describes a "Child not responding to CLI, killed it." and then proceeds to mention there’s a "CLI communication error (hdr)." And finally a, "Child died signal=6."

What’s most puzzling is that with either way of setting up Varnish, is that it fails the exact same way. I supposed this would indicate that Varnish isn’t the issue per se, but rather something within the server configuration? I’ve been through every forum on Varnish that I could find and have found nothing this specific. I have even tried to get it to start by trying different CLI parameters (like timeout settings, pool delays, etc.) but it still won’t do it. Again, this is with having the most basic/default configuration file loaded and nothing else.

# Marker to tell the VCL compiler that this VCL has been adapted to the
# new 4.0 format.
vcl 4.0;

# Default backend definition. Set this to point to your content server.
backend default {
    .host = "127.0.0.1";
    .port = "80";
}

Now here’s the ultimate kicker… I took another (Development) server, slicked it bare, and installed SLES 15.2 from scratch and everything, including Varnish, works! So something with the in-place upgrade is stopping Varnish somehow. I can’t take the main (Production) SLES 15.2 server and start over with it like that, however, because of so many other things that are currently installed and configured on it.

I’m trying to get Varnish back up and started within the current upgraded environment, but nothing seems to be working. Also, there is nothing in the Varnish logs (/var/log/varnish/varnish.log) to give me any clue either.

I’m at a loss as to what to try or where to go next. I’ve even tried starting Varnish in Debug Mode (-d) and then trying to get a child to start that way and it’s the exact same error.

enter image description here

And ultimately, I can’t check for any panics because Varnish won’t even start in the first place.

enter image description here

So to recap, literally all I did was run the in-place upgrade from SLES 15.1 to 15.2, rebooted when it was all done, and now all other services start fine except for Varnish (which worked perfectly on 15.1).


UPDATE #1: I tried to start varnish with no vcl file and no backend (varnishd -b none) but it errored out. Then I simply substituted "none" with "localhost" and I’m right back to the same error as before.

enter image description here


UPDATE #2: Here is the output of the "strace -f varnishd" command.

StraceOutput.txt

2

Answers


  1. VCL loop

    This is a long shot, but can you please change the .port property in your backend to 8080 instead of 80? Just for testing.

    Because if you start varnishd without an explicit -a, the standard listening port will be 80. But since your VCL file already connects to port 80 on localhost for its backend, you might end up in a loop.

    I’m not saying the assert() that is triggered on your system is caused by this, but it’s worth the attempt.

    In older versions of Varnish, the standard port was 6081, but this has changed in recent versions.

    What I am sure of, is that the error is caused by a file descriptor that is not available. Maybe a file descriptor that has already been closed.

    Please give it a shot, and let me know.

    Debug mode

    It is also possible to enable debug mode by adding the -d runtime parameter to your varnishd command.

    Please give it a try to increase the verbosity of the debug output

    Checking panics

    Another thing you can do is run the following command to see if any panics occured:

    varnishdadm panic.show
    

    Trying out various runtime options

    Apparently the error refers to the fact that it cannot load the VCL file.

    Let’s try running varnishd without a VCL file to see whether or not that’s the problem.

    Just try starting varnishd using the following command:

    varnishd -b none
    

    This command will start Varnish without a VCL file and without a backend. When you then try to access Varnish via HTTP, you should be getting an HTTP 503 error. That’s not perfect, but at least we know that Varnish is capable of not crashing all the time.

    • Once that works, you can remove -b and add your -f parameter that refers to the VCL file
    • If that also works, try playing around with the -s setting
    • And so on, and so forth ..

    Use packages

    Other than that, the only advise I can give you is to install Varnish using the official packages on a supported operating system (Debian, Ubuntu, Fedora, CentOS, RHEL).

    Login or Signup to reply.
  2. When checking the output of the requested strace command, I found this:

    [pid  1129] mkdir("vcl_boot.1621874391.008263", 0755) = 0
    [pid  1129] chown("vcl_boot.1621874391.008263", 465, 463) = 0
    [pid  1129] setresuid(-1, 465, -1)      = 0
    [pid  1129] openat(AT_FDCWD, "vcl_boot.1621874391.008263/vgc.c", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 5
    [pid  1129] fchown(5, 0, 0)             = -1 EPERM (Operation not permitted)
    [pid  1129] geteuid()                   = 465
    [pid  1129] close(5)                    = 0
    [pid  1129] openat(AT_FDCWD, "vcl_boot.1621874391.008263/vgc.so", O_WRONLY|O_CREAT|O_TRUNC, 0640) = 5
    [pid  1129] fchown(5, 0, 0)             = -1 EPERM (Operation not permitted)
    

    Varnishd tries to change the owner of at least two files, but isn’t allowed to do so. I’m not sure about the details, but as a next step you could try to find these files (probably below /var/cache/varnish) and check the current permissions. Maybe they belong to a user which is not the user you’re running varnishd with.

    AFAIK the daemon is started as user root and then the process switches to an unprivileged user. This assumption brings us back to my previous question: Are you running AppArmor or SElinux?

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search