skip to Main Content

Can someone explain me in simple terms what does

docker run --rm --privileged multiarch/qemu-user-static --reset -p yes -c yes

do when called right before doing a docker build a container from a Dockerfile?

I have the notion that it is to permit the use of containers from other architectures into the X86 architecture, but I am not sure I quite understand the explanation I found in some sites.

Does the presence of the above instruction(docker run) implies that the Dockerfile of the build stage is for another architecture?

2

Answers


  1. I too had this question recently, and I don’t have a complete answer, but here is what I do know, or at least believe:

    Setup & Test

    The magic to setup – required once per reboot of system, is just this:

    # start root's docker (not via any `-rootless` scripts, obviously)
    sudo systemctl start docker
    # setup QEMU static executables formats
    sudo docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
    # test
    docker run --rm -t arm64v8/ubuntu uname -m
    # shoudl expect:
    # >> aarch64
    # optional: shutdown root's docker
    sudo systemctl stop docker
    

    Note that the test example assumes that you are running that your own personal "rootless-"docker, therefore as yourself, not as root (nor via sudo), and it works just dandy.

    Gory Details

    … which are important if you want to understand how/why this works.

    The main sources for this info:

    The fundamental trick to making this work is to install new "magic" strings into the kernel process space so that when an (ARM) executable is run inside a docker image, it recognizes the bin-fmt and uses the QEMU interpreter (from the multiarch/* docker image) to execute it. Before we setup the bin formats, the contents look like this:

    root@odysseus # mount | grep binfmt_misc
    systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=35,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=45170)
    binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
    root@odysseus # ls /proc/sys/fs/binfmt_misc/
    jar  llvm-6.0-runtime.binfmt  python2.7  python3.6  python3.7  python3.8  register  sbcl  status
    

    After we start (root’s) dockerd and setup the formats:

    root@odysseus # systemctl start docker
    root@odysseus # docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
    Setting /usr/bin/qemu-alpha-static as binfmt interpreter for alpha
    Setting /usr/bin/qemu-arm-static as binfmt interpreter for arm
    [...]
    root@odysseus # ls /proc/sys/fs/binfmt_misc/
    jar                      python3.8        qemu-armeb       qemu-microblazeel  qemu-mipsn32    qemu-ppc64le  qemu-sh4eb        qemu-xtensaeb
    llvm-6.0-runtime.binfmt  qemu-aarch64     qemu-hexagon     qemu-mips          qemu-mipsn32el  qemu-riscv32  qemu-sparc        register
    python2.7                qemu-aarch64_be  qemu-hppa        qemu-mips64        qemu-or1k       qemu-riscv64  qemu-sparc32plus  sbcl
    python3.6                qemu-alpha       qemu-m68k        qemu-mips64el      qemu-ppc        qemu-s390x    qemu-sparc64      status
    python3.7                qemu-arm         qemu-microblaze  qemu-mipsel        qemu-ppc64      qemu-sh4      qemu-xtensa
    

    Now we can run an ARM version of ubuntu:

    root@odysseus # docker run --rm -t arm64v8/ubuntu uname -m
    WARNING: The requested image's platform (linux/arm64/v8) does not match the detected host platform (linux/amd64) and no specific platform was requested
    aarch64
    

    The warning is to be expected since the host CPU is AMD, and can be gotten rid of by specifying the platform to docker:

    root@odysseus # docker run --rm --platform linux/arm64 -t arm64v8/ubuntu uname -m
    aarch64
    

    How does this really work?

    At the base of it is just QEMU’s ability to interpose a DBM (dynamic binary modification) interpreter to translate the instruction set of one system to that of the underlying platform.

    The only trick we have to do is tell the underlying system where to find those interpreters. Thats what the qemu-user-static image does in registering the binary format magic strings / interpreters. So, what’s in those binfmts?

    root@odysseus # cat /proc/sys/fs/binfmt_misc/qemu-aarch64
    enabled
    interpreter /usr/bin/qemu-aarch64-static
    flags: F
    offset 0
    magic 7f454c460201010000000000000000000200b700
    mask ffffffffffffff00fffffffffffffffffeffffff
    

    Huh – that’s interesting, especially because on the host system there is no /usr/bin/qemu-aarch64-static, and it’s not in the target image either, so where does this thing live? It’s in the qemu-user-static image itself, with the appropriate tag of the form: <HOST-ARCH>-<GUEST-ARCH>, as in multiarch/qemu-user-static:x86_64-aarch64.

    # Not on the local system
    odysseus % ls /usr/bin/qemu*
    ls: cannot access '/usr/bin/qemu*': No such file or directory
    
    # Not in the target image
    odysseus % docker run --rm --platform linux/arm64 -t arm64v8/ubuntu bash -c 'ls /usr/bin/qemu*'
    /usr/bin/ls: cannot access '/usr/bin/qemu*': No such file or directory
    
    # where is it?
    odysseus % docker run --rm multiarch/qemu-user-static:x86_64-aarch64 sh -c 'ls /usr/bin/qemu*'
    docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "sh": executable file not found in $PATH: unknown.
    # Hmm, no `sh` in that image - let's try directly...
    odysseus % docker run --rm multiarch/qemu-user-static:x86_64-aarch64 /usr/bin/qemu-aarch64-static --version
    qemu-aarch64 version 7.0.0 (Debian 1:7.0+dfsg-7)
    Copyright (c) 2003-2022 Fabrice Bellard and the QEMU Project developers
    
    # AHA - there it is.
    

    That’s the real magic that I don’t yet quite understand. Somehow docker is, I believe, using that image to spin up the QEMU interpreter, and then feeding it the code from the actual image/container you want to run, as in the uname example from earlier. Some web-searching left me unsatiated as to how this magic is achieved, but I’m guessing if I kept following links from here I might find the true source of that slight-of-hand.

    Login or Signup to reply.
  2. To complement @crimson-egret’s answer: The fix-binary flag in binfmt_misc was used to make the statically compiled qemu emulator work across different namespaces/chroots/containers.

    In the doc for binfmt_misc you can find the explanation of the fix-binary flag:

    F – fix binary

    The usual behaviour of binfmt_misc is to spawn the binary lazily when the misc format file is invoked. However, this doesn’t work very well in the face of mount namespaces and changeroots, so the F mode opens the binary as soon as the emulation is installed and uses the opened image to spawn the emulator, meaning it is always available once installed, regardless of how the environment changes.

    This bug report also explained:


    The fix-binary flag of binfmt is meant to specifically deal with this. The interpreter file (e.g. qemu-arm-static) is loaded when its binfmt rule is installed instead of when a file that requires it is encountered. When the kernel then encounters a file which requires that interpreter it simply execs the already open file descriptor instead of opening a new one (IOW: the kernel already has the correct file descriptor open, so possibly divergent roots no longer play into finding the interpreter thus allowing namespaces/containers/chroots of a foreign architecture to be run like native ones).

    If you use the qemu-user-static image without the -p yes option, the fix-binary flag won’t be added, and running the arm64 container won’t work because now the kernel will actually try to open the qemu emulator in the container’s root filesystem:

    $ docker run --rm --privileged multiarch/qemu-user-static --reset
    [...]
    
    $ cat /proc/sys/fs/binfmt_misc/qemu-aarch64
    enabled
    interpreter /usr/bin/qemu-aarch64-static
    flags:
    offset 0
    magic 7f454c460201010000000000000000000200b700
    mask ffffffffffffff00fffffffffffffffffeffffff
    
    $ docker run --rm -t arm64v8/ubuntu uname -m
    WARNING: The requested image's platform (linux/arm64/v8) does not match the detected host platform (linux/amd64) and no specific platform was requested
    exec /usr/bin/uname: no such file or directory
    failed to resize tty, using default size
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search