Docker version 19.03.12, build 48a66213fe
So in a dockerfile, if I have the following lines:
RUN yum install aaa
bbb
ccc &&
<some cmd> &&
<etc> &&
<some cleanup>
is that a best practice? Should I keep yum
part separate than when I call other <commands/scripts>?
If I want a cleaner (vs traceable) Dockerfile, what if I put those lines in a .sh script can just call that script (i.e. COPY followed by a RUN statement). Will the build step run each time, even though nothing is changes inside .sh script**?** Looking for some gotchas here.
I’m thinking, whatever packages are stable, have a separate RUN <those packages>
i.e. in one layer and lines which depend upon / change frequently i.e. may use user-defined (docker build time CLI level args) keep those in separate RUN layer (so I can use layer cache effectively).
Wondering if you think keeping a cleaner Dockerfile (calling RUN some.sh) would be less efficient than a traceable Dockerfile (where everything is listed in Dockerfile what makes that image).
Thanks.
2
Answers
I guess the question is somewhat opinion based.
It depends on what you are after. It’s ultimately a tradeoff between development experience and an optimized image.
If you put everything in on RUN instruction, you are reducing the number of layers and therefore the image size to some degree. Also, each layer is stored in the registry, so pushing and pulling would get more time-consuming and expensive.
On the other hand, it means that each small change causes everything in the RUN instruction to run again, as it invalidates the cache for that single layer.
If you are creating temporary files with a RUN instruction that are removed by a later RUN instruction, then it would be better to run both commands in a single instruction to not create a layer with temporary files.
For a production image, I would opt for a single RUN instruction as optimization is more important than build speed and caching, IMO. If you can, you could also use multi staging, where the first stage uses an individual RUN instruction to utilize the layer caching. In the second stage, some artefacts from the first stage are taken and the number of layers is aggressively kept at a minimum. Only the final stage will be pushed and pulled from a registry.
For example, in the below image, the builder stage is using more instructions than strictly required to gain better caching. Even The template file is copied into the first stage, even though it’s not used at all there, since it’s only read and used at runtime. But this way the final stage can get the output binary and the template with a single COPY instruction.
In terms of script vs RUN instruction, I think it is more idiomatic to use a RUN instruction and concatenate multiple commands with the double ampersand
&&
. If things get very complex, then it may be better to use a dedicated script to make better use of shell syntax/features. It depends on what you are doing there.The build step would only run once and get cached. As long as the content of the script would not change, docker would use the cached layer. You need to get the file somehow into the image to run beforehand, so I guess the real cache invalidation would already happen in the COPY instruction, if the file has changed.
As mentioned in the previous paragraph, using a script will cost you at minium 1 COPY or ADD instruction more, introducing an additional layer that could have been avoided, if a RUN instruction had been used.
In terms of the final image filesystem, you will notice no difference if you
RUN
the commands directly, orRUN
a script, or have multipleRUN
commands. The number of layers and the size of the command string doesn’t really make any difference at all.What can you observe?
Particularly on the "classic" Docker build system, each
RUN
command becomes an image layer. In your example, youRUN yum install && ... && <some cleanup>
; if this was split into multipleRUN
commands then the un-cleaned-up content would be committed as part of the image and takes up space even though it’s removed in a later layer."More layers" isn’t necessarily bad on its own, unless you have so many layers that you hit an internal limit. The only real downside here is creating a layer with content that you’re planning to delete, in which case its space will still be in the final image.
As a more specific example of this, there’s an occasional pattern where an image installs some development-only packages, runs an installation step, and uninstalls the packages. An Alpine-based example might look like
In this case you must run the "install" and "uninstall" in the same
RUN
command; otherwise Docker will create a layer that includes the build-only packages.(A multi-stage build may be an easier way to accomplish the same goal of needing build-only tools, but not including them in the final image.)
The actual text of the
RUN
command is visible indocker history
and similar inspection commands.And…that’s kind of it. If you think it’s more maintainable to keep the installation steps in a separate script (maybe you have some way to use the same script in a non-Docker context) then go for it. I’d generally default to keeping the steps spelled out in
RUN
commands, and in general try to keep those setup steps as light-weight as possible.