I am trying to understand how the underlying storage works for Docker Hub. For context, JFrog states that they use checksum based storage, not only ensuring that all images will be stored only once, but each individual layer composing the image is stored only once, even if that layer is reused in another image.
This may have side effects that I’m trying to understand when cleaning and removing old artifacts and images from JFrog (and potentially Docker Hub). I would like to know if Docker Hub functions in a similar way, and cannot find a clear answer in the documentation.
2
Answers
There seem to be two questions one for Docker hub and one for Artifactory.
Let me try addressing from Artifactory side. Your understanding is correct. Artifactory is checksum-based and it stores every layer only once.
usercase1 :
We publish two images with few layers in common. Even if we delete one image, the layers that are in common will not be deleted as there is a reference still exists.
Usecase2:
For example, we will pull two images from Docker hub that have same layer in common (When we pull, Artifactory saves a copy in remote-cache and binary store), unique items will be saved. When we delete an image, only the unreferenced layers will be deleted. This is only local to Artifactory and it will not delete anything from the remote endpoint Docker Hub.
Container registries are implemented as a combination of a Directed Acyclic Graph (DAG) and Content Addressable Storage (CAS). Each image has a manifest, in json, that lists the blobs for the layers and image config. Those blobs are referenced by their digest, and the API to push and fetch those blobs includes that digest. So two different images in the same repository that have a manifest referencing the same blobs will use the same API to pull those blobs. There’s no way to tell the difference between the requests, so there’s no need to store the same blob twice.
When deleting content, you shouldn’t delete the blobs. Instead, delete the manifests you no longer want, and rely on the registry to garbage collect those blobs.
However, when deleting manifests, that is done by digest, and multiple tags can point the same manifest. You can also have a manifest list, used for multi-platform images, that points to another manifest. While there is an API to delete tags, most registries don’t implement this, so you need to exercise caution when deleting a manifest that no tags you want to keep still references that manifest. To minimize that risk, I delete a tag by pushing an empty manifest to the tag I want to delete, and then delete the digest of that empty manifest.
For more details on how registries work, see the OCI distribution-spec.