Outline:
- Know ZFS filesystem and Docker’s
zfs
storage driver - How Docker’s
zfs
driver works and Why it is slow - Why we can’t directly use
overlayfs
on ZFS- Why
overlayfs
don’t accept remote filesystems - Why ZFS is identified as a remote filesystem
- Why
- How to actually solve the problem
We will dive into the source code of moby, OpenZFS, and Linux kernel to find out.
Note: I admit this blog is not so beginner-friendly, which requires some prerequisites, otherwise you may have a hard time reading it though. I will give some questions or concepts after each prerequisite to help you know your understanding is enough on this topic.
- general computer/unix concepts (block devices, copy-on-write, mount points)
- basic filesystem concepts (difference between block devices and filesystems, common filesystems, basic understanding of Linux Virtual Filesystem)
- the basics of ZFS filesystem (terminologies like datasets and snapshots, what are rollbacks)
- Docker images (what are image layers, when they are created/deleted, how it works with UnionFS)
Background
What is ZFS
Described as The last word in filesystems, ZFS is scalable, and includes extensive protection against data corruption, support for high storage capacities, efficient data compression, integration of the concepts of filesystem and volume management, snapshots and copy-on-write clones, continuous integrity checking and automatic repair, RAID-Z, native NFSv4 ACLs, and can be very precisely configured. 1
By saying ZFS, I am referring to OpenZFS on Linux and FreeBSD: OpenZFS Documentation
ZFS is a great and sophisticated filesystem, really robust and stable. It never failed my expectations. I personally use ZFS on my personal devices, whenever possible, e.g. laptops (Ubuntu Desktop - for its built-in support for ZFS), NAS (TrueNAS SCALE), and servers (Proxmox VE and Ubuntu Server).
What is a Docker storage driver
Docker uses storage drivers to store image layers, and to store data in the writable layer of a container. The container’s writable layer does not persist after the container is deleted, but is suitable for storing ephemeral data that is generated at runtime. Storage drivers are optimized for space efficiency, but (depending on the storage driver) write speeds are lower than native file system performance, especially for storage drivers that use a copy-on-write filesystem. Write-intensive applications, such as database storage, are impacted by a performance overhead, particularly if pre-existing data exists in the read-only layer. 2
By default, Docker will use overlay2
whenever possible for all Linux distributions.
ZFS and Docker storage driver?
The Docker Engine provides a zfs
storage drivers on Linux, which requires a ZFS filesystem, allowing for advanced options, such as creating snapshots, but require more maintenance and setup.
According to Docker docs, the zfs
storage driver has the following advantages 3 :
- Avoids the container’s writable layer grow too large in write-heavy workloads.
- Performs better for write-heavy workloads (though not as well as Docker volumes).
- A good choice for high-density workloads such as PaaS.
Hmm, sounds good, right? Well, keep reading. If it is that good, this blog wouldn’t exist at the first place.
In this blog,
zfs
refers to Docker’szfs
storage driver, mostly, but may also refer to ZFS filesystem. You should be able to distinguish them by context.
What’s the problem?
There is one single problem with ZFS that bothered me since the very beginning: Docker.
When using Docker on ZFS, one can only use its zfs
driver (no, you cannot use overlay2
directly. We will see why later.). Although Docker docs proudly advertise its zfs
driver as something high-performance:
zfs is a good choice for high-density workloads such as PaaS. 3
But in practice, this thing is slow as hell, specifically, when creating image layers. Build times can go from a fraction of a second on overlay2
to several minutes on zfs
!
It is several magnitudes slower, a complete disaster.
Let’s take the Dockerfile in kube-trigger (a project that I worked on recently) as an example.
Click to see the complete Dockerfile
|
|
We will focus on L39-L46:
|
|
You might be thinking, this is just some build args, so what? Yes, this part almost does nothing (creates some image layers), and should finish immediately. That’s exactly the case on overlay2
, but not on zfs
, which will take minutes!
Such slow build times are driving me crazy.
Why zfs
driver is so slow?
How ZFS storage driver works?
When using docker on a zfs
dataset, the only option is Docker’s zfs
driver, which uses ZFS dataset operations to create layered filesystems. The zfs
storage driver for Docker stores each layer of each image as a separate legacy dataset. Even just a handful of images can result in a huge number of layers, each layer corresponding to a legacy
ZFS dataset. As a result, there are hundreds of datasets created when only running a dozen containers.
The base layer of an image is a ZFS filesystem. Each child layer is a ZFS clone based on a ZFS snapshot of the layer below it. A container is a ZFS clone based on a ZFS Snapshot of the top layer of the image it’s created from. 4
Where’s the bottleneck?
Although when building images it do not have to deal with such many datasets. It will still spend a fair amount of time mounting and unmounting these datasets (can be seen from Docker debug logs).
We can take a look at the code from Docker daemon (moby/moby
).
Mount will happen (if necessary) whenever Get
is called:
|
|
Unmount will happen whenever Put
is called:
|
|
Although Docker will not mount a filesystem twice, changes still exist when consecutive Get/Put
call happens.
I am not an OpenZFS developer, but it seems to me that there is a bottleneck with ZFS with such frequent mount/unmount actions (with a large mount datasets and snapshots).
As you can see, Docker is already optimizing this situation by using the mount syscall directly (instead of calling user-space mount command, which will again, after going to kernel-space from user-space, require the kernel to call the zfs mount binary in user-space, due to ZFS’s license issues with the vfs_mount
in kernel).
One possible solution
So there is not much to optimize in zfs
storage drivers. It is the actual zfs mount process that is slowing image build times down. Now, the problem with Docker’s zfs
storage driver is clear. There are two options left:
optimize ZFS mount times
just get rid of
zfs
storage driver
Of course, you can also grab another disk in ext4
as use overlayfs
on top of it. But I only have zfs
-formatted disks, so I only have the above two options.
The first one “optimize ZFS mount times” isn’t really an option. Currently, I don’t have the expertise or time to work on OpenZFS.
With that out of the way, we only have one option left: “do not use zfs
storage driver”, i.e., use overlay2
on zfs
datasets.
But it doesn’t work
Now, the problem is, how do we use overlay2
storage driver on a ZFS filesystem (dataset)?
Simply put, that’s not possible (directly). ZFS makes use of d_revalidate
. Having d_revalidate
set to not NULL
will make overlayfs
refuse to work.
But why? To understand, we need to analyze some source code from OpenZFS and Linux kernel.
What’s d_revalidate
?
d_revalidate
is defined in Linux kernel include/linux/dcache.h
:
|
|
The kernel documentation has a nice description on d_revalidate
.
TL;DR: d_revalidate
is typically used with network filesystems, and is called called when the VFS needs to revalidate a dentry
, marking this dentry
is still valid or not, to prevent things change without the client being aware of it.
d_revalidate
is called called when the VFS needs to revalidate a dentry. This is called whenever a name look-up finds a dentry in the dcache. Most local filesystems leave this as NULL, because all their dentries in the dcache are valid. Network filesystems are different since things can change on the server without the client necessarily being aware of it.This function should return a positive value if the dentry is still valid, and zero or a negative error code if it isn’t.
d_revalidate may be called in rcu-walk mode (flags & LOOKUP_RCU). If in rcu-walk mode, the filesystem must revalidate the dentry without blocking or storing to the dentry, d_parent and d_inode should not be used without care (because they can change and, in d_inode case, even become NULL under us).
If a situation is encountered that rcu-walk cannot handle, return -ECHILD and it will be called again in ref-walk mode.
Excerpt from: Overview of the Linux Virtual File System — The Linux Kernel documentation
Why use d_revalidate
?
ZFS makes use of d_revalidate
to invalidate dcache
after rolling back.
See? This is the same situation as the kernel doc describes. When a roll back happens, the underlying files are changed, but the dcache
it not updated, so we need to use d_revalidate
to mark it as invalid.
|
|
d_revalidate
is set to the zpl_revalidate
function that we have seen above.
|
|
But why is having d_revalidate
set to not NULL
will make overlayfs
refuse to work?
How overlayfs
refuses d_revalidate
-enabled fs
To understand how and why How overlayfs
refuses d_revalidate
-enabled fs, let’s turn our focus to the Linux kernel.
DCACHE_OP_REVALIDATE
flag is set
If a dentry
has d_revalidate
set to not NULL
, which is the case with ZFS, kernel will mark DCACHE_OP_REVALIDATE
in its d_flags
. A d_flags
is just flags to tell what operations that this dentry
supports, and DCACHE_OP_REVALIDATE
means it supports d_revalidate
operations.
Now in our case, ZFS uses d_revalidate
, so our d_flags
have a DCACHE_OP_REVALIDATE
present.
Keep this in mind. This flag will cause overlayfs
to identify it as a remote fs. You will see why later.
|
|
Mounting process of overlayfs
To understand how overlayfs
rejects d_revalidate
enabled fs, we need to look at the code that mounts overlayfs
.
When we mount a overlayfs
, we call ovl_mount()
in the kernel fs/overlayfs
.
|
|
ovl_fill_super()
will be called to fill the dir (workdir
) where overlayfs
mounts, which will call ovl_get_workdir
.
|
|
Saw something called lowerdir
and upperdir
? Here’s what they means in overlayfs
:
An overlay filesystem combines two filesystems - an ‘upper’ filesystem and a ’lower’ filesystem. When a name exists in both filesystems, the object in the ‘upper’ filesystem is visible while the object in the ’lower’ filesystem is either hidden or, in the case of directories, merged with the ‘upper’ object.
It would be more correct to refer to an upper and lower ‘directory tree’ rather than ‘filesystem’ as it is quite possible for both directory trees to be in the same filesystem and there is no requirement that the root of a filesystem be given for either upper or lower.
The lower filesystem can be any filesystem supported by Linux and does not need to be writable. The lower filesystem can even be another overlayfs. The upper filesystem will normally be writable and if it is it must support the creation of trusted.* extended attributes, and must provide valid d_type in readdir responses, so NFS is not suitable.
A read-only overlay of two read-only filesystems may use any filesystem type.
Excerpt from Overlay Filesystem — The Linux Kernel documentation
And ovl_get_workdir
will get where the workdir
is and it call ovl_make_workdir
to make the workdir
.
|
|
ZFS is identified as a remote fs
Now the good bit comes, notice the comments. This where the overlayfs
rejects remote fs. (We will see why ZFS is identified as remote fs later.)
|
|
In ovl_dentry_remote
, it directly marks dentry
which has DCACHE_OP_REVALIDATE
flags (Remember what we said before? ZFS sets this flag.) as remote, and thus the code above will go into the if-condition, then rejecting it.
|
|
Now everything comes together. Ah, this is why having d_revalidate
set to not NULL
will lead to Linux treating ZFS as a remote filesystem (like NFS) and thus things like overlayfs
won’t work with ZFS.
There is PRs in OpenZFS to fix this problem: https://github.com/openzfs/zfs/pull/9600 , https://github.com/openzfs/zfs/pull/9414 . But currently they are held and I don’t the expertise or time to work on it either. Hopefully wish I can pick up that PR and fix it (if possible).
Final solution
So, the only option left is not possible now. Is there something we can do?
As I said earlier, “Simply put, that’s not possible (directly)”. Well, it turns out, there is still an indirect way – ZFS Volumes. Let’s see what Oracle says:
A ZFS volume is a dataset that represents a block device.
Excerpt from: https://docs.oracle.com/cd/E19253-01/819-5461/gaypf/index.html
Note that it is a block device. This is really important, which means we can treat it like a conventional hard drive and do whatever we want on ZFS datasets!
Since it is a block device, we can use it as a Swap device, iSCSI target, and in this case, a block device holding a ext4
partitation to put overlayfs
on.
Solve problem
Finally! We now decide to use ZFS Volumes (zvol) to hold our overlayfs
, i.e., overlayfs
on top of ext4
on top of zvol
on top of ZFS datasets. (Well, it is a bit complex. But trust me, even with so many fs layers, the performance is still wayyyy higher than Docker’s zfs
driver.)
Let’s fix this now.
Stop Docker:
|
|
Destroy the dataset that Docker uses previously. You can use zfs list
to find all datasets. In our case, it is rpool/ROOT/ubuntu_uzcb39/var/lib/docker
.
|
|
Create a ZFS Volume.
|
|
Format the zvol. ZFS Volumes are identified as devices in the /dev/zvol/{dsk,rdsk}/pool
directory. Since we created a block device, let’s format it to ext4
.
|
|
Mount the ext4
partitation to /var/lib/docker
.
|
|
Check if it is successfully mounted.
|
|
Start Docker back up and check status.
|
|
Make changes persistent. Make sure the zvol is automatically mounted to /var/lib/docker
after system reboots.
|
|
Horrey! Build time are several orders of magnitude faster now!