The bug that was very occasionally corrupting data on file copies in OpenZFS 2.2.0 has been identified and fixed, and there’s a fix for the previous OpenZFS release too.
The OpenZFS development team have put out not one but two new releases of the open-source cross-platform filesystem for Linux and FreeBSD. Version 2.2.2 fixes the problem that showed up in the latest version, which is included in FreeBSD 14 as well as several Linux distros, including Ubuntu 23.10. There’s also a new release in the previous version of OpenZFS: version 2.1.14 which applies to FreeBSD back to version 12.
This was necessary because while, as we reported a week ago, it was OpenZFS 2.2.0 that brought the issue to light and made it visible, it didn’t actually cause the problem. It merely exposed an underlying bug which had been around for years: OpenZFS 2.2.0’s new, faster copy function simply made the existing issue much more likely to happen. The FreeBSD project has published an errata notice, and made fixes available for FreeBSD 12, 13 and 14.
The investigation that’s been going on since then has revealed more. For instance, the bug was also confirmed in Illumos, the open-source fork of OpenSolaris which has continued development since Oracle killed off the open source project in 2010. Illumos is itself the basis of several OpenSolaris-based distributions.
As amendments in the release notes for both these versions clarify, it’s also slightly worse than it looked last week, when we wrote that:
For Linux users, an additional condition seems to be that the OS has a recent version of the coreutils package – above version 9.x.
This looked to be the case because the cp command in Coretils 9 was updated to look for ways to speed up file copies, such as checking for “holes” in files – long stretches of zeroes – called the SEEK_HOLE optimization. Unfortunately, it looks like Red Hat backported this functionality from Coreutils 9.x to 8.x, and it’s been identified in CentOS Stream 9 as well as in the OpenELA source code. As the code comment dryly says:
I’d link to the corresponding RHEL code, but sadly they no longer publish it.
RHEL doesn’t include OpenZFS, so this data-loss issue will not affect it. Indeed, RHEL doesn’t even include Btrfs… but Oracle Linux does, although that’s no cause for concern here: Btrfs itself is immune from the bug.
What this illustrates, though, is the problem with trying to pin down affected versions. As we described back in June, Red Hat puts a lot of engineering time and effort into backporting features from newer kernels into its very-long-term supported enterprise kernels. Sometimes, these backports may not be limited to the kernel: they may extend to non-kernel system utilities. These optimizations are perfectly safe on the Big Purple Hat’s own distro, and indeed its RHELatives such as Oracle and Alma and so on. However, such changes can get picked up by other distros, or even by people hand-building complex bespoke installations. The result is that it’s not safe to simply say “this only affects systems with coreutils 9 or above”.
At any rate, for now, the issue is fixed. There’s a newer overview of the issue on Github, but the investigation as to when the bug first appeared is still underway, as the comments there show (along with a link to our earlier story).
The bug might go back as far as 2006. Although bug fix #15571 in these two new OpenZFS releases does resolve the issue, another, newer attempt to fix the issue in a cleaner way is also under investigation as bug fix #15615. ZFS is a complex filesystem, and this is a complex bug that may have remained hidden for 17 years. If there is a simpler, cleaner way to fix the issue, that would be a good thing.
This is the best summary I could come up with:
The OpenZFS development team have put out not one but two new releases of the open-source cross-platform filesystem for Linux and FreeBSD.
This was necessary because while, as we reported a week ago, it was OpenZFS 2.2.0 that brought the issue to light and made it visible, it didn’t actually cause the problem.
It merely exposed an underlying bug which had been around for years: OpenZFS 2.2.0’s new, faster copy function simply made the existing issue much more likely to happen.
For instance, the bug was also confirmed in Illumos, the open-source fork of OpenSolaris which has continued development since Oracle killed off the open source project in 2010.
Unfortunately, it looks like Red Hat backported this functionality from Coreutils 9.x to 8.x, and it’s been identified in CentOS Stream 9 as well as in the OpenELA source code.
There’s a newer overview of the issue on Github, but the investigation as to when the bug first appeared is still underway, as the comments there show (along with a link to our earlier story).
The original article contains 603 words, the summary contains 174 words. Saved 71%. I’m a bot and I’m open source!