* RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? @ 2017-08-02 8:38 Brendan Hide 2017-08-02 9:11 ` Wang Shilong ` (5 more replies) 0 siblings, 6 replies; 63+ messages in thread From: Brendan Hide @ 2017-08-02 8:38 UTC (permalink / raw) To: linux-btrfs The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ From the release notes at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html "Btrfs has been deprecated The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux. The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature. Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use. We encourage feedback through your Red Hat representative on features and requirements you have for file systems and storage technology." ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 8:38 RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? Brendan Hide @ 2017-08-02 9:11 ` Wang Shilong 2017-08-03 19:18 ` Chris Murphy 2017-08-02 11:25 ` Austin S. Hemmelgarn ` (4 subsequent siblings) 5 siblings, 1 reply; 63+ messages in thread From: Wang Shilong @ 2017-08-02 9:11 UTC (permalink / raw) To: Brendan Hide, linux-btrfs I haven't seen active btrfs developers from some time, Redhat looks put most of their efforts on XFS, It is time to switch to SLES/opensuse! On Wed, Aug 2, 2017 at 4:38 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote: > The title seems alarmist to me - and I suspect it is going to be > misconstrued. :-/ > > From the release notes at > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html > > "Btrfs has been deprecated > > The Btrfs file system has been in Technology Preview state since the initial > release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a > fully supported feature and it will be removed in a future major release of > Red Hat Enterprise Linux. > > The Btrfs file system did receive numerous updates from the upstream in Red > Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise > Linux 7 series. However, this is the last planned update to this feature. > > Red Hat will continue to invest in future technologies to address the use > cases of our customers, specifically those related to snapshots, > compression, NVRAM, and ease of use. We encourage feedback through your Red > Hat representative on features and requirements you have for file systems > and storage technology." > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 9:11 ` Wang Shilong @ 2017-08-03 19:18 ` Chris Murphy 0 siblings, 0 replies; 63+ messages in thread From: Chris Murphy @ 2017-08-03 19:18 UTC (permalink / raw) To: Wang Shilong; +Cc: Brendan Hide, linux-btrfs On Wed, Aug 2, 2017 at 3:11 AM, Wang Shilong <wangshilong1991@gmail.com> wrote: > I haven't seen active btrfs developers from some time, Redhat looks > put most of their efforts on XFS, It is time to switch to SLES/opensuse! I disagree. We need one or more Btrfs developers involved in Fedora. Fedora runs fairly unmodified upstream kernels, which are kept up to date. By default, Fedora 24, 25, 26 users today are on kernel 4.11.11 or 4.11.12. Fedora 25, 26 will soon be rebased to probably 4.12.5. That's the stable repo. You can optionally get newer non-rc ones from testing repo. And nightly Rawhide kernels are built as well with the latest patchset in between rc's. Both Btrfs and Fedora are heavily developing in containerize deployments, so it seems like a good fit for both camps. The problem is the Fedora kernel team has no one sufficiently familiar with Btrfs, nor anyone at Red Hat to fall back on. But they do have this with ext4, XFS, device-mapper, and LVM developers. So they're not going to take on a burden like Btrfs by default without a knowledgeable pair of eyes to triage issues as they come up. And instead they're moving to XFS + overlayfs. There's more opportunity for Btrfs than just as a default file system. I like the idea of using Btrfs on install media to eliminate the monolithic isomd5sum most users skip to test their USB install media; eliminate device-mapper based persistent overlay for the install media and use Btrfs seed/sprout instead (which would help the Sugar on a Stick project as well); and at least for nightly composes eliminate squashfs xz based images in favor of Btrfs compression (faster compression and decompression, bigger file sizes but these are daily throw aways so I think time is more important). Anyway - point is that converging on SUSE doesn't help. If anything I think it'll shrink the market for Btrfs as a general purpose file system, rather than grow it. -- Chris Murphy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 8:38 RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? Brendan Hide 2017-08-02 9:11 ` Wang Shilong @ 2017-08-02 11:25 ` Austin S. Hemmelgarn 2017-08-02 12:55 ` Lutz Vieweg 2017-08-02 18:44 ` Chris Mason ` (3 subsequent siblings) 5 siblings, 1 reply; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-02 11:25 UTC (permalink / raw) To: Brendan Hide, linux-btrfs On 2017-08-02 04:38, Brendan Hide wrote: > The title seems alarmist to me - and I suspect it is going to be > misconstrued. :-/ > > From the release notes at > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html > > > "Btrfs has been deprecated > > The Btrfs file system has been in Technology Preview state since the > initial release of Red Hat Enterprise Linux 6. Red Hat will not be > moving Btrfs to a fully supported feature and it will be removed in a > future major release of Red Hat Enterprise Linux. > > The Btrfs file system did receive numerous updates from the upstream in > Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat > Enterprise Linux 7 series. However, this is the last planned update to > this feature. > > Red Hat will continue to invest in future technologies to address the > use cases of our customers, specifically those related to snapshots, > compression, NVRAM, and ease of use. We encourage feedback through your > Red Hat representative on features and requirements you have for file > systems and storage technology." And this is a worst-case result of the fact that most distros added BTRFS support long before it was ready. I'm betting some RH customer lost a lot of data because they didn't pay attention to the warnings and didn't do their research and were using raid5/6, and thus RH is considering it not worth investing in. That, or they got fed up with the grandiose plans with no realistic timeline. There have been a number of cases of mishandled patches (chunk-level degraded check anyone?), and a lot of important (from an enterprise usage sense) features that have been proposed but to a naive outside have seen little to no progress (hot-spare support, device failure detection and handling, higher-order replication, working erasure coding (raid56), etc), and from both aspects, I can understand them not wanting to deal with it. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 11:25 ` Austin S. Hemmelgarn @ 2017-08-02 12:55 ` Lutz Vieweg 2017-08-02 13:47 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 63+ messages in thread From: Lutz Vieweg @ 2017-08-02 12:55 UTC (permalink / raw) To: linux-btrfs On 08/02/2017 01:25 PM, Austin S. Hemmelgarn wrote: > And this is a worst-case result of the fact that most > distros added BTRFS support long before it was ready. RedHat still advertises "Ceph", and given Ceph initially recommended btrfs as the filesystem to use for its nodes, it is interesting to read how clearly they recommend against btrfs now: http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/ > We recommand against using btrfs due to the lack of a stable version > to test against and frequent bugs in the ENOSPC handling. German IT magazine "Golem" speculates that RedHat's decision is influenced by its recent acquisition of Permabit. But I don't really see how XFS or Permabit tackle the problem that if you need to create consistent backups of file systems while they are in use, block-device level snapshots damage the write performance big time. (That backup topic is the one reason we use btrfs for a lot of /home/ directories.) I understand that XFS is expected to get some COW-features in the future as well - but it remains to be seen what performance and robustness implications that will have on XFS. Regards, Lutz Vieweg ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 12:55 ` Lutz Vieweg @ 2017-08-02 13:47 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-02 13:47 UTC (permalink / raw) To: Lutz Vieweg, linux-btrfs On 2017-08-02 08:55, Lutz Vieweg wrote: > On 08/02/2017 01:25 PM, Austin S. Hemmelgarn wrote: >> And this is a worst-case result of the fact that most >> distros added BTRFS support long before it was ready. > > RedHat still advertises "Ceph", and given Ceph initially recommended > btrfs as > the filesystem to use for its nodes, it is interesting to read how clearly > they recommend against btrfs now: > > http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/ > >> We recommand against using btrfs due to the lack of a stable version >> to test against and frequent bugs in the ENOSPC handling. Yes, and the one thing they don't mention there is that Ceph is already doing most of the same things that BTRFS is, so you end up having performance issues due to duplicated work too. What they specifically call out though is first the reason that it should not be supported yet in RHEL, OEL, and many other distros (I'm explicitly leaving SLES/OpenSUSE off of that list, because while I disagree with their choices of default behavior WRT BTRFS, they are actively involved in it's development, unlike most of the other distros that 'support' it), and then second one of the biggest issues for regular usage. > > German IT magazine "Golem" speculates that RedHat's decision > is influenced by its recent acquisition of Permabit. > > But I don't really see how XFS or Permabit tackle the problem > that if you need to create consistent backups of file systems while they > are > in use, block-device level snapshots damage the write performance > big time. When you're talking about data safety though, most people are willing to sacrifice write performance in favor of significantly lowering perceived risk. The misguided early support of BTRFS without sufficient explanation of exactly how 'in-development' it is by many distros means that there are a lot of stories of issues and failures with BTRFS than ones of success (partly also because the filesystem is one of those things that people tend to complain about if it breaks, and not praise all that much if it works), and as a result, the general perception outside of people who use it actively is that it's pretty risky to use (which is absolutely accurate if you don't do routine maintenance on it). > > (That backup topic is the one reason we use btrfs for a lot of > /home/ directories.) > > I understand that XFS is expected to get some COW-features in the future > as well - but it remains to be seen what performance and robustness > implications that will have on XFS. I believe basic reflink functionality is already upstream, and I wasn't aware of any other specific development for XFS. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 8:38 RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? Brendan Hide 2017-08-02 9:11 ` Wang Shilong 2017-08-02 11:25 ` Austin S. Hemmelgarn @ 2017-08-02 18:44 ` Chris Mason 2017-08-02 22:12 ` Fajar A. Nugraha 2017-08-02 22:22 ` Chris Murphy ` (2 subsequent siblings) 5 siblings, 1 reply; 63+ messages in thread From: Chris Mason @ 2017-08-02 18:44 UTC (permalink / raw) To: Brendan Hide, linux-btrfs On 08/02/2017 04:38 AM, Brendan Hide wrote: > The title seems alarmist to me - and I suspect it is going to be > misconstrued. :-/ Supporting any filesystem is a huge amount of work. I don't have a problem with Redhat or any distro picking and choosing the projects they want to support. At least inside of FB, our own internal btrfs usage is continuing to grow. Btrfs is becoming a big part of how we ship containers and other workloads where snapshots improve performance. We also heavily use XFS, so I'm happy to see RH's long standing investment there continue. -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 18:44 ` Chris Mason @ 2017-08-02 22:12 ` Fajar A. Nugraha 0 siblings, 0 replies; 63+ messages in thread From: Fajar A. Nugraha @ 2017-08-02 22:12 UTC (permalink / raw) To: linux-btrfs On Thu, Aug 3, 2017 at 1:44 AM, Chris Mason <clm@fb.com> wrote: > > On 08/02/2017 04:38 AM, Brendan Hide wrote: >> >> The title seems alarmist to me - and I suspect it is going to be misconstrued. :-/ > > > Supporting any filesystem is a huge amount of work. I don't have a problem with Redhat or any distro picking and choosing the projects they want to support. > It'd help a lot of people if things like https://btrfs.wiki.kernel.org/index.php/Status is kept up-to-date and 'promoted', so at least users are more informed about what they're getting into and can choose which features (stable/still in dev/likely to destroy your data) that they want to use. For example, https://btrfs.wiki.kernel.org/index.php/Status says compression is 'mostly OK' ('auto-repair and compression may crash' looks pretty scary, as from newcomers-perspective it might be interpretted as 'potential data loss'), while https://en.opensuse.org/SDB:BTRFS#Compressed_btrfs_filesystems says they support compression on newer opensuse versions. > > At least inside of FB, our own internal btrfs usage is continuing to grow. Btrfs is becoming a big part of how we ship containers and other workloads where snapshots improve performance. > Ubuntu also support btrfs as part their container implementation (lxd), and (reading lxd mailing list) some people use lxd+btrfs on their production environment. IIRC the last problem posted on lxd list about btrfs was about how 'btrfs send/receive (used by lxd copy) is slower than rsync for full/initial copy'. -- Fajar ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 8:38 RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? Brendan Hide ` (2 preceding siblings ...) 2017-08-02 18:44 ` Chris Mason @ 2017-08-02 22:22 ` Chris Murphy 2017-08-03 9:59 ` Lutz Vieweg 2017-08-03 18:08 ` waxhead 2017-08-04 14:05 ` Qu Wenruo 5 siblings, 1 reply; 63+ messages in thread From: Chris Murphy @ 2017-08-02 22:22 UTC (permalink / raw) To: Brendan Hide; +Cc: Btrfs BTRFS On Wed, Aug 2, 2017 at 2:38 AM, Brendan Hide <brendan@swiftspirit.co.za> wrote: > The title seems alarmist to me - and I suspect it is going to be > misconstrued. :-/ Josef pushed bak on the HN thread with very sound reasoning about why this is totally unsurprising. RHEL runs old kernels, and they have no upstream Btrfs developers. So it's a huge PITA to backport the tons of changes Btrfs has been going through (thousands of line changes per kernel cycle). What's more interesting to me is whether this means - CONFIG_BTRFS_FS=m + # CONFIG_BTRFS_FS is not set In particular in elrepo.org kernels. Also more interesting is this Stratis project that started up a few months ago: https://github.com/stratis-storage/stratisd Which also includes this design document: https://stratis-storage.github.io/StratisSoftwareDesign.pdf Basically they're creating a file system manager manifesting as a daemon, new CLI tools, and new metadata formats for the volume manager. So it's going to use existing device mapper, md, some LVM stuff, XFS, in a layered approach abstracted from the user. -- Chris Murphy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 22:22 ` Chris Murphy @ 2017-08-03 9:59 ` Lutz Vieweg 0 siblings, 0 replies; 63+ messages in thread From: Lutz Vieweg @ 2017-08-03 9:59 UTC (permalink / raw) To: linux-btrfs On 08/03/2017 12:22 AM, Chris Murphy wrote: > Also more interesting is this Stratis project that started up a few months ago: > https://github.com/stratis-storage/stratisd > > Which also includes this design document: > https://stratis-storage.github.io/StratisSoftwareDesign.pdf This concept, if successfully implemented, does not seem to achieve anything beyond "hide the complexity of its implementation from the user". No actual new functionality, no reason to assume any additional robustness or stability, and certainly not a new filesystem, just yet-another-wrapper. Keeping users from understanding the complexity of a storage system they use is not a benefit for all but the most trivial use cases. And I find it symptomatic that the section "D-Bus Access Control" in StratisSoftwareDesign.pdf is empty. > So it's going to use existing device mapper, md, some LVM > stuff, XFS That is the only part of the Stratis concept that looks reasonable to me. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 8:38 RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? Brendan Hide ` (3 preceding siblings ...) 2017-08-02 22:22 ` Chris Murphy @ 2017-08-03 18:08 ` waxhead 2017-08-03 18:29 ` Christoph Anton Mitterer ` (2 more replies) 2017-08-04 14:05 ` Qu Wenruo 5 siblings, 3 replies; 63+ messages in thread From: waxhead @ 2017-08-03 18:08 UTC (permalink / raw) To: Brendan Hide, linux-btrfs Brendan Hide wrote: > The title seems alarmist to me - and I suspect it is going to be > misconstrued. :-/ > > From the release notes at > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html > > "Btrfs has been deprecated > > The Btrfs file system has been in Technology Preview state since the > initial release of Red Hat Enterprise Linux 6. Red Hat will not be > moving Btrfs to a fully supported feature and it will be removed in a > future major release of Red Hat Enterprise Linux. > > The Btrfs file system did receive numerous updates from the upstream > in Red Hat Enterprise Linux 7.4 and will remain available in the Red > Hat Enterprise Linux 7 series. However, this is the last planned > update to this feature. > > Red Hat will continue to invest in future technologies to address the > use cases of our customers, specifically those related to snapshots, > compression, NVRAM, and ease of use. We encourage feedback through > your Red Hat representative on features and requirements you have for > file systems and storage technology." > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html First of all I am not a BTRFS dev, but I use it for various projects and have high hopes for what it can become. Now, the fact that Red Hat depreciate BTRFS does not mean that BTRFS is depreciated. It not removed from the kernel and so far BTRFS offers features that other filesystems don't have. ZFS is something that people brag about all the time as a viable alternative, but for me it seems to be a pain to manage properly. E.g. grow, add/remove devices, shrink etc... good luck doing that right! BTRFS biggest problem is not that there are some bits and pieces that are thoroughly screwed up (raid5/6 (which just got some fixes by the way)), but the fact that the documentation is rather dated. There is a simple status page here https://btrfs.wiki.kernel.org/index.php/Status As others have pointed out already the explanations on the status page is not exactly good. For example compression (that was also mentioned) is as of writing this marked as 'Mostly ok' '(needs verification and source) - auto repair and compression may crash' Now, I am aware that many use compression without trouble. I am not sure how many that has compression with disk issues and don't have trouble , but I would at least expect to see more people yelling on the mailing list if that where the case. The problem here is that this message is rather scary and certainly does NOT sound like 'mostly ok' for most people. What exactly needs verification and source? the mostly ok statement or something else?! A more detailed explanation would be required here to avoid scaring people away. Same thing with the trim feature that is marked OK . It clearly says that is has performance implications. It is marked OK so one would expect it to not cause the filesystem to fail, but if the performance becomes so slow that the filesystem gets practically unusable it is of course not "OK". The relevant information is missing for people to make a decent choice and I certainly don't know how serious these performance implications are, if they are at all relevant... Most people interested in BTRFS are probably a bit more paranoid and concerned about their data than the average computer user. What people tend to forget is that other filesystems either have NO redundancy, auto-repair and other fancy features that BTRFS have. So for the compression example above... if you run compressed files on ext4 and your disk gets some corruption you are in a no better state than what you would be with btrfs either (in fact probably worse). Also nothing is stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you should be VERY safe. Simple documentation is the key so HERE ARE MY DEMANDS!!!..... ehhh.... so here is what I think should be done: 1. The documentation needs to either be improved (or old non-relevant stuff simply removed / archived somewhere) 2. The status page MUST always be up to date for the latest kernel release (It's ok so far , let's hope nobody sleeps here) 3. Proper explanations must be given so the layman and reasonably technical people understand the risks / issues for non-ok stuff. 4. There should be links to roadmaps for each feature on the status page that clearly stats what is being worked on for the NEXT kernel release ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-03 18:08 ` waxhead @ 2017-08-03 18:29 ` Christoph Anton Mitterer 2017-08-03 19:22 ` Austin S. Hemmelgarn 2017-08-03 19:03 ` Austin S. Hemmelgarn 2017-08-16 18:07 ` David Sterba 2 siblings, 1 reply; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-03 18:29 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1709 bytes --] On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: > Brendan Hide wrote: > > The title seems alarmist to me - and I suspect it is going to be > > misconstrued. :-/ > > > > From the release notes at > > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Li > > nux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux- > > 7.4_Release_Notes-Deprecated_Functionality.html > > "Btrfs has been deprecated > > Wow... not that this would have any direct effect... it's still quite alarming, isn't it? This is not meant as criticism, but I often wonder myself where the btrfs is going to!? :-/ It's in the kernel now since when? 2009? And while the extremely basic things (snapshots, etc.) seem to work quite stable... other things seem to be rather stuck (RAID?)... not to talk about many things that have been kinda "promised" (fancy different compression algos, n-parity- raid). There are no higher-level management tools (e.g. RAID management/monitoring, etc.)... there are still some kinda serious issues (the attacks/corruptions likely possible via UUID collisions)... One thing that I miss since long would be the checksumming with nodatacow. Also it has always been said that the actual performance tunning would still lay ahead?! I really like btrfs and use it on all my personal systems... and I haven't had any data loss since then (only a number of seriously looking false positives due to bugs in btrfs check ;-) )... but one still reads every now and then from people here on the list who seem to suffer from more serious losses. So is there any concrete roadmap? Or priority tasks? Is there a lack of developers? Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-03 18:29 ` Christoph Anton Mitterer @ 2017-08-03 19:22 ` Austin S. Hemmelgarn 2017-08-03 20:45 ` Brendan Hide 0 siblings, 1 reply; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-03 19:22 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2017-08-03 14:29, Christoph Anton Mitterer wrote: > On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: >> Brendan Hide wrote: >>> The title seems alarmist to me - and I suspect it is going to be >>> misconstrued. :-/ >>> >>> From the release notes at >>> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Li >>> nux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux- >>> 7.4_Release_Notes-Deprecated_Functionality.html >>> "Btrfs has been deprecated >>> > > Wow... not that this would have any direct effect... it's still quite > alarming, isn't it? > > This is not meant as criticism, but I often wonder myself where the > btrfs is going to!? :-/ > > It's in the kernel now since when? 2009? And while the extremely basic > things (snapshots, etc.) seem to work quite stable... other things seem > to be rather stuck (RAID?)... not to talk about many things that have > been kinda "promised" (fancy different compression algos, n-parity- > raid). I assume you mean the erasure coding the devs and docs call raid56 when you're talking about stuck features, and you're right, it has been stuck, but it arguably should have been better tested and verified before being merged at all. As far as other 'raid' profiles, raid1 and raid0 work fine, and raid10 is mostly fine once you wrap your head around the implications of the inconsistent component device ordering. > There are no higher-level management tools (e.g. RAID > management/monitoring, etc.)... there are still some kinda serious > issues (the attacks/corruptions likely possible via UUID collisions)... The UUID collision issue is present in almost all volume managers and filesystems, it just does more damage in BTRFS, and is exacerbated by the brain-dead 'scan everything for BTRFS' policy in udev. As far as 'higher-level' management tools, you're using your system wrong if you _need_ them. There is no need for there to be a GUI, or a web interface, or a DBus interface, or any other such bloat in the main management tools, they work just fine as is and are mostly on par with the interfaces provided by LVM, MD, and ZFS (other than the lack of machine parseable output). I'd also argue that if you can't reassemble your storage stack by hand without using 'higher-level' tools, you should not be using that storage stack as you don't properly understand it. On the subject of monitoring specifically, part of the issue there is kernel side, any monitoring system currently needs to be polling-based, not event-based, and as a result monitoring tends to be a very system specific affair based on how much overhead you're willing to tolerate. The limited stuff that does exist is also trivial to integrate with many pieces of existing monitoring infrastructure (like Nagios or monit), and therefore the people who care about it a lot (like me) are either monitoring by hand, or are just using the tools with their existing infrastructure (for example, I use monit already on all my systems, so I just make sure to have entries in the config for that to check error counters and scrub results), so there's not much in the way of incentive for the concerned parties to reinvent the wheel. > One thing that I miss since long would be the checksumming with > nodatacow. It has been stated multiple times on the list that this is not possible without making nodatacow prone to data loss. > Also it has always been said that the actual performance tunning would > still lay ahead?! While there hasn't been anything touted specifically as performance tuning, performance has improved slightly since I started using BTRFS. > > > I really like btrfs and use it on all my personal systems... and I > haven't had any data loss since then (only a number of seriously > looking false positives due to bugs in btrfs check ;-) )... but one > still reads every now and then from people here on the list who seem to > suffer from more serious losses. And this brings up part of the issue with uptake. People are quick to post about issues, but not successes. I've been running BTRFS on almost everything (I don't use it in VM's because of the performance implications of having multiple CoW layers) since around kernel 3.9, have had no critical issues (ones resulting in data loss) since about 3.16, and have actually survived quite a few pieces of marginal or failed hardware as a result of BTRFS. > > > > So is there any concrete roadmap? Or priority tasks? Is there a lack of > developers? In order, no, in theory yes but not in practice, and somewhat. As a general rule, all FOSS projects are short on developers. Most of the work that is occurring on BTRFS is being sponsored by SUSE, Facebook, or Fujitsu (at least, I'm pretty sure those are the primary sponsors), and their priorities will not necessarily coincide with normal end-user priorities. I'd say though that testing and review are just as much short on manpower as development. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-03 19:22 ` Austin S. Hemmelgarn @ 2017-08-03 20:45 ` Brendan Hide 2017-08-03 22:00 ` Chris Murphy 2017-08-04 11:26 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 63+ messages in thread From: Brendan Hide @ 2017-08-03 20:45 UTC (permalink / raw) To: Austin S. Hemmelgarn, Christoph Anton Mitterer, linux-btrfs On 08/03/2017 09:22 PM, Austin S. Hemmelgarn wrote: > On 2017-08-03 14:29, Christoph Anton Mitterer wrote: >> On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: >> There are no higher-level management tools (e.g. RAID >> management/monitoring, etc.)... [snip] > As far as 'higher-level' management tools, you're using your system > wrong if you _need_ them. There is no need for there to be a GUI, or a > web interface, or a DBus interface, or any other such bloat in the main > management tools, they work just fine as is and are mostly on par with > the interfaces provided by LVM, MD, and ZFS (other than the lack of > machine parseable output). I'd also argue that if you can't reassemble > your storage stack by hand without using 'higher-level' tools, you > should not be using that storage stack as you don't properly understand it. > > On the subject of monitoring specifically, part of the issue there is > kernel side, any monitoring system currently needs to be polling-based, > not event-based, and as a result monitoring tends to be a very system > specific affair based on how much overhead you're willing to tolerate. > The limited stuff that does exist is also trivial to integrate with many > pieces of existing monitoring infrastructure (like Nagios or monit), and > therefore the people who care about it a lot (like me) are either > monitoring by hand, or are just using the tools with their existing > infrastructure (for example, I use monit already on all my systems, so I > just make sure to have entries in the config for that to check error > counters and scrub results), so there's not much in the way of incentive > for the concerned parties to reinvent the wheel. To counter, I think this is a big problem with btrfs, especially in terms of user attrition. We don't need "GUI" tools. At all. But we do need that btrfs is self-sufficient enough that regular users don't get burnt by what they would view as unexpected behaviour. We have currently a situation where btrfs is too demanding on inexperienced users. I feel we need better worst-case behaviours. For example, if *I* have a btrfs on its second-to-last-available chunk, it means I'm not micro-managing properly. But users shouldn't have to micro-manage in the first place. Btrfs (or a management tool) should just know to balance the least-used chunk and/or delete the lowest-priority snapshot, etc. It shouldn't cause my services/apps to give diskspace errors when, clearly, there is free space available. The other "high-level" aspect would be along the lines of better guidance and standardisation for distros on how best to configure btrfs. This would include guidance/best practices for things like appropriate subvolume mountpoints and snapshot paths, sensible schedules or logic (or perhaps even example tools/scripts) for balancing and scrubbing the filesystem. I don't have all the answers. But I also don't want to have to tell people they can't adopt it because a) they don't (or never will) understand it; and b) they're going to resent me for their irresponsibly losing their own data. -- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-03 20:45 ` Brendan Hide @ 2017-08-03 22:00 ` Chris Murphy 2017-08-04 11:26 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 63+ messages in thread From: Chris Murphy @ 2017-08-03 22:00 UTC (permalink / raw) To: Brendan Hide; +Cc: Austin S. Hemmelgarn, Christoph Anton Mitterer, Btrfs BTRFS On Thu, Aug 3, 2017 at 2:45 PM, Brendan Hide <brendan@swiftspirit.co.za> wrote: > > To counter, I think this is a big problem with btrfs, especially in terms of > user attrition. We don't need "GUI" tools. At all. But we do need that btrfs > is self-sufficient enough that regular users don't get burnt by what they > would view as unexpected behaviour. We have currently a situation where > btrfs is too demanding on inexperienced users. I think the top complaint is the manual nature of balancing to avoid enospc when there's free space, followed by balancing needed to avoid/reduce free space fragmentation and thus maintain consistent performance. Obviously the kernel needs more intelligent code to free up partially full block groups, more correctly to free up contiguous space for it to write to. That solve both problems. But in the meantime, btrfs-progs should ship a policy to do some minimal balance to totally obviate this and find the edge cases. Maybe it's dusage=3 every day. And dusage=10 one a week. And dusage=20 musage=20 once a month. I don't know but some iteration in this area is better than saying, PUNT! And putting it on the user's lap. Better would be a trigger that is statistics based rather than time based. The metric might be some combination of workload (i.e. idle) and ratios found in sysfs. Anyway, the first step is for people on this list to stop micromanaging their own volumes, and try to center on a sane one size fits all solution. And then iterate better and better solutions as we determine the edge cases where one size doesn't fit all. We're throwing hammers at the problem by default because it's a learned behavior. We all need to just stop balancing and act like regular users. And then figure out how to automatically optimize. > > I feel we need better worst-case behaviours. For example, if *I* have a > btrfs on its second-to-last-available chunk, it means I'm not micro-managing > properly. But users shouldn't have to micro-manage in the first place. Btrfs > (or a management tool) should just know to balance the least-used chunk > and/or delete the lowest-priority snapshot, etc. It shouldn't cause my > services/apps to give diskspace errors when, clearly, there is free space > available. Ideally the kernel code needs to do a better job freeing up partial block groups. But in the meantime, this can be set as an optimization policy in user space. And it should be in btrfs-progs so it's consistent across distros. SUSE has a distro specific balancer, on a systemd timer, but I don't think it's enabled by default and I also think it's weirdly too aggressive. If it could be made smarter, with a trigger other than a timer, that'd be even better. But doing nothing has been one of the most consistently negative user responses about Btrfs is the manual balance to maintain performance. > The other "high-level" aspect would be along the lines of better guidance > and standardisation for distros on how best to configure btrfs. This would > include guidance/best practices for things like appropriate subvolume > mountpoints and snapshot paths, sensible schedules or logic (or perhaps even > example tools/scripts) for balancing and scrubbing the filesystem. Would they listen? My experience with openSUSE is, nope. > I don't have all the answers. But I also don't want to have to tell people > they can't adopt it because a) they don't (or never will) understand it; and > b) they're going to resent me for their irresponsibly losing their own data. Sure. You can read on the linux-raid@ list where there are still constant problems with users doing crazy things like mdadm --create to fix a raid assembly problem, and obliterate their data by doing this and then getting pissed. It's like, where the hell do people keep getting the idea of doing that? There are six ways to Sunday ways of fixing a Btrfs volume. It reads like a choose your own adventure book. No, actually it's worse because at least the choose your own adventure book tells you what page to go to next, and Btrfs gives you zero advice what order to try things in. -- Chris Murphy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-03 20:45 ` Brendan Hide 2017-08-03 22:00 ` Chris Murphy @ 2017-08-04 11:26 ` Austin S. Hemmelgarn 1 sibling, 0 replies; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-04 11:26 UTC (permalink / raw) To: Brendan Hide, Christoph Anton Mitterer, linux-btrfs On 2017-08-03 16:45, Brendan Hide wrote: > > > On 08/03/2017 09:22 PM, Austin S. Hemmelgarn wrote: >> On 2017-08-03 14:29, Christoph Anton Mitterer wrote: >>> On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote: >>> There are no higher-level management tools (e.g. RAID >>> management/monitoring, etc.)... > [snip] > >> As far as 'higher-level' management tools, you're using your system >> wrong if you _need_ them. There is no need for there to be a GUI, or >> a web interface, or a DBus interface, or any other such bloat in the >> main management tools, they work just fine as is and are mostly on par >> with the interfaces provided by LVM, MD, and ZFS (other than the lack >> of machine parseable output). I'd also argue that if you can't >> reassemble your storage stack by hand without using 'higher-level' >> tools, you should not be using that storage stack as you don't >> properly understand it. >> >> On the subject of monitoring specifically, part of the issue there is >> kernel side, any monitoring system currently needs to be >> polling-based, not event-based, and as a result monitoring tends to be >> a very system specific affair based on how much overhead you're >> willing to tolerate. The limited stuff that does exist is also trivial >> to integrate with many pieces of existing monitoring infrastructure >> (like Nagios or monit), and therefore the people who care about it a >> lot (like me) are either monitoring by hand, or are just using the >> tools with their existing infrastructure (for example, I use monit >> already on all my systems, so I just make sure to have entries in the >> config for that to check error counters and scrub results), so there's >> not much in the way of incentive for the concerned parties to reinvent >> the wheel. > > To counter, I think this is a big problem with btrfs, especially in > terms of user attrition. We don't need "GUI" tools. At all. But we do > need that btrfs is self-sufficient enough that regular users don't get > burnt by what they would view as unexpected behaviour. We have > currently a situation where btrfs is too demanding on inexperienced users. > > I feel we need better worst-case behaviours. For example, if *I* have a > btrfs on its second-to-last-available chunk, it means I'm not > micro-managing properly. But users shouldn't have to micro-manage in the > first place. Btrfs (or a management tool) should just know to balance > the least-used chunk and/or delete the lowest-priority snapshot, etc. It > shouldn't cause my services/apps to give diskspace errors when, clearly, > there is free space available. That's not just an issue with BTRFS, it's an issue with the distros too. The only one that ships any kind of scheduled regular maintenance as far as I know is SUSE. We don't need some daemon, or even special handling in the kernel, we just need to provide people with standard maintenance tools, and proper advice for monitoring. I've been meaning to write up some wrappers and a couple of cron files to handle this a bit better, but just haven't had time. I may look at getting that done either today or early next week. > > The other "high-level" aspect would be along the lines of better > guidance and standardisation for distros on how best to configure btrfs. > This would include guidance/best practices for things like appropriate > subvolume mountpoints and snapshot paths, sensible schedules or logic > (or perhaps even example tools/scripts) for balancing and scrubbing the > filesystem. There are currently three standards for this: 1. The snapper way, used by at least SUSE and Ubuntu, which IMO ends up being way too complicated for not much benefit. 2. The traditional filesystem way, used by most other distros, which doesn't use subvolumes at all. 3. The user choice way, used by stuff like Arch and Gentoo, which pretty much says the rest of the OS could care less how the filesystems and subvolumes are organized, as long as things work. Overall, other than the first one, this is no different than with regular filesystems. > > I don't have all the answers. But I also don't want to have to tell > people they can't adopt it because a) they don't (or never will) > understand it; and b) they're going to resent me for their irresponsibly > losing their own data. > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-03 18:08 ` waxhead 2017-08-03 18:29 ` Christoph Anton Mitterer @ 2017-08-03 19:03 ` Austin S. Hemmelgarn 2017-08-04 9:48 ` Duncan 2017-08-16 18:07 ` David Sterba 2 siblings, 1 reply; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-03 19:03 UTC (permalink / raw) To: waxhead, Brendan Hide, linux-btrfs On 2017-08-03 14:08, waxhead wrote: > Brendan Hide wrote: >> The title seems alarmist to me - and I suspect it is going to be >> misconstrued. :-/ >> >> From the release notes at >> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html >> >> >> "Btrfs has been deprecated >> >> The Btrfs file system has been in Technology Preview state since the >> initial release of Red Hat Enterprise Linux 6. Red Hat will not be >> moving Btrfs to a fully supported feature and it will be removed in a >> future major release of Red Hat Enterprise Linux. >> >> The Btrfs file system did receive numerous updates from the upstream >> in Red Hat Enterprise Linux 7.4 and will remain available in the Red >> Hat Enterprise Linux 7 series. However, this is the last planned >> update to this feature. >> >> Red Hat will continue to invest in future technologies to address the >> use cases of our customers, specifically those related to snapshots, >> compression, NVRAM, and ease of use. We encourage feedback through >> your Red Hat representative on features and requirements you have for >> file systems and storage technology." >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > First of all I am not a BTRFS dev, but I use it for various projects and > have high hopes for what it can become. > > Now, the fact that Red Hat depreciate BTRFS does not mean that BTRFS is > depreciated. It not removed from the kernel and so far BTRFS offers > features that other filesystems don't have. ZFS is something that people > brag about all the time as a viable alternative, but for me it seems to > be a pain to manage properly. E.g. grow, add/remove devices, shrink > etc... good luck doing that right! > > BTRFS biggest problem is not that there are some bits and pieces that > are thoroughly screwed up (raid5/6 (which just got some fixes by the > way)), but the fact that the documentation is rather dated. > > There is a simple status page here > https://btrfs.wiki.kernel.org/index.php/Status > > As others have pointed out already the explanations on the status page > is not exactly good. For example compression (that was also mentioned) > is as of writing this marked as 'Mostly ok' '(needs verification and > source) - auto repair and compression may crash' > > Now, I am aware that many use compression without trouble. I am not sure > how many that has compression with disk issues and don't have trouble , > but I would at least expect to see more people yelling on the mailing > list if that where the case. The problem here is that this message is > rather scary and certainly does NOT sound like 'mostly ok' for most people. > > What exactly needs verification and source? the mostly ok statement or > something else?! A more detailed explanation would be required here to > avoid scaring people away. Not certain what was meant here, but there were (a while back) some known issues with compressed extents, but I thought those had been fixed. > > Same thing with the trim feature that is marked OK . It clearly says > that is has performance implications. It is marked OK so one would > expect it to not cause the filesystem to fail, but if the performance > becomes so slow that the filesystem gets practically unusable it is of > course not "OK". The relevant information is missing for people to make > a decent choice and I certainly don't know how serious these performance > implications are, if they are at all relevant... The performance implications bit shouldn't be listed, that's a given for any filesystem with discard (TRIM is the ATA and eMMC command, UNMAP is the SCSI one, and ERASE is the name on SD cards, discard is the generic kernel term) support. The issue arises from devices that don't have support for queuing such commands, which is quite rare for SSD's these days. > > Most people interested in BTRFS are probably a bit more paranoid and > concerned about their data than the average computer user. What people > tend to forget is that other filesystems either have NO redundancy, > auto-repair and other fancy features that BTRFS have. So for the > compression example above... if you run compressed files on ext4 and > your disk gets some corruption you are in a no better state than what > you would be with btrfs either (in fact probably worse). Also nothing is > stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you > should be VERY safe. > > Simple documentation is the key so HERE ARE MY DEMANDS!!!..... ehhh.... > so here is what I think should be done: > > 1. The documentation needs to either be improved (or old non-relevant > stuff simply removed / archived somewhere) > 2. The status page MUST always be up to date for the latest kernel > release (It's ok so far , let's hope nobody sleeps here) > 3. Proper explanations must be given so the layman and reasonably > technical people understand the risks / issues for non-ok stuff. > 4. There should be links to roadmaps for each feature on the status page > that clearly stats what is being worked on for the NEXT kernel release I entirely agree on all of this, but there is a severe lack of people willing to maintain it (I for example do not have the patience to maintain it, let alone the time). ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-03 19:03 ` Austin S. Hemmelgarn @ 2017-08-04 9:48 ` Duncan 0 siblings, 0 replies; 63+ messages in thread From: Duncan @ 2017-08-04 9:48 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Thu, 03 Aug 2017 15:03:53 -0400 as excerpted: >> Same thing with the trim feature that is marked OK . It clearly says >> that is has performance implications. It is marked OK so one would >> expect it to not cause the filesystem to fail, but if the performance >> becomes so slow that the filesystem gets practically unusable it is of >> course not "OK". The relevant information is missing for people to make >> a decent choice and I certainly don't know how serious these >> performance implications are, if they are at all relevant... > The performance implications bit shouldn't be listed, that's a given for > any filesystem with discard (TRIM is the ATA and eMMC command, UNMAP is > the SCSI one, and ERASE is the name on SD cards, discard is the generic > kernel term) support. The issue arises from devices that don't have > support for queuing such commands, which is quite rare for SSD's these > days. Not so entirely rare. The generally well regarded Samsung EVO/Pro 850 ssd series don't support queued-trim, and indeed, due to a fiasco where new firmware lied about such support[1], the kernel now blacklists queued- trim on all samsung ssds. (I actually bought a pair of samsung evo 1TB ssds after seeing them well recommended both on this list and in various reviews. Only AFTER I had them and was wondering if I could now add discard to my btrfs mount options and therefore googling for samsung evo queued trim specifically, did I find out about this fiasco and samsung not supporting linux because anyone can write the code, or I'd have certainly reconsidered and would have very likely spent my money elsewhere. I did actually check the current kernel's blacklisting code and verified it, tho I also noted it whitelists samsung ssds for actually honoring flush directives where the code treats non-whitelisted ssds as not honoring them due apparently to too many claiming to do so while not actually doing so, to get better performance, so it's a mixed bag, one whitelisting for actually flushing when it claims to, one blacklisting for not reliably handling queued-trim despite some firmware claiming to do so. But the worst IMO is samsung support blackballing linux because anyone can write the code. =:^ That's worth blackballing samsung for, in my book; I just wish I'd found out before the purchase instead of after, tho the linux devs have at least made sure samsung ssd users don't lose data on linux due to samsung's lies, despite samsung's horrible support policy blackballing linux, at least at the time.) --- [1] The firmware said it supported a new ata standard where it's apparently mandatory, but the result was repeatedly corrupted data, with samsung support repeatedly said they don't support Linux because anyone can write code to execute, but they weren't seeing the problem on MS yet simply because MS hadn't issued a release that supported the new standard, and had queued-trim disabled by default with the older standards due to such problems when it was enabled. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-03 18:08 ` waxhead 2017-08-03 18:29 ` Christoph Anton Mitterer 2017-08-03 19:03 ` Austin S. Hemmelgarn @ 2017-08-16 18:07 ` David Sterba 2 siblings, 0 replies; 63+ messages in thread From: David Sterba @ 2017-08-16 18:07 UTC (permalink / raw) To: waxhead; +Cc: Brendan Hide, linux-btrfs On Thu, Aug 03, 2017 at 08:08:59PM +0200, waxhead wrote: > BTRFS biggest problem is not that there are some bits and pieces that > are thoroughly screwed up (raid5/6 (which just got some fixes by the > way)), but the fact that the documentation is rather dated. > > There is a simple status page here > https://btrfs.wiki.kernel.org/index.php/Status > > As others have pointed out already the explanations on the status page > is not exactly good. For example compression (that was also mentioned) > is as of writing this marked as 'Mostly ok' '(needs verification and > source) - auto repair and compression may crash' > > Now, I am aware that many use compression without trouble. I am not sure > how many that has compression with disk issues and don't have trouble , > but I would at least expect to see more people yelling on the mailing > list if that where the case. The problem here is that this message is > rather scary and certainly does NOT sound like 'mostly ok' for most people. > > What exactly needs verification and source? the mostly ok statement or > something else?! A more detailed explanation would be required here to > avoid scaring people away. > > Same thing with the trim feature that is marked OK . It clearly says > that is has performance implications. It is marked OK so one would > expect it to not cause the filesystem to fail, but if the performance > becomes so slow that the filesystem gets practically unusable it is of > course not "OK". The relevant information is missing for people to make > a decent choice and I certainly don't know how serious these performance > implications are, if they are at all relevant... I'll try to restructure the page so it reflects status of the features from more aspects, like overall/performance/"known bad scenarios". The in-row notes are proably bad idea as they are short on details, the section under table will be better for that. > Most people interested in BTRFS are probably a bit more paranoid and > concerned about their data than the average computer user. What people > tend to forget is that other filesystems either have NO redundancy, > auto-repair and other fancy features that BTRFS have. So for the > compression example above... if you run compressed files on ext4 and > your disk gets some corruption you are in a no better state than what > you would be with btrfs either (in fact probably worse). Also nothing is > stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you > should be VERY safe. > > Simple documentation is the key so HERE ARE MY DEMANDS!!!..... ehhh.... > so here is what I think should be done: > > 1. The documentation needs to either be improved (or old non-relevant > stuff simply removed / archived somewhere) Agreed, this happens from time. > 2. The status page MUST always be up to date for the latest kernel > release (It's ok so far , let's hope nobody sleeps here) I'm watching over the page. It's been locked from edits so there's a mandatory review of the new contents, the update process is documented on the page. > 3. Proper explanations must be given so the layman and reasonably > technical people understand the risks / issues for non-ok stuff. This can be hard, the audience are both technical and non-technical users. The page is supposed to give quick overview, the more detailed information is either in the notes or on separate pages linked from there. I believe this structure should be able to cover what you need, but the acutal contents hasn't been written and there are not enough people willing/capable of writing it. > 4. There should be links to roadmaps for each feature on the status page > that clearly stats what is being worked on for the NEXT kernel release We've tried something like that in the past, the page got out of sync with reality over time and was deleted. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-02 8:38 RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? Brendan Hide ` (4 preceding siblings ...) 2017-08-03 18:08 ` waxhead @ 2017-08-04 14:05 ` Qu Wenruo 2017-08-04 23:55 ` Wang Shilong 2017-08-07 15:27 ` Chris Murphy 5 siblings, 2 replies; 63+ messages in thread From: Qu Wenruo @ 2017-08-04 14:05 UTC (permalink / raw) To: Brendan Hide, linux-btrfs On 2017年08月02日 16:38, Brendan Hide wrote: > The title seems alarmist to me - and I suspect it is going to be > misconstrued. :-/ > > From the release notes at > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html > > > "Btrfs has been deprecated > > The Btrfs file system has been in Technology Preview state since the > initial release of Red Hat Enterprise Linux 6. Red Hat will not be > moving Btrfs to a fully supported feature and it will be removed in a > future major release of Red Hat Enterprise Linux. > > The Btrfs file system did receive numerous updates from the upstream in > Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat > Enterprise Linux 7 series. However, this is the last planned update to > this feature. > > Red Hat will continue to invest in future technologies to address the > use cases of our customers, specifically those related to snapshots, > compression, NVRAM, and ease of use. We encourage feedback through your > Red Hat representative on features and requirements you have for file > systems and storage technology." Personally speaking, unlike most of the btrfs supporters, I think Red Hat is doing the correct thing for their enterprise use case. (To clarify, I'm not going to Red Hat, just in case anyone wonders why I'm not supporting btrfs) [Good things of btrfs] Btrfs is indeed a technic pioneer in a lot of aspects (at least in linux world): 1) Metadata CoW instead of traditional journal 2) Snapshot and delta-backup I think this is the killer feature of Btrfs, and why SUSE is using it for root fs. 3) Default data CoW 4) Data checksum and scrubbing 5) Multi-device management 6) Online resize/balancing And a lot of more. [Bad things of btrfs] But for enterprise usage, it's too advanced and has several problems preventing them being widely applied: 1) Low performance from metadata/data CoW This is a little complicated dilemma. Although Btrfs can disable data CoW, nodatacow also disables data checksum, which is another main feature for btrfs. So Btrfs can't default to nodatacow, unlike XFS. And metadata CoW causes extra metadata write along with superblock update (FUA), further degrading the performance. Such pioneered design makes traditional performance-intense use case very unhappy. Especially for almost all kind of databases. (Note that nodatacow can't always solve the performance problem). Most performance intense usage is still based on tradtional fs design (journal with no CoW) 2) Low concurrency caused by tree design. Unlike traditional one-tree-for-one-inode design, btrfs uses one-tree-for-one-subvolume. The design makes snapshot implementation very easy, while makes tree very hot when a lot of modifiers are trying to modify any metadata. Btrfs has a lot of different way to solve it. For extent tree (the most busy tree), we are using delayed-ref to speed up extent tree update. For fs tree fsync, we have log tree to speed things up. These approaches work, at the cost of complexity and bugs, and we still have slow fs tree modification speed. 3) Low code reusage of device-mapper. I totally understand that, due to the unique support for data csum, btrfs can't use device-mapper directly, as we must verify the data read out from device before passing it to higher level. So Btrfs uses its own device-mapper like implementation to handle multi-device management. The result is mixed. For easy to handle case like RAID0/1/10 btrfs is doing well. While for RAID5/6, everyone knows the result. Such btrfs *enhanced* re-implementation not only makes btrfs larger but also more complex and bug-prune. In short, btrfs is too advanced for generic use cases (performance) and developers (bugs), unfortunately. And even SUSE is just pushing btrfs as root fs, mainly for the snapshot feature. But still ext4/xfs for data or performance intense use case. [Other solution on the table] On the other hand, I think RedHat is pushing storage technology based on LVM (thin) and Xfs. For traditional LVM, it's stable but its snapshot design is old-fashion and low-performance. While new thin-provision LVM solves the problem using a method just like Btrfs, but at block level. And for XFS, it's still traditional designed, journal based, one-tree-for-one-inode. But with fancy new features like data CoW. Even XFS + LVM-thin lacks ability to shrink fs or scrub data or delta backup, it can do a lot of things just like Btrfs. From snapshot to multi-device management. And more importantly, has better performance for things like DB. So, for old use cases, the performance stays almost the same. For developers, guys are still focusing on their old fields, less to concern and more focused to debug. The old UNIX method still works here, do one thing and do it well. It provides some of the fancy features from btrfs, but not too fancy. It's a compromising move, but a good move for enterprise usage. [The future] When btrfs is almost as good as traditional solutions for both performance and stability, I think it will be widely applied no matter whether RedHat uses it or not, especially since btrfs still has features which LVM-thin + XFS can't provide. But the future is still full of challenges. 1) Complexity of btrfs makes development slow. Developers are already doing their work well, but the numbers of lines are twice of traditional fs. 2) New device-mapper based solution may come out fast Dm-thin is already here, and I won't be surprised that one day there will be hooks/API for device-mapper to communicate with higher levels. For example, if one day there is some dm-csum to support verify csum of given ranges (and skip unrelated ones specified by higher levels), btrfs support for data csum is no longer an exclusive feature. Thanks, Qu > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-04 14:05 ` Qu Wenruo @ 2017-08-04 23:55 ` Wang Shilong 2017-08-07 15:27 ` Chris Murphy 1 sibling, 0 replies; 63+ messages in thread From: Wang Shilong @ 2017-08-04 23:55 UTC (permalink / raw) To: Qu Wenruo; +Cc: Brendan Hide, linux-btrfs Hi Qu, On Fri, Aug 4, 2017 at 10:05 PM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > On 2017年08月02日 16:38, Brendan Hide wrote: >> >> The title seems alarmist to me - and I suspect it is going to be >> misconstrued. :-/ >> >> From the release notes at >> https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html >> >> "Btrfs has been deprecated >> >> The Btrfs file system has been in Technology Preview state since the >> initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving >> Btrfs to a fully supported feature and it will be removed in a future major >> release of Red Hat Enterprise Linux. >> >> The Btrfs file system did receive numerous updates from the upstream in >> Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat >> Enterprise Linux 7 series. However, this is the last planned update to this >> feature. >> >> Red Hat will continue to invest in future technologies to address the use >> cases of our customers, specifically those related to snapshots, >> compression, NVRAM, and ease of use. We encourage feedback through your Red >> Hat representative on features and requirements you have for file systems >> and storage technology." > > > Personally speaking, unlike most of the btrfs supporters, I think Red Hat is > doing the correct thing for their enterprise use case. > > (To clarify, I'm not going to Red Hat, just in case anyone wonders why I'm > not supporting btrfs) > > [Good things of btrfs] > Btrfs is indeed a technic pioneer in a lot of aspects (at least in linux > world): > > 1) Metadata CoW instead of traditional journal > 2) Snapshot and delta-backup > I think this is the killer feature of Btrfs, and why SUSE is using it > for root fs. > 3) Default data CoW > 4) Data checksum and scrubbing > 5) Multi-device management > 6) Online resize/balancing > And a lot of more. > > [Bad things of btrfs] > But for enterprise usage, it's too advanced and has several problems > preventing them being widely applied: > > 1) Low performance from metadata/data CoW > This is a little complicated dilemma. > Although Btrfs can disable data CoW, nodatacow also disables data > checksum, which is another main feature for btrfs. > So Btrfs can't default to nodatacow, unlike XFS. > > And metadata CoW causes extra metadata write along with superblock > update (FUA), further degrading the performance. > > Such pioneered design makes traditional performance-intense use case > very unhappy. > Especially for almost all kind of databases. (Note that nodatacow can't > always solve the performance problem). > Most performance intense usage is still based on tradtional fs design > (journal with no CoW) > > 2) Low concurrency caused by tree design. > Unlike traditional one-tree-for-one-inode design, btrfs uses > one-tree-for-one-subvolume. > The design makes snapshot implementation very easy, while makes tree > very hot when a lot of modifiers are trying to modify any metadata. > > Btrfs has a lot of different way to solve it. > For extent tree (the most busy tree), we are using delayed-ref to speed > up extent tree update. > For fs tree fsync, we have log tree to speed things up. > These approaches work, at the cost of complexity and bugs, and we still > have slow fs tree modification speed. > > 3) Low code reusage of device-mapper. > I totally understand that, due to the unique support for data csum, > btrfs can't use device-mapper directly, as we must verify the data read out > from device before passing it to higher level. > So Btrfs uses its own device-mapper like implementation to handle > multi-device management. > > The result is mixed. For easy to handle case like RAID0/1/10 btrfs is > doing well. > While for RAID5/6, everyone knows the result. > > Such btrfs *enhanced* re-implementation not only makes btrfs larger but > also more complex and bug-prune. > > In short, btrfs is too advanced for generic use cases (performance) and > developers (bugs), unfortunately. > > And even SUSE is just pushing btrfs as root fs, mainly for the snapshot > feature. > But still ext4/xfs for data or performance intense use case. > > > [Other solution on the table] > On the other hand, I think RedHat is pushing storage technology based on LVM > (thin) and Xfs. > > For traditional LVM, it's stable but its snapshot design is old-fashion and > low-performance. > While new thin-provision LVM solves the problem using a method just like > Btrfs, but at block level. > > And for XFS, it's still traditional designed, journal based, > one-tree-for-one-inode. > But with fancy new features like data CoW. > > Even XFS + LVM-thin lacks ability to shrink fs or scrub data or delta > backup, it can do a lot of things just like Btrfs. > From snapshot to multi-device management. > > And more importantly, has better performance for things like DB. > > So, for old use cases, the performance stays almost the same. > For developers, guys are still focusing on their old fields, less to concern > and more focused to debug. The old UNIX method still works here, do one > thing and do it well. > > It provides some of the fancy features from btrfs, but not too fancy. > It's a compromising move, but a good move for enterprise usage. > > [The future] > When btrfs is almost as good as traditional solutions for both performance > and stability, I think it will be widely applied no matter whether RedHat > uses it or not, especially since btrfs still has features which LVM-thin + > XFS can't provide. > > But the future is still full of challenges. > 1) Complexity of btrfs makes development slow. > Developers are already doing their work well, but the numbers of lines > are twice of traditional fs. > > 2) New device-mapper based solution may come out fast > Dm-thin is already here, and I won't be surprised that one day there > will be hooks/API for device-mapper to communicate with higher levels. > > For example, if one day there is some dm-csum to support verify csum of > given ranges (and skip unrelated ones specified by higher levels), btrfs > support for data csum is no longer an exclusive feature. Fair enough and good conclusion, I think most of reasons come to btrfs stability and RedHat could not have some good developers for Btrfs like you. I think for the future, the most difficult thing for Btrfs comes to performance, Btrfs could not scale with CPU numbers increased, that is bad for metadata heavy load or even small io random read/write. Thanks, Shilong > > Thanks, > Qu >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-04 14:05 ` Qu Wenruo 2017-08-04 23:55 ` Wang Shilong @ 2017-08-07 15:27 ` Chris Murphy 2017-08-10 0:35 ` Qu Wenruo 1 sibling, 1 reply; 63+ messages in thread From: Chris Murphy @ 2017-08-07 15:27 UTC (permalink / raw) To: Qu Wenruo; +Cc: Brendan Hide, Btrfs BTRFS On Fri, Aug 4, 2017 at 8:05 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > For example, if one day there is some dm-csum to support verify csum of > given ranges (and skip unrelated ones specified by higher levels), btrfs > support for data csum is no longer an exclusive feature. How would dm-csum differ from dm-integrity? https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt By that description it uses a journal to guarantee atomicity. If multiqueue maybe the performance implications are neutral. But certainly on spinning drives that would slow things down, especially if the file system is also journaling, and the workload is metadata heavy. -- Chris Murphy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-07 15:27 ` Chris Murphy @ 2017-08-10 0:35 ` Qu Wenruo 2017-08-12 0:10 ` Christoph Anton Mitterer 0 siblings, 1 reply; 63+ messages in thread From: Qu Wenruo @ 2017-08-10 0:35 UTC (permalink / raw) To: Chris Murphy; +Cc: Brendan Hide, Btrfs BTRFS On 2017年08月07日 23:27, Chris Murphy wrote: > On Fri, Aug 4, 2017 at 8:05 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >> >> For example, if one day there is some dm-csum to support verify csum of >> given ranges (and skip unrelated ones specified by higher levels), btrfs >> support for data csum is no longer an exclusive feature. > > How would dm-csum differ from dm-integrity? > https://www.kernel.org/doc/Documentation/device-mapper/dm-integrity.txt Well, pretty much the same what I want. While my idea is to do n-way buffered csum update. (n=2 should be most common case) That's to say, for CRC32(4 bytes) of 4K write, the csum space will be reserved as 4bytes * n. Even crash happened, one can verify all the csum slots if it's either old or new data. That's just a degraded journal anyway, and may still cause data loss for power loss if data is still updated half way. > > By that description it uses a journal to guarantee atomicity. If > multiqueue maybe the performance implications are neutral. But > certainly on spinning drives that would slow things down, especially > if the file system is also journaling, and the workload is metadata > heavy. That's what btrfs is good at, better co-operation between different layers. But this doesn't mean traditional dm solution can't figure out its way. [No double-csum for metadata for btrfs] Btrfs will not double calculate csum for metadata, which has its own csum at its header. And nodatacow data will not cause csum calculation. But if we have extra flag bits for bio to co-operate fs and dm/block driver, it can still be solved. (Maybe even easier) For example, if there is extra bio flags to info dm-integrity or any supported block device driver not to calculate csum for specified bio, then we can avoid such useless double-csum for metadata or nodatacow write. [Good solution on data cow and csum] Despite of that possible performance improvement, btrfs also solves the problem of async data and csum write, by disabling csum completely for nocow contents, so there is no need to journal csum write and data. (Journaling data is super slow). However nowadays, fs like XFS also has its own extent backref tree to know if given write is new (or CoWed) write or rewrite. So following the method above, if we have another flag to info dm-integrity that a given bio is rewriting or not, then it can be handled much better. For example, if a bio is rewriting data and dm-integrity is configured for better performance, then just let dm-integrity to mark that bio range to nocsum and ignore existing csum. So that dm-integrity can avoid must of its data and csum journal. (New or CoWed write won't need to be journaled, just like what btrfs is doing) In short, there is always method to do more or less the same thing btrfs can do. So I will not be surprised if one day there is a solution to do everything current btrfs can do, with robust code base and less modification to current kernel. Thanks, Qu ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-10 0:35 ` Qu Wenruo @ 2017-08-12 0:10 ` Christoph Anton Mitterer 2017-08-12 7:42 ` Christoph Hellwig 0 siblings, 1 reply; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-12 0:10 UTC (permalink / raw) To: Qu Wenruo; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 1996 bytes --] Qu Wenruo wrote: >Although Btrfs can disable data CoW, nodatacow also disables data >checksum, which is another main feature for btrfs. Then decoupling of the two should probably decoupled and support for notdatacow+checksumming be implemented?! I'm not an expert, but I wouldn't see why this shouldn't be possible (especially since metadata is AFAIC anyway *always* CoWed + checksummed). Nearly a year ago I had some off-list mails exchanged with CM and AFAIU he said it would technically be possible... What's the worst thing that can happen?! IMO, that noCoWed data would have been correctly written on a crash, but not the checksum, thereby the (bad) checksum would invalidate the actually good data. How likely is that compared to the other way round? I'd guess not so much. And even if, it's IMO still better to have then false positives (which the higher application layers should take care of anyway) than to not notice silent data corruption at all. Of course checksuming would possibly impact performance, but anyway could still use nodatacow+nochecksum (or any other fs) if he focuses more on performance than data integrity. But all those who focus on integrity would get that, even in the nodatacow case. IIRC, CM brought as an argument, that some people rather get the bad data than nothing at all (respectively EIO)... but for those btrfs is probably anyway a bad choice (at least in the normal non-nodatacow case),... also any application should properly deal with EIO... and last but not least, one could still provide a special tool that, after crash (with possibly non-matching data/csum) allows a user to find such cases and decide what to do,... so a user/admin who rather takes the bad data an tries for forensical recovery could be given a tool like btrfs csum --recompute-invalid-csums (or some better name), in which either all (or just some paths) csums are re-written in case they don't match. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-12 0:10 ` Christoph Anton Mitterer @ 2017-08-12 7:42 ` Christoph Hellwig 2017-08-12 11:51 ` Christoph Anton Mitterer 2017-08-14 6:36 ` Qu Wenruo 0 siblings, 2 replies; 63+ messages in thread From: Christoph Hellwig @ 2017-08-12 7:42 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Qu Wenruo, Btrfs BTRFS On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote: > Qu Wenruo wrote: > >Although Btrfs can disable data CoW, nodatacow also disables data > >checksum, which is another main feature for btrfs. > > Then decoupling of the two should probably decoupled and support for > notdatacow+checksumming be implemented?! And how are you going to write your data and checksum atomically when doing in-place updates? ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-12 7:42 ` Christoph Hellwig @ 2017-08-12 11:51 ` Christoph Anton Mitterer 2017-08-12 12:12 ` Hugo Mills 2017-08-14 6:36 ` Qu Wenruo 1 sibling, 1 reply; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-12 11:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 1275 bytes --] On Sat, 2017-08-12 at 00:42 -0700, Christoph Hellwig wrote: > And how are you going to write your data and checksum atomically when > doing in-place updates? Maybe I misunderstand something, but what's the big deal with not doing it atomically (I assume you mean in terms of actually writing to the pyhsical medium)? Isn't that anyway already a problem in case of a crash? And isn't that the case also with all forms of e.g. software RAID (when not having a journal)? And as I've said, what's the worst thing that can happen? Either the data would not have been completely written - with or without checksumming. Then what's the difference to try the checksumming (and do it successfully in all non crash cases)? My understanding was (but that may be wrong of course, I'm not a filesystem expert at all), that worst that can happen is that data an csum aren't *both* fully written (in all possible combinations), so we'd have four cases in total: data=good csum=good => fine data=bad csum=bad => doesn't matter whether csum or not and whether atomic or not data=bad csum=good => the csum will tell us, that the data is bad data= good csum=bad => the only real problem, data would be actually good, but csum is not [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-12 11:51 ` Christoph Anton Mitterer @ 2017-08-12 12:12 ` Hugo Mills 2017-08-13 14:08 ` Goffredo Baroncelli 0 siblings, 1 reply; 63+ messages in thread From: Hugo Mills @ 2017-08-12 12:12 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Christoph Hellwig, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 3069 bytes --] On Sat, Aug 12, 2017 at 01:51:46PM +0200, Christoph Anton Mitterer wrote: > On Sat, 2017-08-12 at 00:42 -0700, Christoph Hellwig wrote: > > And how are you going to write your data and checksum atomically when > > doing in-place updates? > > Maybe I misunderstand something, but what's the big deal with not doing > it atomically (I assume you mean in terms of actually writing to the > pyhsical medium)? Isn't that anyway already a problem in case of a > crash? With normal CoW operations, the atomicity is achieved by constructing a completely new metadata tree containing both changes (references to the data, and the csum metadata), and then atomically changing the superblock to point to the new tree, so it really is atomic. With nodatacow, that approach doesn't work, because the new data replaces the old on the physical medium, so you'd have to make the data write atomic with the superblock write -- which can't be done, because it's (at least) two distinct writes. > And isn't that the case also with all forms of e.g. software RAID (when > not having a journal)? > > And as I've said, what's the worst thing that can happen? Either the > data would not have been completely written - with or without > checksumming. Then what's the difference to try the checksumming (and > do it successfully in all non crash cases)? > My understanding was (but that may be wrong of course, I'm not a > filesystem expert at all), that worst that can happen is that data an > csum aren't *both* fully written (in all possible combinations), so > we'd have four cases in total: > > data=good csum=good => fine > data=bad csum=bad => doesn't matter whether csum or not and whether atomic or not > data=bad csum=good => the csum will tell us, that the data is bad > data= > good csum=bad => the only real problem, data would be actually > > good, but csum is not I don't think this is a particularly good description of the problem. I'd say it's more like this: If you write data and metadata separately (which you have to do in the nodatacow case), and the system halts between the two writes, then you either have the new data with the old csum, or the old csum with the new data. Both data and csum are "good", but good from different states of the FS. In both cases (data first or metadata first), the csum doesn't match the data, and so you now have an I/O error reported when trying to read that data. You can't easily fix this, because when the data and csum don't match, you need to know the _reason_ they don't match -- is it because the machine was interrupted during write (in which case you can fix it), or is it because the hard disk has had someone write data to it directly, and the data is now toast (in which case you shouldn't fix the I/O error)? Basically, nodatacow bypasses the very mechanisms that are meant to provide consistency in the filesystem. Hugo. -- Hugo Mills | vi vi vi: the Editor of the Beast. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-12 12:12 ` Hugo Mills @ 2017-08-13 14:08 ` Goffredo Baroncelli 2017-08-14 7:08 ` Qu Wenruo 0 siblings, 1 reply; 63+ messages in thread From: Goffredo Baroncelli @ 2017-08-13 14:08 UTC (permalink / raw) To: Hugo Mills, Christoph Anton Mitterer, Christoph Hellwig, Btrfs BTRFS On 08/12/2017 02:12 PM, Hugo Mills wrote: > On Sat, Aug 12, 2017 at 01:51:46PM +0200, Christoph Anton Mitterer wrote: >> On Sat, 2017-08-12 at 00:42 -0700, Christoph Hellwig wrote: [...] >> good, but csum is not > > I don't think this is a particularly good description of the > problem. I'd say it's more like this: > > If you write data and metadata separately (which you have to do in > the nodatacow case), and the system halts between the two writes, then > you either have the new data with the old csum, or the old csum with > the new data. Both data and csum are "good", but good from different > states of the FS. In both cases (data first or metadata first), the > csum doesn't match the data, and so you now have an I/O error reported > when trying to read that data. > > You can't easily fix this, because when the data and csum don't > match, you need to know the _reason_ they don't match -- is it because > the machine was interrupted during write (in which case you can fix > it), or is it because the hard disk has had someone write data to it > directly, and the data is now toast (in which case you shouldn't fix > the I/O error)? I am still inclined to think that this kind of problems could be solved using a journal: if you track which blocks are updated in the transaction and their checksum, if the transaction are interrupted, you can always rebuild the pair data/checksum: in case of interruption of a transaction: - all COW data are trashed - some NOCOW data might be written - all metadata (which are COW) are trashed Supposing to log for each transaction BTRFS which "data NOCOW blocks" will be updated and their checksum, in case a transaction is interrupted you know which blocks have to be checked and are able to verify if the checksum matches and correct the mismatch. Logging also the checksum could help to identify if: - the data is old - the data is updated - the updated data is correct The same approach could be used also to solving also the issue related to the infamous RAID5/6 hole: logging which block are updated, in case of transaction aborted you can check the parity which have to be rebuild. > > Basically, nodatacow bypasses the very mechanisms that are meant to > provide consistency in the filesystem. > > Hugo. > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-13 14:08 ` Goffredo Baroncelli @ 2017-08-14 7:08 ` Qu Wenruo 2017-08-14 14:23 ` Goffredo Baroncelli 0 siblings, 1 reply; 63+ messages in thread From: Qu Wenruo @ 2017-08-14 7:08 UTC (permalink / raw) To: kreijack, Hugo Mills, Christoph Anton Mitterer, Christoph Hellwig, Btrfs BTRFS On 2017年08月13日 22:08, Goffredo Baroncelli wrote: > On 08/12/2017 02:12 PM, Hugo Mills wrote: >> On Sat, Aug 12, 2017 at 01:51:46PM +0200, Christoph Anton Mitterer wrote: >>> On Sat, 2017-08-12 at 00:42 -0700, Christoph Hellwig wrote: > [...] >>> good, but csum is not >> >> I don't think this is a particularly good description of the >> problem. I'd say it's more like this: >> >> If you write data and metadata separately (which you have to do in >> the nodatacow case), and the system halts between the two writes, then >> you either have the new data with the old csum, or the old csum with >> the new data. Both data and csum are "good", but good from different >> states of the FS. In both cases (data first or metadata first), the >> csum doesn't match the data, and so you now have an I/O error reported >> when trying to read that data. >> >> You can't easily fix this, because when the data and csum don't >> match, you need to know the _reason_ they don't match -- is it because >> the machine was interrupted during write (in which case you can fix >> it), or is it because the hard disk has had someone write data to it >> directly, and the data is now toast (in which case you shouldn't fix >> the I/O error)? > > I am still inclined to think that this kind of problems could be solved using a journal: if you track which blocks are updated in the transaction and their checksum, if the transaction are interrupted, you can always rebuild the pair data/checksum: > in case of interruption of a transaction: > - all COW data are trashed > - some NOCOW data might be written > - all metadata (which are COW) are trashed The idea itself sounds good, however btrfs doesn't use journal (yet) and that means we need to introduce journal while btrfs uses metadata CoW to handle most work of journal. > > Supposing to log for each transaction BTRFS which "data NOCOW blocks" will be updated and their checksum, in case a transaction is interrupted you know which blocks have to be checked and are able to verify if the checksum matches and correct the mismatch. Logging also the checksum could help to identify if: > - the data is old > - the data is updated > - the updated data is correct > > The same approach could be used also to solving also the issue related to the infamous RAID5/6 hole: logging which block are updated, in case of transaction aborted you can check the parity which have to be rebuild. Indeed Liu is using journal to solve RAID5/6 write hole. But to address the lack-of-journal nature of btrfs, he introduced a journal device to handle it, since btrfs metadata is either written or trashed, we can't rely existing btrfs metadata to handle journal. PS: This reminds me why ZFS is still using journal (called ZFS intent log) but not mandatory metadata CoW of btrfs. Thanks, Qu > >> >> Basically, nodatacow bypasses the very mechanisms that are meant to >> provide consistency in the filesystem. >> >> Hugo. >> > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 7:08 ` Qu Wenruo @ 2017-08-14 14:23 ` Goffredo Baroncelli 2017-08-14 19:08 ` Chris Murphy 0 siblings, 1 reply; 63+ messages in thread From: Goffredo Baroncelli @ 2017-08-14 14:23 UTC (permalink / raw) To: Qu Wenruo, Hugo Mills, Christoph Anton Mitterer, Christoph Hellwig, Btrfs BTRFS, Liu Bo On 08/14/2017 09:08 AM, Qu Wenruo wrote: > >> >> Supposing to log for each transaction BTRFS which "data NOCOW blocks" will be updated and their checksum, in case a transaction is interrupted you know which blocks have to be checked and are able to verify if the checksum matches and correct the mismatch. Logging also the checksum could help to identify if: >> - the data is old >> - the data is updated >> - the updated data is correct >> >> The same approach could be used also to solving also the issue related to the infamous RAID5/6 hole: logging which block are updated, in case of transaction aborted you can check the parity which have to be rebuild. > Indeed Liu is using journal to solve RAID5/6 write hole. > > But to address the lack-of-journal nature of btrfs, he introduced a journal device to handle it, since btrfs metadata is either written or trashed, we can't rely existing btrfs metadata to handle journal. The Liu's solution is a lot heavier. With the Liu's solution, you need to write both the data and parity 2 times. I am only suggest to track the block to update. And it would be only need for the stripes involved by a RMW cycle. This is a lot less data to write (8 byte vs 4Kbyte) > > PS: This reminds me why ZFS is still using journal (called ZFS intent log) but not mandatory metadata CoW of btrfs. Form a theoretical point of view, if you have a "PURE" COW file-system, you don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, so you need a journal to keep it in sync. The same is true for the NOCOW file (and their checksums) > > Thanks, > Qu -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 14:23 ` Goffredo Baroncelli @ 2017-08-14 19:08 ` Chris Murphy 2017-08-14 20:27 ` Goffredo Baroncelli 0 siblings, 1 reply; 63+ messages in thread From: Chris Murphy @ 2017-08-14 19:08 UTC (permalink / raw) To: Goffredo Baroncelli Cc: Qu Wenruo, Hugo Mills, Christoph Anton Mitterer, Christoph Hellwig, Btrfs BTRFS, Liu Bo On Mon, Aug 14, 2017 at 8:23 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > Form a theoretical point of view, if you have a "PURE" COW file-system, you don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, so you need a journal to keep it in sync. The same is true for the NOCOW file (and their checksums) > I'm pretty sure the raid56 rmw is in memory only, I don't think we have a case where a stripe is getting partial writes (a block in a stripe is being overwritten). Partial stripe updates with rmw *on disk* would mean Btrfs raid56 is not CoW. -- Chris Murphy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 19:08 ` Chris Murphy @ 2017-08-14 20:27 ` Goffredo Baroncelli 0 siblings, 0 replies; 63+ messages in thread From: Goffredo Baroncelli @ 2017-08-14 20:27 UTC (permalink / raw) To: Chris Murphy Cc: Qu Wenruo, Hugo Mills, Christoph Anton Mitterer, Christoph Hellwig, Btrfs BTRFS, Liu Bo On 08/14/2017 09:08 PM, Chris Murphy wrote: > On Mon, Aug 14, 2017 at 8:23 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote: > >> Form a theoretical point of view, if you have a "PURE" COW file-system, you don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, so you need a journal to keep it in sync. The same is true for the NOCOW file (and their checksums) >> > > I'm pretty sure the raid56 rmw is in memory only, I don't think we > have a case where a stripe is getting partial writes (a block in a > stripe is being overwritten). Partial stripe updates with rmw *on > disk* would mean Btrfs raid56 is not CoW. > I am not sure about that. Consider the following cases: - what if we have to wrote less than a stripe ? - supposing to remove a file with length of 4k. If you don't allow a RMW cycle, this means the space would be lost forever.... Pay attention that the size of a stripe (theoretically) could be quite big: suppose to have an (insanely) raid compose by 20disks, the stripe would be about 20*64k = ~1.2MB....) BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-12 7:42 ` Christoph Hellwig 2017-08-12 11:51 ` Christoph Anton Mitterer @ 2017-08-14 6:36 ` Qu Wenruo 2017-08-14 7:43 ` Paul Jones 2017-08-14 12:24 ` Christoph Anton Mitterer 1 sibling, 2 replies; 63+ messages in thread From: Qu Wenruo @ 2017-08-14 6:36 UTC (permalink / raw) To: Christoph Hellwig, Christoph Anton Mitterer; +Cc: Btrfs BTRFS On 2017年08月12日 15:42, Christoph Hellwig wrote: > On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote: >> Qu Wenruo wrote: >>> Although Btrfs can disable data CoW, nodatacow also disables data >>> checksum, which is another main feature for btrfs. >> >> Then decoupling of the two should probably decoupled and support for >> notdatacow+checksumming be implemented?! > > And how are you going to write your data and checksum atomically when > doing in-place updates? Exactly, that's the main reason I can figure out why btrfs disables checksum for nodatacow. Thanks, Qu > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 63+ messages in thread
* RE: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 6:36 ` Qu Wenruo @ 2017-08-14 7:43 ` Paul Jones 2017-08-14 7:46 ` Qu Wenruo 2017-08-14 12:24 ` Christoph Anton Mitterer 1 sibling, 1 reply; 63+ messages in thread From: Paul Jones @ 2017-08-14 7:43 UTC (permalink / raw) To: Qu Wenruo, Christoph Hellwig, Christoph Anton Mitterer; +Cc: Btrfs BTRFS [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1418 bytes --] > -----Original Message----- > From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs- > owner@vger.kernel.org] On Behalf Of Qu Wenruo > Sent: Monday, 14 August 2017 4:37 PM > To: Christoph Hellwig <hch@infradead.org>; Christoph Anton Mitterer > <calestyo@scientia.net> > Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org> > Subject: Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? > > > > On 2017å¹´08æ12æ¥ 15:42, Christoph Hellwig wrote: > > On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote: > >> Qu Wenruo wrote: > >>> Although Btrfs can disable data CoW, nodatacow also disables data > >>> checksum, which is another main feature for btrfs. > >> > >> Then decoupling of the two should probably decoupled and support for > >> notdatacow+checksumming be implemented?! > > > > And how are you going to write your data and checksum atomically when > > doing in-place updates? > > Exactly, that's the main reason I can figure out why btrfs disables checksum > for nodatacow. But does it matter if it's not strictly atomic? By turning off COW it implies you accept the risk of an ill-timed failure. Although from my point of view any reason that would require COW to be disabled implies you're using the wrong filesystem anyway. Paul. ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±ý»k~ÏâØ^nr¡ö¦zË\x1aëh¨èÚ&£ûàz¿äz¹Þú+Ê+zf£¢·h§~Ûiÿÿïêÿêçz_è®\x0fæj:+v¨þ)ߣøm ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 7:43 ` Paul Jones @ 2017-08-14 7:46 ` Qu Wenruo 2017-08-14 12:32 ` Christoph Anton Mitterer 0 siblings, 1 reply; 63+ messages in thread From: Qu Wenruo @ 2017-08-14 7:46 UTC (permalink / raw) To: Paul Jones, Christoph Hellwig, Christoph Anton Mitterer; +Cc: Btrfs BTRFS On 2017年08月14日 15:43, Paul Jones wrote: >> -----Original Message----- >> From: linux-btrfs-owner@vger.kernel.org [mailto:linux-btrfs- >> owner@vger.kernel.org] On Behalf Of Qu Wenruo >> Sent: Monday, 14 August 2017 4:37 PM >> To: Christoph Hellwig <hch@infradead.org>; Christoph Anton Mitterer >> <calestyo@scientia.net> >> Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org> >> Subject: Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? >> >> >> >> On 2017年08月12日 15:42, Christoph Hellwig wrote: >>> On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote: >>>> Qu Wenruo wrote: >>>>> Although Btrfs can disable data CoW, nodatacow also disables data >>>>> checksum, which is another main feature for btrfs. >>>> >>>> Then decoupling of the two should probably decoupled and support for >>>> notdatacow+checksumming be implemented?! >>> >>> And how are you going to write your data and checksum atomically when >>> doing in-place updates? >> >> Exactly, that's the main reason I can figure out why btrfs disables checksum >> for nodatacow. > > But does it matter if it's not strictly atomic? By turning off COW it implies you accept the risk of an ill-timed failure. The problem here is, if you enable csum and even data is updated correctly, only metadata is trashed, then you can't even read out the correct data. As btrfs csum checker will just prevent you from reading out any data which doesn't match with csum. Now it's not just data corruption, but data loss then. Thanks, Qu > Although from my point of view any reason that would require COW to be disabled implies you're using the wrong filesystem anyway. > > Paul. > > > > > > > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 7:46 ` Qu Wenruo @ 2017-08-14 12:32 ` Christoph Anton Mitterer 2017-08-14 12:58 ` Qu Wenruo 0 siblings, 1 reply; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-14 12:32 UTC (permalink / raw) To: Qu Wenruo; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 1705 bytes --] On Mon, 2017-08-14 at 15:46 +0800, Qu Wenruo wrote: > The problem here is, if you enable csum and even data is updated > correctly, only metadata is trashed, then you can't even read out > the > correct data. So what? This problem occurs anyway *only* in case of a crash,.. and *only* if notdatacow+checksumung would be used. A case in which currently, the user can either only hope that his data is fine (unless higher levels provide some checksumming means[0]), or anyway needs to recover from a backup. Intuitively I'd also say it's much less likely that the data (which is more in terms of space) is written correctly while the checksum is not. Or is it? [0] And when I've investigated back when discussion rose up the first time and some list member claimed that most typical cases (DBs, VM images) would anyway do their own checksuming,... I came to the conclusion that most did not even support it and even if they would it's no enabled per default and not really a *full* checksumming in most cases. > As btrfs csum checker will just prevent you from reading out any > data > which doesn't match with csum. As I've said before, a tool could be provided, that re-computes the checksums then (making the data accessible again)... or one could simply mount the fs with nochecksum or some other special option, which allows bypassing any checks. > Now it's not just data corruption, but data loss then. I think the former is worse than the later. The later gives you a chance of noting it, and either recover from a backup, regenerate the data (if possible) or manually mark the data as being "good" (though corrupted) again. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 12:32 ` Christoph Anton Mitterer @ 2017-08-14 12:58 ` Qu Wenruo 0 siblings, 0 replies; 63+ messages in thread From: Qu Wenruo @ 2017-08-14 12:58 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Btrfs BTRFS On 2017年08月14日 20:32, Christoph Anton Mitterer wrote: > On Mon, 2017-08-14 at 15:46 +0800, Qu Wenruo wrote: >> The problem here is, if you enable csum and even data is updated >> correctly, only metadata is trashed, then you can't even read out >> the >> correct data. > > So what? > This problem occurs anyway *only* in case of a crash,.. and *only* if > notdatacow+checksumung would be used. > A case in which currently, the user can either only hope that his data > is fine (unless higher levels provide some checksumming means[0]), or > anyway needs to recover from a backup. Let's make it clear of the combinations and its result in power loss case: Datacow + Datasum: Good old data Datacow + nodatasum: Good old data Nodatacow + datacum: Good old data (data not committed yet) or -EIO (data updated) Not supported yet, so I just assume it's using current csum checking behavior. Nodatacow + nodatasum: Good old data (data not committed yet) or uncertain data. The uncertain part is when data updated, what it should behave. If we really need to implement nodatacow +datasum, I prefer to make it consistent with nodatacow + nodatasum behavior, at least read out the data, give some csum warning instead of refuse to read and returning -EIO. > > Intuitively I'd also say it's much less likely that the data (which is > more in terms of space) is written correctly while the checksum is not. > Or is it? Checksums are protected by mandatory metadata CoW, so metadata update is always atomic. Checksum will either be updated correctly, or trashed at all. Unlike data. And it's highly possible to happen. As when synchronising a filesystem, we write data first, then metadata (data and meta may be cached by disk controller, but at least we submit such request to disk), then flush all data and metadata to disk, and update superblock finally. Since metadata is updated CoW, unless the superblock is written to disk, we are always reading the old metadata trees (including csum tree). So if powerloss happens between data written to disk and final superblock update, it's highly possible to hit the problem. And considering the data/metadata ratio, we spend more time flushing data other than metadata, which increase the possibility further more. > > [0] And when I've investigated back when discussion rose up the first > time and some list member claimed that most typical cases (DBs, VM > images) would anyway do their own checksuming,... I came to the > conclusion that most did not even support it and even if they would > it's no enabled per default and not really a *full* checksumming in > most cases. > > > >> As btrfs csum checker will just prevent you from reading out any >> data >> which doesn't match with csum. > As I've said before, a tool could be provided, that re-computes the > checksums then (making the data accessible again)... or one could > simply mount the fs with nochecksum or some other special option, which > allows bypassing any checks. Just as you pointed out, such csum bypassing should be the prerequisite for nodatacow+datasum. And unfortunately, we don't have such facility yet. > >> Now it's not just data corruption, but data loss then. > I think the former is worse than the later. The later gives you a > chance of noting it, and either recover from a backup, regenerate the > data (if possible) or manually mark the data as being "good" (though > corrupted) again. This depends. If the upper layer has its own error detection mechanism, like keeping a special file fsynced before write (or just call it journal), then allowing reading out the corrupted data gives it a chance to find it good and continue. While just returning -EIO kills the chance at all. BTW, normal user space programs can handle csum mismatch better than -EIO. Like zip files has its own checksum, btw can't handle -EIO at all. Thanks, Qu > > > Cheers, > Chris. > ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 6:36 ` Qu Wenruo 2017-08-14 7:43 ` Paul Jones @ 2017-08-14 12:24 ` Christoph Anton Mitterer 2017-08-14 14:23 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-14 12:24 UTC (permalink / raw) To: Qu Wenruo; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 993 bytes --] On Mon, 2017-08-14 at 14:36 +0800, Qu Wenruo wrote: > > And how are you going to write your data and checksum atomically > > when > > doing in-place updates? > > Exactly, that's the main reason I can figure out why btrfs disables > checksum for nodatacow. Still, I don't get the problem here... Yes it cannot be done atomically (without workarounds like a journal or so), but this should be only an issue in case of a crash or similar. And in this case nodatacow+nochecksum is anyway already bad, it's also not atomic, so data may be completely garbage (e.g. half written)... just that no one will ever notice. The only problem that nodatacow + checksuming + nonatomic should give is when the data was actually correctly written at a crash, but the cheksum was not, in which case the bogus checksum would invalidate the good data on next read. Or do I miss something? To me that sounds still much better than having no protection at all. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 12:24 ` Christoph Anton Mitterer @ 2017-08-14 14:23 ` Austin S. Hemmelgarn 2017-08-14 15:13 ` Graham Cobb 2017-08-14 19:39 ` Christoph Anton Mitterer 0 siblings, 2 replies; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-14 14:23 UTC (permalink / raw) To: Christoph Anton Mitterer, Qu Wenruo; +Cc: Btrfs BTRFS On 2017-08-14 08:24, Christoph Anton Mitterer wrote: > On Mon, 2017-08-14 at 14:36 +0800, Qu Wenruo wrote: >>> And how are you going to write your data and checksum atomically >>> when >>> doing in-place updates? >> >> Exactly, that's the main reason I can figure out why btrfs disables >> checksum for nodatacow. > > Still, I don't get the problem here... > > Yes it cannot be done atomically (without workarounds like a journal or > so), but this should be only an issue in case of a crash or similar. > > And in this case nodatacow+nochecksum is anyway already bad, it's also > not atomic, so data may be completely garbage (e.g. half written)... > just that no one will ever notice. > > The only problem that nodatacow + checksuming + nonatomic should give > is when the data was actually correctly written at a crash, but the > cheksum was not, in which case the bogus checksum would invalidate the > good data on next read. > > Or do I miss something? > > > To me that sounds still much better than having no protection at all. Assume you have higher level verification. Would you rather not be able to read the data regardless of if it's correct or not, or be able to read it and determine yourself if it's correct or not? For almost anybody, the answer is going to be the second case, because the application knows better than the OS if the data is correct (and 'correct' may be a threshold, not some binary determination). At that point, you need to make the checksum error a warning instead of returning -EIO. How do you intend to communicate that warning back to the application? The kernel log won't work, because on any reasonably secure system it's not visible to anyone but root. There's also no side channel for the read() system calls that you can utilize. That then means that the checksums end up just being a means for the administrator to know some data wasn't written correctly, but they should know that anyway because the system crashed. As a result, the whole thing ends up reduced to some extra work for a pointless notification that some people may not even see. Looking at this from a different angle: Without background, what would you assume the behavior to be for this? For most people, the assumption would be that this provides the same degree of data safety that the checksums do when the data is CoW. We already have enough issues with people misunderstanding how things work and losing data as a result (keep in mind that the average user doesn't read documentation and will often blindly follow any random advice they see online), and we don't need more that are liable to cause data loss. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 14:23 ` Austin S. Hemmelgarn @ 2017-08-14 15:13 ` Graham Cobb 2017-08-14 15:53 ` Austin S. Hemmelgarn 2017-08-14 19:39 ` Christoph Anton Mitterer 1 sibling, 1 reply; 63+ messages in thread From: Graham Cobb @ 2017-08-14 15:13 UTC (permalink / raw) To: Btrfs BTRFS On 14/08/17 15:23, Austin S. Hemmelgarn wrote: > Assume you have higher level verification. But almost no applications do. In real life, the decision making/correction process will be manual and labour-intensive (for example, running fsck on a virtual disk or restoring a file from backup). > Would you rather not be able > to read the data regardless of if it's correct or not, or be able to > read it and determine yourself if it's correct or not? It must be controllable on a per-file basis, of course. For the tiny number of files where the app can both spot the problem and correct it (for example if it has a journal) the current behaviour could be used. But, on MY system, I absolutely would **always** select the first option (-EIO). I need to know that a potential problem may have occurred and will take manual action to decide what to do. Of course, this also needs a special utility (as Christoph proposed) to be able to force the read (to allow me to examine the data) and to be able to reset the checksum (although that is presumably as simple as rewriting the data). This is what happens normally with any filesystem when a disk block goes bad, but with the additional benefit of being able to examine a "possibly valid" version of the data block before overwriting it. > Looking at this from a different angle: Without background, what would > you assume the behavior to be for this? For most people, the assumption > would be that this provides the same degree of data safety that the > checksums do when the data is CoW. Exactly. The naive expectation is that turning off datacow does not prevent the bitrot checking from working. Also, the naive expectation (for any filesystem operation) is that if there is any doubt about the reliability of the data, the error is reported for the user to deal with. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 15:13 ` Graham Cobb @ 2017-08-14 15:53 ` Austin S. Hemmelgarn 2017-08-14 16:42 ` Graham Cobb 2017-08-14 19:54 ` Christoph Anton Mitterer 0 siblings, 2 replies; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-14 15:53 UTC (permalink / raw) To: Graham Cobb, Btrfs BTRFS On 2017-08-14 11:13, Graham Cobb wrote: > On 14/08/17 15:23, Austin S. Hemmelgarn wrote: >> Assume you have higher level verification. > > But almost no applications do. In real life, the decision > making/correction process will be manual and labour-intensive (for > example, running fsck on a virtual disk or restoring a file from backup). Quite a few applications actually _do_ have some degree of secondary verification or protection from a crash. Go look at almost any database software. It usually will not have checksumming, but it will almost always have support for a journal, which is enough to cover the particular data loss scenario we're talking about (unexpected unclean shutdown). > >> Would you rather not be able >> to read the data regardless of if it's correct or not, or be able to >> read it and determine yourself if it's correct or not? > > It must be controllable on a per-file basis, of course. For the tiny > number of files where the app can both spot the problem and correct it > (for example if it has a journal) the current behaviour could be used. In my own experience, the things that use nodatacow fall into one of 4 categories: 1. Cases where the data is non-critical, and data loss will be inconvenient but not fatal. Systemd journal files are a good example of this, as are web browser profiles when you're using profile sync. 2. Cases where the upper level can reasonably be expected to have some degree of handling, even if it's not correction. VM disk images and most database applications fall into this category. 3. Cases where data corruption will take out the application anyway. Poorly written database software is the primary example of this. 4. Things that shouldn't be using nodatacow because data safety is the most important aspect of the system. The first two cases work perfectly fine with the current behavior and are arguably no better off either way. The third is functionally fine with the current behavior provided that the crash doesn't change state (which isn't a guarantee), but could theoretically benefit from the determinism of knowing the app will die if the data is bad. The fourth is what most people seem to want this for, and don't realize that even if this is implemented, they will be no better off on average. > > But, on MY system, I absolutely would **always** select the first option > (-EIO). I need to know that a potential problem may have occurred and > will take manual action to decide what to do. Of course, this also needs > a special utility (as Christoph proposed) to be able to force the read > (to allow me to examine the data) and to be able to reset the checksum > (although that is presumably as simple as rewriting the data). And I and most other sysadmins I know would prefer the opposite with the addition of a secondary notification method. You can still hook the notification to stop the application, but you don't have to if you don't want to (and in cases 1 and 2 I listed above, you probably don't want to). > > This is what happens normally with any filesystem when a disk block goes > bad, but with the additional benefit of being able to examine a > "possibly valid" version of the data block before overwriting it. > >> Looking at this from a different angle: Without background, what would >> you assume the behavior to be for this? For most people, the assumption >> would be that this provides the same degree of data safety that the >> checksums do when the data is CoW. > > Exactly. The naive expectation is that turning off datacow does not > prevent the bitrot checking from working. Also, the naive expectation > (for any filesystem operation) is that if there is any doubt about the > reliability of the data, the error is reported for the user to deal with. The problem is that the naive expectation about data safety appears to be that adding checksumming support for nodatacow will improve safety, which it WILL NOT do. All it will do is add some reporting that will have a 50%+ rate of false positives (there is the very real possibility that the unexpected power loss will corrupt the checksum or the data if you're on anything but a traditional hard drive). If you have something that you need data safety for and can't be arsed to pay attention to whether or not your system had an unclean shutdwon, then you have two practical options: 1. Don't use nodatacow. 2. Do some form of higher level verification. Nothing about that is going to magically change because you suddenly have checksums telling you the data might be bad. Now, you _might_ be better off in a situation where the data got corrupted for some other reason (say, a media error for example), but even then you should have higher level verification, and it won't provide much benefit unless you're using replication or parity in BTRFS. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 15:53 ` Austin S. Hemmelgarn @ 2017-08-14 16:42 ` Graham Cobb 2017-08-14 19:54 ` Christoph Anton Mitterer 1 sibling, 0 replies; 63+ messages in thread From: Graham Cobb @ 2017-08-14 16:42 UTC (permalink / raw) To: Btrfs BTRFS On 14/08/17 16:53, Austin S. Hemmelgarn wrote: > Quite a few applications actually _do_ have some degree of secondary > verification or protection from a crash. I am glad your applications do and you have no need of this feature. You are welcome not to use it. I, on the other hand, definitely want this feature and would have it enabled by default on all my systems despite the need for manual actions after some unclean shutdowns. > Go look at almost any database > software. It usually will not have checksumming, but it will almost > always have support for a journal, which is enough to cover the > particular data loss scenario we're talking about (unexpected unclean > shutdown). No, the problem we are talking about is the data-at-rest corruption that checksumming is designed to deal with. That is why I want it. The unclean shutdown is a side issue that means there is a trade-off to using it. No one is suggesting that checksums are any significant help with the unclean shutdown case, just that the existence of that atomicity issue does not **prevent** them being very useful for the function for which they were designed. The degree to which any particular sysadmin will choose to enable or disable checksums on nodatacow files will depend on how much they value the checksum protection vs. the impact of manually fixing problems after some unclean shutdowns. In my particular case, many of these nodatacow files are large, very long-lived and only in use intermittently. I would like my monthly "btrfs scrub" to know they haven't gone bad but they are extremely unlikely to be in the middle of a write during an unclean shutdown so I am likely to have very few false errors. They are all backed up, but without checksumming I don't know that the backup needs to be restored (or even that I am not backing up now-bad data). Graham ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 15:53 ` Austin S. Hemmelgarn 2017-08-14 16:42 ` Graham Cobb @ 2017-08-14 19:54 ` Christoph Anton Mitterer 2017-08-15 11:37 ` Austin S. Hemmelgarn 2017-08-16 13:12 ` Chris Mason 1 sibling, 2 replies; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-14 19:54 UTC (permalink / raw) To: Austin S. Hemmelgarn, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 2409 bytes --] On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote: > Quite a few applications actually _do_ have some degree of secondary > verification or protection from a crash. Go look at almost any > database > software. Then please give proper references for this! This is from 2015, where you claimed this already and I looked up all the bigger DBs and they either couldn't do it at all, didn't to it per default, or it required application support (i.e. from the programs using the DB) https://www.spinics.net/lists/linux-btrfs/msg50258.html > It usually will not have checksumming, but it will almost > always have support for a journal, which is enough to cover the > particular data loss scenario we're talking about (unexpected > unclean > shutdown). I don't think we talk about this: We talk about people wanting checksuming to notice e.g. silent data corruption. The crash case is only the corner case about what happens then if data is written correctly but csums not. > In my own experience, the things that use nodatacow fall into one of > 4 > categories: > 1. Cases where the data is non-critical, and data loss will be > inconvenient but not fatal. Systemd journal files are a good example > of > this, as are web browser profiles when you're using profile sync. I'd guess many people would want to have their log files valid and complete. Same for their profiles (especially since people concerned about their integrity might not want to have these synced to Mozilla/Google etc.) > 2. Cases where the upper level can reasonably be expected to have > some > degree of handling, even if it's not correction. VM disk images and > most database applications fall into this category. No. Wrong. Or prove me that I'm wrong ;-) And these two (VMs, DBs) are actually *the* main cases for nodatacow. > And I and most other sysadmins I know would prefer the opposite with > the > addition of a secondary notification method. You can still hook the > notification to stop the application, but you don't have to if you > don't > want to (and in cases 1 and 2 I listed above, you probably don't want > to). Then I guess btrfs is generally not the right thing for such people, as in the CoW case it will also give them EIO on any corruptions and their programs will fail. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 19:54 ` Christoph Anton Mitterer @ 2017-08-15 11:37 ` Austin S. Hemmelgarn 2017-08-15 14:41 ` Christoph Anton Mitterer 2017-08-16 13:12 ` Chris Mason 1 sibling, 1 reply; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-15 11:37 UTC (permalink / raw) To: Christoph Anton Mitterer, Btrfs BTRFS On 2017-08-14 15:54, Christoph Anton Mitterer wrote: > On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote: >> Quite a few applications actually _do_ have some degree of secondary >> verification or protection from a crash. Go look at almost any >> database >> software. > Then please give proper references for this! > > This is from 2015, where you claimed this already and I looked up all > the bigger DBs and they either couldn't do it at all, didn't to it per > default, or it required application support (i.e. from the programs > using the DB) > https://www.spinics.net/lists/linux-btrfs/msg50258.html Go look at Chrome, or Firefox, or Opera, or any other major web browser. At minimum, they will safely bail out if they detect corruption in the user profile and can trivially resync the profile from another system if the user has profile sync set up. Go take a look at any enterprise database application from a reasonable company, it will almost always support replication across systems and validate data it reads. Note that in both cases this isn't the same as BTRFS checking block checksums, and I never said that the application had to work without issue, even BTRFS and ZFS can only provide that guarantee with multiple devices or dup profiles on a single disk, but I can count on one hand the software I've used in the last few years that didn't at least fail gracefully when fed bad data (and sending -EIO when a checksum fails is essentially the same thing). > >> It usually will not have checksumming, but it will almost >> always have support for a journal, which is enough to cover the >> particular data loss scenario we're talking about (unexpected >> unclean >> shutdown). > > I don't think we talk about this: > We talk about people wanting checksuming to notice e.g. silent data > corruption. > > The crash case is only the corner case about what happens then if data > is written correctly but csums not. > > >> In my own experience, the things that use nodatacow fall into one of >> 4 >> categories: >> 1. Cases where the data is non-critical, and data loss will be >> inconvenient but not fatal. Systemd journal files are a good example >> of >> this, as are web browser profiles when you're using profile sync. > > I'd guess many people would want to have their log files valid and > complete. Same for their profiles (especially since people concerned > about their integrity might not want to have these synced to > Mozilla/Google etc.) Agreed, but there's also the counter argument for log files that most people who are not running servers rarely (if ever) look at old logs, and it's the old logs that are the most likely to have at rest corruption (the longer something sits idle on media, the more likely it will suffer from a media error). > > >> 2. Cases where the upper level can reasonably be expected to have >> some >> degree of handling, even if it's not correction. VM disk images and >> most database applications fall into this category. > > No. Wrong. Or prove me that I'm wrong ;-) > And these two (VMs, DBs) are actually *the* main cases for nodatacow. Go install OpenSUSE in a VM. Look at what filesystem it uses. Go install Solaris in a VM, lo and behold it uses ZFS _with no option for anything else_ as it's root filesystem. Go install a recent version of Windows server in a VM, notice that it also has the option of a properly checked filesystem (ReFS). Go install FreeBSD in a VM, notice that it provides the option (which is actively recommended by many people who use FreeBSD) to install with root on ZFS. Install Android or Chrome OS (or AOSP or Chromium OS) in a VM. Root the system and take a look at the storage stack, both of them use dm-verity, and Android (and possibly Chrome OS too, not 100% certain) uses per-file AEAD through the VFS encryption API on encrypted devices. The fact that some OS'es blindly trust the underlying storage hardware is not our issue, it's their issue, and it shouldn't be 'fixed' by BTRFS because it doesn't just affect their customers who run the OS in a VM on BTRFS. As far as databases, I know of only one piece of enterprise level database software that doesn't have some kind of handling for this type of thing, and it's a a horribly designed piece of software other than that too. Most enterprise database apps offer support for replication, and quite a few do their own data validation when reading from the database. And if you care about non-enterprise database apps, then you need to worry about the edge case caused by unclean shutdown. > > >> And I and most other sysadmins I know would prefer the opposite with >> the >> addition of a secondary notification method. You can still hook the >> notification to stop the application, but you don't have to if you >> don't >> want to (and in cases 1 and 2 I listed above, you probably don't want >> to). > > Then I guess btrfs is generally not the right thing for such people, as > in the CoW case it will also give them EIO on any corruptions and their > programs will fail. For a single disk? Yes, I'd agree that BTRFS isn't the correct answer unless you're running dup for all profiles on said single disk when you care about data safety. Once you add another though, it's far superior to regular RAID because it knows inherently which copy is wrong. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-15 11:37 ` Austin S. Hemmelgarn @ 2017-08-15 14:41 ` Christoph Anton Mitterer 2017-08-15 15:43 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-15 14:41 UTC (permalink / raw) To: Austin S. Hemmelgarn, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 3959 bytes --] On Tue, 2017-08-15 at 07:37 -0400, Austin S. Hemmelgarn wrote: > Go look at Chrome, or Firefox, or Opera, or any other major web > browser. > At minimum, they will safely bail out if they detect corruption in > the > user profile and can trivially resync the profile from another system > if > the user has profile sync set up. Aha,... I'd rather see a concrete reference to some white paper or code, where one can really see that these programs actually *do* their own checksumming. But even from what you claim here now (that they'd only detect the corruption and then resync from another system - which is nothing else than recovering from a backup), I wouldn't see the big problem with EIO. > Go take a look at any enterprise > database application from a reasonable company, it will almost > always > support replication across systems and validate data it reads. Okay, I already showed you, that PostgreSQL, MySQL, BDB, sqlite can't or don't do per default... so which do you mean with the enterprise DB (Oracle?) and where's the reference that shows that they really do general checksuming? And that EIO would be a problem for their recovery strategies? And again, we're not talking about the WALs (or whatever these programs call it) which are there to handle a crash... we are talking about silent data corruption. > Agreed, but there's also the counter argument for log files that > most > people who are not running servers rarely (if ever) look at old > logs, > and it's the old logs that are the most likely to have at rest > corruption (the longer something sits idle on media, the more likely > it > will suffer from a media error). I wouldn't have any valid prove that it's really the "idle" data, which is the most likely one to have silent corruptions (at least not for all types of storage medium), but even if this is the case as you say... then it's probably more likely to hit the /usr/ /lib/ and so on stuff on stable distros... logs are typically rotated and then at least once re-written (when compressed). > Go install OpenSUSE in a VM. Look at what filesystem it uses. Go > install Solaris in a VM, lo and behold it uses ZFS _with no option > for > anything else_ as it's root filesystem. Go install a recent version > of > Windows server in a VM, notice that it also has the option of a > properly > checked filesystem (ReFS). Go install FreeBSD in a VM, notice that > it > provides the option (which is actively recommended by many people > who > use FreeBSD) to install with root on ZFS. Install Android or Chrome > OS > (or AOSP or Chromium OS) in a VM. Root the system and take a look > at > the storage stack, both of them use dm-verity, and Android (and > possibly > Chrome OS too, not 100% certain) uses per-file AEAD through the VFS > encryption API on encrypted devices. So your argument for not adding support for this is basically: People don't or shouldn't use btrfs for this? o.O > The fact that some OS'es blindly > trust the underlying storage hardware is not our issue, it's their > issue, and it shouldn't be 'fixed' by BTRFS because it doesn't just > affect their customers who run the OS in a VM on BTRFS. Then you can probably drop checksumming from btrfs altogether. And with the same "argument" any other advanced feature. For resilience there is hardware RAID or Linux' MD raid... so no need to keep it in btrfs o.O > Most enterprise database apps offer support for > replication, > and quite a few do their own data validation when reading from the > database. First of all,... replication != the capability to detect silent data corruption. You still haven't named a single one which does checksumming per default. At least those which are quite popular in the FLOSS world all don't seem to do. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-15 14:41 ` Christoph Anton Mitterer @ 2017-08-15 15:43 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-15 15:43 UTC (permalink / raw) To: Christoph Anton Mitterer, Btrfs BTRFS On 2017-08-15 10:41, Christoph Anton Mitterer wrote: > On Tue, 2017-08-15 at 07:37 -0400, Austin S. Hemmelgarn wrote: >> Go look at Chrome, or Firefox, or Opera, or any other major web >> browser. >> At minimum, they will safely bail out if they detect corruption in >> the >> user profile and can trivially resync the profile from another system >> if >> the user has profile sync set up. > > Aha,... I'd rather see a concrete reference to some white paper or > code, where one can really see that these programs actually *do* their > own checksumming. > But even from what you claim here now (that they'd only detect the > corruption and then resync from another system - which is nothing else > than recovering from a backup), I wouldn't see the big problem with > EIO. It isn't a problem if it isn't a false positive. It is a problem when it's not correct and the data is accurate. This breaks from current behavior on BTRFS in a not insignificant way. As things stand right now, -EIO on BTRFS means one of two things: * The underlying device returned an IO error. * The data there is incorrect. While it technically is possible for there to be a false positive with CoW, it is a statistical impossibility even at Google and Facebook scale (I will comment that I've had this happen (exactly once), but it resulted from severe widespread media issues in the storage device that should have caused catastrophic failure of the device). There is no way to avoid false positives without CoW or journaling. We have CoW, and people aren't using it for performance reasons. Adding journaling instead will make performance worse (and brings up the important question of whether or not the journal is CoW) for NOCOW, and has the potential to make performance worse than without NOCOW. > > >> Go take a look at any enterprise >> database application from a reasonable company, it will almost >> always >> support replication across systems and validate data it reads. > > Okay, I already showed you, that PostgreSQL, MySQL, BDB, sqlite can't > or don't do per default... so which do you mean with the enterprise DB > (Oracle?) and where's the reference that shows that they really do > general checksuming? And that EIO would be a problem for their recovery > strategies? Again, I never said it had to be checksumming. Type and range checking and validation of the metadata (not through checksumming, but through verifying that the metadata makes sense, essentially the equivalent of fsck on older filesystems) _is_ done by almost everything dealing with databases these days except for trivial one-off stuff. As far as EIO, see my reply above. > > And again, we're not talking about the WALs (or whatever these programs > call it) which are there to handle a crash... we are talking about > silent data corruption. Reread what I said. Database _APPLICATION_ is not the same as database system. PGSQL, MySQL, BDB, SQLite, MSSQL, Oracle, etc, are all database systems, they provide a database that an application can build on top of, and yes, none of them provide any significant protection (except possibly MSSQL, but I'm not sure about that and it's not hugely relevant to this particular discussion). Things like MythTV, Bugzilla, Kodi, and other stuff that utilize the database for back-end storage (including things like many media players and web browsers) are database applications. The distinction here is no different from Linux applications versus Linux systems. In the context of actual applications using the database, it's still not rigorous verification like you seem to think I'm talking about, but most of them do enough sanity checking that most stuff beyond single bit errors in numeric and string types will be caught and at least reported.> > >> Agreed, but there's also the counter argument for log files that >> most >> people who are not running servers rarely (if ever) look at old >> logs, >> and it's the old logs that are the most likely to have at rest >> corruption (the longer something sits idle on media, the more likely >> it >> will suffer from a media error). > > I wouldn't have any valid prove that it's really the "idle" data, which > is the most likely one to have silent corruptions (at least not for all > types of storage medium), but even if this is the case as you say... > then it's probably more likely to hit the /usr/ /lib/ and so on stuff > on stable distros... logs are typically rotated and then at least once > re-written (when compressed). Except that /usr and /lib are trivial to validate on any modern Linux or BSD system because the package manager almost certainly has file validation built in. At minimum, emerge, Entropy, DNF, yum, FreeBSD pkg-ng, pkgin, Zypper, YaST2, Nix, and Alpine APK, all have this functionality, and there is at least one readily available piece of software (debsigs) for dpkg based systems. Sensibly security minded individuals generally already have this type of validation in a cron job or systemd timer. > > >> Go install OpenSUSE in a VM. Look at what filesystem it uses. Go >> install Solaris in a VM, lo and behold it uses ZFS _with no option >> for >> anything else_ as it's root filesystem. Go install a recent version >> of >> Windows server in a VM, notice that it also has the option of a >> properly >> checked filesystem (ReFS). Go install FreeBSD in a VM, notice that >> it >> provides the option (which is actively recommended by many people >> who >> use FreeBSD) to install with root on ZFS. Install Android or Chrome >> OS >> (or AOSP or Chromium OS) in a VM. Root the system and take a look >> at >> the storage stack, both of them use dm-verity, and Android (and >> possibly >> Chrome OS too, not 100% certain) uses per-file AEAD through the VFS >> encryption API on encrypted devices. > > So your argument for not adding support for this is basically: > People don't or shouldn't use btrfs for this? o.O No, you shouldn't be using a CoW filesystem directly for VM image storage if you care at all about performance, and especially not BTRFS. Even with NOCOW, performance of this on BTRFS is absolutely horrendous. This goes double if you're using QCOW2 or other allocate-on-demand formats. Ideal order of decreasing preference if you care about performance is: * Native block devices * SAN devices * LVM or ZFS ZVols (believe it or not, ZVols actually get remarkably good performance despite being on a CoW backend) * Simple filesystems like ext4 or XFS that don't do CoW or use log structures for data * Files on ZFS or F2FS * Most other CoW or log structured filesystems * BTRFS BTRFS should literally be your last resort for VM image storage if you care about performance. > > > >> The fact that some OS'es blindly >> trust the underlying storage hardware is not our issue, it's their >> issue, and it shouldn't be 'fixed' by BTRFS because it doesn't just >> affect their customers who run the OS in a VM on BTRFS. > > Then you can probably drop checksumming from btrfs altogether. And with > the same "argument" any other advanced feature. > For resilience there is hardware RAID or Linux' MD raid... so no need > to keep it in btrfs o.O **NO**. That is not what I'm arguing. That would be regressing BTRFS to a state that I'm arguing needs to be _FIXED_ in other systems. My complaint is that operating systems (and by extension, VM's) should be doing the checking themselves because they inherently can't rely on the underlying storage in almost all cases, in particular in the ones in which they are almost always used. Notice in particular that I mentioned OpenSUSE, which has this validation _because_ it uses BTRFS by default for the root filesystem. I would have thought that that would not need to be explained here, but apparently I was wrong. > > >> Most enterprise database apps offer support for >> replication, >> and quite a few do their own data validation when reading from the >> database. > First of all,... replication != the capability to detect silent data > corruption. So how is proper verified replication not able to detect silent data corruption exactly? I mean, that's what RAID1 is and it does provide the ability to detect such things (unless your RAID implementation is brain dead), it just doesn't fix it reliably by itself. > > You still haven't named a single one which does checksumming per > default. At least those which are quite popular in the FLOSS world all > don't seem to do. Again, checksumming is not the only way to detect data corruption. Comparison to other copies, metadata validation (databases aren't just a jumble of data, there is required structure that can be validated), and type and range checking are all ways of detecting silent corruption. Are they perfect? No. Is checksumming better? In some circumstances. Are they sufficient for most use cases? Absolutely. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 19:54 ` Christoph Anton Mitterer 2017-08-15 11:37 ` Austin S. Hemmelgarn @ 2017-08-16 13:12 ` Chris Mason 2017-08-16 13:31 ` Christoph Anton Mitterer ` (3 more replies) 1 sibling, 4 replies; 63+ messages in thread From: Chris Mason @ 2017-08-16 13:12 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote: >On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote: >> Quite a few applications actually _do_ have some degree of secondary >> verification or protection from a crash. Go look at almost any >> database >> software. >Then please give proper references for this! > >This is from 2015, where you claimed this already and I looked up all >the bigger DBs and they either couldn't do it at all, didn't to it per >default, or it required application support (i.e. from the programs >using the DB) >https://www.spinics.net/lists/linux-btrfs/msg50258.html > > >> It usually will not have checksumming, but it will almost >> always have support for a journal, which is enough to cover the >> particular data loss scenario we're talking about (unexpected >> unclean >> shutdown). > >I don't think we talk about this: >We talk about people wanting checksuming to notice e.g. silent data >corruption. > >The crash case is only the corner case about what happens then if data >is written correctly but csums not. We use the crcs to catch storage gone wrong, both in terms of simple things like cabling, bus errors, drives gone crazy or exotic problems like every time I reboot the box a handful of sectors return EFI partition table headers instead of the data I wrote. You don't need data center scale for this to happen, but it does help... So, we do catch crc errors in prod and they do keep us from replicating bad data over good data. Some databases also crc, and all drives have correction bits of of some kind. There's nothing wrong with crcs happening at lots of layers. Btrfs couples the crcs with COW because it's the least complicated way to protect against: * bits flipping * IO getting lost on the way to the drive, leaving stale but valid data in place * IO from sector A going to sector B instead, overwriting valid data with other valid data. It's possible to protect against all three without COW, but all solutions have their own tradeoffs and this is the setup we chose. It's easy to trust and easy to debug and at scale that really helps. In general, production storage environments prefer clearly defined errors when the storage has the wrong data. EIOs happen often, and you want to be able to quickly pitch the bad data and replicate in good data. My real goal is to make COW fast enough that we can leave it on for the database applications too. Obviously I haven't quite finished that one yet ;) But I'd rather keep the building block of all the other btrfs features in place than try to do crcs differently. -chris ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 13:12 ` Chris Mason @ 2017-08-16 13:31 ` Christoph Anton Mitterer 2017-08-16 13:53 ` Austin S. Hemmelgarn 2017-08-16 16:54 ` Peter Grandi 2017-08-16 13:56 ` Austin S. Hemmelgarn ` (2 subsequent siblings) 3 siblings, 2 replies; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-16 13:31 UTC (permalink / raw) To: Chris Mason; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 2536 bytes --] Just out of curiosity: On Wed, 2017-08-16 at 09:12 -0400, Chris Mason wrote: > Btrfs couples the crcs with COW because this (which sounds like you want it to stay coupled that way)... plus > It's possible to protect against all three without COW, but all > solutions have their own tradeoffs and this is the setup we > chose. It's > easy to trust and easy to debug and at scale that really helps. ... this (which sounds more you think the checksumming is so helpful, that it would be nice in the nodatacow as well). What does that mean now? Things will stay as they are... or it may become a goal to get checksumming for nodatacow (while of course still retaining the possibility to disable both, datacow AND checksumming)? > In general, production storage environments prefer clearly defined > errors when the storage has the wrong data. EIOs happen often, and > you > want to be able to quickly pitch the bad data and replicate in good > data. Which would also rather point towards getting clear EIOs (and thus checksumming) in the nodatacow case. > My real goal is to make COW fast enough that we can leave it on for > the > database applications too. Obviously I haven't quite finished that > one > yet ;) Well the question is, even if you manage that sooner or later, will everyone be fully satisfied by this?! I've mentioned earlier on the list that I manage one of the many big data/computing centres for LHC. Our use case is typically big plain storage servers connected via some higher level storage management system (http://dcache.org/)... with mostly write once/read many. So apart from some central DBs for the storage management system itself, CoW is mostly no issue for us. But I've talked to some friend at the local super computing centre and they have rather general issues with CoW at their virtualisation cluster. Like SUSE's snapper making many snapshots leading the storage images of VMs apparently to explode (in terms of space usage). For some of their storage backends there simply seem to be no de- duplication available (or other reasons that prevent it's usage). From that I'd guess there would be still people who want the nice features of btrfs (snapshots, checksumming, etc.), while still being able to nodatacow in specific cases. > But I'd rather keep the building block of all the other btrfs > features in place than try to do crcs differently. Mhh I see, what a pity. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 13:31 ` Christoph Anton Mitterer @ 2017-08-16 13:53 ` Austin S. Hemmelgarn 2017-08-16 14:11 ` Christoph Anton Mitterer 2017-08-16 18:19 ` David Sterba 2017-08-16 16:54 ` Peter Grandi 1 sibling, 2 replies; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-16 13:53 UTC (permalink / raw) To: Christoph Anton Mitterer, Chris Mason; +Cc: Btrfs BTRFS On 2017-08-16 09:31, Christoph Anton Mitterer wrote: > Just out of curiosity: > > > On Wed, 2017-08-16 at 09:12 -0400, Chris Mason wrote: >> Btrfs couples the crcs with COW because > > this (which sounds like you want it to stay coupled that way)... > > plus > > >> It's possible to protect against all three without COW, but all >> solutions have their own tradeoffs and this is the setup we >> chose. It's >> easy to trust and easy to debug and at scale that really helps. > > ... this (which sounds more you think the checksumming is so helpful, > that it would be nice in the nodatacow as well). > > What does that mean now? Things will stay as they are... or it may > become a goal to get checksumming for nodatacow (while of course still > retaining the possibility to disable both, datacow AND checksumming)? It means that you have other options if you want this so badly that you need to keep pestering the developers about it but can't be arsed to try to code it yourself. Go try BTRFS on top of dm-integrity, or on a system with T10-DIF or T13-EPP support (which you should have access to given the amount of funding CERN gets), or even on a ZFS zvol if you're crazy enough. It works wonderfully in the first two cases, and reliably (but not efficiently) in the third, and all of them provide exactly what you want, plus the bonus that they do a slightly better job of differentiating between media and memory errors. > > >> In general, production storage environments prefer clearly defined >> errors when the storage has the wrong data. EIOs happen often, and >> you >> want to be able to quickly pitch the bad data and replicate in good >> data. > > Which would also rather point towards getting clear EIOs (and thus > checksumming) in the nodatacow case. Except it isn't clear with nodatacow, because it might be a false positive. > > > >> My real goal is to make COW fast enough that we can leave it on for >> the >> database applications too. Obviously I haven't quite finished that >> one >> yet ;) > > Well the question is, even if you manage that sooner or later, will > everyone be fully satisfied by this?! > I've mentioned earlier on the list that I manage one of the many big > data/computing centres for LHC. > Our use case is typically big plain storage servers connected via some > higher level storage management system (http://dcache.org/)... with > mostly write once/read many. > > So apart from some central DBs for the storage management system > itself, CoW is mostly no issue for us. > But I've talked to some friend at the local super computing centre and > they have rather general issues with CoW at their virtualisation > cluster. > Like SUSE's snapper making many snapshots leading the storage images of > VMs apparently to explode (in terms of space usage). SUSE is pathological case of brain-dead defaults. Snapper needs to either die or have some serious sense beat into it. When you turn off the automatic snapshot generation for everything but updates and set the retention policy to not keep almost everything, it's actually not bad at all. > For some of their storage backends there simply seem to be no de- > duplication available (or other reasons that prevent it's usage). If the snapshots are being CoW'ed, then dedupe won't save them any space. Also, nodatacow is inherently at odds with reflinks used for dedupe. > > From that I'd guess there would be still people who want the nice > features of btrfs (snapshots, checksumming, etc.), while still being > able to nodatacow in specific cases. Snapshots work fine with nodatacow, each block gets CoW'ed once when it's first written to, and then goes back to being NOCOW. The only caveat is that you probably want to defrag either once everything has been rewritten, or right after the snapshot. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 13:53 ` Austin S. Hemmelgarn @ 2017-08-16 14:11 ` Christoph Anton Mitterer 2017-08-16 15:07 ` Austin S. Hemmelgarn 2017-08-16 18:19 ` David Sterba 1 sibling, 1 reply; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-16 14:11 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 1761 bytes --] On Wed, 2017-08-16 at 09:53 -0400, Austin S. Hemmelgarn wrote: > Go try BTRFS on top of dm-integrity, or on a > system with T10-DIF or T13-EPP support When dm-integrity is used... would that be enough for btrfs to do a proper repair in the RAID+nodatacow case? I assume it can't do repairs now there, because how should it know which copy is valid. > (which you should have access to > given the amount of funding CERN gets) Hehe, CERN may get that funding (I don't know),... but the universities rather don't ;-) > Except it isn't clear with nodatacow, because it might be a false > positive. Sure, never claimed the opposite... just that I'd expect this to be less likely than the other way round, and less of a problem in practise. > SUSE is pathological case of brain-dead defaults. Snapper needs to > either die or have some serious sense beat into it. When you turn > off > the automatic snapshot generation for everything but updates and set > the > retention policy to not keep almost everything, it's actually not bad > at > all. Well, still, with CoW (unless you have some form of deduplication, which in e.g. their use case would have to be on the layers below btrfs), your storage usage will grow probably more significantly than without. And as you've mentioned yourself in the other mail, there's still the issue with fragmentation. > Snapshots work fine with nodatacow, each block gets CoW'ed once when > it's first written to, and then goes back to being NOCOW. The only > caveat is that you probably want to defrag either once everything > has > been rewritten, or right after the snapshot. I thought defrag would unshare the reflinks? Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 14:11 ` Christoph Anton Mitterer @ 2017-08-16 15:07 ` Austin S. Hemmelgarn 2017-08-16 17:26 ` Peter Grandi 0 siblings, 1 reply; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-16 15:07 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: Btrfs BTRFS On 2017-08-16 10:11, Christoph Anton Mitterer wrote: > On Wed, 2017-08-16 at 09:53 -0400, Austin S. Hemmelgarn wrote: >> Go try BTRFS on top of dm-integrity, or on a >> system with T10-DIF or T13-EPP support > > When dm-integrity is used... would that be enough for btrfs to do a > proper repair in the RAID+nodatacow case? I assume it can't do repairs > now there, because how should it know which copy is valid. dm-integrity is functionally a 1:1 mapping target (it uses a secondary device for storing the integrity info, but it requires one table per target). It takes one backing device, and gives one mapped device. The setup I'm suggesting would involve putting that on each device that you have BTRFS configured to use. When the checksum there fails, you get a read error (AFAIK at least), which will trigger the regular BTRFS recovery code just like a failed checksum. So in this case, it should recover just fine if one copy is bogus (assuming it's a media issue and not something between the the block device and the filesystem. In all honesty, putting BTRFS on dm-integrity is going to be slow. If you can find some T10 DIF or T13 EPP hardware, that will almost certainly be faster. > > >> (which you should have access to >> given the amount of funding CERN gets) > Hehe, CERN may get that funding (I don't know),... but the universities > rather don't ;-) Point taken, I often forget that funding isn't exactly distributed in the most obvious ways. > > >> Except it isn't clear with nodatacow, because it might be a false >> positive. > > Sure, never claimed the opposite... just that I'd expect this to be > less likely than the other way round, and less of a problem in > practise. Any number of hardware failures or errors can cause the same net effect as an unclean shutdown, and even some much more complicated issues (a loose data cable to a storage device is probably one of the best examples, as it's trivial to explain and not as rare as most people think). > > > >> SUSE is pathological case of brain-dead defaults. Snapper needs to >> either die or have some serious sense beat into it. When you turn >> off >> the automatic snapshot generation for everything but updates and set >> the >> retention policy to not keep almost everything, it's actually not bad >> at >> all. > > Well, still, with CoW (unless you have some form of deduplication, > which in e.g. their use case would have to be on the layers below > btrfs), your storage usage will grow probably more significantly than > without. Yes, and for most VM use cases I would advocate not using BTRFS snapshots inside the VM and instead using snapshot functionality in the VM software itself. That still has performance issues in some cases, but at least it's easier to see where the data is actually being used. > > And as you've mentioned yourself in the other mail, there's still the > issue with fragmentation. > > >> Snapshots work fine with nodatacow, each block gets CoW'ed once when >> it's first written to, and then goes back to being NOCOW. The only >> caveat is that you probably want to defrag either once everything >> has >> been rewritten, or right after the snapshot. > > I thought defrag would unshare the reflinks? Which is exactly why you might want to do it. It will get rid of the overhead of the single CoW operation, and it will make sure there is minimal fragmentation. IOW, when mixing NOCOW and snapshots, you either have to use extra space, or you deal with performance issues. Aside from that though, it works just fine and has no special issues as compared to snapshots without NOCOW. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 15:07 ` Austin S. Hemmelgarn @ 2017-08-16 17:26 ` Peter Grandi 0 siblings, 0 replies; 63+ messages in thread From: Peter Grandi @ 2017-08-16 17:26 UTC (permalink / raw) To: Linux fs Btrfs [ ... ] >>> Snapshots work fine with nodatacow, each block gets CoW'ed >>> once when it's first written to, and then goes back to being >>> NOCOW. >>> The only caveat is that you probably want to defrag either >>> once everything has been rewritten, or right after the >>> snapshot. >> I thought defrag would unshare the reflinks? > Which is exactly why you might want to do it. It will get rid > of the overhead of the single CoW operation, and it will make > sure there is minimal fragmentation. > IOW, when mixing NOCOW and snapshots, you either have to use > extra space, or you deal with performance issues. Aside from > that though, it works just fine and has no special issues as > compared to snapshots without NOCOW. The above illustrates my guess as to why RHEL 7.4 dropped Btrfs support, which is: * RHEL is sold to managers who want to minimize the cost of upgrades and sysadm skills. * Every time a customer creates a ticket, RH profits fall. * RH had adopted 'ext3' because it was an in-place upgrade from 'ext2' and "just worked", 'ext4' because it was an in-place upgrade from 'ext3' and was supposed to "just work", and then was looking at Btrfs as an in-place upgrade from 'ext4', and presumably also a replacement for MD RAID, that would "just work". * 'ext4' (and XFS before that) already created a few years ago trouble because of the 'O_PONIES' controversy. * Not only Btrfs still has "challenges" as to multi-device functionality, and in-place upgrades from 'ext4' have "challenges" too, it has many "special cases" that need skill and discretion to handle, because it tries to cover so many different cases, and the first thing many a RH customer would do is to create a ticket to ask what to do, or how to fix a choice already made. Try to imagine the impact on the RH ticketing system of a switch from 'ext4' to Btrfs, with explanations like the above, about NOCOW, defrag, snapshots, balance, reflinks, and the exact order in which they have to be performed for best results. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 13:53 ` Austin S. Hemmelgarn 2017-08-16 14:11 ` Christoph Anton Mitterer @ 2017-08-16 18:19 ` David Sterba 1 sibling, 0 replies; 63+ messages in thread From: David Sterba @ 2017-08-16 18:19 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Christoph Anton Mitterer, Chris Mason, Btrfs BTRFS On Wed, Aug 16, 2017 at 09:53:57AM -0400, Austin S. Hemmelgarn wrote: > > So apart from some central DBs for the storage management system > > itself, CoW is mostly no issue for us. > > But I've talked to some friend at the local super computing centre and > > they have rather general issues with CoW at their virtualisation > > cluster. > > Like SUSE's snapper making many snapshots leading the storage images of > > VMs apparently to explode (in terms of space usage). > SUSE is pathological case of brain-dead defaults. Snapper needs to > either die or have some serious sense beat into it. When you turn off > the automatic snapshot generation for everything but updates and set the > retention policy to not keep almost everything, it's actually not bad at > all. The defaults for timeline are really bad, the partition is almost never big enough to hold 10 months worth of data updates, not to say 10 years. A rolling distro can fill the space even with the daily or weeky settings set to low numbers. But certain people had different oppinion and I was not successful to change that. The least I did was to document some of the usecases and the hints that could allow one to have a bit more understanding of the effects. https://github.com/kdave/btrfsmaintenance#tuning-periodic-snapshotting > > For some of their storage backends there simply seem to be no de- > > duplication available (or other reasons that prevent it's usage). > If the snapshots are being CoW'ed, then dedupe won't save them any > space. Also, nodatacow is inherently at odds with reflinks used for dedupe. > > > > From that I'd guess there would be still people who want the nice > > features of btrfs (snapshots, checksumming, etc.), while still being > > able to nodatacow in specific cases. > Snapshots work fine with nodatacow, each block gets CoW'ed once when > it's first written to, and then goes back to being NOCOW. The only > caveat is that you probably want to defrag either once everything has > been rewritten, or right after the snapshot. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 13:31 ` Christoph Anton Mitterer 2017-08-16 13:53 ` Austin S. Hemmelgarn @ 2017-08-16 16:54 ` Peter Grandi 1 sibling, 0 replies; 63+ messages in thread From: Peter Grandi @ 2017-08-16 16:54 UTC (permalink / raw) To: Linux fs Btrfs [ ... ] > But I've talked to some friend at the local super computing > centre and they have rather general issues with CoW at their > virtualisation cluster. Amazing news! :-) > Like SUSE's snapper making many snapshots leading the storage > images of VMs apparently to explode (in terms of space usage). Well, this could be an argument that some of your friends are being "challenged" by running the storage systems of a "super computing centre" and that they could become "more prepared" about system administration, for example as to the principle "know which tool to use for which workload". Or else it could be an argument that they expect Btrfs to do their job while they watch cat videos from the intertubes. :-) ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 13:12 ` Chris Mason 2017-08-16 13:31 ` Christoph Anton Mitterer @ 2017-08-16 13:56 ` Austin S. Hemmelgarn 2017-08-16 14:01 ` Qu Wenruo 2017-08-16 16:44 ` Peter Grandi 3 siblings, 0 replies; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-16 13:56 UTC (permalink / raw) To: Chris Mason, Btrfs BTRFS; +Cc: Christoph Anton Mitterer On 2017-08-16 09:12, Chris Mason wrote: > My real goal is to make COW fast enough that we can leave it on for the > database applications too. Obviously I haven't quite finished that one > yet ;) But I'd rather keep the building block of all the other btrfs > features in place than try to do crcs differently. In general, the performance issue isn't because of the time it takes to CoW the blocks, it's because of the fragmentation it introduces. That fragmentation could in theory be mitigated by making CoW happen at a larger chunk size, but that would push the issue more towards being one of CoW performance, not fragmentation. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 13:12 ` Chris Mason 2017-08-16 13:31 ` Christoph Anton Mitterer 2017-08-16 13:56 ` Austin S. Hemmelgarn @ 2017-08-16 14:01 ` Qu Wenruo 2017-08-16 19:52 ` Chris Murphy 2017-08-16 16:44 ` Peter Grandi 3 siblings, 1 reply; 63+ messages in thread From: Qu Wenruo @ 2017-08-16 14:01 UTC (permalink / raw) To: Chris Mason, Christoph Anton Mitterer, Austin S. Hemmelgarn, Btrfs BTRFS On 2017年08月16日 21:12, Chris Mason wrote: > On Mon, Aug 14, 2017 at 09:54:48PM +0200, Christoph Anton Mitterer wrote: >> On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote: >>> Quite a few applications actually _do_ have some degree of secondary >>> verification or protection from a crash. Go look at almost any >>> database >>> software. >> Then please give proper references for this! >> >> This is from 2015, where you claimed this already and I looked up all >> the bigger DBs and they either couldn't do it at all, didn't to it per >> default, or it required application support (i.e. from the programs >> using the DB) >> https://www.spinics.net/lists/linux-btrfs/msg50258.html >> >> >>> It usually will not have checksumming, but it will almost >>> always have support for a journal, which is enough to cover the >>> particular data loss scenario we're talking about (unexpected >>> unclean >>> shutdown). >> >> I don't think we talk about this: >> We talk about people wanting checksuming to notice e.g. silent data >> corruption. >> >> The crash case is only the corner case about what happens then if data >> is written correctly but csums not. > > We use the crcs to catch storage gone wrong, both in terms of simple > things like cabling, bus errors, drives gone crazy or exotic problems > like every time I reboot the box a handful of sectors return EFI > partition table headers instead of the data I wrote. You don't need > data center scale for this to happen, but it does help... > > So, we do catch crc errors in prod and they do keep us from replicating > bad data over good data. Some databases also crc, and all drives have > correction bits of of some kind. There's nothing wrong with crcs > happening at lots of layers. > > Btrfs couples the crcs with COW because it's the least complicated way > to protect against: > > * bits flipping > * IO getting lost on the way to the drive, leaving stale but valid data > in place > * IO from sector A going to sector B instead, overwriting valid data > with other valid data. > > It's possible to protect against all three without COW, but all > solutions have their own tradeoffs and this is the setup we chose. It's > easy to trust and easy to debug and at scale that really helps. > > In general, production storage environments prefer clearly defined > errors when the storage has the wrong data. EIOs happen often, and you > want to be able to quickly pitch the bad data and replicate in good data. Btrfs csum is really good, specially for case like RAID1/5/6 where csum can provide extra info about which mirror/stripe/parity can be trusted, with minimal space wasted. DM layer should really have the ability to verify its data at that timing like btrfs. > > My real goal is to make COW fast enough that we can leave it on for the > database applications too. Yes, most of the complexity of nodatasum/nodatacow comes from those special workload. BTW, when Fujitsu tested the postgresql workload on btrfs, the result is quite interesting. For HDD, when number of clients is low, btrfs shows obvious performance drop. And the problem seems to be mandatory metadata COW, which leads to superblock FUA updates. And when number of clients grow, difference between btrfs and other fses gets much smaller, the bottleneck is the HDD itself. While for SSD, when number of clients is low, btrfs is almost the same performance as other fses, nodatacow/nodatasum only provides marginal difference. But when number of clients grows, btrfs falls far behind other fses. The reason seems to be related to how postgresql commit its transaction, which always fsync its journal sequentially without concurrency. While Btrfs needs to wait its data write before updating its log tree, this makes most of its time wasted on waiting data IO. In that case, nodatacow does improves the performance, by allowing btrfs to update its log tree without waiting data IO. But in both case, CoW itself, like allocating new extent, or calculating csum, is not the main cause to slow down btrfs. That's to say, nodatacow is not as important as we used to think. If we can get rid of nodatacow/nodatasum, there will be much less thing to consider for us developers, and less related bugs. Thanks, Qu > Obviously I haven't quite finished that one > yet ;) But I'd rather keep the building block of all the other btrfs > features in place than try to do crcs differently. > > -chris > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 14:01 ` Qu Wenruo @ 2017-08-16 19:52 ` Chris Murphy 2017-08-17 6:25 ` GWB 0 siblings, 1 reply; 63+ messages in thread From: Chris Murphy @ 2017-08-16 19:52 UTC (permalink / raw) To: Qu Wenruo Cc: Chris Mason, Christoph Anton Mitterer, Austin S. Hemmelgarn, Btrfs BTRFS On Wed, Aug 16, 2017 at 8:01 AM, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > BTW, when Fujitsu tested the postgresql workload on btrfs, the result is > quite interesting. > > For HDD, when number of clients is low, btrfs shows obvious performance > drop. > And the problem seems to be mandatory metadata COW, which leads to > superblock FUA updates. > And when number of clients grow, difference between btrfs and other fses > gets much smaller, the bottleneck is the HDD itself. > > While for SSD, when number of clients is low, btrfs is almost the same > performance as other fses, nodatacow/nodatasum only provides marginal > difference. > But when number of clients grows, btrfs falls far behind other fses. > The reason seems to be related to how postgresql commit its transaction, > which always fsync its journal sequentially without concurrency. I wonder to what degree fsync is used as a hammer for a problem that needs more granular indicators to solve, like fsadvise() and even extending it? But I'm also curious if the above behaviors you report, how it changes by combining SSD and HDD via either dm-cache or bcache? Do the worst aspects of SSD and HDD get muted in that case? Or do the worst aspects become even worse across the board? -- Chris Murphy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 19:52 ` Chris Murphy @ 2017-08-17 6:25 ` GWB 2017-08-17 11:47 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 63+ messages in thread From: GWB @ 2017-08-17 6:25 UTC (permalink / raw) To: Peter Grandi, Linux fs Btrfs << Or else it could be an argument that they expect Btrfs to do their job while they watch cat videos from the intertubes. :-) >> My favourite quote from the list this week, and, well, obviously, that is the main selling point of file systems like btrfs, zfs, and various other lvm and raid set ups. The need to free up time to watch cat videos on the intertubes (whilst at work) has driven most technological innovations, going back at least to the time of the Roman Empire. So, sure, I'll be happy to admit that I like it very much when a file system or some other software or hardware component makes my job easier (which gives me more time to watch cat videos). But if hours on hours of cat videos have taught me one thing, it is that catastrophe (pun intended) awaits those who assume that btrfs (or zfs or nilfs or whatever) will magically work well in all use cases. That may be what their customers assumed about btrfs, but did Red Hat make that claim implicitly or explicitly? I don't know, but it seems unlikely, and all the things mentioned in this thread make sense to me. It looks like Red Hat is pushing "GFS" (Red Hat Global File System) for its clustered file system: https://www.redhat.com/whitepapers/rha/gfs/GFS_INS0032US.pdf XFS is now the standard "on disk" fs for Red Hat, but I can't tell if XFS is the DMU (backing file system or Data Management Unit) for GFS (zfs is the dmu for lustre). Probably, but why does GFS still has a file size limit of 100TB, while XFS has a 500TB limit, according to Red Hat? https://access.redhat.com/articles/rhel-limits And btrfs is gone from that list. So does this mean that Red Hat deprecating btrfs have a tangible effect on its development, future improvements, and adoption? It doesn't help, but maybe its not too bad. From reading the list, my impression is that the typical Red Hat customer with large data arrays might do fine running xfs over lvm2 over hardware raid (or at least the customers who are paying attention to the monitor stats between cat videos). That's not for me, because I prefer mirrors, not stripes, and "hot spares" that I can pull out of the enclosure, place in another machine, and get running again (which points me back to btrfs and zfs). But it must work great for a lot of data silos. On the plus side, btrfs is one of the backing file systems in ceph; on the minus side, with Red Hat out, btrfs might lose some developers and support: http://www.h-online.com/open/features/Kernel-Log-Coming-in-2-6-37-Part-2-File-systems-1148305.html%3Fpage=2 As long as FaceBook keeps using btrfs, I wouldn't worry too much about large firm adoption. Chris (from facebook, post above) points out that Facebook runs both xfs and btrfs as backing file systems for Gluster: https://www.linux.com/news/learn/intro-to-linux/how-facebook-uses-linux-and-btrfs-interview-chris-mason And Gluster is... owned by Red Hat (since 2011), which now advertises its "Red Hat Global File System", which would be... Gluster? Chris, is that right? So Facebook runs Gluster (which might be Red Hat Global File System) with both xfs and btrfs as the backing fs, and Red Hat... advertises Red Hat GFS as a platform for Oracle RAC Database Clustering. But not (presumably) running with btrfs as the backing fs, but rather xfs. So could one Gluster "grid" run over two file systems, xfs for the applications, and btrfs for the primary data storage? So Oracle still supports btrfs. Facebook still uses it. And it would be very funny if Red Hat GFS does use btrfs (eventually, at some point in the future) as the backing fs, but their customers probably won't notice the difference. I'm not too worried. I'll keep using btrfs as it is now, within the limits of what it can consistently do, and do what I can to help support the effort. I'm not a file system coder, but I very much appreciate the enormous amount of work that goes into btrfs. Steady on, ButterFS people. Back now to cat videos. Gordon Aug 16, 2017 at 11:54 AM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote: > [ ... ] > >> But I've talked to some friend at the local super computing >> centre and they have rather general issues with CoW at their >> virtualisation cluster. > > Amazing news! :-) > >> Like SUSE's snapper making many snapshots leading the storage >> images of VMs apparently to explode (in terms of space usage). > > Well, this could be an argument that some of your friends are being > "challenged" by running the storage systems of a "super computing > centre" and that they could become "more prepared" about system > administration, for example as to the principle "know which tool to > use for which workload". Or else it could be an argument that they > expect Btrfs to do their job while they watch cat videos from the > intertubes. :-) > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-17 6:25 ` GWB @ 2017-08-17 11:47 ` Austin S. Hemmelgarn 2017-08-17 19:00 ` Chris Murphy 0 siblings, 1 reply; 63+ messages in thread From: Austin S. Hemmelgarn @ 2017-08-17 11:47 UTC (permalink / raw) To: GWB, Peter Grandi, Linux fs Btrfs On 2017-08-17 02:25, GWB wrote: > << > Or else it could be an argument that they > expect Btrfs to do their job while they watch cat videos from the > intertubes. :-) >>> > > My favourite quote from the list this week, and, well, obviously, that > is the main selling point of file systems like btrfs, zfs, and various > other lvm and raid set ups. The need to free up time to watch cat > videos on the intertubes (whilst at work) has driven most > technological innovations, going back at least to the time of the > Roman Empire. > > So, sure, I'll be happy to admit that I like it very much when a file > system or some other software or hardware component makes my job > easier (which gives me more time to watch cat videos). But if hours > on hours of cat videos have taught me one thing, it is that > catastrophe (pun intended) awaits those who assume that btrfs (or zfs > or nilfs or whatever) will magically work well in all use cases. Agreed, and I will comment that there are far more catastrophes caused by sysadmin complacency or not properly understanding what they're administering than almost anything else. > > That may be what their customers assumed about btrfs, but did Red Hat > make that claim implicitly or explicitly? I don't know, but it seems > unlikely, and all the things mentioned in this thread make sense to > me. It looks like Red Hat is pushing "GFS" (Red Hat Global File > System) for its clustered file system: > > https://www.redhat.com/whitepapers/rha/gfs/GFS_INS0032US.pdf Huh, I could have sworn they were pushing Gluster... > > XFS is now the standard "on disk" fs for Red Hat, but I can't tell if > XFS is the DMU (backing file system or Data Management Unit) for GFS > (zfs is the dmu for lustre). Probably, but why does GFS still has a > file size limit of 100TB, while XFS has a 500TB limit, according to > Red Hat? GFS2 (which is what I think they're talking about) has it's own on-disk format, and actually works as a single-node filesystem. It's a lot closer to OCFS2 in terms of design than it is to Lustre, though I'm not sure if it needs shared storage or not. Also, both of those file size 'limits' are customer support limits from what I can tell. XFS supports files (in theory at least) up to 8 EB minus one byte, and I'm not able to find any other documentation on this regarding GFS2, but I seriously doubt that it has a 100TB file size limit. > > https://access.redhat.com/articles/rhel-limits > > And btrfs is gone from that list. > > So does this mean that Red Hat deprecating btrfs have a tangible > effect on its development, future improvements, and adoption? It > doesn't help, but maybe its not too bad. From reading the list, my > impression is that the typical Red Hat customer with large data arrays > might do fine running xfs over lvm2 over hardware raid (or at least > the customers who are paying attention to the monitor stats between > cat videos). That's not for me, because I prefer mirrors, not > stripes, and "hot spares" that I can pull out of the enclosure, place > in another machine, and get running again (which points me back to > btrfs and zfs). But it must work great for a lot of data silos. > > On the plus side, btrfs is one of the backing file systems in ceph; on > the minus side, with Red Hat out, btrfs might lose some developers and > support: > > http://www.h-online.com/open/features/Kernel-Log-Coming-in-2-6-37-Part-2-File-systems-1148305.html%3Fpage=2 I'm pretty certain that Ceph has officially stopped recommending BTRFS as a backend filesystem. TBH, it was never that amazing of an idea to begin with, Ceph does a lot of the same things that BTRFS does, so you're replicating a not insignificant amount of work, and the big thing was really snapshot support anyway. Also, I don't think I've ever seen any patches posted from a Red Hat address on the ML, so I don't think they were really all that involved in development to begin with. > > As long as FaceBook keeps using btrfs, I wouldn't worry too much about > large firm adoption. Chris (from facebook, post above) points out > that Facebook runs both xfs and btrfs as backing file systems for > Gluster: > > https://www.linux.com/news/learn/intro-to-linux/how-facebook-uses-linux-and-btrfs-interview-chris-mason > > And Gluster is... owned by Red Hat (since 2011), which now advertises > its "Red Hat Global File System", which would be... Gluster? Chris, > is that right? So Facebook runs Gluster (which might be Red Hat > Global File System) with both xfs and btrfs as the backing fs, and Red > Hat... advertises Red Hat GFS as a platform for Oracle RAC Database > Clustering. But not (presumably) running with btrfs as the backing > fs, but rather xfs. So could one Gluster "grid" run over two file > systems, xfs for the applications, and btrfs for the primary data > storage? GFS and GlusterFS are different technologies, unless Red Hat's marketing department is trying to be actively deceptive. GFS is a traditional cluster filesystem which requires fencing hardware and has it's own on-disk format. It originated on IRIX, got ported to Linux, got updated to GFS2 to add splice() support and a few other things, and hasn't seen much development from what I can tell since that happened in 2009 (at least, not much beyond standard bug fixes and maintenance). GlusterFS is a more modern cluster filesystem design, uses separate backing storage (like Lustre and Ceph do), has the rather nice advantage that the layout on the back-end storage exactly replicates the layout in the GlusterFS volume (assuming you're just using replication), and doesn't require any special hardware. It also runs reasonably well on top of BTRFS, other than some scalability issues with directories with thousands of files in them (both Gluster and BTRFS have issues there, and they compound when used in a stack like this). It doesn't directly use any special functionality of BTRFS, although it in theory could make use of the snapshotting functionality (the current snapshot support in Gluster assumes the use of a backing FS that supports freezefs on top of LVM2). > > So Oracle still supports btrfs. Facebook still uses it. And it would > be very funny if Red Hat GFS does use btrfs (eventually, at some point > in the future) as the backing fs, but their customers probably won't > notice the difference. SUSE is also pretty actively involved in the development too, and I think Fujitsu is as well. > > I'm not too worried. I'll keep using btrfs as it is now, within the > limits of what it can consistently do, and do what I can to help > support the effort. I'm not a file system coder, but I very much > appreciate the enormous amount of work that goes into btrfs. > > Steady on, ButterFS people. Back now to cat videos. ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-17 11:47 ` Austin S. Hemmelgarn @ 2017-08-17 19:00 ` Chris Murphy 2017-08-17 20:34 ` GWB 0 siblings, 1 reply; 63+ messages in thread From: Chris Murphy @ 2017-08-17 19:00 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: GWB, Peter Grandi, Linux fs Btrfs On Thu, Aug 17, 2017 at 5:47 AM, Austin S. Hemmelgarn <ahferroin7@gmail.com> wrote: > Also, I don't think I've ever seen any patches posted from a Red Hat address > on the ML, so I don't think they were really all that involved in > development to begin with. Unfortunately the email domain doesn't tell the whole story who's backing development, the company or the individual. [chris@f26s linux]$ git log --since=”2016-01-01” --pretty=format:"%an %ae" --no-merges -- fs/btrfs | sort -u | grep redhat Andreas Gruenbacher agruenba@redhat.com David Howells dhowells@redhat.com Eric Sandeen sandeen@redhat.com Jeff Layton jlayton@redhat.com Mike Christie mchristi@redhat.com Miklos Szeredi mszeredi@redhat.com $ > GFS and GlusterFS are different technologies, unless Red Hat's marketing > department is trying to be actively deceptive. https://www.redhat.com/en/technologies/storage Seems very clear. I don't even see GFS or GFS2 on here. It's Gluster and Ceph. > > SUSE is also pretty actively involved in the development too, and I think > Fujitsu is as well. >> >> >> I'm not too worried. I'll keep using btrfs as it is now, within the >> limits of what it can consistently do, and do what I can to help >> support the effort. I'm not a file system coder, but I very much >> appreciate the enormous amount of work that goes into btrfs. >> >> Steady on, ButterFS people. Back now to cat videos. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Big bunch of SUSE contributions (yes David Sterba is counted three times here), and Fujitsu. [chris@f26s linux]$ git log --since=”2016-01-01” --pretty=format:"%an %ae" --no-merges -- fs/btrfs | sort -u | grep suse Borislav Petkov bp@suse.de David Sterba dsterba@suse.com David Sterba DSterba@suse.com David Sterba dsterba@suse.cz Edmund Nadolski enadolski@suse.com Filipe Manana fdmanana@suse.com Goldwyn Rodrigues rgoldwyn@suse.com Guoqing Jiang gqjiang@suse.com Jan Kara jack@suse.cz Jeff Mahoney jeffm@suse.com Jiri Kosina jkosina@suse.cz Mark Fasheh mfasheh@suse.de Michal Hocko mhocko@suse.com NeilBrown neilb@suse.com Nikolay Borisov nborisov@suse.com Petr Mladek pmladek@suse.com [chris@f26s linux]$ git log --since=”2016-01-01” --pretty=format:"%an %ae" --no-merges -- fs/btrfs | sort -u | grep fujitsu Lu Fengqi lufq.fnst@cn.fujitsu.com Qu Wenruo quwenruo@cn.fujitsu.com Satoru Takeuchi takeuchi_satoru@jp.fujitsu.com Su Yue suy.fnst@cn.fujitsu.com Tsutomu Itoh t-itoh@jp.fujitsu.com Wang Xiaoguang wangxg.fnst@cn.fujitsu.com Xiaoguang Wang wangxg.fnst@cn.fujitsu.com Zhao Lei zhaolei@cn.fujitsu.com Over the past 18 months, it's about 100 Btrfs contributors, 71 ext4, 63 XFS. So all three have many contributors. That of course does not tell the whole story by any means. -- Chris Murphy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-17 19:00 ` Chris Murphy @ 2017-08-17 20:34 ` GWB 0 siblings, 0 replies; 63+ messages in thread From: GWB @ 2017-08-17 20:34 UTC (permalink / raw) To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Peter Grandi, Linux fs Btrfs Yep, and thank you to Suse, Fujitsu, and all the contributors. I suppose we can all be charitable when reading this from the Red Hat Whitepaper at: https://www.redhat.com/whitepapers/rha/gfs/GFS_INS0032US.pdf: << Red Hat GFS is the world’s leading cluster file system for Linux. >> If that is GFS2, it is a different use case than Gluster (https://www.redhat.com/en/technologies/storage). So perhaps marketing might tweak that a little bit, maybe: << Red Hat GFS is the world’s leading cluster file system for Linux for Oracle RAC Database Clustering. >> But you can see how Oracle might quibble with that. So Red Hat goes as far as it can in the Whitepaper: << Red Hat GFS simplifies the installation, configuration, and on-going maintenance of the SAN infrastructure necessary for Oracle RAC clustering. Oracle tables, log files, program files, and archive information can all be stored in GFS files, avoiding the complexity and difficulties of managing raw storage devices on a SAN while achieving excellent performance. >> Which avoids a comparison between, say, an Oracle Sparc server (probably made by Fujitsu) hosting Oracle Rack Clusters on Solaris. Given the price of Oracle's sparc servers, Red Hat may be as good as an Oracle RAC DB server can get for a price less than the annual budget of a small country. Well, great news, Austin and Chris, that clears it up for me, and now I know of yet another use case for btrfs as the dmu for Gluster. So, again, I'm not too worried about Red Hat deprecating btrfs, given the number of supporters and developers. If Oracle or Suse drops out, then I would worry. Gordon On Thu, Aug 17, 2017 at 2:00 PM, Chris Murphy <lists@colorremedies.com> wrote: > On Thu, Aug 17, 2017 at 5:47 AM, Austin S. Hemmelgarn > <ahferroin7@gmail.com> wrote: > >> Also, I don't think I've ever seen any patches posted from a Red Hat address >> on the ML, so I don't think they were really all that involved in >> development to begin with. > > Unfortunately the email domain doesn't tell the whole story who's > backing development, the company or the individual. > > [chris@f26s linux]$ git log --since=”2016-01-01” --pretty=format:"%an > %ae" --no-merges -- fs/btrfs | sort -u | grep redhat > Andreas Gruenbacher agruenba@redhat.com > David Howells dhowells@redhat.com > Eric Sandeen sandeen@redhat.com > Jeff Layton jlayton@redhat.com > Mike Christie mchristi@redhat.com > Miklos Szeredi mszeredi@redhat.com > $ > > > >> GFS and GlusterFS are different technologies, unless Red Hat's marketing >> department is trying to be actively deceptive. > > https://www.redhat.com/en/technologies/storage > > Seems very clear. I don't even see GFS or GFS2 on here. It's Gluster and Ceph. > > >> >> SUSE is also pretty actively involved in the development too, and I think >> Fujitsu is as well. > > > >>> >>> >>> I'm not too worried. I'll keep using btrfs as it is now, within the >>> limits of what it can consistently do, and do what I can to help >>> support the effort. I'm not a file system coder, but I very much >>> appreciate the enormous amount of work that goes into btrfs. >>> >>> Steady on, ButterFS people. Back now to cat videos. >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > Big bunch of SUSE contributions (yes David Sterba is counted three > times here), and Fujitsu. > > [chris@f26s linux]$ git log --since=”2016-01-01” --pretty=format:"%an > %ae" --no-merges -- fs/btrfs | sort -u | grep suse > Borislav Petkov bp@suse.de > David Sterba dsterba@suse.com > David Sterba DSterba@suse.com > David Sterba dsterba@suse.cz > Edmund Nadolski enadolski@suse.com > Filipe Manana fdmanana@suse.com > Goldwyn Rodrigues rgoldwyn@suse.com > Guoqing Jiang gqjiang@suse.com > Jan Kara jack@suse.cz > Jeff Mahoney jeffm@suse.com > Jiri Kosina jkosina@suse.cz > Mark Fasheh mfasheh@suse.de > Michal Hocko mhocko@suse.com > NeilBrown neilb@suse.com > Nikolay Borisov nborisov@suse.com > Petr Mladek pmladek@suse.com > > [chris@f26s linux]$ git log --since=”2016-01-01” --pretty=format:"%an > %ae" --no-merges -- fs/btrfs | sort -u | grep fujitsu > Lu Fengqi lufq.fnst@cn.fujitsu.com > Qu Wenruo quwenruo@cn.fujitsu.com > Satoru Takeuchi takeuchi_satoru@jp.fujitsu.com > Su Yue suy.fnst@cn.fujitsu.com > Tsutomu Itoh t-itoh@jp.fujitsu.com > Wang Xiaoguang wangxg.fnst@cn.fujitsu.com > Xiaoguang Wang wangxg.fnst@cn.fujitsu.com > Zhao Lei zhaolei@cn.fujitsu.com > > > Over the past 18 months, it's about 100 Btrfs contributors, 71 ext4, > 63 XFS. So all three have many contributors. That of course does not > tell the whole story by any means. > > -- > Chris Murphy ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-16 13:12 ` Chris Mason ` (2 preceding siblings ...) 2017-08-16 14:01 ` Qu Wenruo @ 2017-08-16 16:44 ` Peter Grandi 3 siblings, 0 replies; 63+ messages in thread From: Peter Grandi @ 2017-08-16 16:44 UTC (permalink / raw) To: Linux fs Btrfs > We use the crcs to catch storage gone wrong, [ ... ] And that's an opportunistically feasible idea given that current CPUs can do that in real-time. > [ ... ] It's possible to protect against all three without COW, > but all solutions have their own tradeoffs and this is the setup > we chose. It's easy to trust and easy to debug and at scale that > really helps. Indeed all filesystem designs have pathological workloads, and system administrators and applications developers who are "more prepared" know which one is best for which workload, or try to figure it out. > Some databases also crc, and all drives have correction bits of > of some kind. There's nothing wrong with crcs happening at lots > of layers. Well, there is: in theory checksumming should be end-to-end, that is entirely application level, so applications that don't need it don't pay the price, but having it done at other layers can help the very many applications that don't do it and should do it, and it is cheap, and can help when troubleshooting exactly there the problem is. It is an opportunistic thing to do. > [ ... ] My real goal is to make COW fast enough that we can > leave it on for the database applications too. Obviously I > haven't quite finished that one yet ;) [ ... ] And this worries me because it portends the usual "marketing" goal of making Btrfs all things to all workloads, the "OpenStack of filesystems", with little consideration for complexity, maintainability, or even sometimes reality. The reality is that all known storage media have hugely anisotropic performance envelopes, both as to functionality, cost, speed, reliability, and there is no way to have an automagic filesystem that "just works" in all cases, despite the constant demands for one by "less prepared" storage administrators and application developers. The reality is also that if one such filesystem could automagically adapt to cover optimally the performance envelopes of every possible device and workload, it would be so complex as to be unmaintainable in practice. So Btrfs, in its base "Rodeh" functionality, with COW, checksums, subvolumes, shapshots, *on a single device*, works pretty well and reliably and it is already very useful, for most workloads. Some people also like some of its exotic complexities like in-place compression and defragmentation, but they come at a high cost. For workloads that inflict lots of small random in-place updates on storage, like tablespaces for DBMSes etc, perhaps simpler less featureful storage abstraction layers are more appropriate, from OCFS2 to simple DM/LVM2 LVs, and Btrfs NOCOW approximates them well. BTW as to the specifics of DBMSes and filesystems, there is a classic paper making eminently reasonable, practical, suggestions that have been ignored for only 35 years and some: %A M. R. Stonebraker %T Operating system support for database management %J CACM %V 24 %D JUL 1981 %P 412-418 ^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? 2017-08-14 14:23 ` Austin S. Hemmelgarn 2017-08-14 15:13 ` Graham Cobb @ 2017-08-14 19:39 ` Christoph Anton Mitterer 1 sibling, 0 replies; 63+ messages in thread From: Christoph Anton Mitterer @ 2017-08-14 19:39 UTC (permalink / raw) To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 3244 bytes --] On Mon, 2017-08-14 at 10:23 -0400, Austin S. Hemmelgarn wrote: > Assume you have higher level verification. Would you rather not be > able > to read the data regardless of if it's correct or not, or be able to > read it and determine yourself if it's correct or not? What would be the difference here then to the CoW+checksuming+some- data-corruption-case?! btrfs would also give EIO and all these applications you mention would fail then. As I've said previous, one could provide end users with the means to still access the faulty data. Or they could simply mount with nochecksum. > For almost > anybody, the answer is going to be the second case, because the > application knows better than the OS if the data is correct (and > 'correct' may be a threshold, not some binary determination). You've made that claim already once with VMs and DBs, and your claim proved simply wrong. Most applications don't do this kind of verification. And those that do probably rather just check whether the data is valid and if not give an error or at best fall back to some automatical backups (e.g. what package managers do). I'd know only few programs who'd really be capable to use data they know is bogus and recover from that automagically... the only examples I'd know are some archive formats which include error correcting codes. And I really mean using the blocks for recovery for which the csum wouldn't verify (i.e. the ones that gives an EIO)... without ECCs, how would a program know what do to with such data? I cannot image that many people would choose the second option, to be honest. Working with bogus data?! What should be the benefit of this? > At that > point, you need to make the checksum error a warning instead of > returning -EIO. How do you intend to communicate that warning back > to > the application? The kernel log won't work, because on any > reasonably > secure system it's not visible to anyone but root. Still same problem with CoW + any data corruption... > There's also no side > channel for the read() system calls that you can utilize. That then > means that the checksums end up just being a means for the > administrator > to know some data wasn't written correctly, but they should know > that > anyway because the system crashed. No, they'd have no idea if any / which data was written during the crash. > Looking at this from a different angle: Without background, what > would > you assume the behavior to be for this? For most people, the > assumption > would be that this provides the same degree of data safety that the > checksums do when the data is CoW. I don't think the average use would have any such assumption. Most people likely don't even know that there is implicitly no checksuming if nodatacow is enabled. What people may however have heard of is, that btrfs does doe checksuming and they'd assume that their filesystem gives them always just valid data (or an error)... and IMO that's actually what each modern fs should do per default. Relying on higher levels providing such means is simply not realistic. Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5930 bytes --] ^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2017-08-17 20:34 UTC | newest] Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-02 8:38 RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? Brendan Hide 2017-08-02 9:11 ` Wang Shilong 2017-08-03 19:18 ` Chris Murphy 2017-08-02 11:25 ` Austin S. Hemmelgarn 2017-08-02 12:55 ` Lutz Vieweg 2017-08-02 13:47 ` Austin S. Hemmelgarn 2017-08-02 18:44 ` Chris Mason 2017-08-02 22:12 ` Fajar A. Nugraha 2017-08-02 22:22 ` Chris Murphy 2017-08-03 9:59 ` Lutz Vieweg 2017-08-03 18:08 ` waxhead 2017-08-03 18:29 ` Christoph Anton Mitterer 2017-08-03 19:22 ` Austin S. Hemmelgarn 2017-08-03 20:45 ` Brendan Hide 2017-08-03 22:00 ` Chris Murphy 2017-08-04 11:26 ` Austin S. Hemmelgarn 2017-08-03 19:03 ` Austin S. Hemmelgarn 2017-08-04 9:48 ` Duncan 2017-08-16 18:07 ` David Sterba 2017-08-04 14:05 ` Qu Wenruo 2017-08-04 23:55 ` Wang Shilong 2017-08-07 15:27 ` Chris Murphy 2017-08-10 0:35 ` Qu Wenruo 2017-08-12 0:10 ` Christoph Anton Mitterer 2017-08-12 7:42 ` Christoph Hellwig 2017-08-12 11:51 ` Christoph Anton Mitterer 2017-08-12 12:12 ` Hugo Mills 2017-08-13 14:08 ` Goffredo Baroncelli 2017-08-14 7:08 ` Qu Wenruo 2017-08-14 14:23 ` Goffredo Baroncelli 2017-08-14 19:08 ` Chris Murphy 2017-08-14 20:27 ` Goffredo Baroncelli 2017-08-14 6:36 ` Qu Wenruo 2017-08-14 7:43 ` Paul Jones 2017-08-14 7:46 ` Qu Wenruo 2017-08-14 12:32 ` Christoph Anton Mitterer 2017-08-14 12:58 ` Qu Wenruo 2017-08-14 12:24 ` Christoph Anton Mitterer 2017-08-14 14:23 ` Austin S. Hemmelgarn 2017-08-14 15:13 ` Graham Cobb 2017-08-14 15:53 ` Austin S. Hemmelgarn 2017-08-14 16:42 ` Graham Cobb 2017-08-14 19:54 ` Christoph Anton Mitterer 2017-08-15 11:37 ` Austin S. Hemmelgarn 2017-08-15 14:41 ` Christoph Anton Mitterer 2017-08-15 15:43 ` Austin S. Hemmelgarn 2017-08-16 13:12 ` Chris Mason 2017-08-16 13:31 ` Christoph Anton Mitterer 2017-08-16 13:53 ` Austin S. Hemmelgarn 2017-08-16 14:11 ` Christoph Anton Mitterer 2017-08-16 15:07 ` Austin S. Hemmelgarn 2017-08-16 17:26 ` Peter Grandi 2017-08-16 18:19 ` David Sterba 2017-08-16 16:54 ` Peter Grandi 2017-08-16 13:56 ` Austin S. Hemmelgarn 2017-08-16 14:01 ` Qu Wenruo 2017-08-16 19:52 ` Chris Murphy 2017-08-17 6:25 ` GWB 2017-08-17 11:47 ` Austin S. Hemmelgarn 2017-08-17 19:00 ` Chris Murphy 2017-08-17 20:34 ` GWB 2017-08-16 16:44 ` Peter Grandi 2017-08-14 19:39 ` Christoph Anton Mitterer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.