From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from plane.gmane.org ([80.91.229.3]:46301 "EHLO plane.gmane.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750800Ab3LKK4j (ORCPT ); Wed, 11 Dec 2013 05:56:39 -0500 Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1VqhT2-0000RL-UD for linux-btrfs@vger.kernel.org; Wed, 11 Dec 2013 11:56:36 +0100 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 11 Dec 2013 11:56:36 +0100 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 11 Dec 2013 11:56:36 +0100 To: linux-btrfs@vger.kernel.org From: Duncan <1i5t5.duncan@cox.net> Subject: Re: Feature Req: "mkfs.btrfs -d dup" option on single device Date: Wed, 11 Dec 2013 10:56:12 +0000 (UTC) Message-ID: References: <01BDC0F3-CD4E-4BF1-898C-92AD50B66B41@colorremedies.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Chris Murphy posted on Tue, 10 Dec 2013 17:33:59 -0700 as excerpted: > On Dec 10, 2013, at 5:14 PM, Imran Geriskovan > wrote: > >>> Current btrfs-progs is v3.12. 0.19 is a bit old. But yes, looks like >>> the wiki also needs updating. >> >>> Anyway I just tried it on an 8GB stick and it works, but -M (mixed >>> data+metadata) is required, which documentation also says incurs a >>> performance hit, although I'm uncertain of the significance. >> >> btrfs-tools 0.19+20130705 is the most recent one on Debian's leading >> edge Sid/Unstable. [I was debating where to reply, and chose here.] To be fair, that's a snapshot date tag, 0.19 plus 2013-07-05 (which would be the date the git snapshot was taken), which isn't /that/ old, particularly for something like Debian. There was a 0.20-rc1 about this time last year (Nov/Dec-ish 2012), but I guess Debian's date tags don't take rcs into account. That said, as the wiki states, btrfs is still under /heavy/ development, anyone using it at this point is by definition a development filesystem tester, and said testers are strongly recommended to keep current with both kernel and btrfs-progs userspace both because not doing so unnecessarily risks whatever they're testing to already known and fixed bugs, and because if things /do/ go wrong, in addition to being little more than distracting noise if the bug is already fixed, reports from outdated versions simply aren't as useful if the bug remains unfixed. Since the btrfs-progs git repo policy is master branch is always kept release-ready, development must be done on other branches and merged to master only when considered release-ready, ideally, all testers would run a current git build, either built themselves or for distros who choose to package a development/testing product like btrfs, built and updated by the distro on a weekly or monthly basis. Of course that flies in the face of normal distro stabilization policies, but the point is, btrfs is NOT a normal distro stable package, and distros that choose to package it are by definition choosing to package a development package for their users to test /as/ a development package, and should update it accordingly. And Debian or not Debian, a development status package last updated in July, when it's now December and there have been significant changes since July... might not be /that/ old in Debian terms, but it certainly isn't current, either! >> Given the state of the docs probably very few or no people ever used >> '-d dup'. As being the lead developer, is it possible for you to >> provide some insights for the reliability of this option? > > I'm not a developer, I'm just an ape who wears pants. Chris Mason is the > lead developer. All I can say about it is that it's been working for me > OK so far. Least anyone finding this thread in google or the like think otherwise, it's probably worthwhile to emphasize that with a separate post... which I just did. >> Can '-M' requirement be an indication of code which has not been ironed >> out, or is it simply a constraint of the internal machinery? > > I think it's just how chunks are allocated it becomes space inefficient > to have two separate metadata and data chunks, hence the requirement to > mix them if -d dup is used. But I'm not really sure. AFAIK, duplicated data without RAID simply wasn't considered a reasonable use-case. I'd certainly consider it so here, in particular because I *DO* use the data integrity and scrub features, but I'm actually using dual physical devices (SSDs in my case) in raid1 mode, instead. The fact that mixed-data/metadata mode allows it is thus somewhat of an accident, more than a planned feature. FWIW I had tried btrfs some time ago, then left as I decided it wasn't mature enough for my use-case at the time, and just came back in time to see mixed-mode going in. Until mixed-mode, btrfs had quite some issues on 1-gig or smaller partitions as the pre-allocated separate data and metadata blocks simply didn't tend to balance out that well and one or the other would tend to be used up very fast, leaving the filesystem more or less useless in terms of further writes. Mixed-data/metadata mode was added as an effective way of countering that problem, and in fact I've been quite pleased with how it has worked here on my smaller partitions. My /boot is 256 MiB, I have one of those in dup-mode, meaning both data/ metadata dup since it's mixed-mode, on each of my otherwise btrfs raid1 mode SSDs, thus allowing for an effective backup of what would otherwise be not easily and effectively backup-able, since bootloaders tend to allow pointing at only one such location (tho with grub2 on GPT partitioned devices with a BIOS reserved partition for grub2, that's not the issue it tended to be on MBR, since grub2 should still come up with its rescue mode shell even if it can't find the /boot it'd normally load the normal shell from, and the rescue-mode shell can be used to point at a different /boot, but then the same question applies to the grub installed on that BIOS partition, for which a second device with its own grub2 installed to its own BIOS partition is still the best backup), and allowing me to select and boot either one from BIOS. My /var/log is 640 MiB mixed-mode too, but in btrfs raid1, with the size, 640 MiB, chosen as about half a GiB, rounded up a bit and positioned such that all later partitions on the device start at an even GiB boundary. In fact, I only recently realized the DUP-mode implications of mixed-mode on the /boot partitions myself, when I went to scrub them and then thought "Oh, but they're not raid1, so scrub won't work on the data." Except that it did, because the mixed-mode made the data as well as the metadata DUP-mode. >> How well does the main idea of "Guaranteed Data Integrity for extra >> reliability" and the option "-d dup" in its current state match? > > Well given that Btrfs is still flagged as experimental, most notably > when creating any Btrfs file system, I'd say that doesn't apply here. Good point! =:^) Tho the "experimental" level was recently officially toned down a notch, with a change to the kernel's btrfs option description that now says the btrfs on-device format is considered reasonably stable and will change only if absolutely necessary, and then only in such a way that new versions will remain able to mount old-device-format filesystems. But it's still a not fully complete and well tested filesystem, and it remains under very heavy development, with fixes in every kernel series. Meanwhile, it can be pointed out that there's currently no built-in way to access data that fails its checksum -- currently, if there's no valid second copy around due to raid1 or dup mode (and it can be noted, there's ONLY one additional copy ATM, no way to add further redundancy, tho N-way mirroring is planned after raid5/6, the currently focused in-development feature, is completed), or if that second copy fails its checksum as well, you're SOL. That's guaranteed data integrity. If the data can be accessed, its integrity is guaranteed due to the checksums. If they fail, the data simply can no longer be accessed. (Of course there's the nodatasum and nodatacow mount options which turn that off, and the NOCOW file attribute, which I believe turns off checksumming as well, and those are recommended for large-and-frequently-internally-written-file use-cases such as VM images, but those aren't the defaults.) But while that might be guaranteed integrity, it's definitely NOT "extra reliability", at least in the actual accessible data sense, if you can't access the data AT ALL without a second copy around, which isn't available on a single device without data-dup-mode. That was one reason I went multi-device and btrfs raid1 mode. And I'd be much more comfortable if that N-way-mirroring feature was currently available and working as well. I'd probably limit it to three-way, but I would definitely rest more comfortably with that three-way! But, given btrfs' development status and thus the limits to trusting any such feature ATM, I think we're thinking future-stable-btrfs as much as current-development btrfs. Three-way is definitely planned, and I agree with the premise of the original post as well, that there's a definite use-case for DUP-mode (and even TRIPL-mode, etc) on a single partition, too. > If the case you're trying to mitigate is some kind of corruption that > can only be repaired if you have at least one other copy of data, then > -d dup is useful. But obviously this ignores the statistically greater > chance of a more significant hardware failure, as this is still single > device. I'd take issue with the "statistically greater" assumption you make. Perhaps in theory, and arguably possibly in UPS-backed always-on scenarios as well, but I've had personal experience with failed checksums and subsequent scrubs here on my raid1 mode btrfs, that were NOT hardware faults, on quite new SSDs that I'd be VERY unhappy with if they /did/ actually start generating hardware faults. In my case it's a variant of the unsafe shutdown scenario. In particular, my SSDs takes a bit to stabilize after first turn-on, and one typically appears and is ready to take commands some seconds before the other one. Now the kernel does have the rootwait commandline option to wait for devices to appear, and between that and the initramfs I have to use in ordered for a btrfs raid1-mode rootfs to mount properly (apparently rootflags=device=/dev/whatever,device=/dev/whatever2 doesn't parse properly, I'd guess due to splitting at the wrong equals, and without an initramfs I have to mount degraded, or at least I did a few kernels ago when I set things up), actual bootup works fine. But suspend2ram apparently doesn't use the same rootwait mechanism, and if I leave my system in suspend2ram for more than a few hours (I'd guess whatever it takes for the SSDs caps to drain sufficiently so it takes too long to stabilize again), when I try to resume, one of the devices will appear first and the system will try to resume with only it, without the other device having shown up yet. Unfortunately, btrfs raid1 mode doesn't yet cope particularly well with runtime (well here, resume-time) device-loss, and open-for-write files such as ~/.xsession-errors and /var/log/* start triggering errors almost immediately after resume, forcing the filesystems read-only and forcing an only semi-graceful reboot without properly closing those still-open- for-writing-but-can't-be-written files. Fortunately, those files are on independent btrfs non-root filesystems, and my also btrfs rootfs remains mounted read-only in normal operation, so there's very little chance of damage to the core system on the rootfs. Only /home and /var/log are normally mounted writable (and the tmpfs-based /tmp, /run... of course, /var/lib and a few other things that need to be writable and retained over a reboot are symlinked to subdirs in /home). And the writable filesystems have always remained bootable; they just have errors due to the abrupt device-drop and subsequent forced- read-only of the remaining device with open-for-write files. A scrub normally fixes them, altho in one case recently, it "fixed" both my user's .xsession-errors and .bash_history files to entire unreadability -- any attempt to read either one, even with cat, would lockup userspace (magic-srq would work, so the kernel wasn't entirely dead, but no userspace output). So scrub didn't save the files that time, even if it did apparently fix the metadata. I couldn't log in, even in a non-X VT, as that user, until I deleted .bash_history. And once I deleted the user's .bash_history, I could login non-X, but attempting to startx would again give me an unresponsive userspace, until I deleted .xsession-errors as well. Needless-to-say, I've quit using suspend2ram for anything that might be longer than say four hours. Unfortunately, suspend2disk aka hibernate didn't work on this machine last I tried it (it hibernated but resume would fail, tho that was when I first setup the machine over a year ago now, I really should try it again with a current kernel...), so all I have is reboot. Tho with SSDs for the main system that's not so bad. And with it being winter here, the heat from a running system isn't entirely wasted, so for now I can leave the system on when I'd normally suspend2ram it during the 8-9 months out of the year I'm paying for any computer energy used twice, once to use it, again to pump it outside with the AC, here in Phoenix. So the point of all that... data corruption isn't necessarily rarer than single-device hardware failure at all. (Obviously in my case the fact that it's dual-devices in btrfs raid1 mode was a big part of the trigger; that wouldn't apply in single-device cases. But there's other real-world corruption cases too, including simple ungraceful shutdowns that could well trigger the same sort of issues on a single device, that for a LOT of people are far more likely than hardware storage device failure. So there's a definite use-case for single-device DUP/TRIPL/... mode, particularly so since that's what's required to actually make practical use of scrub and thus the actual available reliability side of the btrfs data integrity feature. > Not only could the entire single device fail, but it's possible > that erase blocks individually fail. And since the FTL decides where > pages are stored, the duplicate data/metadata copies could be stored in > the same erase block. So there is a failure vector other than full > failure where some data can still be lost on a single device even with > duplicate, or triplicate copies. That's actually the reason btrfs defaults to SINGLE metadata mode on single-device SSD-backed filesystems, as well. But as Imran points out, SSDs aren't all there is. There's still spinning rust around. And defaults aside, even on SSDs it should be /possible/ to specify data- dup mode, because there's enough different SSD variants and enough different use-cases, that it's surely going to be useful some-of-the-time to someone. =:^) And btrfs still being in development means it's a good time to make the request, before it's stabilized without data-dup mode, and possibly without the ability to easily add it because nobody thought the case was viable and it thus wasn't planned for before btrfs went stable. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman