* btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT @ 2014-01-03 22:28 Jim Salter 2014-01-03 22:42 ` Emil Karlson 2014-01-03 22:43 ` Joshua Schüler 0 siblings, 2 replies; 40+ messages in thread From: Jim Salter @ 2014-01-03 22:28 UTC (permalink / raw) To: linux-btrfs I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient). I discovered to my horror during testing today that neither raid1 nor raid10 arrays are fault tolerant of losing an actual disk. mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde mkdir /test mount /dev/vdb /test echo "test" > /test/test btrfs filesystem sync /test shutdown -hP now After shutting down the VM, I can remove ANY of the drives from the btrfs raid10 array, and be unable to mount the array. In this case, I removed the drive that was at /dev/vde, then restarted the VM. btrfs fi show Label: none uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455 Total devices 4 FS bytes used 156.00KB devid 3 size 1.00GB used 212.75MB path /dev/vdd devid 3 size 1.00GB used 212.75MB path /dev/vdc devid 3 size 1.00GB used 232.75MB path /dev/vdb *** Some devices missing OK, we have three of four raid10 devices present. Should be fine. Let's mount it: mount -t btrfs /dev/vdb /test mount: wrong fs type, bad option, bad superblock on /dev/vdb, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so What's the kernel log got to say about it? dmesg | tail -n 4 [ 536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1 transid 7 /dev/vdb [ 536.700515] btrfs: disk space caching is enabled [ 536.703491] btrfs: failed to read the system array on vdd [ 536.708337] btrfs: open_ctree failed Same behavior persists whether I create a raid1 or raid10 array, and whether I create it as that raid level using mkfs.btrfs or convert it afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn. Also persists even if I both scrub AND sync the array before shutting the machine down and removing one of the disks. What's up with this? This is a MASSIVE bug, and I haven't seen anybody else talking about it... has nobody tried actually failing out a disk yet, or what? ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 22:28 btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT Jim Salter @ 2014-01-03 22:42 ` Emil Karlson 2014-01-03 22:43 ` Joshua Schüler 1 sibling, 0 replies; 40+ messages in thread From: Emil Karlson @ 2014-01-03 22:42 UTC (permalink / raw) To: Jim Salter; +Cc: Linux Btrfs > mount -t btrfs /dev/vdb /test > mount: wrong fs type, bad option, bad superblock on /dev/vdb, > missing codepage or helper program, or other error > In some cases useful info is found in syslog - try > dmesg | tail or so IIRC you need mount option degraded here. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 22:28 btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT Jim Salter 2014-01-03 22:42 ` Emil Karlson @ 2014-01-03 22:43 ` Joshua Schüler 2014-01-03 22:56 ` Jim Salter 1 sibling, 1 reply; 40+ messages in thread From: Joshua Schüler @ 2014-01-03 22:43 UTC (permalink / raw) To: jim; +Cc: linux-btrfs Am 03.01.2014 23:28, schrieb Jim Salter: > I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the > btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient). > > I discovered to my horror during testing today that neither raid1 nor > raid10 arrays are fault tolerant of losing an actual disk. > > mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde > mkdir /test > mount /dev/vdb /test > echo "test" > /test/test > btrfs filesystem sync /test > shutdown -hP now > > After shutting down the VM, I can remove ANY of the drives from the > btrfs raid10 array, and be unable to mount the array. In this case, I > removed the drive that was at /dev/vde, then restarted the VM. > > btrfs fi show > Label: none uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455 > Total devices 4 FS bytes used 156.00KB > devid 3 size 1.00GB used 212.75MB path /dev/vdd > devid 3 size 1.00GB used 212.75MB path /dev/vdc > devid 3 size 1.00GB used 232.75MB path /dev/vdb > *** Some devices missing > > OK, we have three of four raid10 devices present. Should be fine. Let's > mount it: > > mount -t btrfs /dev/vdb /test > mount: wrong fs type, bad option, bad superblock on /dev/vdb, > missing codepage or helper program, or other error > In some cases useful info is found in syslog - try > dmesg | tail or so > > What's the kernel log got to say about it? > > dmesg | tail -n 4 > [ 536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1 > transid 7 /dev/vdb > [ 536.700515] btrfs: disk space caching is enabled > [ 536.703491] btrfs: failed to read the system array on vdd > [ 536.708337] btrfs: open_ctree failed > > Same behavior persists whether I create a raid1 or raid10 array, and > whether I create it as that raid level using mkfs.btrfs or convert it > afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn. > Also persists even if I both scrub AND sync the array before shutting > the machine down and removing one of the disks. > > What's up with this? This is a MASSIVE bug, and I haven't seen anybody > else talking about it... has nobody tried actually failing out a disk > yet, or what? Hey Jim, keep calm and read the wiki ;) https://btrfs.wiki.kernel.org/ You need to mount with -o degraded to tell btrfs a disk is missing. Joshua ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 22:43 ` Joshua Schüler @ 2014-01-03 22:56 ` Jim Salter 2014-01-03 23:04 ` Hugo Mills ` (3 more replies) 0 siblings, 4 replies; 40+ messages in thread From: Jim Salter @ 2014-01-03 22:56 UTC (permalink / raw) To: Joshua Schüler; +Cc: linux-btrfs I actually read the wiki pretty obsessively before blasting the list - could not successfully find anything answering the question, by scanning the FAQ or by Googling. You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine. HOWEVER - this won't allow a root filesystem to mount. How do you deal with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root filesystem? Few things are scarier than seeing the "cannot find init" message in GRUB and being faced with a BusyBox prompt... which is actually how I initially got my scare; I was trying to do a walkthrough for setting up a raid1 / for an article in a major online magazine and it wouldn't boot at all after removing a device; I backed off and tested with a non root filesystem before hitting the list. I did find the -o degraded argument in the wiki now that you mentioned it - but it's not prominent enough if you ask me. =) On 01/03/2014 05:43 PM, Joshua Schüler wrote: > Am 03.01.2014 23:28, schrieb Jim Salter: >> I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the >> btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient). >> >> I discovered to my horror during testing today that neither raid1 nor >> raid10 arrays are fault tolerant of losing an actual disk. >> >> mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde >> mkdir /test >> mount /dev/vdb /test >> echo "test" > /test/test >> btrfs filesystem sync /test >> shutdown -hP now >> >> After shutting down the VM, I can remove ANY of the drives from the >> btrfs raid10 array, and be unable to mount the array. In this case, I >> removed the drive that was at /dev/vde, then restarted the VM. >> >> btrfs fi show >> Label: none uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455 >> Total devices 4 FS bytes used 156.00KB >> devid 3 size 1.00GB used 212.75MB path /dev/vdd >> devid 3 size 1.00GB used 212.75MB path /dev/vdc >> devid 3 size 1.00GB used 232.75MB path /dev/vdb >> *** Some devices missing >> >> OK, we have three of four raid10 devices present. Should be fine. Let's >> mount it: >> >> mount -t btrfs /dev/vdb /test >> mount: wrong fs type, bad option, bad superblock on /dev/vdb, >> missing codepage or helper program, or other error >> In some cases useful info is found in syslog - try >> dmesg | tail or so >> >> What's the kernel log got to say about it? >> >> dmesg | tail -n 4 >> [ 536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1 >> transid 7 /dev/vdb >> [ 536.700515] btrfs: disk space caching is enabled >> [ 536.703491] btrfs: failed to read the system array on vdd >> [ 536.708337] btrfs: open_ctree failed >> >> Same behavior persists whether I create a raid1 or raid10 array, and >> whether I create it as that raid level using mkfs.btrfs or convert it >> afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn. >> Also persists even if I both scrub AND sync the array before shutting >> the machine down and removing one of the disks. >> >> What's up with this? This is a MASSIVE bug, and I haven't seen anybody >> else talking about it... has nobody tried actually failing out a disk >> yet, or what? > Hey Jim, > > keep calm and read the wiki ;) > https://btrfs.wiki.kernel.org/ > > You need to mount with -o degraded to tell btrfs a disk is missing. > > > Joshua > > ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 22:56 ` Jim Salter @ 2014-01-03 23:04 ` Hugo Mills 2014-01-03 23:04 ` Joshua Schüler ` (2 subsequent siblings) 3 siblings, 0 replies; 40+ messages in thread From: Hugo Mills @ 2014-01-03 23:04 UTC (permalink / raw) To: Jim Salter; +Cc: Joshua Schüler, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4070 bytes --] On Fri, Jan 03, 2014 at 05:56:42PM -0500, Jim Salter wrote: > I actually read the wiki pretty obsessively before blasting the list > - could not successfully find anything answering the question, by > scanning the FAQ or by Googling. > > You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine. > > HOWEVER - this won't allow a root filesystem to mount. How do you > deal with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your > root filesystem? Few things are scarier than seeing the "cannot find > init" message in GRUB and being faced with a BusyBox prompt... Use grub's command-line editing to add rootflags=degraded to it. Hugo. > which > is actually how I initially got my scare; I was trying to do a > walkthrough for setting up a raid1 / for an article in a major > online magazine and it wouldn't boot at all after removing a device; > I backed off and tested with a non root filesystem before hitting > the list. > > I did find the -o degraded argument in the wiki now that you > mentioned it - but it's not prominent enough if you ask me. =) > > > > On 01/03/2014 05:43 PM, Joshua Schüler wrote: > >Am 03.01.2014 23:28, schrieb Jim Salter: > >>I'm using Ubuntu 12.04.3 with an up-to-date 3.11 kernel, and the > >>btrfs-progs from Debian Sid (since the ones from Ubuntu are ancient). > >> > >>I discovered to my horror during testing today that neither raid1 nor > >>raid10 arrays are fault tolerant of losing an actual disk. > >> > >>mkfs.btrfs -d raid10 -m raid10 /dev/vdc /dev/vdd /dev/vdd /dev/vde > >>mkdir /test > >>mount /dev/vdb /test > >>echo "test" > /test/test > >>btrfs filesystem sync /test > >>shutdown -hP now > >> > >>After shutting down the VM, I can remove ANY of the drives from the > >>btrfs raid10 array, and be unable to mount the array. In this case, I > >>removed the drive that was at /dev/vde, then restarted the VM. > >> > >>btrfs fi show > >>Label: none uuid: 94af1f5d-6ad2-4582-ab4a-5410c410c455 > >> Total devices 4 FS bytes used 156.00KB > >> devid 3 size 1.00GB used 212.75MB path /dev/vdd > >> devid 3 size 1.00GB used 212.75MB path /dev/vdc > >> devid 3 size 1.00GB used 232.75MB path /dev/vdb > >> *** Some devices missing > >> > >>OK, we have three of four raid10 devices present. Should be fine. Let's > >>mount it: > >> > >>mount -t btrfs /dev/vdb /test > >>mount: wrong fs type, bad option, bad superblock on /dev/vdb, > >> missing codepage or helper program, or other error > >> In some cases useful info is found in syslog - try > >> dmesg | tail or so > >> > >>What's the kernel log got to say about it? > >> > >>dmesg | tail -n 4 > >>[ 536.694363] device fsid 94af1f5d-6ad2-4582-ab4a-5410c410c455 devid 1 > >>transid 7 /dev/vdb > >>[ 536.700515] btrfs: disk space caching is enabled > >>[ 536.703491] btrfs: failed to read the system array on vdd > >>[ 536.708337] btrfs: open_ctree failed > >> > >>Same behavior persists whether I create a raid1 or raid10 array, and > >>whether I create it as that raid level using mkfs.btrfs or convert it > >>afterwards using btrfs balance start -dconvert=raidn -mconvert=raidn. > >>Also persists even if I both scrub AND sync the array before shutting > >>the machine down and removing one of the disks. > >> > >>What's up with this? This is a MASSIVE bug, and I haven't seen anybody > >>else talking about it... has nobody tried actually failing out a disk > >>yet, or what? > >Hey Jim, > > > >keep calm and read the wiki ;) > >https://btrfs.wiki.kernel.org/ > > > >You need to mount with -o degraded to tell btrfs a disk is missing. > > > > > >Joshua > > > > > -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Eighth Army Push Bottles Up Germans -- WWII newspaper --- headline (possibly apocryphal) [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 22:56 ` Jim Salter 2014-01-03 23:04 ` Hugo Mills @ 2014-01-03 23:04 ` Joshua Schüler 2014-01-03 23:13 ` Jim Salter 2014-01-03 23:19 ` Chris Murphy [not found] ` <CAOjFWZ7zC3=4oH6=SBZA+PhZMrSK1KjxoRN6L2vqd=GTBKKTQA@mail.gmail.com> 3 siblings, 1 reply; 40+ messages in thread From: Joshua Schüler @ 2014-01-03 23:04 UTC (permalink / raw) To: Jim Salter; +Cc: linux-btrfs Am 03.01.2014 23:56, schrieb Jim Salter: > I actually read the wiki pretty obsessively before blasting the list - > could not successfully find anything answering the question, by scanning > the FAQ or by Googling. > > You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine. don't forget to btrfs device delete missing <path> See https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices > > HOWEVER - this won't allow a root filesystem to mount. How do you deal > with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root > filesystem? Few things are scarier than seeing the "cannot find init" > message in GRUB and being faced with a BusyBox prompt... which is > actually how I initially got my scare; I was trying to do a walkthrough > for setting up a raid1 / for an article in a major online magazine and > it wouldn't boot at all after removing a device; I backed off and tested > with a non root filesystem before hitting the list. Add -o degraded to the boot-options in GRUB. If your filesystem is more heavily corrupted then you either need the btrfs tools in your initrd or a rescue cd > > I did find the -o degraded argument in the wiki now that you mentioned > it - but it's not prominent enough if you ask me. =) > > [snip] Joshua ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 23:04 ` Joshua Schüler @ 2014-01-03 23:13 ` Jim Salter 2014-01-03 23:18 ` Hugo Mills 2014-01-03 23:22 ` Chris Murphy 0 siblings, 2 replies; 40+ messages in thread From: Jim Salter @ 2014-01-03 23:13 UTC (permalink / raw) To: Joshua Schüler; +Cc: linux-btrfs Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still kinda black magic to me, and I don't think I'm supposed to be editing it directly at all anymore anyway, if I remember correctly... >> HOWEVER - this won't allow a root filesystem to mount. How do you deal >> with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root >> filesystem? Few things are scarier than seeing the "cannot find init" >> message in GRUB and being faced with a BusyBox prompt... which is >> actually how I initially got my scare; I was trying to do a walkthrough >> for setting up a raid1 / for an article in a major online magazine and >> it wouldn't boot at all after removing a device; I backed off and tested >> with a non root filesystem before hitting the list. > Add -o degraded to the boot-options in GRUB. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 23:13 ` Jim Salter @ 2014-01-03 23:18 ` Hugo Mills 2014-01-03 23:25 ` Jim Salter 2014-01-03 23:22 ` Chris Murphy 1 sibling, 1 reply; 40+ messages in thread From: Hugo Mills @ 2014-01-03 23:18 UTC (permalink / raw) To: Jim Salter; +Cc: Joshua Schüler, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1697 bytes --] On Fri, Jan 03, 2014 at 06:13:25PM -0500, Jim Salter wrote: > Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still > kinda black magic to me, and I don't think I'm supposed to be > editing it directly at all anymore anyway, if I remember > correctly... You don't need to edit grub.cfg -- when you boot, grub has an edit option, so you can do it at boot time without having to use a rescue disk. Regardless, the thing you need to edit is the line starting "linux", and will look something like this: linux /vmlinuz-3.11.0-rc2-dirty root=UUID=1b6ec419-211a-445e-b762-ae7da27b6e8a ro single rootflags=subvol=fs-root If there's a rootflags= option already (as above), add ",degraded" to the end. If there isn't, add "rootflags=degraded". Hugo. > >>HOWEVER - this won't allow a root filesystem to mount. How do you deal > >>with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root > >>filesystem? Few things are scarier than seeing the "cannot find init" > >>message in GRUB and being faced with a BusyBox prompt... which is > >>actually how I initially got my scare; I was trying to do a walkthrough > >>for setting up a raid1 / for an article in a major online magazine and > >>it wouldn't boot at all after removing a device; I backed off and tested > >>with a non root filesystem before hitting the list. > >Add -o degraded to the boot-options in GRUB. > -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Eighth Army Push Bottles Up Germans -- WWII newspaper --- headline (possibly apocryphal) [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 23:18 ` Hugo Mills @ 2014-01-03 23:25 ` Jim Salter 2014-01-03 23:32 ` Chris Murphy 0 siblings, 1 reply; 40+ messages in thread From: Jim Salter @ 2014-01-03 23:25 UTC (permalink / raw) To: Hugo Mills, Joshua Schüler, linux-btrfs Yep - had just figured that out and successfully booted with it, and was in the process of typing up instructions for the list (and posterity). One thing that concerns me is that edits made directly to grub.cfg will get wiped out with every kernel upgrade when update-grub is run - any idea where I'd put this in /etc/grub.d to have a persistent change? I have to tell you, I'm not real thrilled with this behavior either way - it means I can't have the option to automatically mount degraded filesystems without the filesystems in question ALWAYS showing as being mounted degraded, whether the disks are all present and working fine or not. That's kind of blecchy. =\ On 01/03/2014 06:18 PM, Hugo Mills wrote: > On Fri, Jan 03, 2014 at 06:13:25PM -0500, Jim Salter wrote: >> Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still >> kinda black magic to me, and I don't think I'm supposed to be >> editing it directly at all anymore anyway, if I remember >> correctly... > You don't need to edit grub.cfg -- when you boot, grub has an edit > option, so you can do it at boot time without having to use a rescue > disk. > > Regardless, the thing you need to edit is the line starting > "linux", and will look something like this: > > linux /vmlinuz-3.11.0-rc2-dirty root=UUID=1b6ec419-211a-445e-b762-ae7da27b6e8a ro single rootflags=subvol=fs-root > > If there's a rootflags= option already (as above), add ",degraded" > to the end. If there isn't, add "rootflags=degraded". > > Hugo. > >>>> HOWEVER - this won't allow a root filesystem to mount. How do you deal >>>> with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root >>>> filesystem? Few things are scarier than seeing the "cannot find init" >>>> message in GRUB and being faced with a BusyBox prompt... which is >>>> actually how I initially got my scare; I was trying to do a walkthrough >>>> for setting up a raid1 / for an article in a major online magazine and >>>> it wouldn't boot at all after removing a device; I backed off and tested >>>> with a non root filesystem before hitting the list. >>> Add -o degraded to the boot-options in GRUB. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 23:25 ` Jim Salter @ 2014-01-03 23:32 ` Chris Murphy 0 siblings, 0 replies; 40+ messages in thread From: Chris Murphy @ 2014-01-03 23:32 UTC (permalink / raw) To: Jim Salter; +Cc: Hugo Mills, Joshua Schüler, linux-btrfs On Jan 3, 2014, at 4:25 PM, Jim Salter <jim@jrs-s.net> wrote: > > One thing that concerns me is that edits made directly to grub.cfg will get wiped out with every kernel upgrade when update-grub is run - any idea where I'd put this in /etc/grub.d to have a persistent change? /etc/default/grub I don't recommend making it persistent. At this stage of development, a disk failure should cause mount failure so you're alerted to the problem. > I have to tell you, I'm not real thrilled with this behavior either way - it means I can't have the option to automatically mount degraded filesystems without the filesystems in question ALWAYS showing as being mounted degraded, whether the disks are all present and working fine or not. That's kind of blecchy. =\ If you need something that comes up degraded automatically by design as a supported use case, use md (or possibly LVM which uses different user space tools and monitoring but uses the md kernel driver code and supports raid 0,1,5,6 - quite nifty). I haven't tried this yet, but I think that's also supported with the thin provisioning work, which even if you don't use thin provisioning gets you the significantly more efficient snapshot behavior. Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 23:13 ` Jim Salter 2014-01-03 23:18 ` Hugo Mills @ 2014-01-03 23:22 ` Chris Murphy 2014-01-04 6:10 ` Duncan 1 sibling, 1 reply; 40+ messages in thread From: Chris Murphy @ 2014-01-03 23:22 UTC (permalink / raw) To: Btrfs BTRFS On Jan 3, 2014, at 4:13 PM, Jim Salter <jim@jrs-s.net> wrote: > Sorry - where do I put this in GRUB? /boot/grub/grub.cfg is still kinda black magic to me, and I don't think I'm supposed to be editing it directly at all anymore anyway, if I remember correctly… Don't edit the grub.cfg directly. At the grub menu, only highlight the entry you want to boot, then hit 'e', and then edit the existing linux/linuxefi line. If you already have rootfs on a subvolume, you'll have an existing parameter on that line rootflags=subvol=<rootname> and you can change this to rootflags=subvol=<rootname>,degraded I would not make this option persistent by putting it permanently in the grub.cfg; although I don't know the consequence of always mounting with degraded even if not necessary it could have some negative effects (?) Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 23:22 ` Chris Murphy @ 2014-01-04 6:10 ` Duncan 2014-01-04 11:20 ` Chris Samuel ` (2 more replies) 0 siblings, 3 replies; 40+ messages in thread From: Duncan @ 2014-01-04 6:10 UTC (permalink / raw) To: linux-btrfs Chris Murphy posted on Fri, 03 Jan 2014 16:22:44 -0700 as excerpted: > I would not make this option persistent by putting it permanently in the > grub.cfg; although I don't know the consequence of always mounting with > degraded even if not necessary it could have some negative effects (?) Degraded only actually does anything if it's actually needed. On a normal array it'll be a NOOP, so should be entirely safe for /normal/ operation, but that doesn't mean I'd /recommend/ it for normal operation, since it bypasses checks that are there for a reason, thus silently bypassing information that an admin needs to know before he boots it anyway, in ordered to recover. However, I've some other comments to add: 1) As you I'm uncomfortable with the whole idea of adding degraded permanently at this point. Mention was made of having to drive down to the data center and actually stand in front of the box if something goes wrong, otherwise. At the moment, for btrfs' development state at this point, fine. Btrfs remains under development and there are clear warnings about using it without backups one hasn't tested recovery from or are not otherwise prepared to actually use. It's stated in multiple locations on the wiki; it's stated on the kernel btrfs config option, and it's stated in mkfs.btrfs output when you create the filesystem. If after all that people are using it in a remote situation where they're not prepared to drive down to the data center and stab at the keys if they have to, they're using possibly the right filesystem, but at the wrong too early point in its development, for their needs at this moment. 2) As the wiki explains, certain configurations require at least a minimum number of devices in ordered to work "undegraded". The example given in the OP was of a 4-device raid10, already the minimum number to work undegraded, with one device dropped out, to below the minimum required number to mount undegraded, so of /course/ it wouldn't mount without that option. If five or six devices would have been used, a device could have been dropped and the remaining number of devices would still be greater than or equal to the minimum number of devices to run an undegraded raid10, and the result would likely have been different, since there's still enough devices to mount writable with proper redundancy, even if existing information doesn't have that redundancy until a rebalance is done to take care of the missing device. Similarly with a raid1 and its minimum two devices. Configure with three, then drop one, and it should still work as it's above the two minimum for raid1 configuration. Configure with two and drop one, and you'll have to mount degraded (and it'll drop to read-only if it happens in operation) since there's no second device to write the second copy to, as required by raid1. 3) Frankly, this whole thread smells of going off half cocked, posting before doing the proper research. I know when I took a look at btrfs here, I read up on the wiki, reading the multiple devices stuff, the faq, the problem faq, the gotchas, the use cases, the sysadmin guide, the getting started and mount options... loading the pages multiple times as I followed links back and forth between them. Because I care about my data and want to understand what I'm doing with it before I do it! And even now I often reread specific parts as I'm trying to help others with questions on this list.... Then I still had some questions about how it worked that I couldn't find answers for on the wiki, and as traditional with mailing lists and newsgroups before them, I read several weeks worth of posts (on an archive for lists) before actually posting my questions, to see if they were FAQs already answered on the list. Then and only then did I post the questions to the list, and when I did, it was, "Questions I haven't found answers for on the wiki or list", not "THE WORLD IS GOING TO END, OH NOS!!111!!111111!!!!!111!!!" Now later on I did post some behavior that had me rather upset, but that was AFTER I had already engaged the list in general, and was pretty sure by that point that what I was seeing was NOT covered on the wiki, and was reasonably new information for at least SOME list users. 4) As a matter of fact, AFAIK that behavior remains relevant today, and may well be of interest to the OP. FWIW my background was Linux kernel md/raid, so I approached the btrfs raid expecting similar behavior. What I found in my testing (and NOT covered on the WIKI or in the various documentation other than in a few threads on list to this day, AFAIK) , however... Test: a) Create a two device btrfs raid1. b) Mount it and write some data to it. c) Unmount it, unplug one device, mount degraded the remaining device. d) Write some data to a test file on it, noting the path/filename and data. e) Unmount again, switch plugged devices so the formerly unplugged one is now the plugged one, and again mount degraded. f) Write some DIFFERENT data to the SAME path/file as in (d), so the two versions each on its own device have now incompatibly forked. g) Unmount, plug both devices in and mount, now undegraded. What I discovered back then, and to my knowledge the same behavior exists today, is that entirely unexpectedly from and in contrast to my mdraid experience, THE FILESYSTEM MOUNTED WITHOUT PROTEST!! h) I checked the file and one variant as written was returned. STILL NO WARNING! While I didn't test it, I'm assuming based on the PID-based round-robin read-assignment that I now know btrfs uses, that which copy I got would depend on whether the PID of the reading thread was even or odd, as that's what determines what device of the pair is read. (There has actually been some discussion of that as it's not a particularly intelligent balancing scheme and it's on the list to change, but the current even/odd works well enough for an initial implementation while the filesystem remains under development.) i) Were I rerunning the test today, I'd try a scrub and see what it did with the difference. But I was early enough in my btrfs learning that I didn't know to run it at that point, so didn't do so. I'd still be interested in how it handled that, tho based on what I know of btrfs behavior in general, I can /predict/ that which copy it'd scrub out and which it would keep, would again depend on the PID of the scrub thread, since both copies would appear valid (would verify against their checksum on the same device) when read, and it's only when matched against the other that a problem, presumably with the other copy, would be detected. My conclusions were two: x) Make *VERY* sure I don't actually do that in practice! If for some reason I mount degraded, make sure I consistently use the same device, so I don't get incompatible divergence. y) If which version of the data you keep really matters, in the event of a device dropout and would-be re-add, it may be worthwhile to discard/ trim/wipe the entire to-be-re-added device and btrfs device add it, then balance, as if it were an entirely new device addition, since that's the only way I know of to be sure that the wrong copy isn't picked. This is VERY VERY different behavior than mdraid would exhibit. But the purpose and use-cases for btrfs raid1 are different as well. For my particular use-case of checksummed file integrity and ensuring /some/ copy of the data survived, and since I had tested and found this behavior BEFORE actual deployment, I not entirely happily accepted it. I'm not happy with it, but at least I found out about it in my pre-testing, and could adapt my recovery practices accordingly. But that /does/ mean one can't as simply just pull a device from a running raid, then plug it back in and re-add, and expect everything to just work, as one could do (and I tested!) with mdraid. One must be rather more careful with btrfs raid, at least at this point, unless of course the object is to test full restore procedures as well! OTOH, from a more philosophical perspective mult-device mdraid handling has been around for rather longer than multi-device btrfs, and I did see mdraid markedly improve in the years I used it. I expect btrfs raid handling will be rather more robust and mature in another decade or so, too, and I've already seen reasonable improvement in the six or eight months I've been using it (and the 6-8 months before that too, since when I first looked at btrfs I decided it simply wasn't mature enough for me to run, yet, so I kicked back for a few months and came at it again). =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 6:10 ` Duncan @ 2014-01-04 11:20 ` Chris Samuel 2014-01-04 13:03 ` Duncan 2014-01-04 14:51 ` Chris Mason 2014-01-04 21:22 ` Jim Salter 2 siblings, 1 reply; 40+ messages in thread From: Chris Samuel @ 2014-01-04 11:20 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1101 bytes --] On Sat, 4 Jan 2014 06:10:14 AM Duncan wrote: > Btrfs remains under development and there are clear warnings > about using it without backups one hasn't tested recovery from > or are not otherwise prepared to actually use. It's stated in > multiple locations on the wiki; it's stated on the kernel btrfs > config option, and it's stated in mkfs.btrfs output when you > create the filesystem. Actually the scary warnings are gone from the Kconfig file for what will be the 3.13 kernel. Removed by this commit: commit 4204617d142c0887e45fda2562cb5c58097b918e Author: David Sterba <dsterba@suse.cz> Date: Wed Nov 20 14:32:34 2013 +0100 btrfs: update kconfig help text Reflect the current status. Portions of the text taken from the wiki pages. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com> -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 11:20 ` Chris Samuel @ 2014-01-04 13:03 ` Duncan 0 siblings, 0 replies; 40+ messages in thread From: Duncan @ 2014-01-04 13:03 UTC (permalink / raw) To: linux-btrfs Chris Samuel posted on Sat, 04 Jan 2014 22:20:20 +1100 as excerpted: > On Sat, 4 Jan 2014 06:10:14 AM Duncan wrote: > >> Btrfs remains under development and there are clear warnings about >> using it without backups one hasn't tested recovery from or are not >> otherwise prepared to actually use. It's stated in multiple locations >> on the wiki; it's stated on the kernel btrfs config option, and it's >> stated in mkfs.btrfs output when you create the filesystem. > > Actually the scary warnings are gone from the Kconfig file for what will > be the 3.13 kernel. Removed by this commit: > > commit 4204617d142c0887e45fda2562cb5c58097b918e FWIW, I'd characterize that as toned down somewhat, not /gone/. You don't see ext4 or other "mature" filesystems saying "The filesystem disk format is no longer unstable, and it's not expected to change unless" ..., do you? "Not expected to change" and etc is definitely toned down from what it was, no argument there, but it still isn't exactly what one would expect in a description from a stable filesystem. If there's still some chance of the disk format changing, what does that say about the code /dealing/ with that disk format? That doesn't sound exactly like something I'd be comfortable staking my reputation as a sysadmin on as judged fully reliable and ready for my mission-critical data, for sure! Tho agreed, one certainly has to read between the lines a bit more for the kernel option now than they did. But the real kicker for me was when I redid several of my btrfs partitions to take advantage of newer features, 16 KiB nodes, etc, and saw the warning it's giving, yes, in btrfs-progs 3.12 after all the recent documentation changes, etc. Not everybody builds their own kernel, but it's kind of hard to get a btrfs filesystem without making one! (Yes, I know the installers make the filesystem for many people, and may well hide the output, but if so and the distros don't provide a similar warning when people choose btrfs, that's entirely on the distros at that point. Not much btrfs as upstream can do about that.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 6:10 ` Duncan 2014-01-04 11:20 ` Chris Samuel @ 2014-01-04 14:51 ` Chris Mason 2014-01-04 15:23 ` Goffredo Baroncelli 2014-01-04 20:08 ` Duncan 2014-01-04 21:22 ` Jim Salter 2 siblings, 2 replies; 40+ messages in thread From: Chris Mason @ 2014-01-04 14:51 UTC (permalink / raw) To: 1i5t5.duncan; +Cc: linux-btrfs On Sat, 2014-01-04 at 06:10 +0000, Duncan wrote: > Chris Murphy posted on Fri, 03 Jan 2014 16:22:44 -0700 as excerpted: > > > I would not make this option persistent by putting it permanently in the > > grub.cfg; although I don't know the consequence of always mounting with > > degraded even if not necessary it could have some negative effects (?) > > Degraded only actually does anything if it's actually needed. On a > normal array it'll be a NOOP, so should be entirely safe for /normal/ > operation, but that doesn't mean I'd /recommend/ it for normal operation, > since it bypasses checks that are there for a reason, thus silently > bypassing information that an admin needs to know before he boots it > anyway, in ordered to recover. > > However, I've some other comments to add: > > 1) As you I'm uncomfortable with the whole idea of adding degraded > permanently at this point. > I added mount -o degraded just because I wanted the admin to be notified of failures. Right now it's still the most reliable way to notify them, but I definitely agree we can do better. Leaving it on all the time? I don't think this is a great long term solution, unless you are actively monitoring the system to make sure there are no failures. Also, as Neil Brown pointed out it does put you at risk of transient device detection failures getting things out of sync. > Test: > > a) Create a two device btrfs raid1. > > b) Mount it and write some data to it. > > c) Unmount it, unplug one device, mount degraded the remaining device. > > d) Write some data to a test file on it, noting the path/filename and > data. > > e) Unmount again, switch plugged devices so the formerly unplugged one is > now the plugged one, and again mount degraded. > > f) Write some DIFFERENT data to the SAME path/file as in (d), so the two > versions each on its own device have now incompatibly forked. > > g) Unmount, plug both devices in and mount, now undegraded. > > What I discovered back then, and to my knowledge the same behavior exists > today, is that entirely unexpectedly from and in contrast to my mdraid > experience, THE FILESYSTEM MOUNTED WITHOUT PROTEST!! > > h) I checked the file and one variant as written was returned. STILL NO > WARNING! While I didn't test it, I'm assuming based on the PID-based > round-robin read-assignment that I now know btrfs uses, that which copy I > got would depend on whether the PID of the reading thread was even or > odd, as that's what determines what device of the pair is read. (There > has actually been some discussion of that as it's not a particularly > intelligent balancing scheme and it's on the list to change, but the > current even/odd works well enough for an initial implementation while > the filesystem remains under development.) > > i) Were I rerunning the test today, I'd try a scrub and see what it did > with the difference. But I was early enough in my btrfs learning that I > didn't know to run it at that point, so didn't do so. I'd still be > interested in how it handled that, tho based on what I know of btrfs > behavior in general, I can /predict/ that which copy it'd scrub out and > which it would keep, would again depend on the PID of the scrub thread, > since both copies would appear valid (would verify against their checksum > on the same device) when read, and it's only when matched against the > other that a problem, presumably with the other copy, would be detected. > It'll pick the latest generation number and use that one as the one true source. For the others you'll get crc errors which make it fall back to the latest one. If the two have exactly the same generation number, we'll have a hard time picking the best one. Ilya has a series of changes from this year's GSOC that we need to clean up and integrate. It detects offline devices and brings them up to date automatically. He targeted the pull-one-drive use case explicitly. -chris ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 14:51 ` Chris Mason @ 2014-01-04 15:23 ` Goffredo Baroncelli 2014-01-04 20:08 ` Duncan 1 sibling, 0 replies; 40+ messages in thread From: Goffredo Baroncelli @ 2014-01-04 15:23 UTC (permalink / raw) To: Chris Mason; +Cc: 1i5t5.duncan, linux-btrfs On 2014-01-04 15:51, Chris Mason wrote: > I added mount -o degraded just because I wanted the admin to be notified > of failures. Right now it's still the most reliable way to notify them, > but I definitely agree we can do better. I think that we should align us to what the others raid subsystem (md and dm) do in these cases. Reading the man page of mdadm, to me it seems that an array is constructed even without some disks; the only requirement is the disks have to be valid (i.e. not out of sync) > Leaving it on all the time? I > don't think this is a great long term solution, unless you are actively > monitoring the system to make sure there are no failures. Anyway mdadm has the "monitor" mode, which reports this kind of error. >From mdadm man page: "Follow or Monitor Monitor one or more md devices and act on any state changes. This is only meaningful for RAID1, 4, 5, 6, 10 or multipath arrays, as only these have interesting state. RAID0 or Linear never have missing, spare, or failed drives, so there is nothing to monitor. " Best regards GB -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 14:51 ` Chris Mason 2014-01-04 15:23 ` Goffredo Baroncelli @ 2014-01-04 20:08 ` Duncan 1 sibling, 0 replies; 40+ messages in thread From: Duncan @ 2014-01-04 20:08 UTC (permalink / raw) To: linux-btrfs Chris Mason posted on Sat, 04 Jan 2014 14:51:23 +0000 as excerpted: > It'll pick the latest generation number and use that one as the one true > source. For the others you'll get crc errors which make it fall back to > the latest one. If the two have exactly the same generation number, > we'll have a hard time picking the best one. > > Ilya has a series of changes from this year's GSOC that we need to clean > up and integrate. It detects offline devices and brings them up to date > automatically. > > He targeted the pull-one-drive use case explicitly. Thanks for the explanation and bits to look forward to. I'll be looking forward to seeing that GSOC stuff then, as having dropouts and re-adds auto-handled would be a sweet feature to add to the raid featureset, improving things from a sysadmin's prepared-to-deal-with- recovery perspective quite a bit. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 6:10 ` Duncan 2014-01-04 11:20 ` Chris Samuel 2014-01-04 14:51 ` Chris Mason @ 2014-01-04 21:22 ` Jim Salter 2014-01-05 11:01 ` Duncan 2 siblings, 1 reply; 40+ messages in thread From: Jim Salter @ 2014-01-04 21:22 UTC (permalink / raw) To: Duncan, linux-btrfs On 01/04/2014 01:10 AM, Duncan wrote: > The example given in the OP was of a 4-device raid10, already the > minimum number to work undegraded, with one device dropped out, to > below the minimum required number to mount undegraded, so of /course/ > it wouldn't mount without that option. The issue was not realizing that a degraded fault-tolerant array would refuse to mount without being passed an -o degraded option. Yes, it's on the wiki - but it's on the wiki under *replacing* a device, not in the FAQ, not in the head of the "multiple devices" section, etc; and no coherent message is thrown either on the console or in the kernel log when you do attempt to mount a degraded array without the correct argument. IMO that's a bug. =) ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 21:22 ` Jim Salter @ 2014-01-05 11:01 ` Duncan 0 siblings, 0 replies; 40+ messages in thread From: Duncan @ 2014-01-05 11:01 UTC (permalink / raw) To: linux-btrfs Jim Salter posted on Sat, 04 Jan 2014 16:22:53 -0500 as excerpted: > On 01/04/2014 01:10 AM, Duncan wrote: >> The example given in the OP was of a 4-device raid10, already the >> minimum number to work undegraded, with one device dropped out, to >> below the minimum required number to mount undegraded, so of /course/ >> it wouldn't mount without that option. > > The issue was not realizing that a degraded fault-tolerant array would > refuse to mount without being passed an -o degraded option. Yes, it's on > the wiki - but it's on the wiki under *replacing* a device, not in the > FAQ, not in the head of the "multiple devices" section, etc; and no > coherent message is thrown either on the console or in the kernel log > when you do attempt to mount a degraded array without the correct > argument. > > IMO that's a bug. =) I'd agree, usability bug, one of many smoothing out the rough "it works, but it's not easy to work with it" bugs. FWIW I'm seeing progress in that area, now. The rush of functional bugs and fixes for them has finally slowed down to the point where there's beginning to be time to focus on the usability and rough edges bugs. I believe I saw a post in October or November from Chris Mason, where he said yes, the maturing of btrfs has been predicted before, but it really does seem like the functional bugs are slowing down to the point where the usability bugs can finally be addressed, and 2014 really does look like the year that btrfs will finally start shaping up into a mature looking and acting filesystem, including in usability, etc. And Chris mentioned the GSoS project that worked on one angle of this specific issue, too. Getting that code integrated and having btrfs finally be able to recognize a dropped and re-added device and automatically trigger a resync... that'd be a pretty sweet improvement to get. =:^) While they're working on that they may well take a look at at least giving the admin more information on a degraded-needed mount failure, too, tweaking the kernel log messages, etc, and possibly taking a second look as to whether full refusing to mount is the best situation then, or not. Actually, I wonder... what about mounting in such a situation, but read- only and refusing to go writable unless degraded is added too? That would preserve the "first, do no harm, don't make the problem worse" ideal, while mounting but read-only unless degraded is added with the rw, wouldn't be /quite/ as drastic as refusing to mount entirely, unless degraded is added. I actually think that, plus some better logging saying hey, we don't have enough devices to write with the requested raid level, so remount rw,degraded, and either add another device or reconfigure the raid mode to something suitable for the number of devices. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 22:56 ` Jim Salter 2014-01-03 23:04 ` Hugo Mills 2014-01-03 23:04 ` Joshua Schüler @ 2014-01-03 23:19 ` Chris Murphy [not found] ` <CAOjFWZ7zC3=4oH6=SBZA+PhZMrSK1KjxoRN6L2vqd=GTBKKTQA@mail.gmail.com> 3 siblings, 0 replies; 40+ messages in thread From: Chris Murphy @ 2014-01-03 23:19 UTC (permalink / raw) To: Btrfs BTRFS On Jan 3, 2014, at 3:56 PM, Jim Salter <jim@jrs-s.net> wrote: > I actually read the wiki pretty obsessively before blasting the list - could not successfully find anything answering the question, by scanning the FAQ or by Googling. > > You're right - mount -t btrfs -o degraded /dev/vdb /test worked fine. > > HOWEVER - this won't allow a root filesystem to mount. How do you deal with this if you'd set up a btrfs-raid1 or btrfs-raid10 as your root filesystem? I'd say that it's not ready for unattended/auto degraded mounting, that this is intended to be a red flag show stopper to get the attention of the user. Before automatic degraded mounts, which md and LVM raid do now, there probably needs to be notification support in desktop's, .e.g. Gnome will report degraded state for at least md arrays (maybe LVM too, not sure). There's also a list of other multiple device stuff on the to do, some of which maybe should be done before auto degraded mount, for example the hot spare work. https://btrfs.wiki.kernel.org/index.php/Project_ideas#Multiple_Devices Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
[parent not found: <CAOjFWZ7zC3=4oH6=SBZA+PhZMrSK1KjxoRN6L2vqd=GTBKKTQA@mail.gmail.com>]
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT [not found] ` <CAOjFWZ7zC3=4oH6=SBZA+PhZMrSK1KjxoRN6L2vqd=GTBKKTQA@mail.gmail.com> @ 2014-01-03 23:42 ` Jim Salter 2014-01-03 23:45 ` Jim Salter 2014-01-04 0:27 ` Chris Murphy 0 siblings, 2 replies; 40+ messages in thread From: Jim Salter @ 2014-01-03 23:42 UTC (permalink / raw) To: Freddie Cash; +Cc: Joshua Schüler, linux-btrfs, Chris Murphy For anybody else interested, if you want your system to automatically boot a degraded btrfs array, here are my crib notes, verified working: ***************************** boot degraded 1. edit /etc/grub.d/10_linux, add degraded to the rootflags GRUB_CMDLINE_LINUX="rootflags=degraded,subvol=${rootsubvol} ${GRUB_CMDLINE_LINUX} 2. add degraded to options in /etc/fstab also UUID=bf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 / btrfs defaults,degraded,subvol=@ 0 1 3. Update and reinstall GRUB to all boot disks update-grub grub-install /dev/vda grub-install /dev/vdb Now you have a system which will automatically start a degraded array. ****************************************************** Side note: sorry, but I absolutely don't buy the argument that "the system won't boot without you driving down to its physical location, standing in front of it, and hammering panickily at a BusyBox prompt" is the best way to find out your array is degraded. I'll set up a Nagios module to check for degraded arrays using btrfs fi list instead, thanks... On 01/03/2014 06:06 PM, Freddie Cash wrote: > Why is manual intervention even needed? Why isn't the filesystem > "smart" enough to mount in a degraded mode automatically? > > -- > Freddie Cash > fjwcash@gmail.com <mailto:fjwcash@gmail.com> ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 23:42 ` Jim Salter @ 2014-01-03 23:45 ` Jim Salter 2014-01-04 0:27 ` Chris Murphy 1 sibling, 0 replies; 40+ messages in thread From: Jim Salter @ 2014-01-03 23:45 UTC (permalink / raw) To: Freddie Cash; +Cc: Joshua Schüler, linux-btrfs, Chris Murphy Minor correction: you need to close the double-quotes at the end of the GRUB_CMDLINE_LINUX line: GRUB_CMDLINE_LINUX="rootflags=degraded,subvol=${rootsubvol} ${GRUB_CMDLINE_LINUX}" On 01/03/2014 06:42 PM, Jim Salter wrote: > For anybody else interested, if you want your system to automatically > boot a degraded btrfs array, here are my crib notes, verified working: > > ***************************** boot degraded > > 1. edit /etc/grub.d/10_linux, add degraded to the rootflags > > GRUB_CMDLINE_LINUX="rootflags=degraded,subvol=${rootsubvol} > ${GRUB_CMDLINE_LINUX} > > > 2. add degraded to options in /etc/fstab also > > UUID=bf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 / btrfs > defaults,degraded,subvol=@ 0 1 > > > 3. Update and reinstall GRUB to all boot disks > > update-grub > grub-install /dev/vda > grub-install /dev/vdb > > Now you have a system which will automatically start a degraded array. > > > ****************************************************** > > Side note: sorry, but I absolutely don't buy the argument that "the > system won't boot without you driving down to its physical location, > standing in front of it, and hammering panickily at a BusyBox prompt" > is the best way to find out your array is degraded. I'll set up a > Nagios module to check for degraded arrays using btrfs fi list > instead, thanks... > > > On 01/03/2014 06:06 PM, Freddie Cash wrote: >> Why is manual intervention even needed? Why isn't the filesystem >> "smart" enough to mount in a degraded mode automatically? >> >> -- >> Freddie Cash >> fjwcash@gmail.com <mailto:fjwcash@gmail.com> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-03 23:42 ` Jim Salter 2014-01-03 23:45 ` Jim Salter @ 2014-01-04 0:27 ` Chris Murphy 2014-01-04 2:59 ` Jim Salter 1 sibling, 1 reply; 40+ messages in thread From: Chris Murphy @ 2014-01-04 0:27 UTC (permalink / raw) To: Jim Salter; +Cc: linux-btrfs, Hugo Mills On Jan 3, 2014, at 4:42 PM, Jim Salter <jim@jrs-s.net> wrote: > For anybody else interested, if you want your system to automatically boot a degraded btrfs array, here are my crib notes, verified working: > > ***************************** boot degraded > > 1. edit /etc/grub.d/10_linux, add degraded to the rootflags > > GRUB_CMDLINE_LINUX="rootflags=degraded,subvol=${rootsubvol} ${GRUB_CMDLINE_LINUX} This is the wrong way to solve this. /etc/grub.d/10_linux is subject to being replaced on updates. It is not recommended it be edited, same as for grub.cfg. The correct way is as I already stated, which is to edit the GRUB_CMDLINE_LINUX= line in /etc/default/grub. > 2. add degraded to options in /etc/fstab also > > UUID=bf9ea9b9-54a7-4efc-8003-6ac0b344c6b5 / btrfs defaults,degraded,subvol=@ 0 1 I think it's bad advice to recommend always persistently mounting a good volume with this option. There's a reason why degraded is not the default mount option, and why there isn't yet automatic degraded mount functionality. That fstab contains other errors. The correct way to automate this before Btrfs developers get around to it is to create a systemd unit that checks for the mount failure, determines that there's a missing device, and generates a modified sysroot.mount job that includes degraded. > Side note: sorry, but I absolutely don't buy the argument that "the system won't boot without you driving down to its physical location, standing in front of it, and hammering panickily at a BusyBox prompt" is the best way to find out your array is degraded. You're simply dissatisfied with the state of Btrfs development and are suggesting bad hacks as a work around. That's my argument. Again, if your use case requires automatic degraded mounts, use a technology that's mature and well tested for that use case. Don't expect a lot of sympathy if these bad hacks cause you problems later. > I'll set up a Nagios module to check for degraded arrays using btrfs fi list instead, thanks… That's a good idea, except that it's show rather than list. Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 0:27 ` Chris Murphy @ 2014-01-04 2:59 ` Jim Salter 2014-01-04 5:57 ` Dave 2014-01-04 19:18 ` Chris Murphy 0 siblings, 2 replies; 40+ messages in thread From: Jim Salter @ 2014-01-04 2:59 UTC (permalink / raw) To: Chris Murphy; +Cc: linux-btrfs On 01/03/2014 07:27 PM, Chris Murphy wrote: > This is the wrong way to solve this. /etc/grub.d/10_linux is subject > to being replaced on updates. It is not recommended it be edited, same > as for grub.cfg. The correct way is as I already stated, which is to > edit the GRUB_CMDLINE_LINUX= line in /etc/default/grub. Fair enough - though since I already have to monkey-patch 00_header, I kind of already have an eye on grub.d so it doesn't seem as onerous as it otherwise would. There is definitely a lot of work that needs to be done on the boot sequence for btrfs IMO. > I think it's bad advice to recommend always persistently mounting a > good volume with this option. There's a reason why degraded is not the > default mount option, and why there isn't yet automatic degraded mount > functionality. That fstab contains other errors. What other errors does it contain? Aside from adding the "degraded" option, that's a bone-stock fstab entry from an Ubuntu Server installation. > The correct way to automate this before Btrfs developers get around to > it is to create a systemd unit that checks for the mount failure, > determines that there's a missing device, and generates a modified > sysroot.mount job that includes degraded. Systemd is not the boot system in use for my distribution, and using it would require me to build a custom kernel, among other things. We're going to have to agree to disagree that that's an appropriate workaround, I think. > You're simply dissatisfied with the state of Btrfs development and are > suggesting bad hacks as a work around. That's my argument. Again, if > your use case requires automatic degraded mounts, use a technology > that's mature and well tested for that use case. Don't expect a lot of > sympathy if these bad hacks cause you problems later. You're suggesting the wrong alternatives here (mdraid, LVM, etc) - they don't provide the features that I need or are accustomed to (true snapshots, copy on write, self-correcting redundant arrays, and on down the line). If you're going to shoo me off, the correct way to do it is to wave me in the direction of ZFS, in which case I can tell you I've been a happy user of ZFS for 5+ years now on hundreds of systems. ZFS and btrfs are literally the *only* options available that do what I want to do, and have been doing for years now. (At least aside from six-figure-and-up proprietary systems, which I have neither the budget nor the inclination for.) I'm testing btrfs heavily in throwaway virtual environments and in a few small, heavily-monitored "test production" instances because ZFS on Linux has its own set of problems, both technical and licensing, and I think it's clear btrfs is going to take the lead in the very near future - in many ways, it does already. >> I'll set up a Nagios module to check for degraded arrays using btrfs fi list instead, thanks… > That's a good idea, except that it's show rather than list. Yup, that's what I meant all right. I frequently still get the syntax backwards between btrfs fi show and btrfs subv list. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 2:59 ` Jim Salter @ 2014-01-04 5:57 ` Dave 2014-01-04 11:28 ` Chris Samuel 2014-01-04 19:18 ` Chris Murphy 1 sibling, 1 reply; 40+ messages in thread From: Dave @ 2014-01-04 5:57 UTC (permalink / raw) To: Jim Salter; +Cc: linux-btrfs On Fri, Jan 3, 2014 at 9:59 PM, Jim Salter <jim@jrs-s.net> wrote: > You're suggesting the wrong alternatives here (mdraid, LVM, etc) - they > don't provide the features that I need or are accustomed to (true snapshots, > copy on write, self-correcting redundant arrays, and on down the line). If > you're going to shoo me off, the correct way to do it is to wave me in the > direction of ZFS, in which case I can tell you I've been a happy user of ZFS > for 5+ years now on hundreds of systems. ZFS and btrfs are literally the > *only* options available that do what I want to do, and have been doing for > years now. (At least aside from six-figure-and-up proprietary systems, which > I have neither the budget nor the inclination for.) Jim, there's nothing stopping you from creating a Btrfs filesystem on top of an mdraid array. I'm currently running three WD Red 3TB drives in a raid5 configuration under a Btrfs filesystem. This configuration works pretty well and fills the feature gap you're describing. I will say, though, that the whole tone of your email chain leaves a bad taste in my mouth; kind of like a poorly adjusted relative who shows up once a year for Thanksgiving and makes everyone feel uncomfortable. I find myself annoyed by the constant disclaimers I read on this list, about the experimental status of Btrfs, but it's apparent that this hasn't sunk in for everyone. Your poor budget doesn't a production filesytem make. I and many others on this list who have been using Btrfs, will tell you with no hesitation, that due to the maturity of the code, Btrfs should be making NO assumptions in the event of a failure, and everything should come to a screeching halt. I've seen it all: the infamous 120 second process hangs, csum errors, multiple separate catastrophic failures (search me on this list). Things are MOSTLY stable but you simply have to glance at a few weeks of history on this list to see the experimental status is fully justified. I use Btrfs because of its intoxicating feature set. As an IT director though, I'd never subject my company to these rigors. If Btrfs on mdraid isn't an acceptable solution for you, then ZFS is the only responsible alternative. -- -=[dave]=- Entropy isn't what it used to be. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 5:57 ` Dave @ 2014-01-04 11:28 ` Chris Samuel 2014-01-04 14:56 ` Chris Mason 0 siblings, 1 reply; 40+ messages in thread From: Chris Samuel @ 2014-01-04 11:28 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 609 bytes --] On Sat, 4 Jan 2014 12:57:02 AM Dave wrote: > I find myself annoyed by the constant disclaimers I > read on this list, about the experimental status of Btrfs, but it's > apparent that this hasn't sunk in for everyone. Btrfs will no longer marked as experimental in the kernel as of 3.13. Unless someone submits a patch to fix it first. :-) Can we also keep things polite here please. thanks, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 11:28 ` Chris Samuel @ 2014-01-04 14:56 ` Chris Mason 2014-01-05 9:20 ` Chris Samuel 0 siblings, 1 reply; 40+ messages in thread From: Chris Mason @ 2014-01-04 14:56 UTC (permalink / raw) To: chris; +Cc: linux-btrfs On Sat, 2014-01-04 at 22:28 +1100, Chris Samuel wrote: > On Sat, 4 Jan 2014 12:57:02 AM Dave wrote: > > > I find myself annoyed by the constant disclaimers I > > read on this list, about the experimental status of Btrfs, but it's > > apparent that this hasn't sunk in for everyone. > > Btrfs will no longer marked as experimental in the kernel as of 3.13. > > Unless someone submits a patch to fix it first. :-) > > Can we also keep things polite here please. Seconded ;) We're really focused on nailing down these problems instead of hiding behind the experimental flag. I know we won't be perfect overnight, but it's time to focus on production workloads. -chris ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 14:56 ` Chris Mason @ 2014-01-05 9:20 ` Chris Samuel 2014-01-05 11:16 ` Duncan 0 siblings, 1 reply; 40+ messages in thread From: Chris Samuel @ 2014-01-05 9:20 UTC (permalink / raw) To: linux-btrfs; +Cc: Chris Mason [-- Attachment #1: Type: text/plain, Size: 933 bytes --] On Sat, 4 Jan 2014 02:56:39 PM Chris Mason wrote: > Seconded +ADs-) We're really focused on nailing down these problems instead > of hiding behind the experimental flag. I know we won't be perfect > overnight, but it's time to focus on production workloads. Perhaps an option here is to remove the need to specify the degraded flag but if the filesystem notice that it is mounting a RAID array and would otherwise fail it then sets the degraded flag itself and carries on? That way the fact it was degraded would be visible in /proc/mounts and could be detected with health check scripts like NRPE for icinga/nagios. Looking at the code this would be in read_one_dev() in fs/btrfs/volumes.c ? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-05 9:20 ` Chris Samuel @ 2014-01-05 11:16 ` Duncan 0 siblings, 0 replies; 40+ messages in thread From: Duncan @ 2014-01-05 11:16 UTC (permalink / raw) To: linux-btrfs Chris Samuel posted on Sun, 05 Jan 2014 20:20:26 +1100 as excerpted: > On Sat, 4 Jan 2014 02:56:39 PM Chris Mason wrote: > >> Seconded +ADs-) We're really focused on nailing down these problems >> instead of hiding behind the experimental flag. I know we won't be >> perfect overnight, but it's time to focus on production workloads. > > Perhaps an option here is to remove the need to specify the degraded > flag but if the filesystem notice that it is mounting a RAID array and > would otherwise fail it then sets the degraded flag itself and carries > on? > > That way the fact it was degraded would be visible in /proc/mounts and > could be detected with health check scripts like NRPE for icinga/nagios. > > Looking at the code this would be in read_one_dev() in > fs/btrfs/volumes.c ? The idea I came up elsewhere was to mount read-only, with a dmesg to the effect that the filesystem was configured for a raid-level that the current number of devices couldn't support, so mount rw,degraded to accept that temporarily and to make changes, either by adding a new device to fill out the required number for the configured raid level, or by reducing the configured raid level to match reality. The read-only mount would be better than not mounting at all, while preserving the "first, do no further harm" ideal, since mounted read- only, the existing situation should at least remain stable. It would also alert the admin to problems, with a reasonable log message saying how to fix them, while letting the admin at least access the filesystem in read-only mode, thereby giving him tools access to manage whatever maintenance tasks are necessary, should it be the rootfs. The admin could then take the action they deemed appropriate, whether that was getting the data backed up, or mounting degraded,rw in ordered to either add a device and bring it back to functional or to rebalance to a lower data/metadata redundancy level due to lack of devices. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 2:59 ` Jim Salter 2014-01-04 5:57 ` Dave @ 2014-01-04 19:18 ` Chris Murphy 2014-01-04 21:16 ` Jim Salter 1 sibling, 1 reply; 40+ messages in thread From: Chris Murphy @ 2014-01-04 19:18 UTC (permalink / raw) To: linux-btrfs On Jan 3, 2014, at 7:59 PM, Jim Salter <jim@jrs-s.net> wrote: > > On 01/03/2014 07:27 PM, Chris Murphy wrote: >> This is the wrong way to solve this. /etc/grub.d/10_linux is subject to being replaced on updates. It is not recommended it be edited, same as for grub.cfg. The correct way is as I already stated, which is to edit the GRUB_CMDLINE_LINUX= line in /etc/default/grub. > Fair enough - though since I already have to monkey-patch 00_header, I kind of already have an eye on grub.d so it doesn't seem as onerous as it otherwise would. There is definitely a lot of work that needs to be done on the boot sequence for btrfs IMO. Most of this work is done for a while in current versions of GRUB 2.00. There are a few fixes due in 2.02. There are some logical challenges making snapshots bootable in a coherent way. But a major advantage of Btrfs is that functionality is contained in one place so once the kernel is booted things usually just work, so I'm not sure what else you're referring to? >> I think it's bad advice to recommend always persistently mounting a good volume with this option. There's a reason why degraded is not the default mount option, and why there isn't yet automatic degraded mount functionality. That fstab contains other errors. > What other errors does it contain? Aside from adding the "degraded" option, that's a bone-stock fstab entry from an Ubuntu Server installation. fs_passno is 1 which doesn't apply to Btrfs. >> You're simply dissatisfied with the state of Btrfs development and are suggesting bad hacks as a work around. That's my argument. Again, if your use case requires automatic degraded mounts, use a technology that's mature and well tested for that use case. Don't expect a lot of sympathy if these bad hacks cause you problems later. > You're suggesting the wrong alternatives here (mdraid, LVM, etc) - they don't provide the features that I need or are accustomed to (true snapshots, copy on write, self-correcting redundant arrays, and on down the line). Well actually LVM thinp does have fast snapshots without requiring preallocation, and uses COW. I'm not sure what you mean by self-correcting, but if the drive reports a read error md, lvm, and Btrfs raid1+ all will get missing data from mirror/parity reconstruction, and write corrected data back to the bad sector. All offer scrubbing (except Btrfs raid5/6). If you mean an independent means of verifying data via checksumming, true you're looking at Btrfs, ZFS, or PI. > If you're going to shoo me off, the correct way to do it is to wave me in the direction of ZFS There's no shooing, I'm just making observations. Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 19:18 ` Chris Murphy @ 2014-01-04 21:16 ` Jim Salter 2014-01-05 20:25 ` Chris Murphy 0 siblings, 1 reply; 40+ messages in thread From: Jim Salter @ 2014-01-04 21:16 UTC (permalink / raw) To: Chris Murphy, linux-btrfs On 01/04/2014 02:18 PM, Chris Murphy wrote: > I'm not sure what else you're referring to?(working on boot > environment of btrfs) Just the string of caveats regarding mounting at boot time - needing to monkeypatch 00_header to avoid the bogus sparse file error (which, worse, tells you to press a key when pressing a key does nothing) followed by this, in my opinion completely unexpected, behavior when missing a disk in a fault-tolerant array, which also requires monkey-patching in fstab and now elsewhere in GRUB to avoid. Please keep in mind - I think we got off on the wrong foot here, and I'm sorry for my part in that, it was unintentional. I *love* btrfs, and think the devs are doing incredible work. I'm excited about it. I'm aware it's not intended for production yet. However, it's just on the cusp, with distributions not only including it in their installers but a couple teetering on the fence with declaring it their next default FS (Oracle Unbreakable, OpenSuse, hell even RedHat was flirting with the idea) that it seems to me some extra testing with an eye towards production isn't a bad thing. That's why I'm here. Not to crap on anybody, but to get involved, hopefully helpfully. > fs_passno is 1 which doesn't apply to Btrfs. Again, that's the distribution's default, so the argument should be with them, not me... with that said, I'd respectfully argue that fs_passno 1 is correct for any root file system; if the file system itself declines to run an fsck that's up to the filesystem, but it's correct to specify fs_passno 1 if the filesystem is to be mounted as root in the first place. I'm open to hearing why that's a bad idea, if you have a specific reason? > Well actually LVM thinp does have fast snapshots without requiring > preallocation, and uses COW. LVM's snapshots aren't very useful for me - there's a performance penalty while you have them in place, so they're best used as a transient use-then-immediately-delete feature, for instance for rsync'ing off a database binary. Until recently, there also wasn't a good way to roll back an LV to a snapshot, and even now, that can be pretty problematic. Finally, there's no way to get a partial copy of an LV snapshot out of the snapshot and back into production, so if eg you have virtual machines of significant size, you could be looking at *hours* of file copy operations to restore an individual VM out of a snapshot (if you even have the drive space available for it), as compared to btrfs' cp --reflink=always operation, which allows you to do the same thing instantaneously. FWIW, I think the ability to do cp --reflink=always is one of the big killer features that makes btrfs more attractive than zfs (which, again FWIW, I have 5+ years of experience with, and is my current primary storage system). > I'm not sure what you mean by self-correcting, but if the drive > reports a read error md, lvm, and Btrfs raid1+ all will get missing > data from mirror/parity reconstruction, and write corrected data back > to the bad sector. You're assuming that the drive will actually *report* a read error, which is frequently not the case. I have a production ZFS array right now that I need to replace an Intel SSD on - the SSD has thrown > 10K checksum errors in six months. Zero read or write errors. Neither hardware RAID nor mdraid nor LVM would have helped me there. Since running filesystems that do block-level checksumming, I have become aware that bitrot happens without hardware errors getting thrown FAR more frequently than I would have thought before having the tools to spot it. ZFS, and now btrfs, are the only tools at hand that can actually prevent it. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-04 21:16 ` Jim Salter @ 2014-01-05 20:25 ` Chris Murphy 2014-01-06 10:20 ` Chris Samuel 0 siblings, 1 reply; 40+ messages in thread From: Chris Murphy @ 2014-01-05 20:25 UTC (permalink / raw) To: linux-btrfs On Jan 4, 2014, at 2:16 PM, Jim Salter <jim@jrs-s.net> wrote: > > On 01/04/2014 02:18 PM, Chris Murphy wrote: >> I'm not sure what else you're referring to?(working on boot environment of btrfs) > > Just the string of caveats regarding mounting at boot time - needing to monkeypatch 00_header to avoid the bogus sparse file error I don't know what "bogus sparse file error" refers to. What version of GRUB? I'm seeing Ubuntu 12.03 precise-updates listing GRUB 1.99 which is rather old. > (which, worse, tells you to press a key when pressing a key does nothing) followed by this, in my opinion completely unexpected, behavior when missing a disk in a fault-tolerant array, which also requires monkey-patching in fstab and now elsewhere in GRUB to avoid. and… > I'm aware it's not intended for production yet. On the one hand you say you're aware, yet on the other hand you say the missing disk behavior is completely unexpected. Some parts of Btrfs, in certain contexts, are production ready. But the developmental state of Btrfs places a burden on the user to know more details about that state than he might otherwise be expected to know with more stable/mature file systems. My opinion is that it's inappropriate for degraded mounts to be made automatic when there's no method of notifying user space of this state change. Gnome-shell via udisks will inform users of a degraded md array. Something equivalent to that is needed before Btrfs should enable a scenario where a user boots a computer in degraded state without being informed as if there's nothing wrong at all. That's demonstrably far worse than "scary" boot failure, during which one copy of data is still likely safe, unlike permitting uninformed degraded rw operation. > However, it's just on the cusp, with distributions not only including it in their installers but a couple teetering on the fence with declaring it their next default FS (Oracle Unbreakable, OpenSuse, hell even RedHat was flirting with the idea) that it seems to me some extra testing with an eye towards production isn't a bad thing. Does the Ubuntu 12.03 LTS installer let you create sysroot on a Btrfs raid1 volume? > That's why I'm here. Not to crap on anybody, but to get involved, hopefully helpfully. I think you're better off using something more developmental, it necessarily needs to exist in the first place there, before it can trickle down to an LTS release. > >> fs_passno is 1 which doesn't apply to Btrfs. > Again, that's the distribution's default, so the argument should be with them, not me… Yes so you'd want to file a bug? That's how you get involved. > with that said, I'd respectfully argue that fs_passno 1 is correct for any root file system; if the file system itself declines to run an fsck that's up to the filesystem, but it's correct to specify fs_passno 1 if the filesystem is to be mounted as root in the first place. > > I'm open to hearing why that's a bad idea, if you have a specific reason? It's a minor point, but it shows that fs_passno has become quaint, like grandma's iron cozy. It's not applicable for either XFS or Btrfs. It's arguably inapplicable for ext3/4 but its fsck program has an optimization to skip fully checking the file system if the journal replay succeeds. There is no unattended fsck for either XFS or Btrfs. On systemd systems, it reads fstab, and if fs_passno is non-zero it checks for the existence of /sbin/fsck.<fs> and if it doesn't exist, then it doesn't run fsck for that entry. This topic was recently brought up and is in the archives. >> Well actually LVM thinp does have fast snapshots without requiring preallocation, and uses COW. > > LVM's snapshots aren't very useful for me - there's a performance penalty while you have them in place, so they're best used as a transient use-then-immediately-delete feature, for instance for rsync'ing off a database binary. Until recently, there also wasn't a good way to roll back an LV to a snapshot, and even now, that can be pretty problematic. This describes old LVM snapshots, not LVM thinp snapshots. > Finally, there's no way to get a partial copy of an LV snapshot out of the snapshot and back into production, so if eg you have virtual machines of significant size, you could be looking at *hours* of file copy operations to restore an individual VM out of a snapshot (if you even have the drive space available for it), as compared to btrfs' cp --reflink=always operation, which allows you to do the same thing instantaneously. LVM isn't a file system, so limitations compared to Btrfs are expected. > >> I'm not sure what you mean by self-correcting, but if the drive reports a read error md, lvm, and Btrfs raid1+ all will get missing data from mirror/parity reconstruction, and write corrected data back to the bad sector. > > You're assuming that the drive will actually *report* a read error, which is frequently not the case. This is discussed in significant detail in the linux-raid@ list archives. I'm not aware of data that explicitly concludes or proposes a ratio between ECC error detection with non-correction (resulting in a read error) vs silent data corruption. I've seen quite a few read errors from drives compared to what I think was SDC - but that's not a scientific sample. Polluting a lot of the data is a mismatch between default drive ERC timeouts compared to SCSI block layer timeouts, so when a drive ECC isn't able to produce a result within the SCSI block layer timeout time, we get a link reset. Now we don't know what the drive would have reported, a read error? Or bogus data? > I have a production ZFS array right now that I need to replace an Intel SSD on - the SSD has thrown > 10K checksum errors in six months. Zero read or write errors. Neither hardware RAID nor mdraid nor LVM would have helped me there. Of course, that's not their design goal. But I don't think the Btrfs devs are suggesting a design goal is to compensate for spectacular failure of the drive's ECC, because if all drives in your Btrfs volume behaved the way this one SSD you're reporting behaves, you'd inevitably still lose data. Btrfs checksumming isn't a substitute for drive ECC. What you're reporting is a significant ECC fail. > Since running filesystems that do block-level checksumming, I have become aware that bitrot happens without hardware errors getting thrown FAR more frequently than I would have thought before having the tools to spot it. ZFS, and now btrfs, are the only tools at hand that can actually prevent it. There are other tools than ZFS and Btrfs, they just aren't open source. 10K checksum errors in six months without a single read error is not bitrot, it's a more significant failure. Bitrot is one kind of silent data corruption, not all SDC is due to bit rot, there are a lot of other sources for data corruption in the storage stack. Yes it's good we have ZFS and Btrfs for additional protection, but I don't see these file systems as getting manufacturers off the hook with respect to ECC. That needs to get better, they know it needs to get better and that's one of the major reasons why spinning drives have moved to 4K physical sectors. So moving to checksumming file systems isn't the only way to prevent these problems. Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-05 20:25 ` Chris Murphy @ 2014-01-06 10:20 ` Chris Samuel 2014-01-06 18:30 ` Chris Murphy 0 siblings, 1 reply; 40+ messages in thread From: Chris Samuel @ 2014-01-06 10:20 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 519 bytes --] On Sun, 5 Jan 2014 01:25:19 PM Chris Murphy wrote: > Does the Ubuntu 12.03 LTS installer let you create sysroot on a Btrfs raid1 > volume? I doubt it, given the alpha for 14.04 doesn't seem to have the concept yet. :-) https://bugs.launchpad.net/ubuntu/+source/grub-installer/+bug/1266200 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC This email may come with a PGP signature as a file. Do not panic. For more info see: http://en.wikipedia.org/wiki/OpenPGP [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 482 bytes --] ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-06 10:20 ` Chris Samuel @ 2014-01-06 18:30 ` Chris Murphy 2014-01-06 19:25 ` Jim Salter 2014-01-06 19:31 ` correct way to rollback a root filesystem? Jim Salter 0 siblings, 2 replies; 40+ messages in thread From: Chris Murphy @ 2014-01-06 18:30 UTC (permalink / raw) To: Chris Samuel; +Cc: linux-btrfs On Jan 6, 2014, at 3:20 AM, Chris Samuel <chris@csamuel.org> wrote: > On Sun, 5 Jan 2014 01:25:19 PM Chris Murphy wrote: > >> Does the Ubuntu 12.03 LTS installer let you create sysroot on a Btrfs raid1 >> volume? > > I doubt it, given the alpha for 14.04 doesn't seem to have the concept yet. > :-) > > https://bugs.launchpad.net/ubuntu/+source/grub-installer/+bug/1266200 Color me surprised. Fedora 20 lets you create Btrfs raid1/raid0 for rootfs, but due to a long standing grubby bug [1] /boot can't be on Btrfs, so it's only ext4. That means only one of your disks will get grub.cfg, and means if it dies, you won't boot without user intervention that also requires esoteric grub knowledge. /boot needs to be on Btrfs or it gets messy. The messy alternative, each drive has an ext4 boot partition means kernel updates have to be written to each drive, and each drives separate /boot/grub/grub.cfg needs to be updated. That's kinda ick x2. Yes they could be made md raid1 to solve part of this. It gets slightly more amusing on UEFI, where the installer needs to be smart enough to create (or reuse) the EFI System partition on each device [2] for the bootloader but NOT for the grub.cfg [3], otherwise we have separate grub.cfgs on each ESP to update when there are kernel updates. And if a disk fails, and is replaced, while grub-install works on BIOS, it doesn't work on UEFI because it'll only install a bootloader if the ESP is mounted in the right location. So until every duck is in the row, I think we can hardly point one finger when it comes to making a degrade system bootable without any human intervention. [1] grubby fatal error updating grub.cfg when /boot is btrfs https://bugzilla.redhat.com/show_bug.cgi?id=864198 [2] RFE: always create required bootloader partitions in custom partitioning https://bugzilla.redhat.com/show_bug.cgi?id=1022316 [2] On EFI, grub.cfg should be in /boot/grub not /boot/efi/EFI/fedora https://bugzilla.redhat.com/show_bug.cgi?id=1048999 Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-06 18:30 ` Chris Murphy @ 2014-01-06 19:25 ` Jim Salter 2014-01-06 22:05 ` Chris Murphy 2014-01-07 5:43 ` Chris Samuel 2014-01-06 19:31 ` correct way to rollback a root filesystem? Jim Salter 1 sibling, 2 replies; 40+ messages in thread From: Jim Salter @ 2014-01-06 19:25 UTC (permalink / raw) To: Chris Murphy, Chris Samuel; +Cc: linux-btrfs FWIW, Ubuntu (and I presume Debian) will work just fine with a single / on btrfs, single or multi disk. I currently have two machines booting to a btrfs-raid10 / with no separate /boot, one booting to a btrfs single disk / with no /boot, and one booting to a btrfs-raid10 / with an ext4-on-mdraid1 /boot. On 01/06/2014 01:30 PM, Chris Murphy wrote: > Color me surprised. Fedora 20 lets you create Btrfs raid1/raid0 for > rootfs, but due to a long standing grubby bug [1] /boot can't be on > Btrfs, so it's only ext4. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-06 19:25 ` Jim Salter @ 2014-01-06 22:05 ` Chris Murphy 2014-01-06 22:24 ` Jim Salter 2014-01-07 5:43 ` Chris Samuel 1 sibling, 1 reply; 40+ messages in thread From: Chris Murphy @ 2014-01-06 22:05 UTC (permalink / raw) To: Jim Salter; +Cc: Chris Samuel, linux-btrfs On Jan 6, 2014, at 12:25 PM, Jim Salter <jim@jrs-s.net> wrote: > FWIW, Ubuntu (and I presume Debian) will work just fine with a single / on btrfs, single or multi disk. > > I currently have two machines booting to a btrfs-raid10 / with no separate /boot, one booting to a btrfs single disk / with no /boot, and one booting to a btrfs-raid10 / with an ext4-on-mdraid1 /boot. Did you create the multiple device layouts outside of the installer first? What I'm seeing in the Ubuntu 12.03.04 installer is a choice of which disk to put the bootloader. If that's reliable UI, then it won't put it on both disks which means a single point of failure in which case -o degraded not being automatic with Btrfs is essentially pointless if we don't have a bootloader. I also see no way in the UI to even create Btrfs raid of any sort. Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-06 22:05 ` Chris Murphy @ 2014-01-06 22:24 ` Jim Salter 0 siblings, 0 replies; 40+ messages in thread From: Jim Salter @ 2014-01-06 22:24 UTC (permalink / raw) To: Chris Murphy; +Cc: Chris Samuel, linux-btrfs No, the installer is completely unaware. What I was getting at is that rebalancing (and installing the bootloader) is dead easy, so it doesn't bug me personally much. It'd be nice to eventually get something in the installer to make it obvious to the oblivious that it can be done and how, but in the meantime, it's frankly easier to set up btrfs-raid WITHOUT installer support than it is to set up mdraid WITH installer support Install process for 4-drive btrfs-raid10 root on Ubuntu (desktop or server): 1. do single-disk install on first disk, default all the way through except picking btrfs instead of ext4 for / 2. sfdisk -d /dev/sda | sfdisk /dev/sdb ; sfdisk -d /dev/sda | sfdisk /dev/sdc ; sfdisk -d /dev/sda | sfdisk /dev/sdd 3. btrfs dev add /dev/sdb1 /dev/sdc1 /dev/sdd1 / 4. btrfs balance start -dconvert=raid10 -mconvert=raid10 / 5. grub-install /dev/sdb ; grub-install /dev/sdc ; grub-install /dev/sdd Done. The rebalancing takes less than a minute, and the system's responsive while it happens. Once you've done the grub-install on the additional drives, you're good to go - Ubuntu already uses the UUID instead of a device ID for GRUB and fstab, so the btrfs mount will scan all drives and find any that are there. The only hitch is the need to mount degraded that I Chicken Littled about earlier so loudly. =) On 01/06/2014 05:05 PM, Chris Murphy wrote: > On Jan 6, 2014, at 12:25 PM, Jim Salter <jim@jrs-s.net> wrote: > >> FWIW, Ubuntu (and I presume Debian) will work just fine with a single / on btrfs, single or multi disk. >> >> I currently have two machines booting to a btrfs-raid10 / with no separate /boot, one booting to a btrfs single disk / with no /boot, and one booting to a btrfs-raid10 / with an ext4-on-mdraid1 /boot. > Did you create the multiple device layouts outside of the installer first? > > What I'm seeing in the Ubuntu 12.03.04 installer is a choice of which disk to put the bootloader. If that's reliable UI, then it won't put it on both disks which means a single point of failure in which case -o degraded not being automatic with Btrfs is essentially pointless if we don't have a bootloader. I also see no way in the UI to even create Btrfs raid of any sort. > > Chris Murphy ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT 2014-01-06 19:25 ` Jim Salter 2014-01-06 22:05 ` Chris Murphy @ 2014-01-07 5:43 ` Chris Samuel 1 sibling, 0 replies; 40+ messages in thread From: Chris Samuel @ 2014-01-07 5:43 UTC (permalink / raw) To: linux-btrfs On 07/01/14 06:25, Jim Salter wrote: > FWIW, Ubuntu (and I presume Debian) will work just fine with a single / > on btrfs, single or multi disk. > > I currently have two machines booting to a btrfs-raid10 / with no > separate /boot, one booting to a btrfs single disk / with no /boot, and > one booting to a btrfs-raid10 / with an ext4-on-mdraid1 /boot. Actually I've run into a problem with grub where a fresh install cannot boot from a btrfs /boot if your first partition is not 1MB aligned (sector 2048) then there is then not enough space for it to store its btrfs code. :-( https://bugs.launchpad.net/ubuntu/+source/grub-installer/+bug/1266195 I don't want to move my first partition as it's a Dell special (type 'de') and I'm not sure what the impact would be, so I just created an ext4 /boot and the install then worked. Regarding RAID, yes I realise it's easy to do post-fact, in fact on the same test system I added an external USB2 drive to the root filesystem and rebalanced as RAID-1, worked nicely. I'm planning on adding dual SSDs as my OS disks to my desktop and this experiment was to learn whether the Kubuntu installer handled it yet and if not to do a quick practice of setting it up by hand. :-) All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC ^ permalink raw reply [flat|nested] 40+ messages in thread
* correct way to rollback a root filesystem? 2014-01-06 18:30 ` Chris Murphy 2014-01-06 19:25 ` Jim Salter @ 2014-01-06 19:31 ` Jim Salter 2014-01-07 11:55 ` Sander 1 sibling, 1 reply; 40+ messages in thread From: Jim Salter @ 2014-01-06 19:31 UTC (permalink / raw) To: linux-btrfs Hi list - I tried a kernel upgrade with moderately disastrous (non-btrfs-related) results this morning; after the kernel upgrade Xorg was completely borked beyond my ability to get it working properly again through any normal means. I do have hourly snapshots being taken by cron, though, so I'm successfully X'ing again on the machine in question right now. It was quite a fight getting back to where I started even so, though - I'm embarassed to admit I finally ended up just doing a cp --reflink=all /mnt/@/.snapshots/snapshotname /mnt/@/ from the initramfs BusyBox prompt. Which WORKED well enough, but obviously isn't ideal. I tried the btrfs sub set-default command - again from BusyBox - and it didn't seem to want to work for me; I got an inappropriate ioctl error (which may be because I tried to use / instead of /mnt, where the root volume was CURRENTLY mounted, as an argument?). Before that, I'd tried setting subvol=@root (which is the writeable snapshot I created from the original read-only hourly snapshot I had) in GRUB and in fstab... but that's what landed me in BusyBox to begin with. When I DID mount the filesystem in BusyBox on /mnt, I saw that @ and @home were listed under /mnt, but no other "directories" were - which explains why mounting -o subvol=@root didn't work. I guess the question is, WHY couldn't I see @root in there, since I had a working, readable, writeable snapshot which showed its own name as "root" when doing a btrfs sub show /.snapshots/root ? Thanks. ^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: correct way to rollback a root filesystem? 2014-01-06 19:31 ` correct way to rollback a root filesystem? Jim Salter @ 2014-01-07 11:55 ` Sander 0 siblings, 0 replies; 40+ messages in thread From: Sander @ 2014-01-07 11:55 UTC (permalink / raw) To: Jim Salter; +Cc: linux-btrfs Jim Salter wrote (ao): > I tried a kernel upgrade with moderately disastrous > (non-btrfs-related) results this morning; after the kernel upgrade > Xorg was completely borked beyond my ability to get it working > properly again through any normal means. I do have hourly snapshots > being taken by cron, though, so I'm successfully X'ing again on the > machine in question right now. > > It was quite a fight getting back to where I started even so, though > - I'm embarassed to admit I finally ended up just doing a cp > --reflink=all /mnt/@/.snapshots/snapshotname /mnt/@/ from the > initramfs BusyBox prompt. Which WORKED well enough, but obviously > isn't ideal. > > I tried the btrfs sub set-default command - again from BusyBox - and > it didn't seem to want to work for me; I got an inappropriate ioctl > error (which may be because I tried to use / instead of /mnt, where > the root volume was CURRENTLY mounted, as an argument?). Before > that, I'd tried setting subvol=@root (which is the writeable > snapshot I created from the original read-only hourly snapshot I > had) in GRUB and in fstab... but that's what landed me in BusyBox to > begin with. > > When I DID mount the filesystem in BusyBox on /mnt, I saw that @ and > @home were listed under /mnt, but no other "directories" were - > which explains why mounting -o subvol=@root didn't work. I guess the > question is, WHY couldn't I see @root in there, since I had a > working, readable, writeable snapshot which showed its own name as > "root" when doing a btrfs sub show /.snapshots/root ? I don't quite get how your setup is. In my setup, all subvolumes and snapshots are under /.root/ # cat /etc/fstab LABEL=panda / btrfs subvol=rootvolume,space_cache,inode_cache,compress=lzo,ssd 0 0 LABEL=panda /home btrfs subvol=home 0 0 LABEL=panda /root btrfs subvol=root 0 0 LABEL=panda /var btrfs subvol=var 0 0 LABEL=panda /holding btrfs subvol=.holding 0 0 LABEL=panda /.root btrfs subvolid=0 0 0 /Varlib /var/lib none bind 0 0 In case of an OS upgrade gone wrong, I would mount subvolid=0, move subvolume 'rootvolume' out of the way, and move (rename) the last known good snapshot to 'rootvolume'. Not sure if that works though. Never tried. Sander ^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2014-01-07 11:55 UTC | newest] Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-01-03 22:28 btrfs raid1 and btrfs raid10 arrays NOT REDUNDANT Jim Salter 2014-01-03 22:42 ` Emil Karlson 2014-01-03 22:43 ` Joshua Schüler 2014-01-03 22:56 ` Jim Salter 2014-01-03 23:04 ` Hugo Mills 2014-01-03 23:04 ` Joshua Schüler 2014-01-03 23:13 ` Jim Salter 2014-01-03 23:18 ` Hugo Mills 2014-01-03 23:25 ` Jim Salter 2014-01-03 23:32 ` Chris Murphy 2014-01-03 23:22 ` Chris Murphy 2014-01-04 6:10 ` Duncan 2014-01-04 11:20 ` Chris Samuel 2014-01-04 13:03 ` Duncan 2014-01-04 14:51 ` Chris Mason 2014-01-04 15:23 ` Goffredo Baroncelli 2014-01-04 20:08 ` Duncan 2014-01-04 21:22 ` Jim Salter 2014-01-05 11:01 ` Duncan 2014-01-03 23:19 ` Chris Murphy [not found] ` <CAOjFWZ7zC3=4oH6=SBZA+PhZMrSK1KjxoRN6L2vqd=GTBKKTQA@mail.gmail.com> 2014-01-03 23:42 ` Jim Salter 2014-01-03 23:45 ` Jim Salter 2014-01-04 0:27 ` Chris Murphy 2014-01-04 2:59 ` Jim Salter 2014-01-04 5:57 ` Dave 2014-01-04 11:28 ` Chris Samuel 2014-01-04 14:56 ` Chris Mason 2014-01-05 9:20 ` Chris Samuel 2014-01-05 11:16 ` Duncan 2014-01-04 19:18 ` Chris Murphy 2014-01-04 21:16 ` Jim Salter 2014-01-05 20:25 ` Chris Murphy 2014-01-06 10:20 ` Chris Samuel 2014-01-06 18:30 ` Chris Murphy 2014-01-06 19:25 ` Jim Salter 2014-01-06 22:05 ` Chris Murphy 2014-01-06 22:24 ` Jim Salter 2014-01-07 5:43 ` Chris Samuel 2014-01-06 19:31 ` correct way to rollback a root filesystem? Jim Salter 2014-01-07 11:55 ` Sander
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.