All of lore.kernel.org
 help / color / mirror / Atom feed
* Unexpected raid1 behaviour
@ 2017-12-16 19:50 Dark Penguin
  2017-12-17 11:58 ` Duncan
                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Dark Penguin @ 2017-12-16 19:50 UTC (permalink / raw)
  To: linux-btrfs

Could someone please point me towards some read about how btrfs handles
multiple devices? Namely, kicking faulty devices and re-adding them.

I've been using btrfs on single devices for a while, but now I want to
start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
tried to see how does it handle various situations. The experience left
me very surprised; I've tried a number of things, all of which produced
unexpected results.

I create a btrfs raid1 filesystem on two hard drives and mount it.

- When I pull one of the drives out (simulating a simple cable failure,
which happens pretty often to me), the filesystem sometimes goes
read-only. ???
- But only after a while, and not always. ???
- When I fix the cable problem (plug the device back), it's immediately
"re-added" back. But I see no replication of the data I've written onto
a degraded filesystem... Nothing shows any problems, so "my filesystem
must be ok". ???
- If I unmount the filesystem and then mount it back, I see all my
recent changes lost (everything I wrote during the "degraded" period).
- If I continue working with a degraded raid1 filesystem (even without
damaging it further by re-adding the faulty device), after a while it
won't mount at all, even with "-o degraded".

I can't wrap my head about all this. Either the kicked device should not
be re-added, or it should be re-added "properly", or it should at least
show some errors and not pretend nothing happened, right?..

I must be missing something. Is there an explanation somewhere about
what's really going on during those situations? Also, do I understand
correctly that upon detecting a faulty device (a write error), nothing
is done about it except logging an error into the 'btrfs device stats'
report? No device kicking, no notification?.. And what about degraded
filesystems - is it absolutely forbidden to work with them without
converting them to a "single" filesystem first?..

On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 .


-- 
darkpenguin

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
@ 2017-12-17 11:58 ` Duncan
  2017-12-17 15:48   ` Peter Grandi
  2017-12-18  5:11   ` Anand Jain
  2017-12-18  1:20 ` Qu Wenruo
  2017-12-18 13:31 ` Austin S. Hemmelgarn
  2 siblings, 2 replies; 61+ messages in thread
From: Duncan @ 2017-12-17 11:58 UTC (permalink / raw)
  To: linux-btrfs

Dark Penguin posted on Sat, 16 Dec 2017 22:50:33 +0300 as excerpted:

> Could someone please point me towards some read about how btrfs handles
> multiple devices? Namely, kicking faulty devices and re-adding them.
> 
> I've been using btrfs on single devices for a while, but now I want to
> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
> tried to see how does it handle various situations. The experience left
> me very surprised; I've tried a number of things, all of which produced
> unexpected results.
> 
> I create a btrfs raid1 filesystem on two hard drives and mount it.
> 
> - When I pull one of the drives out (simulating a simple cable failure,
> which happens pretty often to me), the filesystem sometimes goes
> read-only. ???
> - But only after a while, and not always. ???
> - When I fix the cable problem (plug the device back), it's immediately
> "re-added" back. But I see no replication of the data I've written onto
> a degraded filesystem... Nothing shows any problems, so "my filesystem
> must be ok". ???
> - If I unmount the filesystem and then mount it back, I see all my
> recent changes lost (everything I wrote during the "degraded" period). -
> If I continue working with a degraded raid1 filesystem (even without
> damaging it further by re-adding the faulty device), after a while it
> won't mount at all, even with "-o degraded".
> 
> I can't wrap my head about all this. Either the kicked device should not
> be re-added, or it should be re-added "properly", or it should at least
> show some errors and not pretend nothing happened, right?..
> 
> I must be missing something. Is there an explanation somewhere about
> what's really going on during those situations? Also, do I understand
> correctly that upon detecting a faulty device (a write error), nothing
> is done about it except logging an error into the 'btrfs device stats'
> report? No device kicking, no notification?.. And what about degraded
> filesystems - is it absolutely forbidden to work with them without
> converting them to a "single" filesystem first?..
> 
> On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 .

Btrfs device handling at this point is still "development level" and very 
rough, but there's a patch set in active review ATM that should improve 
things dramatically, perhaps as soon as 4.16 (4.15 is already well on the 
way).

Basically, at this point btrfs doesn't have "dynamic" device handling.  
That is, if a device disappears, it doesn't know it.  So it continues 
attempting to write to (and read from, but the reads are redirected) the 
missing device until things go bad enough it kicks to read-only for 
safety.

If a device is added back, the kernel normally shuffles device names and 
assigns a new one.  Btrfs will see it and list the new device, but it's 
still trying to use the old one internally.  =:^(

Thus, if a device disappears, to get it back you really have to reboot, 
or at least unload/reload the btrfs kernel module, in ordered to clear 
the stale device state and have btrfs rescan and reassociate devices with 
the matching filesystems.

Meanwhile, once a device goes stale -- other devices in the filesystem 
have data that should have been written to the stale one but it was gone 
so the data couldn't get to it -- once you do the module unload/reload or 
reboot cycle and btrfs picks up the device again, you should immediately 
do a btrfs scrub, which will detect and "catch up" the differences.

Btrfs tracks atomic filesystem updates via a monotonically increasing 
generation number, aka transaction-id (transid).  When a device goes 
offline, its generation number of course gets stuck at the point it went 
offline, while the other devices continue to update their generation 
numbers.

When a stale device is readded, btrfs should automatically find and use 
the device with the latest generation, but the old one isn't 
automatically caught up -- a scrub is the mechanism by which you do this.

One thing you do **NOT** want to do is degraded-writable mount one 
device, then the other device, of a raid1 pair, because that'll diverge 
the two with new data on each, and that's no longer simple to correct.  
If you /have/ to degraded-writable mount a raid1, always make sure it's 
the same one mounted writable if you want to combine them again.  If you 
/do/ need to recombine two diverged raid1 devices, the only safe way to 
do so is to wipe the one so btrfs has only the one copy of the data to go 
on, and add the wiped device back as a new device.

Meanwhile, until /very/ recently... 4.13 may not be current enough... if 
you mounted a two-device raid1 degraded-writable, btrfs would try to 
write and note that it couldn't do raid1 because there wasn't a second 
device, so it would create single chunks to write into.

And the older filesystem safe-mount mechanism would see those single 
chunks on a raid1 and decide it wasn't safe to mount the filesystem 
writable at all after that, even if all the single chunks were actually 
present on the remaining device.

The effect was that if a device died, you had exactly one degraded-
writable mount to replace it successfully.  If you didn't complete the 
replace in that single chance writable mount, the filesystem would refuse 
to mount writable again, and thus it was impossible to repair the 
filesystem since that required a writable mount and that was no longer 
possible!  Fortunately the filesystem could still be mounted degraded-
readonly (unless there was some other problem), allowing people to at 
least get at the read-only data to copy it elsewhere.

With a new enough btrfs, while btrfs will still create those single 
chunks on a degraded-writable mount of a raid1, it's at least smart 
enough to do per-chunk checks to see if they're all available on existing 
devices (none only on the missing device), and will continue to allow 
degraded-writable mounting if so.

But once the filesystem is back to multi-device (with writable space on 
at least two devices), a balance-convert of those single chunks to raid1 
should be done, otherwise if the device with them on it goes...

And there's work on allowing it to do only single-copy, thus incomplete-
raid1, chunk writes as well.  This should prevent the single mode chunks 
entirely, thus eliminating the need for the balance-convert, tho a scrub 
would still be needed to fully sync back up.  But I'm not sure what the 
status is on that.

Meanwhile, as mentioned above, there's active work on proper dynamic 
btrfs device tracking and management.  It may or may not be ready for 
4.16, but once it goes in, btrfs should properly detect a device going 
away and react accordingly, and it should detect a device coming back as 
a different device too.  As I write this it occurs to me that I've not 
read close enough to know if it actually initiates scrub/resync on its 
own in the current patch set, but that's obviously an eventual goal if 
not.

Longer term, there's further patches that will provide a hot-spare 
functionality, automatically bringing in a device pre-configured as a hot-
spare if a device disappears, but that of course requires that btrfs 
properly recognize devices disappearing and coming back first, so one 
thing at a time.  Tho as originally presented, that hot-spare 
functionality was a bit limited -- it was a global hot-spare list, and 
with multiple btrfs of different sizes and multiple hot-spare devices 
also of different sizes, it would always just pick the first spare on the 
list for the first btrfs needing one, regardless of whether the size was 
appropriate for that filesystem or not.  By the time the feature actually 
gets merged it may have changed some, and regardless, it should 
eventually get less limited, but that's _eventually_, with a target time 
likely still in years, so don't hold your breath.


I think that answers most of your questions.  Basically, you have to be 
quite careful with btrfs raid1 today, as btrfs simply doesn't have the 
automated functionality to handle it yet.  It's still possible to do two-
device-only raid1 and replace a failed device when you're down to one, 
but it's not as easy or automated as more mature raid options such as 
mdraid, and you do have to keep on top of it as a result.  But it can and 
does work reasonably well for those (like me) who use btrfs raid1 as 
their "daily driver", as long as you /do/ keep on top of it... and don't 
try to use raid1 as a replacement for real backups, because it's *not* a 
backup! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-17 11:58 ` Duncan
@ 2017-12-17 15:48   ` Peter Grandi
  2017-12-17 20:42     ` Chris Murphy
                       ` (2 more replies)
  2017-12-18  5:11   ` Anand Jain
  1 sibling, 3 replies; 61+ messages in thread
From: Peter Grandi @ 2017-12-17 15:48 UTC (permalink / raw)
  To: Linux fs Btrfs

"Duncan"'s reply is slightly optimistic in parts, so some
further information...

[ ... ]

> Basically, at this point btrfs doesn't have "dynamic" device
> handling.  That is, if a device disappears, it doesn't know
> it.

That's just the consequence of what is a completely broken
conceptual model: the current way most multi-device profiles are
designed is that block-devices and only be "added" or "removed",
and cannot be "broken"/"missing". Therefore if IO fails, that is
just one IO failing, not the entire block-device going away.
The time when a block-device is noticed as sort-of missing is
when it is not available for "add"-ing at start.

Put another way, the multi-device design is/was based on the
demented idea that block-devices that are missing are/should be
"remove"d, so that a 2-device volume with a 'raid1' profile
becomes a 1-device volume with a 'single'/'dup' profile, and not
a 2-device volume with a missing block-device and an incomplete
'raid1' profile, even if things have been awkwardly moving in
that direction in recent years.

Note the above is not totally accurate today because various
hacks have been introduced to work around the various issues.

> Thus, if a device disappears, to get it back you really have
> to reboot, or at least unload/reload the btrfs kernel module,
> in ordered to clear the stale device state and have btrfs
> rescan and reassociate devices with the matching filesystems.

IIRC that is not quite accurate: a "missing" device can be
nowadays "replace"d (by "devid") or "remove"d, the latter
possibly implying profile changes:

  https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete

Terrible tricks like this also work:

  https://www.spinics.net/lists/linux-btrfs/msg48394.html

> Meanwhile, as mentioned above, there's active work on proper
> dynamic btrfs device tracking and management. It may or may
> not be ready for 4.16, but once it goes in, btrfs should
> properly detect a device going away and react accordingly,

I haven't seen that, but I doubt that it is the radical redesign
of the multi-device layer of Btrfs that is needed to give it
operational semantics similar to those of MD RAID, and that I
have vaguely described previously.

> and it should detect a device coming back as a different
> device too.

That is disagreeable because of poor terminology: I guess that
what was intended that it should be able to detect a previous
member block-device becoming available again as a different
device inode, which currently is very dangerous in some vital
situations.

> Longer term, there's further patches that will provide a
> hot-spare functionality, automatically bringing in a device
> pre-configured as a hot- spare if a device disappears, but
> that of course requires that btrfs properly recognize devices
> disappearing and coming back first, so one thing at a time.

That would be trivial if the complete redesign of block-device
states of the Btrfs multi-device layer happened, adding an
"active" flag to an "accessible" flag to describe new member
states, for example.

My guess that while logically consistent, the current
multi-device logic is fundamentally broken from an operational
point of view, and needs a complete replacement instead of
fixes.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-17 15:48   ` Peter Grandi
@ 2017-12-17 20:42     ` Chris Murphy
  2017-12-18  8:49       ` Anand Jain
  2017-12-18  8:49     ` Anand Jain
  2017-12-18 13:06     ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 61+ messages in thread
From: Chris Murphy @ 2017-12-17 20:42 UTC (permalink / raw)
  To: Peter Grandi; +Cc: Linux fs Btrfs

On Sun, Dec 17, 2017 at 8:48 AM, Peter Grandi <pg@btrfs.list.sabi.co.uk> wrote:
> "Duncan"'s reply is slightly optimistic in parts, so some
> further information...

>> and it should detect a device coming back as a different
>> device too.
>
> That is disagreeable because of poor terminology: I guess that
> what was intended that it should be able to detect a previous
> member block-device becoming available again as a different
> device inode, which currently is very dangerous in some vital
> situations.

Duncan probably means if the device reappears with different
enumeration (was /dev/sdb1 but comes back as /dev/sde1), that Btrfs
can recover from this by using the Btrfs specific dev.uuid to
recognize the device. And also by knowing generation it in effect has
a virtual write intent bitmap to use to catch up that device for
missing commits, which is something that doesn't currently happen
automatically; it requires either a scrub or balance to catch up a
formerly missing device - a very big penalty because the whole array
has to be done to catch it up for what might be only a few minutes of
missing time.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
  2017-12-17 11:58 ` Duncan
@ 2017-12-18  1:20 ` Qu Wenruo
  2017-12-18 13:31 ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 61+ messages in thread
From: Qu Wenruo @ 2017-12-18  1:20 UTC (permalink / raw)
  To: Dark Penguin, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2469 bytes --]



On 2017年12月17日 03:50, Dark Penguin wrote:
> Could someone please point me towards some read about how btrfs handles
> multiple devices? Namely, kicking faulty devices and re-adding them.
> 
> I've been using btrfs on single devices for a while, but now I want to
> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
> tried to see how does it handle various situations.
> The experience left
> me very surprised; I've tried a number of things, all of which produced
> unexpected results.
> 
> I create a btrfs raid1 filesystem on two hard drives and mount it.

Initial info like "btrfs fi df" will help us to dig this further.

> 
> - When I pull one of the drives out (simulating a simple cable failure,
> which happens pretty often to me), the filesystem sometimes goes
> read-only. ???

Please provide the kernel message.

> - But only after a while, and not always. ???
> - When I fix the cable problem (plug the device back), it's immediately
> "re-added" back. But I see no replication of the data I've written onto
> a degraded filesystem... Nothing shows any problems, so "my filesystem
> must be ok". ???

Needs extra info like "btrfs fi df"

> - If I unmount the filesystem and then mount it back, I see all my
> recent changes lost (everything I wrote during the "degraded" period).
> - If I continue working with a degraded raid1 filesystem (even without
> damaging it further by re-adding the faulty device), after a while it
> won't mount at all, even with "-o degraded".

Please provide kernel message too.
Although I doubt about the usefulness, it's still better than none.

Thanks,
Qu

> 
> I can't wrap my head about all this. Either the kicked device should not
> be re-added, or it should be re-added "properly", or it should at least
> show some errors and not pretend nothing happened, right?..
> 
> I must be missing something. Is there an explanation somewhere about
> what's really going on during those situations? Also, do I understand
> correctly that upon detecting a faulty device (a write error), nothing
> is done about it except logging an error into the 'btrfs device stats'
> report? No device kicking, no notification?.. And what about degraded
> filesystems - is it absolutely forbidden to work with them without
> converting them to a "single" filesystem first?..
> 
> On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 .
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 520 bytes --]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-17 11:58 ` Duncan
  2017-12-17 15:48   ` Peter Grandi
@ 2017-12-18  5:11   ` Anand Jain
  1 sibling, 0 replies; 61+ messages in thread
From: Anand Jain @ 2017-12-18  5:11 UTC (permalink / raw)
  To: Duncan, linux-btrfs



  Nice status update about btrfs volume manager. Thanks.

  Below I have added the names of the patch in ML/wip addressing
  the current limitations.

On 12/17/2017 07:58 PM, Duncan wrote:
> Dark Penguin posted on Sat, 16 Dec 2017 22:50:33 +0300 as excerpted:
> 
>> Could someone please point me towards some read about how btrfs handles
>> multiple devices? Namely, kicking faulty devices and re-adding them.
>>
>> I've been using btrfs on single devices for a while, but now I want to
>> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
>> tried to see how does it handle various situations. The experience left
>> me very surprised; I've tried a number of things, all of which produced
>> unexpected results.
>>
>> I create a btrfs raid1 filesystem on two hard drives and mount it.
>>
>> - When I pull one of the drives out (simulating a simple cable failure,
>> which happens pretty often to me), the filesystem sometimes goes
>> read-only. ???
>> - But only after a while, and not always. ???
>> - When I fix the cable problem (plug the device back), it's immediately
>> "re-added" back. But I see no replication of the data I've written onto
>> a degraded filesystem... Nothing shows any problems, so "my filesystem
>> must be ok". ???
>> - If I unmount the filesystem and then mount it back, I see all my
>> recent changes lost (everything I wrote during the "degraded" period). -
>> If I continue working with a degraded raid1 filesystem (even without
>> damaging it further by re-adding the faulty device), after a while it
>> won't mount at all, even with "-o degraded".
>>
>> I can't wrap my head about all this. Either the kicked device should not
>> be re-added, or it should be re-added "properly", or it should at least
>> show some errors and not pretend nothing happened, right?..
>>
>> I must be missing something. Is there an explanation somewhere about
>> what's really going on during those situations? Also, do I understand
>> correctly that upon detecting a faulty device (a write error), nothing
>> is done about it except logging an error into the 'btrfs device stats'
>> report? No device kicking, no notification?.. And what about degraded
>> filesystems - is it absolutely forbidden to work with them without
>> converting them to a "single" filesystem first?..
>>
>> On Ubuntu 17.10, there's Linux 4.13.0-16 and btrfs-progs 4.12-1 .
> 
> Btrfs device handling at this point is still "development level" and very
> rough, but there's a patch set in active review ATM that should improve
> things dramatically, perhaps as soon as 4.16 (4.15 is already well on the
> way).
> 
> Basically, at this point btrfs doesn't have "dynamic" device handling.
> That is, if a device disappears, it doesn't know it.  So it continues
> attempting to write to (and read from, but the reads are redirected) the
> missing device until things go bad enough it kicks to read-only for
> safety.

   btrfs: introduce device dynamic state transition to failed

> If a device is added back, the kernel normally shuffles device names and
> assigns a new one.  Btrfs will see it and list the new device, but it's
> still trying to use the old one internally.  =:^(

   btrfs: handle dynamically reappearing missing device

> Thus, if a device disappears, to get it back you really have to reboot,
> or at least unload/reload the btrfs kernel module, in ordered to clear
> the stale device state and have btrfs rescan and reassociate devices with
> the matching filesystems.
> 
> Meanwhile, once a device goes stale -- other devices in the filesystem
> have data that should have been written to the stale one but it was gone
> so the data couldn't get to it -- once you do the module unload/reload or
> reboot cycle and btrfs picks up the device again, you should immediately
> do a btrfs scrub, which will detect and "catch up" the differences.
 >
> Btrfs tracks atomic filesystem updates via a monotonically increasing
> generation number, aka transaction-id (transid).  When a device goes
> offline, its generation number of course gets stuck at the point it went
> offline, while the other devices continue to update their generation
> numbers.
 >
> When a stale device is readded, btrfs should automatically find and use
> the device with the latest generation, but the old one isn't
> automatically caught up -- a scrub is the mechanism by which you do this.
>
> One thing you do **NOT** want to do is degraded-writable mount one
> device, then the other device, of a raid1 pair, because that'll diverge
> the two with new data on each, and that's no longer simple to correct.
> If you /have/ to degraded-writable mount a raid1, always make sure it's
> the same one mounted writable if you want to combine them again.  If you
> /do/ need to recombine two diverged raid1 devices, the only safe way to
> do so is to wipe the one so btrfs has only the one copy of the data to go
> on, and add the wiped device back as a new device.

   btrfs: handle volume split brain scenario

> Meanwhile, until /very/ recently... 4.13 may not be current enough... if
> you mounted a two-device raid1 degraded-writable, btrfs would try to
> write and note that it couldn't do raid1 because there wasn't a second
> device, so it would create single chunks to write into. >
> And the older filesystem safe-mount mechanism would see those single
> chunks on a raid1 and decide it wasn't safe to mount the filesystem
> writable at all after that, even if all the single chunks were actually
> present on the remaining device.
 >
> The effect was that if a device died, you had exactly one degraded-
> writable mount to replace it successfully.  If you didn't complete the
> replace in that single chance writable mount, the filesystem would refuse
> to mount writable again, and thus it was impossible to repair the
> filesystem since that required a writable mount and that was no longer
> possible!  Fortunately the filesystem could still be mounted degraded-
> readonly (unless there was some other problem), allowing people to at
> least get at the read-only data to copy it elsewhere.
 >
> With a new enough btrfs, while btrfs will still create those single
> chunks on a degraded-writable mount of a raid1, it's at least smart
> enough to do per-chunk checks to see if they're all available on existing
> devices (none only on the missing device), and will continue to allow
> degraded-writable mounting if so.

   (v4.14)
   btrfs: Introduce a function to check if all chunks a OK for degraded 
rw mount

> But once the filesystem is back to multi-device (with writable space on
> at least two devices), a balance-convert of those single chunks to raid1
> should be done, otherwise if the device with them on it goes...
 >
> And there's work on allowing it to do only single-copy, thus incomplete-
> raid1, chunk writes as well.  This should prevent the single mode chunks
> entirely, thus eliminating the need for the balance-convert, tho a scrub
> would still be needed to fully sync back up.  But I'm not sure what the
> status is on that.

   btrfs: create degraded-RAID1 chunks
   (Patch is wip still. There is a good workaround).

> Meanwhile, as mentioned above, there's active work on proper dynamic
> btrfs device tracking and management. 

   btrfs: Introduce device pool sysfs attributes
   (needs revival)

> It may or may not be ready for
> 4.16, but once it goes in, btrfs should properly detect a device going
> away and react accordingly, and it should detect a device coming back as
> a different device too.  As I write this it occurs to me that I've not
> read close enough to know if it actually initiates scrub/resync on its
> own in the current patch set, but that's obviously an eventual goal if
> not.

   Right. It doesn't as of now, its in my list of things to fix.

> Longer term, there's further patches that will provide a hot-spare
> functionality, automatically bringing in a device pre-configured as a hot-
> spare if a device disappears, but that of course requires that btrfs
> properly recognize devices disappearing and coming back first, so one
> thing at a time.  Tho as originally presented, that hot-spare
> functionality was a bit limited -- it was a global hot-spare list, and
> with multiple btrfs of different sizes and multiple hot-spare devices
> also of different sizes, it would always just pick the first spare on the
> list for the first btrfs needing one, regardless of whether the size was
> appropriate for that filesystem or not.  By the time the feature actually
> gets merged it may have changed some, and regardless, it should
> eventually get less limited, but that's _eventually_, with a target time
> likely still in years, so don't hold your breath.

   hah.

   - Its not that difficult to pick up a suitable sized disk from the
     global hot spare list.
   - A CLI can show which fsid/volume a global hot spare can be the
     candidate for the potential replacement.
   - An auto replace priority can be at the fsid/volume end or we could
     still dedicate a global hot spare device to a fsid/volume.

  Related patches (needs revival):
   btrfs: block incompatible optional features at scan
   btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV
   btrfs: add check not to mount a spare device
   btrfs: support btrfs dev scan for spare device
   btrfs: provide framework to get and put a spare device
   btrfs: introduce helper functions to perform hot replace
   btrfs: check for failed device and hot replace

> I think that answers most of your questions.  Basically, you have to be
> quite careful with btrfs raid1 today, as btrfs simply doesn't have the
> automated functionality to handle it yet.  It's still possible to do two-
> device-only raid1 and replace a failed device when you're down to one,
> but it's not as easy or automated as more mature raid options such as
> mdraid, and you do have to keep on top of it as a result.  But it can and
> does work reasonably well for those (like me) who use btrfs raid1 as
> their "daily driver", as long as you /do/ keep on top of it... and don't
> try to use raid1 as a replacement for real backups, because it's *not* a
> backup! =:^)
> 

Thanks, Anand

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-17 15:48   ` Peter Grandi
  2017-12-17 20:42     ` Chris Murphy
@ 2017-12-18  8:49     ` Anand Jain
  2017-12-18 10:36       ` Peter Grandi
                         ` (2 more replies)
  2017-12-18 13:06     ` Austin S. Hemmelgarn
  2 siblings, 3 replies; 61+ messages in thread
From: Anand Jain @ 2017-12-18  8:49 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs



> Put another way, the multi-device design is/was based on the
> demented idea that block-devices that are missing are/should be
> "remove"d, so that a 2-device volume with a 'raid1' profile
> becomes a 1-device volume with a 'single'/'dup' profile, and not
> a 2-device volume with a missing block-device and an incomplete
> 'raid1' profile, 

  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
  caused by [1], which we should revert back, since..
    - balance (to raid1 chunk) may fail if FS is near full
    - recovery (to raid1 chunk) will take more writes as compared
      to recovery under degraded raid1 chunks

  [1]
  commit 95669976bd7d30ae265db938ecb46a6b7f8cb893
  Btrfs: don't consider the missing device when allocating new chunks

  There is an attempt to fix it [2], but will certainly takes time as
  there are many things to fix around this.

  [2]
  [PATCH RFC] btrfs: create degraded-RAID1 chunks

 > even if things have been awkwardly moving in
 > that direction in recent years.
> Note the above is not totally accurate today because various
> hacks have been introduced to work around the various issues.
  May be you are talking about [3]. Pls note its a workaround
  patch (which I mentioned in its original patch). Its nice that
  we fixed the availability issue through this patch and the
  helper function it added also helps the other developments.
  But for long term we need to work on [2].

  [3]
  btrfs: Introduce a function to check if all chunks a OK for degraded 
rw mount

>> Thus, if a device disappears, to get it back you really have
>> to reboot, or at least unload/reload the btrfs kernel module,
>> in ordered to clear the stale device state and have btrfs
>> rescan and reassociate devices with the matching filesystems.
> 
> IIRC that is not quite accurate: a "missing" device can be
> nowadays "replace"d (by "devid") or "remove"d, the latter
> possibly implying profile changes:
 >
>    https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete
> 
> Terrible tricks like this also work:
> 
>    https://www.spinics.net/lists/linux-btrfs/msg48394.html

  Its replace, which isn't about bringing back a missing disk.


>> Meanwhile, as mentioned above, there's active work on proper
>> dynamic btrfs device tracking and management. It may or may
>> not be ready for 4.16, but once it goes in, btrfs should
>> properly detect a device going away and react accordingly,
> 
> I haven't seen that, but I doubt that it is the radical redesign
> of the multi-device layer of Btrfs that is needed to give it
> operational semantics similar to those of MD RAID, and that I
> have vaguely described previously.

  I agree that btrfs volume manager is incomplete in view of
  data center RAS requisites, there are couple of critical
  bugs and inconsistent design between raid profiles, but I
  doubt if it needs a radical redesign.

  Pls take a look at [4], comments are appreciated as usual.
  I have experimented with two approaches and both are reasonable. -
  There isn't any harm to leave failed disk opened (but stop any
  new IO to it). And there will be udev
  'btrfs dev forget --mounted <dev>' call when device disappears
  so that we can close the device.
  In the 2nd approach, close the failed device right away when disk
  write fails, so that we continue to have only two device states.
  I like the latter.

>> and it should detect a device coming back as a different
>> device too.
> 
> That is disagreeable because of poor terminology: I guess that
> what was intended that it should be able to detect a previous
> member block-device becoming available again as a different
> device inode, which currently is very dangerous in some vital
> situations.

  If device disappears, the patch [4] will completely take out the
  device from btrfs, and continues to RW in degraded mode.
  When it reappears then [5] will bring it back to the RW list.

   [4]
   btrfs: introduce device dynamic state transition to failed
   [5]
   btrfs: handle dynamically reappearing missing device

  From the btrfs original design, it always depends on device SB
  fsid:uuid:devid so it does not matter about the device
  path or device inode or device transport layer. For eg. Dynamically
  you can bring a device under different transport and it will work
  without any down time.


 > That would be trivial if the complete redesign of block-device
 > states of the Btrfs multi-device layer happened, adding an
 > "active" flag to an "accessible" flag to describe new member
 > states, for example.

  I think you are talking about BTRFS_DEV_STATE.. But I think
  Duncan is talking about the patches which I included in my
  reply.

Thanks, Anand


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-17 20:42     ` Chris Murphy
@ 2017-12-18  8:49       ` Anand Jain
  0 siblings, 0 replies; 61+ messages in thread
From: Anand Jain @ 2017-12-18  8:49 UTC (permalink / raw)
  To: Chris Murphy, Peter Grandi; +Cc: Linux fs Btrfs


> formerly missing device - a very big penalty because the whole array
> has to be done to catch it up for what might be only a few minutes of
> missing time.
  For raid1 [1] cli will pick only new chunks.
   [1]
   btrfs bal start -dprofiles=single -mprofiles=single <mnt>

Thanks, Anand

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18  8:49     ` Anand Jain
@ 2017-12-18 10:36       ` Peter Grandi
  2017-12-18 12:10       ` Nikolay Borisov
  2017-12-18 22:28       ` Chris Murphy
  2 siblings, 0 replies; 61+ messages in thread
From: Peter Grandi @ 2017-12-18 10:36 UTC (permalink / raw)
  To: Linux fs Btrfs

>> I haven't seen that, but I doubt that it is the radical
>> redesign of the multi-device layer of Btrfs that is needed to
>> give it operational semantics similar to those of MD RAID,
>> and that I have vaguely described previously.

> I agree that btrfs volume manager is incomplete in view of
> data center RAS requisites, there are couple of critical
> bugs and inconsistent design between raid profiles, but I
> doubt if it needs a radical redesign.

Well it needs a radical redesign because the original design was
based on an entirely consistent and logical concept that was
quite different from that required for sensible operations, and
then special-case case was added (and keeps being added) to
fix the consequences.

But I suspect that it does not need a radical *recoding*,
because most if not all of the needed code is already there.
All tha needs changing most likely is the member state-machine,
that's the bit that need a radical redesign, and it is a
relatively small part of the whole.

The closer the member state-machine design is to the MD RAID one
the better as it is a very workable, proven model.

Sometimes I suspect that the design needs to be changed to also
add a formal notion of "stripe" to the Btrfs internals, where a
"stripe" is a collection of chunks that are "related" (and
something like that is already part of the 'raid10' profile),
but I think that needs not be user-visible.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18  8:49     ` Anand Jain
  2017-12-18 10:36       ` Peter Grandi
@ 2017-12-18 12:10       ` Nikolay Borisov
  2017-12-18 13:43         ` Anand Jain
  2017-12-18 22:28       ` Chris Murphy
  2 siblings, 1 reply; 61+ messages in thread
From: Nikolay Borisov @ 2017-12-18 12:10 UTC (permalink / raw)
  To: Anand Jain, Peter Grandi, Linux fs Btrfs



On 18.12.2017 10:49, Anand Jain wrote:
> 
> 
>> Put another way, the multi-device design is/was based on the
>> demented idea that block-devices that are missing are/should be
>> "remove"d, so that a 2-device volume with a 'raid1' profile
>> becomes a 1-device volume with a 'single'/'dup' profile, and not
>> a 2-device volume with a missing block-device and an incomplete
>> 'raid1' profile, 
> 
>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>  caused by [1], which we should revert back, since..
>    - balance (to raid1 chunk) may fail if FS is near full
>    - recovery (to raid1 chunk) will take more writes as compared
>      to recovery under degraded raid1 chunks
> 
>  [1]
>  commit 95669976bd7d30ae265db938ecb46a6b7f8cb893
>  Btrfs: don't consider the missing device when allocating new chunks
> 
>  There is an attempt to fix it [2], but will certainly takes time as
>  there are many things to fix around this.
> 
>  [2]
>  [PATCH RFC] btrfs: create degraded-RAID1 chunks
> 
>> even if things have been awkwardly moving in
>> that direction in recent years.
>> Note the above is not totally accurate today because various
>> hacks have been introduced to work around the various issues.
>  May be you are talking about [3]. Pls note its a workaround
>  patch (which I mentioned in its original patch). Its nice that
>  we fixed the availability issue through this patch and the
>  helper function it added also helps the other developments.
>  But for long term we need to work on [2].
> 
>  [3]
>  btrfs: Introduce a function to check if all chunks a OK for degraded rw
> mount
> 
>>> Thus, if a device disappears, to get it back you really have
>>> to reboot, or at least unload/reload the btrfs kernel module,
>>> in ordered to clear the stale device state and have btrfs
>>> rescan and reassociate devices with the matching filesystems.
>>
>> IIRC that is not quite accurate: a "missing" device can be
>> nowadays "replace"d (by "devid") or "remove"d, the latter
>> possibly implying profile changes:
>>
>>   
>> https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete
>>
>>
>> Terrible tricks like this also work:
>>
>>    https://www.spinics.net/lists/linux-btrfs/msg48394.html
> 
>  Its replace, which isn't about bringing back a missing disk.
> 
> 
>>> Meanwhile, as mentioned above, there's active work on proper
>>> dynamic btrfs device tracking and management. It may or may
>>> not be ready for 4.16, but once it goes in, btrfs should
>>> properly detect a device going away and react accordingly,
>>
>> I haven't seen that, but I doubt that it is the radical redesign
>> of the multi-device layer of Btrfs that is needed to give it
>> operational semantics similar to those of MD RAID, and that I
>> have vaguely described previously.
> 
>  I agree that btrfs volume manager is incomplete in view of
>  data center RAS requisites, there are couple of critical
>  bugs and inconsistent design between raid profiles, but I
>  doubt if it needs a radical redesign.
> 
>  Pls take a look at [4], comments are appreciated as usual.
>  I have experimented with two approaches and both are reasonable. -
>  There isn't any harm to leave failed disk opened (but stop any
>  new IO to it). And there will be udev
>  'btrfs dev forget --mounted <dev>' call when device disappears
>  so that we can close the device.
>  In the 2nd approach, close the failed device right away when disk
>  write fails, so that we continue to have only two device states.
>  I like the latter.
> 
>>> and it should detect a device coming back as a different
>>> device too.
>>
>> That is disagreeable because of poor terminology: I guess that
>> what was intended that it should be able to detect a previous
>> member block-device becoming available again as a different
>> device inode, which currently is very dangerous in some vital
>> situations.
> 
>  If device disappears, the patch [4] will completely take out the
>  device from btrfs, and continues to RW in degraded mode.
>  When it reappears then [5] will bring it back to the RW list.

but [5] relies on someone from userspace (presumably udev) actually
invoking BTRFS_IOC_SCAN_DEV/IOSC_DEVICES_READY, no ? Because
device_list_add is only ever called from btrfs_scan_one_device, which in
turn is called by either of the aforementioned IOCTLS or during mount
(which is not at play here).

> 
>   [4]
>   btrfs: introduce device dynamic state transition to failed
>   [5]
>   btrfs: handle dynamically reappearing missing device
> 
>  From the btrfs original design, it always depends on device SB
>  fsid:uuid:devid so it does not matter about the device
>  path or device inode or device transport layer. For eg. Dynamically
>  you can bring a device under different transport and it will work
>  without any down time.
> 
> 
>> That would be trivial if the complete redesign of block-device
>> states of the Btrfs multi-device layer happened, adding an
>> "active" flag to an "accessible" flag to describe new member
>> states, for example.
> 
>  I think you are talking about BTRFS_DEV_STATE.. But I think
>  Duncan is talking about the patches which I included in my
>  reply.
> 
> Thanks, Anand
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-17 15:48   ` Peter Grandi
  2017-12-17 20:42     ` Chris Murphy
  2017-12-18  8:49     ` Anand Jain
@ 2017-12-18 13:06     ` Austin S. Hemmelgarn
  2017-12-18 19:43       ` Tomasz Pala
  2 siblings, 1 reply; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-18 13:06 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs

On 2017-12-17 10:48, Peter Grandi wrote:
> "Duncan"'s reply is slightly optimistic in parts, so some
> further information...
> 
> [ ... ]
> 
>> Basically, at this point btrfs doesn't have "dynamic" device
>> handling.  That is, if a device disappears, it doesn't know
>> it.
> 
> That's just the consequence of what is a completely broken
> conceptual model: the current way most multi-device profiles are
> designed is that block-devices and only be "added" or "removed",
> and cannot be "broken"/"missing". Therefore if IO fails, that is
> just one IO failing, not the entire block-device going away.
> The time when a block-device is noticed as sort-of missing is
> when it is not available for "add"-ing at start.
> 
> Put another way, the multi-device design is/was based on the
> demented idea that block-devices that are missing are/should be
> "remove"d, so that a 2-device volume with a 'raid1' profile
> becomes a 1-device volume with a 'single'/'dup' profile, and not
> a 2-device volume with a missing block-device and an incomplete
> 'raid1' profile, even if things have been awkwardly moving in
> that direction in recent years.
> 
> Note the above is not totally accurate today because various
> hacks have been introduced to work around the various issues.
You do realize you just restated exactly what Duncan said, just in a 
much more verbose (and aggressively negative) manner...
> 
>> Thus, if a device disappears, to get it back you really have
>> to reboot, or at least unload/reload the btrfs kernel module,
>> in ordered to clear the stale device state and have btrfs
>> rescan and reassociate devices with the matching filesystems.
> 
> IIRC that is not quite accurate: a "missing" device can be
> nowadays "replace"d (by "devid") or "remove"d, the latter
> possibly implying profile changes:
> 
>    https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete
> 
> Terrible tricks like this also work:
> 
>    https://www.spinics.net/lists/linux-btrfs/msg48394.html
While that is all true, none of that _fixes_ the issue of a device 
disappearing and then being reconnected.  In theory, you can use `btrfs 
device replace` to force BTRFS to acknowledge the new name (by 
'replacing' the missing device with the now returned device), but doing 
so is horribly inefficient as to not be worth it unless you have no 
other choice.
> 
>> Meanwhile, as mentioned above, there's active work on proper
>> dynamic btrfs device tracking and management. It may or may
>> not be ready for 4.16, but once it goes in, btrfs should
>> properly detect a device going away and react accordingly,
> 
> I haven't seen that, but I doubt that it is the radical redesign
> of the multi-device layer of Btrfs that is needed to give it
> operational semantics similar to those of MD RAID, and that I
> have vaguely described previously.
Anand has been working on hot spare support, and as part of that has 
done some work on handling of missing devices.
> 
>> and it should detect a device coming back as a different
>> device too.
> 
> That is disagreeable because of poor terminology: I guess that
> what was intended that it should be able to detect a previous
> member block-device becoming available again as a different
> device inode, which currently is very dangerous in some vital
> situations.
How exactly is this dangerous?  The only situation I can think of is if 
a bogus device is hot-plugged and happens to perfectly match all the 
required identifiers, and at that point you've either got someone 
attacking your system who already has sufficient access to do whatever 
the hell they want with it, or you did something exceedingly stupid, and 
both cases are dangerous by themselves.
> 
>> Longer term, there's further patches that will provide a
>> hot-spare functionality, automatically bringing in a device
>> pre-configured as a hot- spare if a device disappears, but
>> that of course requires that btrfs properly recognize devices
>> disappearing and coming back first, so one thing at a time.
> 
> That would be trivial if the complete redesign of block-device
> states of the Btrfs multi-device layer happened, adding an
> "active" flag to an "accessible" flag to describe new member
> states, for example.
No, it wouldn't be trivial, because a complete redesign of part of the 
filesystem would be needed.
> 
> My guess that while logically consistent, the current
> multi-device logic is fundamentally broken from an operational
> point of view, and needs a complete replacement instead of
> fixes.
Then why don't you go write up some patches yourself if you feel so 
strongly about it?

The fact is, the only cases where this is really an issue is if you've 
either got intermittently bad hardware, or are dealing with external 
storage devices.  For the majority of people who are using multi-device 
setups, the common case is internally connected fixed storage devices 
with properly working hardware, and for that use case, it works 
perfectly fine.  In fact, the only people I've seen any reports of 
issues from are either:

1. Testing the behavior of device management (such as the OP), in which 
case, yes it doesn't work if you do things that aren't reasonably 
expected of working hardware.
2. Trying to do multi-device on USB, which is a bad idea regardless of 
what you're using to create a single volume, because USB has pretty 
serious reliability issues.

Neither case is 'normal' usage of a multi-device volume though.  Yes, 
the second case could be better supported, but that's likely going to 
require some help from the block layer, and verification of writes.  As 
far as handling of other marginal hardware, I'm very inclined to say 
that BTRFS should not care.  At the point at which a device is dropping 
off the bus and reappearing with enough regularity for this to be an 
issue, you have absolutely no idea how else it's corrupting your data, 
and support of such a situation is beyond any filesystem (including ZFS).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
  2017-12-17 11:58 ` Duncan
  2017-12-18  1:20 ` Qu Wenruo
@ 2017-12-18 13:31 ` Austin S. Hemmelgarn
  2018-01-12 12:26   ` Dark Penguin
  2 siblings, 1 reply; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-18 13:31 UTC (permalink / raw)
  To: Dark Penguin, linux-btrfs

On 2017-12-16 14:50, Dark Penguin wrote:
> Could someone please point me towards some read about how btrfs handles
> multiple devices? Namely, kicking faulty devices and re-adding them.
> 
> I've been using btrfs on single devices for a while, but now I want to
> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
> tried to see how does it handle various situations. The experience left
> me very surprised; I've tried a number of things, all of which produced
> unexpected results.
Expounding a bit on Duncan's answer with some more specific info.
> 
> I create a btrfs raid1 filesystem on two hard drives and mount it.
> 
> - When I pull one of the drives out (simulating a simple cable failure,
> which happens pretty often to me), the filesystem sometimes goes
> read-only. ??? > - But only after a while, and not always. ???
The filesystem won't go read-only until it hits an I/O error, and it's 
non-deterministic how long it will be before that happens on an idle 
filesystem that only sees read access (because if all the files that are 
being read are in the page cache).
> - When I fix the cable problem (plug the device back), it's immediately
> "re-added" back. But I see no replication of the data I've written onto
> a degraded filesystem... Nothing shows any problems, so "my filesystem
> must be ok". ???
One of two things happens in this case, and why there is no re-sync is 
dependent on which happens, but both ultimately have to do with the fact 
that BTRFS assumes I/O errors are from device failures, and are at worst 
transient.  Either:

1. The device reappears with the same name. This happens if the time it 
was disconnected is less than the kernel's command timeout (30 seconds 
by default).  In this case, BTRFS may not even notice that the device 
was gone (and if it doesn't, then a re-sync isn't necessary, since it 
will retry all the writes it needs to).  In this case, BTRFS assumes the 
I/O errors were temporary, and keeps using the device after logging the 
errors.  If this happens, then you need to manually re-sync things by 
scrubbing the filesystem (or balancing, but scrubbing is preferred as it 
should run quicker and will only re-write what is actually needed).
2. The device reappears with a different name.  In this case, the device 
was gone long enough that the block layer is certain it was 
disconnected, and thus when it reappears and BTRFS still holds open 
references to the old device node, it gets a new device node.  In this 
case, if the 'new' device is scanned, BTRFS will recognize it as part of 
the FS, but will keep using the old device node.  The correct fix here 
is to unmount the filesystem, re-scan all devices, and then remount the 
filesystem and manually re-sync with a scrub.

> - If I unmount the filesystem and then mount it back, I see all my
> recent changes lost (everything I wrote during the "degraded" period).
I'm not quite sure about this, but I think BTRFS is rolling back to the 
last common generation number for some reason.

> - If I continue working with a degraded raid1 filesystem (even without
> damaging it further by re-adding the faulty device), after a while it
> won't mount at all, even with "-o degraded".
This is (probably) a known bug relating to chunk handling.  In a two 
device volume using a raid1 profile with a missing device, older kernels 
(I don't remember when the fix went in, but I could have sworn it was in 
4.13) will (erroneously) generate single-profile chunks when they need 
to allocate new chunks.  When you then go to mount the filesystem, the 
check for the degraded mount-ability of the FS fails because there is a 
device missing and single profile chunks.

Now, even without that bug, it's never a good idea t0o run a storage 
array degraded for any extended period of time, regardless of what type 
of array it is (BTRFS, ZFS, MD, LVM, or even hardware RAID).  By keeping 
it in 'degraded' mode, you're essentially telling the system that the 
array will be fixed in a reasonably short time-frame, which impacts how 
it handles the array.  If you're not going to fix it almost immediately, 
you should almost always reshape the array to account for the missing 
device if at all possible, as that will improve relative data safety and 
generally get you better performance than running degraded will.
> 
> I can't wrap my head about all this. Either the kicked device should not
> be re-added, or it should be re-added "properly", or it should at least
> show some errors and not pretend nothing happened, right?..
BTRFS is not the best at error reporting at the moment.  If you check 
the output of `btrfs device stats` for that filesystem though, it should 
show non-zero values in the error counters (note that these counters are 
cumulative, so they are counts since the last time they were reset (or 
when the FS was created if they have never been reset).  Similarly, 
scrub should report errors, there should be error messages in the kernel 
log, and switching the FS to read-only mode _is_ technically reporting 
an error, as that's standard error behavior for most sensible 
filesystems (ext[234] being the notable exception, they just continue as 
if nothing happened).
> 
> I must be missing something. Is there an explanation somewhere about
> what's really going on during those situations? Also, do I understand
> correctly that upon detecting a faulty device (a write error), nothing
> is done about it except logging an error into the 'btrfs device stats'
> report? No device kicking, no notification?.. And what about degraded
> filesystems - is it absolutely forbidden to work with them without
> converting them to a "single" filesystem first?..
As mentioned above, going read-only _is_ a notification that something 
is wrong.  Translating that (and the error counter increase, and the 
kernel log messages) into a user visible notification is not really the 
job of BTRFS, especially considering that no other filesystem or device 
manager does so either (yes, you can get nice notifications from LVM, 
but they aren't _from_ LVM itself, they're from other software that 
watches for errors, and the same type of software works just fine for 
BTRFS too).  If you're this worried about it and don't want to keep on 
top of it yourself by monitoring things manually, you really need to 
look into a tool like monit [1] that can handle this for you.


[1] https://mmonit.com/monit/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 12:10       ` Nikolay Borisov
@ 2017-12-18 13:43         ` Anand Jain
  0 siblings, 0 replies; 61+ messages in thread
From: Anand Jain @ 2017-12-18 13:43 UTC (permalink / raw)
  To: Nikolay Borisov, Peter Grandi, Linux fs Btrfs



>>> what was intended that it should be able to detect a previous
>>> member block-device becoming available again as a different
>>> device inode, which currently is very dangerous in some vital
>>> situations.

  Peter, What's the dangerous part here ?

>>   If device disappears, the patch [4] will completely take out the
>>   device from btrfs, and continues to RW in degraded mode.
>>   When it reappears then [5] will bring it back to the RW list.
> 
> but [5] relies on someone from userspace (presumably udev) actually
> invoking BTRFS_IOC_SCAN_DEV/IOSC_DEVICES_READY, no ?

  Nikoly, Yes. Most of the destro udev already does that. udev calls
  btrfs dev scan when SB is overwritten from userland or when a device
  with primary SB as btrfs (re)appears.

> Because
> device_list_add is only ever called from btrfs_scan_one_device, which in
> turn is called by either of the aforementioned IOCTLS or during mount
> (which is not at play here).

  Hm. as above.

Thanks Anand


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 13:06     ` Austin S. Hemmelgarn
@ 2017-12-18 19:43       ` Tomasz Pala
  2017-12-18 22:01         ` Peter Grandi
  2017-12-19 12:25         ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 61+ messages in thread
From: Tomasz Pala @ 2017-12-18 19:43 UTC (permalink / raw)
  To: Linux fs Btrfs

On Mon, Dec 18, 2017 at 08:06:57 -0500, Austin S. Hemmelgarn wrote:

> The fact is, the only cases where this is really an issue is if you've 
> either got intermittently bad hardware, or are dealing with external 

Well, the RAID1+ is all about the failing hardware.

> storage devices.  For the majority of people who are using multi-device 
> setups, the common case is internally connected fixed storage devices 
> with properly working hardware, and for that use case, it works 
> perfectly fine.

If you're talking about "RAID"-0 or storage pools (volume management)
that is true.
But if you imply, that RAID1+ "works perfectly fine as long as hardware
works fine" this is fundamentally wrong. If the hardware needs to work
properly for the RAID to work properly, noone would need this RAID in
the first place.

> that BTRFS should not care.  At the point at which a device is dropping 
> off the bus and reappearing with enough regularity for this to be an 
> issue, you have absolutely no idea how else it's corrupting your data, 
> and support of such a situation is beyond any filesystem (including ZFS).

Support for such situation is exactly what RAID performs. So don't blame
people for expecting this to be handled as long as you call the
filesystem feature a 'RAID'.

If this feature is not going to mitigate hardware hiccups by design (as
opposed to "not implemented yet, needs some time", which is perfectly
understandable), just don't call it 'RAID'.

All the features currently working, like bit-rot mitigation for
duplicated data (dup/raid*) using checksums, are something different
than RAID itself. RAID means "survive failure of N devices/controllers"
- I got one "RAID1" stuck in r/o after degraded mount, not nice... Not
_expected_ to happen after single disk failure (without any reappearing).

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 19:43       ` Tomasz Pala
@ 2017-12-18 22:01         ` Peter Grandi
  2017-12-19 12:46           ` Austin S. Hemmelgarn
  2017-12-19 12:25         ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 61+ messages in thread
From: Peter Grandi @ 2017-12-18 22:01 UTC (permalink / raw)
  To: Linux fs Btrfs

>> The fact is, the only cases where this is really an issue is
>> if you've either got intermittently bad hardware, or are
>> dealing with external

> Well, the RAID1+ is all about the failing hardware.

>> storage devices. For the majority of people who are using
>> multi-device setups, the common case is internally connected
>> fixed storage devices with properly working hardware, and for
>> that use case, it works perfectly fine.

> If you're talking about "RAID"-0 or storage pools (volume
> management) that is true. But if you imply, that RAID1+ "works
> perfectly fine as long as hardware works fine" this is
> fundamentally wrong.

I really agree with this, the argument about "properly working
hardware" is utterly ridiculous. I'll to this: apparently I am
not the first one to discover the "anomalies" in the "RAID"
profiles, but I may have been the first to document some of
them, e.g. the famous issues with the 'raid1' profile. How did I
discover them? Well, I had used Btrfs in single device mode for
a bit, and wanted to try multi-device, and the docs seemed
"strange", so I did tests before trying it out.

The tests were simply on a spare PC with a bunch of old disks to
create two block devices (partitions), put them in 'raid1' first
natively, then by adding a new member to an existing partition,
and then 'remove' one, or simply unplug it (actually 'echo 1 >
/sys/block/.../device/delete') initially. I wanted to check
exactly what happened, resync times, speed, behaviour and speed
when degraded, just ordinary operational tasks.

Well I found significant problems after less than one hour. I
can't imagine anyone with some experience of hw or sw RAID
(especially hw RAID, as hw RAID firmware is often fantastically
buggy especially as to RAID operations) that wouldn't have done
the same tests before operational use, and would not have found
the same issues too straight away. The only guess I could draw
is that whover designed the "RAID" profile had zero operational
system administration experience.

> If the hardware needs to work properly for the RAID to work
> properly, noone would need this RAID in the first place.

It is not just that, but some maintenance operations are needed
even if the hardware works properly: for example preventive
maintenance, replacing drives that are becoming too old,
expanding capacity, testing periodically hardware bits. Systems
engineers don't just say "it works, let's assume it continues to
work properly, why worry".

My impression is that multi-device and "chunks" were designed in
one way by someone, and someone else did not understand the
intent, and confused them with "RAID", and based the 'raid'
profiles on that confusion. For example the 'raid10' profile
seems the least confused to me, and that's I think because the
"RAID" aspect is kept more distinct from the "multi-device"
aspect. But perhaps I am an optimist...

To simplify a longer discussion to have "RAID" one needs an
explicit design concept of "stripe", which in Btrfs needs to be
quite different from that of "set of member devices" and
"chunks", so that for example adding/removing to a "stripe" is
not quite the same thing as adding/removing members to a volume,
plus to make a distinction between online and offline members,
not just added and removed ones, and well-defined state machine
transitions (e.g. in response to hardware problems) among all
those, like in MD RAID. But the importance of such distinctions
may not be apparent to everybody.

But I may have read comments in which "block device" (a data
container on some medium), "block device inode" (a descriptor
for that) and "block device name" (a path to a "block device
inode") were hopelessly confused, so I don't hold a lot of
hope. :-(

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18  8:49     ` Anand Jain
  2017-12-18 10:36       ` Peter Grandi
  2017-12-18 12:10       ` Nikolay Borisov
@ 2017-12-18 22:28       ` Chris Murphy
  2017-12-18 22:29         ` Chris Murphy
                           ` (3 more replies)
  2 siblings, 4 replies; 61+ messages in thread
From: Chris Murphy @ 2017-12-18 22:28 UTC (permalink / raw)
  To: Anand Jain; +Cc: Peter Grandi, Linux fs Btrfs

On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain <anand.jain@oracle.com> wrote:

>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>  caused by [1], which we should revert back, since..
>    - balance (to raid1 chunk) may fail if FS is near full
>    - recovery (to raid1 chunk) will take more writes as compared
>      to recovery under degraded raid1 chunks


The advantage of writing single chunks when degraded, is in the case
where a missing device returns (is readded, intact). Catching up that
device with the first drive, is a manual but simple invocation of
'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
alternative is a full balance or full scrub. It's pretty tedious for
big arrays.

mdadm uses bitmap=internal for any array larger than 100GB for this
reason, avoiding full resync.

'btrfs sub find' will list all *added* files since an arbitrarily
specified generation; but not deletions.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 22:28       ` Chris Murphy
@ 2017-12-18 22:29         ` Chris Murphy
  2017-12-19 12:30         ` Adam Borowski
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 61+ messages in thread
From: Chris Murphy @ 2017-12-18 22:29 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Anand Jain, Peter Grandi, Linux fs Btrfs

On Mon, Dec 18, 2017 at 3:28 PM, Chris Murphy <lists@colorremedies.com> wrote:
> On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain <anand.jain@oracle.com> wrote:
>
>>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>>  caused by [1], which we should revert back, since..
>>    - balance (to raid1 chunk) may fail if FS is near full
>>    - recovery (to raid1 chunk) will take more writes as compared
>>      to recovery under degraded raid1 chunks
>
>
> The advantage of writing single chunks when degraded, is in the case
> where a missing device returns (is readded, intact). Catching up that
> device with the first drive, is a manual but simple invocation of
> 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
> alternative is a full balance or full scrub. It's pretty tedious for
> big arrays.
>
> mdadm uses bitmap=internal for any array larger than 100GB for this
> reason, avoiding full resync.
>
> 'btrfs sub find' will list all *added* files since an arbitrarily
> specified generation; but not deletions.

Looks like LVM raid types (the non-legacy ones that use md driver)
also use a bitmap by default.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 19:43       ` Tomasz Pala
  2017-12-18 22:01         ` Peter Grandi
@ 2017-12-19 12:25         ` Austin S. Hemmelgarn
  2017-12-19 14:46           ` Tomasz Pala
  1 sibling, 1 reply; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-19 12:25 UTC (permalink / raw)
  To: Tomasz Pala, Linux fs Btrfs

On 2017-12-18 14:43, Tomasz Pala wrote:
> On Mon, Dec 18, 2017 at 08:06:57 -0500, Austin S. Hemmelgarn wrote:
> 
>> The fact is, the only cases where this is really an issue is if you've
>> either got intermittently bad hardware, or are dealing with external
> 
> Well, the RAID1+ is all about the failing hardware.
About catastrophically failing hardware, not intermittent failure.
> 
>> storage devices.  For the majority of people who are using multi-device
>> setups, the common case is internally connected fixed storage devices
>> with properly working hardware, and for that use case, it works
>> perfectly fine.
> 
> If you're talking about "RAID"-0 or storage pools (volume management)
> that is true.
> But if you imply, that RAID1+ "works perfectly fine as long as hardware
> works fine" this is fundamentally wrong. If the hardware needs to work
> properly for the RAID to work properly, noone would need this RAID in
> the first place.
I never said the hardware needed to not fail, just that it needed to 
fail in a consistent manner.  BTRFS handles catastrophic failures of 
storage devices just fine right now.  It has issues with intermittent 
failures, but so does hardware RAID, and so do MD and LVM to a lesser 
degree.
> 
>> that BTRFS should not care.  At the point at which a device is dropping
>> off the bus and reappearing with enough regularity for this to be an
>> issue, you have absolutely no idea how else it's corrupting your data,
>> and support of such a situation is beyond any filesystem (including ZFS).
> 
> Support for such situation is exactly what RAID performs. So don't blame
> people for expecting this to be handled as long as you call the
> filesystem feature a 'RAID'.
No, classical RAID (other than RAID0) is supposed to handle catastrophic 
failure of component devices.  That is the entirety of the original 
design purpose, and that is the entirety of what you should be using it 
for in production.  The point at which you are getting random corruption 
on a disk and you're using anything but BTRFS for replication, you 
_NEED_ to replace that disk, and if you don't you risk it causing 
corruption on the other disk.  As of right now, BTRFS is no different in 
that respect, but I agree that it _should_ be able to handle such a 
situation eventually.
> 
> If this feature is not going to mitigate hardware hiccups by design (as
> opposed to "not implemented yet, needs some time", which is perfectly
> understandable), just don't call it 'RAID'.
It shouldn't have been called RAID in the first place, that we can agree 
on (even if for different reasons).
> 
> All the features currently working, like bit-rot mitigation for
> duplicated data (dup/raid*) using checksums, are something different
> than RAID itself. RAID means "survive failure of N devices/controllers"
> - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not
> _expected_ to happen after single disk failure (without any reappearing).
And that's a known bug on older kernels (not to mention that you should 
not be mounting writable and degraded for any purpose other than fixing 
the volume).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 22:28       ` Chris Murphy
  2017-12-18 22:29         ` Chris Murphy
@ 2017-12-19 12:30         ` Adam Borowski
  2017-12-19 12:54         ` Andrei Borzenkov
  2017-12-19 12:59         ` Peter Grandi
  3 siblings, 0 replies; 61+ messages in thread
From: Adam Borowski @ 2017-12-19 12:30 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Anand Jain, Peter Grandi, Linux fs Btrfs

On Mon, Dec 18, 2017 at 03:28:14PM -0700, Chris Murphy wrote:
> On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain <anand.jain@oracle.com> wrote:
> >  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
> >  caused by [1], which we should revert back, since..
> >    - balance (to raid1 chunk) may fail if FS is near full
> >    - recovery (to raid1 chunk) will take more writes as compared
> >      to recovery under degraded raid1 chunks
> 
> The advantage of writing single chunks when degraded, is in the case
> where a missing device returns (is readded, intact). Catching up that
> device with the first drive, is a manual but simple invocation of
> 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
> alternative is a full balance or full scrub. It's pretty tedious for
> big arrays.
> 
> mdadm uses bitmap=internal for any array larger than 100GB for this
> reason, avoiding full resync.
> 
> 'btrfs sub find' will list all *added* files since an arbitrarily
> specified generation; but not deletions.

This is fine as scrub cares about extents not files.  The newer generation
of metadata doesn't have a reference to the deleted extent anymore.

Selective scrub hasn't been implemented, but it should be pretty
straightforward -- unless nocow is involved.  Correct me if I'm wrong, but I
believe there's no way to tell which copy of a nocow extent is the good one.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 22:01         ` Peter Grandi
@ 2017-12-19 12:46           ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-19 12:46 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs

On 2017-12-18 17:01, Peter Grandi wrote:
>>> The fact is, the only cases where this is really an issue is
>>> if you've either got intermittently bad hardware, or are
>>> dealing with external
> 
>> Well, the RAID1+ is all about the failing hardware.
> 
>>> storage devices. For the majority of people who are using
>>> multi-device setups, the common case is internally connected
>>> fixed storage devices with properly working hardware, and for
>>> that use case, it works perfectly fine.
> 
>> If you're talking about "RAID"-0 or storage pools (volume
>> management) that is true. But if you imply, that RAID1+ "works
>> perfectly fine as long as hardware works fine" this is
>> fundamentally wrong.
> 
> I really agree with this, the argument about "properly working
> hardware" is utterly ridiculous. I'll to this: apparently I am
> not the first one to discover the "anomalies" in the "RAID"
> profiles, but I may have been the first to document some of
> them, e.g. the famous issues with the 'raid1' profile. How did I
> discover them? Well, I had used Btrfs in single device mode for
> a bit, and wanted to try multi-device, and the docs seemed
> "strange", so I did tests before trying it out.
> 
> The tests were simply on a spare PC with a bunch of old disks to
> create two block devices (partitions), put them in 'raid1' first
> natively, then by adding a new member to an existing partition,
> and then 'remove' one, or simply unplug it (actually 'echo 1 >
> /sys/block/.../device/delete') initially. I wanted to check
> exactly what happened, resync times, speed, behaviour and speed
> when degraded, just ordinary operational tasks.
> 
> Well I found significant problems after less than one hour. I
> can't imagine anyone with some experience of hw or sw RAID
> (especially hw RAID, as hw RAID firmware is often fantastically
> buggy especially as to RAID operations) that wouldn't have done
> the same tests before operational use, and would not have found
> the same issues too straight away. The only guess I could draw
> is that whover designed the "RAID" profile had zero operational
> system administration experience.
Or possibly that you didn't read the documentation thoroughly at all, 
which any reasonable system administrator would do before even starting 
to test something.  Unless you were doing stupid stuff like running for 
extended periods of time with half an array or not trying at all to 
repair things after the device reappeared, then none of what you 
described should have caused any issues.
> 
>> If the hardware needs to work properly for the RAID to work
>> properly, noone would need this RAID in the first place.
> 
> It is not just that, but some maintenance operations are needed
> even if the hardware works properly: for example preventive
> maintenance, replacing drives that are becoming too old,
> expanding capacity, testing periodically hardware bits. Systems
> engineers don't just say "it works, let's assume it continues to
> work properly, why worry".
Really?  So replacing hard drives just doesn't work on BTRFS?

Hmm...

Then that means that all the testing I do regularly of reshaping arrays 
and replacing devices that is consistently working (except for raid5 and 
raid6, but those have other issues too right now) must be a complete 
fluke.  I guess I have to go check my hardware and the QEMU sources to 
figure out how those are broken such that all of this is working 
successfully...

Seriously though, did you even _test_ replacing devices using the 
procedures described in the documentation, or did you just see that 
things didn't work in the couple of cases you thought were most 
important and assume nothing else worked?
> 
> My impression is that multi-device and "chunks" were designed in
> one way by someone, and someone else did not understand the
> intent, and confused them with "RAID", and based the 'raid'
> profiles on that confusion. For example the 'raid10' profile
> seems the least confused to me, and that's I think because the
> "RAID" aspect is kept more distinct from the "multi-device"
> aspect. But perhaps I am an optimist...
Then names were a stupid choice intended to convey the basic behavior in 
a way that idiots who have no business being sysadmins could understand 
(and yes, the raid1 profiles do behave as someone with a naive 
understanding of RAID1 as simple replication would expect). 
Unfortunately, we're stuck with them now, and there's no point in 
complaining beyond just acknowledging that the names were a poor choice.
> 
> To simplify a longer discussion to have "RAID" one needs an
> explicit design concept of "stripe", which in Btrfs needs to be
> quite different from that of "set of member devices" and
> "chunks", so that for example adding/removing to a "stripe" is
> not quite the same thing as adding/removing members to a volume,
> plus to make a distinction between online and offline members,
> not just added and removed ones, and well-defined state machine
> transitions (e.g. in response to hardware problems) among all
> those, like in MD RAID. But the importance of such distinctions
> may not be apparent to everybody.
Or maybe people are sensible and don't care about such distinctions as 
long as things work within the defined parameters?  It's only engineers 
and scientists that care about how and why (or stuffy bureaucrats who 
want control over things).  Regular users, and even some developers 
don't care about the exact implementation provided it works how they 
need it to work.
> 
[Obviously intentionally inflammatory comment removed]

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 22:28       ` Chris Murphy
  2017-12-18 22:29         ` Chris Murphy
  2017-12-19 12:30         ` Adam Borowski
@ 2017-12-19 12:54         ` Andrei Borzenkov
  2017-12-19 12:59         ` Peter Grandi
  3 siblings, 0 replies; 61+ messages in thread
From: Andrei Borzenkov @ 2017-12-19 12:54 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Anand Jain, Peter Grandi, Linux fs Btrfs

On Tue, Dec 19, 2017 at 1:28 AM, Chris Murphy <lists@colorremedies.com> wrote:
> On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain <anand.jain@oracle.com> wrote:
>
>>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>>  caused by [1], which we should revert back, since..
>>    - balance (to raid1 chunk) may fail if FS is near full
>>    - recovery (to raid1 chunk) will take more writes as compared
>>      to recovery under degraded raid1 chunks
>
>
> The advantage of writing single chunks when degraded, is in the case
> where a missing device returns (is readded, intact). Catching up that
> device with the first drive, is a manual but simple invocation of
> 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
> alternative is a full balance or full scrub. It's pretty tedious for
> big arrays.
>

The alternative would be to introduce new "resilver" operation that
would allocate second copy for every degraded chunk. And it could even
be started automatically when enough redundacy is present again.

> mdadm uses bitmap=internal for any array larger than 100GB for this
> reason, avoiding full resync.
>

ZFS manages to avoid full sync in this case quite efficiently.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 22:28       ` Chris Murphy
                           ` (2 preceding siblings ...)
  2017-12-19 12:54         ` Andrei Borzenkov
@ 2017-12-19 12:59         ` Peter Grandi
  3 siblings, 0 replies; 61+ messages in thread
From: Peter Grandi @ 2017-12-19 12:59 UTC (permalink / raw)
  To: Linux fs Btrfs

[ ... ]

> The advantage of writing single chunks when degraded, is in
> the case where a missing device returns (is readded,
> intact).  Catching up that device with the first drive, is a
> manual but simple invocation of 'btrfs balance start
> -dconvert=raid1,soft -mconvert=raid1,soft' The alternative is
> a full balance or full scrub. It's pretty tedious for big
> arrays.

That is merely an after-the-fact rationalization for a design
that is at the same time entirely logical and quite broken: that
the intended replication factor is the same as the current
number of members of the volume, so if a volume has (currently)
only one member, than only "single" chunks gets created.

A design that would work better for operations would be to have
"profiles" to be a concept entirely independent of number of
members, or perhaps more precisely to have the "desired" profile
of a chunk be distinct from the "actual" profile (dependent on
the actual number of members of a volume) of that chunk, so that
if a volume has only one member chunks could be created that
have "desired" profile 'raid1' but "actual" profile 'single', or
perhaps more sensibly 'raid1-with-missing-mirror', with checks
that "actual" profile be usable else the volume is not
mountable.

Note: ideally every chunk would have both a static desired
profile and a desired stripe width, and a computed actual
profile and a actual stripe width. Or perhaps the desired
profile and width would be properties of the volume (for each of
the three types of data).

For example in MD RAID it is perfectly legitimate to create a
RAID6 set with "desired" width of 6 and "actual" width of 4 (in
which case it can be activated as degraded) or a RAID5 set with
"desired" width of 5 and actual width of 3 (in which case it
cannot be activated at all until at least another member is
added).

The difference with MD RAID is that in MD RAID there is (except
in one case , during conversion) an exact match between
"desired" profile stripe width and number of members, while at
least in principle a Btrfs volume can have any number of chunks
of any profile of any desired stripe size (except that current
implementation is not so flexible in most profiles).

That would require scanning all chunks to determine whether a
volume is mountable at all or mountable only as degraded, while
MD RAID can just count the members. Apparently recent versions
of the Btrfs 'raid1' profile do just that.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 12:25         ` Austin S. Hemmelgarn
@ 2017-12-19 14:46           ` Tomasz Pala
  2017-12-19 16:35             ` Austin S. Hemmelgarn
                               ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Tomasz Pala @ 2017-12-19 14:46 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Dec 19, 2017 at 07:25:49 -0500, Austin S. Hemmelgarn wrote:

>> Well, the RAID1+ is all about the failing hardware.
> About catastrophically failing hardware, not intermittent failure.

It shouldn't matter - as long as disk failing once is kicked out of the
array *if possible*. Or reattached in write-only mode as a best effort,
meaning "will try to keep your *redundancy* copy, but won't trust it to
be read from".
As you see, the "failure level handled" is not by definition, but by implementation.

*if possible* == when there are other volume members having the same
data /or/ there are spare members that could take over the failing ones.

> I never said the hardware needed to not fail, just that it needed to 
> fail in a consistent manner.  BTRFS handles catastrophic failures of 
> storage devices just fine right now.  It has issues with intermittent 
> failures, but so does hardware RAID, and so do MD and LVM to a lesser 
> degree.

When planning hardware failovers/backups I can't predict the failing
pattern. So first of all - every *known* shortcoming should be
documented somehow. Secondly - permanent failures are not handled "just
fine", as there is (1) no automatic mount as degraded, so the machine
won't reboot properly and (2) the r/w degraded mount is[*] one-timer.
Again, this should be:
1. documented in manpage, as a comment to profiles, not wiki page or
linux-btrfs archives,
2. printed on screen when creating/converting "RAID1" profile (by btrfs tools),
3. blown into one's face when doing r/w degraded mount (by kernel).

[*] yes, I know the recent kernels handle this, but the last LTS (4.14)
is just too young.

I'm now aware of issues with MD you're referring to - I got drives
kicked off many times and they were *never* causing any problems despite
being visible in the system. Moreover, since 4.10 there is FAILFAST
which would do this even faster. There is also no problem with mounting
degraded MD array automatically, so telling that btrfs is doing "just
fine" is, well... not even theoretically close. And in my practice it
never saved the day, but already ruined a few ones... It's not right for
the protection to make more problems than it solves.

> No, classical RAID (other than RAID0) is supposed to handle catastrophic 
> failure of component devices.  That is the entirety of the original 
> design purpose, and that is the entirety of what you should be using it 
> for in production. 

1. no, it's not: https://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf

2. even if there was, the single I/O failure (e.g. one bad block) might
   be interpreted as "catastrophic" and the entire drive should be kicked off then.

3. if sysadmin doesn't request any kind of device autobinding, the
device that were already failed doesn't matter anymore - regardless of
it's current state or reappearences.

> The point at which you are getting random corruption 
> on a disk and you're using anything but BTRFS for replication, you 
> _NEED_ to replace that disk, and if you don't you risk it causing 
> corruption on the other disk. 

Not only BTRFS, there are hardware solutions like T10 PI/DIF.
Guess what should RAID controller do in such situation? Fail
drive immediately after the first CRC mismatch?

BTW do you consider "random corruption" as a catastrophic failure?

> As of right now, BTRFS is no different in 
> that respect, but I agree that it _should_ be able to handle such a 
> situation eventually.

The first step should be to realize, that there are some tunables
required if you want to handle many different situation.

Having said that, let's back to reallity:


The classical RAID is about keeping the system functional - trashing a
single drive from RAID1 should be fully-ignorable by sysadmin. The
system must reboot properly, work properly and there MUST NOT by ANY
functional differences compared to non-degraded mode except for slower
read rate (and having no more redundancy obviously).


- not having this == not having RAID1.

> It shouldn't have been called RAID in the first place, that we can agree 
> on (even if for different reasons).

The misnaming would be much less of a problem if it were documented
properly (man page, btrfs-progs and finally kernel screaming).

>> - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not
>> _expected_ to happen after single disk failure (without any reappearing).
> And that's a known bug on older kernels (not to mention that you should 
> not be mounting writable and degraded for any purpose other than fixing 
> the volume).

Yes, ...but:

1. "known" only to the people that already stepped into it, meaning too
   late - it should be "COMMONLY known", i.e. documented,
2. "older kernels" are not so old, the newest mature LTS (4.9) is still
   affected,
3. I was about to fix the volume, accidentally the machine has rebooted.
   Which should do no harm if I had a RAID1.
4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE,
   as long as you accept "no more redundancy"...
4a. ...or had an N-way mirror and there is still some redundancy if N>2.


Since we agree, that btrfs RAID != common RAID, as there are/were
different design principles and some features are in WIP state at best,
the current behaviour should be better documented. That's it.


-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 14:46           ` Tomasz Pala
@ 2017-12-19 16:35             ` Austin S. Hemmelgarn
  2017-12-19 17:56               ` Tomasz Pala
  2017-12-19 18:31             ` George Mitchell
  2017-12-19 19:35             ` Chris Murphy
  2 siblings, 1 reply; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-19 16:35 UTC (permalink / raw)
  To: Tomasz Pala, Linux fs Btrfs

On 2017-12-19 09:46, Tomasz Pala wrote:
> On Tue, Dec 19, 2017 at 07:25:49 -0500, Austin S. Hemmelgarn wrote:
> 
>>> Well, the RAID1+ is all about the failing hardware.
>> About catastrophically failing hardware, not intermittent failure.
> 
> It shouldn't matter - as long as disk failing once is kicked out of the
> array *if possible*. Or reattached in write-only mode as a best effort,
> meaning "will try to keep your *redundancy* copy, but won't trust it to
> be read from".
> As you see, the "failure level handled" is not by definition, but by implementation.
> 
> *if possible* == when there are other volume members having the same
> data /or/ there are spare members that could take over the failing ones.
Actually, it very much does matter, at least with hardware RAID.  The 
exact failure mode that causes issues for BTRFS (intermittent 
disconnects at the bus level) causes just as many issues with most 
hardware RAID controllers (though the exact issues are not quite the 
same), and is in and of itself an indicator that something else is wrong.
> 
>> I never said the hardware needed to not fail, just that it needed to
>> fail in a consistent manner.  BTRFS handles catastrophic failures of
>> storage devices just fine right now.  It has issues with intermittent
>> failures, but so does hardware RAID, and so do MD and LVM to a lesser
>> degree.
> 
> When planning hardware failovers/backups I can't predict the failing
> pattern. So first of all - every *known* shortcoming should be
> documented somehow. Secondly - permanent failures are not handled "just
> fine", as there is (1) no automatic mount as degraded, so the machine
> won't reboot properly and (2) the r/w degraded mount is[*] one-timer.
> Again, this should be:
> 1. documented in manpage, as a comment to profiles, not wiki page or
> linux-btrfs archives,
Agreed, our documentation needs consolidated in general (I would 
absolutely love to see it just be the man pages, and have those up on 
the wiki like some other software does).
> 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools),
I don't agree on this one.  It is in no way unreasonable to expect that 
someone has read the documentation _before_ trying to use something.
> 3. blown into one's face when doing r/w degraded mount (by kernel).
Agreed here though.
> 
> [*] yes, I know the recent kernels handle this, but the last LTS (4.14)
> is just too young.
4.14 should have gotten that patch last I checked.
> 
> I'm now aware of issues with MD you're referring to - I got drives
> kicked off many times and they were *never* causing any problems despite
> being visible in the system. Moreover, since 4.10 there is FAILFAST
> which would do this even faster. There is also no problem with mounting
> degraded MD array automatically, so telling that btrfs is doing "just
> fine" is, well... not even theoretically close. And in my practice it
> never saved the day, but already ruined a few ones... It's not right for
> the protection to make more problems than it solves.
Regarding handling of degraded mounts, BTRFS _is_ working just fine, we 
just chose a different default behavior from MD and LVM (we make certain 
the user knows about the issue without having to look through syslog).
> 
>> No, classical RAID (other than RAID0) is supposed to handle catastrophic
>> failure of component devices.  That is the entirety of the original
>> design purpose, and that is the entirety of what you should be using it
>> for in production.
> 
> 1. no, it's not: https://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf
OK, so I see here performance as a motivation, but listed secondarily to 
reliability, and all the discussion of reliability assumes that either:
1. Disks fail catastrophically.
or:
2. Disks return read or write errors when there is a problem.

Following just those constraints, RAID is not designed to handle devices 
that randomly drop off the bus and reappear or exhibit silent data 
corruption, so my original statement largely was accurate, the primary 
design intent was handling of catastrophic failures.
> 
> 2. even if there was, the single I/O failure (e.g. one bad block) might
>     be interpreted as "catastrophic" and the entire drive should be kicked off then.
This I will agree with, given that it's common behavior in many RAID 
implementations.  As people are quick to point out BTRFS _IS NOT_ RAID, 
the devs just made a poor choice in the original naming of the 2-way 
replication implementation, and it stuck.
> 
> 3. if sysadmin doesn't request any kind of device autobinding, the
> device that were already failed doesn't matter anymore - regardless of
> it's current state or reappearences.
You have to explicitly disable automatic binding of drivers to 
hot-plugged devices though, so that's rather irrelevant.  Yes, you can 
do so yourself if you want, and it will mitigate one of the issues with 
BTRFS to a limited degree (we still don't 'kick-out' old devices, even 
if we should).
> 
>> The point at which you are getting random corruption
>> on a disk and you're using anything but BTRFS for replication, you
>> _NEED_ to replace that disk, and if you don't you risk it causing
>> corruption on the other disk.
> 
> Not only BTRFS, there are hardware solutions like T10 PI/DIF.
> Guess what should RAID controller do in such situation? Fail
> drive immediately after the first CRC mismatch?
If it's more than single errors, yes, it should fail the drive.  If 
you're getting any kind of recurring corruption, it's time to replace 
the drive, whether the error gets corrected or not.
> 
> BTW do you consider "random corruption" as a catastrophic failure?
No, catastrophic failure in reference to hard drives is (usually) 
mechanical failure rendering the drive unusable (such as a head crash 
for example), or a complete controller failure (for example, the drive 
won't enumerate at all).

To use a (possibly strained) analogy:  Catastrophic failure is like a 
handgun blowing up when you try to fire it, you won't be able to use it 
ever again.  Random corruption is equivalent to not consistently feeding 
new rounds from the magazine properly, it still technically works, and 
can (theoretically) be fixed, but it's usually just simpler (and 
significantly safer) to replace the gun than it is to try and jury rig 
things so that it works reliably.
> 
>> As of right now, BTRFS is no different in
>> that respect, but I agree that it _should_ be able to handle such a
>> situation eventually.
> 
> The first step should be to realize, that there are some tunables
> required if you want to handle many different situation.
> 
> Having said that, let's back to reallity:
> 
> 
> The classical RAID is about keeping the system functional - trashing a
> single drive from RAID1 should be fully-ignorable by sysadmin. The
> system must reboot properly, work properly and there MUST NOT by ANY
> functional differences compared to non-degraded mode except for slower
> read rate (and having no more redundancy obviously).
'No functional differences' isn't even a standard that MD or LVM 
achieve, and it's definitely not one that most hardware RAID controllers 
have.
> 
> - not having this == not having RAID1.
Again, BTRFS _IS NOT_ RAID.
> 
>> It shouldn't have been called RAID in the first place, that we can agree
>> on (even if for different reasons).
> 
> The misnaming would be much less of a problem if it were documented
> properly (man page, btrfs-progs and finally kernel screaming).
Yes, our documentation could be significantly better.
> 
>>> - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not
>>> _expected_ to happen after single disk failure (without any reappearing).
>> And that's a known bug on older kernels (not to mention that you should
>> not be mounting writable and degraded for any purpose other than fixing
>> the volume).
> 
> Yes, ...but:
> 
> 1. "known" only to the people that already stepped into it, meaning too
>     late - it should be "COMMONLY known", i.e. documented,
And also known to people who have done proper research.

> 2. "older kernels" are not so old, the newest mature LTS (4.9) is still
>     affected,
I really don't see this as a valid excuse.  It's pretty well documented 
that you absolutely should be running the most recent kernel if you're 
using BTRFS.

> 3. I was about to fix the volume, accidentally the machine has rebooted.
>     Which should do no harm if I had a RAID1.
Agreed.

> 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE,
>     as long as you accept "no more redundancy"...
This is a matter of opinion.  I still contend that running half a two 
device array for an extended period of time without reshaping it to be a 
single device is a bad idea for cases other than BTRFS.  The fewer 
layers of code you're going through, the safer you are.

> 4a. ...or had an N-way mirror and there is still some redundancy if N>2.
N-way mirroring is still on the list of things to implement, believe me, 
many people want it.
> 
> 
> Since we agree, that btrfs RAID != common RAID, as there are/were
> different design principles and some features are in WIP state at best,
> the current behaviour should be better documented. That's it.
Patches would be gratefully accepted.  It's really not hard to update 
the documentation, it's just that nobody has had the time to do it.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 16:35             ` Austin S. Hemmelgarn
@ 2017-12-19 17:56               ` Tomasz Pala
  2017-12-19 19:47                 ` Chris Murphy
  2017-12-19 20:11                 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 61+ messages in thread
From: Tomasz Pala @ 2017-12-19 17:56 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Dec 19, 2017 at 11:35:02 -0500, Austin S. Hemmelgarn wrote:

>> 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools),
> I don't agree on this one.  It is in no way unreasonable to expect that 
> someone has read the documentation _before_ trying to use something.

Provided there are:
- a decent documentation AND
- appropriate[*] level of "common knowledge" AND
- stable behaviour and mature code (kernel, tools etc.)

BTRFS lacks all of these - there are major functional changes in current
kernels and it reaches far beyond LTS. All the knowledge YOU have here,
on this maillist, should be 'engraved' into btrfs-progs, as there are
people still using kernels with serious malfunctions. btrfs-progs could
easily check kernel version and print appropriate warning - consider
this a "software quirks".

[*] by 'appropriate' I mean knowledge so common, as the real word usage
itself.

Moreover, the fact that I've read the documentation and did a
comprehensive[**] reseach today, doesn't mean I should do this again
after kernel change for example.

[**] apparently what I thought was comprehensive, wasn't at all. Most of
the btrfs quirks I've found HERE. As a regular user, not fs developer, I
shouldn't be even looking at this list.

BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
this distro to research every component used?

>> [*] yes, I know the recent kernels handle this, but the last LTS (4.14)
>> is just too young.
> 4.14 should have gotten that patch last I checked.

I meant too young to be widely adopted yet. This requires some
countermeasures in the toolkit that is easier to upgrade, like userspace.

> Regarding handling of degraded mounts, BTRFS _is_ working just fine, we 
> just chose a different default behavior from MD and LVM (we make certain 
> the user knows about the issue without having to look through syslog).

I'm not arguing about the behaviour - apparently there were some
technical reasons. But IF the reasons are not technical, but
philosophical, I'd like to have either mount option (allow_degraded) or
even kernel-level configuration knob for this to happen RAID-style.

Now, if the current kernels won't toggle degraded RAID1 as ro, can I
safely add "degraded" to the mount options? My primary concern is the
machine UPTIME. I care less about the data, as they are backed up to
some remote location and loosing day or week of changes is acceptable,
brain-split as well, while every hour of downtime costs me a real money.


Meanwhile I can't fix broken server using 'remote hands' - mounting degraded
volume means using physical keyboard or KVM which might be not available
at a site. Current btrfs behavious requires physical presence AND downtime
(if a machine rebooted) for fixing things, that could be fixed remotely
an on-line.

Anyway, users shouldn't look through syslog, device status should be
reported by some monitoring tool.

Deviation so big (respectively to common RAID1 scenarios) deserves being documented.
Or renamed...

> reliability, and all the discussion of reliability assumes that either:
> 1. Disks fail catastrophically.
> or:
> 2. Disks return read or write errors when there is a problem.
> 
> Following just those constraints, RAID is not designed to handle devices 
> that randomly drop off the bus and reappear

If it drops, there would be I/O errors eventually. Without the errors - agreed.

> implementations.  As people are quick to point out BTRFS _IS NOT_ RAID, 
> the devs just made a poor choice in the original naming of the 2-way 
> replication implementation, and it stuck.

Well, the question is: either it is not raid YET, or maybe it's time to consider renaming?

>> 3. if sysadmin doesn't request any kind of device autobinding, the
>> device that were already failed doesn't matter anymore - regardless of
>> it's current state or reappearences.
> You have to explicitly disable automatic binding of drivers to 
> hot-plugged devices though, so that's rather irrelevant.  Yes, you can 

Ha! I got this disabled on every bus (although for different reasons)
after boot completes. Lucky me:)

>> 1. "known" only to the people that already stepped into it, meaning too
>>     late - it should be "COMMONLY known", i.e. documented,
> And also known to people who have done proper research.

All the OpenSUSE userbase? ;)

>> 2. "older kernels" are not so old, the newest mature LTS (4.9) is still
>>     affected,
> I really don't see this as a valid excuse.  It's pretty well documented 
> that you absolutely should be running the most recent kernel if you're 
> using BTRFS.

Good point.

>> 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE,
>>     as long as you accept "no more redundancy"...
> This is a matter of opinion.

Sure! And the particular opinion depends on system being affected. I'd
rather not have any brain-split scenario under my database servers, but
also won't mind data loss on BGP router as long as it keeps running and
is fully operational.

> I still contend that running half a two 
> device array for an extended period of time without reshaping it to be a 
> single device is a bad idea for cases other than BTRFS.  The fewer 
> layers of code you're going through, the safer you are.

I create single-device degraded MD RAID1 when I attach one disk for
deployment (usually test machines), which are going to be converted into
dual (production) in a future - attaching second disk to array is much
easier and faster than messing with device nodes (or labels or
anything). The same applies to LVM, it's better to have it even when not
used at a moment. In case of btrfs there is no need for such
preparations, as the devices are added without renaming.

However, sometimes the systems end up without second disk attached.
Either due to their low importance, sometimes power usage, others
need to be quiet.


One might ask, why don't I attach second disk before initial system
creation - the answer is simple: I usually use the same drive models in
RAID1, but it happens that drives bought from the same production lot
fail simultaneously, so this approach mitigates the problem and gives
more time to react.

> Patches would be gratefully accepted.  It's really not hard to update 
> the documentation, it's just that nobody has had the time to do it.

Writing accurate documentation requires deep undestanding of internals.
Me - for example, I know some of the results: "don't do this", "if X happens, Y
should be done", "Z doesn't work yet, but there were some patches", "V
was fixed in some recent kernel, but no idea which commit was it
exactly", "W was severly broken in kernel I.J.K" etc. Not the hard data
that could be posted without creating the impression, that it's all
about creating complain-list. Not to mention I'm absolutely not familiar
with current patches, WIP and many many other corner cases or usage
scenarios. In a fact, not only the internals, but motivation and design
principles must be well understood to write piece of documentation.

Otherwise some "fake news" propaganda is being created, just like
https://suckless.org/sucks/systemd or other systemd-haters that haven't
spent a day in their life for writing SysV init scripts or managing a
bunch of mission critical machines with handcrafted supervisors.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 14:46           ` Tomasz Pala
  2017-12-19 16:35             ` Austin S. Hemmelgarn
@ 2017-12-19 18:31             ` George Mitchell
  2017-12-19 20:28               ` Tomasz Pala
  2017-12-19 19:35             ` Chris Murphy
  2 siblings, 1 reply; 61+ messages in thread
From: George Mitchell @ 2017-12-19 18:31 UTC (permalink / raw)
  To: linux-btrfs

On 12/19/2017 06:46 AM, Tomasz Pala wrote:
> On Tue, Dec 19, 2017 at 07:25:49 -0500, Austin S. Hemmelgarn wrote:
>
>>> Well, the RAID1+ is all about the failing hardware.
>> About catastrophically failing hardware, not intermittent failure.
> It shouldn't matter - as long as disk failing once is kicked out of the
> array *if possible*. Or reattached in write-only mode as a best effort,
> meaning "will try to keep your *redundancy* copy, but won't trust it to
> be read from".
> As you see, the "failure level handled" is not by definition, but by implementation.
>
> *if possible* == when there are other volume members having the same
> data /or/ there are spare members that could take over the failing ones.
>
>> I never said the hardware needed to not fail, just that it needed to
>> fail in a consistent manner.  BTRFS handles catastrophic failures of
>> storage devices just fine right now.  It has issues with intermittent
>> failures, but so does hardware RAID, and so do MD and LVM to a lesser
>> degree.
> When planning hardware failovers/backups I can't predict the failing
> pattern. So first of all - every *known* shortcoming should be
> documented somehow. Secondly - permanent failures are not handled "just
> fine", as there is (1) no automatic mount as degraded, so the machine
> won't reboot properly and (2) the r/w degraded mount is[*] one-timer.
> Again, this should be:
> 1. documented in manpage, as a comment to profiles, not wiki page or
> linux-btrfs archives,
> 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools),
> 3. blown into one's face when doing r/w degraded mount (by kernel).
>
> [*] yes, I know the recent kernels handle this, but the last LTS (4.14)
> is just too young.
>
> I'm now aware of issues with MD you're referring to - I got drives
> kicked off many times and they were *never* causing any problems despite
> being visible in the system. Moreover, since 4.10 there is FAILFAST
> which would do this even faster. There is also no problem with mounting
> degraded MD array automatically, so telling that btrfs is doing "just
> fine" is, well... not even theoretically close. And in my practice it
> never saved the day, but already ruined a few ones... It's not right for
> the protection to make more problems than it solves.
>
>> No, classical RAID (other than RAID0) is supposed to handle catastrophic
>> failure of component devices.  That is the entirety of the original
>> design purpose, and that is the entirety of what you should be using it
>> for in production.
> 1. no, it's not: https://www.cs.cmu.edu/~garth/RAIDpaper/Patterson88.pdf
>
> 2. even if there was, the single I/O failure (e.g. one bad block) might
>     be interpreted as "catastrophic" and the entire drive should be kicked off then.
>
> 3. if sysadmin doesn't request any kind of device autobinding, the
> device that were already failed doesn't matter anymore - regardless of
> it's current state or reappearences.
>
>> The point at which you are getting random corruption
>> on a disk and you're using anything but BTRFS for replication, you
>> _NEED_ to replace that disk, and if you don't you risk it causing
>> corruption on the other disk.
> Not only BTRFS, there are hardware solutions like T10 PI/DIF.
> Guess what should RAID controller do in such situation? Fail
> drive immediately after the first CRC mismatch?
>
> BTW do you consider "random corruption" as a catastrophic failure?
>
>> As of right now, BTRFS is no different in
>> that respect, but I agree that it _should_ be able to handle such a
>> situation eventually.
> The first step should be to realize, that there are some tunables
> required if you want to handle many different situation.
>
> Having said that, let's back to reallity:
>
>
> The classical RAID is about keeping the system functional - trashing a
> single drive from RAID1 should be fully-ignorable by sysadmin. The
> system must reboot properly, work properly and there MUST NOT by ANY
> functional differences compared to non-degraded mode except for slower
> read rate (and having no more redundancy obviously).
>
>
> - not having this == not having RAID1.
>
>> It shouldn't have been called RAID in the first place, that we can agree
>> on (even if for different reasons).
> The misnaming would be much less of a problem if it were documented
> properly (man page, btrfs-progs and finally kernel screaming).
>
>>> - I got one "RAID1" stuck in r/o after degraded mount, not nice... Not
>>> _expected_ to happen after single disk failure (without any reappearing).
>> And that's a known bug on older kernels (not to mention that you should
>> not be mounting writable and degraded for any purpose other than fixing
>> the volume).
> Yes, ...but:
>
> 1. "known" only to the people that already stepped into it, meaning too
>     late - it should be "COMMONLY known", i.e. documented,
> 2. "older kernels" are not so old, the newest mature LTS (4.9) is still
>     affected,
> 3. I was about to fix the volume, accidentally the machine has rebooted.
>     Which should do no harm if I had a RAID1.
> 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE,
>     as long as you accept "no more redundancy"...
> 4a. ...or had an N-way mirror and there is still some redundancy if N>2.
>
>
> Since we agree, that btrfs RAID != common RAID, as there are/were
> different design principles and some features are in WIP state at best,
> the current behaviour should be better documented. That's it.
>
>
I have significant experience as a user of raid1. I spent years using 
software raid1 and then more years using hardware (3ware) raid1 and now 
around 3 years using btrfs raid1. I have not found btrfs raid1 to be 
less reliable than any of the previous implementations of raid.  I have 
found that any implementation of raid whether it be software, hardware, 
or filesystem, is not infallible.  I have also found that when you have 
a failure, you don't just plug things back in and expect it to be fixed 
without seriously investigating what has gone wrong and potential 
unexpected consequences.  I have found that even with hardware raid you 
can find ways to screw things up to the point that you lose your data.  
I have had situations where I reconnected a drive on hardware raid1 only 
to find that the array would not sync and from there on I ended up 
having to directly attach one of the drives and recover the partition 
table with test disk in order to regain access to my data.  So NO FORM 
of raid is a replacement for backups and NO FORM of raid is a 
replacement for due diligence in recovery from failure mode.  Raid gives 
you a second chance when things go wrong, it does not make failures 
transparent which is seemingly what we sometimes expect from raid.  And 
I doubt that we will ever achieve that goal no matter how much effort we 
put into making it happen. Even with hardware raid things can happen 
that were not foreseen by the designers.  So I think we have to be 
careful when we compare various raid (or "raid like") implementations.  
There is no such thing as "fool proof" raid and likely never will be. 
And with that I will end my rant.




^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 14:46           ` Tomasz Pala
  2017-12-19 16:35             ` Austin S. Hemmelgarn
  2017-12-19 18:31             ` George Mitchell
@ 2017-12-19 19:35             ` Chris Murphy
  2017-12-19 20:41               ` Tomasz Pala
  2 siblings, 1 reply; 61+ messages in thread
From: Chris Murphy @ 2017-12-19 19:35 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Linux fs Btrfs

On Tue, Dec 19, 2017 at 7:46 AM, Tomasz Pala <gotar@polanet.pl> wrote:

>Secondly - permanent failures are not handled "just
> fine", as there is (1) no automatic mount as degraded, so the machine
> won't reboot properly and (2) the r/w degraded mount is[*] one-timer.
> Again, this should be:

One of the reasons for problem 1 is problem 2. If we had automatic
degraded mount, people would run into problem 2 and now they're stuck
with a read only file system. Another reason is the kernel code and
udev rule for device "readiness" means the volume is not "ready" until
all member devices are present. And while the volume is not "ready"
systemd will not even attempt to mount. Solving this requires kernel
and udev work, or possibly a helper, to wait an appropriate amount of
time. I also think it's a bad idea to implement automatic degraded
mounts unless there's an API for user space to receive either a push
or request notification for degradedness state, so desktop
environments can inform the user of degradedness.

There is no amount of documentation that makes up for these
deficiencies enough to enable automatic degraded mounts by default. I
would consider it a high order betrayal of user trust to do it.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 17:56               ` Tomasz Pala
@ 2017-12-19 19:47                 ` Chris Murphy
  2017-12-19 21:17                   ` Tomasz Pala
  2017-12-20 16:53                   ` Andrei Borzenkov
  2017-12-19 20:11                 ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 61+ messages in thread
From: Chris Murphy @ 2017-12-19 19:47 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Linux fs Btrfs

On Tue, Dec 19, 2017 at 10:56 AM, Tomasz Pala <gotar@polanet.pl> wrote:
> On Tue, Dec 19, 2017 at 11:35:02 -0500, Austin S. Hemmelgarn wrote:
>
>>> 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools),
>> I don't agree on this one.  It is in no way unreasonable to expect that
>> someone has read the documentation _before_ trying to use something.
>
> Provided there are:
> - a decent documentation AND
> - appropriate[*] level of "common knowledge" AND
> - stable behaviour and mature code (kernel, tools etc.)
>
> BTRFS lacks all of these - there are major functional changes in current
> kernels and it reaches far beyond LTS. All the knowledge YOU have here,
> on this maillist, should be 'engraved' into btrfs-progs, as there are
> people still using kernels with serious malfunctions. btrfs-progs could
> easily check kernel version and print appropriate warning - consider
> this a "software quirks".

The more verbose man pages are, the more likely it is that information
gets stale. We already see this with the Btrfs Wiki. So are you
volunteering to do the btrfs-progs work to easily check kernel
versions and print appropriate warnings? Or is this a case of
complaining about what other people aren't doing with their time?

>
> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
> this distro to research every component used?

As far as I'm aware, only Btrfs single device stuff is "supported".
The multiple device stuff is definitely not supported on openSUSE, but
I have no idea to what degree they support it with enterprise license,
no doubt that support must come with caveats.


>
>>> [*] yes, I know the recent kernels handle this, but the last LTS (4.14)
>>> is just too young.
>> 4.14 should have gotten that patch last I checked.
>
> I meant too young to be widely adopted yet. This requires some
> countermeasures in the toolkit that is easier to upgrade, like userspace.
>
>> Regarding handling of degraded mounts, BTRFS _is_ working just fine, we
>> just chose a different default behavior from MD and LVM (we make certain
>> the user knows about the issue without having to look through syslog).
>
> I'm not arguing about the behaviour - apparently there were some
> technical reasons. But IF the reasons are not technical, but
> philosophical, I'd like to have either mount option (allow_degraded) or
> even kernel-level configuration knob for this to happen RAID-style.


They are technical, which then runs into the philosophical. Giving
users a hurt me button is not ethical programming.


>
> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
> safely add "degraded" to the mount options? My primary concern is the
> machine UPTIME. I care less about the data, as they are backed up to
> some remote location and loosing day or week of changes is acceptable,
> brain-split as well, while every hour of downtime costs me a real money.

Btrfs simply is not ready for this use case. If you need to depend on
degraded raid1 booting, you need to use mdadm or LVM or hardware raid.
Complaining about the lack of maturity in this area? Get in line. Or
propose a design and scope of work that needs to be completed to
enable it.



> Meanwhile I can't fix broken server using 'remote hands' - mounting degraded
> volume means using physical keyboard or KVM which might be not available
> at a site. Current btrfs behavious requires physical presence AND downtime
> (if a machine rebooted) for fixing things, that could be fixed remotely
> an on-line.

Right. It's not ready for this use case. Complaining about this fact
isn't going to make it ready for this use case. What will make it
ready for the use case is a design, a lot of work, and testing.



> Anyway, users shouldn't look through syslog, device status should be
> reported by some monitoring tool.

Yes. And it doesn't exist yet.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 17:56               ` Tomasz Pala
  2017-12-19 19:47                 ` Chris Murphy
@ 2017-12-19 20:11                 ` Austin S. Hemmelgarn
  2017-12-19 21:58                   ` Tomasz Pala
  2017-12-19 23:53                   ` Chris Murphy
  1 sibling, 2 replies; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-19 20:11 UTC (permalink / raw)
  To: Tomasz Pala, Linux fs Btrfs

On 2017-12-19 12:56, Tomasz Pala wrote:
> On Tue, Dec 19, 2017 at 11:35:02 -0500, Austin S. Hemmelgarn wrote:
> 
>>> 2. printed on screen when creating/converting "RAID1" profile (by btrfs tools),
>> I don't agree on this one.  It is in no way unreasonable to expect that
>> someone has read the documentation _before_ trying to use something.
> 
> Provided there are:
> - a decent documentation AND
> - appropriate[*] level of "common knowledge" AND
> - stable behaviour and mature code (kernel, tools etc.)
> 
> BTRFS lacks all of these - there are major functional changes in current
> kernels and it reaches far beyond LTS. All the knowledge YOU have here,
> on this maillist, should be 'engraved' into btrfs-progs, as there are
> people still using kernels with serious malfunctions. btrfs-progs could
> easily check kernel version and print appropriate warning - consider
> this a "software quirks".
Except the systems running on those ancient kernel versions are not 
necessarily using a recent version of btrfs-progs.

It might be possible to write up a script to check the kernel version 
and report known issues with it, but I don't think having it tightly 
integrated will be much help, at least not for quite some time.
> 
> [*] by 'appropriate' I mean knowledge so common, as the real word usage
> itself.
> 
> Moreover, the fact that I've read the documentation and did a
> comprehensive[**] reseach today, doesn't mean I should do this again
> after kernel change for example.
> 
> [**] apparently what I thought was comprehensive, wasn't at all. Most of
> the btrfs quirks I've found HERE. As a regular user, not fs developer, I
> shouldn't be even looking at this list.
That last bit is debatable.  BTRFS doesn't have separate developer and 
user lists, so this list serves both purposes (though IRC also serves 
some of the function of a user list).  I'll agree that searching the 
archives shouldn't be needed to get a baseline of knowledge.
> 
> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
> this distro to research every component used?
SuSE also provides very good support by themselves.
> 
>>> [*] yes, I know the recent kernels handle this, but the last LTS (4.14)
>>> is just too young.
>> 4.14 should have gotten that patch last I checked.
> 
> I meant too young to be widely adopted yet. This requires some
> countermeasures in the toolkit that is easier to upgrade, like userspace.
So in other words, spend the time to write up code for btrfs-progs that 
will then be run by a significant minority of users because people using 
old kernels usually use old userspace, and people using new kernels 
won't have to care, instead of working on other bugs that are still 
affecting people?
> 
>> Regarding handling of degraded mounts, BTRFS _is_ working just fine, we
>> just chose a different default behavior from MD and LVM (we make certain
>> the user knows about the issue without having to look through syslog).
> 
> I'm not arguing about the behaviour - apparently there were some
> technical reasons. But IF the reasons are not technical, but
> philosophical, I'd like to have either mount option (allow_degraded) or
> even kernel-level configuration knob for this to happen RAID-style.
> 
> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
> safely add "degraded" to the mount options? My primary concern is the
> machine UPTIME. I care less about the data, as they are backed up to
> some remote location and loosing day or week of changes is acceptable,
> brain-split as well, while every hour of downtime costs me a real money.
In which case you shouldn't be relying on _ANY_ kind of RAID by itself, 
let alone BTRFS.  If you care that much about uptime, you should be 
investing in a HA setup and going from there.  If downtime costs you 
money, you need to be accounting for kernel updates and similar things, 
and therefore should have things set up such that you can reboot a 
system with no issues.
> 
> 
> Meanwhile I can't fix broken server using 'remote hands' - mounting degraded
> volume means using physical keyboard or KVM which might be not available
> at a site. Current btrfs behavious requires physical presence AND downtime
> (if a machine rebooted) for fixing things, that could be fixed remotely
> an on-line.
Assuming you have a sensibly designed system and are able to do remote 
management, physical presence should only be required for handling of an 
issue with the root filesystem, and downtime should only be needed long 
enough for other filesystem to get them into a sensible enough state 
that you can repair them the rest of the way online.  There's not really 
anything you can do about the root filesystem, but sensible organization 
of application data can mitigate the issues for other filesystems.
> 
> Anyway, users shouldn't look through syslog, device status should be
> reported by some monitoring tool.
This is a common complaint, and based on developer response, I think the 
consensus is that it's out of scope for the time being.  There have been 
some people starting work on such things, but nobody really got anywhere 
because most of the users who care enough about monitoring to be 
interested are already using some external monitoring tool that it's 
easy to hook into.

TBH, you essentially need external monitoring in most RAID situations 
anyway unless you've got some pre-built purpose specific system that 
already includes it (see FreeNAS for an example).
> 
> Deviation so big (respectively to common RAID1 scenarios) deserves being documented.
> Or renamed...
Really? Some examples of where MD and LVM provide direct monitoring 
without needing third party software please.  LVM technically has the 
ability to handle it though dmeventd, but it's decidedly non-trivial to 
monitor state with that directly, and as a result almost everyone uses 
third party software there..  MD I don't have as much background with (I 
prefer the flexibility LVM offers), but anything I've seen regarding 
that requires manual setup of some external software as well.
> 
>> reliability, and all the discussion of reliability assumes that either:
>> 1. Disks fail catastrophically.
>> or:
>> 2. Disks return read or write errors when there is a problem.
>>
>> Following just those constraints, RAID is not designed to handle devices
>> that randomly drop off the bus and reappear
> 
> If it drops, there would be I/O errors eventually. Without the errors - agreed.
Classical hardware RAID will kick the device when it drops, and then 
never re-add it, just like BTRFS functionally does.  The only difference 
is how they then treat the 'failed' disk.  Hardware RAID will stop using 
it, BTRFS will keep trying to use it.
> 
>> implementations.  As people are quick to point out BTRFS _IS NOT_ RAID,
>> the devs just made a poor choice in the original naming of the 2-way
>> replication implementation, and it stuck.
> 
> Well, the question is: either it is not raid YET, or maybe it's time to consider renaming?
Again, the naming is too ingrained.  At a minimum, you will have to keep 
the old naming, and at that point you're just wasting time and making 
things _more_ confusing because some documentation will use the old 
naming and some will use the new (keep in mind that third-party 
documentation rarely gets updated).
> 
>>> 3. if sysadmin doesn't request any kind of device autobinding, the
>>> device that were already failed doesn't matter anymore - regardless of
>>> it's current state or reappearences.
>> You have to explicitly disable automatic binding of drivers to
>> hot-plugged devices though, so that's rather irrelevant.  Yes, you can
> 
> Ha! I got this disabled on every bus (although for different reasons)
> after boot completes. Lucky me:)
Security I'm guessing (my laptop behaves like that for USB devices for 
that exact reason)?  It's a viable option on systems that are tightly 
controlled.  Once you look at consumer devices though, it's just 
impractical.  People expect hardware to just work when they plug it in 
these days.
> 
>>> 1. "known" only to the people that already stepped into it, meaning too
>>>      late - it should be "COMMONLY known", i.e. documented,
>> And also known to people who have done proper research.
> 
> All the OpenSUSE userbase? ;)
I don't think you quite understand what the SuSE business model is. 
SuSE does the research, and then provides support for customers so they 
don't have to.  Red Hat has a similar model.  Most normal distros 
however, do not, and those people using them need to be doing proper 
research.
> 
>>> 2. "older kernels" are not so old, the newest mature LTS (4.9) is still
>>>      affected,
>> I really don't see this as a valid excuse.  It's pretty well documented
>> that you absolutely should be running the most recent kernel if you're
>> using BTRFS.
> 
> Good point.
> 
>>> 4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE,
>>>      as long as you accept "no more redundancy"...
>> This is a matter of opinion.
> 
> Sure! And the particular opinion depends on system being affected. I'd
> rather not have any brain-split scenario under my database servers, but
> also won't mind data loss on BGP router as long as it keeps running and
> is fully operational.
> 
>> I still contend that running half a two
>> device array for an extended period of time without reshaping it to be a
>> single device is a bad idea for cases other than BTRFS.  The fewer
>> layers of code you're going through, the safer you are.
> 
> I create single-device degraded MD RAID1 when I attach one disk for
> deployment (usually test machines), which are going to be converted into
> dual (production) in a future - attaching second disk to array is much
> easier and faster than messing with device nodes (or labels or
> anything). The same applies to LVM, it's better to have it even when not
> used at a moment. In case of btrfs there is no need for such
> preparations, as the devices are added without renaming.
Unless you're pulling some complex black magic, you're not running 
degraded, you're running both in single device mode (which is not the 
same as a degraded two device RAID1 array) and converting to two device 
RAID1 later, which is a perfectly normal use case I have absolutely no 
issues with.
> 
> However, sometimes the systems end up without second disk attached.
> Either due to their low importance, sometimes power usage, others
> need to be quiet.
> 
> One might ask, why don't I attach second disk before initial system
> creation - the answer is simple: I usually use the same drive models in
> RAID1, but it happens that drives bought from the same production lot
> fail simultaneously, so this approach mitigates the problem and gives
> more time to react.
You appear to be misunderstanding me here.  I'm not saying I think 
running with a single disk is bad, I'm saying that I feel that running 
with a single disk and not telling the storage stack that the other one 
isn't coming back any time soon is bad.

IOW, if I lose a disk in a two device BTRFS volume set up for 
replication, I'll mount it degraded, and convert it from the raid1 
profile to the single profile and then remove the missing disk from the 
volume.  Similarly, for a 2 device LVM RAID1 LV, I would use lvconvert 
to a regular linear LV.  Going through the multi-device code in BTRFS or 
the DM-RAID code in LVM when you've only got one actual device is a 
waste of processing power, and ads another layer where things can go wrong.
> 
>> Patches would be gratefully accepted.  It's really not hard to update
>> the documentation, it's just that nobody has had the time to do it.
> 
> Writing accurate documentation requires deep undestanding of internals.
> Me - for example, I know some of the results: "don't do this", "if X happens, Y
> should be done", "Z doesn't work yet, but there were some patches", "V
> was fixed in some recent kernel, but no idea which commit was it
> exactly", "W was severly broken in kernel I.J.K" etc. Not the hard data
> that could be posted without creating the impression, that it's all
> about creating complain-list. Not to mention I'm absolutely not familiar
> with current patches, WIP and many many other corner cases or usage
> scenarios. In a fact, not only the internals, but motivation and design
> principles must be well understood to write piece of documentation.
Writing up something like that is near useless, it would only be valid 
for upstream kernels (And if you're using upstream kernels and following 
the advice of keeping up to date, what does it matter anyway?  The 
moment a new btrfs-progs gets released, you're already going to be on a 
kernel that fixes the issues it reports.), because distros do whatever 
the hell they want with version numbers (RHEL for example is notorious 
for using _ancient_ version numbers bug having bunches of stuff 
back-ported, and most other big distros that aren't Arch, Gentoo, or 
Slackware derived do so too to a lesser degree), and it would require 
constant curation to keep up to date.  Only for long-term known issues 
does it make sense, but those absolutely should be documented in the 
regular documentation, and doing that really isn't that hard if you just 
go for current issues.
> 
> Otherwise some "fake news" propaganda is being created, just like
> https://suckless.org/sucks/systemd or other systemd-haters that haven't
> spent a day in their life for writing SysV init scripts or managing a
> bunch of mission critical machines with handcrafted supervisors.
I hate to tell you that:
1. This type of thing happens regardless.  Systemd has just garnered a 
lot of hatred because it redesigned everything from the ground up and 
was then functionally forced on most of the Linux community.
2. There are quite a few of us who dislike systemd who have had to 
handle actual systems administration before (and quite a few such 
individuals are primarily complaining about other aspects of systemd, 
like the journal crap or how it handles manually mounted filesystems for 
which mount units exist (namely, if it thinks the underlying device 
isn't ready, it will unmount them immediately, even if the user just 
manually mounted them), not the service files replacing init scripts).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 18:31             ` George Mitchell
@ 2017-12-19 20:28               ` Tomasz Pala
  0 siblings, 0 replies; 61+ messages in thread
From: Tomasz Pala @ 2017-12-19 20:28 UTC (permalink / raw)
  To: George Mitchell; +Cc: linux-btrfs

On Tue, Dec 19, 2017 at 10:31:40 -0800, George Mitchell wrote:

> I have significant experience as a user of raid1. I spent years using 
> software raid1 and then more years using hardware (3ware) raid1 and now 
> around 3 years using btrfs raid1. I have not found btrfs raid1 to be 
> less reliable than any of the previous implementations of raid.  I have 

You are aware that in order to proof something one needs only one
example? Degraded r/o is such, QED.
Doesn't matter how long did you ride on top of any RAID implementation,
unless you got them in action, i.e. had actual drive malfunction. Did you
have broken drive under btrfs raid?

> a failure, you don't just plug things back in and expect it to be fixed 
> without seriously investigating what has gone wrong and potential 
> unexpected consequences.  I have found that even with hardware raid you 
> can find ways to screw things up to the point that you lose your data.  

Everything could be screwed beyond comprehension, but we're talking
about PRIMARY objectives. In case of RAID1+ it seems to be obvious:

https://en.oxforddictionaries.com/definition/redundancy

- unplugging ANY SINGLE drive MUST NOT render system unusable.
This is really as simple as that.

> I have had situations where I reconnected a drive on hardware raid1 only 
> to find that the array would not sync and from there on I ended up 
> having to directly attach one of the drives and recover the partition 

I had a situation when replugging a drive started a sync of older data
over the newer. So what? This doesn't change a thing - the drive
reappearance or resync is RECOVERY part. RECOVERY scenarios are entirely
different thing than REDUNDANCY itself. RECOVERY phase in some
implementation could be entirely off-line process and it still would be
RAID. Remove REDUNDANCY part and it's not RAID anymore.

If one is naming thing an apple, shouldn't be surprised if others
compare it to apples, not oranges.

> table with test disk in order to regain access to my data.  So NO FORM 
> of raid is a replacement for backups and NO FORM of raid is a 
> replacement for due diligence in recovery from failure mode.  Raid gives 

And who said it is?

> you a second chance when things go wrong, it does not make failures 
> transparent which is seemingly what we sometimes expect from raid.  And 

Wouldn't want to worry you, but properly managed RAIDs make I/J-of-K
trivial-failures transparent. Just like ECC protects N/M bits transparently.

Investigating the reasons is sysadmin's job, just like other
maintenance, including restoring protection level.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 19:35             ` Chris Murphy
@ 2017-12-19 20:41               ` Tomasz Pala
  2017-12-19 20:47                 ` Austin S. Hemmelgarn
  2017-12-19 23:59                 ` Chris Murphy
  0 siblings, 2 replies; 61+ messages in thread
From: Tomasz Pala @ 2017-12-19 20:41 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote:

> with a read only file system. Another reason is the kernel code and
> udev rule for device "readiness" means the volume is not "ready" until
> all member devices are present. And while the volume is not "ready"
> systemd will not even attempt to mount. Solving this requires kernel
> and udev work, or possibly a helper, to wait an appropriate amount of

Sth like this? I got such problem a few months ago, my solution was
accepted upstream:
https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

Rationale is in referred ticket, udev would not support any more btrfs
logic, so unless btrfs handles this itself on kernel level (daemon?),
that is all that can be done.

> time. I also think it's a bad idea to implement automatic degraded
> mounts unless there's an API for user space to receive either a push
[...]
> There is no amount of documentation that makes up for these
> deficiencies enough to enable automatic degraded mounts by default. I
> would consider it a high order betrayal of user trust to do it.

It doesn't have to be default, might be kernel compile-time knob, module
parameter or anything else to make the *R*aid work.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 20:41               ` Tomasz Pala
@ 2017-12-19 20:47                 ` Austin S. Hemmelgarn
  2017-12-19 22:23                   ` Tomasz Pala
  2017-12-21 11:44                   ` Andrei Borzenkov
  2017-12-19 23:59                 ` Chris Murphy
  1 sibling, 2 replies; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-19 20:47 UTC (permalink / raw)
  To: Tomasz Pala, Linux fs Btrfs

On 2017-12-19 15:41, Tomasz Pala wrote:
> On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote:
> 
>> with a read only file system. Another reason is the kernel code and
>> udev rule for device "readiness" means the volume is not "ready" until
>> all member devices are present. And while the volume is not "ready"
>> systemd will not even attempt to mount. Solving this requires kernel
>> and udev work, or possibly a helper, to wait an appropriate amount of
> 
> Sth like this? I got such problem a few months ago, my solution was
> accepted upstream:
> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
> 
> Rationale is in referred ticket, udev would not support any more btrfs
> logic, so unless btrfs handles this itself on kernel level (daemon?),
> that is all that can be done.
Or maybe systemd can quit trying to treat BTRFS like a volume manager 
(which it isn't) and just try to mount the requested filesystem with the 
requested options?  Then you would just be able to specify 'degraded' in 
your mount options, and you don't have to care that the kernel refuses 
to mount degraded filesystems without being explicitly asked to.
> 
>> time. I also think it's a bad idea to implement automatic degraded
>> mounts unless there's an API for user space to receive either a push
> [...]
>> There is no amount of documentation that makes up for these
>> deficiencies enough to enable automatic degraded mounts by default. I
>> would consider it a high order betrayal of user trust to do it.
> 
> It doesn't have to be default, might be kernel compile-time knob, module
> parameter or anything else to make the *R*aid work.
There's a mount option for it per-filesystem.  Just add that to all your 
mount calls, and you get exactly the same effect.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 19:47                 ` Chris Murphy
@ 2017-12-19 21:17                   ` Tomasz Pala
  2017-12-20  0:08                     ` Chris Murphy
  2017-12-20 16:53                   ` Andrei Borzenkov
  1 sibling, 1 reply; 61+ messages in thread
From: Tomasz Pala @ 2017-12-19 21:17 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Dec 19, 2017 at 12:47:33 -0700, Chris Murphy wrote:

> The more verbose man pages are, the more likely it is that information
> gets stale. We already see this with the Btrfs Wiki. So are you

True. The same applies to git documentation (3rd paragraph):

https://stevebennett.me/2012/02/24/10-things-i-hate-about-git/

Fortunately this CAN be done properly, one of the greatest
documentations I've seen is systemd one.

What I don't like about documentation is lack of objectivity:

$ zgrep -i bugs /usr/share/man/man8/*btrfs*.8.gz | grep -v bugs.debian.org

Nothing. The old-school manuals all had BUGS section even if it was
empty. Seriously, nothing appropriate to be put in there? Documentation
must be symmetric - if it mentions feature X, it must mention at least the
most common caveats.

> volunteering to do the btrfs-progs work to easily check kernel
> versions and print appropriate warnings? Or is this a case of
> complaining about what other people aren't doing with their time?

This is definitely the second case. You see, I got my issues with btrfs, I
already know where to use it and when not. I've learned HARD and still
didn't fully recovered (some dangling r/o, some ENOSPACE due to
fragmentation etc). What I /MIGHT/ help to the community is to share my
opinions and suggestions. And it's all up to you, what would you do with
this. Either you blame me for complaining or you ignore me - you
should realize, that _I_do_not_care_, because I already know things that
I write. At least some other guy, some other day would read this thread and my
opinions might save HIS day. After all, using btrfs should be preceded
with research.

No offence, just trying to be honest with you. Because the other thing
that I've learned hard in my life is to listen regular users of my
products and appreciate any feedback, even if it doesn't suit me.

>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>> safely add "degraded" to the mount options? My primary concern is the
[...]
> Btrfs simply is not ready for this use case. If you need to depend on
> degraded raid1 booting, you need to use mdadm or LVM or hardware raid.
> Complaining about the lack of maturity in this area? Get in line. Or
> propose a design and scope of work that needs to be completed to
> enable it.

I thought the work was already done if current kernel handles degraded RAID1
without switching to r/o, doesn't it? Or something else is missing?

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 20:11                 ` Austin S. Hemmelgarn
@ 2017-12-19 21:58                   ` Tomasz Pala
  2017-12-20 13:10                     ` Austin S. Hemmelgarn
  2017-12-19 23:53                   ` Chris Murphy
  1 sibling, 1 reply; 61+ messages in thread
From: Tomasz Pala @ 2017-12-19 21:58 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Dec 19, 2017 at 15:11:22 -0500, Austin S. Hemmelgarn wrote:

> Except the systems running on those ancient kernel versions are not 
> necessarily using a recent version of btrfs-progs.

Still much easier to update a userspace tools than kernel (consider
binary drivers for various hardware).

> So in other words, spend the time to write up code for btrfs-progs that 
> will then be run by a significant minority of users because people using 
> old kernels usually use old userspace, and people using new kernels 
> won't have to care, instead of working on other bugs that are still 
> affecting people?

I am aware of the dillema and the answer is: that depends.
Depends on expected usefulness of such infrastructure regarding _future_
changes and possible bugs.
In case of stable/mature/frozen projects this doesn't make much sense,
as the possible incompatibilities would be very rare.
Wheter this makes sense for btrfs? I don't know - it's not mature, but if the quirk rate
would be too high to track appropriate kernel versions it might be
really better to officially state "DO USE 4.14+ kernel, REALLY".

This might be accomplished very easy - when releasing new btrfs-progs
check currently available LTS kernel and use it as a base reference for
warning.

After all, "giving users a hurt me button is not ethical programming."

>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>> safely add "degraded" to the mount options? My primary concern is the
>> machine UPTIME. I care less about the data, as they are backed up to
>> some remote location and loosing day or week of changes is acceptable,
>> brain-split as well, while every hour of downtime costs me a real money.
> In which case you shouldn't be relying on _ANY_ kind of RAID by itself, 
> let alone BTRFS.  If you care that much about uptime, you should be 
> investing in a HA setup and going from there.  If downtime costs you 

I got this handled and don't use btrfs there - the question remains:
in a situation as described above, is it safe now to add "degraded"?

To rephrase the question: can degraded RAID1 run permanently as rw
without some *internal* damage?

>> Anyway, users shouldn't look through syslog, device status should be
>> reported by some monitoring tool.
> This is a common complaint, and based on developer response, I think the 
> consensus is that it's out of scope for the time being.  There have been 
> some people starting work on such things, but nobody really got anywhere 
> because most of the users who care enough about monitoring to be 
> interested are already using some external monitoring tool that it's 
> easy to hook into.

I agree, the btrfs code should only emit events, so
SomeUserspaceGUIWhatever could display blinking exclamation mark.

>> Well, the question is: either it is not raid YET, or maybe it's time to consider renaming?
> Again, the naming is too ingrained.  At a minimum, you will have to keep 
> the old naming, and at that point you're just wasting time and making 
> things _more_ confusing because some documentation will use the old 

True, but realizing that documentation is already flawed it gets easier.
But I still don't know if it is going to be RAID some day? Or won't be
"by design"?

>> Ha! I got this disabled on every bus (although for different reasons)
>> after boot completes. Lucky me:)
> Security I'm guessing (my laptop behaves like that for USB devices for 
> that exact reason)?  It's a viable option on systems that are tightly 

Yes, machines are locked and only authorized devices are allowed during
boot.

> IOW, if I lose a disk in a two device BTRFS volume set up for 
> replication, I'll mount it degraded, and convert it from the raid1 
> profile to the single profile and then remove the missing disk from the 
> volume.

I was about to do the same with my r/o-stuck btrfs system, unfortunatelly
unplugged the wrong cable...

>> Writing accurate documentation requires deep undestanding of internals.
[...]
> Writing up something like that is near useless, it would only be valid 
> for upstream kernels (And if you're using upstream kernels and following 
> the advice of keeping up to date, what does it matter anyway?  The 
[...]
> kernel that fixes the issues it reports.), because distros do whatever 
> the hell they want with version numbers (RHEL for example is notorious 
> for using _ancient_ version numbers bug having bunches of stuff 
> back-ported, and most other big distros that aren't Arch, Gentoo, or 
> Slackware derived do so too to a lesser degree), and it would require 
> constant curation to keep up to date.  Only for long-term known issues 

OK, you've convinced me that kernel-vs-feature list is overhead.

So maybe other approach: just like systemd sets the system time (when no
time source available) to it's own release date, maybe btrfs-progs
should take the version of the kernel on which it was build?

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 20:47                 ` Austin S. Hemmelgarn
@ 2017-12-19 22:23                   ` Tomasz Pala
  2017-12-20 13:33                     ` Austin S. Hemmelgarn
  2017-12-21 11:44                   ` Andrei Borzenkov
  1 sibling, 1 reply; 61+ messages in thread
From: Tomasz Pala @ 2017-12-19 22:23 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Dec 19, 2017 at 15:47:03 -0500, Austin S. Hemmelgarn wrote:

>> Sth like this? I got such problem a few months ago, my solution was
>> accepted upstream:
>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
>> 
>> Rationale is in referred ticket, udev would not support any more btrfs
>> logic, so unless btrfs handles this itself on kernel level (daemon?),
>> that is all that can be done.
> Or maybe systemd can quit trying to treat BTRFS like a volume manager 
> (which it isn't) and just try to mount the requested filesystem with the 
> requested options?

Tried that before ("just mount my filesystem, stupid"), it is a no-go.
The problem source is not within systemd treating BTRFS differently, but
in btrfs kernel logic that it uses. Just to show it:

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

THIS readiness is exposed via udev to systemd. And it must be used for
multi-layer setups to work (consider stacked LUKS, LVM, MD, iSCSI, FC etc).

In short: until *something* scans all the btrfs components, so the
kernel makes it ready, systemd won't even try to mount it.

> Then you would just be able to specify 'degraded' in 
> your mount options, and you don't have to care that the kernel refuses 
> to mount degraded filesystems without being explicitly asked to.

Exactly. But since LP refused to try mounting despite kernel "not-ready"
state - it is the kernel that must emit 'ready'. So the
question is: how can I make kernel to mark degraded array as "ready"?

The obvious answer is: do it via kernel command line, just like mdadm
does:
rootflags=device=/dev/sda,device=/dev/sdb
rootflags=device=/dev/sda,device=missing
rootflags=device=/dev/sda,device=/dev/sdb,degraded

If only btrfs.ko recognized this, kernel would be able to assemble
multivolume btrfs itself. Not only this would allow automated degraded
mounts, it would also allow using initrd-less kernels on such volumes.

>> It doesn't have to be default, might be kernel compile-time knob, module
>> parameter or anything else to make the *R*aid work.
> There's a mount option for it per-filesystem.  Just add that to all your 
> mount calls, and you get exactly the same effect.

If only they were passed...

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 20:11                 ` Austin S. Hemmelgarn
  2017-12-19 21:58                   ` Tomasz Pala
@ 2017-12-19 23:53                   ` Chris Murphy
  2017-12-20 13:12                     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 61+ messages in thread
From: Chris Murphy @ 2017-12-19 23:53 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Tomasz Pala, Linux fs Btrfs

On Tue, Dec 19, 2017 at 1:11 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2017-12-19 12:56, Tomasz Pala wrote:

>> BTRFS lacks all of these - there are major functional changes in current
>> kernels and it reaches far beyond LTS. All the knowledge YOU have here,
>> on this maillist, should be 'engraved' into btrfs-progs, as there are
>> people still using kernels with serious malfunctions. btrfs-progs could
>> easily check kernel version and print appropriate warning - consider
>> this a "software quirks".
>
> Except the systems running on those ancient kernel versions are not
> necessarily using a recent version of btrfs-progs.

Indeed it is much more common to find old user space tools, for
whatever reason, compared to the kernel version.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 20:41               ` Tomasz Pala
  2017-12-19 20:47                 ` Austin S. Hemmelgarn
@ 2017-12-19 23:59                 ` Chris Murphy
  2017-12-20  8:34                   ` Tomasz Pala
  1 sibling, 1 reply; 61+ messages in thread
From: Chris Murphy @ 2017-12-19 23:59 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Linux fs Btrfs

On Tue, Dec 19, 2017 at 1:41 PM, Tomasz Pala <gotar@polanet.pl> wrote:
> On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote:
>
>> with a read only file system. Another reason is the kernel code and
>> udev rule for device "readiness" means the volume is not "ready" until
>> all member devices are present. And while the volume is not "ready"
>> systemd will not even attempt to mount. Solving this requires kernel
>> and udev work, or possibly a helper, to wait an appropriate amount of
>
> Sth like this? I got such problem a few months ago, my solution was
> accepted upstream:
> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

I can't parse this commit. In particular I can't tell how long it
waits, or what triggers the end to waiting.



>
> Rationale is in referred ticket, udev would not support any more btrfs
> logic, so unless btrfs handles this itself on kernel level (daemon?),
> that is all that can be done.
>
>> time. I also think it's a bad idea to implement automatic degraded
>> mounts unless there's an API for user space to receive either a push
> [...]
>> There is no amount of documentation that makes up for these
>> deficiencies enough to enable automatic degraded mounts by default. I
>> would consider it a high order betrayal of user trust to do it.
>
> It doesn't have to be default, might be kernel compile-time knob, module
> parameter or anything else to make the *R*aid work.

OK but that's cart before the horse. The horse is proper recovery
behavior once a delayed/missing drive appears, i.e. resync.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 21:17                   ` Tomasz Pala
@ 2017-12-20  0:08                     ` Chris Murphy
  2017-12-23  4:08                       ` Tomasz Pala
  0 siblings, 1 reply; 61+ messages in thread
From: Chris Murphy @ 2017-12-20  0:08 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Linux fs Btrfs

On Tue, Dec 19, 2017 at 2:17 PM, Tomasz Pala <gotar@polanet.pl> wrote:
> On Tue, Dec 19, 2017 at 12:47:33 -0700, Chris Murphy wrote:
>
>> The more verbose man pages are, the more likely it is that information
>> gets stale. We already see this with the Btrfs Wiki. So are you
>
> True. The same applies to git documentation (3rd paragraph):
>
> https://stevebennett.me/2012/02/24/10-things-i-hate-about-git/
>
> Fortunately this CAN be done properly, one of the greatest
> documentations I've seen is systemd one.
>
> What I don't like about documentation is lack of objectivity:
>
> $ zgrep -i bugs /usr/share/man/man8/*btrfs*.8.gz | grep -v bugs.debian.org
>
> Nothing. The old-school manuals all had BUGS section even if it was
> empty. Seriously, nothing appropriate to be put in there? Documentation
> must be symmetric - if it mentions feature X, it must mention at least the
> most common caveats.

It's reasonable to have a known bugs section in the man page, so long
as people are willing to do the work adding to it and deleting it when
bugs are fixed.





>
>> volunteering to do the btrfs-progs work to easily check kernel
>> versions and print appropriate warnings? Or is this a case of
>> complaining about what other people aren't doing with their time?
>
> This is definitely the second case. You see, I got my issues with btrfs, I
> already know where to use it and when not. I've learned HARD and still
> didn't fully recovered (some dangling r/o, some ENOSPACE due to
> fragmentation etc). What I /MIGHT/ help to the community is to share my
> opinions and suggestions. And it's all up to you, what would you do with
> this. Either you blame me for complaining or you ignore me - you
> should realize, that _I_do_not_care_, because I already know things that
> I write. At least some other guy, some other day would read this thread and my
> opinions might save HIS day. After all, using btrfs should be preceded
> with research.
>
> No offence, just trying to be honest with you. Because the other thing
> that I've learned hard in my life is to listen regular users of my
> products and appreciate any feedback, even if it doesn't suit me.

Btrfs development has definitely been a lot more fractured and wild
west and people who do the work get to dictate the direction, than
perhaps other Linux file systems, and certainly that's true compared
to ZFS which has a small team with a very clear ideology and direction
established from the outset.


>
>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>>> safely add "degraded" to the mount options? My primary concern is the
> [...]
>> Btrfs simply is not ready for this use case. If you need to depend on
>> degraded raid1 booting, you need to use mdadm or LVM or hardware raid.
>> Complaining about the lack of maturity in this area? Get in line. Or
>> propose a design and scope of work that needs to be completed to
>> enable it.
>
> I thought the work was already done if current kernel handles degraded RAID1
> without switching to r/o, doesn't it? Or something else is missing?

Well it only does rw once, then the next degraded is ro - there are
patches dealing with this better but I don't know the state. And
there's no resync code that I'm aware of, absolutely it's not good
enough to just kick off a full scrub - that has huge performance
implications and I'd consider it a regression compared to
functionality in LVM and mdadm RAID by default with the write intent
bitmap.  Without some equivalent short cut, automatic degraded means a
decently likely scenario where a slightly late assembly at boot time
will end up requiring a full scrub. That's not an improvement over
manual degraded so people aren't hit with even more silent
consequences of their unfortunate situation.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 23:59                 ` Chris Murphy
@ 2017-12-20  8:34                   ` Tomasz Pala
  2017-12-20  8:51                     ` Tomasz Pala
  2017-12-20 19:49                     ` Chris Murphy
  0 siblings, 2 replies; 61+ messages in thread
From: Tomasz Pala @ 2017-12-20  8:34 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Dec 19, 2017 at 16:59:39 -0700, Chris Murphy wrote:

>> Sth like this? I got such problem a few months ago, my solution was
>> accepted upstream:
>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
> 
> I can't parse this commit. In particular I can't tell how long it
> waits, or what triggers the end to waiting.

The point is - it doesn't wait at all. Instead, every 'ready' btrfs
device triggers event on all the pending devices. Consider 3-device
filesystem consisting of /dev/sd[abd] with /dev/sdc being different,
standalone btrfs:

/dev/sda -> 'not ready'
/dev/sdb -> 'not ready'
/dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not ready'
/dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 'ready'

This way all the parts of a volume are marked as ready, so systemd won't
refuse mounting using legacy device nodes like /dev/sda.


This particular solution depends on kernel returning 'btrfs ready',
which would obviously not work for degraded arrays unless the btrfs.ko
handles some 'missing' or 'mount_degraded' kernel cmdline options
_before_ actually _trying_ to mount it with -o degraded.

And there is a logical problem with this - _which_ array components
should be ignored? Consider:

volume1: /dev/sda /dev/sdb
volume2: /dev/sdc /dev/sdd-broken

If /dev/sdd is missing from the system, it would never be scanned, so
/dev/sdc would be pending. It cannot be assembled just in time of
scanning alone, because the same would happen with /dev/sda and there
would be desync with /dev/sdb, which IS available - a few moments later.

This is the place for the timeout you've mentioned - there should be
*some* decent timeout allowing all the devices to show up (udev waits
for 90 seconds by default or x-systemd.device-timeout=N from fstab).

After such timeout, I'd like to tell the kernel: "no more devices, give
me all the remaining btrfs volumes in degraded mode if possible". By
"give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could
fire it's rules. And if there would be anything for udev to distinguish
'ready' from 'ready-degraded' one could easily compose some notification
scripting on top of it, including sending e-mail to sysadmin.

Is there anything that would make the kernel do the above?

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20  8:34                   ` Tomasz Pala
@ 2017-12-20  8:51                     ` Tomasz Pala
  2017-12-20 19:49                     ` Chris Murphy
  1 sibling, 0 replies; 61+ messages in thread
From: Tomasz Pala @ 2017-12-20  8:51 UTC (permalink / raw)
  To: Linux fs Btrfs

Errata:

On Wed, Dec 20, 2017 at 09:34:48 +0100, Tomasz Pala wrote:

> /dev/sda -> 'not ready'
> /dev/sdb -> 'not ready'
> /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not ready'
> /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 'ready'

The last line should start with /dev/sdd.

> After such timeout, I'd like to tell the kernel: "no more devices, give
> me all the remaining btrfs volumes in degraded mode if possible". By

Actually "if possible" means both:
- if technically possible (i.e. required data is available, like half of RAID1),
- AND if allowed for specific volume as there might be different policies.

For example - one might allow rootfs to be started in degraded-rw mode in
order for the system to boot up, /home in degraded read-only for the
users to have access to their files and do not mount /srv degraded at all.
The failed mount can be non-critical with 'nofail' fstab flag.

> "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could
> fire it's rules. And if there would be anything for udev to distinguish
> 'ready' from 'ready-degraded' one could easily compose some notification
> scripting on top of it, including sending e-mail to sysadmin.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 21:58                   ` Tomasz Pala
@ 2017-12-20 13:10                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-20 13:10 UTC (permalink / raw)
  To: Tomasz Pala, Linux fs Btrfs

On 2017-12-19 16:58, Tomasz Pala wrote:
> On Tue, Dec 19, 2017 at 15:11:22 -0500, Austin S. Hemmelgarn wrote:
> 
>> Except the systems running on those ancient kernel versions are not
>> necessarily using a recent version of btrfs-progs.
> 
> Still much easier to update a userspace tools than kernel (consider
> binary drivers for various hardware).
OK, let's look at this objectively:

Current version of btrfs-progs is 4.14, released last month, and current 
kernel is 4.14.8 (or a 4.15 RC release).

In various distributions:
* Arch Linux:
	btrfs-progs version is 4.14-2
	kernel version is 4.14.6-1
* Alpine Linux:
	btrfs-progs version is 4.10.2-r0
	kernel version is 4.9.32-0
* Debian Sid:
	btrfs-progs version is 4.13.3-1
	kernel version is 4.14.0-1
* Debian 9.3:
	btrfs-progs version is 4.7.3-1
	kernel version is 4.9.0-4
* Fedora 27:
	btrfs-progs version is 4.11.3-3
	kernel version is 4.14.6-300
* Gentoo ~amd64 (equivalent of Debian Sid or Fedora Rawhide):
	btrfs-progs version is 4.14
	kernel version is 4.14.7
* Gentoo stable:
	btrfs-progs version is 4.10.2
	kernel version is 4.14.7
* Manjaro (a somewhat popular Arch Linux derivative):
	btrfs-progs version is 4.14-1
	kernel version is 4.11.12-1-rt16
* OpenSUSE Leap 42.3:
	btrfs-progs version is 4.5.3+20160729
	kernel version is 4.4.103-36
* OpenSUSE Tumbleweed:
	btrfs-progs version is 4.13.3
	kernel version is 4.14.6-1
* Ubuntu 17.10:
	btrfs-progs version is 4.12-1
	kernel version is 4.13.0-19
* Ubuntu 16.04.3:
	btrfs-progs version is 4.4-1ubuntu1
	kernel version is 4.4.0-104

Based on this, it looks like Alpine, Manjaro, and OpenSUSE Leap are the 
only distros for which it was easier to upgrade the userspace than the 
kernel, and Alpine and Manjaro are the only two that it even makes sense 
for that to be the case given that they use GRSecurity and RT patches 
respectively.

The fact is that most people use whatever version their distro packages, 
and don't install software themselves through other means, so for most 
people, it is easier to upgrade the kernel.

Even as a 'power user' using Gentoo (where it's really easy to install 
stuff from external sources because you have all the development tools 
pre-installed), I almost never pull anything that's beyond the main 
repositories or the small handful of user repositories that I've got 
enabled, and that's only for stuff I can't get in a repository.
> 
>> So in other words, spend the time to write up code for btrfs-progs that
>> will then be run by a significant minority of users because people using
>> old kernels usually use old userspace, and people using new kernels
>> won't have to care, instead of working on other bugs that are still
>> affecting people?
> 
> I am aware of the dillema and the answer is: that depends.
> Depends on expected usefulness of such infrastructure regarding _future_
> changes and possible bugs.
> In case of stable/mature/frozen projects this doesn't make much sense,
> as the possible incompatibilities would be very rare.
> Wheter this makes sense for btrfs? I don't know - it's not mature, but if the quirk rate
> would be too high to track appropriate kernel versions it might be
> really better to officially state "DO USE 4.14+ kernel, REALLY".
> 
> This might be accomplished very easy - when releasing new btrfs-progs
> check currently available LTS kernel and use it as a base reference for
> warning.
> 
> After all, "giving users a hurt me button is not ethical programming."
Scaring users needlessly is also not ethical programming.  As an example:

4.9 is the current LTS release (4.9.71 as of right now).  Dozens of bugs 
have been fixed since then.  If we were to start doing as you propose, 
then we'd be spitting out potentially bogus warnings for everything up 
through current kernels.
> 
>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>>> safely add "degraded" to the mount options? My primary concern is the
>>> machine UPTIME. I care less about the data, as they are backed up to
>>> some remote location and loosing day or week of changes is acceptable,
>>> brain-split as well, while every hour of downtime costs me a real money.
>> In which case you shouldn't be relying on _ANY_ kind of RAID by itself,
>> let alone BTRFS.  If you care that much about uptime, you should be
>> investing in a HA setup and going from there.  If downtime costs you
> 
> I got this handled and don't use btrfs there - the question remains:
> in a situation as described above, is it safe now to add "degraded"?
> 
> To rephrase the question: can degraded RAID1 run permanently as rw
> without some *internal* damage?
Not on kernels that don't have the patch that's been mentioned a couple 
of times in this thread, with the caveat that 'internal damage' means 
that it won't mount on such kernels after the first time (but will mount 
on newer kernels that have been patched).
> 
>>> Anyway, users shouldn't look through syslog, device status should be
>>> reported by some monitoring tool.
>> This is a common complaint, and based on developer response, I think the
>> consensus is that it's out of scope for the time being.  There have been
>> some people starting work on such things, but nobody really got anywhere
>> because most of the users who care enough about monitoring to be
>> interested are already using some external monitoring tool that it's
>> easy to hook into.
> 
> I agree, the btrfs code should only emit events, so
> SomeUserspaceGUIWhatever could display blinking exclamation mark.
No, it shouldn't _only_ emit events.  It damn well should be logging to 
the kernel log even if it's emitting events, LVM does so, MD does so, 
ZFS does so, why the hell should BTRFS _NOT_ do so?
> 
>>> Well, the question is: either it is not raid YET, or maybe it's time to consider renaming?
>> Again, the naming is too ingrained.  At a minimum, you will have to keep
>> the old naming, and at that point you're just wasting time and making
>> things _more_ confusing because some documentation will use the old
> 
> True, but realizing that documentation is already flawed it gets easier.
> But I still don't know if it is going to be RAID some day? Or won't be
> "by design"?
> 
>>> Ha! I got this disabled on every bus (although for different reasons)
>>> after boot completes. Lucky me:)
>> Security I'm guessing (my laptop behaves like that for USB devices for
>> that exact reason)?  It's a viable option on systems that are tightly
> 
> Yes, machines are locked and only authorized devices are allowed during
> boot.
> 
>> IOW, if I lose a disk in a two device BTRFS volume set up for
>> replication, I'll mount it degraded, and convert it from the raid1
>> profile to the single profile and then remove the missing disk from the
>> volume.
> 
> I was about to do the same with my r/o-stuck btrfs system, unfortunatelly
> unplugged the wrong cable...
> 
>>> Writing accurate documentation requires deep undestanding of internals.
> [...]
>> Writing up something like that is near useless, it would only be valid
>> for upstream kernels (And if you're using upstream kernels and following
>> the advice of keeping up to date, what does it matter anyway?  The
> [...]
>> kernel that fixes the issues it reports.), because distros do whatever
>> the hell they want with version numbers (RHEL for example is notorious
>> for using _ancient_ version numbers bug having bunches of stuff
>> back-ported, and most other big distros that aren't Arch, Gentoo, or
>> Slackware derived do so too to a lesser degree), and it would require
>> constant curation to keep up to date.  Only for long-term known issues
> 
> OK, you've convinced me that kernel-vs-feature list is overhead.
> 
> So maybe other approach: just like systemd sets the system time (when no
> time source available) to it's own release date, maybe btrfs-progs
> should take the version of the kernel on which it was build?
The systemd thing works because it knows the current time can't be older 
than when it was built (short of time-travel, but that's probably 
irrelevant right now).  Grabbing the kernel version of the build system 
and then using that as our own version absolutely does not because:
1. The kernel on the build system has (or should have) zero impact on 
how btrfs-progs work on the target system.  The only bit that matters is 
the UAPI headers that are installed, and if those mismatch then 
btrfs-progs won't run at all.
2. While the code in btrfs-progs is developed in-concert with kernel 
code, it is not directly dependent on it for most of it's operation.  As 
a couple of people are apt to point out, kernel version matters mostly 
for regular operation, btrfs-progs version matters mostly for recovery 
and repair.  However, _both_ do matter, so just displaying one is a bad 
idea.

As of right now, the versioning of btrfs-progs is largely linked to 
whatever the current stable kernel version is at the time of release. 
That provides a good enough indication of the vintage that most people 
have no issues just running:

	btrfs --version
	uname -r

to figure out what they have, though of course `uname -r` is essentially 
useless for outsiders on RHEL or OEL systems (and there is nothing the 
btrfs community can do about that).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 23:53                   ` Chris Murphy
@ 2017-12-20 13:12                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-20 13:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Tomasz Pala, Linux fs Btrfs

On 2017-12-19 18:53, Chris Murphy wrote:
> On Tue, Dec 19, 2017 at 1:11 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2017-12-19 12:56, Tomasz Pala wrote:
> 
>>> BTRFS lacks all of these - there are major functional changes in current
>>> kernels and it reaches far beyond LTS. All the knowledge YOU have here,
>>> on this maillist, should be 'engraved' into btrfs-progs, as there are
>>> people still using kernels with serious malfunctions. btrfs-progs could
>>> easily check kernel version and print appropriate warning - consider
>>> this a "software quirks".
>>
>> Except the systems running on those ancient kernel versions are not
>> necessarily using a recent version of btrfs-progs.
> 
> Indeed it is much more common to find old user space tools, for
> whatever reason, compared to the kernel version.
Most distros have infrastructure in place to handle quick updates to the 
kernel, and tend to keep the kernel up to date to fix hardware issues 
that affect people who may not be using BTRFS.

In contrast, btrfs-progs updates generally aren't high priority, because 
they benefit a much smaller user base (unless you're SuSE).


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 22:23                   ` Tomasz Pala
@ 2017-12-20 13:33                     ` Austin S. Hemmelgarn
  2017-12-20 17:28                       ` Duncan
  0 siblings, 1 reply; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-20 13:33 UTC (permalink / raw)
  To: Tomasz Pala, Linux fs Btrfs

On 2017-12-19 17:23, Tomasz Pala wrote:
> On Tue, Dec 19, 2017 at 15:47:03 -0500, Austin S. Hemmelgarn wrote:
> 
>>> Sth like this? I got such problem a few months ago, my solution was
>>> accepted upstream:
>>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
>>>
>>> Rationale is in referred ticket, udev would not support any more btrfs
>>> logic, so unless btrfs handles this itself on kernel level (daemon?),
>>> that is all that can be done.
>> Or maybe systemd can quit trying to treat BTRFS like a volume manager
>> (which it isn't) and just try to mount the requested filesystem with the
>> requested options?
> 
> Tried that before ("just mount my filesystem, stupid"), it is a no-go.
> The problem source is not within systemd treating BTRFS differently, but
> in btrfs kernel logic that it uses. Just to show it:
> 
> 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
> 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool),
> 3. try
> mount /dev/sda /test - fails
> mount /dev/sdb /test - works
> 4. reboot again and try in reversed order
> mount /dev/sdb /test - fails
> mount /dev/sda /test - works
> 
> THIS readiness is exposed via udev to systemd. And it must be used for
> multi-layer setups to work (consider stacked LUKS, LVM, MD, iSCSI, FC etc).
Except BTRFS _IS NOT MULTIPLE LAYERS_.  It's one layer at the filesystem 
layer, and handles the other 'layers' internally.
> 
> In short: until *something* scans all the btrfs components, so the
> kernel makes it ready, systemd won't even try to mount it.
Which is the problem here.  Systemd needs to treat BTRFS differently, 
even if the ioctl it's using gets 'fixed', currently it's treating it 
like LVM or MD, when it needs to be treated as just a filesystem with an 
extra wait condition prior to mount (and needs to trust that the user 
knows what they are doing when they mount something by hand).  The IOCTL 
systemd is using was poorly named, what it really does is say that the 
FS is ready to mount normally (that is, without needing 'device=' or 
'degraded' mount options).  Aside from this being problematic with 
degraded volumes, it's got an inherent TOCTOU race condition (so do the 
checks with all the other block layers you mentioned FWIW).  If systemd 
would just treat BTRFS like a filesystem instead of a volume manager, 
and try to mount the volume with the specified options (after waiting 
for udev to report that it's done scanning everything) instead of asking 
the kernel if it's ready, none of this would be an issue.

Put slightly differently:  I use OpenRC and sysv init.  I have a script 
that runs right after udev starts and directly scans all fixed disks for 
BTRFS signatures, and that's _all_ that I need to do to get multi-device 
BTRFS working properly with the standard local filesystem mount script 
in Gentoo.  I don't have to deal with any of this crap that systemd 
users do because Gentoo's OpenRC script for mounting local filesystems 
treats BTRFS like any other filesystem, and (sensibly) assumes that if 
the call to mount succeeds, things are ready and working.
> 
>> Then you would just be able to specify 'degraded' in
>> your mount options, and you don't have to care that the kernel refuses
>> to mount degraded filesystems without being explicitly asked to.
> 
> Exactly. But since LP refused to try mounting despite kernel "not-ready"
> state - it is the kernel that must emit 'ready'. So the
> question is: how can I make kernel to mark degraded array as "ready"?
You can't, because the DEVICE_READY IOCTL is coded to mark the volume 
ready when all component devices are ready.  IOW, it's there to say 
'this mount will work without needing -o degraded or specifying any 
devices in the mount options'.

The issue is the interaction here, not the kernel behavior by itself, 
since the kernel behavior produces no issues whatsoever for other init 
systems (though I will acknowledge that the ioctl itself is really only 
used by systemd, but I contend that that's because everything else is 
sensible enough to understand that the ioctl is functionally useless and 
just avoid it).
> 
> The obvious answer is: do it via kernel command line, just like mdadm
> does:
> rootflags=device=/dev/sda,device=/dev/sdb
> rootflags=device=/dev/sda,device=missing
> rootflags=device=/dev/sda,device=/dev/sdb,degraded
> 
> If only btrfs.ko recognized this, kernel would be able to assemble
> multivolume btrfs itself. Not only this would allow automated degraded
> mounts, it would also allow using initrd-less kernels on such volumes.
Last I checked, the 'device=' options work on upstream kernels just 
fine, though I've never tried the degraded option.  Of course, I'm also 
not using systemd, so it may be some interaction with systemd that's 
causing them to not work (and yes, I understand that I'm inclined to 
blame systemd most of the time based on significant past experience with 
systemd creating issues that never existed before).
> 
>>> It doesn't have to be default, might be kernel compile-time knob, module
>>> parameter or anything else to make the *R*aid work.
>> There's a mount option for it per-filesystem.  Just add that to all your
>> mount calls, and you get exactly the same effect.
> 
> If only they were passed...
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 19:47                 ` Chris Murphy
  2017-12-19 21:17                   ` Tomasz Pala
@ 2017-12-20 16:53                   ` Andrei Borzenkov
  2017-12-20 16:57                     ` Austin S. Hemmelgarn
  2017-12-20 20:02                     ` Chris Murphy
  1 sibling, 2 replies; 61+ messages in thread
From: Andrei Borzenkov @ 2017-12-20 16:53 UTC (permalink / raw)
  To: Chris Murphy, Tomasz Pala; +Cc: Linux fs Btrfs

19.12.2017 22:47, Chris Murphy пишет:
> 
>>
>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
>> this distro to research every component used?
> 
> As far as I'm aware, only Btrfs single device stuff is "supported".
> The multiple device stuff is definitely not supported on openSUSE, but
> I have no idea to what degree they support it with enterprise license,
> no doubt that support must come with caveats.
> 

I was rather surprised seeing RAID1 and RAID10 listed as supported in
SLES 12.x release notes, especially as there is no support for
multi-device btrfs in YaST and hence no way to even install on such
filesystem.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20 16:53                   ` Andrei Borzenkov
@ 2017-12-20 16:57                     ` Austin S. Hemmelgarn
  2017-12-20 20:02                     ` Chris Murphy
  1 sibling, 0 replies; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-20 16:57 UTC (permalink / raw)
  To: Andrei Borzenkov, Chris Murphy, Tomasz Pala; +Cc: Linux fs Btrfs

On 2017-12-20 11:53, Andrei Borzenkov wrote:
> 19.12.2017 22:47, Chris Murphy пишет:
>>
>>>
>>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
>>> this distro to research every component used?
>>
>> As far as I'm aware, only Btrfs single device stuff is "supported".
>> The multiple device stuff is definitely not supported on openSUSE, but
>> I have no idea to what degree they support it with enterprise license,
>> no doubt that support must come with caveats.
>>
> 
> I was rather surprised seeing RAID1 and RAID10 listed as supported in
> SLES 12.x release notes, especially as there is no support for
> multi-device btrfs in YaST and hence no way to even install on such
> filesystem.
That's the beauty of it all though, you don't need to install on such a 
setup directly like you would need to with hardware RAID, you can 
install in single-device mode and then convert the system on-line to use 
multiple devices, and that will (usually) be faster than a direct 
install if you're using replication (unless you're using RAID10 and have 
a _lot_ of disks).

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20 13:33                     ` Austin S. Hemmelgarn
@ 2017-12-20 17:28                       ` Duncan
  0 siblings, 0 replies; 61+ messages in thread
From: Duncan @ 2017-12-20 17:28 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Wed, 20 Dec 2017 08:33:03 -0500 as
excerpted:

>> The obvious answer is: do it via kernel command line, just like mdadm
>> does:
>> rootflags=device=/dev/sda,device=/dev/sdb
>> rootflags=device=/dev/sda,device=missing
>> rootflags=device=/dev/sda,device=/dev/sdb,degraded
>> 
>> If only btrfs.ko recognized this, kernel would be able to assemble
>> multivolume btrfs itself. Not only this would allow automated degraded
>> mounts, it would also allow using initrd-less kernels on such volumes.
> Last I checked, the 'device=' options work on upstream kernels just
> fine, though I've never tried the degraded option.  Of course, I'm also
> not using systemd, so it may be some interaction with systemd that's
> causing them to not work (and yes, I understand that I'm inclined to
> blame systemd most of the time based on significant past experience with
> systemd creating issues that never existed before).

Has the bug where rootflags=device=/dev/sda1,device=/dev/sdb1 failed, 
been fixed?  Last I knew (which was ancient history in btrfs terms, but 
I've not seen mention of a patch for it in all that time either), device= 
on the userspace commandline worked, and device= on the kernel commandline 
worked if there was just one device, but it would fail for more than one 
device.  Mounting degraded (on a pair-device raid1) would then of course 
work, since it would just use the one device=, but that's simply 
dangerous for routine use regardless of whether it actually assembled or 
not, thus effectively forcing an initr* for multi-device btrfs root in 
ordered to get it mounted properly.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20  8:34                   ` Tomasz Pala
  2017-12-20  8:51                     ` Tomasz Pala
@ 2017-12-20 19:49                     ` Chris Murphy
  1 sibling, 0 replies; 61+ messages in thread
From: Chris Murphy @ 2017-12-20 19:49 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Linux fs Btrfs

On Wed, Dec 20, 2017 at 1:34 AM, Tomasz Pala <gotar@polanet.pl> wrote:
> On Tue, Dec 19, 2017 at 16:59:39 -0700, Chris Murphy wrote:
>
>>> Sth like this? I got such problem a few months ago, my solution was
>>> accepted upstream:
>>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
>>
>> I can't parse this commit. In particular I can't tell how long it
>> waits, or what triggers the end to waiting.
>
> The point is - it doesn't wait at all. Instead, every 'ready' btrfs
> device triggers event on all the pending devices. Consider 3-device
> filesystem consisting of /dev/sd[abd] with /dev/sdc being different,
> standalone btrfs:
>
> /dev/sda -> 'not ready'
> /dev/sdb -> 'not ready'
> /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not ready'
> /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 'ready'
>
> This way all the parts of a volume are marked as ready, so systemd won't
> refuse mounting using legacy device nodes like /dev/sda.
>
>
> This particular solution depends on kernel returning 'btrfs ready',
> which would obviously not work for degraded arrays unless the btrfs.ko
> handles some 'missing' or 'mount_degraded' kernel cmdline options
> _before_ actually _trying_ to mount it with -o degraded.


The thing that is valuing a Btrfs's "readiness" is udev. The kernel
doesn't care, it still instantiates a volume UUID. And if you pass -o
degraded mount to a non-ready Btrfs volume, the kernel code will try
to mount that volume in degraded mode (assuming it passes tests for
the minimum number of devices, can find all the supers it needs, and
bootstrap the chunk tree, etc)


If the udev rule were smarter, it could infer "non-ready" Btrfs volume
to mean it should wait (and complaining might be nice so we know why
it's waiting) for some period of time, and then if it's still not
ready to try to mount with -o degraded. I don't know where teaching
system about degraded attempts belongs, whether the udev rule can tell
systemd to add that mount option if a volume is still not ready, of if
systemd needs hard coded understanding of this mount option for Btrfs.
There is no risk of using -o degraded on a Btrfs volume if it's
missing too many devices, such a degraded mount will simply fail.



> After such timeout, I'd like to tell the kernel: "no more devices, give
> me all the remaining btrfs volumes in degraded mode if possible". By
> "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could
> fire it's rules. And if there would be anything for udev to distinguish
> 'ready' from 'ready-degraded' one could easily compose some notification
> scripting on top of it, including sending e-mail to sysadmin.

I think the linguistics of "btrfs devices ready" is confusing because
what we really care about is whether the volume/array can be mounted
normally (not degraded). The BTRFS_IOC_DEVICES_READY ioctl is pointed
to any one of the volume's devices, and you get a pass/fail. If it
passes (ready), all other devices are present. If it fails (not
ready), one or more devices are missing. It's not necessary to hit
every device with this ioctl to understand what's going on.

If the question can be answered with: ready, ready-degraded - It's
highly likely that you always get read-degraded as the answer for all
btrfs multiple device volumes. So if udev were to get read-degraded
will it still wait to see if the state goes to ready? How long does it
wait? Seems like it still should wait 90 seconds. In which case it's
going to try to mount with -o degraded.

So I see zero advantage and multiple disadvantages to having the
kernel do a degradedness test well before the mount will be attempted.
I think this is asking for a race condition.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20 16:53                   ` Andrei Borzenkov
  2017-12-20 16:57                     ` Austin S. Hemmelgarn
@ 2017-12-20 20:02                     ` Chris Murphy
  2017-12-20 20:07                       ` Chris Murphy
  1 sibling, 1 reply; 61+ messages in thread
From: Chris Murphy @ 2017-12-20 20:02 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Chris Murphy, Tomasz Pala, Linux fs Btrfs

On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
> 19.12.2017 22:47, Chris Murphy пишет:
>>
>>>
>>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
>>> this distro to research every component used?
>>
>> As far as I'm aware, only Btrfs single device stuff is "supported".
>> The multiple device stuff is definitely not supported on openSUSE, but
>> I have no idea to what degree they support it with enterprise license,
>> no doubt that support must come with caveats.
>>
>
> I was rather surprised seeing RAID1 and RAID10 listed as supported in
> SLES 12.x release notes, especially as there is no support for
> multi-device btrfs in YaST and hence no way to even install on such
> filesystem.

Haha. OK well I'm at a loss then. And they use systemd which is going
to run into the udev rule that prevents systemd from even attempting
to mount rootfs if one or more devices are missing. So I don't know
how it really gets supported. At the dracut prompt, manually mount
using -o degraded to /sysroot and then exit? I guess?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20 20:02                     ` Chris Murphy
@ 2017-12-20 20:07                       ` Chris Murphy
  2017-12-20 20:14                         ` Austin S. Hemmelgarn
  2017-12-21 11:49                         ` Andrei Borzenkov
  0 siblings, 2 replies; 61+ messages in thread
From: Chris Murphy @ 2017-12-20 20:07 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Andrei Borzenkov, Tomasz Pala, Linux fs Btrfs

On Wed, Dec 20, 2017 at 1:02 PM, Chris Murphy <lists@colorremedies.com> wrote:
> On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>> 19.12.2017 22:47, Chris Murphy пишет:
>>>
>>>>
>>>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
>>>> this distro to research every component used?
>>>
>>> As far as I'm aware, only Btrfs single device stuff is "supported".
>>> The multiple device stuff is definitely not supported on openSUSE, but
>>> I have no idea to what degree they support it with enterprise license,
>>> no doubt that support must come with caveats.
>>>
>>
>> I was rather surprised seeing RAID1 and RAID10 listed as supported in
>> SLES 12.x release notes, especially as there is no support for
>> multi-device btrfs in YaST and hence no way to even install on such
>> filesystem.
>
> Haha. OK well I'm at a loss then. And they use systemd which is going
> to run into the udev rule that prevents systemd from even attempting
> to mount rootfs if one or more devices are missing. So I don't know
> how it really gets supported. At the dracut prompt, manually mount
> using -o degraded to /sysroot and then exit? I guess?


There is an irony here:

YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
encrypted root with Btrfs either because YaST enforces LVM to do LUKS
encryption for some weird reason; and it also enforces NOT putting
Btrfs on LVM.

Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of
these use cases for something like 5 years (does support Btrfs raid1
and raid10 layouts; and also supports Btrfs directly on dmcrypt
without LVM) - with the caveat that it enforces /boot to be on ext4.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20 20:07                       ` Chris Murphy
@ 2017-12-20 20:14                         ` Austin S. Hemmelgarn
  2017-12-21  1:34                           ` Chris Murphy
  2017-12-21 11:49                         ` Andrei Borzenkov
  1 sibling, 1 reply; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-20 20:14 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Andrei Borzenkov, Tomasz Pala, Linux fs Btrfs

On 2017-12-20 15:07, Chris Murphy wrote:
> On Wed, Dec 20, 2017 at 1:02 PM, Chris Murphy <lists@colorremedies.com> wrote:
>> On Wed, Dec 20, 2017 at 9:53 AM, Andrei Borzenkov <arvidjaar@gmail.com> wrote:
>>> 19.12.2017 22:47, Chris Murphy пишет:
>>>>
>>>>>
>>>>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
>>>>> this distro to research every component used?
>>>>
>>>> As far as I'm aware, only Btrfs single device stuff is "supported".
>>>> The multiple device stuff is definitely not supported on openSUSE, but
>>>> I have no idea to what degree they support it with enterprise license,
>>>> no doubt that support must come with caveats.
>>>>
>>>
>>> I was rather surprised seeing RAID1 and RAID10 listed as supported in
>>> SLES 12.x release notes, especially as there is no support for
>>> multi-device btrfs in YaST and hence no way to even install on such
>>> filesystem.
>>
>> Haha. OK well I'm at a loss then. And they use systemd which is going
>> to run into the udev rule that prevents systemd from even attempting
>> to mount rootfs if one or more devices are missing. So I don't know
>> how it really gets supported. At the dracut prompt, manually mount
>> using -o degraded to /sysroot and then exit? I guess?
> 
> 
> There is an irony here:
> 
> YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
> encrypted root with Btrfs either because YaST enforces LVM to do LUKS
> encryption for some weird reason; and it also enforces NOT putting
> Btrfs on LVM.
The 'LUKS must use LVM' thing is likely historical.  The BCP for using 
LUKS is that it's at the bottom level (so you leak absolutely nothing 
about how your storage stack is structured), and if that's the case you 
need something on top to support separate filesystems, which up until 
BTRFS came around has solely been LVM.

The 'No BTRFS on LVM' thing is likely for sanity reasons.  Using BTRFS 
on SuSE means allocating /boot and swap, and the entire rest of the disk 
is BTRFS.  They only support a single PV or a single BTRFS volume at the 
bottom level per-disk for /.
> 
> Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of
> these use cases for something like 5 years (does support Btrfs raid1
> and raid10 layouts; and also supports Btrfs directly on dmcrypt
> without LVM) - with the caveat that it enforces /boot to be on ext4.
And this caveat is because for some reason Fedora has chosen not to 
integrate BTRFS support into their version of GRUB.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20 20:14                         ` Austin S. Hemmelgarn
@ 2017-12-21  1:34                           ` Chris Murphy
  0 siblings, 0 replies; 61+ messages in thread
From: Chris Murphy @ 2017-12-21  1:34 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: Chris Murphy, Andrei Borzenkov, Tomasz Pala, Linux fs Btrfs

On Wed, Dec 20, 2017 at 1:14 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2017-12-20 15:07, Chris Murphy wrote:

>> There is an irony here:
>>
>> YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
>> encrypted root with Btrfs either because YaST enforces LVM to do LUKS
>> encryption for some weird reason; and it also enforces NOT putting
>> Btrfs on LVM.
>
> The 'LUKS must use LVM' thing is likely historical.  The BCP for using LUKS
> is that it's at the bottom level (so you leak absolutely nothing about how
> your storage stack is structured), and if that's the case you need something
> on top to support separate filesystems, which up until BTRFS came around has
> solely been LVM.

*shrug* Anaconda has supported plain partition LUKS without
device-mapper for ext3/4 and XFS since forever, even before the
rewrite.


>> Meanwhile, Fedora/Red Hat's Anaconda installer has supported both of
>> these use cases for something like 5 years (does support Btrfs raid1
>> and raid10 layouts; and also supports Btrfs directly on dmcrypt
>> without LVM) - with the caveat that it enforces /boot to be on ext4.
>
> And this caveat is because for some reason Fedora has chosen not to
> integrate BTRFS support into their version of GRUB.

No. The Fedora patchset for upstream GRUB doesn't remove Btrfs
support. However, they don't use grub-mkconfig to rewrite the grub.cfg
when a new kernel is installed. Instead, they use an unrelated project
called grubby, which modifies the existing grub.cfg (and also supports
most all other configs like syslinux/extlinux, yaboot, uboot, lilo,
and others). And grubby gets confused [1] if the grub.cfg is on a
subvolume (other than ID 5). If the grub.cfg is in the ID 5 subvolume,
in a normal directory structure, it works fine.

Chris Murphy



[1] Gory details

The central part of the confusion appears to be this sequence of
comments in this insanely long bug:
https://bugzilla.redhat.com/show_bug.cgi?id=864198#c3
https://bugzilla.redhat.com/show_bug.cgi?id=864198#c5
https://bugzilla.redhat.com/show_bug.cgi?id=864198#c6
https://bugzilla.redhat.com/show_bug.cgi?id=864198#c7

The comments from Gene Czarcinski (now deceased, that's how old this
bug is) try to negotiate understanding the problem and he had a fix
but it didn't meet some upstream grubby requirement, and so the patch
wasn't accepted. Grubby is sufficiently messy that near as I can tell
no other distribution uses it, and no one really cares to maintain it
until something in RHEL breaks and then *that* gets attention.

Upstream bug
https://github.com/rhboot/grubby/issues/22

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-19 20:47                 ` Austin S. Hemmelgarn
  2017-12-19 22:23                   ` Tomasz Pala
@ 2017-12-21 11:44                   ` Andrei Borzenkov
  2017-12-21 12:27                     ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 61+ messages in thread
From: Andrei Borzenkov @ 2017-12-21 11:44 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Tomasz Pala, Linux fs Btrfs

On Tue, Dec 19, 2017 at 11:47 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2017-12-19 15:41, Tomasz Pala wrote:
>>
>> On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote:
>>
>>> with a read only file system. Another reason is the kernel code and
>>> udev rule for device "readiness" means the volume is not "ready" until
>>> all member devices are present. And while the volume is not "ready"
>>> systemd will not even attempt to mount. Solving this requires kernel
>>> and udev work, or possibly a helper, to wait an appropriate amount of
>>
>>
>> Sth like this? I got such problem a few months ago, my solution was
>> accepted upstream:
>>
>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
>>
>> Rationale is in referred ticket, udev would not support any more btrfs
>> logic, so unless btrfs handles this itself on kernel level (daemon?),
>> that is all that can be done.
>
> Or maybe systemd can quit trying to treat BTRFS like a volume manager (which
> it isn't) and just try to mount the requested filesystem with the requested
> options?

You can't mount filesystem until sufficient number of devices are
present and not waiting (at least attempting to wait) for them opens
you to races on startup. So far systemd position was - it is up to
filesystem to give it something to wait on. And while apparently
everyone agrees that current "btrfs device ready" does not fit the
bill, this is the only thing we have.

This integration issue was so far silently ignored both by btrfs and
systemd developers.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20 20:07                       ` Chris Murphy
  2017-12-20 20:14                         ` Austin S. Hemmelgarn
@ 2017-12-21 11:49                         ` Andrei Borzenkov
  1 sibling, 0 replies; 61+ messages in thread
From: Andrei Borzenkov @ 2017-12-21 11:49 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Tomasz Pala, Linux fs Btrfs

On Wed, Dec 20, 2017 at 11:07 PM, Chris Murphy <lists@colorremedies.com> wrote:
>
> YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
> encrypted root with Btrfs either because YaST enforces LVM to do LUKS
> encryption for some weird reason; and it also enforces NOT putting
> Btrfs on LVM.
>

That's incorrect, btrfs on LVM is default on some SLES flavors and one
of the three standard proposals (where you do not need to go in expert
mode) - normal partitions, LVM, encrypted LVM - even on openSUSE.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-21 11:44                   ` Andrei Borzenkov
@ 2017-12-21 12:27                     ` Austin S. Hemmelgarn
  2017-12-22 16:05                       ` Tomasz Pala
  0 siblings, 1 reply; 61+ messages in thread
From: Austin S. Hemmelgarn @ 2017-12-21 12:27 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: Tomasz Pala, Linux fs Btrfs

On 2017-12-21 06:44, Andrei Borzenkov wrote:
> On Tue, Dec 19, 2017 at 11:47 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2017-12-19 15:41, Tomasz Pala wrote:
>>>
>>> On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote:
>>>
>>>> with a read only file system. Another reason is the kernel code and
>>>> udev rule for device "readiness" means the volume is not "ready" until
>>>> all member devices are present. And while the volume is not "ready"
>>>> systemd will not even attempt to mount. Solving this requires kernel
>>>> and udev work, or possibly a helper, to wait an appropriate amount of
>>>
>>>
>>> Sth like this? I got such problem a few months ago, my solution was
>>> accepted upstream:
>>>
>>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
>>>
>>> Rationale is in referred ticket, udev would not support any more btrfs
>>> logic, so unless btrfs handles this itself on kernel level (daemon?),
>>> that is all that can be done.
>>
>> Or maybe systemd can quit trying to treat BTRFS like a volume manager (which
>> it isn't) and just try to mount the requested filesystem with the requested
>> options?
> 
> You can't mount filesystem until sufficient number of devices are
> present and not waiting (at least attempting to wait) for them opens
> you to races on startup. So far systemd position was - it is up to
> filesystem to give it something to wait on. And while apparently
> everyone agrees that current "btrfs device ready" does not fit the
> bill, this is the only thing we have.
No, it isn't.  You can just make the damn mount call with the supplied 
options.  If it succeeds, the volume was ready, if it fails, it wasn't, 
it's that simple, and there's absolutely no reason that systemd can't 
just do that in a loop until it succeeds or a timeout is reached.  That 
isn't any more racy than waiting on them is (waiting on them to be ready 
and then mounting them is a TOCTOU race condition), and it doesn't have 
any of these issues with the volume being completely unusable in a 
degraded state.

Also, it's not 'up to the filesystem', it's 'up to the underlying 
device'.  LUKS, LVM, MD, and everything else that's an actual device 
layer is what systemd waits on.  XFS, ext4, and any other filesystem 
except BTRFS (and possibly ZFS, but I'm not 100% sure about that) 
provides absolutely _NOTHING_ to wait on.  Systemd just chose to handle 
BTRFS like a device layer, and not a filesystem, so we have this crap to 
deal with, as well as the fact that it makes it impossible to manually 
mount a BTRFS volume with missing or failed devices in degraded mode 
under systemd (because it unmounts it damn near instantly because it 
somehow thinks it knows better than the user what the user wants to do).
> 
> This integration issue was so far silently ignored both by btrfs and
> systemd developers. 
It's been ignored by BTRFS devs because there is _nothing_ wrong on this 
side other than the naming choice for the ioctl.  Systemd is _THE ONLY_ 
init system which has this issue, every other one works just fine.

As far as the systemd side, I have no idea why they are ignoring it, 
though I suspect it's the usual spoiled brat mentality that seems to be 
present about everything that people complain about regarding systemd.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-21 12:27                     ` Austin S. Hemmelgarn
@ 2017-12-22 16:05                       ` Tomasz Pala
  2017-12-22 21:04                         ` Chris Murphy
  0 siblings, 1 reply; 61+ messages in thread
From: Tomasz Pala @ 2017-12-22 16:05 UTC (permalink / raw)
  To: Linux fs Btrfs

On Thu, Dec 21, 2017 at 07:27:23 -0500, Austin S. Hemmelgarn wrote:

> No, it isn't.  You can just make the damn mount call with the supplied 
> options.  If it succeeds, the volume was ready, if it fails, it wasn't, 
> it's that simple, and there's absolutely no reason that systemd can't 
> just do that in a loop until it succeeds or a timeout is reached.  That 

There is no such loop, so if mount would happen before all the required
devices show up, it would either definitely fail, or if there were 'degraded'
in fstab, just start degraded.

> any of these issues with the volume being completely unusable in a 
> degraded state.
> 
> Also, it's not 'up to the filesystem', it's 'up to the underlying 
> device'.  LUKS, LVM, MD, and everything else that's an actual device 
> layer is what systemd waits on.  XFS, ext4, and any other filesystem 
> except BTRFS (and possibly ZFS, but I'm not 100% sure about that) 
> provides absolutely _NOTHING_ to wait on.  Systemd just chose to handle 

You wait for all the devices to settle. One might have dozen of drives
including some attached via network and it might take a time to become
available. Since systemd knows nothing about underlying components, it
simply waits for the btrfs itself to announce it's ready.

> BTRFS like a device layer, and not a filesystem, so we have this crap to 

As btrfs handles many devices in "lower part", this effectively is
device layer. Mounting /dev/sda happens to mount various other /dev/sd*
that are _not_ explicitly exposed, so there is really not an
alternative. Except for the 'mount loop' which is a no-go.

> deal with, as well as the fact that it makes it impossible to manually 
> mount a BTRFS volume with missing or failed devices in degraded mode 
> under systemd (because it unmounts it damn near instantly because it 
> somehow thinks it knows better than the user what the user wants to do).

This seems to be some distro-specific misconfiguration, didn't happen to
me on plain systemd/udev. What is the reproducing scenario?

>> This integration issue was so far silently ignored both by btrfs and
>> systemd developers. 
> It's been ignored by BTRFS devs because there is _nothing_ wrong on this 
> side other than the naming choice for the ioctl.  Systemd is _THE ONLY_ 
> init system which has this issue, every other one works just fine.

Not true - mounting btrfs without "btrfs device scan" doesn't work at
all without udev rules (that mimic behaviour of the command). Let me
repeat example from Dec 19th:

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

> As far as the systemd side, I have no idea why they are ignoring it, 
> though I suspect it's the usual spoiled brat mentality that seems to be 
> present about everything that people complain about regarding systemd.

Explanation above. This is the point when _you_ need to stop ignoring
the fact, that you simply cannot just try mounting devices in a loop as
this would render any NAS/FC/iSCSI-backed or more complicated systems
unusable or hide problems in case of temporary problems with connection.

systemd waits for the _underlying_ device - unless btrfs exposes them as
a list of _actual_ devices to wait for, there is nothing except for
waiting for btrfs itself that systemd can do.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-22 16:05                       ` Tomasz Pala
@ 2017-12-22 21:04                         ` Chris Murphy
  2017-12-23  2:52                           ` Tomasz Pala
  0 siblings, 1 reply; 61+ messages in thread
From: Chris Murphy @ 2017-12-22 21:04 UTC (permalink / raw)
  To: Tomasz Pala; +Cc: Linux fs Btrfs

On Fri, Dec 22, 2017 at 9:05 AM, Tomasz Pala <gotar@polanet.pl> wrote:
> On Thu, Dec 21, 2017 at 07:27:23 -0500, Austin S. Hemmelgarn wrote:
>
>> Also, it's not 'up to the filesystem', it's 'up to the underlying
>> device'.  LUKS, LVM, MD, and everything else that's an actual device
>> layer is what systemd waits on.  XFS, ext4, and any other filesystem
>> except BTRFS (and possibly ZFS, but I'm not 100% sure about that)
>> provides absolutely _NOTHING_ to wait on.  Systemd just chose to handle
>
> You wait for all the devices to settle. One might have dozen of drives
> including some attached via network and it might take a time to become
> available. Since systemd knows nothing about underlying components, it
> simply waits for the btrfs itself to announce it's ready.


I'm pretty sure degraded boot timeout policy is handled by dracut. The
kernel doesn't just automatically assemble an md array as soon as it's
possible (degraded) and then switch to normal operation as other
devices appear. I have no idea how LVM manages the delay policy for
multiple devices.

I don't think the delay policy belongs in the kernel.

It's pie in the sky, and unicorns, but it sure would be nice to have
standardization rather than everyone rolling their own solution. The
Red Hat Stratis folks will need something to do this for their
solution so yet another one is about to be developed...


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-22 21:04                         ` Chris Murphy
@ 2017-12-23  2:52                           ` Tomasz Pala
  2017-12-23  5:40                             ` Duncan
  0 siblings, 1 reply; 61+ messages in thread
From: Tomasz Pala @ 2017-12-23  2:52 UTC (permalink / raw)
  To: Linux fs Btrfs

On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote:

> I'm pretty sure degraded boot timeout policy is handled by dracut. The

Well, last time I've checked dracut on systemd-system couldn't even
generate systemd-less image.

> kernel doesn't just automatically assemble an md array as soon as it's
> possible (degraded) and then switch to normal operation as other

MD devices are explicitly listed in mdadm.conf (for mdadm --assemble
--scan) or kernel command line or metadata of autodetected partitions (fd).

> devices appear. I have no idea how LVM manages the delay policy for
> multiple devices.

I *guess* it's not about waiting, but simply being executed after the
devices are ready.

And there is a VERY long history of various init systems having problems
to boot systems using multi-layer setups (LVM/MD under or above LUKS,
not to mention remote ones that need networking to be set up).

All of this works reasonably well under systemd - except for the btrfs
that uses single device node to match entire group of devices. Which is
convenient for living person (no need to switch between /dev/mdX and
/dev/sdX), but impossible to guess automatically by userspace tools.
There is only probe IOCTL which doesn't handle degraded mode.

> I don't think the delay policy belongs in the kernel.

That is exactly why the systemd waits for appropriate udev state.

> It's pie in the sky, and unicorns, but it sure would be nice to have
> standardization rather than everyone rolling their own solution. The

There was a de facto standard I think - expose component devices or
require them to be specified. Apparently no such thing in btrfs, so
it must be handled in btrfs-way.

Also note that MD can be assembled by kernel itself, while btrfs cannot
(so initrd is required for rootfs).

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-20  0:08                     ` Chris Murphy
@ 2017-12-23  4:08                       ` Tomasz Pala
  2017-12-23  5:23                         ` Duncan
  0 siblings, 1 reply; 61+ messages in thread
From: Tomasz Pala @ 2017-12-23  4:08 UTC (permalink / raw)
  To: Linux fs Btrfs

On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote:

>>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>>>> safely add "degraded" to the mount options? My primary concern is the
>> [...]
> 
> Well it only does rw once, then the next degraded is ro - there are
> patches dealing with this better but I don't know the state. And
> there's no resync code that I'm aware of, absolutely it's not good
> enough to just kick off a full scrub - that has huge performance
> implications and I'd consider it a regression compared to
> functionality in LVM and mdadm RAID by default with the write intent
> bitmap.  Without some equivalent short cut, automatic degraded means a

I read about the 'scrub' all over the time here, so let me ask this
directly, as this is also not documented clearly:

1. is the full scrub required after ANY desync? (like: degraded mount
followed by readding old device)?

2. if the scrub is omitted - is it possible that btrfs return invalid data (from the
desynced and readded drive)?

3. is the scrub required to be scheduled on regular basis? By 'required'
I mean by design/implementation issues/quirks, _not_ related to possible
hardware malfunctions.

-- 
Tomasz Pala <gotar@pld-linux.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-23  4:08                       ` Tomasz Pala
@ 2017-12-23  5:23                         ` Duncan
  0 siblings, 0 replies; 61+ messages in thread
From: Duncan @ 2017-12-23  5:23 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Pala posted on Sat, 23 Dec 2017 05:08:16 +0100 as excerpted:

> On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote:
> 
>>>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>>>>> safely add "degraded" to the mount options? My primary concern is
>>>>> the
>>> [...]
>> 
>> Well it only does rw once, then the next degraded is ro - there are
>> patches dealing with this better but I don't know the state. And
>> there's no resync code that I'm aware of, absolutely it's not good
>> enough to just kick off a full scrub - that has huge performance
>> implications and I'd consider it a regression compared to functionality
>> in LVM and mdadm RAID by default with the write intent bitmap.  Without
>> some equivalent short cut, automatic degraded means a
> 
> I read about the 'scrub' all over the time here, so let me ask this
> directly, as this is also not documented clearly:
> 
> 1. is the full scrub required after ANY desync? (like: degraded mount
> followed by readding old device)?

It is very strongly recommended.

> 2. if the scrub is omitted - is it possible that btrfs return invalid
> data (from the desynced and readded drive)?

Were invalid data returned it would be a bug.  However, a reasonably 
common refrain here is that btrfs is "still stabilizing, not yet fully 
stable and mature", so occasional bugs can be expected, tho both the 
ideal and experience suggests that they're gradually reducing in 
frequency and severity as time goes on and we get closer to "fully stable 
and mature".

Which of course is why both having usable and tested backups, and keeping 
current with the kernel, are strongly recommended as well, the first in 
case one of those bugs does hit and it's severe enough to take out your 
working btrfs, the second because later kernels have fewer known bugs in 
the first place.

Functioning as designed as as intent-coded, in the case of a desync, 
btrfs will use the copy with the latest generation/transid serial, and 
thus should never return older data from the desynced device.  Further, 
btrfs is designed to be self-healing and will transparently rewrite the 
out-of-sync copy, syncing it in the process, as it comes across each 
stale block.

But the only way to be sure everything's consistent again is that scrub, 
and of course if something should happen to the only current copy while 
the desync still has the other copy stale, /then/ you lose data.

And as I said, that's functioning as designed and intent-coded, assuming 
no bugs, an explicitly unsafe assumption given btrfs' "still stabilizing" 
state.

So... "strongly recommended" indeed, tho in theory it shouldn't be 
absolutely required as long as unlucky fate doesn't strike before the 
data is transparently synced in normal usage.  YMMV, but I definitely do 
those scrubs here.

> 3. is the scrub required to be scheduled on regular basis? By 'required'
> I mean by design/implementation issues/quirks, _not_ related to possible
> hardware malfunctions.

Perhaps I'm tempting fate, but I don't do scheduled/regular scrubs here.  
Only if I have an ungraceful shutdown or see complaints in the log (which 
I tail to a system status dashboard so I'd be likely to notice a problem 
one way or the other pretty quickly).

But I do keep those backups, and while it has been quite some time (over 
a year, I'd say about 18 months to two years, and I was actually able to 
use btrfs restore and avoid having to use the backups themselves the last 
time it happened even 18 months or whatever ago) now since I had to use 
them, I /did/ actually spend some significant money upgrading my backups 
to all-SSD in ordered to make updating those backups easier and encourage 
me to keep them much more current than I had been (btrfs restore saved me 
more trouble than I'm comfortable admitting, given that I /did/ have 
backups, but they weren't the freshest at the time).

If as some people I had my backups offsite and would have to download 
them if I actually needed them, I'd potentially be rather stricter and 
schedule regular scrubs.

So by design and intention-coding, no, regularly scheduled scrubs aren't 
"required".  But I'd treat them the same as I would on non-btrfs raid, or 
a bit stricter given the above discussed btrfs stability status.  If 
you'd be uncomfortable not scheduling regular scrubs on your non-btrfs 
raid, you better be uncomfortable not scheduling them on btrfs as well!

And as always, btrfs or no btrfs, scrub or no scrub, have your backups or 
you are literally defining your data as not worth the time/trouble/
resources necessary to do them, and some day, maybe 10 minutes from now, 
maybe 10 years from now, fate's going to call you on that definition!

(Yes, I know /you/ know that or we'd not have this thread, which 
demonstrates that you /do/ care about your data.  But it's as much about 
the lurkers and googlers coming across the thread later as it is the 
direct participants...)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-23  2:52                           ` Tomasz Pala
@ 2017-12-23  5:40                             ` Duncan
  0 siblings, 0 replies; 61+ messages in thread
From: Duncan @ 2017-12-23  5:40 UTC (permalink / raw)
  To: linux-btrfs

Tomasz Pala posted on Sat, 23 Dec 2017 03:52:47 +0100 as excerpted:

> On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote:
> 
>> I'm pretty sure degraded boot timeout policy is handled by dracut. The
> 
> Well, last time I've checked dracut on systemd-system couldn't even
> generate systemd-less image.

??

Unless it changed recently (I /chose/ a systemd-based dracut setup here, 
so I'd not be aware if it did), dracut can indeed do systemd-less initr* 
images.  Dracut is modular, and systemd is one of the modules, enabled by 
default on a systemd system, but not required, as I know, because I had 
dracut setup without the systemd module for some time after I switched to 
systemd for my main sysinit, and I verified it didn't install systemd in 
the initr* until I activated the systemd module.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: Unexpected raid1 behaviour
  2017-12-18 13:31 ` Austin S. Hemmelgarn
@ 2018-01-12 12:26   ` Dark Penguin
  0 siblings, 0 replies; 61+ messages in thread
From: Dark Penguin @ 2018-01-12 12:26 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs



On 18/12/17 16:31, Austin S. Hemmelgarn wrote:
> On 2017-12-16 14:50, Dark Penguin wrote:
>> Could someone please point me towards some read about how btrfs handles
>> multiple devices? Namely, kicking faulty devices and re-adding them.
>>
>> I've been using btrfs on single devices for a while, but now I want to
>> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
>> tried to see how does it handle various situations. The experience left
>> me very surprised; I've tried a number of things, all of which produced
>> unexpected results.
> Expounding a bit on Duncan's answer with some more specific info.
>>
>> I create a btrfs raid1 filesystem on two hard drives and mount it.
>>
>> - When I pull one of the drives out (simulating a simple cable failure,
>> which happens pretty often to me), the filesystem sometimes goes
>> read-only. ??? > - But only after a while, and not always. ???
> The filesystem won't go read-only until it hits an I/O error, and it's
> non-deterministic how long it will be before that happens on an idle
> filesystem that only sees read access (because if all the files that are
> being read are in the page cache).
>> - When I fix the cable problem (plug the device back), it's immediately
>> "re-added" back. But I see no replication of the data I've written onto
>> a degraded filesystem... Nothing shows any problems, so "my filesystem
>> must be ok". ???
> One of two things happens in this case, and why there is no re-sync is
> dependent on which happens, but both ultimately have to do with the fact
> that BTRFS assumes I/O errors are from device failures, and are at worst
> transient.  Either:
>
> 1. The device reappears with the same name. This happens if the time it
> was disconnected is less than the kernel's command timeout (30 seconds
> by default).  In this case, BTRFS may not even notice that the device
> was gone (and if it doesn't, then a re-sync isn't necessary, since it
> will retry all the writes it needs to).  In this case, BTRFS assumes the
> I/O errors were temporary, and keeps using the device after logging the
> errors.  If this happens, then you need to manually re-sync things by
> scrubbing the filesystem (or balancing, but scrubbing is preferred as it
> should run quicker and will only re-write what is actually needed).
> 2. The device reappears with a different name.  In this case, the device
> was gone long enough that the block layer is certain it was
> disconnected, and thus when it reappears and BTRFS still holds open
> references to the old device node, it gets a new device node.  In this
> case, if the 'new' device is scanned, BTRFS will recognize it as part of
> the FS, but will keep using the old device node.  The correct fix here
> is to unmount the filesystem, re-scan all devices, and then remount the
> filesystem and manually re-sync with a scrub.
>
>> - If I unmount the filesystem and then mount it back, I see all my
>> recent changes lost (everything I wrote during the "degraded" period).
> I'm not quite sure about this, but I think BTRFS is rolling back to the
> last common generation number for some reason.
>
>> - If I continue working with a degraded raid1 filesystem (even without
>> damaging it further by re-adding the faulty device), after a while it
>> won't mount at all, even with "-o degraded".
> This is (probably) a known bug relating to chunk handling.  In a two
> device volume using a raid1 profile with a missing device, older kernels
> (I don't remember when the fix went in, but I could have sworn it was in
> 4.13) will (erroneously) generate single-profile chunks when they need
> to allocate new chunks.  When you then go to mount the filesystem, the
> check for the degraded mount-ability of the FS fails because there is a
> device missing and single profile chunks.
>
> Now, even without that bug, it's never a good idea t0o run a storage
> array degraded for any extended period of time, regardless of what type
> of array it is (BTRFS, ZFS, MD, LVM, or even hardware RAID).  By keeping
> it in 'degraded' mode, you're essentially telling the system that the
> array will be fixed in a reasonably short time-frame, which impacts how
> it handles the array.  If you're not going to fix it almost immediately,
> you should almost always reshape the array to account for the missing
> device if at all possible, as that will improve relative data safety and
> generally get you better performance than running degraded will.
>>
>> I can't wrap my head about all this. Either the kicked device should not
>> be re-added, or it should be re-added "properly", or it should at least
>> show some errors and not pretend nothing happened, right?..
> BTRFS is not the best at error reporting at the moment.  If you check
> the output of `btrfs device stats` for that filesystem though, it should
> show non-zero values in the error counters (note that these counters are
> cumulative, so they are counts since the last time they were reset (or
> when the FS was created if they have never been reset).  Similarly,
> scrub should report errors, there should be error messages in the kernel
> log, and switching the FS to read-only mode _is_ technically reporting
> an error, as that's standard error behavior for most sensible
> filesystems (ext[234] being the notable exception, they just continue as
> if nothing happened).
>>
>> I must be missing something. Is there an explanation somewhere about
>> what's really going on during those situations? Also, do I understand
>> correctly that upon detecting a faulty device (a write error), nothing
>> is done about it except logging an error into the 'btrfs device stats'
>> report? No device kicking, no notification?.. And what about degraded
>> filesystems - is it absolutely forbidden to work with them without
>> converting them to a "single" filesystem first?..
> As mentioned above, going read-only _is_ a notification that something
> is wrong.  Translating that (and the error counter increase, and the
> kernel log messages) into a user visible notification is not really the
> job of BTRFS, especially considering that no other filesystem or device
> manager does so either (yes, you can get nice notifications from LVM,
> but they aren't _from_ LVM itself, they're from other software that
> watches for errors, and the same type of software works just fine for
> BTRFS too).  If you're this worried about it and don't want to keep on
> top of it yourself by monitoring things manually, you really need to
> look into a tool like monit [1] that can handle this for you.
>
>
> [1] https://mmonit.com/monit/


Thank you! That was a really detailed explanation!

I was using MD for a long time, so I was expecting kind of the same
behaviour - like refusing to add the failed device back without
resyncing, kicking faulty devices from the array, sending email
warnings, being able to use the array in degraded mode with no problems
(in case of RAID1) and so on. But I guess a few things are different in
the btrfs mindset. It behaves more like a filesystem, so it doesn't
force you to ensure data integrity; noticing errors and fixing them is
up to you, like with any normal filesystem.

The test I did was a "try to break btrfs and see if it survives" test,
which mdadm would have passed (probably), but now I understand that
btrfs was not made for that. However, with some error-reporting tools,
it's probably possible to make it reasonably reliable.


-- 
darkpenguin

-- 
darkpenguin

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2018-01-12 12:34 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
2017-12-17 11:58 ` Duncan
2017-12-17 15:48   ` Peter Grandi
2017-12-17 20:42     ` Chris Murphy
2017-12-18  8:49       ` Anand Jain
2017-12-18  8:49     ` Anand Jain
2017-12-18 10:36       ` Peter Grandi
2017-12-18 12:10       ` Nikolay Borisov
2017-12-18 13:43         ` Anand Jain
2017-12-18 22:28       ` Chris Murphy
2017-12-18 22:29         ` Chris Murphy
2017-12-19 12:30         ` Adam Borowski
2017-12-19 12:54         ` Andrei Borzenkov
2017-12-19 12:59         ` Peter Grandi
2017-12-18 13:06     ` Austin S. Hemmelgarn
2017-12-18 19:43       ` Tomasz Pala
2017-12-18 22:01         ` Peter Grandi
2017-12-19 12:46           ` Austin S. Hemmelgarn
2017-12-19 12:25         ` Austin S. Hemmelgarn
2017-12-19 14:46           ` Tomasz Pala
2017-12-19 16:35             ` Austin S. Hemmelgarn
2017-12-19 17:56               ` Tomasz Pala
2017-12-19 19:47                 ` Chris Murphy
2017-12-19 21:17                   ` Tomasz Pala
2017-12-20  0:08                     ` Chris Murphy
2017-12-23  4:08                       ` Tomasz Pala
2017-12-23  5:23                         ` Duncan
2017-12-20 16:53                   ` Andrei Borzenkov
2017-12-20 16:57                     ` Austin S. Hemmelgarn
2017-12-20 20:02                     ` Chris Murphy
2017-12-20 20:07                       ` Chris Murphy
2017-12-20 20:14                         ` Austin S. Hemmelgarn
2017-12-21  1:34                           ` Chris Murphy
2017-12-21 11:49                         ` Andrei Borzenkov
2017-12-19 20:11                 ` Austin S. Hemmelgarn
2017-12-19 21:58                   ` Tomasz Pala
2017-12-20 13:10                     ` Austin S. Hemmelgarn
2017-12-19 23:53                   ` Chris Murphy
2017-12-20 13:12                     ` Austin S. Hemmelgarn
2017-12-19 18:31             ` George Mitchell
2017-12-19 20:28               ` Tomasz Pala
2017-12-19 19:35             ` Chris Murphy
2017-12-19 20:41               ` Tomasz Pala
2017-12-19 20:47                 ` Austin S. Hemmelgarn
2017-12-19 22:23                   ` Tomasz Pala
2017-12-20 13:33                     ` Austin S. Hemmelgarn
2017-12-20 17:28                       ` Duncan
2017-12-21 11:44                   ` Andrei Borzenkov
2017-12-21 12:27                     ` Austin S. Hemmelgarn
2017-12-22 16:05                       ` Tomasz Pala
2017-12-22 21:04                         ` Chris Murphy
2017-12-23  2:52                           ` Tomasz Pala
2017-12-23  5:40                             ` Duncan
2017-12-19 23:59                 ` Chris Murphy
2017-12-20  8:34                   ` Tomasz Pala
2017-12-20  8:51                     ` Tomasz Pala
2017-12-20 19:49                     ` Chris Murphy
2017-12-18  5:11   ` Anand Jain
2017-12-18  1:20 ` Qu Wenruo
2017-12-18 13:31 ` Austin S. Hemmelgarn
2018-01-12 12:26   ` Dark Penguin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.