* Re: BTRFS state on kernel 5.2
2019-09-02 17:21 BTRFS state on kernel 5.2 waxhead
@ 2019-09-03 0:10 ` Remi Gauvin
2019-09-03 1:59 ` Christoph Anton Mitterer
2019-09-03 3:30 ` Chris Murphy
1 sibling, 1 reply; 5+ messages in thread
From: Remi Gauvin @ 2019-09-03 0:10 UTC (permalink / raw)
To: waxhead, Linux BTRFS Mailinglist
[-- Attachment #1.1: Type: text/plain, Size: 4658 bytes --]
On 2019-09-02 1:21 p.m., waxhead wrote:
> 5. DEVICE REPLACE: (Using_Btrfs_with_Multiple_Devices page)
> It is not clear what to do to recover from a device failure on BTRFS.
> If a device is partly working then you can run the replace functionality
> and hopefully you're good to go afterwards. Ok fine , if this however
> does not work or you have a completely failed device it is a different
> story. My understanding of it is:
> If not enough free space (or devices) is available to restore redundancy
> you first need to add a new device, and then you need to A: first run
> metadata balance (to ensure that the filesystem structures is redundant)
> and then B: run a data balance to restore redundancy for your data.
> Is there any filters that can be applied to only restore chunks which
> are having a missing mirror / stripe member?
If you are adding a new device of same size or larger than the device
you are replacing, do not do balances.. you can still just do a device
replace. The only difference is, if the failed device is missing
entirely, you have to specify the device id of the missing device,
(rather than a /dev/sd?)
>
> 6. RAID56 (status page)
> The RAID56 have had the write hole problem for a long time now, but it
> is not well explained what the consequence of it is for data -
> especially if you have metadata stored in raid1/10.
> If you encounter a powerloss / kernel panic during write - what will
> actually happen?
> Will a fresh file simply be missing or corrupted (as in partly written).
> If you overwrite/append to a existing file - what is the consequence
> then? will you end up with... A: The old data, B: Corrupted or zeroed
> data?! This is not made clear in the comment and it would be great if
> we, the BTRFS users would understand what the risk of hitting the write
> hole actually is.
The Parity data from an interrupted write will be missing/corrupt. This
will in turn affect old data, not just the data you were writing. The
write hole will only be of consequence if you are reading the array
degraded, (ie, a drive is failed/missing, and though unlikely, would
also be a problem if you just happen to suddenly have a bad sector in
the same range of data as the corrupt parity).
If the corrupted data affects metadata, the consequences can be anything
from minor to completely unreadable filesystem.
If it affects data blocks, some files will be unreadable, but they can
simply be deleted/restored from backup.
As you noted, Metadata can be made Raid1, which will at least prevent
complete filesystem meltdown from write hole. But until the patches
increase the number of devices in a Mirrored Raid, there is not way to
make the pool tolerant of 2 device failuers, so Raid 6 is mostly
useless.. (Arguably, Raid 6 data would be much more likely to recover
from an unreadable sector while recovering from a missing device.)
It's also important to understand that unlike most other (all other?)
raid implementations, BTRFS will not, by itself, fix parity when it
restarts after an unclean shutdown. It's up to the administrator to run
a scrub manually.. Otherwise, parity errors will accumulate with each
unclean shutdown, and in turn, result in unrecoverable data if the array
is later degraded.
>
> 13. NODATACOW:
> As far as I can remember there was some issues regarding NOCOW
> files/directories on the mailing list a while ago. I can't find any
> issues related to nocow on the wiki (I might note have searched enough)
> but I don't think they are fixed so maybe someone can verify that.
> And by the way ...are NOCOW files still not checksummed? If yes, are
> there plans to add that (it would be especially nice to know if a nocow
> file is correct or not)
>
AFAIK, checksum and Nocow files is technically not possible, so no plans
exist to even try adding that functionality. If you think about it,
any unclean filesystem stop while nocow data is being written would
result in inconsistent checksum, so it would self defeating.
As for the Nocow problems, it has to do with Mirrored Raid. Without COW
or checksums, BTRFS has no method whatsoever of keeping Raid mirrored
data consistent. In the case of unclean stop while data is being
written, the two copies will be different, and which data gets read at
any time is entirely up to the fates. Not only will BTFS not
synchronize the mirrored copies by itself on next boot, it won't even
fix it in a scrub.
This behaviour, as you noted, is still undocumented after my little
outburst a few months back. IMO, it's pretty bad.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: BTRFS state on kernel 5.2
2019-09-02 17:21 BTRFS state on kernel 5.2 waxhead
2019-09-03 0:10 ` Remi Gauvin
@ 2019-09-03 3:30 ` Chris Murphy
2019-09-04 1:12 ` Chris Murphy
1 sibling, 1 reply; 5+ messages in thread
From: Chris Murphy @ 2019-09-03 3:30 UTC (permalink / raw)
To: Linux BTRFS Mailinglist
On Mon, Sep 2, 2019 at 11:21 AM waxhead <waxhead@dirtcellar.net> wrote:
> 2. DEFRAG: (status page)
> The status page marks defrag as "mostly ok" for stability and "ok" for
> performance. While I understand that extents gets unshared I don't see
> how this will affect stability. Performance (as in space efficiency) on
> the other hand is more likely to be affected. Also is is not (perfectly)
> clear what the difference is in consequence by using the autodefrag
> mount option vs "btrfs filesystem defrag" Can someone please consider
> rewriting this?
It needs "OK - see Gotchas" because shared extents becoming unshared
could be hugely problematic if you're not expecting it.
> 3. SCRUB + RAID56: (status page)
> The status page says it is mostly ok for both stability and performance.
> It is not stated what the problem is with stability, does this have to
> do with the write-hole ?
I think concerns need to be split out for metadata and data. The main
gotcha is if there's a crash you need to do a scrub, and there are no
partial scrubs.
In the case of data, at least there's still a warning on bad
reconstruction (from corrupt strip), because of data csums not
matching.
> 5. DEVICE REPLACE: (Using_Btrfs_with_Multiple_Devices page)
> It is not clear what to do to recover from a device failure on BTRFS.
> If a device is partly working then you can run the replace functionality
> and hopefully you're good to go afterwards. Ok fine , if this however
> does not work or you have a completely failed device it is a different
> story. My understanding of it is:
> If not enough free space (or devices) is available to restore redundancy
> you first need to add a new device, and then you need to A: first run
> metadata balance (to ensure that the filesystem structures is redundant)
> and then B: run a data balance to restore redundancy for your data.
> Is there any filters that can be applied to only restore chunks which
> are having a missing mirror / stripe member?
It is a bit boolean in that it depends on several variables, and is
another reason why a btrfsd service to help do smarter things that
depend on policy decisions would be a very useful future addition. But
sorta what you're getting to is we're not sure what the medium, long
term plans are.
>
> 6. RAID56 (status page)
> The RAID56 have had the write hole problem for a long time now, but it
> is not well explained what the consequence of it is for data -
> especially if you have metadata stored in raid1/10.
> If you encounter a powerloss / kernel panic during write - what will
> actually happen?
> Will a fresh file simply be missing or corrupted (as in partly written).
> If you overwrite/append to a existing file - what is the consequence
> then? will you end up with... A: The old data, B: Corrupted or zeroed
> data?! This is not made clear in the comment and it would be great if
> we, the BTRFS users would understand what the risk of hitting the write
> hole actually is.
If you do an immediate scrub, any corruption should be detected and
fixed by reconstruction, before there are any device failures. If a
device fails before scrub, it's possible data is corrupt, but last
time I tested this I got EIO with csum mismatches for affected files,
not corrupt data return to user space. Worse is if metadata is
affected because nothing can be done, if a device has failed, and
there's corruption in raid5 metadata.
I'm not entirely clear on the COW guarantees between metadata and
data, even in the idealized case where hardware doesn't lie, does what
the file system expects, and all devices complete commits at the same
time. And then when any of those things isn't true, what are the
consequences. It's probably its own separate grid that's needed. But
if someone understood it clearly, someone else could make the
explanation pretty.
> 7. QUOTAS, QGROUPS (status page)
> Again marked as "mostly ok" on the stability. Is there any risk of
> dataloss or irrecoverable failure? If not I think it should be marked as
> stable - The only note seems to be performance related.
Pretty sure all the performance issues are supposed to be fixed by
kernel 5.2 or 5.3. But that probably needs testing to confirm it.
>
> 8. PER SUBVOLUME REDUNDANCY LEVEL:
> What is the state / plan for per subvolume (or object level) redundancy
> levels - is that on the agenda somewhere?
No one has started that work as far as I'm aware.
>
> 9. ADDING EXISTING FILESYSTEM TO THE POOL?:
> Is it somehow, or will it ever be possible to add a existing BTRFS
> filesystem to a pool?
I haven't hear anything like this, so I suspect no one is working on
it. Btrfs subvolumes are just a files tree. It's not a self contained
file system. All subvolumes share the extent, csum, chunk and dev
trees. So this would need some way to import it. Not sure.
> 10. PURE BTRFS BOOTLOADER?
> This probably belongs somewhere else, but has someone considered the
> very idea of a pure BTRFS bootloader which only supports booting up a
> BTRFS filesystem in a (as failsafe as possible) way. It is a pain to
> ensure that grub is installed on all devices and update as you
> add/remove devices from the pool and a "butterboot"-loader would be
> fantastic
Bootloaders are f'n hard. I don't see the advantage of starting
something from scratch that's this narrow purposed.
Realistically, as ugly as it is, we're better off with every drive
having a large EFI system partition or plain boot volume if BIOS, and
a daemon that keeps them all in sync. And use a simple bootloader like
sd-boot, to locate, load, and execute the kernel and let kernel code
worry about all the complex Btrfs device discovery, and how to handle
degradedness.
By the way, GRUB 2.04 should have Btrfs raid5/6 support. And I'm
guessing it supports degraded operation similar to mdadm raid 5/6,
which GRUB supports for a long time.
> 12. SPACE CACHE: (Manpage/btrfs(5) page):
> I have been using space cache v2 for a long time. No issues (that I know
> about) yet. That page states that the safe default space cache is v1.
> What is the current recommended default?
v2 expected default for a long time now. It'd be useful if someone
could benchmark v2 versus no space cache: run time performance with
various loads, mount time, and memory usage.
> 13. NODATACOW:
> As far as I can remember there was some issues regarding NOCOW
> files/directories on the mailing list a while ago. I can't find any
> issues related to nocow on the wiki (I might note have searched enough)
> but I don't think they are fixed so maybe someone can verify that.
> And by the way ...are NOCOW files still not checksummed? If yes, are
> there plans to add that (it would be especially nice to know if a nocow
> file is correct or not)
I think we're better off optimizing COW and getting rid of nocow. It's
really a work around for things becoming slow due to massive
fragmentation. There's a bug (or unexpected behavior) where NOCOW
files can become compressed when defragmented and compress mount
option is used. There's a fix that prevents this, I think in 5.2 or
5.3.
--
Chris Murphy
^ permalink raw reply [flat|nested] 5+ messages in thread