linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BTRFS state on kernel 5.2
@ 2019-09-02 17:21 waxhead
  2019-09-03  0:10 ` Remi Gauvin
  2019-09-03  3:30 ` Chris Murphy
  0 siblings, 2 replies; 5+ messages in thread
From: waxhead @ 2019-09-02 17:21 UTC (permalink / raw)
  To: Linux BTRFS Mailinglist

Being a long time BTRFS user and frequent reader of the mailing list I 
do have some (hopefully practical) questions / requests. Some asked 
before perhaps , but I think it is about time with an update. So without 
further ado... here we go:

1. THE STATUS PAGE:
The status page has not been updated with information for the latest 
stable kernel which is 5.2 as of writing this. Can someone please update?

2. DEFRAG: (status page)
The status page marks defrag as "mostly ok" for stability and "ok" for 
performance. While I understand that extents gets unshared I don't see 
how this will affect stability. Performance (as in space efficiency) on 
the other hand is more likely to be affected. Also is is not (perfectly) 
clear what the difference is in consequence by using the autodefrag 
mount option vs "btrfs filesystem defrag" Can someone please consider 
rewriting this?

3. SCRUB + RAID56: (status page)
The status page says it is mostly ok for both stability and performance.
It is not stated what the problem is with stability, does this have to 
do with the write-hole ?

4. DEVICE REPLACE: (status page)
This is also marked mostly ok for stability. As I understand it BTRFS 
have no issues of recovering from a failed device if it is completely 
removed. If the device is still (partly) working you may get "stuck" 
during a replace operation because BTRFS keeps trying to read from the 
failed device.
 From my point of view I think it is important to clear up this a bit so 
that people will understand that it is not the ability to replace a 
device that is "mostly ok" but the "online replace" functionality that 
might be problematic (but will not damage data).

5. DEVICE REPLACE: (Using_Btrfs_with_Multiple_Devices page)
It is not clear what to do to recover from a device failure on BTRFS.
If a device is partly working then you can run the replace functionality 
and hopefully you're good to go afterwards. Ok fine , if this however 
does not work or you have a completely failed device it is a different 
story. My understanding of it is:
If not enough free space (or devices) is available to restore redundancy 
you first need to add a new device, and then you need to A: first run 
metadata balance (to ensure that the filesystem structures is redundant) 
and then B: run a data balance to restore redundancy for your data.
Is there any filters that can be applied to only restore chunks which 
are having a missing mirror / stripe member?

6. RAID56 (status page)
The RAID56 have had the write hole problem for a long time now, but it 
is not well explained what the consequence of it is for data - 
especially if you have metadata stored in raid1/10.
If you encounter a powerloss / kernel panic during write - what will 
actually happen?
Will a fresh file simply be missing or corrupted (as in partly written).
If you overwrite/append to a existing file - what is the consequence 
then? will you end up with... A: The old data, B: Corrupted or zeroed 
data?! This is not made clear in the comment and it would be great if 
we, the BTRFS users would understand what the risk of hitting the write 
hole actually is.

7. QUOTAS, QGROUPS (status page)
Again marked as "mostly ok" on the stability. Is there any risk of 
dataloss or irrecoverable failure? If not I think it should be marked as 
stable - The only note seems to be performance related.

8. PER SUBVOLUME REDUNDANCY LEVEL:
What is the state / plan for per subvolume (or object level) redundancy 
levels - is that on the agenda somewhere? One use case is to flag the 
main filesystem as RAID1/10 and another subvolume as RAID5/6. That way 
you could be fairly sure that the server comes up while you are prepared 
to tolerate some issue (depends on the answer to question #6) on the 
subvolume that (for now) is prone to the write hole.

9. ADDING EXISTING FILESYSTEM TO THE POOL?:
Is it somehow, or will it ever be possible to add a existing BTRFS 
filesystem to a pool? It would be a wet dream come true to be able to 
add a device containing an existing BTRFS filesystem and get it to show 
up as a subvolume in the main pool

10. PURE BTRFS BOOTLOADER?
This probably belongs somewhere else, but has someone considered the 
very idea of a pure BTRFS bootloader which only supports booting up a 
BTRFS filesystem in a (as failsafe as possible) way. It is a pain to 
ensure that grub is installed on all devices and update as you 
add/remove devices from the pool and a "butterboot"-loader would be 
fantastic

11. DEDUPLICATION:
Is deduplication planned to be part of the btrfs management tool? e.g 
btrfs filesystem[/subvolume?] deuplicate /mnt

12. SPACE CACHE: (Manpage/btrfs(5) page):
I have been using space cache v2 for a long time. No issues (that I know 
about) yet. That page states that the safe default space cache is v1. 
What is the current recommended default?

13. NODATACOW:
As far as I can remember there was some issues regarding NOCOW 
files/directories on the mailing list a while ago. I can't find any 
issues related to nocow on the wiki (I might note have searched enough) 
but I don't think they are fixed so maybe someone can verify that.
And by the way ...are NOCOW files still not checksummed? If yes, are 
there plans to add that (it would be especially nice to know if a nocow 
file is correct or not)

14. VIRTUAL BLOCK DEVICE EXPORT
Are there plans to allow BTRFS to export virtual block devices from the 
BTRFS pool? E.g. so it would be possible to run other filesystems on top 
of a "protected" BTRFS layer (much like LVM / mdraid).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BTRFS state on kernel 5.2
  2019-09-02 17:21 BTRFS state on kernel 5.2 waxhead
@ 2019-09-03  0:10 ` Remi Gauvin
  2019-09-03  1:59   ` Christoph Anton Mitterer
  2019-09-03  3:30 ` Chris Murphy
  1 sibling, 1 reply; 5+ messages in thread
From: Remi Gauvin @ 2019-09-03  0:10 UTC (permalink / raw)
  To: waxhead, Linux BTRFS Mailinglist


[-- Attachment #1.1: Type: text/plain, Size: 4658 bytes --]

On 2019-09-02 1:21 p.m., waxhead wrote:

> 5. DEVICE REPLACE: (Using_Btrfs_with_Multiple_Devices page)
> It is not clear what to do to recover from a device failure on BTRFS.
> If a device is partly working then you can run the replace functionality
> and hopefully you're good to go afterwards. Ok fine , if this however
> does not work or you have a completely failed device it is a different
> story. My understanding of it is:
> If not enough free space (or devices) is available to restore redundancy
> you first need to add a new device, and then you need to A: first run
> metadata balance (to ensure that the filesystem structures is redundant)
> and then B: run a data balance to restore redundancy for your data.
> Is there any filters that can be applied to only restore chunks which
> are having a missing mirror / stripe member?

If you are adding a new device of same size or larger than the device
you are replacing, do  not do balances.. you can still just do a device
replace.  The only difference is, if the failed device is missing
entirely, you have to specify the device id of the missing device,
(rather than a /dev/sd?)


> 
> 6. RAID56 (status page)
> The RAID56 have had the write hole problem for a long time now, but it
> is not well explained what the consequence of it is for data -
> especially if you have metadata stored in raid1/10.
> If you encounter a powerloss / kernel panic during write - what will
> actually happen?
> Will a fresh file simply be missing or corrupted (as in partly written).
> If you overwrite/append to a existing file - what is the consequence
> then? will you end up with... A: The old data, B: Corrupted or zeroed
> data?! This is not made clear in the comment and it would be great if
> we, the BTRFS users would understand what the risk of hitting the write
> hole actually is.

The Parity data from an interrupted write will be missing/corrupt.  This
will in turn affect old data, not just the data you were writing.  The
write hole will only be of consequence if you are reading the array
degraded, (ie, a drive is failed/missing, and though unlikely, would
also be a problem if you just happen to suddenly have a bad sector in
the same range of data as the corrupt parity).

If the corrupted data affects metadata, the consequences can be anything
from minor to completely unreadable filesystem.

If it affects data blocks, some files will be unreadable, but they can
simply be deleted/restored from backup.

As you noted, Metadata can be made Raid1, which will at least prevent
complete filesystem meltdown from write hole.  But until the patches
increase the number of devices in a Mirrored Raid, there is not way to
make the pool tolerant of 2 device failuers, so Raid 6 is mostly
useless..  (Arguably, Raid 6 data would be much more likely to recover
from an unreadable sector while recovering from a missing device.)

It's also important to understand that unlike most other (all other?)
raid implementations, BTRFS will not, by itself, fix parity when it
restarts after an unclean shutdown.  It's up to the administrator to run
a scrub manually..  Otherwise, parity errors will accumulate with each
unclean shutdown, and in turn, result in unrecoverable data if the array
is later degraded.

> 
> 13. NODATACOW:
> As far as I can remember there was some issues regarding NOCOW
> files/directories on the mailing list a while ago. I can't find any
> issues related to nocow on the wiki (I might note have searched enough)
> but I don't think they are fixed so maybe someone can verify that.
> And by the way ...are NOCOW files still not checksummed? If yes, are
> there plans to add that (it would be especially nice to know if a nocow
> file is correct or not)
>

AFAIK, checksum and Nocow files is technically not possible, so no plans
exist to even try adding that functionality.   If you think about it,
any unclean filesystem stop while nocow data is being written would
result in inconsistent checksum, so it would self defeating.

As for the Nocow problems, it has to do with Mirrored Raid.  Without COW
or checksums, BTRFS has no method whatsoever of keeping Raid mirrored
data consistent.  In the case of unclean stop while data is being
written, the two copies will be different, and which data gets read at
any time is entirely up to the fates.  Not only will BTFS not
synchronize the mirrored copies by itself on next boot, it won't even
fix it in a scrub.

This behaviour, as you noted, is still undocumented after my little
outburst a few months back.  IMO, it's pretty bad.






[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BTRFS state on kernel 5.2
  2019-09-03  0:10 ` Remi Gauvin
@ 2019-09-03  1:59   ` Christoph Anton Mitterer
  0 siblings, 0 replies; 5+ messages in thread
From: Christoph Anton Mitterer @ 2019-09-03  1:59 UTC (permalink / raw)
  To: linux-btrfs

On Mon, 2019-09-02 at 20:10 -0400, Remi Gauvin wrote:
> AFAIK, checksum and Nocow files is technically not possible

While this has been claimed numerous times, I still don't see any
reason why it should be true.
I even used to have a off-the-list conversation with Chris Mason about
just that:

me asking:
>> - nodatacow => no checksumming
>>   Breaks IMO one of the big core features of btrfs (i.e. data is either
>>   valid or one gets an error)
>>   I brought that up 1-2 times at the list, but none of the core
>>   developers ever respeonded, and just a few list regulars said it
>>   wouldn't be possible, though I don't quite understand why not...
>>   The meta-data is still CoWed anyway... and the worst that could
>>   happen is that in case of crash, data is actually valid on disk, but
>>   the checksums weren't yet... which is IMO far less likely than the
>>   other cases.

he replying:
>The reason why is because we need a way to atomically update both the
>data block and the crc.  You're right that after a crash the valid data 
>on disk wouldn't match checksums, which for new file data would be 
>unexpected, but not the end of the world.  Still, XFS caught a lot of 
>heat for this because they would allow new files to be zero filled >after 
>a crash.
>
>In our case it would be much worse.  You could have a file that was 10 
>years old, write 4K in the middle, crash, and those 4K would give
>EIOs instead of either the new or old data.


So yes, it's quite clear one could not atomically update checksums and
data the same time,... but so what?
This problem anyway just matters in the case of a crash,... and in this
case it's anyway pretty likely that the data is garbage.
Only for the case where the data would have been properly written
before the crash, but not the checksum, we'd have the case that valid
data would be considered invalid.

People would however have at least a chance to notice and recover from
this.


Cheers,
Chris.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BTRFS state on kernel 5.2
  2019-09-02 17:21 BTRFS state on kernel 5.2 waxhead
  2019-09-03  0:10 ` Remi Gauvin
@ 2019-09-03  3:30 ` Chris Murphy
  2019-09-04  1:12   ` Chris Murphy
  1 sibling, 1 reply; 5+ messages in thread
From: Chris Murphy @ 2019-09-03  3:30 UTC (permalink / raw)
  To: Linux BTRFS Mailinglist

On Mon, Sep 2, 2019 at 11:21 AM waxhead <waxhead@dirtcellar.net> wrote:
> 2. DEFRAG: (status page)
> The status page marks defrag as "mostly ok" for stability and "ok" for
> performance. While I understand that extents gets unshared I don't see
> how this will affect stability. Performance (as in space efficiency) on
> the other hand is more likely to be affected. Also is is not (perfectly)
> clear what the difference is in consequence by using the autodefrag
> mount option vs "btrfs filesystem defrag" Can someone please consider
> rewriting this?

It needs "OK - see Gotchas" because shared extents becoming unshared
could be hugely problematic if you're not expecting it.

> 3. SCRUB + RAID56: (status page)
> The status page says it is mostly ok for both stability and performance.
> It is not stated what the problem is with stability, does this have to
> do with the write-hole ?

I think concerns need to be split out for metadata and data. The main
gotcha is if there's a crash you need to do a scrub, and there are no
partial scrubs.

In the case of data, at least there's still a warning on bad
reconstruction (from corrupt strip), because of data csums not
matching.


> 5. DEVICE REPLACE: (Using_Btrfs_with_Multiple_Devices page)
> It is not clear what to do to recover from a device failure on BTRFS.
> If a device is partly working then you can run the replace functionality
> and hopefully you're good to go afterwards. Ok fine , if this however
> does not work or you have a completely failed device it is a different
> story. My understanding of it is:
> If not enough free space (or devices) is available to restore redundancy
> you first need to add a new device, and then you need to A: first run
> metadata balance (to ensure that the filesystem structures is redundant)
> and then B: run a data balance to restore redundancy for your data.
> Is there any filters that can be applied to only restore chunks which
> are having a missing mirror / stripe member?

It is a bit boolean in that it depends on several variables, and is
another reason why a btrfsd service to help do smarter things that
depend on policy decisions would be a very useful future addition. But
sorta what you're getting to is we're not sure what the medium, long
term plans are.


>
> 6. RAID56 (status page)
> The RAID56 have had the write hole problem for a long time now, but it
> is not well explained what the consequence of it is for data -
> especially if you have metadata stored in raid1/10.
> If you encounter a powerloss / kernel panic during write - what will
> actually happen?
> Will a fresh file simply be missing or corrupted (as in partly written).
> If you overwrite/append to a existing file - what is the consequence
> then? will you end up with... A: The old data, B: Corrupted or zeroed
> data?! This is not made clear in the comment and it would be great if
> we, the BTRFS users would understand what the risk of hitting the write
> hole actually is.

If you do an immediate scrub, any corruption should be detected and
fixed by reconstruction, before there are any device failures. If a
device fails before scrub, it's possible data is corrupt, but last
time I tested this I got EIO with csum mismatches for affected files,
not corrupt data return to user space.  Worse is if metadata is
affected because nothing can be done, if a device has failed, and
there's corruption in raid5 metadata.

I'm not entirely clear on the COW guarantees between metadata and
data, even in the idealized case where hardware doesn't lie, does what
the file system expects, and all devices complete commits at the same
time. And then when any of those things isn't true, what are the
consequences. It's probably its own separate grid that's needed. But
if someone understood it clearly, someone else could make the
explanation pretty.


> 7. QUOTAS, QGROUPS (status page)
> Again marked as "mostly ok" on the stability. Is there any risk of
> dataloss or irrecoverable failure? If not I think it should be marked as
> stable - The only note seems to be performance related.

Pretty sure all the performance issues are supposed to be fixed by
kernel 5.2 or 5.3. But that probably needs testing to confirm it.

>
> 8. PER SUBVOLUME REDUNDANCY LEVEL:
> What is the state / plan for per subvolume (or object level) redundancy
> levels - is that on the agenda somewhere?

No one has started that work as far as I'm aware.

>
> 9. ADDING EXISTING FILESYSTEM TO THE POOL?:
> Is it somehow, or will it ever be possible to add a existing BTRFS
> filesystem to a pool?

I haven't hear anything like this, so I suspect no one is working on
it. Btrfs subvolumes are just a files tree. It's not a self contained
file system. All subvolumes share the extent, csum, chunk and dev
trees. So this would need some way to import it. Not sure.

> 10. PURE BTRFS BOOTLOADER?
> This probably belongs somewhere else, but has someone considered the
> very idea of a pure BTRFS bootloader which only supports booting up a
> BTRFS filesystem in a (as failsafe as possible) way. It is a pain to
> ensure that grub is installed on all devices and update as you
> add/remove devices from the pool and a "butterboot"-loader would be
> fantastic

Bootloaders are f'n hard. I don't see the advantage of starting
something from scratch that's this narrow purposed.

Realistically, as ugly as it is, we're better off with every drive
having a large EFI system partition or plain boot volume if BIOS, and
a daemon that keeps them all in sync. And use a simple bootloader like
sd-boot, to locate, load, and execute the kernel and let kernel code
worry about all the complex Btrfs device discovery, and how to handle
degradedness.

By the way, GRUB 2.04 should have Btrfs raid5/6 support. And I'm
guessing it supports degraded operation similar to mdadm raid 5/6,
which GRUB supports for a long time.

> 12. SPACE CACHE: (Manpage/btrfs(5) page):
> I have been using space cache v2 for a long time. No issues (that I know
> about) yet. That page states that the safe default space cache is v1.
> What is the current recommended default?

v2 expected default for a long time now. It'd be useful if someone
could benchmark v2 versus no space cache: run time performance with
various loads, mount time, and memory usage.

> 13. NODATACOW:
> As far as I can remember there was some issues regarding NOCOW
> files/directories on the mailing list a while ago. I can't find any
> issues related to nocow on the wiki (I might note have searched enough)
> but I don't think they are fixed so maybe someone can verify that.
> And by the way ...are NOCOW files still not checksummed? If yes, are
> there plans to add that (it would be especially nice to know if a nocow
> file is correct or not)

I think we're better off optimizing COW and getting rid of nocow. It's
really a work around for things becoming slow due to massive
fragmentation. There's a bug (or unexpected behavior) where NOCOW
files can become compressed when defragmented and compress mount
option is used. There's a fix that prevents this, I think in 5.2 or
5.3.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: BTRFS state on kernel 5.2
  2019-09-03  3:30 ` Chris Murphy
@ 2019-09-04  1:12   ` Chris Murphy
  0 siblings, 0 replies; 5+ messages in thread
From: Chris Murphy @ 2019-09-04  1:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux BTRFS Mailinglist

raid56 status grid says Unstable, but the body text below says
"Feature marked as mostly OK for now."

Both might be true. The feature with this version of the feature flag
might be mostly OK, with the write hole. But if an intent log will be
needed to fix it, and if it suggests a new feature flag, it could mean
it's unstable in terms of future incompatible change is expected.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-09-04  1:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-02 17:21 BTRFS state on kernel 5.2 waxhead
2019-09-03  0:10 ` Remi Gauvin
2019-09-03  1:59   ` Christoph Anton Mitterer
2019-09-03  3:30 ` Chris Murphy
2019-09-04  1:12   ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).