All of lore.kernel.org
 help / color / mirror / Atom feed
* Status of RAID5/6
@ 2018-03-21 16:50 Menion
  2018-03-21 17:24 ` Liu Bo
  0 siblings, 1 reply; 32+ messages in thread
From: Menion @ 2018-03-21 16:50 UTC (permalink / raw)
  To: linux-btrfs

Hi all
I am trying to understand the status of RAID5/6 in BTRFS
I know that there are some discussion ongoing on the RFC patch
proposed by Liu bo
But it seems that everything stopped last summary. Also it mentioned
about a "separate disk for journal", does it mean that the final
implementation of RAID5/6 will require a dedicated HDD for the
journaling?
Bye

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-21 16:50 Status of RAID5/6 Menion
@ 2018-03-21 17:24 ` Liu Bo
  2018-03-21 20:02   ` Christoph Anton Mitterer
                     ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Liu Bo @ 2018-03-21 17:24 UTC (permalink / raw)
  To: Menion; +Cc: linux-btrfs

On Wed, Mar 21, 2018 at 9:50 AM, Menion <menion@gmail.com> wrote:
> Hi all
> I am trying to understand the status of RAID5/6 in BTRFS
> I know that there are some discussion ongoing on the RFC patch
> proposed by Liu bo
> But it seems that everything stopped last summary. Also it mentioned
> about a "separate disk for journal", does it mean that the final
> implementation of RAID5/6 will require a dedicated HDD for the
> journaling?

Thanks for the interest on btrfs and raid56.

The patch set is to plug write hole, which is very rare in practice, tbh.
The feedback is to use existing space instead of another dedicate
"fast device" as the journal in order to get some extent of raid
protection.  I'd need some time to pick it up.

With that being said, we have several data reconstruction fixes for
raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
upstream kernel or some distros which do kernel updates frequently,
the most important one is

8810f7517a3b Btrfs: make raid6 rebuild retry more
https://patchwork.kernel.org/patch/10091755/

AFAIK, no other data corruptions showed up.

thanks,
liubo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-21 17:24 ` Liu Bo
@ 2018-03-21 20:02   ` Christoph Anton Mitterer
  2018-03-22 12:01     ` Austin S. Hemmelgarn
  2018-03-29 21:50     ` Zygo Blaxell
  2018-03-21 20:27   ` Menion
  2018-03-22 21:13   ` waxhead
  2 siblings, 2 replies; 32+ messages in thread
From: Christoph Anton Mitterer @ 2018-03-21 20:02 UTC (permalink / raw)
  To: linux-btrfs

Hey.

Some things would IMO be nice to get done/clarified (i.e. documented in
the Wiki and manpages) from users'/admin's  POV:

Some basic questions:
- Starting with which kernels (including stable kernel versions) does
it contain the fixes for the bigger issues from some time ago?

- Exactly what does not work yet (only the write hole?)?
  What's the roadmap for such non-working things?

- Ideally some explicit confirmations of what's considered to work,
  like:
  - compression+raid?
  - rebuild / replace of devices?
  - changing raid lvls?
  - repairing data (i.e. picking the right block according to csums in
    case of silent data corruption)?
  - scrub (and scrub+repair)?
  - anything to consider with raid when doing snapshots, send/receive
    or defrag?
  => and for each of these: for which raid levels?

  Perhaps also confirmation for previous issues:
  - I vaguely remember there were issues with either device delete or
    replace.... and that one of them was possibly super-slow?
  - I also remember there were cases in which a fs could end up in
    permanent read-only state?


- Clarifying questions on what is expected to work and how things are
  expected to behave, e.g.:
  - Can one plug a device (without deleting/removing it first) just
    under operation and will btrfs survive it?
  - If an error is found (e.g. silent data corruption based on csums),
    when will it repair&fix (fix = write the repaired data) the data?
    On the read that finds the bad data?
    Only on scrub (i.e. do users need to regularly run scrubs)? 
  - What happens if error cannot be repaired, e.g. no csum information
    or all blocks bad?
    EIO? Or are there cases where it gives no EIO (I guess at least in
    nodatacow case)
  - What happens if data cannot be fixed (i.e. trying to write the
    repaired block again fails)?
    And if the repaired block is written, will it be immediately
    checked again (to find cases of blocks that give different results
    again)?
  - Will a scrub check only the data on "one" device... or will it
    check all the copies (or parity blocks) on all devices in the raid?
  - Does a fsck check all devices or just one?
  - Does a balance implicitly contain a scrub?
  - If a rebuild/repair/reshape is performed... can these be
    interrupted? What if they are forcibly interrupted (power loss)?


- Explaining common workflows:
  - Replacing a faulty or simply an old disk.
    How to stop btrfs from using a device (without bricking the fs)?
    How to do the rebuild.
  - Best practices, like: should one do regular balances (and if so, as
    asked above, do these include the scrubs, so basically: is it
    enough to do one of them)
  - How to grow/shrink raid btrfs... and if this is done... how to
    replicate the data already on the fs to the newly added disks (or
    is this done automatically - and if so, how to see that it's
    finished)?
  - What will actually trigger repairs? (i.e. one wants to get silent
    block errors fixed ASAP and not only when the data is read - and
    when it's possibly to late)
  - In the rebuild/repair phase (e.g. one replaces a device): Can one
    somehow give priority to the rebuild/repair? (e.g. in case of a
    degraded raid, one may want to get that solved ASAP and rather slow
    down other reads or stop them completely.
  - Is there anything to notice when btrfs raid is placed above dm-
    crypt from a security PoV?
    With MD raid that wasn't much of a problem as it's typically placed
    below dm-crypt... but btrfs raid would need to be placed above it.
    So maybe there are some known attacks against crypto modes, if
    equal (RAID 1 / 10) or similar/equal (RAID 5/6) data is written
    above multiple crypto devices? (Probably something one would need
    to ask their experts).


- Maintenance tools
  - How to get the status of the RAID? (Querying kernel logs is IMO
    rather a bad way for this)
    This includes:
    - Is the raid degraded or not?
    - Are scrubs/repairs/rebuilds/reshapes in progress and how far are
      they? (Reshape would be: if the raid level is changed or the raid
      grown/shrinked: has all data been replicated enough to be
      "complete" for the desired raid lvl/number of devices/size?
   - What should one regularly do? scrubs? balance? How often?
     Do we get any automatic (but configurable) tools for this?
   - There should be support in commonly used tools, e.g. Icinga/Nagios
     check_raid
   - Ideally there should also be some desktop notification tool, which
     tells about raid (and btrfs errors in general) as small
     installations with raids typically run no Icinga/Nagios but rely
     on e.g. email or gui notifications.

I think especially for such tools it's important that these are
maintained by upstream (and yes I know you guys are rather fs
developers not)... but since these tools are so vital, having them done
3rd party can easily lead to the situation where something changes in
btrfs, the tools don't notice and errors remain undetected.


- Future?
  What about things like hotspare support? E.g. a good userland tool
  could be configured that one disk is a hotspare... and if there's
  failure it could automatically power it up and replace the faulty
  drives with it.
  It could go further, that not only completely failed devices are
  replaced, but if a configurable number of csum / read / write / etc.
  errors are found... a replace would be triggered.
  Maybe such tool could even look at SMART and proactively replace
  disks.

  What about features that were "announced/suggested/etc." earlier?
  E.g. n-parity-raid ... or n-way-mirrored-raid?


- Real world test?
  Is there already any bigger user of current btrfs raid5/6? I.e. where
  hundreds of raids, devices, etc. are massively used? Where many
  devices failed (because of age) or where pulled, etc. (all the
  typical things that happen in computing centres)?
  So that one could get a feeling whether it's actually stable.


Cheers,
Chris.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-21 17:24 ` Liu Bo
  2018-03-21 20:02   ` Christoph Anton Mitterer
@ 2018-03-21 20:27   ` Menion
  2018-03-22 21:13   ` waxhead
  2 siblings, 0 replies; 32+ messages in thread
From: Menion @ 2018-03-21 20:27 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

I am on 4.15.5 :)
Yes I agree that Journaling is better on the same array,  still should
be unit failure tolerant, so maybe it should go in a RAID1 scheme.
Will a raid56 array built with older kernel be compatible with the new
forecoming code?
Bye

2018-03-21 18:24 GMT+01:00 Liu Bo <obuil.liubo@gmail.com>:
> On Wed, Mar 21, 2018 at 9:50 AM, Menion <menion@gmail.com> wrote:
>> Hi all
>> I am trying to understand the status of RAID5/6 in BTRFS
>> I know that there are some discussion ongoing on the RFC patch
>> proposed by Liu bo
>> But it seems that everything stopped last summary. Also it mentioned
>> about a "separate disk for journal", does it mean that the final
>> implementation of RAID5/6 will require a dedicated HDD for the
>> journaling?
>
> Thanks for the interest on btrfs and raid56.
>
> The patch set is to plug write hole, which is very rare in practice, tbh.
> The feedback is to use existing space instead of another dedicate
> "fast device" as the journal in order to get some extent of raid
> protection.  I'd need some time to pick it up.
>
> With that being said, we have several data reconstruction fixes for
> raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
> upstream kernel or some distros which do kernel updates frequently,
> the most important one is
>
> 8810f7517a3b Btrfs: make raid6 rebuild retry more
> https://patchwork.kernel.org/patch/10091755/
>
> AFAIK, no other data corruptions showed up.
>
> thanks,
> liubo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-21 20:02   ` Christoph Anton Mitterer
@ 2018-03-22 12:01     ` Austin S. Hemmelgarn
  2018-03-29 21:50     ` Zygo Blaxell
  1 sibling, 0 replies; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2018-03-22 12:01 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 2018-03-21 16:02, Christoph Anton Mitterer wrote:
On the note of maintenance specifically:
> - Maintenance tools
>    - How to get the status of the RAID? (Querying kernel logs is IMO
>      rather a bad way for this)
>      This includes:
>      - Is the raid degraded or not?
Check for the 'degraded' flag in the mount options.  Assuming you're 
doing things sensibly and not specifying it on mount, it gets added when 
the array goes degraded.

>      - Are scrubs/repairs/rebuilds/reshapes in progress and how far are
>        they? (Reshape would be: if the raid level is changed or the raid
>        grown/shrinked: has all data been replicated enough to be
>        "complete" for the desired raid lvl/number of devices/size?
A bit trickier, but still not hard, just check the the output of `btrfs 
scrub status`, `btrfs balance status`, and `btrfs replace status` for 
the volume.  It won't check automatic spot-repairs (that is, repairing 
individual blocks that fail checksums), but most people really don't care

>     - What should one regularly do? scrubs? balance? How often?
>       Do we get any automatic (but configurable) tools for this?
There aren't any such tools that I know of currently.  storaged might 
have some, but I've never really looked at it so i can't comment (I'm 
kind of adverse to having hundreds of background services running to do 
stuff that can just as easily be done in a polling manner from cron 
without compromising their utility).  Right now though, it's _trivial_ 
to automate things with cron, or systemd timers, or even third-party 
tools like monit (which has the bonus that if the maintenance fails, you 
get an e-mail about it).

>     - There should be support in commonly used tools, e.g. Icinga/Nagios
>       check_raid
Agreed.  I think there might already be a Nagios plugin for the basic 
checks, not sure about anything else though.

Netdata has had basic monitoring support for a while now, but it only 
looks at allocations, not error counters, so while it will help catch 
impending ENOSPC issues, it can't really help much with data corruption 
issues.

>     - Ideally there should also be some desktop notification tool, which
>       tells about raid (and btrfs errors in general) as small
>       installations with raids typically run no Icinga/Nagios but rely
>       on e.g. email or gui notifications.
Desktop notifications would be nice, but are out of scope for the main 
btrfs-progs.  Not even LVM, MDADM, or ZFS ship desktop notification 
support from upstream.  You don't need Icinga or Nagios for monitoring 
either.  Netdata works pretty well for covering the allocation checks 
(and I'm planning to have something soon, and it's trivial to set up 
e-mail notifications with cron or systemd timers or even tools like monit.

On the note of generic monitoring though, I've been working on a Python 
3 script (with no dependencies beyond the Python standard library) to do 
the same checks that Netdata does regarding allocations, as well as 
checking device error counters and mount options that should be 
reasonable as a simple warning tool run from cron or a systemd timer. 
I'm hoping to get it included in the upstream btrfs-progs, but I don't 
have it in a state yet that it's ready to be posted (the checks are 
working, but I'm still having issues reliably mapping between mount 
points and filesystem UUID's).

> I think especially for such tools it's important that these are
> maintained by upstream (and yes I know you guys are rather fs
> developers not)... but since these tools are so vital, having them done
> 3rd party can easily lead to the situation where something changes in
> btrfs, the tools don't notice and errors remain undetected.
It depends on what they look at.  All the stuff under /sys/fs/btrfs 
should never change (new things might get added, but none of the old 
stuff is likely to ever change because /sys is classified as part of the 
userspace ABI, and any changes would get shot down by Linus), so 
anything that just uses those will likely have no issues (Netdata falls 
into this category for example).  Same goes for anything using ioctls 
directly, as those are also userspace ABI.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-21 17:24 ` Liu Bo
  2018-03-21 20:02   ` Christoph Anton Mitterer
  2018-03-21 20:27   ` Menion
@ 2018-03-22 21:13   ` waxhead
  2 siblings, 0 replies; 32+ messages in thread
From: waxhead @ 2018-03-22 21:13 UTC (permalink / raw)
  To: Liu Bo, Menion; +Cc: linux-btrfs

Liu Bo wrote:
> On Wed, Mar 21, 2018 at 9:50 AM, Menion <menion@gmail.com> wrote:
>> Hi all
>> I am trying to understand the status of RAID5/6 in BTRFS
>> I know that there are some discussion ongoing on the RFC patch
>> proposed by Liu bo
>> But it seems that everything stopped last summary. Also it mentioned
>> about a "separate disk for journal", does it mean that the final
>> implementation of RAID5/6 will require a dedicated HDD for the
>> journaling?
> 
> Thanks for the interest on btrfs and raid56.
> 
> The patch set is to plug write hole, which is very rare in practice, tbh.
> The feedback is to use existing space instead of another dedicate
> "fast device" as the journal in order to get some extent of raid
> protection.  I'd need some time to pick it up.
> 
> With that being said, we have several data reconstruction fixes for
> raid56 (esp. raid6) in 4.15, I'd say please deploy btrfs with the
> upstream kernel or some distros which do kernel updates frequently,
> the most important one is
> 
> 8810f7517a3b Btrfs: make raid6 rebuild retry more
> https://patchwork.kernel.org/patch/10091755/
> 
> AFAIK, no other data corruptions showed up.
> 
I am very interested in the "raid"5/6 like behavior myself. Actually 
calling it RAID in the past may have had it's benefits , but these days 
continuing to use the RAID term is not helping. Even technically minded 
people seem to get confused.

For example: It was suggested that "raid"5/6 should have hot-spare 
support. In BTRFS terms a hot spare devicse sounds wrong to me, but 
reserving extra space for a "hot-space" so any "raid"5/6 like system can 
(auto?) rebalance to missing blocks to the rest of the pool sounds 
sensible enough (as long as the number of devices allows to separate the 
different bits and pieces).

Anyway , I got carried away a bit there. Sorry about that.
What I really wanted to comment is about usability of "raid"5/6
How would really a metadata "raid"1 + data "raid"5 or 6 compare to say 
mdraid 5 or 6 from a reliability point of view.

Sure mdraid has the advantage, but even with the write hole and the risk 
of corruption of data (not the filesystem) would not BTRFS in "theory" 
be safer that at least mdraid 5 if run with metadata "raid"5 ?!
You have to run scrub on both mdraid as well as BTRFS to ensure data is 
not corrupted.

PS! It might be worth mentioning that I am slightly affected by a 
Glenfarclas 105 Whisky while writing this so please bare with me in case 
something is too far off :)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-21 20:02   ` Christoph Anton Mitterer
  2018-03-22 12:01     ` Austin S. Hemmelgarn
@ 2018-03-29 21:50     ` Zygo Blaxell
  2018-03-30  7:21       ` Menion
  2018-03-30 16:14       ` Goffredo Baroncelli
  1 sibling, 2 replies; 32+ messages in thread
From: Zygo Blaxell @ 2018-03-29 21:50 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 15608 bytes --]

On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
> Hey.
> 
> Some things would IMO be nice to get done/clarified (i.e. documented in
> the Wiki and manpages) from users'/admin's  POV:
> 
> Some basic questions:

I can answer some easy ones:

>   - compression+raid?

There is no interaction between compression and raid.  They happen on
different data trees at different levels of the stack.  So if the raid
works, compression does too.

>   - rebuild / replace of devices?

"replace" needs raid-level-specific support.  If the raid level doesn't
support replace, then users have to do device add followed by device
delete, which is considerably (orders of magnitude) slower.

>   - changing raid lvls?

btrfs uses a brute-force RAID conversion algorithm which always works, but
takes zero short cuts.  e.g. there is no speed optimization implemented
for cases like "convert 2-disk raid1 to 1-disk single" which can be
very fast in theory.  The worst-case running time is the only running
time available in btrfs.

Also, users have to understand how the different raid allocators work
to understand their behavior in specific situations.  Without this
understanding, the set of restrictions that pop up in practice can seem
capricious and arbitrary.  e.g. after adding 1 disk to a nearly-full
raid1, full balance is required to make the new space available, but
adding 2 disks makes all the free space available immediately.

Generally it always works if you repeatedly run full-balances in a loop
until you stop running out of space, but again, this is the worst case.

>   - anything to consider with raid when doing snapshots, send/receive
>     or defrag?

Snapshot deletes cannot run at the same time as RAID convert/device
delete/device shrink/resize.  If one is started while the other is
running, it will be blocked until the other finishes.  Internally these
operations block each other on a mutex.

I don't know if snapshot deletes interact with device replace (the case
has never come up for me).  I wouldn't expect it to as device replace
is more similar to scrub than balance, and scrub has no such interaction.

Also note you can only run one balance, device shrink, or device delete
at a time.  If you start one of these three operations while another is
already running, the new request is rejected immediately.

As far as I know there are no other restrictions.

>   => and for each of these: for which raid levels?

Most of those features don't interact with anything specific to a raid
layer, so they work on all raid levels.

Device replace is the exception: all RAID levels in use on the filesystem
must support it, or the user must use device add and device delete instead.

[Aside:  I don't know if any RAID levels that do not support device
replace still exist, which makes my answer longer than it otherwise
would be]

>   Perhaps also confirmation for previous issues:
>   - I vaguely remember there were issues with either device delete or
>     replace.... and that one of them was possibly super-slow?

Device replace is faster than device delete.  Replace does not modify
any metadata, while delete rewrites all the metadata referring to the
removed device.

Delete can be orders of magnitude slower than expected because of the
metadata modifications required.

>   - I also remember there were cases in which a fs could end up in
>     permanent read-only state?

Any unrecovered metadata error 1 bit or larger will do that.  RAID level
is relevant only in terms of how well it can recover corrupted or
unreadable metadata blocks.

> - Clarifying questions on what is expected to work and how things are
>   expected to behave, e.g.:
>   - Can one plug a device (without deleting/removing it first) just
>     under operation and will btrfs survive it?

On raid1 and raid10, yes.  On raid5/6 you will be at risk of write hole
problems if the filesystem is modified while the device is unplugged.

If the device is later reconnected, you should immediately scrub to
bring the metadata on the devices back in sync.  Data written to the
filesystem while the device was offline will be corrected if the csum
is different on the removed device.  If there is no csum data will be
silently corrupted.  If the csum is correct, but the data is not (this
occurs with 2^-32 probability on random data where the CRC happens to
be identical) then the data will be silently corrupted.

A full replace of the removed device would be better than a scrub,
as that will get a known good copy of the data.

If the device is offline for a long time, it should be wiped before being
reintroduced to the rest of the array to avoid data integrity issues.

It may be necessary to specify a different device name when mounting
a filesystem that has had a disk removed and later reinserted until
the scrub or replace action above is completed.

btrfs has no optimization like mdadm write-intent bitmaps; recovery
is always a full-device operation.  In theory btrfs could track
modifications at the chunk level but this isn't even specified in the
on-disk format, much less implemented.

>   - If an error is found (e.g. silent data corruption based on csums),
>     when will it repair&fix (fix = write the repaired data) the data?
>     On the read that finds the bad data?
>     Only on scrub (i.e. do users need to regularly run scrubs)? 

Both cases.  All RAID levels with redundancy are supposed to support it.

I'm not sure if current raid5/6 do (but I think they do as of 4.15?).

>   - What happens if error cannot be repaired, e.g. no csum information
>     or all blocks bad?
>     EIO? Or are there cases where it gives no EIO (I guess at least in
>     nodatacow case)

If the operation can be completed with redundant devices, then btrfs
continues without reporting the error to userspace.  Device statistics
counters are incremented and kernel log messages are emitted.

If the operation cannot be completed (no redundancy or all redundant
disks also fail), then the following happens:

If there's no csum, the data is not checked, so it can be corrupted
without detection.  If there's a read error it's EIO regardless of csums.

If the unreadable block is data, userspace gets EIO.  If it's metadata,
userspace gets EIO on reads, and the FS goes read-only on writes.

Note that metadata writes usually imply dependent metadata reads (e.g.
to find free space to perform a write), so a metadata read error can
also make the filesystem go read-only if it occurs during a userspace
write operation.

>     And if the repaired block is written, will it be immediately
>     checked again (to find cases of blocks that give different results
>     again)?

No, just one write.  The write error is not reported to userspace in the
repair case (only to kernel log and device stats counters).  The repair
write would be expected to fail in some cases (e.g. total disk failure)
and the original read/write operation can continue with the other device.

>   - Will a scrub check only the data on "one" device... or will it
>     check all the copies (or parity blocks) on all devices in the raid?

Scrub checks all devices if you give it a mount point.  If you give
scrub a device it checks only that device.

>   - Does a balance implicitly contain a scrub?

No, balance and scrub are separate tools with separate purposes.

Balance would read only enough drives to be able to read all data, and
also writes all blocks and does metadata updates.  This makes it orders
of magnitude slower than a scrub, and also puts heavy write and seek
stress on the devices.  Balance also aborts on the first unrecoverable
read error.

Scrub reads data from every drive and doesn't write anything except as
required to repair.  Scrub continues until all data is processed and gives
statistics on failure counts.  Scrub runs at close to hardware speeds
because it read data sequentially and writes minimally.  Scrub is also
well-behaved wrt ionice.

Balance may be equivalent to "resilvering" except that balance moves
data around the disk while resilvering just overwrites the data in the
original location.

>   - If a rebuild/repair/reshape is performed... can these be
>     interrupted? What if they are forcibly interrupted (power loss)?

Device delete and device shrink can only be interrupted by rebooting.
They do not restart on reboot, and the filesystem size will revert to
its original value on reboot.  If the operation is restarted manually,
it will not have to repeat data relocation work that was already done.

RAID conversion by balance will resume automatically on boot unless
skip_balance mount option is used.

If the balance is not resumed, or it is cancelled part way through a RAID
conversion, the RAID profile used to write new data will be one of the
existing RAID profiles on the disk chosen at random.  e.g. if you convert
data from raid5 to raid0 and metadata from raid1 to dup, and cancel the
balance part way through, future data will be either raid5 or raid0,
and future metadata will be either raid1 or dup.  If the conversion is
completed (i.e. there is only one RAID level present on the filesystem)
then only one profile is used.

Device replace is aborted by reboots and you have to start over (I
think...I've never interrupted one myself, so I'm not sure on this point).

All of these will make the next mount take some extra minutes to complete
if they were interrupted by a reboot.  Exception:  'balance' can be
'paused' before reboot, and does not trigger the extra delay on the
next mount.

>   - Best practices, like: should one do regular balances (and if so, as
>     asked above, do these include the scrubs, so basically: is it
>     enough to do one of them)

Both need to be done on different schedules.

Balance needs to be done when unallocated space gets low on the minimum
number of disks for the raid profile (e.g. for raid6, you need unallocated
space on at least 3 disks).  Once unallocated space is available (at least
1GB per disk, possibly more if the filesystem is very active), the balance
can be cancelled, or a script can loop with a small value of 'limit'
and simply stop looping when unallocated space is available.

Normally it is not necessary to balance metadata, only data.  Any space
that becomes allocated to metadata should be left alone and not reclaimed.
If the filesystem runs out of metadata space _and_ there is no unallocated
space available, the filesystem will become read-only.

If you are using the 'ssd_spread' option, and you don't have a very good
reason why, stop using the ssd_spread option.  If you do have a good
reason, you'll need to run balances much more often, and possibly balance
metadata as well as data.

Unallocated space is not free space.  Free space is space in the
filesystem you can write data to.  Unallocated space is space on a disk
you can make RAID block groups out of.

Scrub needs to be done after every unclean shutdown and also at periodic
intervals to detect latent faults.  The exact schedule depends on the
fault detection latency required (once a month is a good start, once a
week is paranoid, once a day is overkill).

>   - How to grow/shrink raid btrfs... and if this is done... how to
>     replicate the data already on the fs to the newly added disks (or
>     is this done automatically - and if so, how to see that it's
>     finished)?

btrfs dev add/del to add or remove entire devices.

btrfs fi resize grows or shrinks individual devices ('device delete'
is really 'resize <dev>:0' followed by 'remove empty device' internally).

I generally run resize with small negative increments in a loop until
the device I want to delete has only a few GB of data left, then run
delete, rather than running delete on a full device.  This presents more
opportunities to abort without rebooting.

btrfs balance 'convert' option changes RAID levels.

btrfs fi usage and btrfs dev usage will indirectly report on the progress
of deletes and resizes (in that they will show the amount of space still
occupied on the deleted disks).  They report how much data is stored at
each RAID level so they effectively report on the progress of RAID
level conversions too.

btrfs balance status will report on the progress of raid conversions.

>   - What will actually trigger repairs? (i.e. one wants to get silent
>     block errors fixed ASAP and not only when the data is read - and
>     when it's possibly to late)

Reading bad blocks triggers repairs.  Scrub is an efficient way to read
all the blocks on all devices in the filesystem.

>   - Is there anything to notice when btrfs raid is placed above dm-
>     crypt from a security PoV?
>     With MD raid that wasn't much of a problem as it's typically placed
>     below dm-crypt... but btrfs raid would need to be placed above it.
>     So maybe there are some known attacks against crypto modes, if
>     equal (RAID 1 / 10) or similar/equal (RAID 5/6) data is written
>     above multiple crypto devices? (Probably something one would need
>     to ask their experts).

It's probably OK (i.e. no more or less vulnerable than a single dm-crypt
filesystem) to set up N dm-crypt devices with the same passphrase but
different LUKS master keys, i.e. run luksFormat N times, then run btrfs
raid on top of that.

Setting up one dm-crypt device and replicating its header (so the master
key is the same) is probably vulnerable to attacks that a single-disk
filesystem is not.

> - Maintenance tools
>   - How to get the status of the RAID? (Querying kernel logs is IMO
>     rather a bad way for this)
>     This includes:
>     - Is the raid degraded or not?

Various tools will report "missing" drives.

You can't mount a degraded array without a special mount option.
The option is ignored for non-degraded arrays, so you can use the option
for root filesystems while not using it for data filesystems with a
higher standard of integrity required.

An array with broken drives will likely have an extremely high number
of errors reported in 'btrfs dev stats.'

>     - Are scrubs/repairs/rebuilds/reshapes in progress and how far are
>       they? (Reshape would be: if the raid level is changed or the raid
>       grown/shrinked: has all data been replicated enough to be
>       "complete" for the desired raid lvl/number of devices/size?

scrub and balance have detailed status subcommands.

device delete and filesystem resize do not.  The progress can be inferred
by examining per-device space usage.

>    - What should one regularly do? scrubs? balance? How often?

Scrub frequency depends on your site's fault detection latency
requirements.  If you don't have those, do a scrub every month on
NAS/enterprise drives, every week on desktop/green/cheap drives.

See above for balance recommendation.

Read 'btrfs dev stats' output regularly and assess the health of the
hardware when any counter changes.

>      Do we get any automatic (but configurable) tools for this?

'cron' is sufficient in most cases.

[questions I can't answer removed]
> 
> Cheers,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-29 21:50     ` Zygo Blaxell
@ 2018-03-30  7:21       ` Menion
  2018-03-31  4:53         ` Zygo Blaxell
  2018-03-30 16:14       ` Goffredo Baroncelli
  1 sibling, 1 reply; 32+ messages in thread
From: Menion @ 2018-03-30  7:21 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, linux-btrfs

 Thanks for the detailed explanation. I think that a summary of this
should go in the btrfs raid56 wiki status page, because now it is
completely inconsistent and if a user comes there, ihe may get the
impression that the raid56 is just broken
Still I have the 1 bilion dollar question: from your word I understand
that even in RAID56 the metadata are spread on the devices in a coplex
way, but shall I assume that the array can survice to the sudden death
of one (two for raid6) HDD in the array?
Bye

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-29 21:50     ` Zygo Blaxell
  2018-03-30  7:21       ` Menion
@ 2018-03-30 16:14       ` Goffredo Baroncelli
  2018-03-31  5:03         ` Zygo Blaxell
  1 sibling, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2018-03-30 16:14 UTC (permalink / raw)
  To: Zygo Blaxell, Christoph Anton Mitterer; +Cc: linux-btrfs

On 03/29/2018 11:50 PM, Zygo Blaxell wrote:
> On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
>> Hey.
>>
>> Some things would IMO be nice to get done/clarified (i.e. documented in
>> the Wiki and manpages) from users'/admin's  POV:
[...]
> 
>>   - changing raid lvls?
> 
> btrfs uses a brute-force RAID conversion algorithm which always works, but
> takes zero short cuts.  e.g. there is no speed optimization implemented
> for cases like "convert 2-disk raid1 to 1-disk single" which can be
> very fast in theory.  The worst-case running time is the only running
> time available in btrfs.

[...]
What it is reported by Zygo is an excellent source of information. However I have to point out that BTRFS has a little optimization: i.e. scrub/balance only works on the allocated chunk. So a partial filled filesystem requires less time than a nearly filled one

> 
> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> is always a full-device operation.  In theory btrfs could track
> modifications at the chunk level but this isn't even specified in the
> on-disk format, much less implemented.

It could go even further; it would be sufficient to track which *partial* stripes update will be performed before a commit, in one of the btrfs logs. Then in case of a mount of an unclean filesystem, a scrub on these stripes would be sufficient.

BR
G.Baroncelli

[...]


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-30  7:21       ` Menion
@ 2018-03-31  4:53         ` Zygo Blaxell
  0 siblings, 0 replies; 32+ messages in thread
From: Zygo Blaxell @ 2018-03-31  4:53 UTC (permalink / raw)
  To: Menion; +Cc: Christoph Anton Mitterer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

On Fri, Mar 30, 2018 at 09:21:00AM +0200, Menion wrote:
>  Thanks for the detailed explanation. I think that a summary of this
> should go in the btrfs raid56 wiki status page, because now it is
> completely inconsistent and if a user comes there, ihe may get the
> impression that the raid56 is just broken
> Still I have the 1 bilion dollar question: from your word I understand
> that even in RAID56 the metadata are spread on the devices in a coplex
> way, but shall I assume that the array can survice to the sudden death
> of one (two for raid6) HDD in the array?

I wouldn't assume that.  There is still the write hole, and while there
is a small probability of having a write hole failure, it's a probability
that applies on *every* write in degraded mode, and since disks can fail
at any time, the array can enter degraded mode at any time.

It's similar to lottery tickets--buy one ticket, you probably won't win,
but if you buy millions of tickets, you'll claim the prize eventually.
The "prize" in this case is a severely damaged, possibly unrecoverable
filesystem.

If the data is raid5 and the metadata is raid1, the filesystem can
survive a single disk failure easily; however, some of the data may be
lost if writes to the remaining disks are interrupted by a system crash
or power failure and the write hole issue occurs.  Note that the damage
is not necessarily limited to recently written data--it's any random
data that is merely located adjacent to written data on the filesystem.

I wouldn't use raid6 until the write hole issue is resolved.  There is
no configuration where two disks can fail and metadata can still be
updated reliably.

Some users use the 'ssd_spread' mount option to reduce the probability
of write hole failure, which happens to be helpful by accident on some
array configurations, but it has a fairly high cost when the array is
not degraded due to all the extra balancing required.



> Bye

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-30 16:14       ` Goffredo Baroncelli
@ 2018-03-31  5:03         ` Zygo Blaxell
  2018-03-31  6:57           ` Goffredo Baroncelli
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2018-03-31  5:03 UTC (permalink / raw)
  To: kreijack; +Cc: Christoph Anton Mitterer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1745 bytes --]

On Fri, Mar 30, 2018 at 06:14:52PM +0200, Goffredo Baroncelli wrote:
> On 03/29/2018 11:50 PM, Zygo Blaxell wrote:
> > On Wed, Mar 21, 2018 at 09:02:36PM +0100, Christoph Anton Mitterer wrote:
> >> Hey.
> >>
> >> Some things would IMO be nice to get done/clarified (i.e. documented in
> >> the Wiki and manpages) from users'/admin's  POV:
> [...]
> > 
> > btrfs has no optimization like mdadm write-intent bitmaps; recovery
> > is always a full-device operation.  In theory btrfs could track
> > modifications at the chunk level but this isn't even specified in the
> > on-disk format, much less implemented.
> 
> It could go even further; it would be sufficient to track which
> *partial* stripes update will be performed before a commit, in one
> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> a scrub on these stripes would be sufficient.

A scrub cannot fix a raid56 write hole--the data is already lost.
The damaged stripe updates must be replayed from the log.

A scrub could fix raid1/raid10 partial updates but only if the filesystem
can reliably track which blocks failed to be updated by the disconnected
disks.

It would be nice if scrub could be filtered the same way balance is, e.g.
only certain block ranges, or only metadata blocks; however, this is not
presently implemented.

> BR
> G.Baroncelli
> 
> [...]
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-31  5:03         ` Zygo Blaxell
@ 2018-03-31  6:57           ` Goffredo Baroncelli
  2018-03-31  7:43             ` Zygo Blaxell
  2018-03-31 22:34             ` Chris Murphy
  0 siblings, 2 replies; 32+ messages in thread
From: Goffredo Baroncelli @ 2018-03-31  6:57 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, linux-btrfs

On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
>>> is always a full-device operation.  In theory btrfs could track
>>> modifications at the chunk level but this isn't even specified in the
>>> on-disk format, much less implemented.
>> It could go even further; it would be sufficient to track which
>> *partial* stripes update will be performed before a commit, in one
>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
>> a scrub on these stripes would be sufficient.

> A scrub cannot fix a raid56 write hole--the data is already lost.
> The damaged stripe updates must be replayed from the log.

Your statement is correct, but you doesn't consider the COW nature of btrfs.

The key is that if a data write is interrupted, all the transaction is interrupted and aborted. And due to the COW nature of btrfs, the "old state" is restored at the next reboot.

What is needed in any case is rebuild of parity to avoid the "write-hole" bug. And this is needed only for a partial stripe write. For a full stripe write, due to the fact that the commit is not flushed, it is not needed the scrub at all.

Of course for the NODATACOW file this is not entirely true; but I don't see the gain to switch from the cost of COW to the cost of a log.

The above sentences are correct (IMHO) if we don't consider a power failure+device missing case. However in this case even logging the "new data" would be not sufficient.

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-31  6:57           ` Goffredo Baroncelli
@ 2018-03-31  7:43             ` Zygo Blaxell
  2018-03-31  8:16               ` Goffredo Baroncelli
  2018-03-31 22:34             ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2018-03-31  7:43 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Christoph Anton Mitterer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2122 bytes --]

On Sat, Mar 31, 2018 at 08:57:18AM +0200, Goffredo Baroncelli wrote:
> On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
> >>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> >>> is always a full-device operation.  In theory btrfs could track
> >>> modifications at the chunk level but this isn't even specified in the
> >>> on-disk format, much less implemented.
> >> It could go even further; it would be sufficient to track which
> >> *partial* stripes update will be performed before a commit, in one
> >> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >> a scrub on these stripes would be sufficient.
> 
> > A scrub cannot fix a raid56 write hole--the data is already lost.
> > The damaged stripe updates must be replayed from the log.
> 
> Your statement is correct, but you doesn't consider the COW nature of btrfs.
> 
> The key is that if a data write is interrupted, all the transaction
> is interrupted and aborted. And due to the COW nature of btrfs, the
> "old state" is restored at the next reboot.

This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
RMW operations which are not COW and don't provide any data integrity
guarantee.  Old data (i.e. data from very old transactions that are not
part of the currently written transaction) can be destroyed by this.

> What is needed in any case is rebuild of parity to avoid the
> "write-hole" bug. And this is needed only for a partial stripe
> write. For a full stripe write, due to the fact that the commit is
> not flushed, it is not needed the scrub at all.
> 
> Of course for the NODATACOW file this is not entirely true; but I
> don't see the gain to switch from the cost of COW to the cost of a log.
> 
> The above sentences are correct (IMHO) if we don't consider a power
> failure+device missing case. However in this case even logging the
> "new data" would be not sufficient.
> 
> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-31  7:43             ` Zygo Blaxell
@ 2018-03-31  8:16               ` Goffredo Baroncelli
       [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
  0 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2018-03-31  8:16 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Christoph Anton Mitterer, linux-btrfs

On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
>> The key is that if a data write is interrupted, all the transaction
>> is interrupted and aborted. And due to the COW nature of btrfs, the
>> "old state" is restored at the next reboot.

> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> RMW operations which are not COW and don't provide any data integrity
> guarantee.  Old data (i.e. data from very old transactions that are not
> part of the currently written transaction) can be destroyed by this.

Could you elaborate a bit ?

Generally speaking, updating a part of a stripe require a RMW cycle, because
- you need to read all data stripe (with parity in case of a problem)
- then you should write
	- the new data
	- the new parity (calculated on the basis of the first read, and the new data)

However the "old" data should be untouched; or you are saying that the "old" data is rewritten with the same data ? 

BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
       [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
@ 2018-03-31 14:40                   ` Zygo Blaxell
  0 siblings, 0 replies; 32+ messages in thread
From: Zygo Blaxell @ 2018-03-31 14:40 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: kreijack, Christoph Anton Mitterer, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2584 bytes --]

On Sat, Mar 31, 2018 at 11:36:50AM +0300, Andrei Borzenkov wrote:
> 31.03.2018 11:16, Goffredo Baroncelli пишет:
> > On 03/31/2018 09:43 AM, Zygo Blaxell wrote:
> >>> The key is that if a data write is interrupted, all the transaction
> >>> is interrupted and aborted. And due to the COW nature of btrfs, the
> >>> "old state" is restored at the next reboot.
> > 
> >> This is not presently true with raid56 and btrfs.  RAID56 on btrfs uses
> >> RMW operations which are not COW and don't provide any data integrity
> >> guarantee.  Old data (i.e. data from very old transactions that are not
> >> part of the currently written transaction) can be destroyed by this.
> > 
> > Could you elaborate a bit ?
> > 
> > Generally speaking, updating a part of a stripe require a RMW cycle, because
> > - you need to read all data stripe (with parity in case of a problem)
> > - then you should write
> > 	- the new data
> > 	- the new parity (calculated on the basis of the first read, and the new data)
> > 
> > However the "old" data should be untouched; or you are saying that the "old" data is rewritten with the same data ? 
> > 
> 
> If old data block becomes unavailable, it can no more be reconstructed
> because old content of "new data" and "new priority" blocks are lost.
> Fortunately if checksum is in use it does not cause silent data
> corruption but it effectively means data loss.
> 
> Writing of data belonging to unrelated transaction affects previous
> transactions precisely due to RMW cycle. This fundamentally violates
> btrfs claim of always having either old or new consistent state.

Correct.

To fix this, any RMW stripe update on raid56 has to be written to a
log first.  All RMW updates must be logged because a disk failure could
happen at any time.

Full stripe writes don't need to be logged because all the data in the
stripe belongs to the same transaction, so if a disk fails the entire
stripe is either committed or it is not.

One way to avoid the logging is to change the btrfs allocation parameters
so that the filesystem doesn't allocate data in RAID stripes that are
already occupied by data from older transactions.  This is similar to
what 'ssd_spread' does, although the ssd_spread option wasn't designed
for this and won't be effective on large arrays.  This avoids modifying
stripes that contain old committed data, but it also means the free space
on the filesystem will become heavily fragmented over time.  Users will
have to run balance *much* more often to defragment the free space.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-31  6:57           ` Goffredo Baroncelli
  2018-03-31  7:43             ` Zygo Blaxell
@ 2018-03-31 22:34             ` Chris Murphy
  2018-04-01  3:45               ` Zygo Blaxell
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2018-03-31 22:34 UTC (permalink / raw)
  To: Goffredo Baroncelli; +Cc: Zygo Blaxell, Christoph Anton Mitterer, Btrfs BTRFS

On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
<kreijack@inwind.it> wrote:
> On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
>>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
>>>> is always a full-device operation.  In theory btrfs could track
>>>> modifications at the chunk level but this isn't even specified in the
>>>> on-disk format, much less implemented.
>>> It could go even further; it would be sufficient to track which
>>> *partial* stripes update will be performed before a commit, in one
>>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
>>> a scrub on these stripes would be sufficient.
>
>> A scrub cannot fix a raid56 write hole--the data is already lost.
>> The damaged stripe updates must be replayed from the log.
>
> Your statement is correct, but you doesn't consider the COW nature of btrfs.
>
> The key is that if a data write is interrupted, all the transaction is interrupted and aborted. And due to the COW nature of btrfs, the "old state" is restored at the next reboot.
>
> What is needed in any case is rebuild of parity to avoid the "write-hole" bug.

Write hole happens on disk in Btrfs, but the ensuing corruption on
rebuild is detected. Corrupt data never propagates. The problem is
that Btrfs gives up when it's detected.

If it assumes just a bit flip - not always a correct assumption but
might be reasonable most of the time, it could iterate very quickly.
Flip bit, and recompute and compare checksum. It doesn't have to
iterate across 64KiB times the number of devices. It really only has
to iterate bit flips on the particular 4KiB block that has failed csum
(or in the case of metadata, 16KiB for the default leaf size, up to a
max of 64KiB).

That's a maximum of 4096 iterations and comparisons. It'd be quite
fast. And going for two bit flips while a lot slower is probably not
all that bad either.

Now if it's the kind of corruption you get from a torn or misdirected
write, there's enough corruption that now you're trying to find a
collision on crc32c with a partial match as a guide. That'd take a
while and who knows you might actually get corrupted data anyway since
crc32c isn't cryptographically secure.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-03-31 22:34             ` Chris Murphy
@ 2018-04-01  3:45               ` Zygo Blaxell
  2018-04-01 20:51                 ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2018-04-01  3:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Goffredo Baroncelli, Christoph Anton Mitterer, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 9730 bytes --]

On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:
> On Sat, Mar 31, 2018 at 12:57 AM, Goffredo Baroncelli
> <kreijack@inwind.it> wrote:
> > On 03/31/2018 07:03 AM, Zygo Blaxell wrote:
> >>>> btrfs has no optimization like mdadm write-intent bitmaps; recovery
> >>>> is always a full-device operation.  In theory btrfs could track
> >>>> modifications at the chunk level but this isn't even specified in the
> >>>> on-disk format, much less implemented.
> >>> It could go even further; it would be sufficient to track which
> >>> *partial* stripes update will be performed before a commit, in one
> >>> of the btrfs logs. Then in case of a mount of an unclean filesystem,
> >>> a scrub on these stripes would be sufficient.
> >
> >> A scrub cannot fix a raid56 write hole--the data is already lost.
> >> The damaged stripe updates must be replayed from the log.
> >
> > Your statement is correct, but you doesn't consider the COW nature of btrfs.
> >
> > The key is that if a data write is interrupted, all the transaction is interrupted and aborted. And due to the COW nature of btrfs, the "old state" is restored at the next reboot.
> >
> > What is needed in any case is rebuild of parity to avoid the "write-hole" bug.
> 
> Write hole happens on disk in Btrfs, but the ensuing corruption on
> rebuild is detected. Corrupt data never propagates. 

Data written with nodatasum or nodatacow is corrupted without detection
(same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
journal device).

Metadata always has csums, and files have checksums if they are created
with default attributes and mount options.  Those cases are covered,
any corrupted data will give EIO on reads (except once per 4 billion
blocks, where the corrupted CRC matches at random).

> The problem is that Btrfs gives up when it's detected.

Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
combinations of recovery blocks for raid6, and earlier kernels than
those would not recover correctly for raid5 either.  I think this has
all been fixed in recent kernels but I haven't tested these myself so
don't quote me on that.

Other than that, btrfs doesn't give up in the write hole case.
It rebuilds the data according to the raid5/6 parity algorithm, but
the algorithm doesn't produce correct data for interrupted RMW writes
when there is no stripe update journal.  There is nothing else to try
at that point.  By the time the error is detected the opportunity to
recover the data has long passed.

The data that comes out of the recovery algorithm is a mixture of old
and new data from the filesystem.  The "new" data is something that
was written just before a failure, but the "old" data could be data
of any age, even a block of free space, that previously existed on the
filesystem.  If you bypass the EIO from the failing csums (e.g. by using
btrfs rescue) it will appear as though someone took the XOR of pairs of
random blocks from the disk and wrote it over one of the data blocks
at random.  When this happens to btrfs metadata, it is effectively a
fuzz tester for tools like 'btrfs check' which will often splat after
a write hole failure happens.

> If it assumes just a bit flip - not always a correct assumption but
> might be reasonable most of the time, it could iterate very quickly.

That is not how write hole works (or csum recovery for that matter).
Write hole producing a single bit flip would occur extremely rarely
outside of contrived test cases.

Recall that in a write hole, one or more 4K blocks are updated on some
of the disks in a stripe, but other blocks retain their original values
from prior to the update.  This is OK as long as all disks are online,
since the parity can be ignored or recomputed from the data blocks.  It is
also OK if the writes on all disks are completed without interruption,
since the data and parity eventually become consistent when all writes
complete as intended.  It is also OK if the entire stripe is written at
once, since then there is only one transaction referring to the stripe,
and if that transaction is not committed then the content of the stripe
is irrelevant.

The write hole error event is when all of the following occur:

	- a stripe containing committed data from one or more btrfs
	transactions is modified by raid5/6 RMW update in a new
	transaction.  This is the usual case on a btrfs filesystem
	with the default, 'nossd' or 'ssd' mount options.

	- the write is not completed (due to crash, power failure, disk
	failure, bad sector, SCSI timeout, bad cable, firmware bug, etc),
	so the parity block is out of sync with modified data blocks
	(before or after, order doesn't matter).

	- the array is alredy degraded, or later becomes degraded before
	the parity block can be recomputed by a scrub.

Users can run scrub immediately after _every_ unclean shutdown to
reduce the risk of inconsistent parity and unrecoverable data should
a disk fail later, but this can only prevent future write hole events,
not recover data lost during past events.

If one of the data blocks is not available, its content cannot be
recomputed from parity due to the inconsistency within the stripe.
This will likely be detected as a csum failure (unless the data block
is part of a nodatacow/nodatasum file, in which case corruption occurs
but is not detected) except for the one time out of 4 billion when
two CRC32s on random data match at random.

If a damaged block contains btrfs metadata, the filesystem will be
severely affected:  read-only, up to 100% of data inaccessible, only
recovery methods involving brute force search will work.

> Flip bit, and recompute and compare checksum. It doesn't have to
> iterate across 64KiB times the number of devices. It really only has
> to iterate bit flips on the particular 4KiB block that has failed csum
> (or in the case of metadata, 16KiB for the default leaf size, up to a
> max of 64KiB).

Write hole is effectively 32768 possible bit flips in a 4K block--assuming
only one block is affected, which is not very likely.  Each disk in an
array can have dozens of block updates in flight when an interruption
occurs, so there can be millions of bits corrupted in a single write
interruption event (and dozens of opportunities to encounter the nominally
rare write hole itself).

An experienced forensic analyst armed with specialized tools, a database
of file formats, and a recent backup of the filesystem might be able to
recover the damaged data or deduce what it was.  btrfs, being only mere
software running in the kernel, cannot.

There are two ways to solve the write hole problem and this is not one
of them.

> That's a maximum of 4096 iterations and comparisons. It'd be quite
> fast. And going for two bit flips while a lot slower is probably not
> all that bad either.

You could use that approach to fix a corrupted parity or data block
on a degraded array, but not a stripe that has data blocks destroyed
by an update with a write hole event.  Also this approach assumes that
whatever is flipping bits in RAM is not in and of itself corrupting data
or damaging the filesystem in unrecoverable ways, but most RAM-corrupting
agents in the real world do not limit themselves only to detectable and
recoverable mischief.

Aside:  As a best practice, if you see one-bit corruptions on your
btrfs filesystem, it is time to start replacing hardware, possibly also
finding a new hardware vendor or model (assuming the corruption is coming
from hardware, not a kernel memory corruption bug in some random device
driver).  Healthy hardware doesn't do bit flips.  So many things can go
wrong on unhealthy hardware, and they aren't all detectable or fixable.
It's one of the few IT risks that can be mitigated by merely spending
money until the problem goes away.

> Now if it's the kind of corruption you get from a torn or misdirected
> write, there's enough corruption that now you're trying to find a
> collision on crc32c with a partial match as a guide. That'd take a
> while and who knows you might actually get corrupted data anyway since
> crc32c isn't cryptographically secure.

All the CRC32 does is reduce the search space to for data recovery
from 32768 bits to 32736 bits per 4K block.  It is not possible to
brute-force search a 32736-bit space (that's two to the power of 32736
possible combinations), and even if it was, there would be no way to
distinguish which of billions of billions of billions of billions...[over
4000 "billions of" deleted]...of billions of possible data blocks that
have a matching CRC is the right one.  A SHA256 as block csum would only
reduce the search space to 32512 bits.

Our forensic analyst above could reduce the search space to a manageable
size for a data-specific recovery tool, but we can't put one of those
in the kernel.

Getting corrupted data out of a brute force search of multiple bit
flips against a checksum is not just likely--it's certain, if you can
even run the search long enough to get a result.  The number of corrupt
4K blocks with correct CRC outnumbers the number of correct blocks by 
ten thousand orders of magnitude.

It would work with a small number of bit flips because of the properties
of the CRC32 function is that it reliably detects errors with length
shorter than the polynomial.

> 
> -- 
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-01  3:45               ` Zygo Blaxell
@ 2018-04-01 20:51                 ` Chris Murphy
  2018-04-01 21:11                   ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2018-04-01 20:51 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Chris Murphy, Goffredo Baroncelli, Christoph Anton Mitterer, Btrfs BTRFS

On Sat, Mar 31, 2018 at 9:45 PM, Zygo Blaxell
<ce3g8jdj@umail.furryterror.org> wrote:
> On Sat, Mar 31, 2018 at 04:34:58PM -0600, Chris Murphy wrote:

>> Write hole happens on disk in Btrfs, but the ensuing corruption on
>> rebuild is detected. Corrupt data never propagates.
>
> Data written with nodatasum or nodatacow is corrupted without detection
> (same as running ext3/ext4/xfs on top of mdadm raid5 without a parity
> journal device).

Yeah I guess I'm not very worried about nodatasum/nodatacow if the
user isn't. Perhaps it's not a fair bias, but bias nonetheless.


>
> Metadata always has csums, and files have checksums if they are created
> with default attributes and mount options.  Those cases are covered,
> any corrupted data will give EIO on reads (except once per 4 billion
> blocks, where the corrupted CRC matches at random).
>
>> The problem is that Btrfs gives up when it's detected.
>
> Before recent kernels (4.14 or 4.15) btrfs would not attempt all possible
> combinations of recovery blocks for raid6, and earlier kernels than
> those would not recover correctly for raid5 either.  I think this has
> all been fixed in recent kernels but I haven't tested these myself so
> don't quote me on that.

Looks like 4.15
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.15&id2=v4.14

And those parts aren't yet backported to 4.14
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/diff/fs/btrfs/raid56.c?id=v4.15.15&id2=v4.14.32

And more in 4.16
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/fs/btrfs/raid56.c?id=v4.16-rc7&id2=v4.15


>
>> If it assumes just a bit flip - not always a correct assumption but
>> might be reasonable most of the time, it could iterate very quickly.
>
> That is not how write hole works (or csum recovery for that matter).
> Write hole producing a single bit flip would occur extremely rarely
> outside of contrived test cases.

Yes, what I wrote is definitely wrong, and I know better. I guess I
had a torn write in my brain!



> Users can run scrub immediately after _every_ unclean shutdown to
> reduce the risk of inconsistent parity and unrecoverable data should
> a disk fail later, but this can only prevent future write hole events,
> not recover data lost during past events.

Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
such a leaf containing EXTENT_CSUM means that EXTENT_CSUM




>
> If one of the data blocks is not available, its content cannot be
> recomputed from parity due to the inconsistency within the stripe.
> This will likely be detected as a csum failure (unless the data block
> is part of a nodatacow/nodatasum file, in which case corruption occurs
> but is not detected) except for the one time out of 4 billion when
> two CRC32s on random data match at random.
>
> If a damaged block contains btrfs metadata, the filesystem will be
> severely affected:  read-only, up to 100% of data inaccessible, only
> recovery methods involving brute force search will work.
>
>> Flip bit, and recompute and compare checksum. It doesn't have to
>> iterate across 64KiB times the number of devices. It really only has
>> to iterate bit flips on the particular 4KiB block that has failed csum
>> (or in the case of metadata, 16KiB for the default leaf size, up to a
>> max of 64KiB).
>
> Write hole is effectively 32768 possible bit flips in a 4K block--assuming
> only one block is affected, which is not very likely.  Each disk in an
> array can have dozens of block updates in flight when an interruption
> occurs, so there can be millions of bits corrupted in a single write
> interruption event (and dozens of opportunities to encounter the nominally
> rare write hole itself).
>
> An experienced forensic analyst armed with specialized tools, a database
> of file formats, and a recent backup of the filesystem might be able to
> recover the damaged data or deduce what it was.  btrfs, being only mere
> software running in the kernel, cannot.
>
> There are two ways to solve the write hole problem and this is not one
> of them.
>
>> That's a maximum of 4096 iterations and comparisons. It'd be quite
>> fast. And going for two bit flips while a lot slower is probably not
>> all that bad either.
>
> You could use that approach to fix a corrupted parity or data block
> on a degraded array, but not a stripe that has data blocks destroyed
> by an update with a write hole event.  Also this approach assumes that
> whatever is flipping bits in RAM is not in and of itself corrupting data
> or damaging the filesystem in unrecoverable ways, but most RAM-corrupting
> agents in the real world do not limit themselves only to detectable and
> recoverable mischief.
>
> Aside:  As a best practice, if you see one-bit corruptions on your
> btrfs filesystem, it is time to start replacing hardware, possibly also
> finding a new hardware vendor or model (assuming the corruption is coming
> from hardware, not a kernel memory corruption bug in some random device
> driver).  Healthy hardware doesn't do bit flips.  So many things can go
> wrong on unhealthy hardware, and they aren't all detectable or fixable.
> It's one of the few IT risks that can be mitigated by merely spending
> money until the problem goes away.
>
>> Now if it's the kind of corruption you get from a torn or misdirected
>> write, there's enough corruption that now you're trying to find a
>> collision on crc32c with a partial match as a guide. That'd take a
>> while and who knows you might actually get corrupted data anyway since
>> crc32c isn't cryptographically secure.
>
> All the CRC32 does is reduce the search space to for data recovery
> from 32768 bits to 32736 bits per 4K block.  It is not possible to
> brute-force search a 32736-bit space (that's two to the power of 32736
> possible combinations), and even if it was, there would be no way to
> distinguish which of billions of billions of billions of billions...[over
> 4000 "billions of" deleted]...of billions of possible data blocks that
> have a matching CRC is the right one.  A SHA256 as block csum would only
> reduce the search space to 32512 bits.
>
> Our forensic analyst above could reduce the search space to a manageable
> size for a data-specific recovery tool, but we can't put one of those
> in the kernel.
>
> Getting corrupted data out of a brute force search of multiple bit
> flips against a checksum is not just likely--it's certain, if you can
> even run the search long enough to get a result.  The number of corrupt
> 4K blocks with correct CRC outnumbers the number of correct blocks by
> ten thousand orders of magnitude.
>
> It would work with a small number of bit flips because of the properties
> of the CRC32 function is that it reliably detects errors with length
> shorter than the polynomial.
>
>>
>> --
>> Chris Murphy
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-01 20:51                 ` Chris Murphy
@ 2018-04-01 21:11                   ` Chris Murphy
  2018-04-02  5:45                     ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2018-04-01 21:11 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Zygo Blaxell, Goffredo Baroncelli, Christoph Anton Mitterer, Btrfs BTRFS

(I hate it when my palm rubs the trackpad and hits send prematurely...)


On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy <lists@colorremedies.com> wrote:

>> Users can run scrub immediately after _every_ unclean shutdown to
>> reduce the risk of inconsistent parity and unrecoverable data should
>> a disk fail later, but this can only prevent future write hole events,
>> not recover data lost during past events.
>
> Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
> such a leaf containing EXTENT_CSUM means that EXTENT_CSUM

means that EXTENT_CSUM is assumed to be correct. But in fact it could
be stale. It's just as possible the metadata and superblock update is
what's missing due to the interruption, while both data and parity
strip writes succeeded. The window for either the data or parity write
to fail is way shorter of a time interval, than that of the numerous
metadata writes, followed by superblock update. In such a case, the
old metadata is what's pointed to, including EXTENT_CSUM. Therefore
your scrub would always show csum error, even if both data and parity
are correct. You'd have to init-csum in this case, I suppose.

Pretty much it's RMW with a (partial) stripe overwrite upending COW,
and therefore upending the atomicity, and thus consistency of Btrfs in
the raid56 case where any portion of the transaction is interrupted.

And this is amplified if metadata is also raid56.

ZFS avoids the problem at the expense of probably a ton of
fragmentation, by taking e.g. 4KiB RMW and writing a full length
stripe of 8KiB fully COW, rather than doing stripe modification with
an overwrite. And that's because it has dynamic stripe lengths. For
Btrfs to always do COW would mean that 4KiB change goes into a new
full stripe, 64KiB * num devices, assuming no other changes are ready
at commit time.

So yeah, avoiding the problem is best. But if it's going to be a
journal, it's going to make things pretty damn slow I'd think, unless
the journal can be explicitly placed something faster than the array,
like an SSD/NVMe device. And that's what mdadm allows and expects.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-01 21:11                   ` Chris Murphy
@ 2018-04-02  5:45                     ` Zygo Blaxell
  2018-04-02 15:18                       ` Goffredo Baroncelli
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2018-04-02  5:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Goffredo Baroncelli, Christoph Anton Mitterer, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 8410 bytes --]

On Sun, Apr 01, 2018 at 03:11:04PM -0600, Chris Murphy wrote:
> (I hate it when my palm rubs the trackpad and hits send prematurely...)
> 
> 
> On Sun, Apr 1, 2018 at 2:51 PM, Chris Murphy <lists@colorremedies.com> wrote:
> 
> >> Users can run scrub immediately after _every_ unclean shutdown to
> >> reduce the risk of inconsistent parity and unrecoverable data should
> >> a disk fail later, but this can only prevent future write hole events,
> >> not recover data lost during past events.
> >
> > Problem is, Btrfs assumes a leaf is correct if it passes checksum. And
> > such a leaf containing EXTENT_CSUM means that EXTENT_CSUM
> 
> means that EXTENT_CSUM is assumed to be correct. But in fact it could
> be stale. It's just as possible the metadata and superblock update is
> what's missing due to the interruption, while both data and parity
> strip writes succeeded. The window for either the data or parity write
> to fail is way shorter of a time interval, than that of the numerous
> metadata writes, followed by superblock update. 

csums cannot be wrong due to write interruption.  The data and metadata
blocks are written first, then barrier, then superblock updates pointing
to the data and csums previously written in the same transaction.
Unflushed data is not included in the metadata.  If there is a write
interruption then the superblock update doesn't occur and btrfs reverts
to the previous unmodified data+csum trees.

This works on non-raid5/6 because all the writes that make up a
single transaction are ordered and independent, and no data from older
transactions is modified during any tree update.

On raid5/6 every RMW operation modifies data from old transactions
by creating data/parity inconsistency.  If there was no data in the
stripe from an old transaction, the operation would be just a write,
no read and modify.  In the write hole case, the csum *is* correct,
it is the data that is wrong.

> In such a case, the
> old metadata is what's pointed to, including EXTENT_CSUM. Therefore
> your scrub would always show csum error, even if both data and parity
> are correct. You'd have to init-csum in this case, I suppose.

No, the csums are correct.  The data does not match the csum because the
data is corrupted.  Assuming barriers work on your disk, and you're not
having some kind of direct IO data consistency bug, and you can read the
csum tree at all, then the csums are correct, even with write hole.

When write holes and other write interruption patterns affect the csum
tree itself, this results in parent transid verify failures, csum tree
page csum failures, or both.  This forces the filesystem read-only so
it's easy to spot when it happens.

Note that the data blocks with wrong csum from raid5/6 reconstruction
after a write hole event always belong to _old_ transactions damaged
by the write hole.  If the writes are interrupted, the new data blocks
in a RMW stripe will not be committed and will have no csums to verify,
so they can't have _wrong_ csums.  The old data blocks do not have their
csum changed by the write hole (the csum is stored on a separate tree
in a different block group) so the csums are intact.  When a write hole
event corrupts the data reconstruction on a degraded array, the csum
doesn't match because the csum is correct and the data is not.

> Pretty much it's RMW with a (partial) stripe overwrite upending COW,
> and therefore upending the atomicity, and thus consistency of Btrfs in
> the raid56 case where any portion of the transaction is interrupted.

Not any portion, only the RMW stripe update can produce data loss due
to write interruption (well, that, and fsync() log-tree replay bugs).

If any other part of the transaction is interrupted then btrfs recovers
just fine with its COW tree update algorithm and write barriers.

> And this is amplified if metadata is also raid56.

Data and metadata are mangled the same way.  The difference is the impact.

btrfs tolerates exactly 0 bits of damaged metadata after RAID recovery,
and enforces this intolerance with metadata transids and csums, so write
hole on metadata _always_ breaks the filesystem.

> ZFS avoids the problem at the expense of probably a ton of
> fragmentation, by taking e.g. 4KiB RMW and writing a full length
> stripe of 8KiB fully COW, rather than doing stripe modification with
> an overwrite. And that's because it has dynamic stripe lengths. 

I think that's technically correct but could be clearer.

ZFS never does RMW.  It doesn't need to.  Parity blocks are allocated
at the extent level and RAID stripes are built *inside* the extents (or
"groups of contiguous blocks written in a single transaction" which
seems to be the closest ZFS equivalent of the btrfs extent concept).

Since every ZFS RAID stripe is bespoke sized to exactly fit a single
write operation, no two ZFS transactions can ever share a RAID stripe.
No transactions sharing a stripe means no write hole.

There is no impact on fragmentation on ZFS--space is allocated and
deallocated contiguously on ZFS the same way in the RAID-Z and other
profiles.  The _amount_ of space allocated is different but it is the
same number of file fragments and free space holes created.

The tradeoff is that short writes consume more space in ZFS because
the stripe width depends on contiguous write size.  There is an impact
on the data:parity ratio because every short write reduces the average
ratio across the filesystem.  Really short writes degenerate to RAID1
(1:1 data and parity blocks).

> For
> Btrfs to always do COW would mean that 4KiB change goes into a new
> full stripe, 64KiB * num devices, assuming no other changes are ready
> at commit time.

In btrfs the higher layers know nothing about block group structure.
btrfs extents are allocated in virtual address space with the RAID5/6
layer underneath.  This was a straight copy of the mdadm approach and
has all the same pitfalls and workarounds.

It is possible to combine writes from a single transaction into full
RMW stripes, but this *does* have an impact on fragmentation in btrfs.
Any partially-filled stripe is effectively read-only and the space within
it is inaccessible until all data within the stripe is overwritten,
deleted, or relocated by balance.

btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
update, but that has a significant write magnification effect (and before
kernel 4.14, non-trivial CPU load as well).

btrfs could also just allocate the full stripe to an extent, but emit
only extent ref items for the blocks that are in use.  No fragmentation
but lots of extra disk space used.  Also doesn't quite work the same
way for metadata pages.

If btrfs adopted the ZFS approach, the extent allocator and all higher
layers of the filesystem would have to know about--and skip over--the
parity blocks embedded inside extents.  Making this change would mean
that some btrfs RAID profiles start interacting with stuff like balance
and compression which they currently do not.  It would create a new
block group type and require an incompatible on-disk format change for
both reads and writes.

So the current front-runner compromise seems to be RMW stripe update
logging, which is slow and requires an incompatible on-disk format change,
but minimizes code churn within btrfs.  Stripe update logs also handle
nodatacow files which none of the other proposals do.

> So yeah, avoiding the problem is best. But if it's going to be a
> journal, it's going to make things pretty damn slow I'd think, unless
> the journal can be explicitly placed something faster than the array,
> like an SSD/NVMe device. And that's what mdadm allows and expects.

The journal isn't required for full stripe writes, so it should only
cause overhead on short writes (i.e. 4K followed by fsync(), or any
leftover blocks before a transaction commit, or writes to a nearly full
filesystem with free space fragmentation).  Those are already slow due to
the seeks that are required to implement these.  The stripe log can be
combined with the fsync log and transaction commit, so the extra IO may
not cause a significant drop in performance (making a lot of assumptions
about how it gets implemented).

> 
> 
> -- 
> Chris Murphy
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-02  5:45                     ` Zygo Blaxell
@ 2018-04-02 15:18                       ` Goffredo Baroncelli
  2018-04-02 15:49                         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2018-04-02 15:18 UTC (permalink / raw)
  To: Zygo Blaxell, Chris Murphy; +Cc: Christoph Anton Mitterer, Btrfs BTRFS

On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
[...]
> It is possible to combine writes from a single transaction into full
> RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> Any partially-filled stripe is effectively read-only and the space within
> it is inaccessible until all data within the stripe is overwritten,
> deleted, or relocated by balance.
>
> btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> update, but that has a significant write magnification effect (and before
> kernel 4.14, non-trivial CPU load as well).
> 
> btrfs could also just allocate the full stripe to an extent, but emit
> only extent ref items for the blocks that are in use.  No fragmentation
> but lots of extra disk space used.  Also doesn't quite work the same
> way for metadata pages.
> 
> If btrfs adopted the ZFS approach, the extent allocator and all higher
> layers of the filesystem would have to know about--and skip over--the
> parity blocks embedded inside extents.  Making this change would mean
> that some btrfs RAID profiles start interacting with stuff like balance
> and compression which they currently do not.  It would create a new
> block group type and require an incompatible on-disk format change for
> both reads and writes.

I thought that a possible solution is to create BG with different number of data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are parity disk; we should allocate 3 BG

BG #1: 1 data disk, 2 parity disks
BG #2: 2 data disks, 2 parity disks,
BG #3: 4 data disks, 2 parity disks

For simplicity, the disk-stripe length is assumed = 4K.

So If you have a write with a length of 4 KB, this should be placed in BG#1; if you have a write with a length of 4*3KB, the first 8KB, should be placed in in BG#2, then in BG#1.

This would avoid space wasting, even if the fragmentation will increase (but shall the fragmentation matters with the modern solid state disks ?).

Time to time, a re-balance should be performed to empty the BG #1, and #2. Otherwise a new BG should be allocated.

The cost should be comparable to the logging/journaling (each data shorter than a full-stripe, has to be written two times); the implementation should be quite easy, because already NOW btrfs support BG with different set of disks.

BR 
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-02 15:18                       ` Goffredo Baroncelli
@ 2018-04-02 15:49                         ` Austin S. Hemmelgarn
  2018-04-02 22:23                           ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Austin S. Hemmelgarn @ 2018-04-02 15:49 UTC (permalink / raw)
  To: kreijack, Zygo Blaxell, Chris Murphy
  Cc: Christoph Anton Mitterer, Btrfs BTRFS

On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
> [...]
>> It is possible to combine writes from a single transaction into full
>> RMW stripes, but this *does* have an impact on fragmentation in btrfs.
>> Any partially-filled stripe is effectively read-only and the space within
>> it is inaccessible until all data within the stripe is overwritten,
>> deleted, or relocated by balance.
>>
>> btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
>> update, but that has a significant write magnification effect (and before
>> kernel 4.14, non-trivial CPU load as well).
>>
>> btrfs could also just allocate the full stripe to an extent, but emit
>> only extent ref items for the blocks that are in use.  No fragmentation
>> but lots of extra disk space used.  Also doesn't quite work the same
>> way for metadata pages.
>>
>> If btrfs adopted the ZFS approach, the extent allocator and all higher
>> layers of the filesystem would have to know about--and skip over--the
>> parity blocks embedded inside extents.  Making this change would mean
>> that some btrfs RAID profiles start interacting with stuff like balance
>> and compression which they currently do not.  It would create a new
>> block group type and require an incompatible on-disk format change for
>> both reads and writes.
> 
> I thought that a possible solution is to create BG with different number of data disks. E.g. supposing to have a raid 6 system with 6 disks, where 2 are parity disk; we should allocate 3 BG
> 
> BG #1: 1 data disk, 2 parity disks
> BG #2: 2 data disks, 2 parity disks,
> BG #3: 4 data disks, 2 parity disks
> 
> For simplicity, the disk-stripe length is assumed = 4K.
> 
> So If you have a write with a length of 4 KB, this should be placed in BG#1; if you have a write with a length of 4*3KB, the first 8KB, should be placed in in BG#2, then in BG#1.
> 
> This would avoid space wasting, even if the fragmentation will increase (but shall the fragmentation matters with the modern solid state disks ?).
Yes, fragmentation _does_ matter even with storage devices that have a 
uniform seek latency (such as SSD's), because less fragmentation means 
fewer I/O requests have to be made to load the same amount of data. 
Contrary to popular belief uniform seek-time devices do still perform 
better doing purely sequential I/O to random I/O because larger requests 
can be made, the difference is just small enough that it only matters if 
you're constantly using all the disk bandwidth.

Also, you're still going to be wasting space, it's just that less space 
will be wasted, and it will be wasted at the chunk level instead of the 
block level, which opens up a whole new set of issues to deal with, most 
significantly that it becomes functionally impossible without 
brute-force search techniques to determine when you will hit the 
common-case of -ENOSPC due to being unable to allocate a new chunk.
> 
> Time to time, a re-balance should be performed to empty the BG #1, and #2. Otherwise a new BG should be allocated.
> 
> The cost should be comparable to the logging/journaling (each data shorter than a full-stripe, has to be written two times); the implementation should be quite easy, because already NOW btrfs support BG with different set of disks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-02 15:49                         ` Austin S. Hemmelgarn
@ 2018-04-02 22:23                           ` Zygo Blaxell
  2018-04-03  0:31                             ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2018-04-02 22:23 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: kreijack, Chris Murphy, Christoph Anton Mitterer, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4728 bytes --]

On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > On 04/02/2018 07:45 AM, Zygo Blaxell wrote:
> > [...]
> > > It is possible to combine writes from a single transaction into full
> > > RMW stripes, but this *does* have an impact on fragmentation in btrfs.
> > > Any partially-filled stripe is effectively read-only and the space within
> > > it is inaccessible until all data within the stripe is overwritten,
> > > deleted, or relocated by balance.
> > > 
> > > btrfs could do a mini-balance on one RAID stripe instead of a RMW stripe
> > > update, but that has a significant write magnification effect (and before
> > > kernel 4.14, non-trivial CPU load as well).
> > > 
> > > btrfs could also just allocate the full stripe to an extent, but emit
> > > only extent ref items for the blocks that are in use.  No fragmentation
> > > but lots of extra disk space used.  Also doesn't quite work the same
> > > way for metadata pages.
> > > 
> > > If btrfs adopted the ZFS approach, the extent allocator and all higher
> > > layers of the filesystem would have to know about--and skip over--the
> > > parity blocks embedded inside extents.  Making this change would mean
> > > that some btrfs RAID profiles start interacting with stuff like balance
> > > and compression which they currently do not.  It would create a new
> > > block group type and require an incompatible on-disk format change for
> > > both reads and writes.
> > 
> > I thought that a possible solution is to create BG with different
> number of data disks. E.g. supposing to have a raid 6 system with 6
> disks, where 2 are parity disk; we should allocate 3 BG
> > 
> > BG #1: 1 data disk, 2 parity disks
> > BG #2: 2 data disks, 2 parity disks,
> > BG #3: 4 data disks, 2 parity disks
> > 
> > For simplicity, the disk-stripe length is assumed = 4K.
> > 
> > So If you have a write with a length of 4 KB, this should be placed
> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> should be placed in in BG#2, then in BG#1.
> > 
> > This would avoid space wasting, even if the fragmentation will
> increase (but shall the fragmentation matters with the modern solid
> state disks ?).

I don't really see why this would increase fragmentation or waste space.
The extent size is determined before allocation anyway, all that changes
in this proposal is where those small extents ultimately land on the disk.

If anything, it might _reduce_ fragmentation since everything in BG #1
and BG #2 will be of uniform size.

It does solve write hole (one transaction per RAID stripe).

> Also, you're still going to be wasting space, it's just that less space will
> be wasted, and it will be wasted at the chunk level instead of the block
> level, which opens up a whole new set of issues to deal with, most
> significantly that it becomes functionally impossible without brute-force
> search techniques to determine when you will hit the common-case of -ENOSPC
> due to being unable to allocate a new chunk.

Hopefully the allocator only keeps one of each size of small block groups
around at a time.  The allocator can take significant short cuts because
the size of every extent in the small block groups is known (they are
all the same size by definition).

When a small block group fills up, the next one should occupy the
most-empty subset of disks--which is the opposite of the usual RAID5/6
allocation policy.  This will probably lead to "interesting" imbalances
since there are now two allocators on the filesystem with different goals
(though it is no worse than -draid5 -mraid1, and I had no problems with
free space when I was running that).

There will be an increase in the amount of allocated but not usable space,
though, because now the amount of free space depends on how much data
is batched up before fsync() or sync().  Probably best to just not count
any space in the small block groups as 'free' in statvfs terms at all.

There are a lot of variables implied there.  Without running some
simulations I have no idea if this is a good idea or not.

> > Time to time, a re-balance should be performed to empty the BG #1,
> and #2. Otherwise a new BG should be allocated.

That shouldn't be _necessary_ (the filesystem should just allocate
whatever BGs it needs), though it will improve storage efficiency if it
is done.

> > The cost should be comparable to the logging/journaling (each
> data shorter than a full-stripe, has to be written two times); the
> implementation should be quite easy, because already NOW btrfs support
> BG with different set of disks.


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-02 22:23                           ` Zygo Blaxell
@ 2018-04-03  0:31                             ` Zygo Blaxell
  2018-04-03 17:03                               ` Goffredo Baroncelli
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2018-04-03  0:31 UTC (permalink / raw)
  To: Austin S. Hemmelgarn
  Cc: kreijack, Chris Murphy, Christoph Anton Mitterer, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4135 bytes --]

On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> > On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > > I thought that a possible solution is to create BG with different
> > number of data disks. E.g. supposing to have a raid 6 system with 6
> > disks, where 2 are parity disk; we should allocate 3 BG
> > > 
> > > BG #1: 1 data disk, 2 parity disks
> > > BG #2: 2 data disks, 2 parity disks,
> > > BG #3: 4 data disks, 2 parity disks
> > > 
> > > For simplicity, the disk-stripe length is assumed = 4K.
> > > 
> > > So If you have a write with a length of 4 KB, this should be placed
> > in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> > should be placed in in BG#2, then in BG#1.
> > > 
> > > This would avoid space wasting, even if the fragmentation will
> > increase (but shall the fragmentation matters with the modern solid
> > state disks ?).
> 
> I don't really see why this would increase fragmentation or waste space.

Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
remaining 2 blocks).  It also flips the usual order of "determine size
of extent, then allocate space for it" which might require major surgery
on the btrfs allocator to implement.

If we round that write up to 8 blocks (so we can put both pieces in
BG #3), it degenerates into the "pretend partially filled RAID stripes
are completely full" case, something like what ssd_spread already does.
That trades less file fragmentation for more free space fragmentation.

> The extent size is determined before allocation anyway, all that changes
> in this proposal is where those small extents ultimately land on the disk.
> 
> If anything, it might _reduce_ fragmentation since everything in BG #1
> and BG #2 will be of uniform size.
> 
> It does solve write hole (one transaction per RAID stripe).
> 
> > Also, you're still going to be wasting space, it's just that less space will
> > be wasted, and it will be wasted at the chunk level instead of the block
> > level, which opens up a whole new set of issues to deal with, most
> > significantly that it becomes functionally impossible without brute-force
> > search techniques to determine when you will hit the common-case of -ENOSPC
> > due to being unable to allocate a new chunk.
> 
> Hopefully the allocator only keeps one of each size of small block groups
> around at a time.  The allocator can take significant short cuts because
> the size of every extent in the small block groups is known (they are
> all the same size by definition).
> 
> When a small block group fills up, the next one should occupy the
> most-empty subset of disks--which is the opposite of the usual RAID5/6
> allocation policy.  This will probably lead to "interesting" imbalances
> since there are now two allocators on the filesystem with different goals
> (though it is no worse than -draid5 -mraid1, and I had no problems with
> free space when I was running that).
> 
> There will be an increase in the amount of allocated but not usable space,
> though, because now the amount of free space depends on how much data
> is batched up before fsync() or sync().  Probably best to just not count
> any space in the small block groups as 'free' in statvfs terms at all.
> 
> There are a lot of variables implied there.  Without running some
> simulations I have no idea if this is a good idea or not.
> 
> > > Time to time, a re-balance should be performed to empty the BG #1,
> > and #2. Otherwise a new BG should be allocated.
> 
> That shouldn't be _necessary_ (the filesystem should just allocate
> whatever BGs it needs), though it will improve storage efficiency if it
> is done.
> 
> > > The cost should be comparable to the logging/journaling (each
> > data shorter than a full-stripe, has to be written two times); the
> > implementation should be quite easy, because already NOW btrfs support
> > BG with different set of disks.
> 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-03  0:31                             ` Zygo Blaxell
@ 2018-04-03 17:03                               ` Goffredo Baroncelli
  2018-04-03 22:57                                 ` Zygo Blaxell
  2018-04-04  3:08                                 ` Chris Murphy
  0 siblings, 2 replies; 32+ messages in thread
From: Goffredo Baroncelli @ 2018-04-03 17:03 UTC (permalink / raw)
  To: Zygo Blaxell, Austin S. Hemmelgarn
  Cc: Chris Murphy, Christoph Anton Mitterer, Btrfs BTRFS

On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
>>>> I thought that a possible solution is to create BG with different
>>> number of data disks. E.g. supposing to have a raid 6 system with 6
>>> disks, where 2 are parity disk; we should allocate 3 BG
>>>> BG #1: 1 data disk, 2 parity disks
>>>> BG #2: 2 data disks, 2 parity disks,
>>>> BG #3: 4 data disks, 2 parity disks
>>>>
>>>> For simplicity, the disk-stripe length is assumed = 4K.
>>>>
>>>> So If you have a write with a length of 4 KB, this should be placed
>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
>>> should be placed in in BG#2, then in BG#1.
>>>> This would avoid space wasting, even if the fragmentation will
>>> increase (but shall the fragmentation matters with the modern solid
>>> state disks ?).
>> I don't really see why this would increase fragmentation or waste space.

> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> remaining 2 blocks).  It also flips the usual order of "determine size
> of extent, then allocate space for it" which might require major surgery
> on the btrfs allocator to implement.

I have to point out that in any case the extent is physically interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, the first half is written in the first disk, the other in the 2nd disk.  If you want to write 96kb, the first 64 are written in the first disk, the last part in the 2nd, only on a different BG.
So yes there is a fragmentation from a logical point of view; from a physical point of view the data is spread on the disks in any case.

In any case, you are right, we should gather some data, because the performance impact are no so clear.

I am not worried abut having different BG; we have problem with these because we never developed tool to handle this issue properly (i.e. a daemon which starts a balance when needed). But I hope that this will be solved in future.

In any case, the all solutions proposed have their trade off:

- a) as is: write hole bug
- b) variable stripe size (like ZFS): big impact on how btrfs handle the extent. limited waste of space
- c) logging data before writing: we wrote the data two times in a short time window. Moreover the log area is written several order of magnitude more than the other area; there was some patch around
- d) rounding the writing to the stripe size: waste of space; simple to implement;
- e) different BG with different stripe size: limited waste of space; logical fragmentation.


* c),d),e) are applied only for the tail of the extent, in case the size is less than the stripe size.
* for b),d), e), the wasting of space may be reduced with a balance 

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-03 17:03                               ` Goffredo Baroncelli
@ 2018-04-03 22:57                                 ` Zygo Blaxell
  2018-04-04  5:15                                   ` Goffredo Baroncelli
  2018-04-04  3:08                                 ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2018-04-03 22:57 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: Austin S. Hemmelgarn, Chris Murphy, Christoph Anton Mitterer,
	Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 6421 bytes --]

On Tue, Apr 03, 2018 at 07:03:06PM +0200, Goffredo Baroncelli wrote:
> On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> > On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> >>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> >>>> I thought that a possible solution is to create BG with different
> >>> number of data disks. E.g. supposing to have a raid 6 system with 6
> >>> disks, where 2 are parity disk; we should allocate 3 BG
> >>>> BG #1: 1 data disk, 2 parity disks
> >>>> BG #2: 2 data disks, 2 parity disks,
> >>>> BG #3: 4 data disks, 2 parity disks
> >>>>
> >>>> For simplicity, the disk-stripe length is assumed = 4K.
> >>>>
> >>>> So If you have a write with a length of 4 KB, this should be placed
> >>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> >>> should be placed in in BG#2, then in BG#1.
> >>>> This would avoid space wasting, even if the fragmentation will
> >>> increase (but shall the fragmentation matters with the modern solid
> >>> state disks ?).
> >> I don't really see why this would increase fragmentation or waste space.
> 
> > Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> > to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> > remaining 2 blocks).  It also flips the usual order of "determine size
> > of extent, then allocate space for it" which might require major surgery
> > on the btrfs allocator to implement.
> 
> I have to point out that in any case the extent is physically
> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> you want to write 128KB, the first half is written in the first disk,
> the other in the 2nd disk.  If you want to write 96kb, the first 64
> are written in the first disk, the last part in the 2nd, only on a
> different BG.

The "only on a different BG" part implies something expensive, either
a seek or a new erase page depending on the hardware.  Without that,
nearby logical blocks are nearby physical blocks as well.

> So yes there is a fragmentation from a logical point of view; from a
> physical point of view the data is spread on the disks in any case.

What matters is the extent-tree point of view.  There is (currently)
no fragmentation there, even for RAID5/6.  The extent tree is unaware
of RAID5/6 (to its peril).

ZFS makes its thing-like-the-extent-tree aware of RAID5/6, and it can
put a stripe of any size anywhere.  If we're going to do that in btrfs,
you might as well just do what ZFS does.

OTOH, variable-size block groups give us read-compatibility with old
kernel versions (and write-compatibility for that matter--a kernel that
didn't know about the BG separation would just work but have write hole).

If an application does a loop writing 68K then fsync(), the multiple-BG
solution adds two seeks to read every 68K.  That's expensive if sequential
read bandwidth is more scarce than free space.

> In any case, you are right, we should gather some data, because the
> performance impact are no so clear.
> 
> I am not worried abut having different BG; we have problem with these
> because we never developed tool to handle this issue properly (i.e. a
> daemon which starts a balance when needed). But I hope that this will
> be solved in future.

Balance daemons are easy to the point of being trivial to write in Python.

The balancing itself is quite expensive and invasive:  can't usefully
ionice it, can only abort it on block group boundaries, can't delete
snapshots while it's running.

If balance could be given a vrange that was the size of one extent...then
we could talk about daemons.

> In any case, the all solutions proposed have their trade off:
> 
> - a) as is: write hole bug
> - b) variable stripe size (like ZFS): big impact on how btrfs handle
> the extent. limited waste of space
> - c) logging data before writing: we wrote the data two times in a
> short time window. Moreover the log area is written several order of
> magnitude more than the other area; there was some patch around
> - d) rounding the writing to the stripe size: waste of space; simple
> to implement;
> - e) different BG with different stripe size: limited waste of space;
> logical fragmentation.

Also:

  - f) avoiding writes to partially filled stripes: free space
  fragmentation; simple to implement (ssd_spread does it accidentally)

The difference between d) and f) is that d) allocates the space to the
extent while f) leaves the space unallocated, but skips any free space
fragments smaller than the stripe size when allocating.  f) gets the
space back with a balance (i.e. it is exactly as space-efficient as (a)
after balance).

> * c),d),e) are applied only for the tail of the extent, in case the
size is less than the stripe size.

It's only necessary to split an extent if there are no other writes
in the same transaction that could be combined with the extent tail
into a single RAID stripe.  As long as everything in the RAID stripe
belongs to a single transaction, there is no write hole.

> * for b),d), e), the wasting of space may be reduced with a balance 

Not for d.  Balance doesn't know how to get rid of unreachable blocks
in extents (it just moves the entire extent around) so after a balance
the writes would still be rounded up to the stripe size.  Balance would
never be able to free the rounded-up space.  That space would just be
gone until the file was overwritten, deleted, or defragged.

Possibly not for b either, for the same reason.  Defrag is the existing
btrfs tool to fix extents with unused space attached to them.  Or some
new thing designed explicitly to handle these cases.

And also not for e, but it's a little different there.  In e the wasted
space is the extra metadata extent refs due to discontiguous extent
allocation.  You don't get that back with balance, you need defrag
here too.  e) also effectively can't claim that unused space in BG's is
"free" since there are non-trivial restrictions on whether it can be
allocated for any given write.  So even if you have free space, 'df'
has to tell you that you don't.

> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-03 17:03                               ` Goffredo Baroncelli
  2018-04-03 22:57                                 ` Zygo Blaxell
@ 2018-04-04  3:08                                 ` Chris Murphy
  2018-04-04  6:20                                   ` Zygo Blaxell
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2018-04-04  3:08 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: Zygo Blaxell, Austin S. Hemmelgarn, Chris Murphy,
	Christoph Anton Mitterer, Btrfs BTRFS

On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
>> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
>>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
>>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
>>>>> I thought that a possible solution is to create BG with different
>>>> number of data disks. E.g. supposing to have a raid 6 system with 6
>>>> disks, where 2 are parity disk; we should allocate 3 BG
>>>>> BG #1: 1 data disk, 2 parity disks
>>>>> BG #2: 2 data disks, 2 parity disks,
>>>>> BG #3: 4 data disks, 2 parity disks
>>>>>
>>>>> For simplicity, the disk-stripe length is assumed = 4K.
>>>>>
>>>>> So If you have a write with a length of 4 KB, this should be placed
>>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
>>>> should be placed in in BG#2, then in BG#1.
>>>>> This would avoid space wasting, even if the fragmentation will
>>>> increase (but shall the fragmentation matters with the modern solid
>>>> state disks ?).
>>> I don't really see why this would increase fragmentation or waste space.
>
>> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
>> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
>> remaining 2 blocks).  It also flips the usual order of "determine size
>> of extent, then allocate space for it" which might require major surgery
>> on the btrfs allocator to implement.
>
> I have to point out that in any case the extent is physically interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, the first half is written in the first disk, the other in the 2nd disk.  If you want to write 96kb, the first 64 are written in the first disk, the last part in the 2nd, only on a different BG.
> So yes there is a fragmentation from a logical point of view; from a physical point of view the data is spread on the disks in any case.
>
> In any case, you are right, we should gather some data, because the performance impact are no so clear.

They're pretty clear, and there's a lot written about small file size
and parity raid performance being shit, no matter the implementation
(md, ZFS, Btrfs, hardware maybe less so just because of all the
caching and extra processing hardware that's dedicated to the task).

The linux-raid@ list is full of optimizations for this that are use
case specific. One of those that often comes up is how badly suited
raid56 are for e.g. mail servers, tons of small file reads and writes,
and all the disk contention that comes up, and it's even worse when
you lose a disk, and even if you're running raid 6 and lose two disk
it's really god awful. It can be unexpectedly a disqualifying setup
without prior testing in that condition: can your workload really be
usable for two or three days in a double degraded state on that raid6?
*shrug*

Parity raid is well suited for full stripe reads and writes, lots of
sequential writes. Ergo a small file is anything less than a full
stripe write. Of course, delayed allocation can end up making for more
full stripe writes. But now you have more RMW which is the real
performance killer, again no matter the raid.


>
> I am not worried abut having different BG; we have problem with these because we never developed tool to handle this issue properly (i.e. a daemon which starts a balance when needed). But I hope that this will be solved in future.
>
> In any case, the all solutions proposed have their trade off:
>
> - a) as is: write hole bug
> - b) variable stripe size (like ZFS): big impact on how btrfs handle the extent. limited waste of space
> - c) logging data before writing: we wrote the data two times in a short time window. Moreover the log area is written several order of magnitude more than the other area; there was some patch around
> - d) rounding the writing to the stripe size: waste of space; simple to implement;
> - e) different BG with different stripe size: limited waste of space; logical fragmentation.

I'd say for sure you're worse off with metadata raid5 vs metadata
raid1. And if there are many devices you might be better off with
metadata raid1 even on a raid6, it's not an absolute certainty you
lose the file system with a 2nd drive failure - it depends on the
device and what chunk copies happen to be on it. But at the least if
you have a script or some warning you can relatively easily rebalance
... HMMM

Actually that should be a test. Single drive degraded raid6 with
metadata raid1, can you do a metadata only balance to force the
missing copy of metadata to be replicated again? In theory this should
be quite fast.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-03 22:57                                 ` Zygo Blaxell
@ 2018-04-04  5:15                                   ` Goffredo Baroncelli
  2018-04-04  6:01                                     ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2018-04-04  5:15 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Austin S. Hemmelgarn, Chris Murphy, Christoph Anton Mitterer,
	Btrfs BTRFS

On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
>> I have to point out that in any case the extent is physically
>> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
>> you want to write 128KB, the first half is written in the first disk,
>> the other in the 2nd disk.  If you want to write 96kb, the first 64
>> are written in the first disk, the last part in the 2nd, only on a
>> different BG.
> The "only on a different BG" part implies something expensive, either
> a seek or a new erase page depending on the hardware.  Without that,
> nearby logical blocks are nearby physical blocks as well.

In any case it happens on a different disk

> 
>> So yes there is a fragmentation from a logical point of view; from a
>> physical point of view the data is spread on the disks in any case.

> What matters is the extent-tree point of view.  There is (currently)
> no fragmentation there, even for RAID5/6.  The extent tree is unaware
> of RAID5/6 (to its peril).

Before you pointed out that the non-contiguous block written has an impact on performance. I am replaying  that the switching from a different BG happens at the stripe-disk boundary, so in any case the block is physically interrupted and switched to another disk

However yes: from an extent-tree point of view there will be an increase of number extents, because the end of the writing is allocated to another BG (if the size is not stripe-boundary)

> If an application does a loop writing 68K then fsync(), the multiple-BG
> solution adds two seeks to read every 68K.  That's expensive if sequential
> read bandwidth is more scarce than free space.

Why you talk about an additional seeks? In any case (even without the additional BG) the read happens from another disks

>> * c),d),e) are applied only for the tail of the extent, in case the
> size is less than the stripe size.
> 
> It's only necessary to split an extent if there are no other writes
> in the same transaction that could be combined with the extent tail
> into a single RAID stripe.  As long as everything in the RAID stripe
> belongs to a single transaction, there is no write hole

May be that a more "simpler" optimization would be close the transaction when the data reach the stripe boundary... But I suspect that it is not so simple to implement.

> Not for d.  Balance doesn't know how to get rid of unreachable blocks
> in extents (it just moves the entire extent around) so after a balance
> the writes would still be rounded up to the stripe size.  Balance would
> never be able to free the rounded-up space.  That space would just be
> gone until the file was overwritten, deleted, or defragged.

If balance is capable to move the extent, why not place one near the other during a balance ? The goal is not to limit the the writing of the end of a extent, but avoid writing the end of an extent without further data (e.g. the gap to the stripe has to be filled in the same transaction)

BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-04  5:15                                   ` Goffredo Baroncelli
@ 2018-04-04  6:01                                     ` Zygo Blaxell
  2018-04-04 21:31                                       ` Goffredo Baroncelli
  0 siblings, 1 reply; 32+ messages in thread
From: Zygo Blaxell @ 2018-04-04  6:01 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: Austin S. Hemmelgarn, Chris Murphy, Christoph Anton Mitterer,
	Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4958 bytes --]

On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> >> I have to point out that in any case the extent is physically
> >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> >> you want to write 128KB, the first half is written in the first disk,
> >> the other in the 2nd disk.  If you want to write 96kb, the first 64
> >> are written in the first disk, the last part in the 2nd, only on a
> >> different BG.
> > The "only on a different BG" part implies something expensive, either
> > a seek or a new erase page depending on the hardware.  Without that,
> > nearby logical blocks are nearby physical blocks as well.
> 
> In any case it happens on a different disk

No it doesn't.  The small-BG could be on the same disk(s) as the big-BG.

> >> So yes there is a fragmentation from a logical point of view; from a
> >> physical point of view the data is spread on the disks in any case.
> 
> > What matters is the extent-tree point of view.  There is (currently)
> > no fragmentation there, even for RAID5/6.  The extent tree is unaware
> > of RAID5/6 (to its peril).
> 
> Before you pointed out that the non-contiguous block written has
> an impact on performance. I am replaying  that the switching from a
> different BG happens at the stripe-disk boundary, so in any case the
> block is physically interrupted and switched to another disk

The difference is that the write is switched to a different local address
on the disk.

It's not "another" disk if it's a different BG.  Recall in this plan
there is a full-width BG that is on _every_ disk, which means every
small-width BG shares a disk with the full-width BG.  Every extent tail
write requires a seek on a minimum of two disks in the array for raid5,
three disks for raid6.  A tail that is strip-width minus one will hit
N - 1 disks twice in an N-disk array.

> However yes: from an extent-tree point of view there will be an increase
> of number extents, because the end of the writing is allocated to
> another BG (if the size is not stripe-boundary)
> 
> > If an application does a loop writing 68K then fsync(), the multiple-BG
> > solution adds two seeks to read every 68K.  That's expensive if sequential
> > read bandwidth is more scarce than free space.
> 
> Why you talk about an additional seeks? In any case (even without the
> additional BG) the read happens from another disks

See above:  not another disk, usually a different location on two or
more of the same disks.

> >> * c),d),e) are applied only for the tail of the extent, in case the
> > size is less than the stripe size.
> > 
> > It's only necessary to split an extent if there are no other writes
> > in the same transaction that could be combined with the extent tail
> > into a single RAID stripe.  As long as everything in the RAID stripe
> > belongs to a single transaction, there is no write hole
> 
> May be that a more "simpler" optimization would be close the transaction
> when the data reach the stripe boundary... But I suspect that it is
> not so simple to implement.

Transactions exist in btrfs to batch up writes into big contiguous extents
already.  The trick is to _not_ do that when one transaction ends and
the next begins, i.e. leave a space at the end of the partially-filled
stripe so that the next transaction begins in an empty stripe.

This does mean that there will only be extra seeks during transaction
commit and fsync()--which were already very seeky to begin with.  It's
not necessary to write a partial stripe when there are other extents to
combine.

So there will be double the amount of seeking, but depending on the
workload, it could double a very small percentage of writes.

> > Not for d.  Balance doesn't know how to get rid of unreachable blocks
> > in extents (it just moves the entire extent around) so after a balance
> > the writes would still be rounded up to the stripe size.  Balance would
> > never be able to free the rounded-up space.  That space would just be
> > gone until the file was overwritten, deleted, or defragged.
> 
> If balance is capable to move the extent, why not place one near the
> other during a balance ? The goal is not to limit the the writing of
> the end of a extent, but avoid writing the end of an extent without
> further data (e.g. the gap to the stripe has to be filled in the
> same transaction)

That's plan f (leave gaps in RAID stripes empty).  Balance will repack
short extents into RAID stripes nicely.

Plan d can't do that because plan d overallocates the extent so that
the extent fills the stripe (only some of the extent is used for data).
Small but important difference.

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-04  3:08                                 ` Chris Murphy
@ 2018-04-04  6:20                                   ` Zygo Blaxell
  0 siblings, 0 replies; 32+ messages in thread
From: Zygo Blaxell @ 2018-04-04  6:20 UTC (permalink / raw)
  To: Chris Murphy
  Cc: Goffredo Baroncelli, Austin S. Hemmelgarn,
	Christoph Anton Mitterer, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 5964 bytes --]

On Tue, Apr 03, 2018 at 09:08:01PM -0600, Chris Murphy wrote:
> On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli <kreijack@inwind.it> wrote:
> > On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
> >>>> On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> >>>>> I thought that a possible solution is to create BG with different
> >>>> number of data disks. E.g. supposing to have a raid 6 system with 6
> >>>> disks, where 2 are parity disk; we should allocate 3 BG
> >>>>> BG #1: 1 data disk, 2 parity disks
> >>>>> BG #2: 2 data disks, 2 parity disks,
> >>>>> BG #3: 4 data disks, 2 parity disks
> >>>>>
> >>>>> For simplicity, the disk-stripe length is assumed = 4K.
> >>>>>
> >>>>> So If you have a write with a length of 4 KB, this should be placed
> >>>> in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
> >>>> should be placed in in BG#2, then in BG#1.
> >>>>> This would avoid space wasting, even if the fragmentation will
> >>>> increase (but shall the fragmentation matters with the modern solid
> >>>> state disks ?).
> >>> I don't really see why this would increase fragmentation or waste space.
> >
> >> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> >> remaining 2 blocks).  It also flips the usual order of "determine size
> >> of extent, then allocate space for it" which might require major surgery
> >> on the btrfs allocator to implement.
> >
> > I have to point out that in any case the extent is physically interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 128KB, the first half is written in the first disk, the other in the 2nd disk.  If you want to write 96kb, the first 64 are written in the first disk, the last part in the 2nd, only on a different BG.
> > So yes there is a fragmentation from a logical point of view; from a physical point of view the data is spread on the disks in any case.
> >
> > In any case, you are right, we should gather some data, because the performance impact are no so clear.
> 
> They're pretty clear, and there's a lot written about small file size
> and parity raid performance being shit, no matter the implementation
> (md, ZFS, Btrfs, hardware maybe less so just because of all the
> caching and extra processing hardware that's dedicated to the task).

Pretty much everything goes fast if you put a faster non-volatile cache
in front of it.

> The linux-raid@ list is full of optimizations for this that are use
> case specific. One of those that often comes up is how badly suited
> raid56 are for e.g. mail servers, tons of small file reads and writes,
> and all the disk contention that comes up, and it's even worse when
> you lose a disk, and even if you're running raid 6 and lose two disk
> it's really god awful. It can be unexpectedly a disqualifying setup
> without prior testing in that condition: can your workload really be
> usable for two or three days in a double degraded state on that raid6?
> *shrug*
> 
> Parity raid is well suited for full stripe reads and writes, lots of
> sequential writes. Ergo a small file is anything less than a full
> stripe write. Of course, delayed allocation can end up making for more
> full stripe writes. But now you have more RMW which is the real
> performance killer, again no matter the raid.

RMW isn't necessary if you have properly configured COW on top.
ZFS doesn't do RMW at all.  OTOH for some workloads COW is a step in a
different wrong direction--the btrfs raid5 problems with nodatacow
files can be solved by stripe logging and nothing else.

Some equivalent of autodefrag that repacks your small RAID stripes
into bigger ones will burn 3x your write IOPS eventually--it just
lets you defer the inevitable until a hopefully more convenient time.
A continuously loaded server never has a more convenient time, so it
needs a different solution.

> > I am not worried abut having different BG; we have problem with these because we never developed tool to handle this issue properly (i.e. a daemon which starts a balance when needed). But I hope that this will be solved in future.
> >
> > In any case, the all solutions proposed have their trade off:
> >
> > - a) as is: write hole bug
> > - b) variable stripe size (like ZFS): big impact on how btrfs handle the extent. limited waste of space
> > - c) logging data before writing: we wrote the data two times in a short time window. Moreover the log area is written several order of magnitude more than the other area; there was some patch around
> > - d) rounding the writing to the stripe size: waste of space; simple to implement;
> > - e) different BG with different stripe size: limited waste of space; logical fragmentation.
> 
> I'd say for sure you're worse off with metadata raid5 vs metadata
> raid1. And if there are many devices you might be better off with
> metadata raid1 even on a raid6, it's not an absolute certainty you
> lose the file system with a 2nd drive failure - it depends on the
> device and what chunk copies happen to be on it. But at the least if
> you have a script or some warning you can relatively easily rebalance
> ... HMMM
> 
> Actually that should be a test. Single drive degraded raid6 with
> metadata raid1, can you do a metadata only balance to force the
> missing copy of metadata to be replicated again? In theory this should
> be quite fast.

I've done it, but it's not as fast as you might hope.  Metadata balances
95% slower than data, and seeks pretty hard (stressing the surviving
drives and sucking performance) while it does so.  Also you're likely to
have to fix or work around a couple of btrfs bugs while you do it.

> 
> 
> -- 
> Chris Murphy

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-04  6:01                                     ` Zygo Blaxell
@ 2018-04-04 21:31                                       ` Goffredo Baroncelli
  2018-04-04 22:38                                         ` Zygo Blaxell
  0 siblings, 1 reply; 32+ messages in thread
From: Goffredo Baroncelli @ 2018-04-04 21:31 UTC (permalink / raw)
  To: Zygo Blaxell
  Cc: Austin S. Hemmelgarn, Chris Murphy, Christoph Anton Mitterer,
	Btrfs BTRFS

On 04/04/2018 08:01 AM, Zygo Blaxell wrote:
> On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
>> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
[...]
>> Before you pointed out that the non-contiguous block written has
>> an impact on performance. I am replaying  that the switching from a
>> different BG happens at the stripe-disk boundary, so in any case the
>> block is physically interrupted and switched to another disk
> 
> The difference is that the write is switched to a different local address
> on the disk.
> 
> It's not "another" disk if it's a different BG.  Recall in this plan
> there is a full-width BG that is on _every_ disk, which means every
> small-width BG shares a disk with the full-width BG.  Every extent tail
> write requires a seek on a minimum of two disks in the array for raid5,
> three disks for raid6.  A tail that is strip-width minus one will hit
> N - 1 disks twice in an N-disk array.

Below I made a little simulation; my results telling me another thing:

Current BTRFS (w/write hole)

Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)

Case A.1): extent size = 192kb:
5 writes of 64kb spread on 5 disks (3data + 2 parity)

Case A.2.2): extent size = 256kb: (optimistic case: contiguous space available)
5 writes of 64kb spread on 5 disks (3 data + 2 parity)
2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**]
3 writes of 64 kb spread on 3 disks (data + 2 parity)

Note that the two reads are contiguous to the 5 writes both in term of space and time. The three writes are contiguous only in terms of space, but not in terms of time, because these could happen only after the 2 reads and the consequent parities computations. So we should consider that between these two events, some disk activities happen; this means seeks between the 2 reads and the 3 writes


BTRFS with multiple BG (wo/write hole)

Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)

Case B.1): extent size = 192kb:
5 writes of 64kb spread on 5 disks

Case B.2): extent size = 256kb:
5 writes of 64kb spread on 5 disks in BG#1
3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks)

So if I count correctly:
- case B1 vs A1: these are equivalent
- case B2 vs A2.1/A2.2:
	8 writes vs 8 writes
	3 seeks vs 3 seeks
	0 reads vs 2 reads

So to me it seems that the cost of doing a RMW cycle is worse than seeking to another BG.

Anyway I am reaching the conclusion, also thanks of this discussion, that this is not enough. Even if we had solve the problem of the "extent smaller than stripe" write, we still face gain this issue when part of the file is changed.
In this case the file update breaks the old extent and will create a three extents: the first part, the new part, the last part. Until that everything is OK. However the "old" part of the file would be marked as free space. But using this part could require a RMW cycle....

I am concluding that the only two reliable solution are 
a) variable stripe size (like ZFS does) 
or b) logging the RMW cycle of a stripe 


[**] Does someone know if the checksum are checked during this read ?
[...]
 
BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Status of RAID5/6
  2018-04-04 21:31                                       ` Goffredo Baroncelli
@ 2018-04-04 22:38                                         ` Zygo Blaxell
  0 siblings, 0 replies; 32+ messages in thread
From: Zygo Blaxell @ 2018-04-04 22:38 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: Austin S. Hemmelgarn, Chris Murphy, Christoph Anton Mitterer,
	Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4894 bytes --]

On Wed, Apr 04, 2018 at 11:31:33PM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 08:01 AM, Zygo Blaxell wrote:
> > On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> >> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> [...]
> >> Before you pointed out that the non-contiguous block written has
> >> an impact on performance. I am replaying  that the switching from a
> >> different BG happens at the stripe-disk boundary, so in any case the
> >> block is physically interrupted and switched to another disk
> > 
> > The difference is that the write is switched to a different local address
> > on the disk.
> > 
> > It's not "another" disk if it's a different BG.  Recall in this plan
> > there is a full-width BG that is on _every_ disk, which means every
> > small-width BG shares a disk with the full-width BG.  Every extent tail
> > write requires a seek on a minimum of two disks in the array for raid5,
> > three disks for raid6.  A tail that is strip-width minus one will hit
> > N - 1 disks twice in an N-disk array.
> 
> Below I made a little simulation; my results telling me another thing:
> 
> Current BTRFS (w/write hole)
> 
> Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)
> 
> Case A.1): extent size = 192kb:
> 5 writes of 64kb spread on 5 disks (3data + 2 parity)
> 
> Case A.2.2): extent size = 256kb: (optimistic case: contiguous space available)
> 5 writes of 64kb spread on 5 disks (3 data + 2 parity)
> 2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**]
> 3 writes of 64 kb spread on 3 disks (data + 2 parity)
> 
> Note that the two reads are contiguous to the 5 writes both in term of
> space and time. The three writes are contiguous only in terms of space,
> but not in terms of time, because these could happen only after the 2
> reads and the consequent parities computations. So we should consider
> that between these two events, some disk activities happen; this means
> seeks between the 2 reads and the 3 writes
> 
> 
> BTRFS with multiple BG (wo/write hole)
> 
> Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)
> 
> Case B.1): extent size = 192kb:
> 5 writes of 64kb spread on 5 disks
> 
> Case B.2): extent size = 256kb:
> 5 writes of 64kb spread on 5 disks in BG#1
> 3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks)
> 
> So if I count correctly:
> - case B1 vs A1: these are equivalent
> - case B2 vs A2.1/A2.2:
> 	8 writes vs 8 writes
> 	3 seeks vs 3 seeks
> 	0 reads vs 2 reads
> 
> So to me it seems that the cost of doing a RMW cycle is worse than
> seeking to another BG.

Well, RMW cycles are dangerous, so being slow as well is just a second
reason never to do them.

> Anyway I am reaching the conclusion, also thanks of this discussion,
> that this is not enough. Even if we had solve the problem of the
> "extent smaller than stripe" write, we still face gain this issue when
> part of the file is changed.
> In this case the file update breaks the old extent and will create a
> three extents: the first part, the new part, the last part. Until that
> everything is OK. However the "old" part of the file would be marked
> as free space. But using this part could require a RMW cycle....

You cannot use that free space within RAID stripes because it would
require RMW, and RMW causes write hole.  The space would have to be kept
unavailable until the rest of the RAID stripe was deleted.

OTOH, if you can solve that free space management problem, you don't
have to do anything else to solve write hole.  If you never RMW then
you never have the write hole in the first place.

> I am concluding that the only two reliable solution are 
> a) variable stripe size (like ZFS does) 
> or b) logging the RMW cycle of a stripe 

Those are the only solutions that don't require a special process for
reclaiming unused space in RAID stripes.  If you have that, you have a
few more options; however, they all involve making a second copy of the
data at a later time (as opposed to option b, which makes a second
copy of the data during the original write).

a) also doesn't support nodatacow files (AFAIK ZFS doesn't have those)
and it would require defrag to get the inefficiently used space back.

b) is the best of the terrible options.  It minimizes the impact on the
rest of the filesystem since it can fix RMW inconsistency without having
to eliminate the RMW cases.  It doesn't require rewriting the allocator
nor does it require users to run defrag or balance periodically.

> [**] Does someone know if the checksum are checked during this read ?
> [...]
>  
> BR
> G.Baroncelli
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-04-04 22:38 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21 16:50 Status of RAID5/6 Menion
2018-03-21 17:24 ` Liu Bo
2018-03-21 20:02   ` Christoph Anton Mitterer
2018-03-22 12:01     ` Austin S. Hemmelgarn
2018-03-29 21:50     ` Zygo Blaxell
2018-03-30  7:21       ` Menion
2018-03-31  4:53         ` Zygo Blaxell
2018-03-30 16:14       ` Goffredo Baroncelli
2018-03-31  5:03         ` Zygo Blaxell
2018-03-31  6:57           ` Goffredo Baroncelli
2018-03-31  7:43             ` Zygo Blaxell
2018-03-31  8:16               ` Goffredo Baroncelli
     [not found]                 ` <28a574db-0f74-b12c-ab5f-400205fd80c8@gmail.com>
2018-03-31 14:40                   ` Zygo Blaxell
2018-03-31 22:34             ` Chris Murphy
2018-04-01  3:45               ` Zygo Blaxell
2018-04-01 20:51                 ` Chris Murphy
2018-04-01 21:11                   ` Chris Murphy
2018-04-02  5:45                     ` Zygo Blaxell
2018-04-02 15:18                       ` Goffredo Baroncelli
2018-04-02 15:49                         ` Austin S. Hemmelgarn
2018-04-02 22:23                           ` Zygo Blaxell
2018-04-03  0:31                             ` Zygo Blaxell
2018-04-03 17:03                               ` Goffredo Baroncelli
2018-04-03 22:57                                 ` Zygo Blaxell
2018-04-04  5:15                                   ` Goffredo Baroncelli
2018-04-04  6:01                                     ` Zygo Blaxell
2018-04-04 21:31                                       ` Goffredo Baroncelli
2018-04-04 22:38                                         ` Zygo Blaxell
2018-04-04  3:08                                 ` Chris Murphy
2018-04-04  6:20                                   ` Zygo Blaxell
2018-03-21 20:27   ` Menion
2018-03-22 21:13   ` waxhead

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.