linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* raid 5/6 - is this implemented?
@ 2020-11-06 19:41 Hendrik Friedel
  2020-11-07  1:43 ` Zygo Blaxell
  0 siblings, 1 reply; 3+ messages in thread
From: Hendrik Friedel @ 2020-11-06 19:41 UTC (permalink / raw)
  To: Btrfs BTRFS

Hello,

I stumbled upon this:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg91938.html

<<This is why scrub after a crash or powerloss with raid56 is important,
while the array is still whole (not degraded). The two problems with
that are:

a) the scrub isn't initiated automatically, nor is it obvious to the
user it's necessary
b) the scrub can take a long time, Btrfs has no partial scrubbing.

Wheras mdadm arrays offer a write intent bitmap to know what blocks to
partially scrub, and to trigger it automatically following a crash or
powerloss.

It seems Btrfs already has enough on-disk metadata to infer a
functional equivalent to the write intent bitmap, via transid. Just
scrub the last ~50 generations the next time it's mounted. Either do
this every time a Btrfs raid56 is mounted. Or create some flag that
allows Btrfs to know if the filesystem was not cleanly shutdown. >>

Has this been implemented in the meantime? If not: Are there any plans 
to?

Regards,
Hendrik


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: raid 5/6 - is this implemented?
  2020-11-06 19:41 raid 5/6 - is this implemented? Hendrik Friedel
@ 2020-11-07  1:43 ` Zygo Blaxell
  2020-11-11 17:06   ` Re[2]: " Hendrik Friedel
  0 siblings, 1 reply; 3+ messages in thread
From: Zygo Blaxell @ 2020-11-07  1:43 UTC (permalink / raw)
  To: Hendrik Friedel; +Cc: Btrfs BTRFS

On Fri, Nov 06, 2020 at 07:41:11PM +0000, Hendrik Friedel wrote:
> Hello,
> 
> I stumbled upon this:
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg91938.html
> 
> <<This is why scrub after a crash or powerloss with raid56 is important,
> while the array is still whole (not degraded). The two problems with
> that are:
> 
> a) the scrub isn't initiated automatically, nor is it obvious to the
> user it's necessary

Properly implemented, raid56 should not require scrubs after unclean
shutdown.  Notice none of the other btrfs raid profiles require scrub
after unclean shutdown (and even mdadm doesn't require it with a journal
device or PPL).  It's only btrfs raid56 that needs this workaround.

Scrub is a tool for finding drive failures (especially silent data
corruption) every month--not working around filesystem bugs every reboot.

> b) the scrub can take a long time, Btrfs has no partial scrubbing.

btrfs does have partial scrubbing.  The userspace utilities use it to
implement the pause and resume features.  It should be easy to write a
partial scrubber in userspace with e.g.  python-btrfs, at least at block
group resolution (maybe it can't scrub individual stripes).

> Wheras mdadm arrays offer a write intent bitmap to know what blocks to
> partially scrub, and to trigger it automatically following a crash or
> powerloss.

Write-intent bitmap updates metadata on the disk _before_ data writes,
which is the opposite of the btrfs transaction mechanism (which writes
data first, then metadata).  btrfs would have to write something on
the disk that indicates which block groups will be touched, and flush
it to disk _before_ any other writes to the filesystem occur, and have
that portion of the disk maintained separately from the rest of btrfs
(something like the free space cache).

It's tricky to implement as a bitmap, since new block groups can be
created during a transaction.  It's not impossible, but it requires an
on-disk format change, and whoever is working on it could be spending
their effort better by fixing raid5/6 bugs to make hacks like write-intent
bitmaps unnecessary.

> It seems Btrfs already has enough on-disk metadata to infer a
> functional equivalent to the write intent bitmap, via transid. Just
> scrub the last ~50 generations the next time it's mounted. Either do
> this every time a Btrfs raid56 is mounted. Or create some flag that
> allows Btrfs to know if the filesystem was not cleanly shutdown. >>

It's not the last 50 generations that need to be scrubbed.  Any committed
transaction since mkfs can be affected by write hole.

It's the set of raid5/6 stripes that were touched by the last incomplete
transaction before the crash that need to be scrubbed.  It's tricky to
find those, because no record of the last transaction exists on the disk
until after the transaction is complete.

i.e. the only thing you can't find by looking at transaction history is
the one thing you need to be looking at here.

After an unclean shutdown, we could scrub all stripes that contain at
least one used and at least one free block--those are the possible
stripes that can be corrupted by raid5/6 updates.  All other stripes
cannot be corrupted that way, because completely full stripes cannot be
updated, and completely empty stripes don't contain any data to corrupt.

The portion of the disk that falls into this category depends on average
file size and overall free space fragmentation--it could be less than 1%,
or more than 90%, of the disk.  And it's another time-wasting hack, or
an exercise using python-btrfs in userspace.

> Has this been implemented in the meantime? 

Not much has changed for raid5/6 since 2014, other than the introduction
of raid1c3 for metadata in 2019 to make filesystem with raid6 data usable.
Almost all of the bugs from 2014 still exist today.  Developers have
been fixing more severe and less avoidable bugs in the meantime.

> If not: Are there any plans to?

There are a few solutions to the raid5/6 write hole problem:  deprecate
the current raid5/6 profile and start over with something better,
adjust the allocator to pack allocations into full stripes to close
the raid5/6 write hole, and/or implement raid5/6 stripe journalling
in raid1/raid1c3/raid1c4 metadata.  Any of those ideas would work, all
three can be implemented at once, and they all have various cost/benefit
trade-offs (like whether they work for nodatacow files, and whether they
require on-disk format changes).

The solutions also have one thing in common:  nobody has been working
on them.  There are several other raid5/6 bugs that are much worse than
write hole, and they aren't getting developer attention either.

See this more up to date list, which puts write hole right at the bottom:

	https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/

> Regards,
> Hendrik
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re[2]: raid 5/6 - is this implemented?
  2020-11-07  1:43 ` Zygo Blaxell
@ 2020-11-11 17:06   ` Hendrik Friedel
  0 siblings, 0 replies; 3+ messages in thread
From: Hendrik Friedel @ 2020-11-11 17:06 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Btrfs BTRFS

Hello Zygo,

thanks for your reply.

>It's not the last 50 generations that need to be scrubbed.  Any committed
>transaction since mkfs can be affected by write hole.
Reading your link below, I would say that raid5/6 should not be used 
currently.
But if it is and as long as the state is as it is, I think that a full 
scrub should be done after any unclean shutdown.
>
>See this more up to date list, which puts write hole right at the bottom:
>
>https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@hungrycats.org/
>
That is a really good post.
I think it should be made prominent on the btrfs wiki?!

Regards,
Hendrik


>


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-11-11 17:06 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-06 19:41 raid 5/6 - is this implemented? Hendrik Friedel
2020-11-07  1:43 ` Zygo Blaxell
2020-11-11 17:06   ` Re[2]: " Hendrik Friedel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).