All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs-scrub: slow scrub speed (raid5)
@ 2020-02-06 17:32 Sebastian Döring
  2020-02-06 17:46 ` Matt Zagrabelny
  2020-02-06 20:51 ` Chris Murphy
  0 siblings, 2 replies; 6+ messages in thread
From: Sebastian Döring @ 2020-02-06 17:32 UTC (permalink / raw)
  To: linux-btrfs

Hi everyone,

when I run a scrub on my 5 disk raid5 array (data: raid5, metadata:
raid6) I notice very slow scrubbing speed: max. 5MB/s per device,
about 23-24 MB/s in sum (according to btrfs scrub status).

What's interesting is at the same time the gross read speed across the
involved devices (according to iostat) is about ~71 MB/s in sum (14-15
MB/s per device). Where are the remaining 47 MB/s going? I expect
there would be some overhead because it's a raid5, but it shouldn't be
much more than a factor of (n-1) / n , no? At the moment it appears to
be only scrubbing 1/3 of all data that is being read and the rest is
thrown out (and probably re-read again at a different time).

Surely this can't be right? Are iostat or possibly btrfs scrub status
lying to me? What am I seeing here? I've never seen this problem with
scrubbing a raid1, so maybe there's a bug in how scrub is reading data
from raid5 data profile?

Just to be clear: I can read data from the array in regular file
system usage much faster - it's just the scrub that's very slow for
some reason:

ionice -c idle dd if=/mnt/raid5/testfile.mkv bs=1M of=/dev/null
7876+1 records in
7876+1 records out
8258797247 bytes (8.3 GB, 7.7 GiB) copied, 63.2118 s, 131 MB/s

It seems to me that I could perform a much faster scrub by rsyncing
the whole fs into /dev/null... btrfs is comparing the checksums anyway
when reading data, no?


Best regards,

Sebastian


~ » btrfs --version
btrfs-progs v5.4.1

kernel version: 5.5.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs-scrub: slow scrub speed (raid5)
  2020-02-06 17:32 btrfs-scrub: slow scrub speed (raid5) Sebastian Döring
@ 2020-02-06 17:46 ` Matt Zagrabelny
  2020-02-06 18:13   ` Sebastian Döring
  2020-02-06 20:51 ` Chris Murphy
  1 sibling, 1 reply; 6+ messages in thread
From: Matt Zagrabelny @ 2020-02-06 17:46 UTC (permalink / raw)
  To: Sebastian Döring; +Cc: linux-btrfs

Slightly offtopic...

On Thu, Feb 6, 2020 at 11:33 AM Sebastian Döring <moralapostel@gmail.com> wrote:
>
> Hi everyone,
>
> when I run a scrub on my 5 disk raid5 array (data: raid5, metadata:
> raid6) I notice very slow scrubbing speed: max. 5MB/s per device,
> about 23-24 MB/s in sum (according to btrfs scrub status).

Is RAID5 stable? I was under the impression that it wasn't.

-m

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs-scrub: slow scrub speed (raid5)
  2020-02-06 17:46 ` Matt Zagrabelny
@ 2020-02-06 18:13   ` Sebastian Döring
  2020-02-07  4:58     ` Zygo Blaxell
  0 siblings, 1 reply; 6+ messages in thread
From: Sebastian Döring @ 2020-02-06 18:13 UTC (permalink / raw)
  To: Matt Zagrabelny; +Cc: linux-btrfs

(oops, forgot to reply all the first time)

> Is RAID5 stable? I was under the impression that it wasn't.
>
> -m

Not sure, but AFAIK most of the known issues have been addressed.

I did some informal testing with a bunch of usb devices, ripping them
out during writes, remounting the array with a device missing in
degraded mode, then replacing the device with a fresh one, etc. Always
worked fine. Good enough for me. The scary write hole seems hard to
hit, power outages are rare and if they happen I will just run a scrub
immediately.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs-scrub: slow scrub speed (raid5)
  2020-02-06 17:32 btrfs-scrub: slow scrub speed (raid5) Sebastian Döring
  2020-02-06 17:46 ` Matt Zagrabelny
@ 2020-02-06 20:51 ` Chris Murphy
  2020-02-07  0:58   ` Zygo Blaxell
  1 sibling, 1 reply; 6+ messages in thread
From: Chris Murphy @ 2020-02-06 20:51 UTC (permalink / raw)
  To: Sebastian Döring; +Cc: Btrfs BTRFS

On Thu, Feb 6, 2020 at 10:33 AM Sebastian Döring <moralapostel@gmail.com> wrote:
>
> Hi everyone,
>
> when I run a scrub on my 5 disk raid5 array (data: raid5, metadata:
> raid6) I notice very slow scrubbing speed: max. 5MB/s per device,
> about 23-24 MB/s in sum (according to btrfs scrub status).

raid56 is not recommended for metadata. With raid5 data, it's
recommended to use raid1 metadata. It's possible to convert from raid6
to raid1 metadata, but you'll need to use -f flag due to the reduced
redundancy.

If you can consistently depend on kernel 5.5+ you can use raid1c3 or
raid1c4 for metadata, although even though the file system itself can
survive a two or three device failure, most of your data won't
survive. It would allow getting some fraction of the files smaller
than 64KiB (raid5 strip size) off the volume.

I'm not sure this accounts for the slow scrub though. It could be some
combination of heavy block group fragmentation, i.e. a lot of free
space in block groups, in both metadata and data block groups, and
then raid6 on top of it. But, I'm not convinced. It's be useful to see
IO and utilization during the scrub from iostat 5, to see if any one
of the drives is ever getting close to 100% utilization.

>
> What's interesting is at the same time the gross read speed across the
> involved devices (according to iostat) is about ~71 MB/s in sum (14-15
> MB/s per device). Where are the remaining 47 MB/s going? I expect
> there would be some overhead because it's a raid5, but it shouldn't be
> much more than a factor of (n-1) / n , no? At the moment it appears to
> be only scrubbing 1/3 of all data that is being read and the rest is
> thrown out (and probably re-read again at a different time).

What do you get for
btrfs fi df /mountpoint/
btrfs fi us /mountpoint/

Is it consistently this slow or does it vary a lot?

>
> Surely this can't be right? Are iostat or possibly btrfs scrub status
> lying to me? What am I seeing here? I've never seen this problem with
> scrubbing a raid1, so maybe there's a bug in how scrub is reading data
> from raid5 data profile?

I'd say more likely it's a lack of optimization for the moderate to
high fragmentation case. Both LVM and mdadm raid have no idea what the
layout is, there's no fs metadata to take into account, so every scrub
read is a full stripe read. However, that means it reads unused
portions of the array too, where Btrfs won't because every read is
deliberate. But that means performance can be impacted by disk
contention.


> It seems to me that I could perform a much faster scrub by rsyncing
> the whole fs into /dev/null... btrfs is comparing the checksums anyway
> when reading data, no?

Yes.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs-scrub: slow scrub speed (raid5)
  2020-02-06 20:51 ` Chris Murphy
@ 2020-02-07  0:58   ` Zygo Blaxell
  0 siblings, 0 replies; 6+ messages in thread
From: Zygo Blaxell @ 2020-02-07  0:58 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Sebastian Döring, Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 4995 bytes --]

On Thu, Feb 06, 2020 at 01:51:06PM -0700, Chris Murphy wrote:
> On Thu, Feb 6, 2020 at 10:33 AM Sebastian Döring <moralapostel@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > when I run a scrub on my 5 disk raid5 array (data: raid5, metadata:
> > raid6) I notice very slow scrubbing speed: max. 5MB/s per device,
> > about 23-24 MB/s in sum (according to btrfs scrub status).
> 
> raid56 is not recommended for metadata. With raid5 data, it's
> recommended to use raid1 metadata. It's possible to convert from raid6
> to raid1 metadata, but you'll need to use -f flag due to the reduced
> redundancy.

Definitely do not use raid5 or raid6 for metadata.  All test runs end
in total filesystem loss.  Use raid1 or raid1c3 metadata for raid5 or
raid6 data respectively.

> If you can consistently depend on kernel 5.5+ you can use raid1c3 or
> raid1c4 for metadata, although even though the file system itself can
> survive a two or three device failure, most of your data won't
> survive. It would allow getting some fraction of the files smaller
> than 64KiB (raid5 strip size) off the volume.
> 
> I'm not sure this accounts for the slow scrub though. It could be some
> combination of heavy block group fragmentation, i.e. a lot of free
> space in block groups, in both metadata and data block groups, and
> then raid6 on top of it. But, I'm not convinced. It's be useful to see
> IO and utilization during the scrub from iostat 5, to see if any one
> of the drives is ever getting close to 100% utilization.

When you run the scrub userspace utility, it creates one thread for
every drive to run the scrub ioctl.  This works well for the other RAID
levels as each drive can be read independently of all others; however,
it's a terrible idea for raid5/6 as each thread has to read all the
drives to recompute parity.  Patches welcome!

(e.g. a patch to make the btrfs scrub userspace utility detect and handle
raid5/6 differently, or to fix the kernel's raid5/6 implementation, or
to add a new scrub ioctl interface that is block-group based instead of
device based).

A workaround is to run scrub on each disk sequentially.  This will take
N times longer than necessary for N disks, but that's better than 20-100
times longer than necessary with all the thrashing of disk heads between
threads on spinning disks.  'btrfs scrub' on a filesystem mountpoint is
exactly equivalent to running several independent 'btrfs scrub' ioctls
on each disk at the same time, so there's no change in behavior to scrub
separate disks one at a time on raid5/6, except for the massive speedup.

Currently btrfs raid5/6 csum error correction works roughly 99.999% of
the time on corrupted data blocks, and utterly fails the other 0.001%.
Recovery will work for things like total-loss disk failures (as long as
you're not writing to the filesystem during recovery).  If you have a
more minor failure, e.g. a disk being offline for a few minutes and then
reconnecting, or a drive silently corrupting data but otherwise healthy,
then the effort to recover and the amount of data lost go *up*.

I just updated this bug report today with some details:

	https://www.spinics.net/lists/linux-btrfs/msg94594.html

> > What's interesting is at the same time the gross read speed across the
> > involved devices (according to iostat) is about ~71 MB/s in sum (14-15
> > MB/s per device). Where are the remaining 47 MB/s going? I expect
> > there would be some overhead because it's a raid5, but it shouldn't be
> > much more than a factor of (n-1) / n , no? At the moment it appears to
> > be only scrubbing 1/3 of all data that is being read and the rest is
> > thrown out (and probably re-read again at a different time).
> 
> What do you get for
> btrfs fi df /mountpoint/
> btrfs fi us /mountpoint/
> 
> Is it consistently this slow or does it vary a lot?
> 
> >
> > Surely this can't be right? Are iostat or possibly btrfs scrub status
> > lying to me? What am I seeing here? I've never seen this problem with
> > scrubbing a raid1, so maybe there's a bug in how scrub is reading data
> > from raid5 data profile?
> 
> I'd say more likely it's a lack of optimization for the moderate to
> high fragmentation case. Both LVM and mdadm raid have no idea what the
> layout is, there's no fs metadata to take into account, so every scrub
> read is a full stripe read. However, that means it reads unused
> portions of the array too, where Btrfs won't because every read is
> deliberate. But that means performance can be impacted by disk
> contention.
> 
> 
> > It seems to me that I could perform a much faster scrub by rsyncing
> > the whole fs into /dev/null... btrfs is comparing the checksums anyway
> > when reading data, no?
> 
> Yes.

No.  Reads will not verify or update the parity unless a csum error
is detected.  Scrub reads the entire stripe if any portion of the
stripe contains data.

> -- 
> Chris Murphy

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: btrfs-scrub: slow scrub speed (raid5)
  2020-02-06 18:13   ` Sebastian Döring
@ 2020-02-07  4:58     ` Zygo Blaxell
  0 siblings, 0 replies; 6+ messages in thread
From: Zygo Blaxell @ 2020-02-07  4:58 UTC (permalink / raw)
  To: Sebastian Döring; +Cc: Matt Zagrabelny, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4063 bytes --]

On Thu, Feb 06, 2020 at 07:13:41PM +0100, Sebastian Döring wrote:
> (oops, forgot to reply all the first time)
> 
> > Is RAID5 stable? I was under the impression that it wasn't.
> >
> > -m
> 
> Not sure, but AFAIK most of the known issues have been addressed.
> 
> I did some informal testing with a bunch of usb devices, ripping them
> out during writes, remounting the array with a device missing in
> degraded mode, then replacing the device with a fresh one, etc. Always
> worked fine. Good enough for me. The scary write hole seems hard to
> hit, power outages are rare and if they happen I will just run a scrub
> immediately.

It's quite hard to hit the write hole for data.  A bunch of stuff has
to happen at the same time:

	- Writes have to be small.  btrfs actively tries to prevent this,
	but can be defeated by a workload that uses fsync().  Big writes
	will get their own complete RAID stripes, therefore no write hole.
	Writes smaller than a RAID stripe (64K * (number_of_disks -
	1)) will be packed into smaller gaps in the free space map,
	more of which will be colocated in RAID stripes with previously
	committed data.  This is good for collections of big media files,
	and bad for databases, VM images, and build trees.

	- The filesystem has to have partially filled RAID stripes, as
	the write hole cannot occur in an empty or full RAID stripe. [1]
	A heavily fragmented filesystem has more of these.  An empty
	filesystem (or a recently balanced one) has fewer.

	- Power needs to fail (or host crash) *during* a write
	that meets the other criteria.	Hosts that spend only 1%
	of their time writing will have write hole failures at 10x
	lower rates than hosts that spend 10% of their time writing.
	The vulnerable interval for write hole is very short--typically
	less than a millisecond--but if you are writing to thousands of
	raid stripes per second, then there are thousands of write hole
	windows per second.

	- Write hole can only affect a system in degraded mode, so
	after all the above, you're still only _at risk_ of a write
	hole failure--you also need a disk fault to occur before you
	can repair parity with a scrub.

It is harder to meet these conditions for data, but it's the common
case for metadata.  Metadata is all 16K writes and the 'nossd' allocator
perversely prefers partially-filled RAID stripes.  btrfs spends a _lot_
of its time doing metadata writes.  This maximizes all the prerequisites
above.  btrfs has zero tolerance for uncorrectable metadata loss, so
raid5 and raid6 should never be used for metadata.  Adding the 'nossd'
mount option will make total filesystem loss even faster.

In practice, btrfs raid5 data with raid1 metadata fails (at least one
block unreadable) about once per 30 power failures or crashes while
running a write stress test designed to maximize the conditions listed
above.  I'm not sure if all of that failure rate is due to write hole
or other currently active btrfs raid5 bugs--we'd have to fix the other
bugs and measure the change in failure rate to know.

If your workload doesn't meet the above criteria then the failure rates
will be lower.  A light-duty SOHO file sharing server will probably
last 5 years between data losses with a disk fault every 2.5 years;
however, if you put a database or VM host on that server than it might
have losses on almost every power failure.  The rate you will experience
depends on your workload.  As long as metadata is raid1, 10, 1c3 or 1c4,
the data losses which do occur due to write hole will be small, and can
be recovered by deleting and replacing the damaged files from backups.


[1] except in nodatacow and prealloc files; however, nodatacow is
basically a flag that says "please allow my data to be corrupted as much
as possible without intentionally destroying it," so this is expected.
Prealloc prevents the allocator from avoiding partially filled RAID
stripes because it forces logically consecutive writes to be physically
consecutive as well.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-02-07  4:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-06 17:32 btrfs-scrub: slow scrub speed (raid5) Sebastian Döring
2020-02-06 17:46 ` Matt Zagrabelny
2020-02-06 18:13   ` Sebastian Döring
2020-02-07  4:58     ` Zygo Blaxell
2020-02-06 20:51 ` Chris Murphy
2020-02-07  0:58   ` Zygo Blaxell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.