* btrfs-scrub: slow scrub speed (raid5) @ 2020-02-06 17:32 Sebastian Döring 2020-02-06 17:46 ` Matt Zagrabelny 2020-02-06 20:51 ` Chris Murphy 0 siblings, 2 replies; 6+ messages in thread From: Sebastian Döring @ 2020-02-06 17:32 UTC (permalink / raw) To: linux-btrfs Hi everyone, when I run a scrub on my 5 disk raid5 array (data: raid5, metadata: raid6) I notice very slow scrubbing speed: max. 5MB/s per device, about 23-24 MB/s in sum (according to btrfs scrub status). What's interesting is at the same time the gross read speed across the involved devices (according to iostat) is about ~71 MB/s in sum (14-15 MB/s per device). Where are the remaining 47 MB/s going? I expect there would be some overhead because it's a raid5, but it shouldn't be much more than a factor of (n-1) / n , no? At the moment it appears to be only scrubbing 1/3 of all data that is being read and the rest is thrown out (and probably re-read again at a different time). Surely this can't be right? Are iostat or possibly btrfs scrub status lying to me? What am I seeing here? I've never seen this problem with scrubbing a raid1, so maybe there's a bug in how scrub is reading data from raid5 data profile? Just to be clear: I can read data from the array in regular file system usage much faster - it's just the scrub that's very slow for some reason: ionice -c idle dd if=/mnt/raid5/testfile.mkv bs=1M of=/dev/null 7876+1 records in 7876+1 records out 8258797247 bytes (8.3 GB, 7.7 GiB) copied, 63.2118 s, 131 MB/s It seems to me that I could perform a much faster scrub by rsyncing the whole fs into /dev/null... btrfs is comparing the checksums anyway when reading data, no? Best regards, Sebastian ~ » btrfs --version btrfs-progs v5.4.1 kernel version: 5.5.2 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs-scrub: slow scrub speed (raid5) 2020-02-06 17:32 btrfs-scrub: slow scrub speed (raid5) Sebastian Döring @ 2020-02-06 17:46 ` Matt Zagrabelny 2020-02-06 18:13 ` Sebastian Döring 2020-02-06 20:51 ` Chris Murphy 1 sibling, 1 reply; 6+ messages in thread From: Matt Zagrabelny @ 2020-02-06 17:46 UTC (permalink / raw) To: Sebastian Döring; +Cc: linux-btrfs Slightly offtopic... On Thu, Feb 6, 2020 at 11:33 AM Sebastian Döring <moralapostel@gmail.com> wrote: > > Hi everyone, > > when I run a scrub on my 5 disk raid5 array (data: raid5, metadata: > raid6) I notice very slow scrubbing speed: max. 5MB/s per device, > about 23-24 MB/s in sum (according to btrfs scrub status). Is RAID5 stable? I was under the impression that it wasn't. -m ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs-scrub: slow scrub speed (raid5) 2020-02-06 17:46 ` Matt Zagrabelny @ 2020-02-06 18:13 ` Sebastian Döring 2020-02-07 4:58 ` Zygo Blaxell 0 siblings, 1 reply; 6+ messages in thread From: Sebastian Döring @ 2020-02-06 18:13 UTC (permalink / raw) To: Matt Zagrabelny; +Cc: linux-btrfs (oops, forgot to reply all the first time) > Is RAID5 stable? I was under the impression that it wasn't. > > -m Not sure, but AFAIK most of the known issues have been addressed. I did some informal testing with a bunch of usb devices, ripping them out during writes, remounting the array with a device missing in degraded mode, then replacing the device with a fresh one, etc. Always worked fine. Good enough for me. The scary write hole seems hard to hit, power outages are rare and if they happen I will just run a scrub immediately. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs-scrub: slow scrub speed (raid5) 2020-02-06 18:13 ` Sebastian Döring @ 2020-02-07 4:58 ` Zygo Blaxell 0 siblings, 0 replies; 6+ messages in thread From: Zygo Blaxell @ 2020-02-07 4:58 UTC (permalink / raw) To: Sebastian Döring; +Cc: Matt Zagrabelny, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4063 bytes --] On Thu, Feb 06, 2020 at 07:13:41PM +0100, Sebastian Döring wrote: > (oops, forgot to reply all the first time) > > > Is RAID5 stable? I was under the impression that it wasn't. > > > > -m > > Not sure, but AFAIK most of the known issues have been addressed. > > I did some informal testing with a bunch of usb devices, ripping them > out during writes, remounting the array with a device missing in > degraded mode, then replacing the device with a fresh one, etc. Always > worked fine. Good enough for me. The scary write hole seems hard to > hit, power outages are rare and if they happen I will just run a scrub > immediately. It's quite hard to hit the write hole for data. A bunch of stuff has to happen at the same time: - Writes have to be small. btrfs actively tries to prevent this, but can be defeated by a workload that uses fsync(). Big writes will get their own complete RAID stripes, therefore no write hole. Writes smaller than a RAID stripe (64K * (number_of_disks - 1)) will be packed into smaller gaps in the free space map, more of which will be colocated in RAID stripes with previously committed data. This is good for collections of big media files, and bad for databases, VM images, and build trees. - The filesystem has to have partially filled RAID stripes, as the write hole cannot occur in an empty or full RAID stripe. [1] A heavily fragmented filesystem has more of these. An empty filesystem (or a recently balanced one) has fewer. - Power needs to fail (or host crash) *during* a write that meets the other criteria. Hosts that spend only 1% of their time writing will have write hole failures at 10x lower rates than hosts that spend 10% of their time writing. The vulnerable interval for write hole is very short--typically less than a millisecond--but if you are writing to thousands of raid stripes per second, then there are thousands of write hole windows per second. - Write hole can only affect a system in degraded mode, so after all the above, you're still only _at risk_ of a write hole failure--you also need a disk fault to occur before you can repair parity with a scrub. It is harder to meet these conditions for data, but it's the common case for metadata. Metadata is all 16K writes and the 'nossd' allocator perversely prefers partially-filled RAID stripes. btrfs spends a _lot_ of its time doing metadata writes. This maximizes all the prerequisites above. btrfs has zero tolerance for uncorrectable metadata loss, so raid5 and raid6 should never be used for metadata. Adding the 'nossd' mount option will make total filesystem loss even faster. In practice, btrfs raid5 data with raid1 metadata fails (at least one block unreadable) about once per 30 power failures or crashes while running a write stress test designed to maximize the conditions listed above. I'm not sure if all of that failure rate is due to write hole or other currently active btrfs raid5 bugs--we'd have to fix the other bugs and measure the change in failure rate to know. If your workload doesn't meet the above criteria then the failure rates will be lower. A light-duty SOHO file sharing server will probably last 5 years between data losses with a disk fault every 2.5 years; however, if you put a database or VM host on that server than it might have losses on almost every power failure. The rate you will experience depends on your workload. As long as metadata is raid1, 10, 1c3 or 1c4, the data losses which do occur due to write hole will be small, and can be recovered by deleting and replacing the damaged files from backups. [1] except in nodatacow and prealloc files; however, nodatacow is basically a flag that says "please allow my data to be corrupted as much as possible without intentionally destroying it," so this is expected. Prealloc prevents the allocator from avoiding partially filled RAID stripes because it forces logically consecutive writes to be physically consecutive as well. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs-scrub: slow scrub speed (raid5) 2020-02-06 17:32 btrfs-scrub: slow scrub speed (raid5) Sebastian Döring 2020-02-06 17:46 ` Matt Zagrabelny @ 2020-02-06 20:51 ` Chris Murphy 2020-02-07 0:58 ` Zygo Blaxell 1 sibling, 1 reply; 6+ messages in thread From: Chris Murphy @ 2020-02-06 20:51 UTC (permalink / raw) To: Sebastian Döring; +Cc: Btrfs BTRFS On Thu, Feb 6, 2020 at 10:33 AM Sebastian Döring <moralapostel@gmail.com> wrote: > > Hi everyone, > > when I run a scrub on my 5 disk raid5 array (data: raid5, metadata: > raid6) I notice very slow scrubbing speed: max. 5MB/s per device, > about 23-24 MB/s in sum (according to btrfs scrub status). raid56 is not recommended for metadata. With raid5 data, it's recommended to use raid1 metadata. It's possible to convert from raid6 to raid1 metadata, but you'll need to use -f flag due to the reduced redundancy. If you can consistently depend on kernel 5.5+ you can use raid1c3 or raid1c4 for metadata, although even though the file system itself can survive a two or three device failure, most of your data won't survive. It would allow getting some fraction of the files smaller than 64KiB (raid5 strip size) off the volume. I'm not sure this accounts for the slow scrub though. It could be some combination of heavy block group fragmentation, i.e. a lot of free space in block groups, in both metadata and data block groups, and then raid6 on top of it. But, I'm not convinced. It's be useful to see IO and utilization during the scrub from iostat 5, to see if any one of the drives is ever getting close to 100% utilization. > > What's interesting is at the same time the gross read speed across the > involved devices (according to iostat) is about ~71 MB/s in sum (14-15 > MB/s per device). Where are the remaining 47 MB/s going? I expect > there would be some overhead because it's a raid5, but it shouldn't be > much more than a factor of (n-1) / n , no? At the moment it appears to > be only scrubbing 1/3 of all data that is being read and the rest is > thrown out (and probably re-read again at a different time). What do you get for btrfs fi df /mountpoint/ btrfs fi us /mountpoint/ Is it consistently this slow or does it vary a lot? > > Surely this can't be right? Are iostat or possibly btrfs scrub status > lying to me? What am I seeing here? I've never seen this problem with > scrubbing a raid1, so maybe there's a bug in how scrub is reading data > from raid5 data profile? I'd say more likely it's a lack of optimization for the moderate to high fragmentation case. Both LVM and mdadm raid have no idea what the layout is, there's no fs metadata to take into account, so every scrub read is a full stripe read. However, that means it reads unused portions of the array too, where Btrfs won't because every read is deliberate. But that means performance can be impacted by disk contention. > It seems to me that I could perform a much faster scrub by rsyncing > the whole fs into /dev/null... btrfs is comparing the checksums anyway > when reading data, no? Yes. -- Chris Murphy ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: btrfs-scrub: slow scrub speed (raid5) 2020-02-06 20:51 ` Chris Murphy @ 2020-02-07 0:58 ` Zygo Blaxell 0 siblings, 0 replies; 6+ messages in thread From: Zygo Blaxell @ 2020-02-07 0:58 UTC (permalink / raw) To: Chris Murphy; +Cc: Sebastian Döring, Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 4995 bytes --] On Thu, Feb 06, 2020 at 01:51:06PM -0700, Chris Murphy wrote: > On Thu, Feb 6, 2020 at 10:33 AM Sebastian Döring <moralapostel@gmail.com> wrote: > > > > Hi everyone, > > > > when I run a scrub on my 5 disk raid5 array (data: raid5, metadata: > > raid6) I notice very slow scrubbing speed: max. 5MB/s per device, > > about 23-24 MB/s in sum (according to btrfs scrub status). > > raid56 is not recommended for metadata. With raid5 data, it's > recommended to use raid1 metadata. It's possible to convert from raid6 > to raid1 metadata, but you'll need to use -f flag due to the reduced > redundancy. Definitely do not use raid5 or raid6 for metadata. All test runs end in total filesystem loss. Use raid1 or raid1c3 metadata for raid5 or raid6 data respectively. > If you can consistently depend on kernel 5.5+ you can use raid1c3 or > raid1c4 for metadata, although even though the file system itself can > survive a two or three device failure, most of your data won't > survive. It would allow getting some fraction of the files smaller > than 64KiB (raid5 strip size) off the volume. > > I'm not sure this accounts for the slow scrub though. It could be some > combination of heavy block group fragmentation, i.e. a lot of free > space in block groups, in both metadata and data block groups, and > then raid6 on top of it. But, I'm not convinced. It's be useful to see > IO and utilization during the scrub from iostat 5, to see if any one > of the drives is ever getting close to 100% utilization. When you run the scrub userspace utility, it creates one thread for every drive to run the scrub ioctl. This works well for the other RAID levels as each drive can be read independently of all others; however, it's a terrible idea for raid5/6 as each thread has to read all the drives to recompute parity. Patches welcome! (e.g. a patch to make the btrfs scrub userspace utility detect and handle raid5/6 differently, or to fix the kernel's raid5/6 implementation, or to add a new scrub ioctl interface that is block-group based instead of device based). A workaround is to run scrub on each disk sequentially. This will take N times longer than necessary for N disks, but that's better than 20-100 times longer than necessary with all the thrashing of disk heads between threads on spinning disks. 'btrfs scrub' on a filesystem mountpoint is exactly equivalent to running several independent 'btrfs scrub' ioctls on each disk at the same time, so there's no change in behavior to scrub separate disks one at a time on raid5/6, except for the massive speedup. Currently btrfs raid5/6 csum error correction works roughly 99.999% of the time on corrupted data blocks, and utterly fails the other 0.001%. Recovery will work for things like total-loss disk failures (as long as you're not writing to the filesystem during recovery). If you have a more minor failure, e.g. a disk being offline for a few minutes and then reconnecting, or a drive silently corrupting data but otherwise healthy, then the effort to recover and the amount of data lost go *up*. I just updated this bug report today with some details: https://www.spinics.net/lists/linux-btrfs/msg94594.html > > What's interesting is at the same time the gross read speed across the > > involved devices (according to iostat) is about ~71 MB/s in sum (14-15 > > MB/s per device). Where are the remaining 47 MB/s going? I expect > > there would be some overhead because it's a raid5, but it shouldn't be > > much more than a factor of (n-1) / n , no? At the moment it appears to > > be only scrubbing 1/3 of all data that is being read and the rest is > > thrown out (and probably re-read again at a different time). > > What do you get for > btrfs fi df /mountpoint/ > btrfs fi us /mountpoint/ > > Is it consistently this slow or does it vary a lot? > > > > > Surely this can't be right? Are iostat or possibly btrfs scrub status > > lying to me? What am I seeing here? I've never seen this problem with > > scrubbing a raid1, so maybe there's a bug in how scrub is reading data > > from raid5 data profile? > > I'd say more likely it's a lack of optimization for the moderate to > high fragmentation case. Both LVM and mdadm raid have no idea what the > layout is, there's no fs metadata to take into account, so every scrub > read is a full stripe read. However, that means it reads unused > portions of the array too, where Btrfs won't because every read is > deliberate. But that means performance can be impacted by disk > contention. > > > > It seems to me that I could perform a much faster scrub by rsyncing > > the whole fs into /dev/null... btrfs is comparing the checksums anyway > > when reading data, no? > > Yes. No. Reads will not verify or update the parity unless a csum error is detected. Scrub reads the entire stripe if any portion of the stripe contains data. > -- > Chris Murphy [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2020-02-07 4:58 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-02-06 17:32 btrfs-scrub: slow scrub speed (raid5) Sebastian Döring 2020-02-06 17:46 ` Matt Zagrabelny 2020-02-06 18:13 ` Sebastian Döring 2020-02-07 4:58 ` Zygo Blaxell 2020-02-06 20:51 ` Chris Murphy 2020-02-07 0:58 ` Zygo Blaxell
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.