* Deleting a failing drive from RAID6 fails @ 2019-12-25 19:25 Martin 2019-12-26 5:03 ` Qu Wenruo 0 siblings, 1 reply; 5+ messages in thread From: Martin @ 2019-12-25 19:25 UTC (permalink / raw) To: linux-btrfs Hi, I have a drive that started failing (uncorrectable errors & lots of relocated sectors) in a RAID6 (12 device/70TB total with 30TB of data), btrfs scrub started showing corrected errors as well (seemingly no big deal since its RAID6). I decided to remove the drive from the array with: btrfs device delete /dev/sdg /mount_point After about 20 hours and having rebalanced 90% of the data off the drive, the operation failed with an I/O error. dmesg was showing csum errors: BTRFS warning (device sdf): csum failed root -9 ino 2526 off 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2 BTRFS warning (device sdf): csum failed root -9 ino 2526 off 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2 . . . I pulled the drive out of the system and attempted the device deletion again, but getting the same error. Looking back through the logs to the previous scrubs, it showed the file paths where errors were detected, so I deleted those files, and tried removing the failing drive again. It moved along some more. Now its down to only 13GiB of data remaining on the missing drive. Is there any way to track the above errors to specific files so I can delete them and finish the removal. Is there is a better way to finish the device deletion? Scrubbing with the device missing just racks up uncorrectable errors right off the bat, so it seemingly doesn't like missing a device - I assume it's not actually doing anything useful, right? I'm currently traveling and away from the system physically. Is there any way to complete the device removal without reconnecting the failing drive? Otherwise, I'll have a replacement drive in a couple of weeks when I'm back, and can try anything involving reconnecting the drive. Thanks, Martin ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Deleting a failing drive from RAID6 fails 2019-12-25 19:25 Deleting a failing drive from RAID6 fails Martin @ 2019-12-26 5:03 ` Qu Wenruo 2019-12-26 5:40 ` Zygo Blaxell 0 siblings, 1 reply; 5+ messages in thread From: Qu Wenruo @ 2019-12-26 5:03 UTC (permalink / raw) To: Martin, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 2844 bytes --] On 2019/12/26 上午3:25, Martin wrote: > Hi, > > I have a drive that started failing (uncorrectable errors & lots of > relocated sectors) in a RAID6 (12 device/70TB total with 30TB of > data), btrfs scrub started showing corrected errors as well (seemingly > no big deal since its RAID6). I decided to remove the drive from the > array with: > btrfs device delete /dev/sdg /mount_point > > After about 20 hours and having rebalanced 90% of the data off the > drive, the operation failed with an I/O error. dmesg was showing csum > errors: > BTRFS warning (device sdf): csum failed root -9 ino 2526 off > 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2 > BTRFS warning (device sdf): csum failed root -9 ino 2526 off > 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2 > . . . This means some data reloc tree had csum mismatch. The strange part is, we shouldn't hit csum error here, as if it's some data corrupted, it should report csum error at read time, other than reporting the error at this timing. This looks like something reported before. > > I pulled the drive out of the system and attempted the device deletion > again, but getting the same error. > > Looking back through the logs to the previous scrubs, it showed the > file paths where errors were detected, so I deleted those files, and > tried removing the failing drive again. It moved along some more. Now > its down to only 13GiB of data remaining on the missing drive. Is > there any way to track the above errors to specific files so I can > delete them and finish the removal. Is there is a better way to finish > the device deletion? As the message shows, it's the data reloc tree, which store the newly relocated data. So it doesn't contain the file path. > > Scrubbing with the device missing just racks up uncorrectable errors > right off the bat, so it seemingly doesn't like missing a device - I > assume it's not actually doing anything useful, right? Which kernel are you using? IIRC older kernel doesn't retry all possible device combinations, thus it can report uncorrectable errors even if it should be correctable. Another possible cause is write-hole, which reduced the tolerance of RAID6 stripes by stripes. You can also try replace the missing device. In that case, it doesn't go through the regular relocation path, but dev replace path (more like scrub), but you need physical access then. Thanks, Qu > > I'm currently traveling and away from the system physically. Is there > any way to complete the device removal without reconnecting the > failing drive? Otherwise, I'll have a replacement drive in a couple of > weeks when I'm back, and can try anything involving reconnecting the > drive. > > Thanks, > Martin > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 520 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Deleting a failing drive from RAID6 fails 2019-12-26 5:03 ` Qu Wenruo @ 2019-12-26 5:40 ` Zygo Blaxell 2019-12-26 6:50 ` Qu Wenruo 0 siblings, 1 reply; 5+ messages in thread From: Zygo Blaxell @ 2019-12-26 5:40 UTC (permalink / raw) To: Qu Wenruo; +Cc: Martin, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3271 bytes --] On Thu, Dec 26, 2019 at 01:03:47PM +0800, Qu Wenruo wrote: > > > On 2019/12/26 上午3:25, Martin wrote: > > Hi, > > > > I have a drive that started failing (uncorrectable errors & lots of > > relocated sectors) in a RAID6 (12 device/70TB total with 30TB of > > data), btrfs scrub started showing corrected errors as well (seemingly > > no big deal since its RAID6). I decided to remove the drive from the > > array with: > > btrfs device delete /dev/sdg /mount_point > > > > After about 20 hours and having rebalanced 90% of the data off the > > drive, the operation failed with an I/O error. dmesg was showing csum > > errors: > > BTRFS warning (device sdf): csum failed root -9 ino 2526 off > > 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2 > > BTRFS warning (device sdf): csum failed root -9 ino 2526 off > > 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2 > > . . . > > This means some data reloc tree had csum mismatch. > The strange part is, we shouldn't hit csum error here, as if it's some > data corrupted, it should report csum error at read time, other than > reporting the error at this timing. > > This looks like something reported before. > > > > > I pulled the drive out of the system and attempted the device deletion > > again, but getting the same error. > > > > Looking back through the logs to the previous scrubs, it showed the > > file paths where errors were detected, so I deleted those files, and > > tried removing the failing drive again. It moved along some more. Now > > its down to only 13GiB of data remaining on the missing drive. Is > > there any way to track the above errors to specific files so I can > > delete them and finish the removal. Is there is a better way to finish > > the device deletion? > > As the message shows, it's the data reloc tree, which store the newly > relocated data. > So it doesn't contain the file path. > > > > > Scrubbing with the device missing just racks up uncorrectable errors > > right off the bat, so it seemingly doesn't like missing a device - I > > assume it's not actually doing anything useful, right? > > Which kernel are you using? > > IIRC older kernel doesn't retry all possible device combinations, thus > it can report uncorrectable errors even if it should be correctable. > Another possible cause is write-hole, which reduced the tolerance of > RAID6 stripes by stripes. Did you find a fix for https://www.spinics.net/lists/linux-btrfs/msg94634.html If that bug is happening in this case, it can abort a device delete on raid5/6 due to corrupted data every few block groups. > You can also try replace the missing device. > In that case, it doesn't go through the regular relocation path, but dev > replace path (more like scrub), but you need physical access then. > > Thanks, > Qu > > > > > I'm currently traveling and away from the system physically. Is there > > any way to complete the device removal without reconnecting the > > failing drive? Otherwise, I'll have a replacement drive in a couple of > > weeks when I'm back, and can try anything involving reconnecting the > > drive. > > > > Thanks, > > Martin > > > [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Deleting a failing drive from RAID6 fails 2019-12-26 5:40 ` Zygo Blaxell @ 2019-12-26 6:50 ` Qu Wenruo 2019-12-26 19:37 ` Martin 0 siblings, 1 reply; 5+ messages in thread From: Qu Wenruo @ 2019-12-26 6:50 UTC (permalink / raw) To: Zygo Blaxell; +Cc: Martin, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 3496 bytes --] On 2019/12/26 下午1:40, Zygo Blaxell wrote: > On Thu, Dec 26, 2019 at 01:03:47PM +0800, Qu Wenruo wrote: >> >> >> On 2019/12/26 上午3:25, Martin wrote: >>> Hi, >>> >>> I have a drive that started failing (uncorrectable errors & lots of >>> relocated sectors) in a RAID6 (12 device/70TB total with 30TB of >>> data), btrfs scrub started showing corrected errors as well (seemingly >>> no big deal since its RAID6). I decided to remove the drive from the >>> array with: >>> btrfs device delete /dev/sdg /mount_point >>> >>> After about 20 hours and having rebalanced 90% of the data off the >>> drive, the operation failed with an I/O error. dmesg was showing csum >>> errors: >>> BTRFS warning (device sdf): csum failed root -9 ino 2526 off >>> 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2 >>> BTRFS warning (device sdf): csum failed root -9 ino 2526 off >>> 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2 >>> . . . >> >> This means some data reloc tree had csum mismatch. >> The strange part is, we shouldn't hit csum error here, as if it's some >> data corrupted, it should report csum error at read time, other than >> reporting the error at this timing. >> >> This looks like something reported before. >> >>> >>> I pulled the drive out of the system and attempted the device deletion >>> again, but getting the same error. >>> >>> Looking back through the logs to the previous scrubs, it showed the >>> file paths where errors were detected, so I deleted those files, and >>> tried removing the failing drive again. It moved along some more. Now >>> its down to only 13GiB of data remaining on the missing drive. Is >>> there any way to track the above errors to specific files so I can >>> delete them and finish the removal. Is there is a better way to finish >>> the device deletion? >> >> As the message shows, it's the data reloc tree, which store the newly >> relocated data. >> So it doesn't contain the file path. >> >>> >>> Scrubbing with the device missing just racks up uncorrectable errors >>> right off the bat, so it seemingly doesn't like missing a device - I >>> assume it's not actually doing anything useful, right? >> >> Which kernel are you using? >> >> IIRC older kernel doesn't retry all possible device combinations, thus >> it can report uncorrectable errors even if it should be correctable. > >> Another possible cause is write-hole, which reduced the tolerance of >> RAID6 stripes by stripes. > > Did you find a fix for > > https://www.spinics.net/lists/linux-btrfs/msg94634.html > > If that bug is happening in this case, it can abort a device delete > on raid5/6 due to corrupted data every few block groups. My bad, always lost my track of to-do works. It looks like one possible cause indeed. Thanks for reminding me that bug, Qu > >> You can also try replace the missing device. >> In that case, it doesn't go through the regular relocation path, but dev >> replace path (more like scrub), but you need physical access then. >> >> Thanks, >> Qu >> >>> >>> I'm currently traveling and away from the system physically. Is there >>> any way to complete the device removal without reconnecting the >>> failing drive? Otherwise, I'll have a replacement drive in a couple of >>> weeks when I'm back, and can try anything involving reconnecting the >>> drive. >>> >>> Thanks, >>> Martin >>> >> > > > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Deleting a failing drive from RAID6 fails 2019-12-26 6:50 ` Qu Wenruo @ 2019-12-26 19:37 ` Martin 0 siblings, 0 replies; 5+ messages in thread From: Martin @ 2019-12-26 19:37 UTC (permalink / raw) To: Qu Wenruo; +Cc: Zygo Blaxell, linux-btrfs I appreciate the replies, and as a general update I ended up cleaning out large amount of unneeded files, hoping the corruption would be in one of those and retried the device deletion - it completed successfully. Not really sure why the files were ever unrecoverably corrupted - the system has never crashed or lost power since this filesystem was created. It's a Fedora server and somewhat regularly updated, and this btrfs FS was created about 2 years ago maybe - not really sure which kernel version, but most recently running kernel 5.3.16 when I noticed the hard drive failing. Not really sure when it first started having problems. Thanks, Martin On Thu, Dec 26, 2019 at 1:50 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2019/12/26 下午1:40, Zygo Blaxell wrote: > > On Thu, Dec 26, 2019 at 01:03:47PM +0800, Qu Wenruo wrote: > >> > >> > >> On 2019/12/26 上午3:25, Martin wrote: > >>> Hi, > >>> > >>> I have a drive that started failing (uncorrectable errors & lots of > >>> relocated sectors) in a RAID6 (12 device/70TB total with 30TB of > >>> data), btrfs scrub started showing corrected errors as well (seemingly > >>> no big deal since its RAID6). I decided to remove the drive from the > >>> array with: > >>> btrfs device delete /dev/sdg /mount_point > >>> > >>> After about 20 hours and having rebalanced 90% of the data off the > >>> drive, the operation failed with an I/O error. dmesg was showing csum > >>> errors: > >>> BTRFS warning (device sdf): csum failed root -9 ino 2526 off > >>> 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2 > >>> BTRFS warning (device sdf): csum failed root -9 ino 2526 off > >>> 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2 > >>> . . . > >> > >> This means some data reloc tree had csum mismatch. > >> The strange part is, we shouldn't hit csum error here, as if it's some > >> data corrupted, it should report csum error at read time, other than > >> reporting the error at this timing. > >> > >> This looks like something reported before. > >> > >>> > >>> I pulled the drive out of the system and attempted the device deletion > >>> again, but getting the same error. > >>> > >>> Looking back through the logs to the previous scrubs, it showed the > >>> file paths where errors were detected, so I deleted those files, and > >>> tried removing the failing drive again. It moved along some more. Now > >>> its down to only 13GiB of data remaining on the missing drive. Is > >>> there any way to track the above errors to specific files so I can > >>> delete them and finish the removal. Is there is a better way to finish > >>> the device deletion? > >> > >> As the message shows, it's the data reloc tree, which store the newly > >> relocated data. > >> So it doesn't contain the file path. > >> > >>> > >>> Scrubbing with the device missing just racks up uncorrectable errors > >>> right off the bat, so it seemingly doesn't like missing a device - I > >>> assume it's not actually doing anything useful, right? > >> > >> Which kernel are you using? > >> > >> IIRC older kernel doesn't retry all possible device combinations, thus > >> it can report uncorrectable errors even if it should be correctable. > > > >> Another possible cause is write-hole, which reduced the tolerance of > >> RAID6 stripes by stripes. > > > > Did you find a fix for > > > > https://www.spinics.net/lists/linux-btrfs/msg94634.html > > > > If that bug is happening in this case, it can abort a device delete > > on raid5/6 due to corrupted data every few block groups. > > My bad, always lost my track of to-do works. > > It looks like one possible cause indeed. > > Thanks for reminding me that bug, > Qu > > > > >> You can also try replace the missing device. > >> In that case, it doesn't go through the regular relocation path, but dev > >> replace path (more like scrub), but you need physical access then. > >> > >> Thanks, > >> Qu > >> > >>> > >>> I'm currently traveling and away from the system physically. Is there > >>> any way to complete the device removal without reconnecting the > >>> failing drive? Otherwise, I'll have a replacement drive in a couple of > >>> weeks when I'm back, and can try anything involving reconnecting the > >>> drive. > >>> > >>> Thanks, > >>> Martin > >>> > >> > > > > > > > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-12-26 19:40 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-25 19:25 Deleting a failing drive from RAID6 fails Martin 2019-12-26 5:03 ` Qu Wenruo 2019-12-26 5:40 ` Zygo Blaxell 2019-12-26 6:50 ` Qu Wenruo 2019-12-26 19:37 ` Martin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.