On Thu, Dec 26, 2019 at 01:03:47PM +0800, Qu Wenruo wrote: > > > On 2019/12/26 上午3:25, Martin wrote: > > Hi, > > > > I have a drive that started failing (uncorrectable errors & lots of > > relocated sectors) in a RAID6 (12 device/70TB total with 30TB of > > data), btrfs scrub started showing corrected errors as well (seemingly > > no big deal since its RAID6). I decided to remove the drive from the > > array with: > > btrfs device delete /dev/sdg /mount_point > > > > After about 20 hours and having rebalanced 90% of the data off the > > drive, the operation failed with an I/O error. dmesg was showing csum > > errors: > > BTRFS warning (device sdf): csum failed root -9 ino 2526 off > > 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2 > > BTRFS warning (device sdf): csum failed root -9 ino 2526 off > > 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2 > > . . . > > This means some data reloc tree had csum mismatch. > The strange part is, we shouldn't hit csum error here, as if it's some > data corrupted, it should report csum error at read time, other than > reporting the error at this timing. > > This looks like something reported before. > > > > > I pulled the drive out of the system and attempted the device deletion > > again, but getting the same error. > > > > Looking back through the logs to the previous scrubs, it showed the > > file paths where errors were detected, so I deleted those files, and > > tried removing the failing drive again. It moved along some more. Now > > its down to only 13GiB of data remaining on the missing drive. Is > > there any way to track the above errors to specific files so I can > > delete them and finish the removal. Is there is a better way to finish > > the device deletion? > > As the message shows, it's the data reloc tree, which store the newly > relocated data. > So it doesn't contain the file path. > > > > > Scrubbing with the device missing just racks up uncorrectable errors > > right off the bat, so it seemingly doesn't like missing a device - I > > assume it's not actually doing anything useful, right? > > Which kernel are you using? > > IIRC older kernel doesn't retry all possible device combinations, thus > it can report uncorrectable errors even if it should be correctable. > Another possible cause is write-hole, which reduced the tolerance of > RAID6 stripes by stripes. Did you find a fix for https://www.spinics.net/lists/linux-btrfs/msg94634.html If that bug is happening in this case, it can abort a device delete on raid5/6 due to corrupted data every few block groups. > You can also try replace the missing device. > In that case, it doesn't go through the regular relocation path, but dev > replace path (more like scrub), but you need physical access then. > > Thanks, > Qu > > > > > I'm currently traveling and away from the system physically. Is there > > any way to complete the device removal without reconnecting the > > failing drive? Otherwise, I'll have a replacement drive in a couple of > > weeks when I'm back, and can try anything involving reconnecting the > > drive. > > > > Thanks, > > Martin > > >