On 2019/12/26 上午3:25, Martin wrote: > Hi, > > I have a drive that started failing (uncorrectable errors & lots of > relocated sectors) in a RAID6 (12 device/70TB total with 30TB of > data), btrfs scrub started showing corrected errors as well (seemingly > no big deal since its RAID6). I decided to remove the drive from the > array with: > btrfs device delete /dev/sdg /mount_point > > After about 20 hours and having rebalanced 90% of the data off the > drive, the operation failed with an I/O error. dmesg was showing csum > errors: > BTRFS warning (device sdf): csum failed root -9 ino 2526 off > 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2 > BTRFS warning (device sdf): csum failed root -9 ino 2526 off > 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2 > . . . This means some data reloc tree had csum mismatch. The strange part is, we shouldn't hit csum error here, as if it's some data corrupted, it should report csum error at read time, other than reporting the error at this timing. This looks like something reported before. > > I pulled the drive out of the system and attempted the device deletion > again, but getting the same error. > > Looking back through the logs to the previous scrubs, it showed the > file paths where errors were detected, so I deleted those files, and > tried removing the failing drive again. It moved along some more. Now > its down to only 13GiB of data remaining on the missing drive. Is > there any way to track the above errors to specific files so I can > delete them and finish the removal. Is there is a better way to finish > the device deletion? As the message shows, it's the data reloc tree, which store the newly relocated data. So it doesn't contain the file path. > > Scrubbing with the device missing just racks up uncorrectable errors > right off the bat, so it seemingly doesn't like missing a device - I > assume it's not actually doing anything useful, right? Which kernel are you using? IIRC older kernel doesn't retry all possible device combinations, thus it can report uncorrectable errors even if it should be correctable. Another possible cause is write-hole, which reduced the tolerance of RAID6 stripes by stripes. You can also try replace the missing device. In that case, it doesn't go through the regular relocation path, but dev replace path (more like scrub), but you need physical access then. Thanks, Qu > > I'm currently traveling and away from the system physically. Is there > any way to complete the device removal without reconnecting the > failing drive? Otherwise, I'll have a replacement drive in a couple of > weeks when I'm back, and can try anything involving reconnecting the > drive. > > Thanks, > Martin >