All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Cc: Martin <mbakiev@gmail.com>, linux-btrfs@vger.kernel.org
Subject: Re: Deleting a failing drive from RAID6 fails
Date: Thu, 26 Dec 2019 14:50:30 +0800	[thread overview]
Message-ID: <50661176-b04c-882b-d87c-ee5c0395c3f6@gmx.com> (raw)
In-Reply-To: <20191226054058.GC13306@hungrycats.org>


[-- Attachment #1.1: Type: text/plain, Size: 3496 bytes --]



On 2019/12/26 下午1:40, Zygo Blaxell wrote:
> On Thu, Dec 26, 2019 at 01:03:47PM +0800, Qu Wenruo wrote:
>>
>>
>> On 2019/12/26 上午3:25, Martin wrote:
>>> Hi,
>>>
>>> I have a drive that started failing (uncorrectable errors & lots of
>>> relocated sectors) in a RAID6 (12 device/70TB total with 30TB of
>>> data), btrfs scrub started showing corrected errors as well (seemingly
>>> no big deal since its RAID6). I decided to remove the drive from the
>>> array with:
>>>     btrfs device delete /dev/sdg /mount_point
>>>
>>> After about 20 hours and having rebalanced 90% of the data off the
>>> drive, the operation failed with an I/O error. dmesg was showing csum
>>> errors:
>>>     BTRFS warning (device sdf): csum failed root -9 ino 2526 off
>>> 10673848320 csum 0x8941f998 expected csum 0x253c8e4b mirror 2
>>>     BTRFS warning (device sdf): csum failed root -9 ino 2526 off
>>> 10673852416 csum 0x8941f998 expected csum 0x8a9a53fe mirror 2
>>>     . . .
>>
>> This means some data reloc tree had csum mismatch.
>> The strange part is, we shouldn't hit csum error here, as if it's some
>> data corrupted, it should report csum error at read time, other than
>> reporting the error at this timing.
>>
>> This looks like something reported before.
>>
>>>
>>> I pulled the drive out of the system and attempted the device deletion
>>> again, but getting the same error.
>>>
>>> Looking back through the logs to the previous scrubs, it showed the
>>> file paths where errors were detected, so I deleted those files, and
>>> tried removing the failing drive again. It moved along some more. Now
>>> its down to only 13GiB of data remaining on the missing drive. Is
>>> there any way to track the above errors to specific files so I can
>>> delete them and finish the removal. Is there is a better way to finish
>>> the device deletion?
>>
>> As the message shows, it's the data reloc tree, which store the newly
>> relocated data.
>> So it doesn't contain the file path.
>>
>>>
>>> Scrubbing with the device missing just racks up uncorrectable errors
>>> right off the bat, so it seemingly doesn't like missing a device - I
>>> assume it's not actually doing anything useful, right?
>>
>> Which kernel are you using?
>>
>> IIRC older kernel doesn't retry all possible device combinations, thus
>> it can report uncorrectable errors even if it should be correctable.
> 
>> Another possible cause is write-hole, which reduced the tolerance of
>> RAID6 stripes by stripes.
> 
> Did you find a fix for
> 
> 	https://www.spinics.net/lists/linux-btrfs/msg94634.html
> 
> If that bug is happening in this case, it can abort a device delete
> on raid5/6 due to corrupted data every few block groups.

My bad, always lost my track of to-do works.

It looks like one possible cause indeed.

Thanks for reminding me that bug,
Qu

> 
>> You can also try replace the missing device.
>> In that case, it doesn't go through the regular relocation path, but dev
>> replace path (more like scrub), but you need physical access then.
>>
>> Thanks,
>> Qu
>>
>>>
>>> I'm currently traveling and away from the system physically. Is there
>>> any way to complete the device removal without reconnecting the
>>> failing drive? Otherwise, I'll have a replacement drive in a couple of
>>> weeks when I'm back, and can try anything involving reconnecting the
>>> drive.
>>>
>>> Thanks,
>>> Martin
>>>
>>
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2019-12-26  6:50 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-25 19:25 Deleting a failing drive from RAID6 fails Martin
2019-12-26  5:03 ` Qu Wenruo
2019-12-26  5:40   ` Zygo Blaxell
2019-12-26  6:50     ` Qu Wenruo [this message]
2019-12-26 19:37       ` Martin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50661176-b04c-882b-d87c-ee5c0395c3f6@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=ce3g8jdj@umail.furryterror.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mbakiev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.