Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10

From: "Tkaczyk, Mariusz" <mariusz.tkaczyk@linux.intel.com>
To: Song Liu <song@kernel.org>
Cc: linux-raid <linux-raid@vger.kernel.org>
Subject: Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
Date: Mon, 27 Sep 2021 16:54:27 +0200	[thread overview]
Message-ID: <83f06776-891f-dffa-7449-4128c419ada9@linux.intel.com> (raw)
In-Reply-To: <CAPhsuW4HZFaPgKx68mDeWE0F7SAjpnXmHL0TzN1SuZD6_Kds-w@mail.gmail.com>

Hi Song,
Thank for review.

On 24.09.2021 23:15, Song Liu wrote:
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index c322841d4edc..ac20eb2ddff7 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -926,8 +926,9 @@ static void super_written(struct bio *bio)
>>>                  pr_err("md: %s gets error=%d\n", __func__,
>>>                         blk_status_to_errno(bio->bi_status));
>>>                  md_error(mddev, rdev);
>>> -               if (!test_bit(Faulty, &rdev->flags)
>>> -                   && (bio->bi_opf & MD_FAILFAST)) {
>>> +               if (!test_bit(Faulty, &rdev->flags) &&
>>> +                    !test_bit(MD_BROKEN, &mddev->flags) &&
>>> +                    (bio->bi_opf & MD_FAILFAST)) {
>>
>> So with MD_BROKEN, we will not try to update the SB?
Array is dead, is there added value in writing failed state?

For external arrays failed state is not written, because drive is
not removed from MD device and metadata manager cannot detect array
failure. This is how it was originally implemented (expect raid5 but I
aligned it around two years ago[1]). I tried to make it consistent for
everyone but I failed. Second patch restores possibility to remove
drive by kernel for raid5 so failed state will be detected (for external)
again.
So, maybe I should drop this change for native. What do you think?

>>
>>>                          set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>                          set_bit(LastDev, &rdev->flags);
>>>                  }
>>> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>>>          int err = -EINVAL;
>>>          if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
>>>                  md_error(rdev->mddev, rdev);
>>> -               if (test_bit(Faulty, &rdev->flags))
>>> +
>>> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
>>
>> I don't think this makes much sense. EBUSY for already failed array
>> sounds weird.
>> Also, shall we also set MD_BROKEN here?
>>
> Actually, we just called md_error above, so we don't need to set MD_BROKEN here.
> But we shouldn't return EBUSY in such cases, right?
>
About EBUSY:
This is how it is implemented in mdadm, we are expecting it in
case of failure. See my fix[2].
I agree that it can be confusing, but this is how it is working.
Do you want to change it across mdadm and md?
This will break compatibility.

About MD_BROKEN:
As you see we are determining failure by checking rdev state, if "Faulty"
in flags after md_error() is not set, then it assumes that array is
failed and EBUSY is returned to userspace.

I changed verification to checking MD_BROKEN flag instead. It is
potentially harmful for raid1 and raid10 as old way is working well but
main target was to unify it across levels. Current verification method
is not flexible enough for raid5 (second patch).
Small benefit is that new IO will be failed faster, in md_submit_io().

Thanks,
Mariusz

[1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?id=fb73b3
[2] https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/commit/?id=cb8f5371