All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Use MD_BROKEN for redundant arrays
@ 2021-09-17 15:34 Mariusz Tkaczyk
  2021-09-17 15:34 ` [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
  2021-09-17 15:34 ` [PATCH 2/2] raid5: introduce MD_BROKEN Mariusz Tkaczyk
  0 siblings, 2 replies; 14+ messages in thread
From: Mariusz Tkaczyk @ 2021-09-17 15:34 UTC (permalink / raw)
  To: song; +Cc: linux-raid

Hi Song,
This patchset adds usage of MD_BROKEN for each redundant level.
This should simplify IO failure stack when md device is failed and
fixes raid456 bug..

Mariusz Tkaczyk (2):
  md: Set MD_BROKEN for RAID1 and RAID10
  raid5: introduce MD_BROKEN

 drivers/md/md.c     | 16 ++++++++++------
 drivers/md/md.h     |  4 ++--
 drivers/md/raid1.c  |  1 +
 drivers/md/raid10.c |  1 +
 drivers/md/raid5.c  | 34 ++++++++++++++++------------------
 5 files changed, 30 insertions(+), 26 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-17 15:34 [PATCH 0/2] Use MD_BROKEN for redundant arrays Mariusz Tkaczyk
@ 2021-09-17 15:34 ` Mariusz Tkaczyk
  2021-09-24 21:09   ` Song Liu
  2021-09-17 15:34 ` [PATCH 2/2] raid5: introduce MD_BROKEN Mariusz Tkaczyk
  1 sibling, 1 reply; 14+ messages in thread
From: Mariusz Tkaczyk @ 2021-09-17 15:34 UTC (permalink / raw)
  To: song; +Cc: linux-raid

The idea of stopping all writes if devices is failed, introduced by
62f7b1989c0 ("md raid0/linear: Mark array as 'broken' and fail BIOs if
a member is gone") seems to be reasonable so use MD_BROKEN for RAID1 and
RAID10. Support in RAID456 is added in next commit.
If userspace (mdadm) forces md to fail the device (Failure state
written via sysfs), then EBUSY is expected if array will become failed.
To achieve that, check for MD_BROKEN and if is set, then return EBUSY to
be complaint with userspace.
For faulty state, handled via ioctl, let the error_handler to decide.

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/md/md.c     | 16 ++++++++++------
 drivers/md/md.h     |  4 ++--
 drivers/md/raid1.c  |  1 +
 drivers/md/raid10.c |  1 +
 4 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c322841d4edc..ac20eb2ddff7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -926,8 +926,9 @@ static void super_written(struct bio *bio)
 		pr_err("md: %s gets error=%d\n", __func__,
 		       blk_status_to_errno(bio->bi_status));
 		md_error(mddev, rdev);
-		if (!test_bit(Faulty, &rdev->flags)
-		    && (bio->bi_opf & MD_FAILFAST)) {
+		if (!test_bit(Faulty, &rdev->flags) &&
+		     !test_bit(MD_BROKEN, &mddev->flags) &&
+		     (bio->bi_opf & MD_FAILFAST)) {
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
 			set_bit(LastDev, &rdev->flags);
 		}
@@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	int err = -EINVAL;
 	if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
 		md_error(rdev->mddev, rdev);
-		if (test_bit(Faulty, &rdev->flags))
+
+		if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
 			err = 0;
 		else
 			err = -EBUSY;
@@ -7974,12 +7976,14 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (!mddev->pers || !mddev->pers->error_handler)
 		return;
 	mddev->pers->error_handler(mddev,rdev);
-	if (mddev->degraded)
+	if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags))
 		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 	sysfs_notify_dirent_safe(rdev->sysfs_state);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-	md_wakeup_thread(mddev->thread);
+	if (!test_bit(MD_BROKEN, &mddev->flags)) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	}
 	if (mddev->event_work.func)
 		queue_work(md_misc_wq, &mddev->event_work);
 	md_new_event(mddev);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 4c96c36bd01a..e01433f3b46a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -259,8 +259,8 @@ enum mddev_flags {
 	MD_NOT_READY,		/* do_md_run() is active, so 'array_state'
 				 * must not report that array is ready yet
 				 */
-	MD_BROKEN,              /* This is used in RAID-0/LINEAR only, to stop
-				 * I/O in case an array member is gone/failed.
+	MD_BROKEN,              /* This is used to stop I/O and mark device as
+				 * dead in case an array becomes failed.
 				 */
 };
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 19598bd38939..79462d860177 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1639,6 +1639,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 		 */
 		conf->recovery_disabled = mddev->recovery_disabled;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+		set_bit(MD_BROKEN, &mddev->flags);
 		return;
 	}
 	set_bit(Blocked, &rdev->flags);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index aa2636582841..02a4d84b4d2e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1964,6 +1964,7 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 		 * Don't fail the drive, just return an IO error.
 		 */
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+		set_bit(MD_BROKEN, &mddev->flags);
 		return;
 	}
 	if (test_and_clear_bit(In_sync, &rdev->flags))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] raid5: introduce MD_BROKEN
  2021-09-17 15:34 [PATCH 0/2] Use MD_BROKEN for redundant arrays Mariusz Tkaczyk
  2021-09-17 15:34 ` [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
@ 2021-09-17 15:34 ` Mariusz Tkaczyk
  2021-09-24 21:18   ` Song Liu
  1 sibling, 1 reply; 14+ messages in thread
From: Mariusz Tkaczyk @ 2021-09-17 15:34 UTC (permalink / raw)
  To: song; +Cc: linux-raid

Raid456 module had allowed to achieve failed state, distinct from other
redundant levels. It was fixed by fb73b357fb9 ("raid5: block failing
device if raid will be failed").
This fix introduces a bug, now if raid5 fails during IO, it may result
with a hung task without completion. Faulty flag on the device is
necessary to process all requests and is checked many times, mainly in
anaylze_stripe().
Allow to set faulty flag on drive again and set MD_BROKEN if raid is
failed.

Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/md/raid5.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 02ed53b20654..43e1ff43a222 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -690,6 +690,9 @@ static int has_failed(struct r5conf *conf)
 {
 	int degraded;
 
+	if (test_bit(MD_BROKEN, &conf->mddev->flags))
+		return 1;
+
 	if (conf->mddev->reshape_position == MaxSector)
 		return conf->mddev->degraded > conf->max_degraded;
 
@@ -2877,34 +2880,29 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 	unsigned long flags;
 	pr_debug("raid456: error called\n");
 
-	spin_lock_irqsave(&conf->device_lock, flags);
-
-	if (test_bit(In_sync, &rdev->flags) &&
-	    mddev->degraded == conf->max_degraded) {
-		/*
-		 * Don't allow to achieve failed state
-		 * Don't try to recover this device
-		 */
-		conf->recovery_disabled = mddev->recovery_disabled;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		return;
-	}
+	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n",
+		mdname(mddev), bdevname(rdev->bdev, b));
 
+	spin_lock_irqsave(&conf->device_lock, flags);
 	set_bit(Faulty, &rdev->flags);
 	clear_bit(In_sync, &rdev->flags);
 	mddev->degraded = raid5_calc_degraded(conf);
+
+	if (has_failed(conf)) {
+		set_bit(MD_BROKEN, &mddev->flags);
+		conf->recovery_disabled = mddev->recovery_disabled;
+		pr_crit("md/raid:%s: Cannot continue on %d devices.\n",
+			mdname(mddev), conf->raid_disks - mddev->degraded);
+	} else
+		pr_crit("md/raid:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), conf->raid_disks - mddev->degraded);
+
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 
 	set_bit(Blocked, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n"
-		"md/raid:%s: Operation continuing on %d devices.\n",
-		mdname(mddev),
-		bdevname(rdev->bdev, b),
-		mdname(mddev),
-		conf->raid_disks - mddev->degraded);
 	r5c_update_on_rdev_error(mddev, rdev);
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-17 15:34 ` [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
@ 2021-09-24 21:09   ` Song Liu
  2021-09-24 21:15     ` Song Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Song Liu @ 2021-09-24 21:09 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: linux-raid

On Fri, Sep 17, 2021 at 8:35 AM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> The idea of stopping all writes if devices is failed, introduced by
> 62f7b1989c0 ("md raid0/linear: Mark array as 'broken' and fail BIOs if
> a member is gone") seems to be reasonable so use MD_BROKEN for RAID1 and
> RAID10. Support in RAID456 is added in next commit.
> If userspace (mdadm) forces md to fail the device (Failure state
> written via sysfs), then EBUSY is expected if array will become failed.
> To achieve that, check for MD_BROKEN and if is set, then return EBUSY to
> be complaint with userspace.
> For faulty state, handled via ioctl, let the error_handler to decide.
>
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>

Thanks for the patch and sorry for the late reply.

> ---
>  drivers/md/md.c     | 16 ++++++++++------
>  drivers/md/md.h     |  4 ++--
>  drivers/md/raid1.c  |  1 +
>  drivers/md/raid10.c |  1 +
>  4 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c322841d4edc..ac20eb2ddff7 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -926,8 +926,9 @@ static void super_written(struct bio *bio)
>                 pr_err("md: %s gets error=%d\n", __func__,
>                        blk_status_to_errno(bio->bi_status));
>                 md_error(mddev, rdev);
> -               if (!test_bit(Faulty, &rdev->flags)
> -                   && (bio->bi_opf & MD_FAILFAST)) {
> +               if (!test_bit(Faulty, &rdev->flags) &&
> +                    !test_bit(MD_BROKEN, &mddev->flags) &&
> +                    (bio->bi_opf & MD_FAILFAST)) {

So with MD_BROKEN, we will not try to update the SB?

>                         set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>                         set_bit(LastDev, &rdev->flags);
>                 }
> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>         int err = -EINVAL;
>         if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
>                 md_error(rdev->mddev, rdev);
> -               if (test_bit(Faulty, &rdev->flags))
> +
> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))

I don't think this makes much sense. EBUSY for already failed array
sounds weird.
Also, shall we also set MD_BROKEN here?

>                         err = 0;
>                 else
>                         err = -EBUSY;
> @@ -7974,12 +7976,14 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>         if (!mddev->pers || !mddev->pers->error_handler)
>                 return;
>         mddev->pers->error_handler(mddev,rdev);
> -       if (mddev->degraded)
> +       if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags))
>                 set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>         sysfs_notify_dirent_safe(rdev->sysfs_state);
>         set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -       set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> -       md_wakeup_thread(mddev->thread);
> +       if (!test_bit(MD_BROKEN, &mddev->flags)) {
> +               set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +               md_wakeup_thread(mddev->thread);
> +       }
>         if (mddev->event_work.func)
>                 queue_work(md_misc_wq, &mddev->event_work);
>         md_new_event(mddev);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 4c96c36bd01a..e01433f3b46a 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -259,8 +259,8 @@ enum mddev_flags {
>         MD_NOT_READY,           /* do_md_run() is active, so 'array_state'
>                                  * must not report that array is ready yet
>                                  */
> -       MD_BROKEN,              /* This is used in RAID-0/LINEAR only, to stop
> -                                * I/O in case an array member is gone/failed.
> +       MD_BROKEN,              /* This is used to stop I/O and mark device as
> +                                * dead in case an array becomes failed.
>                                  */
>  };
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 19598bd38939..79462d860177 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1639,6 +1639,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>                  */
>                 conf->recovery_disabled = mddev->recovery_disabled;
>                 spin_unlock_irqrestore(&conf->device_lock, flags);
> +               set_bit(MD_BROKEN, &mddev->flags);
>                 return;
>         }
>         set_bit(Blocked, &rdev->flags);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index aa2636582841..02a4d84b4d2e 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1964,6 +1964,7 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>                  * Don't fail the drive, just return an IO error.
>                  */
>                 spin_unlock_irqrestore(&conf->device_lock, flags);
> +               set_bit(MD_BROKEN, &mddev->flags);
>                 return;
>         }
>         if (test_and_clear_bit(In_sync, &rdev->flags))
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-24 21:09   ` Song Liu
@ 2021-09-24 21:15     ` Song Liu
  2021-09-27 14:54       ` Tkaczyk, Mariusz
  0 siblings, 1 reply; 14+ messages in thread
From: Song Liu @ 2021-09-24 21:15 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: linux-raid

On Fri, Sep 24, 2021 at 2:09 PM Song Liu <song@kernel.org> wrote:
>
> On Fri, Sep 17, 2021 at 8:35 AM Mariusz Tkaczyk
> <mariusz.tkaczyk@linux.intel.com> wrote:
> >
> > The idea of stopping all writes if devices is failed, introduced by
> > 62f7b1989c0 ("md raid0/linear: Mark array as 'broken' and fail BIOs if
> > a member is gone") seems to be reasonable so use MD_BROKEN for RAID1 and
> > RAID10. Support in RAID456 is added in next commit.
> > If userspace (mdadm) forces md to fail the device (Failure state
> > written via sysfs), then EBUSY is expected if array will become failed.
> > To achieve that, check for MD_BROKEN and if is set, then return EBUSY to
> > be complaint with userspace.
> > For faulty state, handled via ioctl, let the error_handler to decide.
> >
> > Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
>
> Thanks for the patch and sorry for the late reply.
>
> > ---
> >  drivers/md/md.c     | 16 ++++++++++------
> >  drivers/md/md.h     |  4 ++--
> >  drivers/md/raid1.c  |  1 +
> >  drivers/md/raid10.c |  1 +
> >  4 files changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index c322841d4edc..ac20eb2ddff7 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -926,8 +926,9 @@ static void super_written(struct bio *bio)
> >                 pr_err("md: %s gets error=%d\n", __func__,
> >                        blk_status_to_errno(bio->bi_status));
> >                 md_error(mddev, rdev);
> > -               if (!test_bit(Faulty, &rdev->flags)
> > -                   && (bio->bi_opf & MD_FAILFAST)) {
> > +               if (!test_bit(Faulty, &rdev->flags) &&
> > +                    !test_bit(MD_BROKEN, &mddev->flags) &&
> > +                    (bio->bi_opf & MD_FAILFAST)) {
>
> So with MD_BROKEN, we will not try to update the SB?
>
> >                         set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> >                         set_bit(LastDev, &rdev->flags);
> >                 }
> > @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
> >         int err = -EINVAL;
> >         if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
> >                 md_error(rdev->mddev, rdev);
> > -               if (test_bit(Faulty, &rdev->flags))
> > +
> > +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
>
> I don't think this makes much sense. EBUSY for already failed array
> sounds weird.
> Also, shall we also set MD_BROKEN here?
>
Actually, we just called md_error above, so we don't need to set MD_BROKEN here.
But we shouldn't return EBUSY in such cases, right?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] raid5: introduce MD_BROKEN
  2021-09-17 15:34 ` [PATCH 2/2] raid5: introduce MD_BROKEN Mariusz Tkaczyk
@ 2021-09-24 21:18   ` Song Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Song Liu @ 2021-09-24 21:18 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: linux-raid

On Fri, Sep 17, 2021 at 8:35 AM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> Raid456 module had allowed to achieve failed state, distinct from other
> redundant levels. It was fixed by fb73b357fb9 ("raid5: block failing
> device if raid will be failed").
> This fix introduces a bug, now if raid5 fails during IO, it may result
> with a hung task without completion.

It will be great if we can add repro steps and/or hung stack trace here.

> Faulty flag on the device is
> necessary to process all requests and is checked many times, mainly in
> anaylze_stripe().
> Allow to set faulty flag on drive again and set MD_BROKEN if raid is
> failed.
>
> Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>

For both patches, please provide more information about what is being
fixed and describe the behavior.

Thanks,
Song

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-24 21:15     ` Song Liu
@ 2021-09-27 14:54       ` Tkaczyk, Mariusz
  2021-09-27 22:59         ` Song Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Tkaczyk, Mariusz @ 2021-09-27 14:54 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

Hi Song,
Thank for review.

On 24.09.2021 23:15, Song Liu wrote:
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index c322841d4edc..ac20eb2ddff7 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -926,8 +926,9 @@ static void super_written(struct bio *bio)
>>>                  pr_err("md: %s gets error=%d\n", __func__,
>>>                         blk_status_to_errno(bio->bi_status));
>>>                  md_error(mddev, rdev);
>>> -               if (!test_bit(Faulty, &rdev->flags)
>>> -                   && (bio->bi_opf & MD_FAILFAST)) {
>>> +               if (!test_bit(Faulty, &rdev->flags) &&
>>> +                    !test_bit(MD_BROKEN, &mddev->flags) &&
>>> +                    (bio->bi_opf & MD_FAILFAST)) {
>>
>> So with MD_BROKEN, we will not try to update the SB?
Array is dead, is there added value in writing failed state?

For external arrays failed state is not written, because drive is
not removed from MD device and metadata manager cannot detect array
failure. This is how it was originally implemented (expect raid5 but I
aligned it around two years ago[1]). I tried to make it consistent for
everyone but I failed. Second patch restores possibility to remove
drive by kernel for raid5 so failed state will be detected (for external)
again.
So, maybe I should drop this change for native. What do you think?

>>
>>>                          set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>                          set_bit(LastDev, &rdev->flags);
>>>                  }
>>> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>>>          int err = -EINVAL;
>>>          if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
>>>                  md_error(rdev->mddev, rdev);
>>> -               if (test_bit(Faulty, &rdev->flags))
>>> +
>>> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
>>
>> I don't think this makes much sense. EBUSY for already failed array
>> sounds weird.
>> Also, shall we also set MD_BROKEN here?
>>
> Actually, we just called md_error above, so we don't need to set MD_BROKEN here.
> But we shouldn't return EBUSY in such cases, right?
>
About EBUSY:
This is how it is implemented in mdadm, we are expecting it in
case of failure. See my fix[2].
I agree that it can be confusing, but this is how it is working.
Do you want to change it across mdadm and md?
This will break compatibility.

About MD_BROKEN:
As you see we are determining failure by checking rdev state, if "Faulty"
in flags after md_error() is not set, then it assumes that array is
failed and EBUSY is returned to userspace.

I changed verification to checking MD_BROKEN flag instead. It is
potentially harmful for raid1 and raid10 as old way is working well but
main target was to unify it across levels. Current verification method
is not flexible enough for raid5 (second patch).
Small benefit is that new IO will be failed faster, in md_submit_io().

Thanks,
Mariusz

[1] https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/commit/?id=fb73b3
[2] https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/commit/?id=cb8f5371

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-27 14:54       ` Tkaczyk, Mariusz
@ 2021-09-27 22:59         ` Song Liu
  2021-09-28  7:33           ` Tkaczyk, Mariusz
  0 siblings, 1 reply; 14+ messages in thread
From: Song Liu @ 2021-09-27 22:59 UTC (permalink / raw)
  To: Tkaczyk, Mariusz; +Cc: linux-raid

On Mon, Sep 27, 2021 at 7:54 AM Tkaczyk, Mariusz
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> Hi Song,
> Thank for review.
>
> On 24.09.2021 23:15, Song Liu wrote:
> >>> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >>> index c322841d4edc..ac20eb2ddff7 100644
> >>> --- a/drivers/md/md.c
> >>> +++ b/drivers/md/md.c
> >>> @@ -926,8 +926,9 @@ static void super_written(struct bio *bio)
> >>>                  pr_err("md: %s gets error=%d\n", __func__,
> >>>                         blk_status_to_errno(bio->bi_status));
> >>>                  md_error(mddev, rdev);
> >>> -               if (!test_bit(Faulty, &rdev->flags)
> >>> -                   && (bio->bi_opf & MD_FAILFAST)) {
> >>> +               if (!test_bit(Faulty, &rdev->flags) &&
> >>> +                    !test_bit(MD_BROKEN, &mddev->flags) &&
> >>> +                    (bio->bi_opf & MD_FAILFAST)) {
> >>
> >> So with MD_BROKEN, we will not try to update the SB?
> Array is dead, is there added value in writing failed state?

I think there is still value to remember this. Say a raid1 with 2 drives,
A and B. If B is unpluged from the system, we would like to update SB
on A to remember that. Otherwise, if B is pluged back later (maybe after
system reset), we won't tell which one has the latest data.

Does this make sense?

>
> For external arrays failed state is not written, because drive is
> not removed from MD device and metadata manager cannot detect array
> failure. This is how it was originally implemented (expect raid5 but I
> aligned it around two years ago[1]). I tried to make it consistent for
> everyone but I failed. Second patch restores possibility to remove
> drive by kernel for raid5 so failed state will be detected (for external)
> again.
> So, maybe I should drop this change for native. What do you think?
>
> >>
> >>>                          set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> >>>                          set_bit(LastDev, &rdev->flags);
> >>>                  }
> >>> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
> >>>          int err = -EINVAL;
> >>>          if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
> >>>                  md_error(rdev->mddev, rdev);
> >>> -               if (test_bit(Faulty, &rdev->flags))
> >>> +
> >>> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
> >>
> >> I don't think this makes much sense. EBUSY for already failed array
> >> sounds weird.
> >> Also, shall we also set MD_BROKEN here?
> >>
> > Actually, we just called md_error above, so we don't need to set MD_BROKEN here.
> > But we shouldn't return EBUSY in such cases, right?
> >
> About EBUSY:
> This is how it is implemented in mdadm, we are expecting it in
> case of failure. See my fix[2].
> I agree that it can be confusing, but this is how it is working.
> Do you want to change it across mdadm and md?
> This will break compatibility.
>
> About MD_BROKEN:
> As you see we are determining failure by checking rdev state, if "Faulty"
> in flags after md_error() is not set, then it assumes that array is
> failed and EBUSY is returned to userspace.

This changed the behavior for raid0, no?

W/o the change mdadm --fail on raid0 will get EBUSY. W/ this change,
it will get 0, and the device is NOT marked as faulty, right?

Song

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-27 22:59         ` Song Liu
@ 2021-09-28  7:33           ` Tkaczyk, Mariusz
  2021-09-28  7:55             ` Tkaczyk, Mariusz
  0 siblings, 1 reply; 14+ messages in thread
From: Tkaczyk, Mariusz @ 2021-09-28  7:33 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

Hi Song,

On 28.09.2021 00:59, Song Liu wrote:
>>>>> +               if (!test_bit(Faulty, &rdev->flags) &&
>>>>> +                    !test_bit(MD_BROKEN, &mddev->flags) &&
>>>>> +                    (bio->bi_opf & MD_FAILFAST)) {
>>>>
>>>> So with MD_BROKEN, we will not try to update the SB?
>> Array is dead, is there added value in writing failed state?
> 
> I think there is still value to remember this. Say a raid1 with 2 drives,
> A and B. If B is unpluged from the system, we would like to update SB
> on A to remember that. Otherwise, if B is pluged back later (maybe after
> system reset), we won't tell which one has the latest data.
> 
> Does this make sense?

Removing one drive from raid1 array doesn't cause raid failure.
So, removing B will be recorded on A.
Raid1 is not good example because to fail array we need to remove
all members, so MD_BROKEN doesn't matter because
!test_bit(Faulty, &rdev->flags) is true.

Let say that we have raid10 with member A, B, C, D. Member A is removed,
and it is recorded correctly (we are degraded now). Next, member B is
removed which causes array failure.
W/ my patch:
member B failure is not saved on members C and D. Raid is failed but
it is not recorded in metadata.
w/o my patch:
member B failure is saved on C and D, and metadata is in failed state.
>>>>
>>>>>                           set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>>>                           set_bit(LastDev, &rdev->flags);
>>>>>                   }
>>>>> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>>>>>           int err = -EINVAL;
>>>>>           if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
>>>>>                   md_error(rdev->mddev, rdev);
>>>>> -               if (test_bit(Faulty, &rdev->flags))
>>>>> +
>>>>> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
>>>>
>>>> I don't think this makes much sense. EBUSY for already failed array
>>>> sounds weird.
>>>> Also, shall we also set MD_BROKEN here?
>>>>
>>> Actually, we just called md_error above, so we don't need to set MD_BROKEN here.
>>> But we shouldn't return EBUSY in such cases, right?
>>>
>> About EBUSY:
>> This is how it is implemented in mdadm, we are expecting it in
>> case of failure. See my fix[2].
>> I agree that it can be confusing, but this is how it is working.
>> Do you want to change it across mdadm and md?
>> This will break compatibility.
>>
>> About MD_BROKEN:
>> As you see we are determining failure by checking rdev state, if "Faulty"
>> in flags after md_error() is not set, then it assumes that array is
>> failed and EBUSY is returned to userspace.
> 
> This changed the behavior for raid0, no?
> 
> W/o the change mdadm --fail on raid0 will get EBUSY. W/ this change,
> it will get 0, and the device is NOT marked as faulty, right?
> 
See commit mentioned in description. MD_BROKEN is used for raid0,
so EBUSY is returned, same as w/o patch.

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-28  7:33           ` Tkaczyk, Mariusz
@ 2021-09-28  7:55             ` Tkaczyk, Mariusz
  2021-09-29  1:29               ` Song Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Tkaczyk, Mariusz @ 2021-09-28  7:55 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

On 28.09.2021 09:33, Tkaczyk, Mariusz wrote:
> Hi Song,
> 
> On 28.09.2021 00:59, Song Liu wrote:
>>>>>> +               if (!test_bit(Faulty, &rdev->flags) &&
>>>>>> +                    !test_bit(MD_BROKEN, &mddev->flags) &&
>>>>>> +                    (bio->bi_opf & MD_FAILFAST)) {
>>>>>
>>>>> So with MD_BROKEN, we will not try to update the SB?
>>> Array is dead, is there added value in writing failed state?
>>
>> I think there is still value to remember this. Say a raid1 with 2 drives,
>> A and B. If B is unpluged from the system, we would like to update SB
>> on A to remember that. Otherwise, if B is pluged back later (maybe after
>> system reset), we won't tell which one has the latest data.
>>
>> Does this make sense?
> 
> Removing one drive from raid1 array doesn't cause raid failure.
> So, removing B will be recorded on A.
> Raid1 is not good example because to fail array we need to remove
> all members, so MD_BROKEN doesn't matter because
> !test_bit(Faulty, &rdev->flags) is true.
is false.

Oh, I messed it up. There is no faulty flag in this case, we cannot remove
last drive. Since member is (physically) gone then there is no change to
update metadata, even if it is requested.

> 
> Let say that we have raid10 with member A, B, C, D. Member A is removed,
> and it is recorded correctly (we are degraded now). Next, member B is
> removed which causes array failure.
> W/ my patch:
> member B failure is not saved on members C and D. Raid is failed but
> it is not recorded in metadata.
> w/o my patch:
> member B failure is saved on C and D, and metadata is in failed state.
>>>>>
>>>>>>                           set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>>>>                           set_bit(LastDev, &rdev->flags);
>>>>>>                   }
>>>>>> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, 
>>>>>> size_t len)
>>>>>>           int err = -EINVAL;
>>>>>>           if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
>>>>>>                   md_error(rdev->mddev, rdev);
>>>>>> -               if (test_bit(Faulty, &rdev->flags))
>>>>>> +
>>>>>> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
>>>>>
>>>>> I don't think this makes much sense. EBUSY for already failed array
>>>>> sounds weird.
>>>>> Also, shall we also set MD_BROKEN here?
>>>>>
>>>> Actually, we just called md_error above, so we don't need to set MD_BROKEN 
>>>> here.
>>>> But we shouldn't return EBUSY in such cases, right?
>>>>
>>> About EBUSY:
>>> This is how it is implemented in mdadm, we are expecting it in
>>> case of failure. See my fix[2].
>>> I agree that it can be confusing, but this is how it is working.
>>> Do you want to change it across mdadm and md?
>>> This will break compatibility.
>>>
>>> About MD_BROKEN:
>>> As you see we are determining failure by checking rdev state, if "Faulty"
>>> in flags after md_error() is not set, then it assumes that array is
>>> failed and EBUSY is returned to userspace.
>>
>> This changed the behavior for raid0, no?
>>
>> W/o the change mdadm --fail on raid0 will get EBUSY. W/ this change,
>> it will get 0, and the device is NOT marked as faulty, right?
>>
> See commit mentioned in description. MD_BROKEN is used for raid0,
> so EBUSY is returned, same as w/o patch.
> 
> Thanks,
> Mariusz


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-28  7:55             ` Tkaczyk, Mariusz
@ 2021-09-29  1:29               ` Song Liu
  2021-09-30 11:23                 ` Tkaczyk, Mariusz
  0 siblings, 1 reply; 14+ messages in thread
From: Song Liu @ 2021-09-29  1:29 UTC (permalink / raw)
  To: Tkaczyk, Mariusz; +Cc: linux-raid

On Tue, Sep 28, 2021 at 12:55 AM Tkaczyk, Mariusz
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> On 28.09.2021 09:33, Tkaczyk, Mariusz wrote:
> > Hi Song,
> >
> > On 28.09.2021 00:59, Song Liu wrote:
> >>>>>> +               if (!test_bit(Faulty, &rdev->flags) &&
> >>>>>> +                    !test_bit(MD_BROKEN, &mddev->flags) &&
> >>>>>> +                    (bio->bi_opf & MD_FAILFAST)) {
> >>>>>
> >>>>> So with MD_BROKEN, we will not try to update the SB?
> >>> Array is dead, is there added value in writing failed state?
> >>
> >> I think there is still value to remember this. Say a raid1 with 2 drives,
> >> A and B. If B is unpluged from the system, we would like to update SB
> >> on A to remember that. Otherwise, if B is pluged back later (maybe after
> >> system reset), we won't tell which one has the latest data.
> >>
> >> Does this make sense?
> >
> > Removing one drive from raid1 array doesn't cause raid failure.
> > So, removing B will be recorded on A.
> > Raid1 is not good example because to fail array we need to remove
> > all members, so MD_BROKEN doesn't matter because
> > !test_bit(Faulty, &rdev->flags) is true.
> is false.
>
> Oh, I messed it up. There is no faulty flag in this case, we cannot remove
> last drive. Since member is (physically) gone then there is no change to
> update metadata, even if it is requested.
>
> >
> > Let say that we have raid10 with member A, B, C, D. Member A is removed,
> > and it is recorded correctly (we are degraded now). Next, member B is
> > removed which causes array failure.
> > W/ my patch:
> > member B failure is not saved on members C and D. Raid is failed but
> > it is not recorded in metadata.
> > w/o my patch:
> > member B failure is saved on C and D, and metadata is in failed state.
> >>>>>
> >>>>>>                           set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> >>>>>>                           set_bit(LastDev, &rdev->flags);
> >>>>>>                   }
> >>>>>> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf,
> >>>>>> size_t len)
> >>>>>>           int err = -EINVAL;
> >>>>>>           if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
> >>>>>>                   md_error(rdev->mddev, rdev);
> >>>>>> -               if (test_bit(Faulty, &rdev->flags))
> >>>>>> +
> >>>>>> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
> >>>>>
> >>>>> I don't think this makes much sense. EBUSY for already failed array
> >>>>> sounds weird.
> >>>>> Also, shall we also set MD_BROKEN here?
> >>>>>
> >>>> Actually, we just called md_error above, so we don't need to set MD_BROKEN
> >>>> here.
> >>>> But we shouldn't return EBUSY in such cases, right?
> >>>>
> >>> About EBUSY:
> >>> This is how it is implemented in mdadm, we are expecting it in
> >>> case of failure. See my fix[2].
> >>> I agree that it can be confusing, but this is how it is working.
> >>> Do you want to change it across mdadm and md?
> >>> This will break compatibility.
> >>>
> >>> About MD_BROKEN:
> >>> As you see we are determining failure by checking rdev state, if "Faulty"
> >>> in flags after md_error() is not set, then it assumes that array is
> >>> failed and EBUSY is returned to userspace.
> >>
> >> This changed the behavior for raid0, no?
> >>
> >> W/o the change mdadm --fail on raid0 will get EBUSY. W/ this change,
> >> it will get 0, and the device is NOT marked as faulty, right?
> >>
> > See commit mentioned in description. MD_BROKEN is used for raid0,
> > so EBUSY is returned, same as w/o patch.

Hmm... I am still confused. In state_store(), md_error is a no-op for raid0,
which will not set Faulty or MD_BROKEN. So we will get -EBUSY w/o
the patch and 0 w/ the patch, no? It is probably not a serious issue though.

Thanks,
Song

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-29  1:29               ` Song Liu
@ 2021-09-30 11:23                 ` Tkaczyk, Mariusz
  2021-09-30 15:56                   ` Song Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Tkaczyk, Mariusz @ 2021-09-30 11:23 UTC (permalink / raw)
  To: Song Liu; +Cc: linux-raid

Hi Song,
On 29.09.2021 03:29, Song Liu wrote:
>>>>>>>
>>>>>>>>                            set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>>>>>>                            set_bit(LastDev, &rdev->flags);
>>>>>>>>                    }
>>>>>>>> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf,
>>>>>>>> size_t len)
>>>>>>>>            int err = -EINVAL;
>>>>>>>>            if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
>>>>>>>>                    md_error(rdev->mddev, rdev);
>>>>>>>> -               if (test_bit(Faulty, &rdev->flags))
>>>>>>>> +
>>>>>>>> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
>>>>>>>
>>>>>>> I don't think this makes much sense. EBUSY for already failed array
>>>>>>> sounds weird.
>>>>>>> Also, shall we also set MD_BROKEN here?
>>>>>>>
>>>>>> Actually, we just called md_error above, so we don't need to set MD_BROKEN
>>>>>> here.
>>>>>> But we shouldn't return EBUSY in such cases, right?
>>>>>>
>>>>> About EBUSY:
>>>>> This is how it is implemented in mdadm, we are expecting it in
>>>>> case of failure. See my fix[2].
>>>>> I agree that it can be confusing, but this is how it is working.
>>>>> Do you want to change it across mdadm and md?
>>>>> This will break compatibility.
>>>>>
>>>>> About MD_BROKEN:
>>>>> As you see we are determining failure by checking rdev state, if "Faulty"
>>>>> in flags after md_error() is not set, then it assumes that array is
>>>>> failed and EBUSY is returned to userspace.
>>>>
>>>> This changed the behavior for raid0, no?
>>>>
>>>> W/o the change mdadm --fail on raid0 will get EBUSY. W/ this change,
>>>> it will get 0, and the device is NOT marked as faulty, right?
>>>>
>>> See commit mentioned in description. MD_BROKEN is used for raid0,
>>> so EBUSY is returned, same as w/o patch.
> 
> Hmm... I am still confused. In state_store(), md_error is a no-op for raid0,
> which will not set Faulty or MD_BROKEN. So we will get -EBUSY w/o
> the patch and 0 w/ the patch, no? It is probably not a serious issue though.
> 
>
Yeah, you are right. There is no error_handler. I missed that.
Now, I reviewed raid0 again.

With my change result won't be clear for raid0, it is correlated with IO.
When drive disappears and there is IO, then it could return -EBUSY,
raid0_make_request() may set MD_BROKEN first.
In there is no IO then 0 will be returned. I need to close this gap.
Thanks for your curiosity.

So, please tell me, are you ok with idea of this patch? I will send
requested details with v2 but I want to know if I choose good way to close
raid5 issue, which is a main problem.

Thanks,
Mariusz



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-30 11:23                 ` Tkaczyk, Mariusz
@ 2021-09-30 15:56                   ` Song Liu
  0 siblings, 0 replies; 14+ messages in thread
From: Song Liu @ 2021-09-30 15:56 UTC (permalink / raw)
  To: Tkaczyk, Mariusz; +Cc: linux-raid

On Thu, Sep 30, 2021 at 4:23 AM Tkaczyk, Mariusz
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> Hi Song,
> On 29.09.2021 03:29, Song Liu wrote:
> >>>>>>>
> >>>>>>>>                            set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> >>>>>>>>                            set_bit(LastDev, &rdev->flags);
> >>>>>>>>                    }
> >>>>>>>> @@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf,
> >>>>>>>> size_t len)
> >>>>>>>>            int err = -EINVAL;
> >>>>>>>>            if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
> >>>>>>>>                    md_error(rdev->mddev, rdev);
> >>>>>>>> -               if (test_bit(Faulty, &rdev->flags))
> >>>>>>>> +
> >>>>>>>> +               if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
> >>>>>>>
> >>>>>>> I don't think this makes much sense. EBUSY for already failed array
> >>>>>>> sounds weird.
> >>>>>>> Also, shall we also set MD_BROKEN here?
> >>>>>>>
> >>>>>> Actually, we just called md_error above, so we don't need to set MD_BROKEN
> >>>>>> here.
> >>>>>> But we shouldn't return EBUSY in such cases, right?
> >>>>>>
> >>>>> About EBUSY:
> >>>>> This is how it is implemented in mdadm, we are expecting it in
> >>>>> case of failure. See my fix[2].
> >>>>> I agree that it can be confusing, but this is how it is working.
> >>>>> Do you want to change it across mdadm and md?
> >>>>> This will break compatibility.
> >>>>>
> >>>>> About MD_BROKEN:
> >>>>> As you see we are determining failure by checking rdev state, if "Faulty"
> >>>>> in flags after md_error() is not set, then it assumes that array is
> >>>>> failed and EBUSY is returned to userspace.
> >>>>
> >>>> This changed the behavior for raid0, no?
> >>>>
> >>>> W/o the change mdadm --fail on raid0 will get EBUSY. W/ this change,
> >>>> it will get 0, and the device is NOT marked as faulty, right?
> >>>>
> >>> See commit mentioned in description. MD_BROKEN is used for raid0,
> >>> so EBUSY is returned, same as w/o patch.
> >
> > Hmm... I am still confused. In state_store(), md_error is a no-op for raid0,
> > which will not set Faulty or MD_BROKEN. So we will get -EBUSY w/o
> > the patch and 0 w/ the patch, no? It is probably not a serious issue though.
> >
> >
> Yeah, you are right. There is no error_handler. I missed that.
> Now, I reviewed raid0 again.
>
> With my change result won't be clear for raid0, it is correlated with IO.
> When drive disappears and there is IO, then it could return -EBUSY,
> raid0_make_request() may set MD_BROKEN first.
> In there is no IO then 0 will be returned. I need to close this gap.
> Thanks for your curiosity.
>
> So, please tell me, are you ok with idea of this patch? I will send
> requested details with v2 but I want to know if I choose good way to close
> raid5 issue, which is a main problem.

I think the idea is good. Maybe we can just add an error_hander for raid0 so
it is more consistent?

Besides that, please also add more explanations about MD_BROKEN in the
next version.

Thanks,
Song

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10
  2021-09-17 15:18 [PATCH 0/2] Use MD_BROKEN for redundant arrays Mariusz Tkaczyk
@ 2021-09-17 15:18 ` Mariusz Tkaczyk
  0 siblings, 0 replies; 14+ messages in thread
From: Mariusz Tkaczyk @ 2021-09-17 15:18 UTC (permalink / raw)
  To: song; +Cc: linux-kernel

The idea of stopping all writes if devices is failed, introduced by
62f7b1989c0 ("md raid0/linear: Mark array as 'broken' and fail BIOs if
a member is gone") seems to be reasonable so use MD_BROKEN for RAID1 and
RAID10. Support in RAID456 is added in next commit.
If userspace (mdadm) forces md to fail the device (Failure state
written via sysfs), then EBUSY is expected if array will become failed.
To achieve that, check for MD_BROKEN and if is set, then return EBUSY to
be complaint with userspace.
For faulty state, handled via ioctl, let the error_handler to decide.

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/md/md.c     | 16 ++++++++++------
 drivers/md/md.h     |  4 ++--
 drivers/md/raid1.c  |  1 +
 drivers/md/raid10.c |  1 +
 4 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c322841d4edc..ac20eb2ddff7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -926,8 +926,9 @@ static void super_written(struct bio *bio)
 		pr_err("md: %s gets error=%d\n", __func__,
 		       blk_status_to_errno(bio->bi_status));
 		md_error(mddev, rdev);
-		if (!test_bit(Faulty, &rdev->flags)
-		    && (bio->bi_opf & MD_FAILFAST)) {
+		if (!test_bit(Faulty, &rdev->flags) &&
+		     !test_bit(MD_BROKEN, &mddev->flags) &&
+		     (bio->bi_opf & MD_FAILFAST)) {
 			set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
 			set_bit(LastDev, &rdev->flags);
 		}
@@ -2979,7 +2980,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	int err = -EINVAL;
 	if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
 		md_error(rdev->mddev, rdev);
-		if (test_bit(Faulty, &rdev->flags))
+
+		if (!test_bit(MD_BROKEN, &rdev->mddev->flags))
 			err = 0;
 		else
 			err = -EBUSY;
@@ -7974,12 +7976,14 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (!mddev->pers || !mddev->pers->error_handler)
 		return;
 	mddev->pers->error_handler(mddev,rdev);
-	if (mddev->degraded)
+	if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags))
 		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 	sysfs_notify_dirent_safe(rdev->sysfs_state);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-	md_wakeup_thread(mddev->thread);
+	if (!test_bit(MD_BROKEN, &mddev->flags)) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	}
 	if (mddev->event_work.func)
 		queue_work(md_misc_wq, &mddev->event_work);
 	md_new_event(mddev);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 4c96c36bd01a..e01433f3b46a 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -259,8 +259,8 @@ enum mddev_flags {
 	MD_NOT_READY,		/* do_md_run() is active, so 'array_state'
 				 * must not report that array is ready yet
 				 */
-	MD_BROKEN,              /* This is used in RAID-0/LINEAR only, to stop
-				 * I/O in case an array member is gone/failed.
+	MD_BROKEN,              /* This is used to stop I/O and mark device as
+				 * dead in case an array becomes failed.
 				 */
 };
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 19598bd38939..79462d860177 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1639,6 +1639,7 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 		 */
 		conf->recovery_disabled = mddev->recovery_disabled;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+		set_bit(MD_BROKEN, &mddev->flags);
 		return;
 	}
 	set_bit(Blocked, &rdev->flags);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index aa2636582841..02a4d84b4d2e 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1964,6 +1964,7 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 		 * Don't fail the drive, just return an IO error.
 		 */
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+		set_bit(MD_BROKEN, &mddev->flags);
 		return;
 	}
 	if (test_and_clear_bit(In_sync, &rdev->flags))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-09-30 15:57 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-17 15:34 [PATCH 0/2] Use MD_BROKEN for redundant arrays Mariusz Tkaczyk
2021-09-17 15:34 ` [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
2021-09-24 21:09   ` Song Liu
2021-09-24 21:15     ` Song Liu
2021-09-27 14:54       ` Tkaczyk, Mariusz
2021-09-27 22:59         ` Song Liu
2021-09-28  7:33           ` Tkaczyk, Mariusz
2021-09-28  7:55             ` Tkaczyk, Mariusz
2021-09-29  1:29               ` Song Liu
2021-09-30 11:23                 ` Tkaczyk, Mariusz
2021-09-30 15:56                   ` Song Liu
2021-09-17 15:34 ` [PATCH 2/2] raid5: introduce MD_BROKEN Mariusz Tkaczyk
2021-09-24 21:18   ` Song Liu
  -- strict thread matches above, loose matches on Subject: below --
2021-09-17 15:18 [PATCH 0/2] Use MD_BROKEN for redundant arrays Mariusz Tkaczyk
2021-09-17 15:18 ` [PATCH 1/2] md, raid1, raid10: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.