All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] Improve failed arrays handling.
@ 2022-01-27 15:39 Mariusz Tkaczyk
  2022-01-27 15:39 ` [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear Mariusz Tkaczyk
                   ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-01-27 15:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid

Hi Song,
I've made changes as discussed in v2[1]. I did manual testing with IMSM
metadata.

Patch 1:
- "%pg" propagated and raid0/linear_error refactored as Suggested by Guoqing.
- missed dm-event, suggested by Guoqing- verified. IMO the behavior is same as
  before.

Patch 2:
- Commit id fixed, suggested by Gouqing.
- Description rework, suggested by Xiao (discussed offline).
- fail_last_dev handling added (and verified).
- MD_BROKEN description modified, suggested by Gouqing.
- Descriptions for raid1_error and raid10_error are added, redundant comments
  are removed.

Patch3:
- Error message for failed array changed, suggested by you.
- MD_BROKEN setter moved to has_failed(), suggested by Gouqing.
- has_failed() refactored

Other notes:
I followed kernel-doc style guidelines when editing or adding new descriptions.
Please let me know if you consider it as unnecessary and messy.

I also noticed potential issue during refactor related to MD_FAILFAST_SUPPORTED,
please see the flag definition. I'm wondering if fail_last_dev functionality is
not against failfast. Should I start separate thread for that?

[1] https://lore.kernel.org/linux-raid/CAPhsuW43QfDNGEu72o2_eqDZ5vGq3tbFvdZ-W+dxVqcEhHmJ5w@mail.gmail.com/T/#t

Mariusz Tkaczyk (3):
  raid0, linear, md: add error_handlers for raid0 and linear
  md: Set MD_BROKEN for RAID1 and RAID10
  raid5: introduce MD_BROKEN

 drivers/md/md-linear.c | 15 ++++++++-
 drivers/md/md.c        | 23 +++++++++-----
 drivers/md/md.h        | 72 ++++++++++++++++++++++--------------------
 drivers/md/raid0.c     | 15 ++++++++-
 drivers/md/raid1.c     | 42 ++++++++++++++----------
 drivers/md/raid10.c    | 33 ++++++++++++-------
 drivers/md/raid5.c     | 49 ++++++++++++++--------------
 7 files changed, 151 insertions(+), 98 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear
  2022-01-27 15:39 [PATCH v3 0/3] Improve failed arrays handling Mariusz Tkaczyk
@ 2022-01-27 15:39 ` Mariusz Tkaczyk
  2022-02-12  1:12   ` Guoqing Jiang
  2022-01-27 15:39 ` [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-01-27 15:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid

Patch 62f7b1989c0 ("md raid0/linear: Mark array as 'broken' and fail BIOs
if a member is gone") allowed to finish writes earlier (before level
dependent actions) for non-redundant arrays.

To achieve that MD_BROKEN is added to mddev->flags if drive disappearance
is detected. This is done in is_mddev_broken() which is confusing and not
consistent with other levels where error_handler() is used.
This patch adds appropriate error_handler for raid0 and linear.

It also adopts md_error(), we only want to call .error_handler for those
levels. mddev->pers->sync_request is additionally checked, its existence
implies a level with redundancy.

Usage of error_handler causes that disk failure can be requested from
userspace. User can fail the array via #mdadm --set-faulty command. This
is not safe and will be fixed in mdadm. It is correctable because failed
state is not recorded in the metadata. After next assembly array will be
read-write again. For safety reason is better to keep MD_BROKEN in
runtime only.

Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/md/md-linear.c | 15 ++++++++++++++-
 drivers/md/md.c        |  6 +++++-
 drivers/md/md.h        | 10 ++--------
 drivers/md/raid0.c     | 15 ++++++++++++++-
 4 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index 1ff51647a682..3c368e3e4641 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -233,7 +233,8 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio)
 		     bio_sector < start_sector))
 		goto out_of_bounds;
 
-	if (unlikely(is_mddev_broken(tmp_dev->rdev, "linear"))) {
+	if (unlikely(is_rdev_broken(tmp_dev->rdev))) {
+		md_error(mddev, tmp_dev->rdev);
 		bio_io_error(bio);
 		return true;
 	}
@@ -281,6 +282,17 @@ static void linear_status (struct seq_file *seq, struct mddev *mddev)
 	seq_printf(seq, " %dk rounding", mddev->chunk_sectors / 2);
 }
 
+static void linear_error(struct mddev *mddev, struct md_rdev *rdev)
+{
+	if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) {
+		char *md_name = mdname(mddev);
+
+		pr_crit("md/linear%s: Disk failure on %pg detected.\n"
+			"md/linear:%s: Cannot continue, failing array.\n",
+			md_name, rdev->bdev, md_name);
+	}
+}
+
 static void linear_quiesce(struct mddev *mddev, int state)
 {
 }
@@ -297,6 +309,7 @@ static struct md_personality linear_personality =
 	.hot_add_disk	= linear_add,
 	.size		= linear_size,
 	.quiesce	= linear_quiesce,
+	.error_handler	= linear_error,
 };
 
 static int __init linear_init (void)
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e8666bdc0d28..f888ef197765 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -7982,7 +7982,11 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 
 	if (!mddev->pers || !mddev->pers->error_handler)
 		return;
-	mddev->pers->error_handler(mddev,rdev);
+	mddev->pers->error_handler(mddev, rdev);
+
+	if (!mddev->pers->sync_request)
+		return;
+
 	if (mddev->degraded)
 		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 	sysfs_notify_dirent_safe(rdev->sysfs_state);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 53ea7a6961de..bc3f2094d0b6 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -762,15 +762,9 @@ extern void mddev_destroy_serial_pool(struct mddev *mddev, struct md_rdev *rdev,
 struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
 struct md_rdev *md_find_rdev_rcu(struct mddev *mddev, dev_t dev);
 
-static inline bool is_mddev_broken(struct md_rdev *rdev, const char *md_type)
+static inline bool is_rdev_broken(struct md_rdev *rdev)
 {
-	if (!disk_live(rdev->bdev->bd_disk)) {
-		if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags))
-			pr_warn("md: %s: %s array has a missing/failed member\n",
-				mdname(rdev->mddev), md_type);
-		return true;
-	}
-	return false;
+	return !disk_live(rdev->bdev->bd_disk);
 }
 
 static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 62c8b6adac70..707efb3825f0 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -564,8 +564,9 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 		return true;
 	}
 
-	if (unlikely(is_mddev_broken(tmp_dev, "raid0"))) {
+	if (unlikely(is_rdev_broken(tmp_dev))) {
 		bio_io_error(bio);
+		md_error(mddev, tmp_dev);
 		return true;
 	}
 
@@ -588,6 +589,17 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
 	return;
 }
 
+static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
+{
+	if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) {
+		char *md_name = mdname(mddev);
+
+		pr_crit("md/raid0%s: Disk failure on %pg detected.\n"
+			"md/raid0:%s: Cannot continue, failing array.\n",
+			md_name, rdev->bdev, md_name);
+	}
+}
+
 static void *raid0_takeover_raid45(struct mddev *mddev)
 {
 	struct md_rdev *rdev;
@@ -763,6 +775,7 @@ static struct md_personality raid0_personality=
 	.size		= raid0_size,
 	.takeover	= raid0_takeover,
 	.quiesce	= raid0_quiesce,
+	.error_handler	= raid0_error,
 };
 
 static int __init raid0_init (void)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10
  2022-01-27 15:39 [PATCH v3 0/3] Improve failed arrays handling Mariusz Tkaczyk
  2022-01-27 15:39 ` [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear Mariusz Tkaczyk
@ 2022-01-27 15:39 ` Mariusz Tkaczyk
  2022-01-31  8:29   ` Xiao Ni
  2022-02-12  1:17   ` Guoqing Jiang
  2022-01-27 15:39 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
  2022-02-08  7:18 ` [PATCH v3 0/3] Improve failed arrays handling Song Liu
  3 siblings, 2 replies; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-01-27 15:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid

There is no direct mechanism to determine raid failure outside
personality. It is done by checking rdev->flags after executing
md_error(). If "faulty" flag is not set then -EBUSY is returned to
userspace. -EBUSY means that array will be failed after drive removal.

Mdadm has special routine to handle the array failure and it is executed
if -EBUSY is returned by md.

There are at least two known reasons to not consider this mechanism
as correct:
1. drive can be removed even if array will be failed[1].
2. -EBUSY seems to be wrong status. Array is not busy, but removal
   process cannot proceed safe.

-EBUSY expectation cannot be removed without breaking compatibility
with userspace. In this patch first issue is resolved by adding support
for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
next commit.

The idea is to set the MD_BROKEN if we are sure that raid is in failed
state now. This is done in each error_handler(). In md_error() MD_BROKEN
flag is checked. If is set, then -EBUSY is returned to userspace.

As in previous commit, it causes that #mdadm --set-faulty is able to
fail array. Previously proposed workaround is valid if optional
functionality[1] is disabled.

[1] commit 9a567843f7ce("md: allow last device to be forcibly removed from
    RAID1/RAID10.")
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/md/md.c     | 17 ++++++++-----
 drivers/md/md.h     | 62 +++++++++++++++++++++++++--------------------
 drivers/md/raid1.c  | 41 +++++++++++++++++-------------
 drivers/md/raid10.c | 33 ++++++++++++++++--------
 4 files changed, 91 insertions(+), 62 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f888ef197765..fda8473f96b8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2983,10 +2983,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 
 	if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
 		md_error(rdev->mddev, rdev);
-		if (test_bit(Faulty, &rdev->flags))
-			err = 0;
-		else
+
+		if (test_bit(MD_BROKEN, &rdev->mddev->flags))
 			err = -EBUSY;
+		else
+			err = 0;
 	} else if (cmd_match(buf, "remove")) {
 		if (rdev->mddev->pers) {
 			clear_bit(Blocked, &rdev->flags);
@@ -7441,7 +7442,7 @@ static int set_disk_faulty(struct mddev *mddev, dev_t dev)
 		err =  -ENODEV;
 	else {
 		md_error(mddev, rdev);
-		if (!test_bit(Faulty, &rdev->flags))
+		if (test_bit(MD_BROKEN, &mddev->flags))
 			err = -EBUSY;
 	}
 	rcu_read_unlock();
@@ -7987,12 +7988,14 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
 	if (!mddev->pers->sync_request)
 		return;
 
-	if (mddev->degraded)
+	if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags))
 		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
 	sysfs_notify_dirent_safe(rdev->sysfs_state);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
-	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
-	md_wakeup_thread(mddev->thread);
+	if (!test_bit(MD_BROKEN, &mddev->flags)) {
+		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
+		md_wakeup_thread(mddev->thread);
+	}
 	if (mddev->event_work.func)
 		queue_work(md_misc_wq, &mddev->event_work);
 	md_new_event();
diff --git a/drivers/md/md.h b/drivers/md/md.h
index bc3f2094d0b6..1eb7d0e88cb2 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -234,34 +234,42 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
 				int is_new);
 struct md_cluster_info;
 
-/* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added */
+/**
+ * enum mddev_flags - md device flags.
+ * @MD_ARRAY_FIRST_USE: First use of array, needs initialization.
+ * @MD_CLOSING: If set, we are closing the array, do not open it then.
+ * @MD_JOURNAL_CLEAN: A raid with journal is already clean.
+ * @MD_HAS_JOURNAL: The raid array has journal feature set.
+ * @MD_CLUSTER_RESYNC_LOCKED: cluster raid only, which means node, already took
+ *			       resync lock, need to release the lock.
+ * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
+ *			    calls to md_error() will never cause the array to
+ *			    become failed.
+ * @MD_HAS_PPL:  The raid array has PPL feature set.
+ * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
+ * @MD_ALLOW_SB_UPDATE: md_check_recovery is allowed to update the metadata
+ *			 without taking reconfig_mutex.
+ * @MD_UPDATING_SB: md_check_recovery is updating the metadata without
+ *		     explicitly holding reconfig_mutex.
+ * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
+ *		   array is ready yet.
+ * @MD_BROKEN: This is used to stop writes and mark array as failed.
+ *
+ * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
+ */
 enum mddev_flags {
-	MD_ARRAY_FIRST_USE,	/* First use of array, needs initialization */
-	MD_CLOSING,		/* If set, we are closing the array, do not open
-				 * it then */
-	MD_JOURNAL_CLEAN,	/* A raid with journal is already clean */
-	MD_HAS_JOURNAL,		/* The raid array has journal feature set */
-	MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
-				   * already took resync lock, need to
-				   * release the lock */
-	MD_FAILFAST_SUPPORTED,	/* Using MD_FAILFAST on metadata writes is
-				 * supported as calls to md_error() will
-				 * never cause the array to become failed.
-				 */
-	MD_HAS_PPL,		/* The raid array has PPL feature set */
-	MD_HAS_MULTIPLE_PPLS,	/* The raid array has multiple PPLs feature set */
-	MD_ALLOW_SB_UPDATE,	/* md_check_recovery is allowed to update
-				 * the metadata without taking reconfig_mutex.
-				 */
-	MD_UPDATING_SB,		/* md_check_recovery is updating the metadata
-				 * without explicitly holding reconfig_mutex.
-				 */
-	MD_NOT_READY,		/* do_md_run() is active, so 'array_state'
-				 * must not report that array is ready yet
-				 */
-	MD_BROKEN,              /* This is used in RAID-0/LINEAR only, to stop
-				 * I/O in case an array member is gone/failed.
-				 */
+	MD_ARRAY_FIRST_USE,
+	MD_CLOSING,
+	MD_JOURNAL_CLEAN,
+	MD_HAS_JOURNAL,
+	MD_CLUSTER_RESYNC_LOCKED,
+	MD_FAILFAST_SUPPORTED,
+	MD_HAS_PPL,
+	MD_HAS_MULTIPLE_PPLS,
+	MD_ALLOW_SB_UPDATE,
+	MD_UPDATING_SB,
+	MD_NOT_READY,
+	MD_BROKEN,
 };
 
 enum mddev_sb_flags {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7dc8026cf6ee..b222bafa1196 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1615,30 +1615,37 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
 	seq_printf(seq, "]");
 }
 
+/**
+ * raid1_error() - RAID1 error handler.
+ * @mddev: affected md device.
+ * @rdev: member device to remove.
+ *
+ * Error on @rdev is raised and it is needed to be removed from @mddev.
+ * If there are more than one active member, @rdev is always removed.
+ *
+ * If it is the last active member, it depends on &mddev->fail_last_dev:
+ * - if is on @rdev is removed.
+ * - if is off, @rdev is not removed, but recovery from it is disabled (@rdev is
+ *   very likely to fail).
+ * In both cases, &MD_BROKEN will be set in &mddev->flags.
+ */
 static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
 {
 	char b[BDEVNAME_SIZE];
 	struct r1conf *conf = mddev->private;
 	unsigned long flags;
 
-	/*
-	 * If it is not operational, then we have already marked it as dead
-	 * else if it is the last working disks with "fail_last_dev == false",
-	 * ignore the error, let the next level up know.
-	 * else mark the drive as failed
-	 */
 	spin_lock_irqsave(&conf->device_lock, flags);
-	if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev
-	    && (conf->raid_disks - mddev->degraded) == 1) {
-		/*
-		 * Don't fail the drive, act as though we were just a
-		 * normal single drive.
-		 * However don't try a recovery from this drive as
-		 * it is very likely to fail.
-		 */
-		conf->recovery_disabled = mddev->recovery_disabled;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		return;
+
+	if (test_bit(In_sync, &rdev->flags) &&
+	    (conf->raid_disks - mddev->degraded) == 1) {
+		set_bit(MD_BROKEN, &mddev->flags);
+
+		if (!mddev->fail_last_dev) {
+			conf->recovery_disabled = mddev->recovery_disabled;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			return;
+		}
 	}
 	set_bit(Blocked, &rdev->flags);
 	if (test_and_clear_bit(In_sync, &rdev->flags))
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index dde98f65bd04..7471e20d7cd6 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1945,26 +1945,37 @@ static int enough(struct r10conf *conf, int ignore)
 		_enough(conf, 1, ignore);
 }
 
+/**
+ * raid10_error() - RAID10 error handler.
+ * @mddev: affected md device.
+ * @rdev: member device to remove.
+ *
+ * Error on @rdev is raised and it is needed to be removed from @mddev.
+ * If there is (excluding @rdev) enough members to operate, @rdev is always
+ * removed.
+ *
+ * Otherwise, it depends on &mddev->fail_last_dev:
+ * - if is on @rdev is removed.
+ * - if is off, @rdev is not removed.
+ *
+ * In both cases, &MD_BROKEN will be set in &mddev->flags.
+ */
 static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
 {
 	char b[BDEVNAME_SIZE];
 	struct r10conf *conf = mddev->private;
 	unsigned long flags;
 
-	/*
-	 * If it is not operational, then we have already marked it as dead
-	 * else if it is the last working disks with "fail_last_dev == false",
-	 * ignore the error, let the next level up know.
-	 * else mark the drive as failed
-	 */
 	spin_lock_irqsave(&conf->device_lock, flags);
+
 	if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev
 	    && !enough(conf, rdev->raid_disk)) {
-		/*
-		 * Don't fail the drive, just return an IO error.
-		 */
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		return;
+		set_bit(MD_BROKEN, &mddev->flags);
+
+		if (!mddev->fail_last_dev) {
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+			return;
+		}
 	}
 	if (test_and_clear_bit(In_sync, &rdev->flags))
 		mddev->degraded++;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/3] raid5: introduce MD_BROKEN
  2022-01-27 15:39 [PATCH v3 0/3] Improve failed arrays handling Mariusz Tkaczyk
  2022-01-27 15:39 ` [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear Mariusz Tkaczyk
  2022-01-27 15:39 ` [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
@ 2022-01-27 15:39 ` Mariusz Tkaczyk
  2022-01-31  8:58   ` Xiao Ni
  2022-02-12  1:47   ` Guoqing Jiang
  2022-02-08  7:18 ` [PATCH v3 0/3] Improve failed arrays handling Song Liu
  3 siblings, 2 replies; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-01-27 15:39 UTC (permalink / raw)
  To: song; +Cc: linux-raid

Raid456 module had allowed to achieve failed state. It was fixed by
fb73b357fb9 ("raid5: block failing device if raid will be failed").
This fix introduces a bug, now if raid5 fails during IO, it may result
with a hung task without completion. Faulty flag on the device is
necessary to process all requests and is checked many times, mainly in
analyze_stripe().
Allow to set faulty on drive again and set MD_BROKEN if raid is failed.

As a result, this level is allowed to achieve failed state again, but
communication with userspace (via -EBUSY status) will be preserved.

This restores possibility to fail array via #mdadm --set-faulty command
and will be fixed by additional verification on mdadm side.

Reproduction steps:
 mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
 mkfs.xfs /dev/md126 -f
 mount /dev/md126 /mnt/root/

 fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
--bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
--time_based --group_reporting --name=throughput-test-job
--eta-newline=1 &

 echo 1 > /sys/block/nvme2n1/device/device/remove
 echo 1 > /sys/block/nvme1n1/device/device/remove

 [ 1475.787779] Call Trace:
 [ 1475.793111] __schedule+0x2a6/0x700
 [ 1475.799460] schedule+0x38/0xa0
 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
 [ 1475.813856] ? finish_wait+0x80/0x80
 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
 [ 1475.828281] ? finish_wait+0x80/0x80
 [ 1475.834727] ? finish_wait+0x80/0x80
 [ 1475.841127] ? finish_wait+0x80/0x80
 [ 1475.847480] md_handle_request+0x119/0x190
 [ 1475.854390] md_make_request+0x8a/0x190
 [ 1475.861041] generic_make_request+0xcf/0x310
 [ 1475.868145] submit_bio+0x3c/0x160
 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
 [ 1475.889149] iomap_apply+0xff/0x310
 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.909974] iomap_dio_rw+0x2f2/0x490
 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.923680] ? atime_needs_update+0x77/0xe0
 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
 [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
 [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
 [ 1475.953403] aio_read+0xd5/0x180
 [ 1475.959395] ? _cond_resched+0x15/0x30
 [ 1475.965907] io_submit_one+0x20b/0x3c0
 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
 [ 1475.979335] ? do_io_getevents+0x7c/0xc0
 [ 1475.986009] do_syscall_64+0x5b/0x1a0
 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
 [ 1476.000255] RIP: 0033:0x7f11fc27978d
 [ 1476.006631] Code: Bad RIP value.
 [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.

Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/md/raid1.c |  1 +
 drivers/md/raid5.c | 49 +++++++++++++++++++++++-----------------------
 2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index b222bafa1196..58c8eddb0f55 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1627,6 +1627,7 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
  * - if is on @rdev is removed.
  * - if is off, @rdev is not removed, but recovery from it is disabled (@rdev is
  *   very likely to fail).
+ *
  * In both cases, &MD_BROKEN will be set in &mddev->flags.
  */
 static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1240a5c16af8..bee953c8007f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -686,17 +686,21 @@ int raid5_calc_degraded(struct r5conf *conf)
 	return degraded;
 }
 
-static int has_failed(struct r5conf *conf)
+static bool has_failed(struct r5conf *conf)
 {
-	int degraded;
+	int degraded = conf->mddev->degraded;
 
-	if (conf->mddev->reshape_position == MaxSector)
-		return conf->mddev->degraded > conf->max_degraded;
+	if (test_bit(MD_BROKEN, &conf->mddev->flags))
+		return true;
 
-	degraded = raid5_calc_degraded(conf);
-	if (degraded > conf->max_degraded)
-		return 1;
-	return 0;
+	if (conf->mddev->reshape_position != MaxSector)
+		degraded = raid5_calc_degraded(conf);
+
+	if (degraded > conf->max_degraded) {
+		set_bit(MD_BROKEN, &conf->mddev->flags);
+		return true;
+	}
+	return false;
 }
 
 struct stripe_head *
@@ -2877,34 +2881,29 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 	unsigned long flags;
 	pr_debug("raid456: error called\n");
 
+	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n",
+		mdname(mddev), bdevname(rdev->bdev, b));
+
 	spin_lock_irqsave(&conf->device_lock, flags);
+	set_bit(Faulty, &rdev->flags);
+	clear_bit(In_sync, &rdev->flags);
+	mddev->degraded = raid5_calc_degraded(conf);
 
-	if (test_bit(In_sync, &rdev->flags) &&
-	    mddev->degraded == conf->max_degraded) {
-		/*
-		 * Don't allow to achieve failed state
-		 * Don't try to recover this device
-		 */
+	if (has_failed(conf)) {
 		conf->recovery_disabled = mddev->recovery_disabled;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		return;
+		pr_crit("md/raid:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), mddev->degraded, conf->raid_disks);
+	} else {
+		pr_crit("md/raid:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), conf->raid_disks - mddev->degraded);
 	}
 
-	set_bit(Faulty, &rdev->flags);
-	clear_bit(In_sync, &rdev->flags);
-	mddev->degraded = raid5_calc_degraded(conf);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 
 	set_bit(Blocked, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n"
-		"md/raid:%s: Operation continuing on %d devices.\n",
-		mdname(mddev),
-		bdevname(rdev->bdev, b),
-		mdname(mddev),
-		conf->raid_disks - mddev->degraded);
 	r5c_update_on_rdev_error(mddev, rdev);
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10
  2022-01-27 15:39 ` [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
@ 2022-01-31  8:29   ` Xiao Ni
  2022-01-31  9:06     ` Mariusz Tkaczyk
  2022-01-31 12:23     ` Wols Lists
  2022-02-12  1:17   ` Guoqing Jiang
  1 sibling, 2 replies; 29+ messages in thread
From: Xiao Ni @ 2022-01-31  8:29 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: Song Liu, linux-raid

On Thu, Jan 27, 2022 at 11:39 PM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> There is no direct mechanism to determine raid failure outside
> personality. It is done by checking rdev->flags after executing
> md_error(). If "faulty" flag is not set then -EBUSY is returned to
> userspace. -EBUSY means that array will be failed after drive removal.
>
> Mdadm has special routine to handle the array failure and it is executed
> if -EBUSY is returned by md.
>
> There are at least two known reasons to not consider this mechanism
> as correct:
> 1. drive can be removed even if array will be failed[1].
> 2. -EBUSY seems to be wrong status. Array is not busy, but removal
>    process cannot proceed safe.
>
> -EBUSY expectation cannot be removed without breaking compatibility
> with userspace. In this patch first issue is resolved by adding support
> for MD_BROKEN flag for RAID1 and RAID10. Support for RAID456 is added in
> next commit.
>
> The idea is to set the MD_BROKEN if we are sure that raid is in failed
> state now. This is done in each error_handler(). In md_error() MD_BROKEN
> flag is checked. If is set, then -EBUSY is returned to userspace.
>
> As in previous commit, it causes that #mdadm --set-faulty is able to
> fail array. Previously proposed workaround is valid if optional
> functionality[1] is disabled.
>
> [1] commit 9a567843f7ce("md: allow last device to be forcibly removed from
>     RAID1/RAID10.")
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> ---
>  drivers/md/md.c     | 17 ++++++++-----
>  drivers/md/md.h     | 62 +++++++++++++++++++++++++--------------------
>  drivers/md/raid1.c  | 41 +++++++++++++++++-------------
>  drivers/md/raid10.c | 33 ++++++++++++++++--------
>  4 files changed, 91 insertions(+), 62 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f888ef197765..fda8473f96b8 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2983,10 +2983,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>
>         if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
>                 md_error(rdev->mddev, rdev);
> -               if (test_bit(Faulty, &rdev->flags))
> -                       err = 0;
> -               else
> +
> +               if (test_bit(MD_BROKEN, &rdev->mddev->flags))
>                         err = -EBUSY;
> +               else
> +                       err = 0;
>         } else if (cmd_match(buf, "remove")) {
>                 if (rdev->mddev->pers) {
>                         clear_bit(Blocked, &rdev->flags);
> @@ -7441,7 +7442,7 @@ static int set_disk_faulty(struct mddev *mddev, dev_t dev)
>                 err =  -ENODEV;
>         else {
>                 md_error(mddev, rdev);
> -               if (!test_bit(Faulty, &rdev->flags))
> +               if (test_bit(MD_BROKEN, &mddev->flags))
>                         err = -EBUSY;
>         }
>         rcu_read_unlock();
> @@ -7987,12 +7988,14 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>         if (!mddev->pers->sync_request)
>                 return;
>
> -       if (mddev->degraded)
> +       if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags))
>                 set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>         sysfs_notify_dirent_safe(rdev->sysfs_state);
>         set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -       set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> -       md_wakeup_thread(mddev->thread);
> +       if (!test_bit(MD_BROKEN, &mddev->flags)) {
> +               set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +               md_wakeup_thread(mddev->thread);
> +       }
>         if (mddev->event_work.func)
>                 queue_work(md_misc_wq, &mddev->event_work);
>         md_new_event();
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index bc3f2094d0b6..1eb7d0e88cb2 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -234,34 +234,42 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
>                                 int is_new);
>  struct md_cluster_info;
>
> -/* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added */
> +/**
> + * enum mddev_flags - md device flags.
> + * @MD_ARRAY_FIRST_USE: First use of array, needs initialization.
> + * @MD_CLOSING: If set, we are closing the array, do not open it then.
> + * @MD_JOURNAL_CLEAN: A raid with journal is already clean.
> + * @MD_HAS_JOURNAL: The raid array has journal feature set.
> + * @MD_CLUSTER_RESYNC_LOCKED: cluster raid only, which means node, already took
> + *                            resync lock, need to release the lock.
> + * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
> + *                         calls to md_error() will never cause the array to
> + *                         become failed.
> + * @MD_HAS_PPL:  The raid array has PPL feature set.
> + * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
> + * @MD_ALLOW_SB_UPDATE: md_check_recovery is allowed to update the metadata
> + *                      without taking reconfig_mutex.
> + * @MD_UPDATING_SB: md_check_recovery is updating the metadata without
> + *                  explicitly holding reconfig_mutex.
> + * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
> + *                array is ready yet.
> + * @MD_BROKEN: This is used to stop writes and mark array as failed.
> + *
> + * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
> + */
>  enum mddev_flags {
> -       MD_ARRAY_FIRST_USE,     /* First use of array, needs initialization */
> -       MD_CLOSING,             /* If set, we are closing the array, do not open
> -                                * it then */
> -       MD_JOURNAL_CLEAN,       /* A raid with journal is already clean */
> -       MD_HAS_JOURNAL,         /* The raid array has journal feature set */
> -       MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
> -                                  * already took resync lock, need to
> -                                  * release the lock */
> -       MD_FAILFAST_SUPPORTED,  /* Using MD_FAILFAST on metadata writes is
> -                                * supported as calls to md_error() will
> -                                * never cause the array to become failed.
> -                                */
> -       MD_HAS_PPL,             /* The raid array has PPL feature set */
> -       MD_HAS_MULTIPLE_PPLS,   /* The raid array has multiple PPLs feature set */
> -       MD_ALLOW_SB_UPDATE,     /* md_check_recovery is allowed to update
> -                                * the metadata without taking reconfig_mutex.
> -                                */
> -       MD_UPDATING_SB,         /* md_check_recovery is updating the metadata
> -                                * without explicitly holding reconfig_mutex.
> -                                */
> -       MD_NOT_READY,           /* do_md_run() is active, so 'array_state'
> -                                * must not report that array is ready yet
> -                                */
> -       MD_BROKEN,              /* This is used in RAID-0/LINEAR only, to stop
> -                                * I/O in case an array member is gone/failed.
> -                                */
> +       MD_ARRAY_FIRST_USE,
> +       MD_CLOSING,
> +       MD_JOURNAL_CLEAN,
> +       MD_HAS_JOURNAL,
> +       MD_CLUSTER_RESYNC_LOCKED,
> +       MD_FAILFAST_SUPPORTED,
> +       MD_HAS_PPL,
> +       MD_HAS_MULTIPLE_PPLS,
> +       MD_ALLOW_SB_UPDATE,
> +       MD_UPDATING_SB,
> +       MD_NOT_READY,
> +       MD_BROKEN,
>  };
>
>  enum mddev_sb_flags {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7dc8026cf6ee..b222bafa1196 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1615,30 +1615,37 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>         seq_printf(seq, "]");
>  }
>
> +/**
> + * raid1_error() - RAID1 error handler.
> + * @mddev: affected md device.
> + * @rdev: member device to remove.
> + *
> + * Error on @rdev is raised and it is needed to be removed from @mddev.
> + * If there are more than one active member, @rdev is always removed.
> + *
> + * If it is the last active member, it depends on &mddev->fail_last_dev:
> + * - if is on @rdev is removed.
> + * - if is off, @rdev is not removed, but recovery from it is disabled (@rdev is
> + *   very likely to fail).
> + * In both cases, &MD_BROKEN will be set in &mddev->flags.
> + */
>  static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>  {
>         char b[BDEVNAME_SIZE];
>         struct r1conf *conf = mddev->private;
>         unsigned long flags;
>
> -       /*
> -        * If it is not operational, then we have already marked it as dead
> -        * else if it is the last working disks with "fail_last_dev == false",
> -        * ignore the error, let the next level up know.
> -        * else mark the drive as failed
> -        */
>         spin_lock_irqsave(&conf->device_lock, flags);
> -       if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev
> -           && (conf->raid_disks - mddev->degraded) == 1) {
> -               /*
> -                * Don't fail the drive, act as though we were just a
> -                * normal single drive.
> -                * However don't try a recovery from this drive as
> -                * it is very likely to fail.
> -                */

Hi Mariusz

From my view, it's better to keep those comments although you add some comments
before the function. These comments are helpful to understand this function.

> -               conf->recovery_disabled = mddev->recovery_disabled;
> -               spin_unlock_irqrestore(&conf->device_lock, flags);
> -               return;
> +
> +       if (test_bit(In_sync, &rdev->flags) &&
> +           (conf->raid_disks - mddev->degraded) == 1) {
> +               set_bit(MD_BROKEN, &mddev->flags);
> +
> +               if (!mddev->fail_last_dev) {
> +                       conf->recovery_disabled = mddev->recovery_disabled;
> +                       spin_unlock_irqrestore(&conf->device_lock, flags);
> +                       return;
> +               }
>         }
>         set_bit(Blocked, &rdev->flags);
>         if (test_and_clear_bit(In_sync, &rdev->flags))
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index dde98f65bd04..7471e20d7cd6 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1945,26 +1945,37 @@ static int enough(struct r10conf *conf, int ignore)
>                 _enough(conf, 1, ignore);
>  }
>
> +/**
> + * raid10_error() - RAID10 error handler.
> + * @mddev: affected md device.
> + * @rdev: member device to remove.
> + *
> + * Error on @rdev is raised and it is needed to be removed from @mddev.
> + * If there is (excluding @rdev) enough members to operate, @rdev is always
s/is/are/g

> + * removed.
> + *
> + * Otherwise, it depends on &mddev->fail_last_dev:
> + * - if is on @rdev is removed.
> + * - if is off, @rdev is not removed.
> + *
> + * In both cases, &MD_BROKEN will be set in &mddev->flags.
> + */
>  static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>  {
>         char b[BDEVNAME_SIZE];
>         struct r10conf *conf = mddev->private;
>         unsigned long flags;
>
> -       /*
> -        * If it is not operational, then we have already marked it as dead
> -        * else if it is the last working disks with "fail_last_dev == false",
> -        * ignore the error, let the next level up know.
> -        * else mark the drive as failed
> -        */
>         spin_lock_irqsave(&conf->device_lock, flags);
> +
>         if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev
>             && !enough(conf, rdev->raid_disk)) {

The check of mddev->fail_last_dev should be removed here.

> -               /*
> -                * Don't fail the drive, just return an IO error.
> -                */

It's the same. These comments can directly give people notes. raid10 will return
bio here with an error. Is it better to keep them here?

Best Regards
Xiao

> -               spin_unlock_irqrestore(&conf->device_lock, flags);
> -               return;
> +               set_bit(MD_BROKEN, &mddev->flags);
> +
> +               if (!mddev->fail_last_dev) {
> +                       spin_unlock_irqrestore(&conf->device_lock, flags);
> +                       return;
> +               }
>         }
>         if (test_and_clear_bit(In_sync, &rdev->flags))
>                 mddev->degraded++;
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2022-01-27 15:39 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
@ 2022-01-31  8:58   ` Xiao Ni
  2022-02-12  1:47   ` Guoqing Jiang
  1 sibling, 0 replies; 29+ messages in thread
From: Xiao Ni @ 2022-01-31  8:58 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: Song Liu, linux-raid

On Thu, Jan 27, 2022 at 11:39 PM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> Raid456 module had allowed to achieve failed state. It was fixed by
> fb73b357fb9 ("raid5: block failing device if raid will be failed").
> This fix introduces a bug, now if raid5 fails during IO, it may result
> with a hung task without completion. Faulty flag on the device is
> necessary to process all requests and is checked many times, mainly in
> analyze_stripe().
> Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
>
> As a result, this level is allowed to achieve failed state again, but
> communication with userspace (via -EBUSY status) will be preserved.
>
> This restores possibility to fail array via #mdadm --set-faulty command
> and will be fixed by additional verification on mdadm side.
>
> Reproduction steps:
>  mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
>  mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
>  mkfs.xfs /dev/md126 -f
>  mount /dev/md126 /mnt/root/
>
>  fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
> --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
> --time_based --group_reporting --name=throughput-test-job
> --eta-newline=1 &
>
>  echo 1 > /sys/block/nvme2n1/device/device/remove
>  echo 1 > /sys/block/nvme1n1/device/device/remove
>
>  [ 1475.787779] Call Trace:
>  [ 1475.793111] __schedule+0x2a6/0x700
>  [ 1475.799460] schedule+0x38/0xa0
>  [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
>  [ 1475.813856] ? finish_wait+0x80/0x80
>  [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
>  [ 1475.828281] ? finish_wait+0x80/0x80
>  [ 1475.834727] ? finish_wait+0x80/0x80
>  [ 1475.841127] ? finish_wait+0x80/0x80
>  [ 1475.847480] md_handle_request+0x119/0x190
>  [ 1475.854390] md_make_request+0x8a/0x190
>  [ 1475.861041] generic_make_request+0xcf/0x310
>  [ 1475.868145] submit_bio+0x3c/0x160
>  [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
>  [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
>  [ 1475.889149] iomap_apply+0xff/0x310
>  [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
>  [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
>  [ 1475.909974] iomap_dio_rw+0x2f2/0x490
>  [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
>  [ 1475.923680] ? atime_needs_update+0x77/0xe0
>  [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
>  [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
>  [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
>  [ 1475.953403] aio_read+0xd5/0x180
>  [ 1475.959395] ? _cond_resched+0x15/0x30
>  [ 1475.965907] io_submit_one+0x20b/0x3c0
>  [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
>  [ 1475.979335] ? do_io_getevents+0x7c/0xc0
>  [ 1475.986009] do_syscall_64+0x5b/0x1a0
>  [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
>  [ 1476.000255] RIP: 0033:0x7f11fc27978d
>  [ 1476.006631] Code: Bad RIP value.
>  [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
>
> Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> ---
>  drivers/md/raid1.c |  1 +
>  drivers/md/raid5.c | 49 +++++++++++++++++++++++-----------------------
>  2 files changed, 25 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index b222bafa1196..58c8eddb0f55 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1627,6 +1627,7 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>   * - if is on @rdev is removed.
>   * - if is off, @rdev is not removed, but recovery from it is disabled (@rdev is
>   *   very likely to fail).
> + *

type error? If not, it should be better in the second patch.

Regards
Xiao

>   * In both cases, &MD_BROKEN will be set in &mddev->flags.
>   */
>  static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 1240a5c16af8..bee953c8007f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -686,17 +686,21 @@ int raid5_calc_degraded(struct r5conf *conf)
>         return degraded;
>  }
>
> -static int has_failed(struct r5conf *conf)
> +static bool has_failed(struct r5conf *conf)
>  {
> -       int degraded;
> +       int degraded = conf->mddev->degraded;
>
> -       if (conf->mddev->reshape_position == MaxSector)
> -               return conf->mddev->degraded > conf->max_degraded;
> +       if (test_bit(MD_BROKEN, &conf->mddev->flags))
> +               return true;
>
> -       degraded = raid5_calc_degraded(conf);
> -       if (degraded > conf->max_degraded)
> -               return 1;
> -       return 0;
> +       if (conf->mddev->reshape_position != MaxSector)
> +               degraded = raid5_calc_degraded(conf);
> +
> +       if (degraded > conf->max_degraded) {
> +               set_bit(MD_BROKEN, &conf->mddev->flags);
> +               return true;
> +       }
> +       return false;
>  }
>
>  struct stripe_head *
> @@ -2877,34 +2881,29 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>         unsigned long flags;
>         pr_debug("raid456: error called\n");
>
> +       pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n",
> +               mdname(mddev), bdevname(rdev->bdev, b));
> +
>         spin_lock_irqsave(&conf->device_lock, flags);
> +       set_bit(Faulty, &rdev->flags);
> +       clear_bit(In_sync, &rdev->flags);
> +       mddev->degraded = raid5_calc_degraded(conf);
>
> -       if (test_bit(In_sync, &rdev->flags) &&
> -           mddev->degraded == conf->max_degraded) {
> -               /*
> -                * Don't allow to achieve failed state
> -                * Don't try to recover this device
> -                */
> +       if (has_failed(conf)) {
>                 conf->recovery_disabled = mddev->recovery_disabled;
> -               spin_unlock_irqrestore(&conf->device_lock, flags);
> -               return;
> +               pr_crit("md/raid:%s: Cannot continue operation (%d/%d failed).\n",
> +                       mdname(mddev), mddev->degraded, conf->raid_disks);
> +       } else {
> +               pr_crit("md/raid:%s: Operation continuing on %d devices.\n",
> +                       mdname(mddev), conf->raid_disks - mddev->degraded);
>         }
>
> -       set_bit(Faulty, &rdev->flags);
> -       clear_bit(In_sync, &rdev->flags);
> -       mddev->degraded = raid5_calc_degraded(conf);
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>         set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>
>         set_bit(Blocked, &rdev->flags);
>         set_mask_bits(&mddev->sb_flags, 0,
>                       BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
> -       pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n"
> -               "md/raid:%s: Operation continuing on %d devices.\n",
> -               mdname(mddev),
> -               bdevname(rdev->bdev, b),
> -               mdname(mddev),
> -               conf->raid_disks - mddev->degraded);
>         r5c_update_on_rdev_error(mddev, rdev);
>  }
>
> --
> 2.26.2
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10
  2022-01-31  8:29   ` Xiao Ni
@ 2022-01-31  9:06     ` Mariusz Tkaczyk
  2022-02-08  7:13       ` Song Liu
  2022-01-31 12:23     ` Wols Lists
  1 sibling, 1 reply; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-01-31  9:06 UTC (permalink / raw)
  To: Xiao Ni; +Cc: Song Liu, linux-raid

Hi Xiao,
Thanks for review.

On Mon, 31 Jan 2022 16:29:27 +0800
Xiao Ni <xni@redhat.com> wrote:

> > +
> >         if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev
> >             && !enough(conf, rdev->raid_disk)) {
> 
> The check of mddev->fail_last_dev should be removed here.

Ohh, my bad. I mainly tested it on RAID1 and didn't notice it.
Thanks!

> 
> > -               /*
> > -                * Don't fail the drive, just return an IO error.
> > -                */
> 
> It's the same. These comments can directly give people notes. raid10
> will return bio here with an error. Is it better to keep them here?

Sure, let wait for Song opinion first and then I will send v4.

Mariusz




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10
  2022-01-31  8:29   ` Xiao Ni
  2022-01-31  9:06     ` Mariusz Tkaczyk
@ 2022-01-31 12:23     ` Wols Lists
  1 sibling, 0 replies; 29+ messages in thread
From: Wols Lists @ 2022-01-31 12:23 UTC (permalink / raw)
  To: Xiao Ni, Mariusz Tkaczyk; +Cc: Song Liu, linux-raid

On 31/01/2022 08:29, Xiao Ni wrote:
>> + * Error on @rdev is raised and it is needed to be removed from @mddev.
>> + * If there is (excluding @rdev) enough members to operate, @rdev is always
> s/is/are/g
> 

If there *are* (excluding @rdev) enough members to operate, @rdev *is* 
always

Cheers,
Wol

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10
  2022-01-31  9:06     ` Mariusz Tkaczyk
@ 2022-02-08  7:13       ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-02-08  7:13 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: Xiao Ni, linux-raid

On Mon, Jan 31, 2022 at 1:06 AM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> Hi Xiao,
> Thanks for review.
>
> On Mon, 31 Jan 2022 16:29:27 +0800
> Xiao Ni <xni@redhat.com> wrote:
>
> > > +
> > >         if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev
> > >             && !enough(conf, rdev->raid_disk)) {
> >
> > The check of mddev->fail_last_dev should be removed here.

I folded this change in.

>
> Ohh, my bad. I mainly tested it on RAID1 and didn't notice it.
> Thanks!
>
> >
> > > -               /*
> > > -                * Don't fail the drive, just return an IO error.
> > > -                */
> >
> > It's the same. These comments can directly give people notes. raid10
> > will return bio here with an error. Is it better to keep them here?
>
> Sure, let wait for Song opinion first and then I will send v4.

I think the current comment (before the function) is good, so this is no
longer needed.

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v3 0/3] Improve failed arrays handling.
  2022-01-27 15:39 [PATCH v3 0/3] Improve failed arrays handling Mariusz Tkaczyk
                   ` (2 preceding siblings ...)
  2022-01-27 15:39 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
@ 2022-02-08  7:18 ` Song Liu
  3 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-02-08  7:18 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: linux-raid

On Thu, Jan 27, 2022 at 7:39 AM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> Hi Song,
> I've made changes as discussed in v2[1]. I did manual testing with IMSM
> metadata.
>
> Patch 1:
> - "%pg" propagated and raid0/linear_error refactored as Suggested by Guoqing.
> - missed dm-event, suggested by Guoqing- verified. IMO the behavior is same as
>   before.
>
> Patch 2:
> - Commit id fixed, suggested by Gouqing.
> - Description rework, suggested by Xiao (discussed offline).
> - fail_last_dev handling added (and verified).
> - MD_BROKEN description modified, suggested by Gouqing.
> - Descriptions for raid1_error and raid10_error are added, redundant comments
>   are removed.
>
> Patch3:
> - Error message for failed array changed, suggested by you.
> - MD_BROKEN setter moved to has_failed(), suggested by Gouqing.
> - has_failed() refactored
>
> Other notes:
> I followed kernel-doc style guidelines when editing or adding new descriptions.
> Please let me know if you consider it as unnecessary and messy.
>
> I also noticed potential issue during refactor related to MD_FAILFAST_SUPPORTED,
> please see the flag definition. I'm wondering if fail_last_dev functionality is
> not against failfast. Should I start separate thread for that?
>
> [1] https://lore.kernel.org/linux-raid/CAPhsuW43QfDNGEu72o2_eqDZ5vGq3tbFvdZ-W+dxVqcEhHmJ5w@mail.gmail.com/T/#t
>
> Mariusz Tkaczyk (3):
>   raid0, linear, md: add error_handlers for raid0 and linear
>   md: Set MD_BROKEN for RAID1 and RAID10
>   raid5: introduce MD_BROKEN

With some changes (mostly from feedback), applied to md-next.

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear
  2022-01-27 15:39 ` [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear Mariusz Tkaczyk
@ 2022-02-12  1:12   ` Guoqing Jiang
  2022-02-14  9:37     ` Mariusz Tkaczyk
  0 siblings, 1 reply; 29+ messages in thread
From: Guoqing Jiang @ 2022-02-12  1:12 UTC (permalink / raw)
  To: Mariusz Tkaczyk, song; +Cc: linux-raid



On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote:
> Patch 62f7b1989c0 ("md raid0/linear: Mark array as 'broken' and fail BIOs
> if a member is gone") allowed to finish writes earlier (before level
> dependent actions) for non-redundant arrays.
>
> To achieve that MD_BROKEN is added to mddev->flags if drive disappearance
> is detected. This is done in is_mddev_broken() which is confusing and not
> consistent with other levels where error_handler() is used.
> This patch adds appropriate error_handler for raid0 and linear.

I think the purpose of them are quite different, as said before, 
error_handler
is mostly against rdev while is_mddev_broken is for mddev though it needs
to test rdev first.

> It also adopts md_error(), we only want to call .error_handler for those
> levels. mddev->pers->sync_request is additionally checked, its existence
> implies a level with redundancy.
>
> Usage of error_handler causes that disk failure can be requested from
> userspace. User can fail the array via #mdadm --set-faulty command. This
> is not safe and will be fixed in mdadm.

What is the safe issue here? It would betterr to post mdadm fix together.

> It is correctable because failed
> state is not recorded in the metadata. After next assembly array will be
> read-write again.

I don't think it is a problem, care to explain why it can't be RW again?

> For safety reason is better to keep MD_BROKEN in runtime only.

Isn't MD_BROKEN runtime already? It is mddev_flags not mddev_sb_flags.

> Signed-off-by: Mariusz Tkaczyk<mariusz.tkaczyk@linux.intel.com>
> ---
>   drivers/md/md-linear.c | 15 ++++++++++++++-
>   drivers/md/md.c        |  6 +++++-
>   drivers/md/md.h        | 10 ++--------
>   drivers/md/raid0.c     | 15 ++++++++++++++-
>   4 files changed, 35 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
> index 1ff51647a682..3c368e3e4641 100644
> --- a/drivers/md/md-linear.c
> +++ b/drivers/md/md-linear.c
> @@ -233,7 +233,8 @@ static bool linear_make_request(struct mddev *mddev, struct bio *bio)
>   		     bio_sector < start_sector))
>   		goto out_of_bounds;
>   
> -	if (unlikely(is_mddev_broken(tmp_dev->rdev, "linear"))) {
> +	if (unlikely(is_rdev_broken(tmp_dev->rdev))) {
> +		md_error(mddev, tmp_dev->rdev);

[ ... ]

>   
> +static void linear_error(struct mddev *mddev, struct md_rdev *rdev)
> +{
> +	if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) {

s/rdev->mddev/mddev/

> +		char *md_name = mdname(mddev);
> +
> +		pr_crit("md/linear%s: Disk failure on %pg detected.\n"
> +			"md/linear:%s: Cannot continue, failing array.\n",
> +			md_name, rdev->bdev, md_name);

The second md_name is not needed.

> +	}
> +}
> +
>   static void linear_quiesce(struct mddev *mddev, int state)
>   {
>   }
> @@ -297,6 +309,7 @@ static struct md_personality linear_personality =
>   	.hot_add_disk	= linear_add,
>   	.size		= linear_size,
>   	.quiesce	= linear_quiesce,
> +	.error_handler	= linear_error,
>   };
>   
>   static int __init linear_init (void)
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e8666bdc0d28..f888ef197765 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7982,7 +7982,11 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>   
>   	if (!mddev->pers || !mddev->pers->error_handler)
>   		return;
> -	mddev->pers->error_handler(mddev,rdev);
> +	mddev->pers->error_handler(mddev, rdev);
> +
> +	if (!mddev->pers->sync_request)
> +		return;

The above only valid for raid0 and linear, I guess it is fine if DM 
don't create LV on top
of them. But the new checking deserves some comment above.

> +
>   	if (mddev->degraded)
>   		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>   	sysfs_notify_dirent_safe(rdev->sysfs_state);

[ ... ]

> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
> +{
> +	if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) {
> +		char *md_name = mdname(mddev);
> +
> +		pr_crit("md/raid0%s: Disk failure on %pg detected.\n"
> +			"md/raid0:%s: Cannot continue, failing array.\n",
> +			md_name, rdev->bdev, md_name);

The comments for linear_error also valid here.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10
  2022-01-27 15:39 ` [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
  2022-01-31  8:29   ` Xiao Ni
@ 2022-02-12  1:17   ` Guoqing Jiang
  2022-02-14  8:55     ` Mariusz Tkaczyk
  1 sibling, 1 reply; 29+ messages in thread
From: Guoqing Jiang @ 2022-02-12  1:17 UTC (permalink / raw)
  To: Mariusz Tkaczyk, song; +Cc: linux-raid



On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f888ef197765..fda8473f96b8 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2983,10 +2983,11 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>   
>   	if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
>   		md_error(rdev->mddev, rdev);
> -		if (test_bit(Faulty, &rdev->flags))
> -			err = 0;
> -		else
> +
> +		if (test_bit(MD_BROKEN, &rdev->mddev->flags))
>   			err = -EBUSY;
> +		else
> +			err = 0;

Again, it is weird to check MD_BROKEN which is for mddev in rdev 
specific function from
my understanding.

>   	} else if (cmd_match(buf, "remove")) {
>   		if (rdev->mddev->pers) {
>   			clear_bit(Blocked, &rdev->flags);
> @@ -7441,7 +7442,7 @@ static int set_disk_faulty(struct mddev *mddev, dev_t dev)
>   		err =  -ENODEV;
>   	else {
>   		md_error(mddev, rdev);
> -		if (!test_bit(Faulty, &rdev->flags))
> +		if (test_bit(MD_BROKEN, &mddev->flags))
>   			err = -EBUSY;

Ditto.

>   	}
>   	rcu_read_unlock();
> @@ -7987,12 +7988,14 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>   	if (!mddev->pers->sync_request)
>   		return;
>   
> -	if (mddev->degraded)
> +	if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags))
>   		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
>   	sysfs_notify_dirent_safe(rdev->sysfs_state);
>   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> -	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> -	md_wakeup_thread(mddev->thread);
> +	if (!test_bit(MD_BROKEN, &mddev->flags)) {
> +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> +		md_wakeup_thread(mddev->thread);
> +	}
>   	if (mddev->event_work.func)
>   		queue_work(md_misc_wq, &mddev->event_work);
>   	md_new_event();
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index bc3f2094d0b6..1eb7d0e88cb2 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -234,34 +234,42 @@ extern int rdev_clear_badblocks(struct md_rdev *rdev, sector_t s, int sectors,
>   				int is_new);
>   struct md_cluster_info;
>   
> -/* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added */
> +/**
> + * enum mddev_flags - md device flags.
> + * @MD_ARRAY_FIRST_USE: First use of array, needs initialization.
> + * @MD_CLOSING: If set, we are closing the array, do not open it then.
> + * @MD_JOURNAL_CLEAN: A raid with journal is already clean.
> + * @MD_HAS_JOURNAL: The raid array has journal feature set.
> + * @MD_CLUSTER_RESYNC_LOCKED: cluster raid only, which means node, already took
> + *			       resync lock, need to release the lock.
> + * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
> + *			    calls to md_error() will never cause the array to
> + *			    become failed.
> + * @MD_HAS_PPL:  The raid array has PPL feature set.
> + * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
> + * @MD_ALLOW_SB_UPDATE: md_check_recovery is allowed to update the metadata
> + *			 without taking reconfig_mutex.
> + * @MD_UPDATING_SB: md_check_recovery is updating the metadata without
> + *		     explicitly holding reconfig_mutex.
> + * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
> + *		   array is ready yet.
> + * @MD_BROKEN: This is used to stop writes and mark array as failed.
> + *
> + * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag is added
> + */
>   enum mddev_flags {
> -	MD_ARRAY_FIRST_USE,	/* First use of array, needs initialization */
> -	MD_CLOSING,		/* If set, we are closing the array, do not open
> -				 * it then */
> -	MD_JOURNAL_CLEAN,	/* A raid with journal is already clean */
> -	MD_HAS_JOURNAL,		/* The raid array has journal feature set */
> -	MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which means node
> -				   * already took resync lock, need to
> -				   * release the lock */
> -	MD_FAILFAST_SUPPORTED,	/* Using MD_FAILFAST on metadata writes is
> -				 * supported as calls to md_error() will
> -				 * never cause the array to become failed.
> -				 */
> -	MD_HAS_PPL,		/* The raid array has PPL feature set */
> -	MD_HAS_MULTIPLE_PPLS,	/* The raid array has multiple PPLs feature set */
> -	MD_ALLOW_SB_UPDATE,	/* md_check_recovery is allowed to update
> -				 * the metadata without taking reconfig_mutex.
> -				 */
> -	MD_UPDATING_SB,		/* md_check_recovery is updating the metadata
> -				 * without explicitly holding reconfig_mutex.
> -				 */
> -	MD_NOT_READY,		/* do_md_run() is active, so 'array_state'
> -				 * must not report that array is ready yet
> -				 */
> -	MD_BROKEN,              /* This is used in RAID-0/LINEAR only, to stop
> -				 * I/O in case an array member is gone/failed.
> -				 */

People have to scroll up to see the meaning of each flag with the 
change, I would
suggest move these comments on top of each flag if you really hate 
previous style,
but just my personal flavor.

> +	MD_ARRAY_FIRST_USE,
> +	MD_CLOSING,
> +	MD_JOURNAL_CLEAN,
> +	MD_HAS_JOURNAL,
> +	MD_CLUSTER_RESYNC_LOCKED,
> +	MD_FAILFAST_SUPPORTED,
> +	MD_HAS_PPL,
> +	MD_HAS_MULTIPLE_PPLS,
> +	MD_ALLOW_SB_UPDATE,
> +	MD_UPDATING_SB,
> +	MD_NOT_READY,
> +	MD_BROKEN,
>   };
>   
>   enum mddev_sb_flags {
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 7dc8026cf6ee..b222bafa1196 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1615,30 +1615,37 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>   	seq_printf(seq, "]");
>   }
>   
> +/**
> + * raid1_error() - RAID1 error handler.
> + * @mddev: affected md device.
> + **@rdev: member device to remove.*

It is confusing, rdev is removed in raid1_remove_disk only.

> + *
> + * Error on @rdev is raised and it is needed to be removed from @mddev.
> + * If there are more than one active member, @rdev is always removed.
> + *
> + * If it is the last active member, it depends on &mddev->fail_last_dev:
> + * - if is on @rdev is removed.
> + * - if is off, @rdev is not removed, but recovery from it is disabled (@rdev is
> + *   very likely to fail).
> + * In both cases, &MD_BROKEN will be set in &mddev->flags.
> + */
>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>   {
>   	char b[BDEVNAME_SIZE];
>   	struct r1conf *conf = mddev->private;
>   	unsigned long flags;
>   
> -	/*
> -	 * If it is not operational, then we have already marked it as dead
> -	 * else if it is the last working disks with "fail_last_dev == false",
> -	 * ignore the error, let the next level up know.
> -	 * else mark the drive as failed
> -	 */
>   	spin_lock_irqsave(&conf->device_lock, flags);
> -	if (test_bit(In_sync, &rdev->flags) && !mddev->fail_last_dev
> -	    && (conf->raid_disks - mddev->degraded) == 1) {
> -		/*
> -		 * Don't fail the drive, act as though we were just a
> -		 * normal single drive.
> -		 * However don't try a recovery from this drive as
> -		 * it is very likely to fail.
> -		 */
> -		conf->recovery_disabled = mddev->recovery_disabled;
> -		spin_unlock_irqrestore(&conf->device_lock, flags);
> -		return;
> +
> +	if (test_bit(In_sync, &rdev->flags) &&
> +	    (conf->raid_disks - mddev->degraded) == 1) {
> +		set_bit(MD_BROKEN, &mddev->flags);
> +
> +		if (!mddev->fail_last_dev) {
> +			conf->recovery_disabled = mddev->recovery_disabled;
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
> +			return;
> +		}
>   	}
>   	set_bit(Blocked, &rdev->flags);
>   	if (test_and_clear_bit(In_sync, &rdev->flags))
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index dde98f65bd04..7471e20d7cd6 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1945,26 +1945,37 @@ static int enough(struct r10conf *conf, int ignore)
>   		_enough(conf, 1, ignore);
>   }
>   
> +/**
> + * raid10_error() - RAID10 error handler.
> + * @mddev: affected md device.
> + * @rdev: member device to remove.

Ditto.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2022-01-27 15:39 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
  2022-01-31  8:58   ` Xiao Ni
@ 2022-02-12  1:47   ` Guoqing Jiang
  2022-02-22 14:18     ` Mariusz Tkaczyk
  1 sibling, 1 reply; 29+ messages in thread
From: Guoqing Jiang @ 2022-02-12  1:47 UTC (permalink / raw)
  To: Mariusz Tkaczyk, song; +Cc: linux-raid



On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote:
> Raid456 module had allowed to achieve failed state. It was fixed by
> fb73b357fb9 ("raid5: block failing device if raid will be failed").
> This fix introduces a bug, now if raid5 fails during IO, it may result
> with a hung task without completion. Faulty flag on the device is
> necessary to process all requests and is checked many times, mainly in
> analyze_stripe().
> Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
>
> As a result, this level is allowed to achieve failed state again, but
> communication with userspace (via -EBUSY status) will be preserved.
>
> This restores possibility to fail array via #mdadm --set-faulty command
> and will be fixed by additional verification on mdadm side.

Again, you better to send mdadm change along with the series.

> Reproduction steps:
>   mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
>   mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
>   mkfs.xfs /dev/md126 -f
>   mount /dev/md126 /mnt/root/
>
>   fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
> --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
> --time_based --group_reporting --name=throughput-test-job
> --eta-newline=1 &
>
>   echo 1 > /sys/block/nvme2n1/device/device/remove
>   echo 1 > /sys/block/nvme1n1/device/device/remove
>
>   [ 1475.787779] Call Trace:
>   [ 1475.793111] __schedule+0x2a6/0x700
>   [ 1475.799460] schedule+0x38/0xa0
>   [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
>   [ 1475.813856] ? finish_wait+0x80/0x80
>   [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
>   [ 1475.828281] ? finish_wait+0x80/0x80
>   [ 1475.834727] ? finish_wait+0x80/0x80
>   [ 1475.841127] ? finish_wait+0x80/0x80
>   [ 1475.847480] md_handle_request+0x119/0x190
>   [ 1475.854390] md_make_request+0x8a/0x190
>   [ 1475.861041] generic_make_request+0xcf/0x310
>   [ 1475.868145] submit_bio+0x3c/0x160
>   [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
>   [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
>   [ 1475.889149] iomap_apply+0xff/0x310
>   [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
>   [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
>   [ 1475.909974] iomap_dio_rw+0x2f2/0x490
>   [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
>   [ 1475.923680] ? atime_needs_update+0x77/0xe0
>   [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
>   [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
>   [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
>   [ 1475.953403] aio_read+0xd5/0x180
>   [ 1475.959395] ? _cond_resched+0x15/0x30
>   [ 1475.965907] io_submit_one+0x20b/0x3c0
>   [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
>   [ 1475.979335] ? do_io_getevents+0x7c/0xc0
>   [ 1475.986009] do_syscall_64+0x5b/0x1a0
>   [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
>   [ 1476.000255] RIP: 0033:0x7f11fc27978d
>   [ 1476.006631] Code: Bad RIP value.
>   [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.

Does it also happen to non imsm array? And did you try to reproduce it with
revert fb73b357fb? I suppose fb73b357fb9 introduced the regression given
it is fixed by this one.

> Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
> Signed-off-by: Mariusz Tkaczyk<mariusz.tkaczyk@linux.intel.com>
> ---
>   drivers/md/raid1.c |  1 +
>   drivers/md/raid5.c | 49 +++++++++++++++++++++++-----------------------
>   2 files changed, 25 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index b222bafa1196..58c8eddb0f55 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1627,6 +1627,7 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>    * - if is on @rdev is removed.
>    * - if is off, @rdev is not removed, but recovery from it is disabled (@rdev is
>    *   very likely to fail).
> + *

You probably don't want to touch raid1 file in this patch.

>    * In both cases, &MD_BROKEN will be set in &mddev->flags.
>    */
>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 1240a5c16af8..bee953c8007f 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -686,17 +686,21 @@ int raid5_calc_degraded(struct r5conf *conf)
>   	return degraded;
>   }
>   
> -static int has_failed(struct r5conf *conf)
> +static bool has_failed(struct r5conf *conf)
>   {
> -	int degraded;
> +	int degraded = conf->mddev->degraded;
>   
> -	if (conf->mddev->reshape_position == MaxSector)
> -		return conf->mddev->degraded > conf->max_degraded;
> +	if (test_bit(MD_BROKEN, &conf->mddev->flags))
> +		return true;

If one member disk was set Faulty which caused BROKEN was set, is it
possible to re-add the same member disk again?

[root@vm ~]# echo faulty > /sys/block/md0/md/dev-loop1/state
[root@vm ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 loop2[2] loop1[0](F)
       1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2 [2/1] [_U]
       bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>
[root@vm ~]# echo re-add > /sys/block/md0/md/dev-loop1/state
[root@vm ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 loop2[2] loop1[0]
       1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2 [2/2] [UU]
       bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>

And have you run mdadm test against the series?

> -	degraded = raid5_calc_degraded(conf);
> -	if (degraded > conf->max_degraded)
> -		return 1;
> -	return 0;
> +	if (conf->mddev->reshape_position != MaxSector)
> +		degraded = raid5_calc_degraded(conf);
> +
> +	if (degraded > conf->max_degraded) {
> +		set_bit(MD_BROKEN, &conf->mddev->flags);

Why not set BROKEN flags in err handler to align with other levels? Or
do it in md_error only.

> +		return true;
> +	}
> +	return false;
>   }
>   
>   struct stripe_head *
> @@ -2877,34 +2881,29 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>   	unsigned long flags;
>   	pr_debug("raid456: error called\n");
>   
> +	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n",
> +		mdname(mddev), bdevname(rdev->bdev, b));
> +
>   	spin_lock_irqsave(&conf->device_lock, flags);
> +	set_bit(Faulty, &rdev->flags);
> +	clear_bit(In_sync, &rdev->flags);
> +	mddev->degraded = raid5_calc_degraded(conf);
>   
> -	if (test_bit(In_sync, &rdev->flags) &&
> -	    mddev->degraded == conf->max_degraded) {
> -		/*
> -		 * Don't allow to achieve failed state
> -		 * Don't try to recover this device
> -		 */
> +	if (has_failed(conf)) {
>   		conf->recovery_disabled = mddev->recovery_disabled;
> -		spin_unlock_irqrestore(&conf->device_lock, flags);
> -		return;

Ok, I think commit fb73b357fb985cc652a72a41541d25915c7f9635 is
effectively reverted by this hunk. So I would prefer to separate the
revert part from this patch, just my 0.02$.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10
  2022-02-12  1:17   ` Guoqing Jiang
@ 2022-02-14  8:55     ` Mariusz Tkaczyk
  0 siblings, 0 replies; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-02-14  8:55 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: song, linux-raid

Hi Guoqing,
Thanks for review. Song already merged that, so I will adress issues in
the new patch.

On Sat, 12 Feb 2022 09:17:57 +0800
Guoqing Jiang <guoqing.jiang@linux.dev> wrote:

> On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote:
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index f888ef197765..fda8473f96b8 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -2983,10 +2983,11 @@ state_store(struct md_rdev *rdev, const
> > char *buf, size_t len) 
> >   	if (cmd_match(buf, "faulty") && rdev->mddev->pers) {
> >   		md_error(rdev->mddev, rdev);
> > -		if (test_bit(Faulty, &rdev->flags))
> > -			err = 0;
> > -		else
> > +
> > +		if (test_bit(MD_BROKEN, &rdev->mddev->flags))
> >   			err = -EBUSY;
> > +		else
> > +			err = 0;  
> 
> Again, it is weird to check MD_BROKEN which is for mddev in rdev 
> specific function from
> my understanding.

As I wrote in description:

"-EBUSY expectation cannot be removed without breaking compatibility
with userspace."

I agree with your opinion but I would like to not introduce regression
here.
> 
> >   	} else if (cmd_match(buf, "remove")) {
> >   		if (rdev->mddev->pers) {
> >   			clear_bit(Blocked, &rdev->flags);
> > @@ -7441,7 +7442,7 @@ static int set_disk_faulty(struct mddev
> > *mddev, dev_t dev) err =  -ENODEV;
> >   	else {
> >   		md_error(mddev, rdev);
> > -		if (!test_bit(Faulty, &rdev->flags))
> > +		if (test_bit(MD_BROKEN, &mddev->flags))
> >   			err = -EBUSY;  
> 
> Ditto.
> 
> >   	}
> >   	rcu_read_unlock();
> > @@ -7987,12 +7988,14 @@ void md_error(struct mddev *mddev, struct
> > md_rdev *rdev) if (!mddev->pers->sync_request)
> >   		return;
> >   
> > -	if (mddev->degraded)
> > +	if (mddev->degraded && !test_bit(MD_BROKEN, &mddev->flags))
> >   		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> >   	sysfs_notify_dirent_safe(rdev->sysfs_state);
> >   	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
> > -	set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> > -	md_wakeup_thread(mddev->thread);
> > +	if (!test_bit(MD_BROKEN, &mddev->flags)) {
> > +		set_bit(MD_RECOVERY_NEEDED, &mddev->recovery);
> > +		md_wakeup_thread(mddev->thread);
> > +	}
> >   	if (mddev->event_work.func)
> >   		queue_work(md_misc_wq, &mddev->event_work);
> >   	md_new_event();
> > diff --git a/drivers/md/md.h b/drivers/md/md.h
> > index bc3f2094d0b6..1eb7d0e88cb2 100644
> > --- a/drivers/md/md.h
> > +++ b/drivers/md/md.h
> > @@ -234,34 +234,42 @@ extern int rdev_clear_badblocks(struct
> > md_rdev *rdev, sector_t s, int sectors, int is_new);
> >   struct md_cluster_info;
> >   
> > -/* change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag
> > is added */ +/**
> > + * enum mddev_flags - md device flags.
> > + * @MD_ARRAY_FIRST_USE: First use of array, needs initialization.
> > + * @MD_CLOSING: If set, we are closing the array, do not open it
> > then.
> > + * @MD_JOURNAL_CLEAN: A raid with journal is already clean.
> > + * @MD_HAS_JOURNAL: The raid array has journal feature set.
> > + * @MD_CLUSTER_RESYNC_LOCKED: cluster raid only, which means node,
> > already took
> > + *			       resync lock, need to release the
> > lock.
> > + * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is
> > supported as
> > + *			    calls to md_error() will never cause
> > the array to
> > + *			    become failed.
> > + * @MD_HAS_PPL:  The raid array has PPL feature set.
> > + * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature
> > set.
> > + * @MD_ALLOW_SB_UPDATE: md_check_recovery is allowed to update the
> > metadata
> > + *			 without taking reconfig_mutex.
> > + * @MD_UPDATING_SB: md_check_recovery is updating the metadata
> > without
> > + *		     explicitly holding reconfig_mutex.
> > + * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not
> > report that
> > + *		   array is ready yet.
> > + * @MD_BROKEN: This is used to stop writes and mark array as
> > failed.
> > + *
> > + * change UNSUPPORTED_MDDEV_FLAGS for each array type if new flag
> > is added
> > + */
> >   enum mddev_flags {
> > -	MD_ARRAY_FIRST_USE,	/* First use of array, needs
> > initialization */
> > -	MD_CLOSING,		/* If set, we are closing the
> > array, do not open
> > -				 * it then */
> > -	MD_JOURNAL_CLEAN,	/* A raid with journal is already
> > clean */
> > -	MD_HAS_JOURNAL,		/* The raid array has
> > journal feature set */
> > -	MD_CLUSTER_RESYNC_LOCKED, /* cluster raid only, which
> > means node
> > -				   * already took resync lock,
> > need to
> > -				   * release the lock */
> > -	MD_FAILFAST_SUPPORTED,	/* Using MD_FAILFAST on
> > metadata writes is
> > -				 * supported as calls to
> > md_error() will
> > -				 * never cause the array to become
> > failed.
> > -				 */
> > -	MD_HAS_PPL,		/* The raid array has PPL
> > feature set */
> > -	MD_HAS_MULTIPLE_PPLS,	/* The raid array has
> > multiple PPLs feature set */
> > -	MD_ALLOW_SB_UPDATE,	/* md_check_recovery is allowed
> > to update
> > -				 * the metadata without taking
> > reconfig_mutex.
> > -				 */
> > -	MD_UPDATING_SB,		/* md_check_recovery is
> > updating the metadata
> > -				 * without explicitly holding
> > reconfig_mutex.
> > -				 */
> > -	MD_NOT_READY,		/* do_md_run() is active, so
> > 'array_state'
> > -				 * must not report that array is
> > ready yet
> > -				 */
> > -	MD_BROKEN,              /* This is used in RAID-0/LINEAR
> > only, to stop
> > -				 * I/O in case an array member is
> > gone/failed.
> > -				 */  
> 
> People have to scroll up to see the meaning of each flag with the 
> change, I would
> suggest move these comments on top of each flag if you really hate 
> previous style,
> but just my personal flavor.

Kernel has well-defined definition and comment rules so I followed
them:
https://www.kernel.org/doc/html/latest/doc-guide/kernel-doc.html
Personally, I think that we should follow. It gives as a change to
generate documentation.

> 
> > +	MD_ARRAY_FIRST_USE,
> > +	MD_CLOSING,
> > +	MD_JOURNAL_CLEAN,
> > +	MD_HAS_JOURNAL,
> > +	MD_CLUSTER_RESYNC_LOCKED,
> > +	MD_FAILFAST_SUPPORTED,
> > +	MD_HAS_PPL,
> > +	MD_HAS_MULTIPLE_PPLS,
> > +	MD_ALLOW_SB_UPDATE,
> > +	MD_UPDATING_SB,
> > +	MD_NOT_READY,
> > +	MD_BROKEN,
> >   };
> >   
> >   enum mddev_sb_flags {
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 7dc8026cf6ee..b222bafa1196 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -1615,30 +1615,37 @@ static void raid1_status(struct seq_file
> > *seq, struct mddev *mddev) seq_printf(seq, "]");
> >   }
> >   
> > +/**
> > + * raid1_error() - RAID1 error handler.
> > + * @mddev: affected md device.
> > + **@rdev: member device to remove.*  
> 
> It is confusing, rdev is removed in raid1_remove_disk only.
> 

Yes, that make sense. I will address it.

> > + *
> > + * Error on @rdev is raised and it is needed to be removed from
> > @mddev.
> > + * If there are more than one active member, @rdev is always
> > removed.
> > + *
> > + * If it is the last active member, it depends on
> > &mddev->fail_last_dev:
> > + * - if is on @rdev is removed.
> > + * - if is off, @rdev is not removed, but recovery from it is
> > disabled (@rdev is
> > + *   very likely to fail).
> > + * In both cases, &MD_BROKEN will be set in &mddev->flags.
> > + */
> >   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> >   {
> >   	char b[BDEVNAME_SIZE];
> >   	struct r1conf *conf = mddev->private;
> >   	unsigned long flags;
> >   
> > -	/*
> > -	 * If it is not operational, then we have already marked
> > it as dead
> > -	 * else if it is the last working disks with
> > "fail_last_dev == false",
> > -	 * ignore the error, let the next level up know.
> > -	 * else mark the drive as failed
> > -	 */
> >   	spin_lock_irqsave(&conf->device_lock, flags);
> > -	if (test_bit(In_sync, &rdev->flags) &&
> > !mddev->fail_last_dev
> > -	    && (conf->raid_disks - mddev->degraded) == 1) {
> > -		/*
> > -		 * Don't fail the drive, act as though we were
> > just a
> > -		 * normal single drive.
> > -		 * However don't try a recovery from this drive as
> > -		 * it is very likely to fail.
> > -		 */
> > -		conf->recovery_disabled = mddev->recovery_disabled;
> > -		spin_unlock_irqrestore(&conf->device_lock, flags);
> > -		return;
> > +
> > +	if (test_bit(In_sync, &rdev->flags) &&
> > +	    (conf->raid_disks - mddev->degraded) == 1) {
> > +		set_bit(MD_BROKEN, &mddev->flags);
> > +
> > +		if (!mddev->fail_last_dev) {
> > +			conf->recovery_disabled =
> > mddev->recovery_disabled;
> > +			spin_unlock_irqrestore(&conf->device_lock,
> > flags);
> > +			return;
> > +		}
> >   	}
> >   	set_bit(Blocked, &rdev->flags);
> >   	if (test_and_clear_bit(In_sync, &rdev->flags))
> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > index dde98f65bd04..7471e20d7cd6 100644
> > --- a/drivers/md/raid10.c
> > +++ b/drivers/md/raid10.c
> > @@ -1945,26 +1945,37 @@ static int enough(struct r10conf *conf, int
> > ignore) _enough(conf, 1, ignore);
> >   }
> >   
> > +/**
> > + * raid10_error() - RAID10 error handler.
> > + * @mddev: affected md device.
> > + * @rdev: member device to remove.  
> 
> Ditto.
>
Noted.

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear
  2022-02-12  1:12   ` Guoqing Jiang
@ 2022-02-14  9:37     ` Mariusz Tkaczyk
  2022-02-15  3:43       ` Guoqing Jiang
  0 siblings, 1 reply; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-02-14  9:37 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: song, linux-raid

On Sat, 12 Feb 2022 09:12:00 +0800
Guoqing Jiang <guoqing.jiang@linux.dev> wrote:

> On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote:
> > Patch 62f7b1989c0 ("md raid0/linear: Mark array as 'broken' and
> > fail BIOs if a member is gone") allowed to finish writes earlier
> > (before level dependent actions) for non-redundant arrays.
> >
> > To achieve that MD_BROKEN is added to mddev->flags if drive
> > disappearance is detected. This is done in is_mddev_broken() which
> > is confusing and not consistent with other levels where
> > error_handler() is used. This patch adds appropriate error_handler
> > for raid0 and linear.  
> 
> I think the purpose of them are quite different, as said before, 
> error_handler
> is mostly against rdev while is_mddev_broken is for mddev though it
> needs to test rdev first.

I changed is_mddev_broken to is_rdev_broken, because it checks the
device now. On error it calls md_error and later error_handler.
I unified error handling for each level. Do you consider it as wrong?

> 
> > It also adopts md_error(), we only want to call .error_handler for
> > those levels. mddev->pers->sync_request is additionally checked,
> > its existence implies a level with redundancy.
> >
> > Usage of error_handler causes that disk failure can be requested
> > from userspace. User can fail the array via #mdadm --set-faulty
> > command. This is not safe and will be fixed in mdadm.  
> 
> What is the safe issue here? It would betterr to post mdadm fix
> together.

We can and should block user from damaging raid even if it is
recoverable. It is a regression.
I will fix mdadm. I don't consider it as a big risk (because it is
recoverable) so I focused on kernel part first.

> 
> > It is correctable because failed
> > state is not recorded in the metadata. After next assembly array
> > will be read-write again.  
> 
> I don't think it is a problem, care to explain why it can't be RW
> again?

failed state is not recoverable in runtime, so you need to recreate
array.

> 
> > For safety reason is better to keep MD_BROKEN in runtime only.  
> 
> Isn't MD_BROKEN runtime already? It is mddev_flags not mddev_sb_flags.

Yes, and this is why I didn't propagate it.
> 
> > Signed-off-by: Mariusz Tkaczyk<mariusz.tkaczyk@linux.intel.com>
> > ---
> >   drivers/md/md-linear.c | 15 ++++++++++++++-
> >   drivers/md/md.c        |  6 +++++-
> >   drivers/md/md.h        | 10 ++--------
> >   drivers/md/raid0.c     | 15 ++++++++++++++-
> >   4 files changed, 35 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
> > index 1ff51647a682..3c368e3e4641 100644
> > --- a/drivers/md/md-linear.c
> > +++ b/drivers/md/md-linear.c
> > @@ -233,7 +233,8 @@ static bool linear_make_request(struct mddev
> > *mddev, struct bio *bio) bio_sector < start_sector))
> >   		goto out_of_bounds;
> >   
> > -	if (unlikely(is_mddev_broken(tmp_dev->rdev, "linear"))) {
> > +	if (unlikely(is_rdev_broken(tmp_dev->rdev))) {
> > +		md_error(mddev, tmp_dev->rdev);  
> 
> [ ... ]
> 
> >   
> > +static void linear_error(struct mddev *mddev, struct md_rdev *rdev)
> > +{
> > +	if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) {  
> 
> s/rdev->mddev/mddev/

Noted.
> 
> > +		char *md_name = mdname(mddev);
> > +
> > +		pr_crit("md/linear%s: Disk failure on %pg
> > detected.\n"
> > +			"md/linear:%s: Cannot continue, failing
> > array.\n",
> > +			md_name, rdev->bdev, md_name);  
> 
> The second md_name is not needed.
Could you elaborate here more? Do you want to skip device name in
second message?

> 
> > +	}
> > +}
> > +
> >   static void linear_quiesce(struct mddev *mddev, int state)
> >   {
> >   }
> > @@ -297,6 +309,7 @@ static struct md_personality linear_personality
> > = .hot_add_disk	= linear_add,
> >   	.size		= linear_size,
> >   	.quiesce	= linear_quiesce,
> > +	.error_handler	= linear_error,
> >   };
> >   
> >   static int __init linear_init (void)
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index e8666bdc0d28..f888ef197765 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -7982,7 +7982,11 @@ void md_error(struct mddev *mddev, struct
> > md_rdev *rdev) 
> >   	if (!mddev->pers || !mddev->pers->error_handler)
> >   		return;
> > -	mddev->pers->error_handler(mddev,rdev);
> > +	mddev->pers->error_handler(mddev, rdev);
> > +
> > +	if (!mddev->pers->sync_request)
> > +		return;  
> 
> The above only valid for raid0 and linear, I guess it is fine if DM 
> don't create LV on top
> of them. But the new checking deserves some comment above.
Will do, could you propose comment?

> 
> > +
> >   	if (mddev->degraded)
> >   		set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> >   	sysfs_notify_dirent_safe(rdev->sysfs_state);  
> 
> [ ... ]
> 
> > +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
> > +{
> > +	if (!test_and_set_bit(MD_BROKEN, &rdev->mddev->flags)) {
> > +		char *md_name = mdname(mddev);
> > +
> > +		pr_crit("md/raid0%s: Disk failure on %pg
> > detected.\n"
> > +			"md/raid0:%s: Cannot continue, failing
> > array.\n",
> > +			md_name, rdev->bdev, md_name);  
> 
> The comments for linear_error also valid here.
> 
Noted.

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear
  2022-02-14  9:37     ` Mariusz Tkaczyk
@ 2022-02-15  3:43       ` Guoqing Jiang
  2022-02-15 14:06         ` Mariusz Tkaczyk
  0 siblings, 1 reply; 29+ messages in thread
From: Guoqing Jiang @ 2022-02-15  3:43 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: song, linux-raid



On 2/14/22 5:37 PM, Mariusz Tkaczyk wrote:
> On Sat, 12 Feb 2022 09:12:00 +0800
> Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>> On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote:
>>> Patch 62f7b1989c0 ("md raid0/linear: Mark array as 'broken' and
>>> fail BIOs if a member is gone") allowed to finish writes earlier
>>> (before level dependent actions) for non-redundant arrays.
>>>
>>> To achieve that MD_BROKEN is added to mddev->flags if drive
>>> disappearance is detected. This is done in is_mddev_broken() which
>>> is confusing and not consistent with other levels where
>>> error_handler() is used. This patch adds appropriate error_handler
>>> for raid0 and linear.
>> I think the purpose of them are quite different, as said before,
>> error_handler
>> is mostly against rdev while is_mddev_broken is for mddev though it
>> needs to test rdev first.
> I changed is_mddev_broken to is_rdev_broken, because it checks the
> device now. On error it calls md_error and later error_handler.
> I unified error handling for each level. Do you consider it as wrong?

I am neutral to the change

>> It also adopts md_error(), we only want to call .error_handler for
>> those levels. mddev->pers->sync_request is additionally checked,
>> its existence implies a level with redundancy.
>>
>> Usage of error_handler causes that disk failure can be requested
>> from userspace. User can fail the array via #mdadm --set-faulty
>> command. This is not safe and will be fixed in mdadm.
>> What is the safe issue here? It would betterr to post mdadm fix
>> together.
> We can and should block user from damaging raid even if it is
> recoverable. It is a regression.

I don't follow, did you mean --set-fault from mdadm could "damaging raid"?

> I will fix mdadm. I don't consider it as a big risk (because it is
> recoverable) so I focused on kernel part first.
>
>>> It is correctable because failed
>>> state is not recorded in the metadata. After next assembly array
>>> will be read-write again.
>> I don't think it is a problem, care to explain why it can't be RW
>> again?
> failed state is not recoverable in runtime, so you need to recreate
> array.

IIUC, the failfast flag is supposed to be set during transient error not
permanent failure, the rdev (marked as failfast) need to be revalidated
and readded to array.


[ ... ]

>>> +		char *md_name = mdname(mddev);
>>> +
>>> +		pr_crit("md/linear%s: Disk failure on %pg
>>> detected.\n"
>>> +			"md/linear:%s: Cannot continue, failing
>>> array.\n",
>>> +			md_name, rdev->bdev, md_name);
>> The second md_name is not needed.
> Could you elaborate here more? Do you want to skip device name in
> second message?

Yes, we printed two md_name here, seems unnecessary.

[ ... ]

>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -7982,7 +7982,11 @@ void md_error(struct mddev *mddev, struct
>>> md_rdev *rdev)
>>>    	if (!mddev->pers || !mddev->pers->error_handler)
>>>    		return;
>>> -	mddev->pers->error_handler(mddev,rdev);
>>> +	mddev->pers->error_handler(mddev, rdev);
>>> +
>>> +	if (!mddev->pers->sync_request)
>>> +		return;
>> The above only valid for raid0 and linear, I guess it is fine if DM
>> don't create LV on top
>> of them. But the new checking deserves some comment above.
> Will do, could you propose comment?

Or, just check if it is raid0 or linear directly instead of implies 
level with
redundancy.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear
  2022-02-15  3:43       ` Guoqing Jiang
@ 2022-02-15 14:06         ` Mariusz Tkaczyk
  2022-02-16  9:47           ` Xiao Ni
  2022-02-22  6:34           ` Song Liu
  0 siblings, 2 replies; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-02-15 14:06 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: song, linux-raid

On Tue, 15 Feb 2022 11:43:34 +0800
Guoqing Jiang <guoqing.jiang@linux.dev> wrote:

> >> It also adopts md_error(), we only want to call .error_handler for
> >> those levels. mddev->pers->sync_request is additionally checked,
> >> its existence implies a level with redundancy.
> >>
> >> Usage of error_handler causes that disk failure can be requested
> >> from userspace. User can fail the array via #mdadm --set-faulty
> >> command. This is not safe and will be fixed in mdadm.
> >> What is the safe issue here? It would betterr to post mdadm fix
> >> together.  
> > We can and should block user from damaging raid even if it is
> > recoverable. It is a regression.  
> 
> I don't follow, did you mean --set-fault from mdadm could "damaging
> raid"?

Yes, now it will be able to impose failed state. This is a regression I
caused and I'm aware of that.
> 
> > I will fix mdadm. I don't consider it as a big risk (because it is
> > recoverable) so I focused on kernel part first.
> >  
> >>> It is correctable because failed
> >>> state is not recorded in the metadata. After next assembly array
> >>> will be read-write again.  
> >> I don't think it is a problem, care to explain why it can't be RW
> >> again?  
> > failed state is not recoverable in runtime, so you need to recreate
> > array.  
> 
> IIUC, the failfast flag is supposed to be set during transient error
> not permanent failure, the rdev (marked as failfast) need to be
> revalidated and readded to array.
> 

The device is never marked as failed. I checked write paths (because I
introduced more aggressive policy for writes) and if we are in a
critical array, failfast flag is not added to bio for both raid1 and
radi10, see raid1_write_request().
> 
> [ ... ] 
> 
> >>> +		char *md_name = mdname(mddev);
> >>> +
> >>> +		pr_crit("md/linear%s: Disk failure on %pg
> >>> detected.\n"
> >>> +			"md/linear:%s: Cannot continue, failing
> >>> array.\n",
> >>> +			md_name, rdev->bdev, md_name);  
> >> The second md_name is not needed.  
> > Could you elaborate here more? Do you want to skip device name in
> > second message?  
> 
> Yes, we printed two md_name here, seems unnecessary.
> 
I will merge errors to one message.

> 
> >>> --- a/drivers/md/md.c
> >>> +++ b/drivers/md/md.c
> >>> @@ -7982,7 +7982,11 @@ void md_error(struct mddev *mddev, struct
> >>> md_rdev *rdev)
> >>>    	if (!mddev->pers || !mddev->pers->error_handler)
> >>>    		return;
> >>> -	mddev->pers->error_handler(mddev,rdev);
> >>> +	mddev->pers->error_handler(mddev, rdev);
> >>> +
> >>> +	if (!mddev->pers->sync_request)
> >>> +		return;  
> >> The above only valid for raid0 and linear, I guess it is fine if DM
> >> don't create LV on top
> >> of them. But the new checking deserves some comment above.  
> > Will do, could you propose comment?  
> 
> Or, just check if it is raid0 or linear directly instead of implies 
> level with
> redundancy.

Got it.

Thanks,
Mariusz


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear
  2022-02-15 14:06         ` Mariusz Tkaczyk
@ 2022-02-16  9:47           ` Xiao Ni
  2022-02-22  6:34           ` Song Liu
  1 sibling, 0 replies; 29+ messages in thread
From: Xiao Ni @ 2022-02-16  9:47 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: Guoqing Jiang, Song Liu, linux-raid

On Tue, Feb 15, 2022 at 10:07 PM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> On Tue, 15 Feb 2022 11:43:34 +0800
> Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
> > >> It also adopts md_error(), we only want to call .error_handler for
> > >> those levels. mddev->pers->sync_request is additionally checked,
> > >> its existence implies a level with redundancy.
> > >>
> > >> Usage of error_handler causes that disk failure can be requested
> > >> from userspace. User can fail the array via #mdadm --set-faulty
> > >> command. This is not safe and will be fixed in mdadm.
> > >> What is the safe issue here? It would betterr to post mdadm fix
> > >> together.
> > > We can and should block user from damaging raid even if it is
> > > recoverable. It is a regression.
> >
> > I don't follow, did you mean --set-fault from mdadm could "damaging
> > raid"?
>
> Yes, now it will be able to impose failed state. This is a regression I
> caused and I'm aware of that.
> >
> > > I will fix mdadm. I don't consider it as a big risk (because it is
> > > recoverable) so I focused on kernel part first.
> > >
> > >>> It is correctable because failed
> > >>> state is not recorded in the metadata. After next assembly array
> > >>> will be read-write again.
> > >> I don't think it is a problem, care to explain why it can't be RW
> > >> again?
> > > failed state is not recoverable in runtime, so you need to recreate
> > > array.
> >
> > IIUC, the failfast flag is supposed to be set during transient error
> > not permanent failure, the rdev (marked as failfast) need to be
> > revalidated and readded to array.
> >
>
> The device is never marked as failed. I checked write paths (because I
> introduced more aggressive policy for writes) and if we are in a
> critical array, failfast flag is not added to bio for both raid1 and
> radi10, see raid1_write_request().

Hi all

The "failed state" in the patch should mean MD_BROKEN, right? If so,
are you talking
in the wrong direction?

> >
> > [ ... ]
> >
> > >>> +         char *md_name = mdname(mddev);
> > >>> +
> > >>> +         pr_crit("md/linear%s: Disk failure on %pg
> > >>> detected.\n"
> > >>> +                 "md/linear:%s: Cannot continue, failing
> > >>> array.\n",
> > >>> +                 md_name, rdev->bdev, md_name);
> > >> The second md_name is not needed.
> > > Could you elaborate here more? Do you want to skip device name in
> > > second message?
> >
> > Yes, we printed two md_name here, seems unnecessary.
> >
> I will merge errors to one message.
>
> >
> > >>> --- a/drivers/md/md.c
> > >>> +++ b/drivers/md/md.c
> > >>> @@ -7982,7 +7982,11 @@ void md_error(struct mddev *mddev, struct
> > >>> md_rdev *rdev)
> > >>>           if (!mddev->pers || !mddev->pers->error_handler)
> > >>>                   return;
> > >>> - mddev->pers->error_handler(mddev,rdev);
> > >>> + mddev->pers->error_handler(mddev, rdev);
> > >>> +
> > >>> + if (!mddev->pers->sync_request)
> > >>> +         return;
> > >> The above only valid for raid0 and linear, I guess it is fine if DM
> > >> don't create LV on top

Hi guoqing, could you explain this more? If we create LV on raid
device, this check doesn't work?

Regards
Xiao


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear
  2022-02-15 14:06         ` Mariusz Tkaczyk
  2022-02-16  9:47           ` Xiao Ni
@ 2022-02-22  6:34           ` Song Liu
  2022-02-22 13:02             ` Mariusz Tkaczyk
  1 sibling, 1 reply; 29+ messages in thread
From: Song Liu @ 2022-02-22  6:34 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: Guoqing Jiang, linux-raid

Hi Mariusz,

On Tue, Feb 15, 2022 at 6:06 AM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> On Tue, 15 Feb 2022 11:43:34 +0800
> Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
> > >> It also adopts md_error(), we only want to call .error_handler for
> > >> those levels. mddev->pers->sync_request is additionally checked,
> > >> its existence implies a level with redundancy.
> > >>
> > >> Usage of error_handler causes that disk failure can be requested
> > >> from userspace. User can fail the array via #mdadm --set-faulty
> > >> command. This is not safe and will be fixed in mdadm.
> > >> What is the safe issue here? It would betterr to post mdadm fix
> > >> together.
> > > We can and should block user from damaging raid even if it is
> > > recoverable. It is a regression.
> >
> > I don't follow, did you mean --set-fault from mdadm could "damaging
> > raid"?
>
> Yes, now it will be able to impose failed state. This is a regression I
> caused and I'm aware of that.
> >
> > > I will fix mdadm. I don't consider it as a big risk (because it is
> > > recoverable) so I focused on kernel part first.
> > >
> > >>> It is correctable because failed
> > >>> state is not recorded in the metadata. After next assembly array
> > >>> will be read-write again.
> > >> I don't think it is a problem, care to explain why it can't be RW
> > >> again?
> > > failed state is not recoverable in runtime, so you need to recreate
> > > array.
> >
> > IIUC, the failfast flag is supposed to be set during transient error
> > not permanent failure, the rdev (marked as failfast) need to be
> > revalidated and readded to array.
> >
>
> The device is never marked as failed. I checked write paths (because I
> introduced more aggressive policy for writes) and if we are in a
> critical array, failfast flag is not added to bio for both raid1 and
> radi10, see raid1_write_request().
> >
> > [ ... ]
> >
> > >>> +         char *md_name = mdname(mddev);
> > >>> +
> > >>> +         pr_crit("md/linear%s: Disk failure on %pg
> > >>> detected.\n"
> > >>> +                 "md/linear:%s: Cannot continue, failing
> > >>> array.\n",
> > >>> +                 md_name, rdev->bdev, md_name);
> > >> The second md_name is not needed.
> > > Could you elaborate here more? Do you want to skip device name in
> > > second message?
> >
> > Yes, we printed two md_name here, seems unnecessary.
> >
> I will merge errors to one message.
>
> >
> > >>> --- a/drivers/md/md.c
> > >>> +++ b/drivers/md/md.c
> > >>> @@ -7982,7 +7982,11 @@ void md_error(struct mddev *mddev, struct
> > >>> md_rdev *rdev)
> > >>>           if (!mddev->pers || !mddev->pers->error_handler)
> > >>>                   return;
> > >>> - mddev->pers->error_handler(mddev,rdev);
> > >>> + mddev->pers->error_handler(mddev, rdev);
> > >>> +
> > >>> + if (!mddev->pers->sync_request)
> > >>> +         return;
> > >> The above only valid for raid0 and linear, I guess it is fine if DM
> > >> don't create LV on top
> > >> of them. But the new checking deserves some comment above.
> > > Will do, could you propose comment?
> >
> > Or, just check if it is raid0 or linear directly instead of implies
> > level with
> > redundancy.
>
> Got it.

Could you please resend the patchset with feedback addressed? I can
apply the newer version and force push to md-next.

Thanks,
Song

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear
  2022-02-22  6:34           ` Song Liu
@ 2022-02-22 13:02             ` Mariusz Tkaczyk
  0 siblings, 0 replies; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-02-22 13:02 UTC (permalink / raw)
  To: Song Liu; +Cc: Guoqing Jiang, linux-raid

Hi Song,
On Mon, 21 Feb 2022 22:34:02 -0800
Song Liu <song@kernel.org> wrote:

> Could you please resend the patchset with feedback addressed? I can
> apply the newer version and force push to md-next.

Yes, I'm working on it. I cannot estimate when it will be ready, there
is ongoing discussion. If you prefer, you can take it off now.

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2022-02-12  1:47   ` Guoqing Jiang
@ 2022-02-22 14:18     ` Mariusz Tkaczyk
  2022-02-25  7:22       ` Guoqing Jiang
  0 siblings, 1 reply; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-02-22 14:18 UTC (permalink / raw)
  To: Guoqing Jiang, song; +Cc: linux-raid

Hi Guoqing,
Sorry for the delay, I missed it.

On Sat, 12 Feb 2022 09:47:38 +0800
Guoqing Jiang <guoqing.jiang@linux.dev> wrote:

> On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote:
> > Raid456 module had allowed to achieve failed state. It was fixed by
> > fb73b357fb9 ("raid5: block failing device if raid will be failed").
> > This fix introduces a bug, now if raid5 fails during IO, it may
> > result with a hung task without completion. Faulty flag on the
> > device is necessary to process all requests and is checked many
> > times, mainly in analyze_stripe().
> > Allow to set faulty on drive again and set MD_BROKEN if raid is
> > failed.
> >
> > As a result, this level is allowed to achieve failed state again,
> > but communication with userspace (via -EBUSY status) will be
> > preserved.
> >
> > This restores possibility to fail array via #mdadm --set-faulty
> > command and will be fixed by additional verification on mdadm side.
> >  
> 
> Again, you better to send mdadm change along with the series.

I understand your objections. Unfortunately, I was unable to handle it
yet. I focused on kernel part first because mdadm was in freeze.
In mdadm, I need to block manual removal which should be simple, there
is already enough() function defined. It is on top of my TODO lst.
  
> 
> > Reproduction steps:
> >   mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
> >   mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
> >   mkfs.xfs /dev/md126 -f
> >   mount /dev/md126 /mnt/root/
> >
> >   fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
> > --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
> > --time_based --group_reporting --name=throughput-test-job
> > --eta-newline=1 &
> >
> >   echo 1 > /sys/block/nvme2n1/device/device/remove
> >   echo 1 > /sys/block/nvme1n1/device/device/remove
> >
> >   [ 1475.787779] Call Trace:
> >   [ 1475.793111] __schedule+0x2a6/0x700
> >   [ 1475.799460] schedule+0x38/0xa0
> >   [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
> >   [ 1475.813856] ? finish_wait+0x80/0x80
> >   [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
> >   [ 1475.828281] ? finish_wait+0x80/0x80
> >   [ 1475.834727] ? finish_wait+0x80/0x80
> >   [ 1475.841127] ? finish_wait+0x80/0x80
> >   [ 1475.847480] md_handle_request+0x119/0x190
> >   [ 1475.854390] md_make_request+0x8a/0x190
> >   [ 1475.861041] generic_make_request+0xcf/0x310
> >   [ 1475.868145] submit_bio+0x3c/0x160
> >   [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
> >   [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
> >   [ 1475.889149] iomap_apply+0xff/0x310
> >   [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
> >   [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
> >   [ 1475.909974] iomap_dio_rw+0x2f2/0x490
> >   [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
> >   [ 1475.923680] ? atime_needs_update+0x77/0xe0
> >   [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
> >   [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
> >   [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
> >   [ 1475.953403] aio_read+0xd5/0x180
> >   [ 1475.959395] ? _cond_resched+0x15/0x30
> >   [ 1475.965907] io_submit_one+0x20b/0x3c0
> >   [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
> >   [ 1475.979335] ? do_io_getevents+0x7c/0xc0
> >   [ 1475.986009] do_syscall_64+0x5b/0x1a0
> >   [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
> >   [ 1476.000255] RIP: 0033:0x7f11fc27978d
> >   [ 1476.006631] Code: Bad RIP value.
> >   [ 1476.073251] INFO: task fio:3877 blocked for more than 120
> > seconds.  
> 
> Does it also happen to non imsm array? And did you try to reproduce
> it with revert fb73b357fb? I suppose fb73b357fb9 introduced the
> regression given it is fixed by this one.
> 
It is reproducible on native and IMSM. Yes, we can fix it by revert.

> 
> >    * In both cases, &MD_BROKEN will be set in &mddev->flags.
> >    */
> >   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 1240a5c16af8..bee953c8007f 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -686,17 +686,21 @@ int raid5_calc_degraded(struct r5conf *conf)
> >   	return degraded;
> >   }
> >   
> > -static int has_failed(struct r5conf *conf)
> > +static bool has_failed(struct r5conf *conf)
> >   {
> > -	int degraded;
> > +	int degraded = conf->mddev->degraded;
> >   
> > -	if (conf->mddev->reshape_position == MaxSector)
> > -		return conf->mddev->degraded > conf->max_degraded;
> > +	if (test_bit(MD_BROKEN, &conf->mddev->flags))
> > +		return true;  
> 
> If one member disk was set Faulty which caused BROKEN was set, is it
> possible to re-add the same member disk again?
> 
Is possible to re-add drive to failed raid5 array now? From my
understanding of raid5_add_disk it is not possible.

> [root@vm ~]# echo faulty > /sys/block/md0/md/dev-loop1/state
> [root@vm ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 loop2[2] loop1[0](F)
>        1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2
> [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> [root@vm ~]# echo re-add > /sys/block/md0/md/dev-loop1/state
> [root@vm ~]# cat /proc/mdstat
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 loop2[2] loop1[0]
>        1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2
> [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> 
> And have you run mdadm test against the series?
> 
I run imsm test suite and our internal IMSM scope. I will take the
challenge and will verify with native. Thanks for suggestion.

> > -	degraded = raid5_calc_degraded(conf);
> > -	if (degraded > conf->max_degraded)
> > -		return 1;
> > -	return 0;
> > +	if (conf->mddev->reshape_position != MaxSector)
> > +		degraded = raid5_calc_degraded(conf);
> > +
> > +	if (degraded > conf->max_degraded) {
> > +		set_bit(MD_BROKEN, &conf->mddev->flags);  
> 
> Why not set BROKEN flags in err handler to align with other levels? Or
> do it in md_error only.

https://lore.kernel.org/linux-raid/3da9324e-01e7-2a07-4bcd-14245db56693@linux.dev/

You suggested that.
Other levels doesn't have dedicates has_failed() routines. For raid5 it
is reasonable to set it in has_failed().

I can't do that in md_error because I don't have such information in
all cases. !test_bit("Faulty", rdev->flags) result varies.

> 
> > +		return true;
> > +	}
> > +	return false;
> >   }
> >   
> >   struct stripe_head *
> > @@ -2877,34 +2881,29 @@ static void raid5_error(struct mddev
> > *mddev, struct md_rdev *rdev) unsigned long flags;
> >   	pr_debug("raid456: error called\n");
> >   
> > +	pr_crit("md/raid:%s: Disk failure on %s, disabling
> > device.\n",
> > +		mdname(mddev), bdevname(rdev->bdev, b));
> > +
> >   	spin_lock_irqsave(&conf->device_lock, flags);
> > +	set_bit(Faulty, &rdev->flags);
> > +	clear_bit(In_sync, &rdev->flags);
> > +	mddev->degraded = raid5_calc_degraded(conf);
> >   
> > -	if (test_bit(In_sync, &rdev->flags) &&
> > -	    mddev->degraded == conf->max_degraded) {
> > -		/*
> > -		 * Don't allow to achieve failed state
> > -		 * Don't try to recover this device
> > -		 */
> > +	if (has_failed(conf)) {
> >   		conf->recovery_disabled =
> > mddev->recovery_disabled;
> > -		spin_unlock_irqrestore(&conf->device_lock, flags);
> > -		return;  
> 
> Ok, I think commit fb73b357fb985cc652a72a41541d25915c7f9635 is
> effectively reverted by this hunk. So I would prefer to separate the
> revert part from this patch, just my 0.02$.

Song,
do you want revert?

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2022-02-22 14:18     ` Mariusz Tkaczyk
@ 2022-02-25  7:22       ` Guoqing Jiang
  2022-03-03 16:21         ` Mariusz Tkaczyk
  0 siblings, 1 reply; 29+ messages in thread
From: Guoqing Jiang @ 2022-02-25  7:22 UTC (permalink / raw)
  To: Mariusz Tkaczyk, song; +Cc: linux-raid

Hi Mariusz,

On 2/22/22 10:18 PM, Mariusz Tkaczyk wrote:
>
>>>    
>>> -static int has_failed(struct r5conf *conf)
>>> +static bool has_failed(struct r5conf *conf)
>>>    {
>>> -	int degraded;
>>> +	int degraded = conf->mddev->degraded;
>>>    
>>> -	if (conf->mddev->reshape_position == MaxSector)
>>> -		return conf->mddev->degraded > conf->max_degraded;
>>> +	if (test_bit(MD_BROKEN, &conf->mddev->flags))
>>> +		return true;
>> If one member disk was set Faulty which caused BROKEN was set, is it
>> possible to re-add the same member disk again?
>>
> Is possible to re-add drive to failed raid5 array now? From my
> understanding of raid5_add_disk it is not possible.

I mean the below steps, it works as you can see.

>> [root@vm ~]# echo faulty > /sys/block/md0/md/dev-loop1/state
>> [root@vm ~]# cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active raid5 loop2[2] loop1[0](F)
>>         1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2
>> [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk
>>
>> unused devices: <none>
>> [root@vm ~]# echo re-add > /sys/block/md0/md/dev-loop1/state
>> [root@vm ~]# cat /proc/mdstat
>> Personalities : [raid6] [raid5] [raid4]
>> md0 : active raid5 loop2[2] loop1[0]
>>         1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2
>> [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk
>>
>> unused devices: <none>
>>
>> And have you run mdadm test against the series?
>>
> I run imsm test suite and our internal IMSM scope. I will take the
> challenge and will verify with native. Thanks for suggestion.

Cool, thank you.

BTW, I know the mdadm test suite is kind of broke, at least this one
which I aware.

https://lore.kernel.org/all/20220119055501.GD27703@xsang-OptiPlex-9020/

And given the complexity of md, the more we test, the less bug we can
avoid.


>>> -	degraded = raid5_calc_degraded(conf);
>>> -	if (degraded > conf->max_degraded)
>>> -		return 1;
>>> -	return 0;
>>> +	if (conf->mddev->reshape_position != MaxSector)
>>> +		degraded = raid5_calc_degraded(conf);
>>> +
>>> +	if (degraded > conf->max_degraded) {
>>> +		set_bit(MD_BROKEN, &conf->mddev->flags);
>> Why not set BROKEN flags in err handler to align with other levels? Or
>> do it in md_error only.
> https://lore.kernel.org/linux-raid/3da9324e-01e7-2a07-4bcd-14245db56693@linux.dev/
>
> You suggested that.
> Other levels doesn't have dedicates has_failed() routines. For raid5 it
> is reasonable to set it in has_failed().

When has_failed returns true which means MD_BROKEN should be set, if so,
then it makes sense to set it in raid5_error.


> I can't do that in md_error because I don't have such information in
> all cases. !test_bit("Faulty", rdev->flags) result varies.

Fair enough.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2022-02-25  7:22       ` Guoqing Jiang
@ 2022-03-03 16:21         ` Mariusz Tkaczyk
  0 siblings, 0 replies; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-03-03 16:21 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: song, linux-raid

On Fri, 25 Feb 2022 15:22:00 +0800
Guoqing Jiang <guoqing.jiang@linux.dev> wrote:

> >> If one member disk was set Faulty which caused BROKEN was set, is
> >> it possible to re-add the same member disk again?
> >>  
> > Is possible to re-add drive to failed raid5 array now? From my
> > understanding of raid5_add_disk it is not possible.  
> 
> I mean the below steps, it works as you can see.
> 
> >> [root@vm ~]# echo faulty > /sys/block/md0/md/dev-loop1/state
> >> [root@vm ~]# cat /proc/mdstat
> >> Personalities : [raid6] [raid5] [raid4]
> >> md0 : active raid5 loop2[2] loop1[0](F)
> >>         1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2
> >> [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk
> >>
> >> unused devices: <none>
> >> [root@vm ~]# echo re-add > /sys/block/md0/md/dev-loop1/state
> >> [root@vm ~]# cat /proc/mdstat
> >> Personalities : [raid6] [raid5] [raid4]
> >> md0 : active raid5 loop2[2] loop1[0]
> >>         1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2
> >> [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk
> >>
> >> unused devices: <none>


In this case array is not failed (it is degraded). For that reason I
think that my changes are not related.

# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1] [raid10]
md127 : active raid5 nvme5n1[1] nvme4n1[0](F)
      5242880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [2/1]
[_U]

unused devices: <none>
# cat /sys/block/md127/md/array_state clean

# mdadm -D /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Thu Mar  3 18:49:53 2022
        Raid Level : raid5
        Array Size : 5242880 (5.00 GiB 5.37 GB)
     Used Dev Size : 5242880 (5.00 GiB 5.37 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Thu Mar  3 18:52:46 2022
             State : clean, degraded
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 1
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : resync

              Name : gklab-localhost:vol  (local to host
              gklab-localhost)
              UUID : 711594e8:73ef988c:87a85085:b30c838d
            Events : 8

    Number   Major   Minor   RaidDevice State
       -       0        0        0      removed
       1     259        9        1      active sync   /dev/nvme5n1

       0     259        5        -      faulty   /dev/nvme4n1


Do I miss something?

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2022-03-22 15:23 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
@ 2022-04-08  0:29   ` Song Liu
  0 siblings, 0 replies; 29+ messages in thread
From: Song Liu @ 2022-04-08  0:29 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: linux-raid, Guoqing Jiang

On Tue, Mar 22, 2022 at 8:24 AM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> Raid456 module had allowed to achieve failed state. It was fixed by
> fb73b357fb9 ("raid5: block failing device if raid will be failed").
> This fix introduces a bug, now if raid5 fails during IO, it may result
> with a hung task without completion. Faulty flag on the device is
> necessary to process all requests and is checked many times, mainly in
> analyze_stripe().
[...]

> Cc: stable@vger.kernel.org
> Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> ---
>  drivers/md/raid5.c | 48 +++++++++++++++++++++++-----------------------
>  1 file changed, 24 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7c119208a214..4d76e3a89aa5 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -686,17 +686,20 @@ int raid5_calc_degraded(struct r5conf *conf)
>         return degraded;
>  }
>
> -static int has_failed(struct r5conf *conf)
> +static bool has_failed(struct r5conf *conf)
>  {
> -       int degraded;
> +       int degraded = conf->mddev->degraded;
>
> -       if (conf->mddev->reshape_position == MaxSector)
> -               return conf->mddev->degraded > conf->max_degraded;
> +       if (test_bit(MD_BROKEN, &conf->mddev->flags))
> +               return true;
> +
> +       if (conf->mddev->reshape_position != MaxSector)
> +               degraded = raid5_calc_degraded(conf);
>
> -       degraded = raid5_calc_degraded(conf);
>         if (degraded > conf->max_degraded)

nit: we can just do
   return degraded > conf->max_degraded;

[...]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/3] raid5: introduce MD_BROKEN
  2022-03-22 15:23 [PATCH 0/3] Failed array handling improvements Mariusz Tkaczyk
@ 2022-03-22 15:23 ` Mariusz Tkaczyk
  2022-04-08  0:29   ` Song Liu
  0 siblings, 1 reply; 29+ messages in thread
From: Mariusz Tkaczyk @ 2022-03-22 15:23 UTC (permalink / raw)
  To: song; +Cc: linux-raid, guoqing.jiang

Raid456 module had allowed to achieve failed state. It was fixed by
fb73b357fb9 ("raid5: block failing device if raid will be failed").
This fix introduces a bug, now if raid5 fails during IO, it may result
with a hung task without completion. Faulty flag on the device is
necessary to process all requests and is checked many times, mainly in
analyze_stripe().
Allow to set faulty on drive again and set MD_BROKEN if raid is failed.

As a result, this level is allowed to achieve failed state again, but
communication with userspace (via -EBUSY status) will be preserved.

This restores possibility to fail array via #mdadm --set-faulty command
and will be fixed by additional verification on mdadm side.

Reproduction steps:
 mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
 mkfs.xfs /dev/md126 -f
 mount /dev/md126 /mnt/root/

 fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
--bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
--time_based --group_reporting --name=throughput-test-job
--eta-newline=1 &

 echo 1 > /sys/block/nvme2n1/device/device/remove
 echo 1 > /sys/block/nvme1n1/device/device/remove

 [ 1475.787779] Call Trace:
 [ 1475.793111] __schedule+0x2a6/0x700
 [ 1475.799460] schedule+0x38/0xa0
 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
 [ 1475.813856] ? finish_wait+0x80/0x80
 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
 [ 1475.828281] ? finish_wait+0x80/0x80
 [ 1475.834727] ? finish_wait+0x80/0x80
 [ 1475.841127] ? finish_wait+0x80/0x80
 [ 1475.847480] md_handle_request+0x119/0x190
 [ 1475.854390] md_make_request+0x8a/0x190
 [ 1475.861041] generic_make_request+0xcf/0x310
 [ 1475.868145] submit_bio+0x3c/0x160
 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
 [ 1475.889149] iomap_apply+0xff/0x310
 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.909974] iomap_dio_rw+0x2f2/0x490
 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.923680] ? atime_needs_update+0x77/0xe0
 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
 [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
 [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
 [ 1475.953403] aio_read+0xd5/0x180
 [ 1475.959395] ? _cond_resched+0x15/0x30
 [ 1475.965907] io_submit_one+0x20b/0x3c0
 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
 [ 1475.979335] ? do_io_getevents+0x7c/0xc0
 [ 1475.986009] do_syscall_64+0x5b/0x1a0
 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
 [ 1476.000255] RIP: 0033:0x7f11fc27978d
 [ 1476.006631] Code: Bad RIP value.
 [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.

Cc: stable@vger.kernel.org
Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/md/raid5.c | 48 +++++++++++++++++++++++-----------------------
 1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7c119208a214..4d76e3a89aa5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -686,17 +686,20 @@ int raid5_calc_degraded(struct r5conf *conf)
 	return degraded;
 }
 
-static int has_failed(struct r5conf *conf)
+static bool has_failed(struct r5conf *conf)
 {
-	int degraded;
+	int degraded = conf->mddev->degraded;
 
-	if (conf->mddev->reshape_position == MaxSector)
-		return conf->mddev->degraded > conf->max_degraded;
+	if (test_bit(MD_BROKEN, &conf->mddev->flags))
+		return true;
+
+	if (conf->mddev->reshape_position != MaxSector)
+		degraded = raid5_calc_degraded(conf);
 
-	degraded = raid5_calc_degraded(conf);
 	if (degraded > conf->max_degraded)
-		return 1;
-	return 0;
+		return true;
+
+	return false;
 }
 
 struct stripe_head *
@@ -2877,34 +2880,31 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 	unsigned long flags;
 	pr_debug("raid456: error called\n");
 
+	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n",
+		mdname(mddev), bdevname(rdev->bdev, b));
+
 	spin_lock_irqsave(&conf->device_lock, flags);
+	set_bit(Faulty, &rdev->flags);
+	clear_bit(In_sync, &rdev->flags);
+	mddev->degraded = raid5_calc_degraded(conf);
 
-	if (test_bit(In_sync, &rdev->flags) &&
-	    mddev->degraded == conf->max_degraded) {
-		/*
-		 * Don't allow to achieve failed state
-		 * Don't try to recover this device
-		 */
+	if (has_failed(conf)) {
+		set_bit(MD_BROKEN, &conf->mddev->flags);
 		conf->recovery_disabled = mddev->recovery_disabled;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		return;
+
+		pr_crit("md/raid:%s: Cannot continue operation (%d/%d failed).\n",
+			mdname(mddev), mddev->degraded, conf->raid_disks);
+	} else {
+		pr_crit("md/raid:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), conf->raid_disks - mddev->degraded);
 	}
 
-	set_bit(Faulty, &rdev->flags);
-	clear_bit(In_sync, &rdev->flags);
-	mddev->degraded = raid5_calc_degraded(conf);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 
 	set_bit(Blocked, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n"
-		"md/raid:%s: Operation continuing on %d devices.\n",
-		mdname(mddev),
-		bdevname(rdev->bdev, b),
-		mdname(mddev),
-		conf->raid_disks - mddev->degraded);
 	r5c_update_on_rdev_error(mddev, rdev);
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2021-12-17  8:37     ` Mariusz Tkaczyk
@ 2021-12-22  1:46       ` Guoqing Jiang
  0 siblings, 0 replies; 29+ messages in thread
From: Guoqing Jiang @ 2021-12-22  1:46 UTC (permalink / raw)
  To: Mariusz Tkaczyk; +Cc: song, linux-raid



On 12/17/21 4:37 PM, Mariusz Tkaczyk wrote:
> Hi Guoqing,
>
> On Fri, 17 Dec 2021 10:26:27 +0800
> Guoqing Jiang <guoqing.jiang@linux.dev> wrote:
>
>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>>> index 1240a5c16af8..8b5561811431 100644
>>> --- a/drivers/md/raid5.c
>>> +++ b/drivers/md/raid5.c
>>> @@ -690,6 +690,9 @@ static int has_failed(struct r5conf *conf)
>>>    {
>>>    	int degraded;
>>>    
>>> +	if (test_bit(MD_BROKEN, &conf->mddev->flags))
>>> +		return 1;
>>> +
>>>    	if (conf->mddev->reshape_position == MaxSector)
>>>    		return conf->mddev->degraded > conf->max_degraded;
>>>    
>>> @@ -2877,34 +2880,29 @@ static void raid5_error(struct mddev
>>> *mddev, struct md_rdev *rdev) unsigned long flags;
>>>    	pr_debug("raid456: error called\n");
>>>    
>>> -	spin_lock_irqsave(&conf->device_lock, flags);
>>> -
>>> -	if (test_bit(In_sync, &rdev->flags) &&
>>> -	    mddev->degraded == conf->max_degraded) {
>>> -		/*
>>> -		 * Don't allow to achieve failed state
>>> -		 * Don't try to recover this device
>>> -		 */
>>> -		conf->recovery_disabled = mddev->recovery_disabled;
>>> -		spin_unlock_irqrestore(&conf->device_lock, flags);
>>> -		return;
>>> -	}
>>> +	pr_crit("md/raid:%s: Disk failure on %s, disabling
>>> device.\n",
>>> +		mdname(mddev), bdevname(rdev->bdev, b));
>>>    
>>> +	spin_lock_irqsave(&conf->device_lock, flags);
>>>    	set_bit(Faulty, &rdev->flags);
>>>    	clear_bit(In_sync, &rdev->flags);
>>>    	mddev->degraded = raid5_calc_degraded(conf);
>>> +
>>> +	if (has_failed(conf)) {
>>> +		set_bit(MD_BROKEN, &mddev->flags);
>> What about other callers of has_failed? Do they need to set BROKEN
>> flag? Or set the flag in has_failed if it returns true, just FYI.
>>
> The function checks rdev->state for faulty. There are two, places
> where it can be set:
> - raid5_error (handled here)
> - raid5_spare_active (not a case IMO).
>
> I left it in raid5_error to be consistent with other levels.
> I think that moving it t has_failed is safe but I don't see any
> additional value in it.

 From raid5's point of view, it is broken in case has_failed returns 1, 
so it is
kind of inconsistent I think.

> I see that in raid5_error we hold device_lock. It is not true for
> all has_failed calls.

Didn't go detail of each caller, not sure the lock  is needed for set/clear
mddev->flags.

> Do you have any recommendations?

I don't have strong opinion for it.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2021-12-17  2:26   ` Guoqing Jiang
@ 2021-12-17  8:37     ` Mariusz Tkaczyk
  2021-12-22  1:46       ` Guoqing Jiang
  0 siblings, 1 reply; 29+ messages in thread
From: Mariusz Tkaczyk @ 2021-12-17  8:37 UTC (permalink / raw)
  To: Guoqing Jiang; +Cc: song, linux-raid

Hi Guoqing,

On Fri, 17 Dec 2021 10:26:27 +0800
Guoqing Jiang <guoqing.jiang@linux.dev> wrote:

> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 1240a5c16af8..8b5561811431 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -690,6 +690,9 @@ static int has_failed(struct r5conf *conf)
> >   {
> >   	int degraded;
> >   
> > +	if (test_bit(MD_BROKEN, &conf->mddev->flags))
> > +		return 1;
> > +
> >   	if (conf->mddev->reshape_position == MaxSector)
> >   		return conf->mddev->degraded > conf->max_degraded;
> >   
> > @@ -2877,34 +2880,29 @@ static void raid5_error(struct mddev
> > *mddev, struct md_rdev *rdev) unsigned long flags;
> >   	pr_debug("raid456: error called\n");
> >   
> > -	spin_lock_irqsave(&conf->device_lock, flags);
> > -
> > -	if (test_bit(In_sync, &rdev->flags) &&
> > -	    mddev->degraded == conf->max_degraded) {
> > -		/*
> > -		 * Don't allow to achieve failed state
> > -		 * Don't try to recover this device
> > -		 */
> > -		conf->recovery_disabled = mddev->recovery_disabled;
> > -		spin_unlock_irqrestore(&conf->device_lock, flags);
> > -		return;
> > -	}
> > +	pr_crit("md/raid:%s: Disk failure on %s, disabling
> > device.\n",
> > +		mdname(mddev), bdevname(rdev->bdev, b));
> >   
> > +	spin_lock_irqsave(&conf->device_lock, flags);
> >   	set_bit(Faulty, &rdev->flags);
> >   	clear_bit(In_sync, &rdev->flags);
> >   	mddev->degraded = raid5_calc_degraded(conf);
> > +
> > +	if (has_failed(conf)) {
> > +		set_bit(MD_BROKEN, &mddev->flags);  
> 
> What about other callers of has_failed? Do they need to set BROKEN
> flag? Or set the flag in has_failed if it returns true, just FYI.
> 

The function checks rdev->state for faulty. There are two, places
where it can be set:
- raid5_error (handled here)
- raid5_spare_active (not a case IMO).

I left it in raid5_error to be consistent with other levels.
I think that moving it t has_failed is safe but I don't see any
additional value in it.
I see that in raid5_error we hold device_lock. It is not true for
all has_failed calls.

Do you have any recommendations?

Thanks,
Mariusz 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] raid5: introduce MD_BROKEN
  2021-12-16 14:52 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
@ 2021-12-17  2:26   ` Guoqing Jiang
  2021-12-17  8:37     ` Mariusz Tkaczyk
  0 siblings, 1 reply; 29+ messages in thread
From: Guoqing Jiang @ 2021-12-17  2:26 UTC (permalink / raw)
  To: Mariusz Tkaczyk, song; +Cc: linux-raid



On 12/16/21 10:52 PM, Mariusz Tkaczyk wrote:
> Raid456 module had allowed to achieve failed state. It was fixed by
> fb73b357fb9 ("raid5: block failing device if raid will be failed").
> This fix introduces a bug, now if raid5 fails during IO, it may result
> with a hung task without completion. Faulty flag on the device is
> necessary to process all requests and is checked many times, mainly in
> analyze_stripe().
> Allow to set faulty on drive again and set MD_BROKEN if raid is failed.
>
> As a result, this level is allowed to achieve failed state again, but
> communication with userspace (via -EBUSY status) will be preserved.
>
> This restores possibility to fail array via #mdadm --set-faulty command
> and will be fixed by additional verification on mdadm side.
>
> Reproduction steps:
>   mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
>   mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
>   mkfs.xfs /dev/md126 -f
>   mount /dev/md126 /mnt/root/
>
>   fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
> --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
> --time_based --group_reporting --name=throughput-test-job
> --eta-newline=1 &
>
>   echo 1 > /sys/block/nvme2n1/device/device/remove
>   echo 1 > /sys/block/nvme1n1/device/device/remove
>
>   [ 1475.787779] Call Trace:
>   [ 1475.793111] __schedule+0x2a6/0x700
>   [ 1475.799460] schedule+0x38/0xa0
>   [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
>   [ 1475.813856] ? finish_wait+0x80/0x80
>   [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
>   [ 1475.828281] ? finish_wait+0x80/0x80
>   [ 1475.834727] ? finish_wait+0x80/0x80
>   [ 1475.841127] ? finish_wait+0x80/0x80
>   [ 1475.847480] md_handle_request+0x119/0x190
>   [ 1475.854390] md_make_request+0x8a/0x190
>   [ 1475.861041] generic_make_request+0xcf/0x310
>   [ 1475.868145] submit_bio+0x3c/0x160
>   [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
>   [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
>   [ 1475.889149] iomap_apply+0xff/0x310
>   [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
>   [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
>   [ 1475.909974] iomap_dio_rw+0x2f2/0x490
>   [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
>   [ 1475.923680] ? atime_needs_update+0x77/0xe0
>   [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
>   [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
>   [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
>   [ 1475.953403] aio_read+0xd5/0x180
>   [ 1475.959395] ? _cond_resched+0x15/0x30
>   [ 1475.965907] io_submit_one+0x20b/0x3c0
>   [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
>   [ 1475.979335] ? do_io_getevents+0x7c/0xc0
>   [ 1475.986009] do_syscall_64+0x5b/0x1a0
>   [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
>   [ 1476.000255] RIP: 0033:0x7f11fc27978d
>   [ 1476.006631] Code: Bad RIP value.
>   [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.
>
> Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> ---
>   drivers/md/raid5.c | 34 ++++++++++++++++------------------
>   1 file changed, 16 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 1240a5c16af8..8b5561811431 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -690,6 +690,9 @@ static int has_failed(struct r5conf *conf)
>   {
>   	int degraded;
>   
> +	if (test_bit(MD_BROKEN, &conf->mddev->flags))
> +		return 1;
> +
>   	if (conf->mddev->reshape_position == MaxSector)
>   		return conf->mddev->degraded > conf->max_degraded;
>   
> @@ -2877,34 +2880,29 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
>   	unsigned long flags;
>   	pr_debug("raid456: error called\n");
>   
> -	spin_lock_irqsave(&conf->device_lock, flags);
> -
> -	if (test_bit(In_sync, &rdev->flags) &&
> -	    mddev->degraded == conf->max_degraded) {
> -		/*
> -		 * Don't allow to achieve failed state
> -		 * Don't try to recover this device
> -		 */
> -		conf->recovery_disabled = mddev->recovery_disabled;
> -		spin_unlock_irqrestore(&conf->device_lock, flags);
> -		return;
> -	}
> +	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n",
> +		mdname(mddev), bdevname(rdev->bdev, b));
>   
> +	spin_lock_irqsave(&conf->device_lock, flags);
>   	set_bit(Faulty, &rdev->flags);
>   	clear_bit(In_sync, &rdev->flags);
>   	mddev->degraded = raid5_calc_degraded(conf);
> +
> +	if (has_failed(conf)) {
> +		set_bit(MD_BROKEN, &mddev->flags);

What about other callers of has_failed? Do they need to set BROKEN flag?
Or set the flag in has_failed if it returns true, just FYI.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 3/3] raid5: introduce MD_BROKEN
  2021-12-16 14:52 [PATCH v2 0/3] Use MD_BROKEN for redundant arrays Mariusz Tkaczyk
@ 2021-12-16 14:52 ` Mariusz Tkaczyk
  2021-12-17  2:26   ` Guoqing Jiang
  0 siblings, 1 reply; 29+ messages in thread
From: Mariusz Tkaczyk @ 2021-12-16 14:52 UTC (permalink / raw)
  To: song; +Cc: linux-raid

Raid456 module had allowed to achieve failed state. It was fixed by
fb73b357fb9 ("raid5: block failing device if raid will be failed").
This fix introduces a bug, now if raid5 fails during IO, it may result
with a hung task without completion. Faulty flag on the device is
necessary to process all requests and is checked many times, mainly in
analyze_stripe().
Allow to set faulty on drive again and set MD_BROKEN if raid is failed.

As a result, this level is allowed to achieve failed state again, but
communication with userspace (via -EBUSY status) will be preserved.

This restores possibility to fail array via #mdadm --set-faulty command
and will be fixed by additional verification on mdadm side.

Reproduction steps:
 mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1
 mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean
 mkfs.xfs /dev/md126 -f
 mount /dev/md126 /mnt/root/

 fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw
--bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4
--time_based --group_reporting --name=throughput-test-job
--eta-newline=1 &

 echo 1 > /sys/block/nvme2n1/device/device/remove
 echo 1 > /sys/block/nvme1n1/device/device/remove

 [ 1475.787779] Call Trace:
 [ 1475.793111] __schedule+0x2a6/0x700
 [ 1475.799460] schedule+0x38/0xa0
 [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456]
 [ 1475.813856] ? finish_wait+0x80/0x80
 [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456]
 [ 1475.828281] ? finish_wait+0x80/0x80
 [ 1475.834727] ? finish_wait+0x80/0x80
 [ 1475.841127] ? finish_wait+0x80/0x80
 [ 1475.847480] md_handle_request+0x119/0x190
 [ 1475.854390] md_make_request+0x8a/0x190
 [ 1475.861041] generic_make_request+0xcf/0x310
 [ 1475.868145] submit_bio+0x3c/0x160
 [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60
 [ 1475.882070] iomap_dio_bio_actor+0x175/0x390
 [ 1475.889149] iomap_apply+0xff/0x310
 [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.909974] iomap_dio_rw+0x2f2/0x490
 [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390
 [ 1475.923680] ? atime_needs_update+0x77/0xe0
 [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
 [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs]
 [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs]
 [ 1475.953403] aio_read+0xd5/0x180
 [ 1475.959395] ? _cond_resched+0x15/0x30
 [ 1475.965907] io_submit_one+0x20b/0x3c0
 [ 1475.972398] __x64_sys_io_submit+0xa2/0x180
 [ 1475.979335] ? do_io_getevents+0x7c/0xc0
 [ 1475.986009] do_syscall_64+0x5b/0x1a0
 [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca
 [ 1476.000255] RIP: 0033:0x7f11fc27978d
 [ 1476.006631] Code: Bad RIP value.
 [ 1476.073251] INFO: task fio:3877 blocked for more than 120 seconds.

Fixes: fb73b357fb9 ("raid5: block failing device if raid will be failed")
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/md/raid5.c | 34 ++++++++++++++++------------------
 1 file changed, 16 insertions(+), 18 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1240a5c16af8..8b5561811431 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -690,6 +690,9 @@ static int has_failed(struct r5conf *conf)
 {
 	int degraded;
 
+	if (test_bit(MD_BROKEN, &conf->mddev->flags))
+		return 1;
+
 	if (conf->mddev->reshape_position == MaxSector)
 		return conf->mddev->degraded > conf->max_degraded;
 
@@ -2877,34 +2880,29 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 	unsigned long flags;
 	pr_debug("raid456: error called\n");
 
-	spin_lock_irqsave(&conf->device_lock, flags);
-
-	if (test_bit(In_sync, &rdev->flags) &&
-	    mddev->degraded == conf->max_degraded) {
-		/*
-		 * Don't allow to achieve failed state
-		 * Don't try to recover this device
-		 */
-		conf->recovery_disabled = mddev->recovery_disabled;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
-		return;
-	}
+	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n",
+		mdname(mddev), bdevname(rdev->bdev, b));
 
+	spin_lock_irqsave(&conf->device_lock, flags);
 	set_bit(Faulty, &rdev->flags);
 	clear_bit(In_sync, &rdev->flags);
 	mddev->degraded = raid5_calc_degraded(conf);
+
+	if (has_failed(conf)) {
+		set_bit(MD_BROKEN, &mddev->flags);
+		conf->recovery_disabled = mddev->recovery_disabled;
+		pr_crit("md/raid:%s: Cannot continue on %d devices.\n",
+			mdname(mddev), conf->raid_disks - mddev->degraded);
+	} else
+		pr_crit("md/raid:%s: Operation continuing on %d devices.\n",
+			mdname(mddev), conf->raid_disks - mddev->degraded);
+
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 	set_bit(MD_RECOVERY_INTR, &mddev->recovery);
 
 	set_bit(Blocked, &rdev->flags);
 	set_mask_bits(&mddev->sb_flags, 0,
 		      BIT(MD_SB_CHANGE_DEVS) | BIT(MD_SB_CHANGE_PENDING));
-	pr_crit("md/raid:%s: Disk failure on %s, disabling device.\n"
-		"md/raid:%s: Operation continuing on %d devices.\n",
-		mdname(mddev),
-		bdevname(rdev->bdev, b),
-		mdname(mddev),
-		conf->raid_disks - mddev->degraded);
 	r5c_update_on_rdev_error(mddev, rdev);
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2022-04-08  0:29 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-27 15:39 [PATCH v3 0/3] Improve failed arrays handling Mariusz Tkaczyk
2022-01-27 15:39 ` [PATCH 1/3] raid0, linear, md: add error_handlers for raid0 and linear Mariusz Tkaczyk
2022-02-12  1:12   ` Guoqing Jiang
2022-02-14  9:37     ` Mariusz Tkaczyk
2022-02-15  3:43       ` Guoqing Jiang
2022-02-15 14:06         ` Mariusz Tkaczyk
2022-02-16  9:47           ` Xiao Ni
2022-02-22  6:34           ` Song Liu
2022-02-22 13:02             ` Mariusz Tkaczyk
2022-01-27 15:39 ` [PATCH 2/3] md: Set MD_BROKEN for RAID1 and RAID10 Mariusz Tkaczyk
2022-01-31  8:29   ` Xiao Ni
2022-01-31  9:06     ` Mariusz Tkaczyk
2022-02-08  7:13       ` Song Liu
2022-01-31 12:23     ` Wols Lists
2022-02-12  1:17   ` Guoqing Jiang
2022-02-14  8:55     ` Mariusz Tkaczyk
2022-01-27 15:39 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
2022-01-31  8:58   ` Xiao Ni
2022-02-12  1:47   ` Guoqing Jiang
2022-02-22 14:18     ` Mariusz Tkaczyk
2022-02-25  7:22       ` Guoqing Jiang
2022-03-03 16:21         ` Mariusz Tkaczyk
2022-02-08  7:18 ` [PATCH v3 0/3] Improve failed arrays handling Song Liu
  -- strict thread matches above, loose matches on Subject: below --
2022-03-22 15:23 [PATCH 0/3] Failed array handling improvements Mariusz Tkaczyk
2022-03-22 15:23 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
2022-04-08  0:29   ` Song Liu
2021-12-16 14:52 [PATCH v2 0/3] Use MD_BROKEN for redundant arrays Mariusz Tkaczyk
2021-12-16 14:52 ` [PATCH 3/3] raid5: introduce MD_BROKEN Mariusz Tkaczyk
2021-12-17  2:26   ` Guoqing Jiang
2021-12-17  8:37     ` Mariusz Tkaczyk
2021-12-22  1:46       ` Guoqing Jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.