All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
@ 2019-04-18 22:04 ` Guilherme G. Piccoli
  0 siblings, 0 replies; 12+ messages in thread
From: Guilherme G. Piccoli @ 2019-04-18 22:04 UTC (permalink / raw)
  To: axboe, linux-raid
  Cc: kernel, gpiccoli, neilb, linux-block, dm-devel, linux-fsdevel,
	jay.vosburgh

Currently the md driver completely relies in the userspace to stop an
array in case of some failure - for raid0, if we remove a raid0 member like via
PCI hot(un)plugging an NVMe device, and the raid0 array is _mounted_, mdadm
cannot stop the array, since the tool tries to open the block device to perform
the ioctl with the O_EXCL flag.

So, an array in this situation is "alive" - users may write to it and unless
they check the kernel log or some other monitor tool, everything will seem fine
and the writes are completed with no errors. Being more precise, direct
writes will not work, but since usually writes are done in a regular form
(backed by the page cache) the most common scenario is an user being able to
regularly write to a broken raid0, and get all their data corrupted.


PROPOSAL:
The idea proposed here to fix this behavior is mimic other block devices:
if one have a filesystem mounted in a block device on top of an NVMe or
SCSI disk and the disk gets removed, writes are prevented, errors are
observed and it's obvious something is wrong; same goes for USB sticks.

We believe right now the md driver is not behaving properly for raid0
arrays (it is handling these errors for other levels though). The approach
took for raid-0 is basically an emergency removal procedure, in which I/O
is blocked from the device, the regular clean-up happens and the associate
disk is deleted. It went to extensive testing, as detailed below.

Not all are roses, we have some caveats that need to be resolved.
Feedback is much appreciated.
There is a caveats / questions / polemic choices section below the test
section.

V1 link: https://marc.info/?l=linux-raid&m=153313538013458

Thanks in advance,


Guilherme


* Testing:

The device topology for the tests is as follows:

                            md6
                             |
              |******************************|
              |                              |
             md4                            md5
              |                              |
       |*************|                |*************|
       |             |                |             |
      md0           md2              md1           md3
       |             |                |             |
   |*******|       |***|            |***|       |*******|
   |       |       |   |            |   |       |       |
nvme1n1 nvme0n1   sda sdd          sde sdf   nvme2n1 nvme3n1


We chose to test such complex topology to expose corner cases and timing
issues (which were found in the development phase). There are 3 test
subsets basically: the whole set of devices, called here "md cascade", and
2 small branches, called here "md direct" testing.

So, in summary:
### md direct (single arrays composed of SCSI/NVMe block devices) ###
A1 = md1      A2 = md3
C1 = sde      C1 = nvme2n1
C2 = sdf      C2 = nvme3n1

### md cascade (array composed of md devices) ###
A3 = md6
underlying component UC1 = nvme1n1
underlying component UC2 = sdd
underlying component UC3 = sdf
underlying component UC4 = nvme2n1


### Method of removal ###
- For NVMe devices, "echo 1 > /sys/block/nvmeXnY/device/device/remove"
- For SCSI devices, "echo 1 > /sys/block/sdX/device/delete"


### Test ###
Write a file using the command: "dd if=/dev/zero of=tmpfile bs=X" where
X might be 8K or 384K (varied in the tests).

Each array also contains a valid file to be checked later, in order to
validate that filesystem didn't get severely corrupted after the
procedure.

Tests marked with @ indicate direct writes were tested too.
Tests with a [] indicator exhibited some oddity/issue, detailed in the
caveats section.

After each test, guest was rebooted.

Tests were performed in kernel v5.1-rc5.


* Test results
("partition X" means we have a GPT table with 2 partitions in the device)


### md direct

Remove members and start writing to array right after:
A1 with:                               A2 with:
 -ext4                                  -ext4
 --Removed C1: PASSED @                 --Removed C2: PASSED @

 -xfs                                   -xfs
 --Removed C2: PASSED                   --Removed C1: PASSED

 -partition 1 + ext4                    -partition 1 + xfs
 -partition 2 + xfs                     -partition 2 + ext4
 --Removed C1: PASSED                   --Removed C2: PASSED


Start writing to array and remove member right after:
A1 with:                               A2 with:
 -ext4                                  -ext4
 --Removed C1: PASSED                   --Removed C1: PASSED
 --Removed C2: PASSED @

 -xfs                                   -xfs
 --Removed C1: PASSED                   --Removed C1: PASSED
                                        --Removed C2: PASSED @

 -partition 1 + ext4                    -partition 1 + xfs
 -partition 2 + xfs                     -partition 2 + ext4
 --Removed C2: PASSED                   --Removed C1: PASSED @


### md cascade
Remove members and start writing to array right after:
A3 with:
 -ext4:
 --Removed UC2: PASSED
 --Removed UC4: PASSED @

  -xfs:
 --Removed UC2: PASSED @
 --Removed UC4: PASSED

 -partition 1 + ext4
 -partition 2 + xfs
 --Removed UC1: PASSED @
 --Removed UC3: PASSED


Start writing to array and remove member right after:
A3 with:
 -ext4:
 --Removed UC2: PASSED
 --Removed UC4: PARTIAL @ [(a) (b)]

  -xfs:
 --Removed UC2: PASSED
 --Removed UC4: PARTIAL @ [(c)]

 -partition 1 + ext4
 -partition 2 + xfs
 --Removed UC1: PASSED
 --Removed UC3: PARTIAL @ [(b)]


* Caveats / points of uncertainty / questions:

=> Some arrays  still show in /dev after the removal. We noticed if FS
is unmounted and "mdadm --detail" is tried against that, it hangs.

#For md cascade only (nested arrays):

a) After direct writes, if we issue a regular right (backed by page
cache), observed an oops sometimes, in ext4 filesystem.

b) We might face an ext4 journal task blocked in io_schedule().
Backtrace: https://pastebin.ubuntu.com/p/KgvHjRn6Pg

c) Hung task in xfs after direct I/O only, when trying to write again after
failing one of the members (2nd write is using page cache).
Backtrace: https://pastebin.ubuntu.com/p/7NMRV9fG2H



#Generic questions / polemic choices:

i) With this patch, the STOP_ARRAY ioctl won't proceed in case a disk is
removed and emergency stop routine already started to act, even in case of
unmounted md arrays. This is a change in the old behavior, triggered by
the way we check for failed members in raid0 driver.

ii) Currently, the patch implements a kernel-only removal policy - shall
it rely on userspace (mdadm) to do it? A first approach was based in
userspace, but it proved to be more problematic in tests.

iii) Would a new state ("error") for RAID0 be a better solution? I'm
experimenting a simpler approach that would introduce a new state, but
wouldn't force the removal of the device.

Guilherme G. Piccoli (1):
  md/raid0: Introduce emergency stop for raid0 arrays

 drivers/md/md.c    | 123 +++++++++++++++++++++++++++++++++++++++++----
 drivers/md/md.h    |   6 +++
 drivers/md/raid0.c |  14 ++++++
 3 files changed, 134 insertions(+), 9 deletions(-)

-- 
2.21.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
@ 2019-04-18 22:04 ` Guilherme G. Piccoli
  0 siblings, 0 replies; 12+ messages in thread
From: Guilherme G. Piccoli @ 2019-04-18 22:04 UTC (permalink / raw)
  To: axboe, linux-raid
  Cc: gpiccoli, jay.vosburgh, kernel, neilb, dm-devel, linux-fsdevel,
	linux-block

Currently the md driver completely relies in the userspace to stop an
array in case of some failure - for raid0, if we remove a raid0 member like via
PCI hot(un)plugging an NVMe device, and the raid0 array is _mounted_, mdadm
cannot stop the array, since the tool tries to open the block device to perform
the ioctl with the O_EXCL flag.

So, an array in this situation is "alive" - users may write to it and unless
they check the kernel log or some other monitor tool, everything will seem fine
and the writes are completed with no errors. Being more precise, direct
writes will not work, but since usually writes are done in a regular form
(backed by the page cache) the most common scenario is an user being able to
regularly write to a broken raid0, and get all their data corrupted.


PROPOSAL:
The idea proposed here to fix this behavior is mimic other block devices:
if one have a filesystem mounted in a block device on top of an NVMe or
SCSI disk and the disk gets removed, writes are prevented, errors are
observed and it's obvious something is wrong; same goes for USB sticks.

We believe right now the md driver is not behaving properly for raid0
arrays (it is handling these errors for other levels though). The approach
took for raid-0 is basically an emergency removal procedure, in which I/O
is blocked from the device, the regular clean-up happens and the associate
disk is deleted. It went to extensive testing, as detailed below.

Not all are roses, we have some caveats that need to be resolved.
Feedback is much appreciated.
There is a caveats / questions / polemic choices section below the test
section.

V1 link: https://marc.info/?l=linux-raid&m=153313538013458

Thanks in advance,


Guilherme


* Testing:

The device topology for the tests is as follows:

                            md6
                             |
              |******************************|
              |                              |
             md4                            md5
              |                              |
       |*************|                |*************|
       |             |                |             |
      md0           md2              md1           md3
       |             |                |             |
   |*******|       |***|            |***|       |*******|
   |       |       |   |            |   |       |       |
nvme1n1 nvme0n1   sda sdd          sde sdf   nvme2n1 nvme3n1


We chose to test such complex topology to expose corner cases and timing
issues (which were found in the development phase). There are 3 test
subsets basically: the whole set of devices, called here "md cascade", and
2 small branches, called here "md direct" testing.

So, in summary:
### md direct (single arrays composed of SCSI/NVMe block devices) ###
A1 = md1      A2 = md3
C1 = sde      C1 = nvme2n1
C2 = sdf      C2 = nvme3n1

### md cascade (array composed of md devices) ###
A3 = md6
underlying component UC1 = nvme1n1
underlying component UC2 = sdd
underlying component UC3 = sdf
underlying component UC4 = nvme2n1


### Method of removal ###
- For NVMe devices, "echo 1 > /sys/block/nvmeXnY/device/device/remove"
- For SCSI devices, "echo 1 > /sys/block/sdX/device/delete"


### Test ###
Write a file using the command: "dd if=/dev/zero of=tmpfile bs=X" where
X might be 8K or 384K (varied in the tests).

Each array also contains a valid file to be checked later, in order to
validate that filesystem didn't get severely corrupted after the
procedure.

Tests marked with @ indicate direct writes were tested too.
Tests with a [] indicator exhibited some oddity/issue, detailed in the
caveats section.

After each test, guest was rebooted.

Tests were performed in kernel v5.1-rc5.


* Test results
("partition X" means we have a GPT table with 2 partitions in the device)


### md direct

Remove members and start writing to array right after:
A1 with:                               A2 with:
 -ext4                                  -ext4
 --Removed C1: PASSED @                 --Removed C2: PASSED @

 -xfs                                   -xfs
 --Removed C2: PASSED                   --Removed C1: PASSED

 -partition 1 + ext4                    -partition 1 + xfs
 -partition 2 + xfs                     -partition 2 + ext4
 --Removed C1: PASSED                   --Removed C2: PASSED


Start writing to array and remove member right after:
A1 with:                               A2 with:
 -ext4                                  -ext4
 --Removed C1: PASSED                   --Removed C1: PASSED
 --Removed C2: PASSED @

 -xfs                                   -xfs
 --Removed C1: PASSED                   --Removed C1: PASSED
                                        --Removed C2: PASSED @

 -partition 1 + ext4                    -partition 1 + xfs
 -partition 2 + xfs                     -partition 2 + ext4
 --Removed C2: PASSED                   --Removed C1: PASSED @


### md cascade
Remove members and start writing to array right after:
A3 with:
 -ext4:
 --Removed UC2: PASSED
 --Removed UC4: PASSED @

  -xfs:
 --Removed UC2: PASSED @
 --Removed UC4: PASSED

 -partition 1 + ext4
 -partition 2 + xfs
 --Removed UC1: PASSED @
 --Removed UC3: PASSED


Start writing to array and remove member right after:
A3 with:
 -ext4:
 --Removed UC2: PASSED
 --Removed UC4: PARTIAL @ [(a) (b)]

  -xfs:
 --Removed UC2: PASSED
 --Removed UC4: PARTIAL @ [(c)]

 -partition 1 + ext4
 -partition 2 + xfs
 --Removed UC1: PASSED
 --Removed UC3: PARTIAL @ [(b)]


* Caveats / points of uncertainty / questions:

=> Some arrays  still show in /dev after the removal. We noticed if FS
is unmounted and "mdadm --detail" is tried against that, it hangs.

#For md cascade only (nested arrays):

a) After direct writes, if we issue a regular right (backed by page
cache), observed an oops sometimes, in ext4 filesystem.

b) We might face an ext4 journal task blocked in io_schedule().
Backtrace: https://pastebin.ubuntu.com/p/KgvHjRn6Pg

c) Hung task in xfs after direct I/O only, when trying to write again after
failing one of the members (2nd write is using page cache).
Backtrace: https://pastebin.ubuntu.com/p/7NMRV9fG2H



#Generic questions / polemic choices:

i) With this patch, the STOP_ARRAY ioctl won't proceed in case a disk is
removed and emergency stop routine already started to act, even in case of
unmounted md arrays. This is a change in the old behavior, triggered by
the way we check for failed members in raid0 driver.

ii) Currently, the patch implements a kernel-only removal policy - shall
it rely on userspace (mdadm) to do it? A first approach was based in
userspace, but it proved to be more problematic in tests.

iii) Would a new state ("error") for RAID0 be a better solution? I'm
experimenting a simpler approach that would introduce a new state, but
wouldn't force the removal of the device.

Guilherme G. Piccoli (1):
  md/raid0: Introduce emergency stop for raid0 arrays

 drivers/md/md.c    | 123 +++++++++++++++++++++++++++++++++++++++++----
 drivers/md/md.h    |   6 +++
 drivers/md/raid0.c |  14 ++++++
 3 files changed, 134 insertions(+), 9 deletions(-)

-- 
2.21.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC] [PATCH V2 1/1] md/raid0: Introduce emergency stop for raid0 arrays
  2019-04-18 22:04 ` Guilherme G. Piccoli
@ 2019-04-18 22:04   ` Guilherme G. Piccoli
  -1 siblings, 0 replies; 12+ messages in thread
From: Guilherme G. Piccoli @ 2019-04-18 22:04 UTC (permalink / raw)
  To: axboe, linux-raid
  Cc: kernel, gpiccoli, neilb, linux-block, dm-devel, linux-fsdevel,
	jay.vosburgh

Currently the raid0 driver is not provided with any health checking
mechanism to verify its members are fine. So, if suddenly a member
is removed, for example, a STOP_ARRAY ioctl will be triggered from
userspace, i.e., all the logic for stopping an array relies in the
userspace tools, like mdadm/udev. Particularly, if a raid0 array is
mounted, this stop procedure will fail, since mdadm tries to open
the md block device with O_EXCL flag, which isn't allowed if md is
mounted.

That leads to the following situation: if a raid0 array member is
removed and the array is mounted, some user writing to this array
won't realize errors are happening unless they check kernel log.
In other words, no -EIO is returned and writes (except direct I/Os)
appear normal. Meaning this user might think the wrote data is stored
in the array, but instead garbage was written since raid0 does stripping
and require all members to be working in order not corrupt data.

This patch propose a change in this behavior: to emergency stop a
raid0 array in case one of its members are gone. The check happens
when I/O is queued to raid0 driver, so the driver will confirm if
the block device it plans to read/write has its queue healthy; in
case it's not fine (like a dying or dead queue), raid0 driver will
invoke an emergency removal routine that will mark the md device as
broken and trigger a delayed stop procedure. Also, raid0 will start
refusing new BIOs from this point, returning -EIO.
The emergency stop routine will mark the md request queue as dying
too, as a "flag" to indicate failure in case of a nested raid0 array
configuration (a raid0 composed of raid0 devices).

The delayed stop procedure then will perform the basic stop of the
md device, and will take care in case it holds mounted filesystems,
allowing the stop of a mounted raid0 array - which is common in
other regular block devices like NVMe and SCSI.

This emergency stop mechanism only affects raid0 arrays.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
---
 drivers/md/md.c    | 123 +++++++++++++++++++++++++++++++++++++++++----
 drivers/md/md.h    |   6 +++
 drivers/md/raid0.c |  14 ++++++
 3 files changed, 134 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 05ffffb8b769..5dc0ed7d37d8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -94,6 +94,7 @@ EXPORT_SYMBOL(md_cluster_mod);
 static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 static struct workqueue_struct *md_wq;
 static struct workqueue_struct *md_misc_wq;
+static struct workqueue_struct *md_stop_wq;
 
 static int remove_and_add_spares(struct mddev *mddev,
 				 struct md_rdev *this);
@@ -446,7 +447,9 @@ static void md_end_flush(struct bio *fbio)
 	rdev_dec_pending(rdev, mddev);
 
 	if (atomic_dec_and_test(&fi->flush_pending)) {
-		if (bio->bi_iter.bi_size == 0) {
+		if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+			goto bail_flush;
+		} else if (bio->bi_iter.bi_size == 0) {
 			/* an empty barrier - all done */
 			bio_endio(bio);
 			mempool_free(fi, mddev->flush_pool);
@@ -456,6 +459,12 @@ static void md_end_flush(struct bio *fbio)
 		}
 	}
 
+bail_flush:
+	/* Prevents dangling bios to crash after raid0 emergency stop */
+	if (test_bit(MD_BROKEN, &mddev->flags) && !mddev->raid_disks) {
+		bio_uninit(fbio);
+		return;
+	}
 	mempool_free(fb, mddev->flush_bio_pool);
 	bio_put(fbio);
 }
@@ -465,6 +474,11 @@ void md_flush_request(struct mddev *mddev, struct bio *bio)
 	struct md_rdev *rdev;
 	struct flush_info *fi;
 
+	if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+		bio_io_error(bio);
+		return;
+	}
+
 	fi = mempool_alloc(mddev->flush_pool, GFP_NOIO);
 
 	fi->bio = bio;
@@ -5202,6 +5216,17 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	return rv;
 }
 
+static void __md_free(struct mddev *mddev)
+{
+	if (mddev->gendisk)
+		del_gendisk(mddev->gendisk);
+	if (mddev->queue)
+		blk_cleanup_queue(mddev->queue);
+	if (mddev->gendisk)
+		put_disk(mddev->gendisk);
+	percpu_ref_exit(&mddev->writes_pending);
+}
+
 static void md_free(struct kobject *ko)
 {
 	struct mddev *mddev = container_of(ko, struct mddev, kobj);
@@ -5209,13 +5234,8 @@ static void md_free(struct kobject *ko)
 	if (mddev->sysfs_state)
 		sysfs_put(mddev->sysfs_state);
 
-	if (mddev->gendisk)
-		del_gendisk(mddev->gendisk);
-	if (mddev->queue)
-		blk_cleanup_queue(mddev->queue);
-	if (mddev->gendisk)
-		put_disk(mddev->gendisk);
-	percpu_ref_exit(&mddev->writes_pending);
+	if (!test_bit(MD_BROKEN, &mddev->flags))
+		__md_free(mddev);
 
 	bioset_exit(&mddev->bio_set);
 	bioset_exit(&mddev->sync_set);
@@ -5238,7 +5258,9 @@ static void mddev_delayed_delete(struct work_struct *ws)
 {
 	struct mddev *mddev = container_of(ws, struct mddev, del_work);
 
-	sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
+	if (!test_bit(MD_BROKEN, &mddev->flags))
+		sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
+
 	kobject_del(&mddev->kobj);
 	kobject_put(&mddev->kobj);
 }
@@ -5974,6 +5996,75 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
 	return err;
 }
 
+static void mddev_delayed_stop(struct work_struct *ws)
+{
+	struct mddev *mddev = container_of(ws, struct mddev, stop_work);
+	struct gendisk *disk = mddev->gendisk;
+	struct block_device *bdev;
+	struct md_rdev *rdev;
+	int i = 0;
+
+	mutex_lock(&mddev->open_mutex);
+	del_timer_sync(&mddev->safemode_timer);
+	__md_stop(mddev);
+	rdev_for_each(rdev, mddev)
+		if (rdev->raid_disk >= 0)
+			sysfs_unlink_rdev(mddev, rdev);
+	set_capacity(disk, 0);
+	mutex_unlock(&mddev->open_mutex);
+
+	mddev->changed = 1;
+	revalidate_disk(disk);
+	export_array(mddev);
+	mddev->hold_active = 0;
+
+	/* Make sure unbind_rdev_from_array() jobs are done */
+	flush_workqueue(md_misc_wq);
+
+	do {
+		bdev = bdget_disk(disk, i++);
+		if (bdev) {
+			struct super_block *sb = bdev->bd_super;
+
+			if (sb) {
+				sb->s_flags |= SB_RDONLY;
+				sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+			}
+			bdput(bdev);
+		}
+	} while (bdev);
+
+	md_clean(mddev);
+	set_bit(MD_BROKEN, &mddev->flags); /* md_clean() will clear this too */
+	md_new_event(mddev);
+	sysfs_notify_dirent_safe(mddev->sysfs_state);
+
+	__md_free(mddev);
+	mddev->gendisk = NULL;
+}
+
+void mddev_emergency_stop(struct mddev *mddev)
+{
+	/* Prevents races if raid0 driver detects errors in multiple requests
+	 * at the same time.
+	 */
+	mutex_lock(&mddev->open_mutex);
+	if (test_bit(MD_BROKEN, &mddev->flags))
+		return;
+
+	set_bit(MD_BROKEN, &mddev->flags);
+	mutex_unlock(&mddev->open_mutex);
+
+	/* Necessary to set md queue as dying here in case this md device is
+	 * a member of some other md array - nested removal will proceed then.
+	 */
+	blk_set_queue_dying(mddev->queue);
+
+	INIT_WORK(&mddev->stop_work, mddev_delayed_stop);
+	queue_work(md_stop_wq, &mddev->stop_work);
+}
+EXPORT_SYMBOL_GPL(mddev_emergency_stop);
+
 /* mode:
  *   0 - completely stop and dis-assemble array
  *   2 - stop but do not disassemble array
@@ -5985,6 +6076,13 @@ static int do_md_stop(struct mddev *mddev, int mode,
 	struct md_rdev *rdev;
 	int did_freeze = 0;
 
+	/* If we already started the emergency removal procedure,
+	 * the STOP_ARRAY ioctl can race with the removal, leading to
+	 * dangling devices. Below check is to prevent this corner case.
+	 */
+	if (test_bit(MD_BROKEN, &mddev->flags))
+		return -EBUSY;
+
 	if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
 		did_freeze = 1;
 		set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
@@ -9150,6 +9248,10 @@ static int __init md_init(void)
 	if (!md_misc_wq)
 		goto err_misc_wq;
 
+	md_stop_wq = alloc_ordered_workqueue("md_stop", 0);
+	if (!md_stop_wq)
+		goto err_stop_wq;
+
 	if ((ret = register_blkdev(MD_MAJOR, "md")) < 0)
 		goto err_md;
 
@@ -9171,6 +9273,8 @@ static int __init md_init(void)
 err_mdp:
 	unregister_blkdev(MD_MAJOR, "md");
 err_md:
+	destroy_workqueue(md_stop_wq);
+err_stop_wq:
 	destroy_workqueue(md_misc_wq);
 err_misc_wq:
 	destroy_workqueue(md_wq);
@@ -9458,6 +9562,7 @@ static __exit void md_exit(void)
 		 * destroy_workqueue() below will wait for that to complete.
 		 */
 	}
+	destroy_workqueue(md_stop_wq);
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index c52afb52c776..eebccc318828 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -243,6 +243,9 @@ enum mddev_flags {
 	MD_UPDATING_SB,		/* md_check_recovery is updating the metadata
 				 * without explicitly holding reconfig_mutex.
 				 */
+	MD_BROKEN,		/* This is used in the emergency RAID-0 stop
+				 * in case one of its members gets removed.
+				 */
 };
 
 enum mddev_sb_flags {
@@ -411,6 +414,8 @@ struct mddev {
 
 	struct work_struct del_work;	/* used for delayed sysfs removal */
 
+	struct work_struct stop_work;	/* used for delayed emergency stop */
+
 	/* "lock" protects:
 	 *   flush_bio transition from NULL to !NULL
 	 *   rdev superblocks, events
@@ -715,6 +720,7 @@ extern void md_rdev_clear(struct md_rdev *rdev);
 extern void md_handle_request(struct mddev *mddev, struct bio *bio);
 extern void mddev_suspend(struct mddev *mddev);
 extern void mddev_resume(struct mddev *mddev);
+extern void mddev_emergency_stop(struct mddev *mddev);
 extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 				   struct mddev *mddev);
 
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f3fb5bb8c82a..1054243be92b 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -555,6 +555,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
+	struct block_device *bd;
 	struct md_rdev *tmp_dev;
 	sector_t bio_sector;
 	sector_t sector;
@@ -593,6 +594,19 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 
 	zone = find_zone(mddev->private, &sector);
 	tmp_dev = map_sector(mddev, zone, sector, &sector);
+	bd = tmp_dev->bdev;
+
+	if (unlikely((blk_queue_dead(bd->bd_queue) ||
+	    blk_queue_dying(bd->bd_queue)) &&
+	    !test_bit(MD_BROKEN, &mddev->flags))) {
+		mddev_emergency_stop(mddev);
+	}
+
+	if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+		bio_io_error(bio);
+		return true;
+	}
+
 	bio_set_dev(bio, tmp_dev->bdev);
 	bio->bi_iter.bi_sector = sector + zone->dev_start +
 		tmp_dev->data_offset;
-- 
2.21.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC] [PATCH V2 1/1] md/raid0: Introduce emergency stop for raid0 arrays
@ 2019-04-18 22:04   ` Guilherme G. Piccoli
  0 siblings, 0 replies; 12+ messages in thread
From: Guilherme G. Piccoli @ 2019-04-18 22:04 UTC (permalink / raw)
  To: axboe, linux-raid
  Cc: gpiccoli, jay.vosburgh, kernel, neilb, dm-devel, linux-fsdevel,
	linux-block

Currently the raid0 driver is not provided with any health checking
mechanism to verify its members are fine. So, if suddenly a member
is removed, for example, a STOP_ARRAY ioctl will be triggered from
userspace, i.e., all the logic for stopping an array relies in the
userspace tools, like mdadm/udev. Particularly, if a raid0 array is
mounted, this stop procedure will fail, since mdadm tries to open
the md block device with O_EXCL flag, which isn't allowed if md is
mounted.

That leads to the following situation: if a raid0 array member is
removed and the array is mounted, some user writing to this array
won't realize errors are happening unless they check kernel log.
In other words, no -EIO is returned and writes (except direct I/Os)
appear normal. Meaning this user might think the wrote data is stored
in the array, but instead garbage was written since raid0 does stripping
and require all members to be working in order not corrupt data.

This patch propose a change in this behavior: to emergency stop a
raid0 array in case one of its members are gone. The check happens
when I/O is queued to raid0 driver, so the driver will confirm if
the block device it plans to read/write has its queue healthy; in
case it's not fine (like a dying or dead queue), raid0 driver will
invoke an emergency removal routine that will mark the md device as
broken and trigger a delayed stop procedure. Also, raid0 will start
refusing new BIOs from this point, returning -EIO.
The emergency stop routine will mark the md request queue as dying
too, as a "flag" to indicate failure in case of a nested raid0 array
configuration (a raid0 composed of raid0 devices).

The delayed stop procedure then will perform the basic stop of the
md device, and will take care in case it holds mounted filesystems,
allowing the stop of a mounted raid0 array - which is common in
other regular block devices like NVMe and SCSI.

This emergency stop mechanism only affects raid0 arrays.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
---
 drivers/md/md.c    | 123 +++++++++++++++++++++++++++++++++++++++++----
 drivers/md/md.h    |   6 +++
 drivers/md/raid0.c |  14 ++++++
 3 files changed, 134 insertions(+), 9 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 05ffffb8b769..5dc0ed7d37d8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -94,6 +94,7 @@ EXPORT_SYMBOL(md_cluster_mod);
 static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 static struct workqueue_struct *md_wq;
 static struct workqueue_struct *md_misc_wq;
+static struct workqueue_struct *md_stop_wq;
 
 static int remove_and_add_spares(struct mddev *mddev,
 				 struct md_rdev *this);
@@ -446,7 +447,9 @@ static void md_end_flush(struct bio *fbio)
 	rdev_dec_pending(rdev, mddev);
 
 	if (atomic_dec_and_test(&fi->flush_pending)) {
-		if (bio->bi_iter.bi_size == 0) {
+		if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+			goto bail_flush;
+		} else if (bio->bi_iter.bi_size == 0) {
 			/* an empty barrier - all done */
 			bio_endio(bio);
 			mempool_free(fi, mddev->flush_pool);
@@ -456,6 +459,12 @@ static void md_end_flush(struct bio *fbio)
 		}
 	}
 
+bail_flush:
+	/* Prevents dangling bios to crash after raid0 emergency stop */
+	if (test_bit(MD_BROKEN, &mddev->flags) && !mddev->raid_disks) {
+		bio_uninit(fbio);
+		return;
+	}
 	mempool_free(fb, mddev->flush_bio_pool);
 	bio_put(fbio);
 }
@@ -465,6 +474,11 @@ void md_flush_request(struct mddev *mddev, struct bio *bio)
 	struct md_rdev *rdev;
 	struct flush_info *fi;
 
+	if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+		bio_io_error(bio);
+		return;
+	}
+
 	fi = mempool_alloc(mddev->flush_pool, GFP_NOIO);
 
 	fi->bio = bio;
@@ -5202,6 +5216,17 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	return rv;
 }
 
+static void __md_free(struct mddev *mddev)
+{
+	if (mddev->gendisk)
+		del_gendisk(mddev->gendisk);
+	if (mddev->queue)
+		blk_cleanup_queue(mddev->queue);
+	if (mddev->gendisk)
+		put_disk(mddev->gendisk);
+	percpu_ref_exit(&mddev->writes_pending);
+}
+
 static void md_free(struct kobject *ko)
 {
 	struct mddev *mddev = container_of(ko, struct mddev, kobj);
@@ -5209,13 +5234,8 @@ static void md_free(struct kobject *ko)
 	if (mddev->sysfs_state)
 		sysfs_put(mddev->sysfs_state);
 
-	if (mddev->gendisk)
-		del_gendisk(mddev->gendisk);
-	if (mddev->queue)
-		blk_cleanup_queue(mddev->queue);
-	if (mddev->gendisk)
-		put_disk(mddev->gendisk);
-	percpu_ref_exit(&mddev->writes_pending);
+	if (!test_bit(MD_BROKEN, &mddev->flags))
+		__md_free(mddev);
 
 	bioset_exit(&mddev->bio_set);
 	bioset_exit(&mddev->sync_set);
@@ -5238,7 +5258,9 @@ static void mddev_delayed_delete(struct work_struct *ws)
 {
 	struct mddev *mddev = container_of(ws, struct mddev, del_work);
 
-	sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
+	if (!test_bit(MD_BROKEN, &mddev->flags))
+		sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
+
 	kobject_del(&mddev->kobj);
 	kobject_put(&mddev->kobj);
 }
@@ -5974,6 +5996,75 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
 	return err;
 }
 
+static void mddev_delayed_stop(struct work_struct *ws)
+{
+	struct mddev *mddev = container_of(ws, struct mddev, stop_work);
+	struct gendisk *disk = mddev->gendisk;
+	struct block_device *bdev;
+	struct md_rdev *rdev;
+	int i = 0;
+
+	mutex_lock(&mddev->open_mutex);
+	del_timer_sync(&mddev->safemode_timer);
+	__md_stop(mddev);
+	rdev_for_each(rdev, mddev)
+		if (rdev->raid_disk >= 0)
+			sysfs_unlink_rdev(mddev, rdev);
+	set_capacity(disk, 0);
+	mutex_unlock(&mddev->open_mutex);
+
+	mddev->changed = 1;
+	revalidate_disk(disk);
+	export_array(mddev);
+	mddev->hold_active = 0;
+
+	/* Make sure unbind_rdev_from_array() jobs are done */
+	flush_workqueue(md_misc_wq);
+
+	do {
+		bdev = bdget_disk(disk, i++);
+		if (bdev) {
+			struct super_block *sb = bdev->bd_super;
+
+			if (sb) {
+				sb->s_flags |= SB_RDONLY;
+				sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+			}
+			bdput(bdev);
+		}
+	} while (bdev);
+
+	md_clean(mddev);
+	set_bit(MD_BROKEN, &mddev->flags); /* md_clean() will clear this too */
+	md_new_event(mddev);
+	sysfs_notify_dirent_safe(mddev->sysfs_state);
+
+	__md_free(mddev);
+	mddev->gendisk = NULL;
+}
+
+void mddev_emergency_stop(struct mddev *mddev)
+{
+	/* Prevents races if raid0 driver detects errors in multiple requests
+	 * at the same time.
+	 */
+	mutex_lock(&mddev->open_mutex);
+	if (test_bit(MD_BROKEN, &mddev->flags))
+		return;
+
+	set_bit(MD_BROKEN, &mddev->flags);
+	mutex_unlock(&mddev->open_mutex);
+
+	/* Necessary to set md queue as dying here in case this md device is
+	 * a member of some other md array - nested removal will proceed then.
+	 */
+	blk_set_queue_dying(mddev->queue);
+
+	INIT_WORK(&mddev->stop_work, mddev_delayed_stop);
+	queue_work(md_stop_wq, &mddev->stop_work);
+}
+EXPORT_SYMBOL_GPL(mddev_emergency_stop);
+
 /* mode:
  *   0 - completely stop and dis-assemble array
  *   2 - stop but do not disassemble array
@@ -5985,6 +6076,13 @@ static int do_md_stop(struct mddev *mddev, int mode,
 	struct md_rdev *rdev;
 	int did_freeze = 0;
 
+	/* If we already started the emergency removal procedure,
+	 * the STOP_ARRAY ioctl can race with the removal, leading to
+	 * dangling devices. Below check is to prevent this corner case.
+	 */
+	if (test_bit(MD_BROKEN, &mddev->flags))
+		return -EBUSY;
+
 	if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
 		did_freeze = 1;
 		set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
@@ -9150,6 +9248,10 @@ static int __init md_init(void)
 	if (!md_misc_wq)
 		goto err_misc_wq;
 
+	md_stop_wq = alloc_ordered_workqueue("md_stop", 0);
+	if (!md_stop_wq)
+		goto err_stop_wq;
+
 	if ((ret = register_blkdev(MD_MAJOR, "md")) < 0)
 		goto err_md;
 
@@ -9171,6 +9273,8 @@ static int __init md_init(void)
 err_mdp:
 	unregister_blkdev(MD_MAJOR, "md");
 err_md:
+	destroy_workqueue(md_stop_wq);
+err_stop_wq:
 	destroy_workqueue(md_misc_wq);
 err_misc_wq:
 	destroy_workqueue(md_wq);
@@ -9458,6 +9562,7 @@ static __exit void md_exit(void)
 		 * destroy_workqueue() below will wait for that to complete.
 		 */
 	}
+	destroy_workqueue(md_stop_wq);
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index c52afb52c776..eebccc318828 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -243,6 +243,9 @@ enum mddev_flags {
 	MD_UPDATING_SB,		/* md_check_recovery is updating the metadata
 				 * without explicitly holding reconfig_mutex.
 				 */
+	MD_BROKEN,		/* This is used in the emergency RAID-0 stop
+				 * in case one of its members gets removed.
+				 */
 };
 
 enum mddev_sb_flags {
@@ -411,6 +414,8 @@ struct mddev {
 
 	struct work_struct del_work;	/* used for delayed sysfs removal */
 
+	struct work_struct stop_work;	/* used for delayed emergency stop */
+
 	/* "lock" protects:
 	 *   flush_bio transition from NULL to !NULL
 	 *   rdev superblocks, events
@@ -715,6 +720,7 @@ extern void md_rdev_clear(struct md_rdev *rdev);
 extern void md_handle_request(struct mddev *mddev, struct bio *bio);
 extern void mddev_suspend(struct mddev *mddev);
 extern void mddev_resume(struct mddev *mddev);
+extern void mddev_emergency_stop(struct mddev *mddev);
 extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 				   struct mddev *mddev);
 
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index f3fb5bb8c82a..1054243be92b 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -555,6 +555,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
+	struct block_device *bd;
 	struct md_rdev *tmp_dev;
 	sector_t bio_sector;
 	sector_t sector;
@@ -593,6 +594,19 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 
 	zone = find_zone(mddev->private, &sector);
 	tmp_dev = map_sector(mddev, zone, sector, &sector);
+	bd = tmp_dev->bdev;
+
+	if (unlikely((blk_queue_dead(bd->bd_queue) ||
+	    blk_queue_dying(bd->bd_queue)) &&
+	    !test_bit(MD_BROKEN, &mddev->flags))) {
+		mddev_emergency_stop(mddev);
+	}
+
+	if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+		bio_io_error(bio);
+		return true;
+	}
+
 	bio_set_dev(bio, tmp_dev->bdev);
 	bio->bi_iter.bi_sector = sector + zone->dev_start +
 		tmp_dev->data_offset;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
  2019-04-18 22:04 ` Guilherme G. Piccoli
@ 2019-04-19 17:08   ` Song Liu
  -1 siblings, 0 replies; 12+ messages in thread
From: Song Liu @ 2019-04-19 17:08 UTC (permalink / raw)
  To: Guilherme G. Piccoli
  Cc: axboe, linux-block, kernel, NeilBrown, linux-raid, dm-devel,
	Linux-Fsdevel, jay.vosburgh

On Thu, Apr 18, 2019 at 3:05 PM Guilherme G. Piccoli
<gpiccoli@canonical.com> wrote:
>
> Currently the md driver completely relies in the userspace to stop an
> array in case of some failure - for raid0, if we remove a raid0 member like via
> PCI hot(un)plugging an NVMe device, and the raid0 array is _mounted_, mdadm
> cannot stop the array, since the tool tries to open the block device to perform
> the ioctl with the O_EXCL flag.
>
> So, an array in this situation is "alive" - users may write to it and unless
> they check the kernel log or some other monitor tool, everything will seem fine
> and the writes are completed with no errors. Being more precise, direct
> writes will not work, but since usually writes are done in a regular form
> (backed by the page cache) the most common scenario is an user being able to
> regularly write to a broken raid0, and get all their data corrupted.
>
>
> PROPOSAL:
> The idea proposed here to fix this behavior is mimic other block devices:
> if one have a filesystem mounted in a block device on top of an NVMe or
> SCSI disk and the disk gets removed, writes are prevented, errors are
> observed and it's obvious something is wrong; same goes for USB sticks.
>
> We believe right now the md driver is not behaving properly for raid0
> arrays (it is handling these errors for other levels though). The approach
> took for raid-0 is basically an emergency removal procedure, in which I/O
> is blocked from the device, the regular clean-up happens and the associate
> disk is deleted. It went to extensive testing, as detailed below.
>
> Not all are roses, we have some caveats that need to be resolved.
> Feedback is much appreciated.
> There is a caveats / questions / polemic choices section below the test
> section.
>
> V1 link: https://marc.info/?l=linux-raid&m=153313538013458

I read through the discussion in V1, and I would agree with Neil that
current behavior is reasonable.

For the following example:

fd = open("file", "w");
write(fd, buf, size);
ret = fsync(fd);

If "size" is big enough, the write is not expected to be atomic for
md or other drives. If we remove the underlining block device
after write() and before fsync(), the file could get corrupted. This
is the same for md or NVMe/SCSI drives.

The application need to check "ret" from fsync(), the data is safe
only when fsync() returns 0.

Does this make sense?

Also, could you please highlight changes from V1 (if more than
just rebase)?

Thanks,
Song

>
> Thanks in advance,
>
>
> Guilherme
>
>
> * Testing:
>
> The device topology for the tests is as follows:
>
>                             md6
>                              |
>               |******************************|
>               |                              |
>              md4                            md5
>               |                              |
>        |*************|                |*************|
>        |             |                |             |
>       md0           md2              md1           md3
>        |             |                |             |
>    |*******|       |***|            |***|       |*******|
>    |       |       |   |            |   |       |       |
> nvme1n1 nvme0n1   sda sdd          sde sdf   nvme2n1 nvme3n1
>
>
> We chose to test such complex topology to expose corner cases and timing
> issues (which were found in the development phase). There are 3 test
> subsets basically: the whole set of devices, called here "md cascade", and
> 2 small branches, called here "md direct" testing.
>
> So, in summary:
> ### md direct (single arrays composed of SCSI/NVMe block devices) ###
> A1 = md1      A2 = md3
> C1 = sde      C1 = nvme2n1
> C2 = sdf      C2 = nvme3n1
>
> ### md cascade (array composed of md devices) ###
> A3 = md6
> underlying component UC1 = nvme1n1
> underlying component UC2 = sdd
> underlying component UC3 = sdf
> underlying component UC4 = nvme2n1
>
>
> ### Method of removal ###
> - For NVMe devices, "echo 1 > /sys/block/nvmeXnY/device/device/remove"
> - For SCSI devices, "echo 1 > /sys/block/sdX/device/delete"
>
>
> ### Test ###
> Write a file using the command: "dd if=/dev/zero of=tmpfile bs=X" where
> X might be 8K or 384K (varied in the tests).
>
> Each array also contains a valid file to be checked later, in order to
> validate that filesystem didn't get severely corrupted after the
> procedure.
>
> Tests marked with @ indicate direct writes were tested too.
> Tests with a [] indicator exhibited some oddity/issue, detailed in the
> caveats section.
>
> After each test, guest was rebooted.
>
> Tests were performed in kernel v5.1-rc5.
>
>
> * Test results
> ("partition X" means we have a GPT table with 2 partitions in the device)
>
>
> ### md direct
>
> Remove members and start writing to array right after:
> A1 with:                               A2 with:
>  -ext4                                  -ext4
>  --Removed C1: PASSED @                 --Removed C2: PASSED @
>
>  -xfs                                   -xfs
>  --Removed C2: PASSED                   --Removed C1: PASSED
>
>  -partition 1 + ext4                    -partition 1 + xfs
>  -partition 2 + xfs                     -partition 2 + ext4
>  --Removed C1: PASSED                   --Removed C2: PASSED
>
>
> Start writing to array and remove member right after:
> A1 with:                               A2 with:
>  -ext4                                  -ext4
>  --Removed C1: PASSED                   --Removed C1: PASSED
>  --Removed C2: PASSED @
>
>  -xfs                                   -xfs
>  --Removed C1: PASSED                   --Removed C1: PASSED
>                                         --Removed C2: PASSED @
>
>  -partition 1 + ext4                    -partition 1 + xfs
>  -partition 2 + xfs                     -partition 2 + ext4
>  --Removed C2: PASSED                   --Removed C1: PASSED @
>
>
> ### md cascade
> Remove members and start writing to array right after:
> A3 with:
>  -ext4:
>  --Removed UC2: PASSED
>  --Removed UC4: PASSED @
>
>   -xfs:
>  --Removed UC2: PASSED @
>  --Removed UC4: PASSED
>
>  -partition 1 + ext4
>  -partition 2 + xfs
>  --Removed UC1: PASSED @
>  --Removed UC3: PASSED
>
>
> Start writing to array and remove member right after:
> A3 with:
>  -ext4:
>  --Removed UC2: PASSED
>  --Removed UC4: PARTIAL @ [(a) (b)]
>
>   -xfs:
>  --Removed UC2: PASSED
>  --Removed UC4: PARTIAL @ [(c)]
>
>  -partition 1 + ext4
>  -partition 2 + xfs
>  --Removed UC1: PASSED
>  --Removed UC3: PARTIAL @ [(b)]
>
>
> * Caveats / points of uncertainty / questions:
>
> => Some arrays  still show in /dev after the removal. We noticed if FS
> is unmounted and "mdadm --detail" is tried against that, it hangs.
>
> #For md cascade only (nested arrays):
>
> a) After direct writes, if we issue a regular right (backed by page
> cache), observed an oops sometimes, in ext4 filesystem.
>
> b) We might face an ext4 journal task blocked in io_schedule().
> Backtrace: https://pastebin.ubuntu.com/p/KgvHjRn6Pg
>
> c) Hung task in xfs after direct I/O only, when trying to write again after
> failing one of the members (2nd write is using page cache).
> Backtrace: https://pastebin.ubuntu.com/p/7NMRV9fG2H
>
>
>
> #Generic questions / polemic choices:
>
> i) With this patch, the STOP_ARRAY ioctl won't proceed in case a disk is
> removed and emergency stop routine already started to act, even in case of
> unmounted md arrays. This is a change in the old behavior, triggered by
> the way we check for failed members in raid0 driver.
>
> ii) Currently, the patch implements a kernel-only removal policy - shall
> it rely on userspace (mdadm) to do it? A first approach was based in
> userspace, but it proved to be more problematic in tests.
>
> iii) Would a new state ("error") for RAID0 be a better solution? I'm
> experimenting a simpler approach that would introduce a new state, but
> wouldn't force the removal of the device.
>
> Guilherme G. Piccoli (1):
>   md/raid0: Introduce emergency stop for raid0 arrays
>
>  drivers/md/md.c    | 123 +++++++++++++++++++++++++++++++++++++++++----
>  drivers/md/md.h    |   6 +++
>  drivers/md/raid0.c |  14 ++++++
>  3 files changed, 134 insertions(+), 9 deletions(-)
>
> --
> 2.21.0
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
@ 2019-04-19 17:08   ` Song Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Song Liu @ 2019-04-19 17:08 UTC (permalink / raw)
  To: Guilherme G. Piccoli
  Cc: axboe, linux-raid, jay.vosburgh, kernel, NeilBrown, dm-devel,
	Linux-Fsdevel, linux-block

On Thu, Apr 18, 2019 at 3:05 PM Guilherme G. Piccoli
<gpiccoli@canonical.com> wrote:
>
> Currently the md driver completely relies in the userspace to stop an
> array in case of some failure - for raid0, if we remove a raid0 member like via
> PCI hot(un)plugging an NVMe device, and the raid0 array is _mounted_, mdadm
> cannot stop the array, since the tool tries to open the block device to perform
> the ioctl with the O_EXCL flag.
>
> So, an array in this situation is "alive" - users may write to it and unless
> they check the kernel log or some other monitor tool, everything will seem fine
> and the writes are completed with no errors. Being more precise, direct
> writes will not work, but since usually writes are done in a regular form
> (backed by the page cache) the most common scenario is an user being able to
> regularly write to a broken raid0, and get all their data corrupted.
>
>
> PROPOSAL:
> The idea proposed here to fix this behavior is mimic other block devices:
> if one have a filesystem mounted in a block device on top of an NVMe or
> SCSI disk and the disk gets removed, writes are prevented, errors are
> observed and it's obvious something is wrong; same goes for USB sticks.
>
> We believe right now the md driver is not behaving properly for raid0
> arrays (it is handling these errors for other levels though). The approach
> took for raid-0 is basically an emergency removal procedure, in which I/O
> is blocked from the device, the regular clean-up happens and the associate
> disk is deleted. It went to extensive testing, as detailed below.
>
> Not all are roses, we have some caveats that need to be resolved.
> Feedback is much appreciated.
> There is a caveats / questions / polemic choices section below the test
> section.
>
> V1 link: https://marc.info/?l=linux-raid&m=153313538013458

I read through the discussion in V1, and I would agree with Neil that
current behavior is reasonable.

For the following example:

fd = open("file", "w");
write(fd, buf, size);
ret = fsync(fd);

If "size" is big enough, the write is not expected to be atomic for
md or other drives. If we remove the underlining block device
after write() and before fsync(), the file could get corrupted. This
is the same for md or NVMe/SCSI drives.

The application need to check "ret" from fsync(), the data is safe
only when fsync() returns 0.

Does this make sense?

Also, could you please highlight changes from V1 (if more than
just rebase)?

Thanks,
Song

>
> Thanks in advance,
>
>
> Guilherme
>
>
> * Testing:
>
> The device topology for the tests is as follows:
>
>                             md6
>                              |
>               |******************************|
>               |                              |
>              md4                            md5
>               |                              |
>        |*************|                |*************|
>        |             |                |             |
>       md0           md2              md1           md3
>        |             |                |             |
>    |*******|       |***|            |***|       |*******|
>    |       |       |   |            |   |       |       |
> nvme1n1 nvme0n1   sda sdd          sde sdf   nvme2n1 nvme3n1
>
>
> We chose to test such complex topology to expose corner cases and timing
> issues (which were found in the development phase). There are 3 test
> subsets basically: the whole set of devices, called here "md cascade", and
> 2 small branches, called here "md direct" testing.
>
> So, in summary:
> ### md direct (single arrays composed of SCSI/NVMe block devices) ###
> A1 = md1      A2 = md3
> C1 = sde      C1 = nvme2n1
> C2 = sdf      C2 = nvme3n1
>
> ### md cascade (array composed of md devices) ###
> A3 = md6
> underlying component UC1 = nvme1n1
> underlying component UC2 = sdd
> underlying component UC3 = sdf
> underlying component UC4 = nvme2n1
>
>
> ### Method of removal ###
> - For NVMe devices, "echo 1 > /sys/block/nvmeXnY/device/device/remove"
> - For SCSI devices, "echo 1 > /sys/block/sdX/device/delete"
>
>
> ### Test ###
> Write a file using the command: "dd if=/dev/zero of=tmpfile bs=X" where
> X might be 8K or 384K (varied in the tests).
>
> Each array also contains a valid file to be checked later, in order to
> validate that filesystem didn't get severely corrupted after the
> procedure.
>
> Tests marked with @ indicate direct writes were tested too.
> Tests with a [] indicator exhibited some oddity/issue, detailed in the
> caveats section.
>
> After each test, guest was rebooted.
>
> Tests were performed in kernel v5.1-rc5.
>
>
> * Test results
> ("partition X" means we have a GPT table with 2 partitions in the device)
>
>
> ### md direct
>
> Remove members and start writing to array right after:
> A1 with:                               A2 with:
>  -ext4                                  -ext4
>  --Removed C1: PASSED @                 --Removed C2: PASSED @
>
>  -xfs                                   -xfs
>  --Removed C2: PASSED                   --Removed C1: PASSED
>
>  -partition 1 + ext4                    -partition 1 + xfs
>  -partition 2 + xfs                     -partition 2 + ext4
>  --Removed C1: PASSED                   --Removed C2: PASSED
>
>
> Start writing to array and remove member right after:
> A1 with:                               A2 with:
>  -ext4                                  -ext4
>  --Removed C1: PASSED                   --Removed C1: PASSED
>  --Removed C2: PASSED @
>
>  -xfs                                   -xfs
>  --Removed C1: PASSED                   --Removed C1: PASSED
>                                         --Removed C2: PASSED @
>
>  -partition 1 + ext4                    -partition 1 + xfs
>  -partition 2 + xfs                     -partition 2 + ext4
>  --Removed C2: PASSED                   --Removed C1: PASSED @
>
>
> ### md cascade
> Remove members and start writing to array right after:
> A3 with:
>  -ext4:
>  --Removed UC2: PASSED
>  --Removed UC4: PASSED @
>
>   -xfs:
>  --Removed UC2: PASSED @
>  --Removed UC4: PASSED
>
>  -partition 1 + ext4
>  -partition 2 + xfs
>  --Removed UC1: PASSED @
>  --Removed UC3: PASSED
>
>
> Start writing to array and remove member right after:
> A3 with:
>  -ext4:
>  --Removed UC2: PASSED
>  --Removed UC4: PARTIAL @ [(a) (b)]
>
>   -xfs:
>  --Removed UC2: PASSED
>  --Removed UC4: PARTIAL @ [(c)]
>
>  -partition 1 + ext4
>  -partition 2 + xfs
>  --Removed UC1: PASSED
>  --Removed UC3: PARTIAL @ [(b)]
>
>
> * Caveats / points of uncertainty / questions:
>
> => Some arrays  still show in /dev after the removal. We noticed if FS
> is unmounted and "mdadm --detail" is tried against that, it hangs.
>
> #For md cascade only (nested arrays):
>
> a) After direct writes, if we issue a regular right (backed by page
> cache), observed an oops sometimes, in ext4 filesystem.
>
> b) We might face an ext4 journal task blocked in io_schedule().
> Backtrace: https://pastebin.ubuntu.com/p/KgvHjRn6Pg
>
> c) Hung task in xfs after direct I/O only, when trying to write again after
> failing one of the members (2nd write is using page cache).
> Backtrace: https://pastebin.ubuntu.com/p/7NMRV9fG2H
>
>
>
> #Generic questions / polemic choices:
>
> i) With this patch, the STOP_ARRAY ioctl won't proceed in case a disk is
> removed and emergency stop routine already started to act, even in case of
> unmounted md arrays. This is a change in the old behavior, triggered by
> the way we check for failed members in raid0 driver.
>
> ii) Currently, the patch implements a kernel-only removal policy - shall
> it rely on userspace (mdadm) to do it? A first approach was based in
> userspace, but it proved to be more problematic in tests.
>
> iii) Would a new state ("error") for RAID0 be a better solution? I'm
> experimenting a simpler approach that would introduce a new state, but
> wouldn't force the removal of the device.
>
> Guilherme G. Piccoli (1):
>   md/raid0: Introduce emergency stop for raid0 arrays
>
>  drivers/md/md.c    | 123 +++++++++++++++++++++++++++++++++++++++++----
>  drivers/md/md.h    |   6 +++
>  drivers/md/raid0.c |  14 ++++++
>  3 files changed, 134 insertions(+), 9 deletions(-)
>
> --
> 2.21.0
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
  2019-04-19 17:08   ` Song Liu
@ 2019-04-30 22:41     ` Guilherme G. Piccoli
  -1 siblings, 0 replies; 12+ messages in thread
From: Guilherme G. Piccoli @ 2019-04-30 22:41 UTC (permalink / raw)
  To: Song Liu
  Cc: axboe, linux-block, kernel, NeilBrown, linux-raid, dm-devel,
	Linux-Fsdevel, jay.vosburgh, gavin.guo

> On 19/04/2019 14:08, Song Liu wrote:
> [...]
> I read through the discussion in V1, and I would agree with Neil that
> current behavior is reasonable.
> 
> For the following example:
> 
> fd = open("file", "w");
> write(fd, buf, size);
> ret = fsync(fd);
> 
> If "size" is big enough, the write is not expected to be atomic for
> md or other drives. If we remove the underlining block device
> after write() and before fsync(), the file could get corrupted. This
> is the same for md or NVMe/SCSI drives.
> 
> The application need to check "ret" from fsync(), the data is safe
> only when fsync() returns 0.
> 
> Does this make sense?
> 

Hi Song, thanks for your quick response, and sorry for my delay.
I've noticed after v4.18 kernel started to crash when we remove one
raid0 member while writing, so I was investigating this
before perform your test (in fact, found 2 issues [0]), hence my delay.

Your test does make sense; in fact I've tested your scenario with the
following code (with the patches from [0]):
https://pastebin.ubuntu.com/p/cyqpDqpM7x/

Indeed, fsync returns -1 in this case.
Interestingly, when I do a "dd if=<some_file> of=<raid0_mount>" and try
to "sync -f <some_file>" and "sync", it succeeds and the file is
written, although corrupted.

Do you think this behavior is correct? In other devices, like a pure
SCSI disk or NVMe, the 'dd' write fails.
Also, what about the status of the raid0 array in mdadm - it shows as
"clean" even after the member is removed, should we change that?


> Also, could you please highlight changes from V1 (if more than
> just rebase)?

No changes other than rebase. Worth mentioning here that a kernel bot
(and Julia Lawall) found an issue in my patch; I forgot a
"mutex_lock(&mddev->open_mutex);" in line 6053, which caused the first
caveat (hung mdadm and persistent device in /dev). Thanks for pointing
this silly mistake from me! in case this patch gets some traction, I'll
re-submit with that fixed.

Cheers,


Guilherme

[0] https://marc.info/?l=linux-block&m=155666385707413

> 
> Thanks,
> Song
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
@ 2019-04-30 22:41     ` Guilherme G. Piccoli
  0 siblings, 0 replies; 12+ messages in thread
From: Guilherme G. Piccoli @ 2019-04-30 22:41 UTC (permalink / raw)
  To: Song Liu
  Cc: axboe, linux-raid, jay.vosburgh, kernel, NeilBrown, dm-devel,
	Linux-Fsdevel, linux-block, gavin.guo

> On 19/04/2019 14:08, Song Liu wrote:
> [...]
> I read through the discussion in V1, and I would agree with Neil that
> current behavior is reasonable.
> 
> For the following example:
> 
> fd = open("file", "w");
> write(fd, buf, size);
> ret = fsync(fd);
> 
> If "size" is big enough, the write is not expected to be atomic for
> md or other drives. If we remove the underlining block device
> after write() and before fsync(), the file could get corrupted. This
> is the same for md or NVMe/SCSI drives.
> 
> The application need to check "ret" from fsync(), the data is safe
> only when fsync() returns 0.
> 
> Does this make sense?
> 

Hi Song, thanks for your quick response, and sorry for my delay.
I've noticed after v4.18 kernel started to crash when we remove one
raid0 member while writing, so I was investigating this
before perform your test (in fact, found 2 issues [0]), hence my delay.

Your test does make sense; in fact I've tested your scenario with the
following code (with the patches from [0]):
https://pastebin.ubuntu.com/p/cyqpDqpM7x/

Indeed, fsync returns -1 in this case.
Interestingly, when I do a "dd if=<some_file> of=<raid0_mount>" and try
to "sync -f <some_file>" and "sync", it succeeds and the file is
written, although corrupted.

Do you think this behavior is correct? In other devices, like a pure
SCSI disk or NVMe, the 'dd' write fails.
Also, what about the status of the raid0 array in mdadm - it shows as
"clean" even after the member is removed, should we change that?


> Also, could you please highlight changes from V1 (if more than
> just rebase)?

No changes other than rebase. Worth mentioning here that a kernel bot
(and Julia Lawall) found an issue in my patch; I forgot a
"mutex_lock(&mddev->open_mutex);" in line 6053, which caused the first
caveat (hung mdadm and persistent device in /dev). Thanks for pointing
this silly mistake from me! in case this patch gets some traction, I'll
re-submit with that fixed.

Cheers,


Guilherme

[0] https://marc.info/?l=linux-block&m=155666385707413

> 
> Thanks,
> Song
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
  2019-04-30 22:41     ` Guilherme G. Piccoli
@ 2019-05-01 15:33       ` Song Liu
  -1 siblings, 0 replies; 12+ messages in thread
From: Song Liu @ 2019-05-01 15:33 UTC (permalink / raw)
  To: Guilherme G. Piccoli
  Cc: axboe, linux-block, kernel, NeilBrown, linux-raid, dm-devel,
	Linux-Fsdevel, Jay Vosburgh, gavin.guo

On Tue, Apr 30, 2019 at 3:41 PM Guilherme G. Piccoli
<gpiccoli@canonical.com> wrote:
>
> > On 19/04/2019 14:08, Song Liu wrote:
> > [...]
> > I read through the discussion in V1, and I would agree with Neil that
> > current behavior is reasonable.
> >
> > For the following example:
> >
> > fd = open("file", "w");
> > write(fd, buf, size);
> > ret = fsync(fd);
> >
> > If "size" is big enough, the write is not expected to be atomic for
> > md or other drives. If we remove the underlining block device
> > after write() and before fsync(), the file could get corrupted. This
> > is the same for md or NVMe/SCSI drives.
> >
> > The application need to check "ret" from fsync(), the data is safe
> > only when fsync() returns 0.
> >
> > Does this make sense?
> >
>
> Hi Song, thanks for your quick response, and sorry for my delay.
> I've noticed after v4.18 kernel started to crash when we remove one
> raid0 member while writing, so I was investigating this
> before perform your test (in fact, found 2 issues [0]), hence my delay.
>
> Your test does make sense; in fact I've tested your scenario with the
> following code (with the patches from [0]):
> https://pastebin.ubuntu.com/p/cyqpDqpM7x/
>
> Indeed, fsync returns -1 in this case.
> Interestingly, when I do a "dd if=<some_file> of=<raid0_mount>" and try
> to "sync -f <some_file>" and "sync", it succeeds and the file is
> written, although corrupted.

I guess this is some issue with sync command, but I haven't got time
to look into it. How about running dd with oflag=sync or oflag=direct?

>
> Do you think this behavior is correct? In other devices, like a pure
> SCSI disk or NVMe, the 'dd' write fails.
> Also, what about the status of the raid0 array in mdadm - it shows as
> "clean" even after the member is removed, should we change that?

I guess this is because the kernel hasn't detect the array is gone? In
that case, I think reducing the latency would be useful for some use
cases.

Thanks,
Song

>
>
> > Also, could you please highlight changes from V1 (if more than
> > just rebase)?
>
> No changes other than rebase. Worth mentioning here that a kernel bot
> (and Julia Lawall) found an issue in my patch; I forgot a
> "mutex_lock(&mddev->open_mutex);" in line 6053, which caused the first
> caveat (hung mdadm and persistent device in /dev). Thanks for pointing
> this silly mistake from me! in case this patch gets some traction, I'll
> re-submit with that fixed.
>
> Cheers,
>
>
> Guilherme
>
> [0] https://marc.info/?l=linux-block&m=155666385707413
>
> >
> > Thanks,
> > Song
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
@ 2019-05-01 15:33       ` Song Liu
  0 siblings, 0 replies; 12+ messages in thread
From: Song Liu @ 2019-05-01 15:33 UTC (permalink / raw)
  To: Guilherme G. Piccoli
  Cc: axboe, linux-raid, Jay Vosburgh, kernel, NeilBrown, dm-devel,
	Linux-Fsdevel, linux-block, gavin.guo

On Tue, Apr 30, 2019 at 3:41 PM Guilherme G. Piccoli
<gpiccoli@canonical.com> wrote:
>
> > On 19/04/2019 14:08, Song Liu wrote:
> > [...]
> > I read through the discussion in V1, and I would agree with Neil that
> > current behavior is reasonable.
> >
> > For the following example:
> >
> > fd = open("file", "w");
> > write(fd, buf, size);
> > ret = fsync(fd);
> >
> > If "size" is big enough, the write is not expected to be atomic for
> > md or other drives. If we remove the underlining block device
> > after write() and before fsync(), the file could get corrupted. This
> > is the same for md or NVMe/SCSI drives.
> >
> > The application need to check "ret" from fsync(), the data is safe
> > only when fsync() returns 0.
> >
> > Does this make sense?
> >
>
> Hi Song, thanks for your quick response, and sorry for my delay.
> I've noticed after v4.18 kernel started to crash when we remove one
> raid0 member while writing, so I was investigating this
> before perform your test (in fact, found 2 issues [0]), hence my delay.
>
> Your test does make sense; in fact I've tested your scenario with the
> following code (with the patches from [0]):
> https://pastebin.ubuntu.com/p/cyqpDqpM7x/
>
> Indeed, fsync returns -1 in this case.
> Interestingly, when I do a "dd if=<some_file> of=<raid0_mount>" and try
> to "sync -f <some_file>" and "sync", it succeeds and the file is
> written, although corrupted.

I guess this is some issue with sync command, but I haven't got time
to look into it. How about running dd with oflag=sync or oflag=direct?

>
> Do you think this behavior is correct? In other devices, like a pure
> SCSI disk or NVMe, the 'dd' write fails.
> Also, what about the status of the raid0 array in mdadm - it shows as
> "clean" even after the member is removed, should we change that?

I guess this is because the kernel hasn't detect the array is gone? In
that case, I think reducing the latency would be useful for some use
cases.

Thanks,
Song

>
>
> > Also, could you please highlight changes from V1 (if more than
> > just rebase)?
>
> No changes other than rebase. Worth mentioning here that a kernel bot
> (and Julia Lawall) found an issue in my patch; I forgot a
> "mutex_lock(&mddev->open_mutex);" in line 6053, which caused the first
> caveat (hung mdadm and persistent device in /dev). Thanks for pointing
> this silly mistake from me! in case this patch gets some traction, I'll
> re-submit with that fixed.
>
> Cheers,
>
>
> Guilherme
>
> [0] https://marc.info/?l=linux-block&m=155666385707413
>
> >
> > Thanks,
> > Song
> >

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
  2019-05-01 15:33       ` Song Liu
@ 2019-05-01 18:00         ` Guilherme G. Piccoli
  -1 siblings, 0 replies; 12+ messages in thread
From: Guilherme G. Piccoli @ 2019-05-01 18:00 UTC (permalink / raw)
  To: Song Liu
  Cc: axboe, linux-block, kernel, Guilherme G. Piccoli, NeilBrown,
	linux-raid, dm-devel, Linux-Fsdevel, Jay Vosburgh, gavin.guo

 > On 5/1/19 12:33 PM, Song Liu wrote:
>> [...]
>> Indeed, fsync returns -1 in this case.
>> Interestingly, when I do a "dd if=<some_file> of=<raid0_mount>" and try
>> to "sync -f <some_file>" and "sync", it succeeds and the file is
>> written, although corrupted.
> 
> I guess this is some issue with sync command, but I haven't got time
> to look into it. How about running dd with oflag=sync or oflag=direct?
> 

Hi Song, could be some problem with sync command; using either 
'oflag=direct' or 'oflag=sync' fails the dd command instantly when a 
member is removed.


>> Do you think this behavior is correct? In other devices, like a pure
>> SCSI disk or NVMe, the 'dd' write fails.
>> Also, what about the status of the raid0 array in mdadm - it shows as
>> "clean" even after the member is removed, should we change that?
> 
> I guess this is because the kernel hasn't detect the array is gone? In
> that case, I think reducing the latency would be useful for some use
> cases.
> 

Exactly! This is the main concern here, mdadm cannot stop the array 
since it's mounted, and there's no filesystem API to quickly shutdown 
the filesystem, hence it keeps "alive" for too long after the failure.

For instance, if we have a raid0 with 2 members and remove the 1st, it 
fails much quicker than if we remove the 2nd; the filesystem will 
"realize" the device is flaw quickly if we remove the 1st member, and 
goes to RO mode. Specially, xfs seems even faster than ext4 in noticing 
the failure.

Do you have any suggestion on how could we reduce this latency? And how 
about the status exhibited by mdadm, shall it move from 'clean' to 
something more meaningful in the failure case?

Thanks again,


Guilherme

> Thanks,
> Song
> 
>>
>>
>>> Also, could you please highlight changes from V1 (if more than
>>> just rebase)?
>>
>> No changes other than rebase. Worth mentioning here that a kernel bot
>> (and Julia Lawall) found an issue in my patch; I forgot a
>> "mutex_lock(&mddev->open_mutex);" in line 6053, which caused the first
>> caveat (hung mdadm and persistent device in /dev). Thanks for pointing
>> this silly mistake from me! in case this patch gets some traction, I'll
>> re-submit with that fixed.
>>
>> Cheers,
>>
>>
>> Guilherme
>>
>> [0] https://marc.info/?l=linux-block&m=155666385707413
>>
>>>
>>> Thanks,
>>> Song
>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays
@ 2019-05-01 18:00         ` Guilherme G. Piccoli
  0 siblings, 0 replies; 12+ messages in thread
From: Guilherme G. Piccoli @ 2019-05-01 18:00 UTC (permalink / raw)
  To: Song Liu
  Cc: Guilherme G. Piccoli, axboe, linux-raid, Jay Vosburgh, kernel,
	NeilBrown, dm-devel, Linux-Fsdevel, linux-block, gavin.guo

 > On 5/1/19 12:33 PM, Song Liu wrote:
>> [...]
>> Indeed, fsync returns -1 in this case.
>> Interestingly, when I do a "dd if=<some_file> of=<raid0_mount>" and try
>> to "sync -f <some_file>" and "sync", it succeeds and the file is
>> written, although corrupted.
> 
> I guess this is some issue with sync command, but I haven't got time
> to look into it. How about running dd with oflag=sync or oflag=direct?
> 

Hi Song, could be some problem with sync command; using either 
'oflag=direct' or 'oflag=sync' fails the dd command instantly when a 
member is removed.


>> Do you think this behavior is correct? In other devices, like a pure
>> SCSI disk or NVMe, the 'dd' write fails.
>> Also, what about the status of the raid0 array in mdadm - it shows as
>> "clean" even after the member is removed, should we change that?
> 
> I guess this is because the kernel hasn't detect the array is gone? In
> that case, I think reducing the latency would be useful for some use
> cases.
> 

Exactly! This is the main concern here, mdadm cannot stop the array 
since it's mounted, and there's no filesystem API to quickly shutdown 
the filesystem, hence it keeps "alive" for too long after the failure.

For instance, if we have a raid0 with 2 members and remove the 1st, it 
fails much quicker than if we remove the 2nd; the filesystem will 
"realize" the device is flaw quickly if we remove the 1st member, and 
goes to RO mode. Specially, xfs seems even faster than ext4 in noticing 
the failure.

Do you have any suggestion on how could we reduce this latency? And how 
about the status exhibited by mdadm, shall it move from 'clean' to 
something more meaningful in the failure case?

Thanks again,


Guilherme

> Thanks,
> Song
> 
>>
>>
>>> Also, could you please highlight changes from V1 (if more than
>>> just rebase)?
>>
>> No changes other than rebase. Worth mentioning here that a kernel bot
>> (and Julia Lawall) found an issue in my patch; I forgot a
>> "mutex_lock(&mddev->open_mutex);" in line 6053, which caused the first
>> caveat (hung mdadm and persistent device in /dev). Thanks for pointing
>> this silly mistake from me! in case this patch gets some traction, I'll
>> re-submit with that fixed.
>>
>> Cheers,
>>
>>
>> Guilherme
>>
>> [0] https://marc.info/?l=linux-block&m=155666385707413
>>
>>>
>>> Thanks,
>>> Song
>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-05-01 18:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-18 22:04 [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays Guilherme G. Piccoli
2019-04-18 22:04 ` Guilherme G. Piccoli
2019-04-18 22:04 ` [RFC] [PATCH V2 1/1] md/raid0: Introduce emergency stop for raid0 arrays Guilherme G. Piccoli
2019-04-18 22:04   ` Guilherme G. Piccoli
2019-04-19 17:08 ` [RFC] [PATCH V2 0/1] Introduce emergency raid0 stop for mounted arrays Song Liu
2019-04-19 17:08   ` Song Liu
2019-04-30 22:41   ` Guilherme G. Piccoli
2019-04-30 22:41     ` Guilherme G. Piccoli
2019-05-01 15:33     ` Song Liu
2019-05-01 15:33       ` Song Liu
2019-05-01 18:00       ` Guilherme G. Piccoli
2019-05-01 18:00         ` Guilherme G. Piccoli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.