linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays
@ 2018-08-01 14:56 Guilherme G. Piccoli
  2018-08-01 14:56 ` [RFC] [PATCH 1/1] md/raid0: Introduce emergency stop for raid0 arrays Guilherme G. Piccoli
  2018-08-02  1:51 ` [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays NeilBrown
  0 siblings, 2 replies; 7+ messages in thread
From: Guilherme G. Piccoli @ 2018-08-01 14:56 UTC (permalink / raw)
  To: linux-raid, shli
  Cc: gpiccoli, kernel, jay.vosburgh, neilb, dm-devel, linux-block,
	linux-fsdevel

Currently the md driver completely relies in the userspace to stop an
array in case of some failure. There's an interesting case for raid0: if
we remove a raid0 member, like PCI hot(un)plugging an NVMe device, and
the raid0 array is _mounted_, mdadm cannot stop the array, since the tool
tries to open the block device (to perform the ioctl) with O_EXCL flag.

So, in this case the array is still alive - users may write to this
"broken-yet-alive" array and unless they check the kernel log or some
other monitor tool, everything will seem fine and the writes are completed
with no errors. Being more precise, direct writes will not work, but since
usually writes are done in a regular form, i.e., backed by the page
cache, the most common scenario is an user being able to regularly write
to a broken raid0, and get all their data corrupted.

PROPOSAL:
The idea proposed here to fix this behavior is mimic other block devices:
if one have a filesystem mounted in a block device on top of an NVMe or
SCSI disk and the disk gets removed, writes are prevented, errors are
observed and it's obvious something is wrong. Same goes for USB sticks,
which are sometimes even removed physically from the machine without
getting their filesystem unmounted before.

We believe right now the md driver is not behaving properly for raid0
arrays (it is handling these errors for other levels though). The approach
took for raid-0 is basically an emergency removal procedure, in which I/O
is blocked from the device, the regular clean-up happens and the associate
disk is deleted. It went to extensive testing, as detailed below.

Not all are roses, we have some caveats that need to be resolved.
Feedback is _much appreciated_.
There is a caveats / questions / polemic choices section below the test
section.

Thanks in advance,


Guilherme


* Testing:

The device topology for the tests is as follows:


                            md6
                             |
              |******************************|
              |                              |
             md4                            md5
              |                              |
       |*************|                |*************|
       |             |                |             |
      md0           md2              md1           md3
       |             |                |             |
   |*******|       |***|            |***|       |*******|
   |       |       |   |            |   |       |       |
nvme1n1 nvme0n1   sda sdd          sde sdf   nvme2n1 nvme3n1


We chose to test such complex topology to prevent corner cases and timing
issues (which were found in the development phase). There are 3 test
subsets basically: the whole set of devices, called here "md cascade", and
2 small branches, called here "md direct" testing.

So, in summary:

### md direct (single arrays composed of SCSI/NVMe block devices) ###
A1 = md1      A2 = md3
C1 = sde      C1 = nvme2n1
C2 = sdf      C2 = nvme3n1

### md cascade (array composed of md devices) ###
A3 = md6
underlying component UC1 = nvme1n1
underlying component UC2 = sdd
underlying component UC3 = sdf
underlying component UC4 = nvme2n1


### Method of removal ###
- For NVMe devices, "echo 1 > /sys/block/nvmeXnY/device/device/remove"
- For SCSI devices, "echo 1 > /sys/block/sdX/device/delete"
When tests are marked with *, it means PCI/disk hotplug was exercised
too, using the qemu hotplug feature.


### Test ###
Write a file using the command: "dd if=/dev/zero of=tmpfile bs=X" where
X might be 1M or 4K - we've experimented with both.
Each array also contains a valid file to be checked later, in order to
validate that filesystem didn't get severely corrupted after the
procedure.
Tests marked with @ indicate direct writes were tested too.
Tests with a [] indicator exhibited some oddity/issue, detailed in the
caveats section.

After each test, guest was rebooted. We also tested in some cases
re-adding the previously removed SCSI/NVMe device (after unmounting
the previous mount points) and restarting the arrays - worked fine.


* Test results
("partition X" means we have a GPT table with 2 partitions in the device)


### md direct

Remove members and start writing to array right after:
A1 with:                               A2 with:
 -ext4                                  -ext4
 --Removed C1: PASSED @                 --Removed C2: PASSED @

 -xfs                                   -xfs
 --Removed C2: PASSED @                  --Removed C1: PASSED @

 -ext2                                  -btrfs
 --Removed C1: PASSED                   --Removed C2: PASSED

 -partition 1 + ext4                    -partition 1 + xfs
 -partition 2 + xfs                     -partition 2 + ext4
 --Removed C1: PASSED @                 --Removed C2: PASSED @

 -LVM + xfs                             -LVM + ext4
 --Removed C2: PASSED                   --Removed C1: PASSED

Start writing to array and remove member right after:
A1 with:                               A2 with:
 -ext4                                  -ext4
 --Removed C1: PASSED @                 --Removed C1: PASSED @
 --Removed C2: PASSED *@

 -xfs                                   -xfs
 --Removed C1: PARTIAL @ [(a)]          --Removed C1: PASSED *
                                        --Removed C2: PASSED *@

 -ext2                                  -btrfs
 --Removed C2: PASSED @                  --Removed C1: PASSED @

 -partition 1 + ext4                    -partition 1 + xfs
 -partition 2 + xfs                     -partition 2 + ext4
 --Removed C2: PARTIAL @ [(a)]          --Removed C1: PASSED *@


 -LVM + xfs                             -LVM + ext4
 --Removed C1: PASSED                   --Removed C2: PASSED

                                        -fuse (NTFS)
                                        --Removed C2: PASSED

### md cascade

Remove members and start writing to array right after:
A3 with:
 -ext4:
 --Removed UC2: PASSED @
 --Removed UC4: PASSED @

  -xfs:
 --Removed UC2: PASSED @
 --Removed UC4: PASSED @

 -partition 1 + ext4
 -partition 2 + xfs
 --Removed UC1: PASSED
 --Removed UC3: PASSED

 -ext2:
 --Removed UC3: PASSED @

 -btrfs:
 --Removed UC1:  PARTIAL [(d)]

Start writing to array and remove member right after:
A3 with:
 -ext4:
 --Removed UC2: PARTIAL @ [(a), (b)]
 --Removed UC4: PARTIAL @ [(b), (d)]

  -xfs:
 --Removed UC2: PARTIAL @  [(a), (c)]
 --Removed UC4: PARTIAL *@ [(c), (d)]

 -partition 1 + ext4
 -partition 2 + xfs
 --Removed UC1: PASSED *
 --Removed UC3: PASSED

 -ext2:
 --Removed UC3: PASSED @

 -btrfs:
 --Removed UC1: PARTIAL @ [(d)]


* Caveats / points of uncertainty / questions:

a) When using SCSI array members, depending "when" the raid member gets
removed, raid0 driver might not quickly observe a failure in the queue,
because the requests may be all "scheduled" in the SCSI cache to be
flushed to the device. In other words, depending on the timing, the
mechanism used in this patch to detect a failed array member will take
some time to be triggered. It does not happen in direct I/O mode nor with
NVMe - it's related with the SCSI cache sync. As soon a new I/O is queued
in the md device, that triggers the array removal.

#For md cascade only (nested arrays):

b) In the case of direct I/Os, we might face an ext4 journal task blocked
in io_schedule(). Stack trace: https://pastebin.ubuntu.com/p/RZ4qFvJjy2

c) Hung task in xfs after direct I/O only, when trying to unmount
[xlog_wait() or xlog_dealloc_log()]. Stack trace:
https://pastebin.ubuntu.com/p/fq3G45jHX2

d) Some arrays in the nested scenario still show in /sys/block after
the removal.


#Generic questions / polemic choices:

i) With this patch, the STOP_ARRAY ioctl won't proceed in case a disk is
removed and emergency stop routine already started to act, even in case of
unmounted md arrays. This is a change in the old behavior, triggered by
the way we check for failed members in raid0 driver.

ii) Currently, the patch implements a kernel-only removal policy - shall
it rely on userspace (mdadm) to do it? A first approach was based in
userspace, but it proved to be more problematic in tests.

iii) The patch implemented the md_delayed_stop() function to perform the
emergency stop - should we refactor do_md_stop() instead, to encompass
the emergency delayed stop minimal code?

Guilherme G. Piccoli (1):
  md/raid0: Introduce emergency stop for raid0 arrays

 drivers/md/md.c    | 126 +++++++++++++++++++++++++++++++++++++++++----
 drivers/md/md.h    |   6 +++
 drivers/md/raid0.c |  14 +++++
 3 files changed, 136 insertions(+), 10 deletions(-)

-- 
2.18.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC] [PATCH 1/1] md/raid0: Introduce emergency stop for raid0 arrays
  2018-08-01 14:56 [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays Guilherme G. Piccoli
@ 2018-08-01 14:56 ` Guilherme G. Piccoli
  2018-08-02  1:51 ` [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays NeilBrown
  1 sibling, 0 replies; 7+ messages in thread
From: Guilherme G. Piccoli @ 2018-08-01 14:56 UTC (permalink / raw)
  To: linux-raid, shli
  Cc: gpiccoli, kernel, jay.vosburgh, neilb, dm-devel, linux-block,
	linux-fsdevel

Currently the raid0 driver is not provided with any health checking
mechanism to verify its members are fine. So, if suddenly a member
is removed, for example, a STOP_ARRAY ioctl will be triggered from
userspace, i.e., all the logic for stopping an array relies in the
userspace tools, like mdadm/udev. Particularly, if a raid0 array is
mounted, this stop procedure will fail, since mdadm tries to open
the md block device with O_EXCL flag, which isn't allowed if md is
mounted.

That leads to the following situation: if a raid0 array member is
removed and the array is mounted, some user writing to this array
won't realize errors are happening unless they check kernel log.
In other words, no -EIO is returned and writes (except direct I/Os)
appear normal. Meaning this user might think the wrote data is stored
in the array, but instead garbage was written since raid0 does stripping
and require all members to be working in order not corrupt data.

This patch propose a change in this behavior: to emergency stop a
raid0 array in case one of its members are gone. The check happens
when I/O is queued to raid0 driver, so the driver will confirm if
the block device it plans to read/write has its queue healthy; in
case it's not fine (like a dying or dead queue), raid0 driver will
invoke an emergency removal routine that will mark the md device as
broken and trigger a delayed stop procedure. Also, raid0 will start
refusing new BIOs from this point, returning -EIO.
The emergency stop routine will mark the md request queue as dying
too, as a "flag" to indicate failure in case of a nested raid0 array
configuration (a raid0 composed of raid0 devices).

The delayed stop procedure then will perform the basic stop of the
md device, and will take care in case it holds mounted filesystems,
allowing the stop of a mounted raid0 array - which is common in
other regular block devices like NVMe and SCSI.

This emergency stop mechanism only affects raid0 arrays.

Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
---
 drivers/md/md.c    | 126 +++++++++++++++++++++++++++++++++++++++++----
 drivers/md/md.h    |   6 +++
 drivers/md/raid0.c |  14 +++++
 3 files changed, 136 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 994aed2f9dff..bf725ba50ff2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -94,6 +94,7 @@ EXPORT_SYMBOL(md_cluster_mod);
 static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 static struct workqueue_struct *md_wq;
 static struct workqueue_struct *md_misc_wq;
+static struct workqueue_struct *md_stop_wq;
 
 static int remove_and_add_spares(struct mddev *mddev,
 				 struct md_rdev *this);
@@ -455,15 +456,24 @@ static void md_end_flush(struct bio *fbio)
 	rdev_dec_pending(rdev, mddev);
 
 	if (atomic_dec_and_test(&fi->flush_pending)) {
-		if (bio->bi_iter.bi_size == 0)
+		if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+			goto bail_flush;
+		} else if (bio->bi_iter.bi_size == 0) {
 			/* an empty barrier - all done */
 			bio_endio(bio);
-		else {
+		} else {
 			INIT_WORK(&fi->flush_work, submit_flushes);
 			queue_work(md_wq, &fi->flush_work);
 		}
 	}
 
+bail_flush:
+	/* Prevents dangling bios to crash after raid0 emergency stop */
+	if (test_bit(MD_BROKEN, &mddev->flags) && !mddev->raid_disks) {
+		bio_uninit(fbio);
+		return;
+	}
+
 	mempool_free(fb, mddev->flush_bio_pool);
 	bio_put(fbio);
 }
@@ -473,6 +483,11 @@ void md_flush_request(struct mddev *mddev, struct bio *bio)
 	struct md_rdev *rdev;
 	struct flush_info *fi;
 
+	if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+		bio_io_error(bio);
+		return;
+	}
+
 	fi = mempool_alloc(mddev->flush_pool, GFP_NOIO);
 
 	fi->bio = bio;
@@ -5211,6 +5226,17 @@ md_attr_store(struct kobject *kobj, struct attribute *attr,
 	return rv;
 }
 
+static void __md_free(struct mddev *mddev)
+{
+	if (mddev->gendisk)
+		del_gendisk(mddev->gendisk);
+	if (mddev->queue)
+		blk_cleanup_queue(mddev->queue);
+	if (mddev->gendisk)
+		put_disk(mddev->gendisk);
+	percpu_ref_exit(&mddev->writes_pending);
+}
+
 static void md_free(struct kobject *ko)
 {
 	struct mddev *mddev = container_of(ko, struct mddev, kobj);
@@ -5218,13 +5244,8 @@ static void md_free(struct kobject *ko)
 	if (mddev->sysfs_state)
 		sysfs_put(mddev->sysfs_state);
 
-	if (mddev->gendisk)
-		del_gendisk(mddev->gendisk);
-	if (mddev->queue)
-		blk_cleanup_queue(mddev->queue);
-	if (mddev->gendisk)
-		put_disk(mddev->gendisk);
-	percpu_ref_exit(&mddev->writes_pending);
+	if (!test_bit(MD_BROKEN, &mddev->flags))
+		__md_free(mddev);
 
 	bioset_exit(&mddev->bio_set);
 	bioset_exit(&mddev->sync_set);
@@ -5247,7 +5268,9 @@ static void mddev_delayed_delete(struct work_struct *ws)
 {
 	struct mddev *mddev = container_of(ws, struct mddev, del_work);
 
-	sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
+	if (!test_bit(MD_BROKEN, &mddev->flags))
+		sysfs_remove_group(&mddev->kobj, &md_bitmap_group);
+
 	kobject_del(&mddev->kobj);
 	kobject_put(&mddev->kobj);
 }
@@ -5987,6 +6010,75 @@ static int md_set_readonly(struct mddev *mddev, struct block_device *bdev)
 	return err;
 }
 
+static void mddev_delayed_stop(struct work_struct *ws)
+{
+	struct mddev *mddev = container_of(ws, struct mddev, stop_work);
+	struct gendisk *disk = mddev->gendisk;
+	struct block_device *bdev;
+	struct md_rdev *rdev;
+	int i = 0;
+
+	mutex_lock(&mddev->open_mutex);
+	del_timer_sync(&mddev->safemode_timer);
+	__md_stop(mddev);
+	rdev_for_each(rdev, mddev)
+		if (rdev->raid_disk >= 0)
+			sysfs_unlink_rdev(mddev, rdev);
+	set_capacity(disk, 0);
+	mutex_unlock(&mddev->open_mutex);
+
+	mddev->changed = 1;
+	revalidate_disk(disk);
+	export_array(mddev);
+	mddev->hold_active = 0;
+
+	/* Make sure unbind_rdev_from_array() jobs are done */
+	flush_workqueue(md_misc_wq);
+
+	do {
+		bdev = bdget_disk(disk, i++);
+		if (bdev) {
+			struct super_block *sb = bdev->bd_super;
+
+			if (sb) {
+				sb->s_flags |= SB_RDONLY;
+				sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK;
+			}
+			bdput(bdev);
+		}
+	} while (bdev);
+
+	md_clean(mddev);
+	set_bit(MD_BROKEN, &mddev->flags); /* md_clean() will clear this too */
+	md_new_event(mddev);
+	sysfs_notify_dirent_safe(mddev->sysfs_state);
+
+	__md_free(mddev);
+	mddev->gendisk = NULL;
+}
+
+void mddev_emergency_stop(struct mddev *mddev)
+{
+	/* Prevents races if raid0 driver detects errors in multiple requests
+	 * at the same time.
+	 */
+	mutex_lock(&mddev->open_mutex);
+	if (test_bit(MD_BROKEN, &mddev->flags))
+		return;
+
+	set_bit(MD_BROKEN, &mddev->flags);
+	mutex_unlock(&mddev->open_mutex);
+
+	/* Necessary to set md queue as dying here in case this md device is
+	 * a member of some other md array - nested removal will proceed then.
+	 */
+	blk_set_queue_dying(mddev->queue);
+
+	INIT_WORK(&mddev->stop_work, mddev_delayed_stop);
+	queue_work(md_stop_wq, &mddev->stop_work);
+}
+EXPORT_SYMBOL_GPL(mddev_emergency_stop);
+
 /* mode:
  *   0 - completely stop and dis-assemble array
  *   2 - stop but do not disassemble array
@@ -5998,6 +6090,13 @@ static int do_md_stop(struct mddev *mddev, int mode,
 	struct md_rdev *rdev;
 	int did_freeze = 0;
 
+	/* If we already started the emergency removal procedure,
+	 * the STOP_ARRAY ioctl can race with the removal, leading to
+	 * dangling devices. Below check is to prevent this corner case.
+	 */
+	if (test_bit(MD_BROKEN, &mddev->flags))
+		return -EBUSY;
+
 	if (!test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) {
 		did_freeze = 1;
 		set_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
@@ -9122,6 +9221,10 @@ static int __init md_init(void)
 	if (!md_misc_wq)
 		goto err_misc_wq;
 
+	md_stop_wq = alloc_ordered_workqueue("md_stop", 0);
+	if (!md_stop_wq)
+		goto err_stop_wq;
+
 	if ((ret = register_blkdev(MD_MAJOR, "md")) < 0)
 		goto err_md;
 
@@ -9143,6 +9246,8 @@ static int __init md_init(void)
 err_mdp:
 	unregister_blkdev(MD_MAJOR, "md");
 err_md:
+	destroy_workqueue(md_stop_wq);
+err_stop_wq:
 	destroy_workqueue(md_misc_wq);
 err_misc_wq:
 	destroy_workqueue(md_wq);
@@ -9402,6 +9507,7 @@ static __exit void md_exit(void)
 		 * destroy_workqueue() below will wait for that to complete.
 		 */
 	}
+	destroy_workqueue(md_stop_wq);
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
 }
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 2d148bdaba74..e5e7e3977b46 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -243,6 +243,9 @@ enum mddev_flags {
 	MD_UPDATING_SB,		/* md_check_recovery is updating the metadata
 				 * without explicitly holding reconfig_mutex.
 				 */
+	MD_BROKEN,		/* This is used in the emergency RAID-0 stop
+				 * in case one of its members gets removed.
+				 */
 };
 
 enum mddev_sb_flags {
@@ -411,6 +414,8 @@ struct mddev {
 
 	struct work_struct del_work;	/* used for delayed sysfs removal */
 
+	struct work_struct stop_work;	/* used for delayed emergency stop */
+
 	/* "lock" protects:
 	 *   flush_bio transition from NULL to !NULL
 	 *   rdev superblocks, events
@@ -713,6 +718,7 @@ extern void md_rdev_clear(struct md_rdev *rdev);
 extern void md_handle_request(struct mddev *mddev, struct bio *bio);
 extern void mddev_suspend(struct mddev *mddev);
 extern void mddev_resume(struct mddev *mddev);
+extern void mddev_emergency_stop(struct mddev *mddev);
 extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 				   struct mddev *mddev);
 
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index ac1cffd2a09b..4f282087aba2 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -555,6 +555,7 @@ static void raid0_handle_discard(struct mddev *mddev, struct bio *bio)
 static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 {
 	struct strip_zone *zone;
+	struct block_device *bd;
 	struct md_rdev *tmp_dev;
 	sector_t bio_sector;
 	sector_t sector;
@@ -593,6 +594,19 @@ static bool raid0_make_request(struct mddev *mddev, struct bio *bio)
 
 	zone = find_zone(mddev->private, &sector);
 	tmp_dev = map_sector(mddev, zone, sector, &sector);
+	bd = tmp_dev->bdev;
+
+	if (unlikely((blk_queue_dead(bd->bd_queue) ||
+	    blk_queue_dying(bd->bd_queue)) &&
+	    !test_bit(MD_BROKEN, &mddev->flags))) {
+		mddev_emergency_stop(mddev);
+	}
+
+	if (unlikely(test_bit(MD_BROKEN, &mddev->flags))) {
+		bio_io_error(bio);
+		return true;
+	}
+
 	bio_set_dev(bio, tmp_dev->bdev);
 	bio->bi_iter.bi_sector = sector + zone->dev_start +
 		tmp_dev->data_offset;
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays
  2018-08-01 14:56 [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays Guilherme G. Piccoli
  2018-08-01 14:56 ` [RFC] [PATCH 1/1] md/raid0: Introduce emergency stop for raid0 arrays Guilherme G. Piccoli
@ 2018-08-02  1:51 ` NeilBrown
  2018-08-02 13:30   ` Guilherme G. Piccoli
  1 sibling, 1 reply; 7+ messages in thread
From: NeilBrown @ 2018-08-02  1:51 UTC (permalink / raw)
  To: Guilherme G. Piccoli, linux-raid, shli
  Cc: gpiccoli, kernel, jay.vosburgh, dm-devel, linux-block, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2622 bytes --]

On Wed, Aug 01 2018, Guilherme G. Piccoli wrote:

> Currently the md driver completely relies in the userspace to stop an
> array in case of some failure. There's an interesting case for raid0: if
> we remove a raid0 member, like PCI hot(un)plugging an NVMe device, and
> the raid0 array is _mounted_, mdadm cannot stop the array, since the tool
> tries to open the block device (to perform the ioctl) with O_EXCL flag.
>
> So, in this case the array is still alive - users may write to this
> "broken-yet-alive" array and unless they check the kernel log or some
> other monitor tool, everything will seem fine and the writes are completed
> with no errors. Being more precise, direct writes will not work, but since
> usually writes are done in a regular form, i.e., backed by the page
> cache, the most common scenario is an user being able to regularly write
> to a broken raid0, and get all their data corrupted.
>
> PROPOSAL:
> The idea proposed here to fix this behavior is mimic other block devices:
> if one have a filesystem mounted in a block device on top of an NVMe or
> SCSI disk and the disk gets removed, writes are prevented, errors are
> observed and it's obvious something is wrong. Same goes for USB sticks,
> which are sometimes even removed physically from the machine without
> getting their filesystem unmounted before.
>
> We believe right now the md driver is not behaving properly for raid0
> arrays (it is handling these errors for other levels though). The approach
> took for raid-0 is basically an emergency removal procedure, in which I/O
> is blocked from the device, the regular clean-up happens and the associate
> disk is deleted. It went to extensive testing, as detailed below.
>
> Not all are roses, we have some caveats that need to be resolved.
> Feedback is _much appreciated_.

If you have hard drive and some sectors or track stop working, I think
you would still expect IO to the other sectors or tracks to keep
working.

For this reason, the behaviour of md/raid0 is to continue to serve IO to
working devices, and only fail IO to failed/missing devices.

It seems reasonable that you might want a different behaviour, but I
think that should be optional.  i.e. you would need to explicitly set a
"one-out-all-out" flag on the array.  I'm not sure if this should cause
reads to fail, but it seems quite reasonable that it would cause all
writes to fail.

I would only change the kernel to recognise the flag and refuse any
writes after any error has been seen.
I would use udev/mdadm to detect a device removal and to mark the
relevant component device as missing.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays
  2018-08-02  1:51 ` [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays NeilBrown
@ 2018-08-02 13:30   ` Guilherme G. Piccoli
  2018-08-02 21:37     ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Guilherme G. Piccoli @ 2018-08-02 13:30 UTC (permalink / raw)
  To: NeilBrown, linux-raid, shli
  Cc: kernel, jay.vosburgh, dm-devel, linux-block, linux-fsdevel

On 01/08/2018 22:51, NeilBrown wrote:
>> [...] 
> If you have hard drive and some sectors or track stop working, I think
> you would still expect IO to the other sectors or tracks to keep
> working.
> 
> For this reason, the behaviour of md/raid0 is to continue to serve IO to
> working devices, and only fail IO to failed/missing devices.
> 

Hi Neil, thanks for your quick response. I agree with you about the
potential sector failure, it shouldn't automatically fail the entire
array for a single failed write.

The check I'm using in the patch is against device request queue - if a
raid0 member queue is dying/dead, then we consider the device as dead,
and as a consequence, the array is marked dead.

In my understanding of raid0/stripping, data is split among N devices,
called raid members. If one member is failed, for sure the data written
to the array will be corrupted, since that "portion" of data going to
the failed device won't be stored.

Regarding the current behavior, one test I made was to remove 1 device
of a 2-disk raid0 array and after that, write a file. Write completed
normally (no errors from the userspace perspective), and I hashed the
file using md5. I then rebooted the machine, raid0 was back with the 2
devices, and guess what?
The written file was there, but corrupted (with a different hash). I
don't think this is something fine, user could have written important
data and don't realize it was getting corrupted while writing.


> It seems reasonable that you might want a different behaviour, but I
> think that should be optional.  i.e. you would need to explicitly set a
> "one-out-all-out" flag on the array.  I'm not sure if this should cause
> reads to fail, but it seems quite reasonable that it would cause all
> writes to fail.
> 
> I would only change the kernel to recognise the flag and refuse any
> writes after any error has been seen.
> I would use udev/mdadm to detect a device removal and to mark the
> relevant component device as missing.
>

Using the udev/mdadm to notice a member has failed and the array must be
stopped might work, it was my first approach. The main issue here is
timing: it takes "some time" until userspace is aware of the failure, so
we have a window in which writes were sent between

(A) the array member failed/got removed and
(B) mdadm notices and instruct driver to refuse new writes;

between (A) and (B), those writes are seen as completed, since they are
indeed complete (at least, they are fine from the page cache point of
view). Then, writeback will try to write those, which will cause
problems or they will complete in a corrupted form (the file will
be present in the array's filesystem after array is restored, but
corrupted).

So, the in-kernel mechanism avoided most part of window (A)-(B),
although it seems we still have some problems when nesting arrays,
due to this same window, even with the in-kernel mechanism (given the
fact it takes some time to remove the top array when a pretty "far"
bottom-member is failed).

More suggestions on how to deal with this in a definitive manner are
highly appreciated.
Thanks,


Guilherme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays
  2018-08-02 13:30   ` Guilherme G. Piccoli
@ 2018-08-02 21:37     ` NeilBrown
  2018-08-09 23:17       ` Guilherme G. Piccoli
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2018-08-02 21:37 UTC (permalink / raw)
  To: Guilherme G. Piccoli, linux-raid, shli
  Cc: kernel, jay.vosburgh, dm-devel, linux-block, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 3929 bytes --]

On Thu, Aug 02 2018, Guilherme G. Piccoli wrote:

> On 01/08/2018 22:51, NeilBrown wrote:
>>> [...] 
>> If you have hard drive and some sectors or track stop working, I think
>> you would still expect IO to the other sectors or tracks to keep
>> working.
>> 
>> For this reason, the behaviour of md/raid0 is to continue to serve IO to
>> working devices, and only fail IO to failed/missing devices.
>> 
>
> Hi Neil, thanks for your quick response. I agree with you about the
> potential sector failure, it shouldn't automatically fail the entire
> array for a single failed write.
>
> The check I'm using in the patch is against device request queue - if a
> raid0 member queue is dying/dead, then we consider the device as dead,
> and as a consequence, the array is marked dead.
>
> In my understanding of raid0/stripping, data is split among N devices,
> called raid members. If one member is failed, for sure the data written
> to the array will be corrupted, since that "portion" of data going to
> the failed device won't be stored.
>
> Regarding the current behavior, one test I made was to remove 1 device
> of a 2-disk raid0 array and after that, write a file. Write completed
> normally (no errors from the userspace perspective), and I hashed the
> file using md5. I then rebooted the machine, raid0 was back with the 2
> devices, and guess what?
> The written file was there, but corrupted (with a different hash). I
> don't think this is something fine, user could have written important
> data and don't realize it was getting corrupted while writing.


In your test, did you "fsync" the file after writing to it?  That is
essential for data security.
If fsync succeeded even though the data wasn't written, that is
certainly a bug.  If it doesn't succeed, then you know there is a
problem with your data.

>
>
>> It seems reasonable that you might want a different behaviour, but I
>> think that should be optional.  i.e. you would need to explicitly set a
>> "one-out-all-out" flag on the array.  I'm not sure if this should cause
>> reads to fail, but it seems quite reasonable that it would cause all
>> writes to fail.
>> 
>> I would only change the kernel to recognise the flag and refuse any
>> writes after any error has been seen.
>> I would use udev/mdadm to detect a device removal and to mark the
>> relevant component device as missing.
>>
>
> Using the udev/mdadm to notice a member has failed and the array must be
> stopped might work, it was my first approach. The main issue here is
> timing: it takes "some time" until userspace is aware of the failure, so
> we have a window in which writes were sent between
>
> (A) the array member failed/got removed and
> (B) mdadm notices and instruct driver to refuse new writes;

I don't think the delay is relevant.
If writes are happening, then the kernel will get write error from the
failed devices and can flag the array as faulty.
If writes aren't happening, then it no important cost in the "device is
removed" message going up to user-space and back.

NeilBrown

>
> between (A) and (B), those writes are seen as completed, since they are
> indeed complete (at least, they are fine from the page cache point of
> view). Then, writeback will try to write those, which will cause
> problems or they will complete in a corrupted form (the file will
> be present in the array's filesystem after array is restored, but
> corrupted).
>
> So, the in-kernel mechanism avoided most part of window (A)-(B),
> although it seems we still have some problems when nesting arrays,
> due to this same window, even with the in-kernel mechanism (given the
> fact it takes some time to remove the top array when a pretty "far"
> bottom-member is failed).
>
> More suggestions on how to deal with this in a definitive manner are
> highly appreciated.
> Thanks,
>
>
> Guilherme

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays
  2018-08-02 21:37     ` NeilBrown
@ 2018-08-09 23:17       ` Guilherme G. Piccoli
  2018-09-03 12:16         ` Guilherme G. Piccoli
  0 siblings, 1 reply; 7+ messages in thread
From: Guilherme G. Piccoli @ 2018-08-09 23:17 UTC (permalink / raw)
  To: NeilBrown, linux-raid, shli
  Cc: kernel, jay.vosburgh, dm-devel, linux-block, linux-fsdevel

Hi Neil, sorry for my delay.


On 02/08/2018 18:37, NeilBrown wrote:
> On Thu, Aug 02 2018, Guilherme G. Piccoli wrote:
>> [...]
>> Regarding the current behavior, one test I made was to remove 1 device
>> of a 2-disk raid0 array and after that, write a file. Write completed
>> normally (no errors from the userspace perspective), and I hashed the
>> file using md5. I then rebooted the machine, raid0 was back with the 2
>> devices, and guess what?
>> The written file was there, but corrupted (with a different hash). I
>> don't think this is something fine, user could have written important
>> data and don't realize it was getting corrupted while writing.
> 
> 
> In your test, did you "fsync" the file after writing to it?  That is
> essential for data security.
> If fsync succeeded even though the data wasn't written, that is
> certainly a bug.  If it doesn't succeed, then you know there is a
> problem with your data.
> 

Yes, I did. After writing, I ran both "sync" and "sync -f" after "dd"
command complete (with no errors). The sync procedures also finished
without errors, and the file was there. After a reboot, though, the
file has a different md5, since it was corrupted.


>> [...]
>> Using the udev/mdadm to notice a member has failed and the array must be
>> stopped might work, it was my first approach. The main issue here is
>> timing: it takes "some time" until userspace is aware of the failure, so
>> we have a window in which writes were sent between
>>
>> (A) the array member failed/got removed and
>> (B) mdadm notices and instruct driver to refuse new writes;
> 
> I don't think the delay is relevant.
> If writes are happening, then the kernel will get write error from the
> failed devices and can flag the array as faulty.
> If writes aren't happening, then it no important cost in the "device is
> removed" message going up to user-space and back.

The problem with the time between userspace notice something is wrong
and "warn" the kernel to stop writes is that many writes will be sent
to the device in this mean time, and they can completed later - handling
async completions of dead devices proved to be tricky, at least in my
approach.
Also, writeback threads will be filled with I/Os to be written to the
dead devices too, this is other part of the problem.

If you have suggestions to improve my approach, or perhaps a totally
different idea than mine, I highly appreciate the feedback.

Thank you very much for the attention.
Cheers,


Guilherme

> 
> NeilBrown
> 
>>
>> between (A) and (B), those writes are seen as completed, since they are
>> indeed complete (at least, they are fine from the page cache point of
>> view). Then, writeback will try to write those, which will cause
>> problems or they will complete in a corrupted form (the file will
>> be present in the array's filesystem after array is restored, but
>> corrupted).
>>
>> So, the in-kernel mechanism avoided most part of window (A)-(B),
>> although it seems we still have some problems when nesting arrays,
>> due to this same window, even with the in-kernel mechanism (given the
>> fact it takes some time to remove the top array when a pretty "far"
>> bottom-member is failed).
>>
>> More suggestions on how to deal with this in a definitive manner are
>> highly appreciated.
>> Thanks,
>>
>>
>> Guilherme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays
  2018-08-09 23:17       ` Guilherme G. Piccoli
@ 2018-09-03 12:16         ` Guilherme G. Piccoli
  0 siblings, 0 replies; 7+ messages in thread
From: Guilherme G. Piccoli @ 2018-09-03 12:16 UTC (permalink / raw)
  To: NeilBrown, linux-raid, shli
  Cc: kernel, jay.vosburgh, dm-devel, linux-block, linux-fsdevel

On 09/08/2018 20:17, Guilherme G. Piccoli wrote:
> [..] 
> If you have suggestions to improve my approach, or perhaps a totally
> different idea than mine, I highly appreciate the feedback.
> 
> Thank you very much for the attention.
> Cheers,
> 
> 
> Guilherme
>

Hi Neil and Shaohua Li, sorry for the ping - do you have any
news on this? Any feedback is much appreciated.

Cheers,


Guilherme

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-09-03 16:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-01 14:56 [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays Guilherme G. Piccoli
2018-08-01 14:56 ` [RFC] [PATCH 1/1] md/raid0: Introduce emergency stop for raid0 arrays Guilherme G. Piccoli
2018-08-02  1:51 ` [RFC] [PATCH 0/1] Introduce emergency raid0 stop for mounted arrays NeilBrown
2018-08-02 13:30   ` Guilherme G. Piccoli
2018-08-02 21:37     ` NeilBrown
2018-08-09 23:17       ` Guilherme G. Piccoli
2018-09-03 12:16         ` Guilherme G. Piccoli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).