All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/2] [RFC] Introduce device state 'failed'
@ 2017-10-03 15:59 Anand Jain
  2017-10-03 15:59 ` [PATCH v8 1/2] btrfs: introduce device dynamic state transition to failed Anand Jain
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Anand Jain @ 2017-10-03 15:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: bo.li.liu

When one device fails it has to be closed and marked as failed.
Further it needs sysfs (or some) interface to provide complete
information about the device and the volume status to the user
land from the kernel. Next when the disappeared device reappears
we need to resilver/insync depending on the RAID profile which
should be handled per RAID profile specific.

The efforts here are to fix above three missing items.

To begin with this patch brings a Write/Flush failed device to
a failed state.

Next about bringing the device back to the alloc list and verifying
its consistency and kicking off the re-silvering part that still WIP,
& feedback helps. For RAID1 a convert of single raid profile back to
all raid1 will help. For RAID56 I am backing on Luibo's recent RAID56
write hole work I am yet to look deeper on that. Next for RAID1 there
can be split brain scenario where each of the devices were mounted
independently, so to fix this I planning to set an (new) incompatible
flag if any of the device is written without the other. Now when they
are brought together then incompatible flag should be their on only
one of the device, however if incompatible flag is on both the devices
then its a split brain scenario where user intervention will be required.

On the sysfs part there are patches in the ML which was sent before,
I shall be reviving them as well.

Thanks, Anand

Anand Jain (2):
  btrfs: introduce device dynamic state transition to failed
  btrfs: check device for critical errors and mark failed

 fs/btrfs/ctree.h   |   2 +
 fs/btrfs/disk-io.c |  78 ++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  19 +++++++++-
 4 files changed, 202 insertions(+), 2 deletions(-)

-- 
2.7.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v8 1/2] btrfs: introduce device dynamic state transition to failed
  2017-10-03 15:59 [PATCH v8 0/2] [RFC] Introduce device state 'failed' Anand Jain
@ 2017-10-03 15:59 ` Anand Jain
  2017-10-13 18:47   ` Liu Bo
  2017-10-03 15:59 ` [PATCH v8 2/2] btrfs: check device for critical errors and mark failed Anand Jain
  2017-10-05 13:54 ` [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors Anand Jain
  2 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2017-10-03 15:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: bo.li.liu

From: Anand Jain <Anand.Jain@oracle.com>

This patch provides helper functions to force a device to failed,
and we need it for the following reasons,
1) a. It can be reported that device has failed when it does and
   b. Close the device when it goes offline so that blocklayer can
      cleanup
2) Identify the candidate for the auto replace
3) Stop further RW to the failing device and
4) A device in the multi device btrfs may fail, but as of now in
   some system config whole of btrfs gets unmounted.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
V8: General misc cleanup. Based on v4.14-rc2

 fs/btrfs/volumes.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  15 +++++++-
 2 files changed, 118 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0e8f16c305df..06e7cf4cef81 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7255,3 +7255,107 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info)
 		fs_devices = fs_devices->seed;
 	}
 }
+
+static void do_close_device(struct work_struct *work)
+{
+	struct btrfs_device *device;
+
+	device = container_of(work, struct btrfs_device, rcu_work);
+
+	if (device->closing_bdev)
+		blkdev_put(device->closing_bdev, device->mode);
+
+	device->closing_bdev = NULL;
+}
+
+static void btrfs_close_one_device(struct rcu_head *head)
+{
+	struct btrfs_device *device;
+
+	device = container_of(head, struct btrfs_device, rcu);
+
+	INIT_WORK(&device->rcu_work, do_close_device);
+	schedule_work(&device->rcu_work);
+}
+
+void btrfs_force_device_close(struct btrfs_device *device)
+{
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_fs_devices *fs_devices;
+
+	fs_devices = device->fs_devices;
+	fs_info = fs_devices->fs_info;
+
+	btrfs_sysfs_rm_device_link(fs_devices, device);
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	mutex_lock(&fs_devices->fs_info->chunk_mutex);
+
+	btrfs_assign_next_active_device(fs_devices->fs_info, device, NULL);
+
+	if (device->bdev)
+		fs_devices->open_devices--;
+
+	if (device->writeable) {
+		list_del_init(&device->dev_alloc_list);
+		fs_devices->rw_devices--;
+	}
+	device->writeable = 0;
+
+	/*
+	 * Todo: We have miss-used missing flag all around, and here
+	 * too for now. (In the long run I want to keep missing to only
+	 * indicate that it was not present when RAID was assembled.)
+	 */
+	device->missing = 1;
+	fs_devices->missing_devices++;
+	device->closing_bdev = device->bdev;
+	device->bdev = NULL;
+
+	call_rcu(&device->rcu, btrfs_close_one_device);
+
+	mutex_unlock(&fs_devices->fs_info->chunk_mutex);
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	rcu_barrier();
+
+	btrfs_warn_in_rcu(fs_info, "device %s failed",
+				rcu_str_deref(device->name));
+
+	/*
+	 * We lost one/more disk, which means its not as it
+	 * was configured by the user. Show mount should show
+	 * degraded.
+	 */
+	btrfs_set_opt(fs_info->mount_opt, DEGRADED);
+
+	/*
+	 * Now having lost one of the device, check if chunk stripe
+	 * is incomplete and handle fatal error if needed.
+	 */
+	if (!btrfs_check_rw_degradable(fs_info))
+		btrfs_handle_fs_error(fs_info, -EIO,
+				"devices below critical level");
+}
+
+void btrfs_mark_device_failed(struct btrfs_device *dev)
+{
+	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
+
+	/* This shouldn't be called if device is already missing */
+	if (dev->missing || !dev->bdev)
+		return;
+	if (dev->failed)
+		return;
+	dev->failed = 1;
+
+	/* Last RW device is requested to force close let FS handle it. */
+	if (fs_devices->rw_devices == 1) {
+		btrfs_handle_fs_error(fs_devices->fs_info, -EIO,
+					"Last RW device failed");
+		return;
+	}
+
+	/* Point of no return start here. */
+	btrfs_force_device_close(dev);
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6108fdfec67f..05b150c03995 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -65,13 +65,26 @@ struct btrfs_device {
 	struct btrfs_pending_bios pending_sync_bios;
 
 	struct block_device *bdev;
+	struct block_device *closing_bdev;
 
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
 	int writeable;
 	int in_fs_metadata;
+	/* missing: device not found at the time of mount */
 	int missing;
+	/* failed: device confirmed to have experienced critical io failure */
+	int failed;
+	/*
+	* offline: system or user or block layer transport has removed
+	* offlined the device which was once present and without going
+	* through unmount. Implies an intriem communication break down
+	* and not necessarily a candidate for the device replace. And
+	* device might be online after user intervention or after
+	* block transport layer error recovery.
+	*/
+	int offline;
 	int can_discard;
 	int is_tgtdev_for_dev_replace;
 	blk_status_t last_flush_error;
@@ -544,5 +557,5 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
 bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
 void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
 				 u8 *uuid);
-
+void btrfs_mark_device_failed(struct btrfs_device *dev);
 #endif
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-03 15:59 [PATCH v8 0/2] [RFC] Introduce device state 'failed' Anand Jain
  2017-10-03 15:59 ` [PATCH v8 1/2] btrfs: introduce device dynamic state transition to failed Anand Jain
@ 2017-10-03 15:59 ` Anand Jain
  2017-10-04 20:11   ` Liu Bo
  2017-10-05 13:54 ` [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors Anand Jain
  2 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2017-10-03 15:59 UTC (permalink / raw)
  To: linux-btrfs; +Cc: bo.li.liu

From: Anand Jain <Anand.Jain@oracle.com>

Write and flush errors are critical errors, upon which the device fd
must be closed and marked as failed.

There are two type of device close in btrfs, one, close as part
of clean up where we shall release the struct btrfs_device and
or btrfs_fs_devices as well. And the other type which is introduced
here is where we close the device fd for the reason that it has failed
and the mounted FS is still present using the other redundant device.
In this new case we shall keep the failed device's struct btrfs_device
similar to missing device.

Further the approach here is to monitor the device statistics and
trigger the action based on one or more device state.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
V8: General misc cleanup. Based on v4.14-rc2

 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/disk-io.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c |  1 +
 fs/btrfs/volumes.h |  4 +++
 4 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5a8933da39a7..bad8fbaff18d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -824,6 +824,7 @@ struct btrfs_fs_info {
 	struct mutex tree_log_mutex;
 	struct mutex transaction_kthread_mutex;
 	struct mutex cleaner_mutex;
+	struct mutex health_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
 
@@ -941,6 +942,7 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *extent_workers;
 	struct task_struct *transaction_kthread;
 	struct task_struct *cleaner_kthread;
+	struct task_struct *health_kthread;
 	int thread_pool_size;
 
 	struct kobject *space_info_kobj;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 487bbe4fb3c6..be22104bafbf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1922,6 +1922,70 @@ static int cleaner_kthread(void *arg)
 	return 0;
 }
 
+static void btrfs_check_device_fatal_errors(struct btrfs_root *root)
+{
+	struct btrfs_device *device;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+
+	/* Mark devices with write or flush errors as failed. */
+	mutex_lock(&fs_info->volume_mutex);
+	list_for_each_entry_rcu(device,
+			&fs_info->fs_devices->devices, dev_list) {
+		int c_err;
+
+		if (device->failed)
+			continue;
+
+		/* Todo: Skip replace target for now. */
+		if (device->is_tgtdev_for_dev_replace)
+			continue;
+		if (!device->dev_stats_valid)
+			continue;
+
+		c_err = atomic_read(&device->new_critical_errs);
+		atomic_sub(c_err, &device->new_critical_errs);
+		if (c_err) {
+			btrfs_crit_in_rcu(fs_info,
+				"%s: Fatal write/flush error",
+				rcu_str_deref(device->name));
+			btrfs_mark_device_failed(device);
+		}
+	}
+	mutex_unlock(&fs_info->volume_mutex);
+}
+
+static int health_kthread(void *arg)
+{
+	struct btrfs_root *root = arg;
+
+	do {
+		/* Todo rename the below function */
+		if (btrfs_need_cleaner_sleep(root->fs_info))
+			goto sleep;
+
+		if (!mutex_trylock(&root->fs_info->health_mutex))
+			goto sleep;
+
+		if (btrfs_need_cleaner_sleep(root->fs_info)) {
+			mutex_unlock(&root->fs_info->health_mutex);
+			goto sleep;
+		}
+
+		/* Check devices health */
+		btrfs_check_device_fatal_errors(root);
+
+		mutex_unlock(&root->fs_info->health_mutex);
+
+sleep:
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!kthread_should_stop())
+			schedule();
+		__set_current_state(TASK_RUNNING);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
 static int transaction_kthread(void *arg)
 {
 	struct btrfs_root *root = arg;
@@ -1969,6 +2033,7 @@ static int transaction_kthread(void *arg)
 			btrfs_end_transaction(trans);
 		}
 sleep:
+		wake_up_process(fs_info->health_kthread);
 		wake_up_process(fs_info->cleaner_kthread);
 		mutex_unlock(&fs_info->transaction_kthread_mutex);
 
@@ -2713,6 +2778,7 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->chunk_mutex);
 	mutex_init(&fs_info->transaction_kthread_mutex);
 	mutex_init(&fs_info->cleaner_mutex);
+	mutex_init(&fs_info->health_mutex);
 	mutex_init(&fs_info->volume_mutex);
 	mutex_init(&fs_info->ro_block_group_mutex);
 	init_rwsem(&fs_info->commit_root_sem);
@@ -3049,11 +3115,16 @@ int open_ctree(struct super_block *sb,
 	if (IS_ERR(fs_info->cleaner_kthread))
 		goto fail_sysfs;
 
+	fs_info->health_kthread = kthread_run(health_kthread, tree_root,
+						"btrfs-health");
+	if (IS_ERR(fs_info->health_kthread))
+		goto fail_cleaner;
+
 	fs_info->transaction_kthread = kthread_run(transaction_kthread,
 						   tree_root,
 						   "btrfs-transaction");
 	if (IS_ERR(fs_info->transaction_kthread))
-		goto fail_cleaner;
+		goto fail_health;
 
 	if (!btrfs_test_opt(fs_info, NOSSD) &&
 	    !fs_info->fs_devices->rotating) {
@@ -3222,6 +3293,10 @@ int open_ctree(struct super_block *sb,
 	kthread_stop(fs_info->transaction_kthread);
 	btrfs_cleanup_transaction(fs_info);
 	btrfs_free_fs_roots(fs_info);
+
+fail_health:
+	kthread_stop(fs_info->health_kthread);
+
 fail_cleaner:
 	kthread_stop(fs_info->cleaner_kthread);
 
@@ -3896,6 +3971,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
 
 	kthread_stop(fs_info->transaction_kthread);
 	kthread_stop(fs_info->cleaner_kthread);
+	kthread_stop(fs_info->health_kthread);
 
 	set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 06e7cf4cef81..18dabd0364bf 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -247,6 +247,7 @@ static struct btrfs_device *__alloc_device(void)
 	spin_lock_init(&dev->reada_lock);
 	atomic_set(&dev->reada_in_flight, 0);
 	atomic_set(&dev->dev_stats_ccnt, 0);
+	atomic_set(&dev->new_critical_errs, 0);
 	btrfs_device_data_ordered_init(dev);
 	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 05b150c03995..9328a5d12e78 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -163,6 +163,7 @@ struct btrfs_device {
 	/* Counter to record the change of device stats */
 	atomic_t dev_stats_ccnt;
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
+	atomic_t new_critical_errs;
 };
 
 /*
@@ -513,6 +514,9 @@ static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
 	atomic_inc(dev->dev_stat_values + index);
 	smp_mb__before_atomic();
 	atomic_inc(&dev->dev_stats_ccnt);
+	if (index == BTRFS_DEV_STAT_WRITE_ERRS ||
+		index == BTRFS_DEV_STAT_FLUSH_ERRS)
+		atomic_inc(&dev->new_critical_errs);
 }
 
 static inline int btrfs_dev_stat_read(struct btrfs_device *dev,
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-03 15:59 ` [PATCH v8 2/2] btrfs: check device for critical errors and mark failed Anand Jain
@ 2017-10-04 20:11   ` Liu Bo
  2017-10-05 11:07     ` Austin S. Hemmelgarn
  2017-10-05 13:56     ` Anand Jain
  0 siblings, 2 replies; 15+ messages in thread
From: Liu Bo @ 2017-10-04 20:11 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
> From: Anand Jain <Anand.Jain@oracle.com>
> 
> Write and flush errors are critical errors, upon which the device fd
> must be closed and marked as failed.
>

Can we defer the job of closing device to umount?

We can go mark the device failed and skip it while doing read/write,
and umount can do the cleanup work.

That way we don't need a dedicated thread looping around to detect a
rare situation.

Thanks,

-liubo

> There are two type of device close in btrfs, one, close as part
> of clean up where we shall release the struct btrfs_device and
> or btrfs_fs_devices as well. And the other type which is introduced
> here is where we close the device fd for the reason that it has failed
> and the mounted FS is still present using the other redundant device.
> In this new case we shall keep the failed device's struct btrfs_device
> similar to missing device.
> 
> Further the approach here is to monitor the device statistics and
> trigger the action based on one or more device state.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
> ---
> V8: General misc cleanup. Based on v4.14-rc2
> 
>  fs/btrfs/ctree.h   |  2 ++
>  fs/btrfs/disk-io.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/btrfs/volumes.c |  1 +
>  fs/btrfs/volumes.h |  4 +++
>  4 files changed, 84 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5a8933da39a7..bad8fbaff18d 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -824,6 +824,7 @@ struct btrfs_fs_info {
>  	struct mutex tree_log_mutex;
>  	struct mutex transaction_kthread_mutex;
>  	struct mutex cleaner_mutex;
> +	struct mutex health_mutex;
>  	struct mutex chunk_mutex;
>  	struct mutex volume_mutex;
>  
> @@ -941,6 +942,7 @@ struct btrfs_fs_info {
>  	struct btrfs_workqueue *extent_workers;
>  	struct task_struct *transaction_kthread;
>  	struct task_struct *cleaner_kthread;
> +	struct task_struct *health_kthread;
>  	int thread_pool_size;
>  
>  	struct kobject *space_info_kobj;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 487bbe4fb3c6..be22104bafbf 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1922,6 +1922,70 @@ static int cleaner_kthread(void *arg)
>  	return 0;
>  }
>  
> +static void btrfs_check_device_fatal_errors(struct btrfs_root *root)
> +{
> +	struct btrfs_device *device;
> +	struct btrfs_fs_info *fs_info = root->fs_info;
> +
> +	/* Mark devices with write or flush errors as failed. */
> +	mutex_lock(&fs_info->volume_mutex);
> +	list_for_each_entry_rcu(device,
> +			&fs_info->fs_devices->devices, dev_list) {
> +		int c_err;
> +
> +		if (device->failed)
> +			continue;
> +
> +		/* Todo: Skip replace target for now. */
> +		if (device->is_tgtdev_for_dev_replace)
> +			continue;
> +		if (!device->dev_stats_valid)
> +			continue;
> +
> +		c_err = atomic_read(&device->new_critical_errs);
> +		atomic_sub(c_err, &device->new_critical_errs);
> +		if (c_err) {
> +			btrfs_crit_in_rcu(fs_info,
> +				"%s: Fatal write/flush error",
> +				rcu_str_deref(device->name));
> +			btrfs_mark_device_failed(device);
> +		}
> +	}
> +	mutex_unlock(&fs_info->volume_mutex);
> +}
> +
> +static int health_kthread(void *arg)
> +{
> +	struct btrfs_root *root = arg;
> +
> +	do {
> +		/* Todo rename the below function */
> +		if (btrfs_need_cleaner_sleep(root->fs_info))
> +			goto sleep;
> +
> +		if (!mutex_trylock(&root->fs_info->health_mutex))
> +			goto sleep;
> +
> +		if (btrfs_need_cleaner_sleep(root->fs_info)) {
> +			mutex_unlock(&root->fs_info->health_mutex);
> +			goto sleep;
> +		}
> +
> +		/* Check devices health */
> +		btrfs_check_device_fatal_errors(root);
> +
> +		mutex_unlock(&root->fs_info->health_mutex);
> +
> +sleep:
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		if (!kthread_should_stop())
> +			schedule();
> +		__set_current_state(TASK_RUNNING);
> +	} while (!kthread_should_stop());
> +
> +	return 0;
> +}
> +
>  static int transaction_kthread(void *arg)
>  {
>  	struct btrfs_root *root = arg;
> @@ -1969,6 +2033,7 @@ static int transaction_kthread(void *arg)
>  			btrfs_end_transaction(trans);
>  		}
>  sleep:
> +		wake_up_process(fs_info->health_kthread);
>  		wake_up_process(fs_info->cleaner_kthread);
>  		mutex_unlock(&fs_info->transaction_kthread_mutex);
>  
> @@ -2713,6 +2778,7 @@ int open_ctree(struct super_block *sb,
>  	mutex_init(&fs_info->chunk_mutex);
>  	mutex_init(&fs_info->transaction_kthread_mutex);
>  	mutex_init(&fs_info->cleaner_mutex);
> +	mutex_init(&fs_info->health_mutex);
>  	mutex_init(&fs_info->volume_mutex);
>  	mutex_init(&fs_info->ro_block_group_mutex);
>  	init_rwsem(&fs_info->commit_root_sem);
> @@ -3049,11 +3115,16 @@ int open_ctree(struct super_block *sb,
>  	if (IS_ERR(fs_info->cleaner_kthread))
>  		goto fail_sysfs;
>  
> +	fs_info->health_kthread = kthread_run(health_kthread, tree_root,
> +						"btrfs-health");
> +	if (IS_ERR(fs_info->health_kthread))
> +		goto fail_cleaner;
> +
>  	fs_info->transaction_kthread = kthread_run(transaction_kthread,
>  						   tree_root,
>  						   "btrfs-transaction");
>  	if (IS_ERR(fs_info->transaction_kthread))
> -		goto fail_cleaner;
> +		goto fail_health;
>  
>  	if (!btrfs_test_opt(fs_info, NOSSD) &&
>  	    !fs_info->fs_devices->rotating) {
> @@ -3222,6 +3293,10 @@ int open_ctree(struct super_block *sb,
>  	kthread_stop(fs_info->transaction_kthread);
>  	btrfs_cleanup_transaction(fs_info);
>  	btrfs_free_fs_roots(fs_info);
> +
> +fail_health:
> +	kthread_stop(fs_info->health_kthread);
> +
>  fail_cleaner:
>  	kthread_stop(fs_info->cleaner_kthread);
>  
> @@ -3896,6 +3971,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  
>  	kthread_stop(fs_info->transaction_kthread);
>  	kthread_stop(fs_info->cleaner_kthread);
> +	kthread_stop(fs_info->health_kthread);
>  
>  	set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags);
>  
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 06e7cf4cef81..18dabd0364bf 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -247,6 +247,7 @@ static struct btrfs_device *__alloc_device(void)
>  	spin_lock_init(&dev->reada_lock);
>  	atomic_set(&dev->reada_in_flight, 0);
>  	atomic_set(&dev->dev_stats_ccnt, 0);
> +	atomic_set(&dev->new_critical_errs, 0);
>  	btrfs_device_data_ordered_init(dev);
>  	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>  	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 05b150c03995..9328a5d12e78 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -163,6 +163,7 @@ struct btrfs_device {
>  	/* Counter to record the change of device stats */
>  	atomic_t dev_stats_ccnt;
>  	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
> +	atomic_t new_critical_errs;
>  };
>  
>  /*
> @@ -513,6 +514,9 @@ static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
>  	atomic_inc(dev->dev_stat_values + index);
>  	smp_mb__before_atomic();
>  	atomic_inc(&dev->dev_stats_ccnt);
> +	if (index == BTRFS_DEV_STAT_WRITE_ERRS ||
> +		index == BTRFS_DEV_STAT_FLUSH_ERRS)
> +		atomic_inc(&dev->new_critical_errs);
>  }
>  
>  static inline int btrfs_dev_stat_read(struct btrfs_device *dev,
> -- 
> 2.7.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-04 20:11   ` Liu Bo
@ 2017-10-05 11:07     ` Austin S. Hemmelgarn
  2017-10-06 23:33       ` Liu Bo
  2017-10-05 13:56     ` Anand Jain
  1 sibling, 1 reply; 15+ messages in thread
From: Austin S. Hemmelgarn @ 2017-10-05 11:07 UTC (permalink / raw)
  To: bo.li.liu, Anand Jain; +Cc: linux-btrfs

On 2017-10-04 16:11, Liu Bo wrote:
> On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
>> From: Anand Jain <Anand.Jain@oracle.com>
>>
>> Write and flush errors are critical errors, upon which the device fd
>> must be closed and marked as failed.
>>
> 
> Can we defer the job of closing device to umount?
> 
> We can go mark the device failed and skip it while doing read/write,
> and umount can do the cleanup work.
> 
> That way we don't need a dedicated thread looping around to detect a
> rare situation.
If BTRFS doesn't close the device, then it's 100% guaranteed if it 
reconnects that it will show up under a different device node.  It would 
also mean that the device node stays visible when there is in fact no 
device connected to it, which is a pain from a monitoring perspective.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors
  2017-10-03 15:59 [PATCH v8 0/2] [RFC] Introduce device state 'failed' Anand Jain
  2017-10-03 15:59 ` [PATCH v8 1/2] btrfs: introduce device dynamic state transition to failed Anand Jain
  2017-10-03 15:59 ` [PATCH v8 2/2] btrfs: check device for critical errors and mark failed Anand Jain
@ 2017-10-05 13:54 ` Anand Jain
  2 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2017-10-05 13:54 UTC (permalink / raw)
  To: linux-btrfs

This patch marks device as failed for the write/flush errors
so that it can be brought out of the RW device list. However
if the chunk render is incomplete due to missing of a device
then the volume/raid fails, so this situation is handled by
calling btrfs_handle_fs_error() which puts the whole FS to
RDONLY.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
This patch is to replace
  [PATCH v8 2/2] btrfs: check device for critical errors and mark failed

 fs/btrfs/volumes.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 05b150c03995..53397d077e91 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -501,6 +501,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
 				u64 chunk_offset, u64 chunk_size);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 		       struct btrfs_fs_info *fs_info, u64 chunk_offset);
+void btrfs_mark_device_failed(struct btrfs_device *dev);
 
 static inline int btrfs_dev_stats_dirty(struct btrfs_device *dev)
 {
@@ -513,6 +514,10 @@ static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
 	atomic_inc(dev->dev_stat_values + index);
 	smp_mb__before_atomic();
 	atomic_inc(&dev->dev_stats_ccnt);
+
+	if (index == BTRFS_DEV_STAT_WRITE_ERRS ||
+		index == BTRFS_DEV_STAT_FLUSH_ERRS)
+		btrfs_mark_device_failed(dev);
 }
 
 static inline int btrfs_dev_stat_read(struct btrfs_device *dev,
-- 
2.13.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-04 20:11   ` Liu Bo
  2017-10-05 11:07     ` Austin S. Hemmelgarn
@ 2017-10-05 13:56     ` Anand Jain
  2017-10-06 23:56       ` Liu Bo
  1 sibling, 1 reply; 15+ messages in thread
From: Anand Jain @ 2017-10-05 13:56 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs



On 10/05/2017 04:11 AM, Liu Bo wrote:
> On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
>> From: Anand Jain <Anand.Jain@oracle.com>
>>
>> Write and flush errors are critical errors, upon which the device fd
>> must be closed and marked as failed.
>>
> 
> Can we defer the job of closing device to umount?

  Originally I think it was designed to handle the bad disk
  same as missing device. as below [1]. This patch just does that.
  But I am curious to know if there is any issue to close failed
  device ?

[1]
-----
static int read_one_dev(struct btrfs_fs_info *fs_info,
                         struct extent_buffer *leaf,
                         struct btrfs_dev_item *dev_item)
::
             /*
              * this happens when a device that was properly setup
              * in the device info lists suddenly goes bad.
              * device->bdev is NULL, and so we have to set
              * device->missing to one here
              */
------

  So here I misuse the device missing scenario (device->bdev = NULL)
  to the failed device scenario. (code comment mentioned it).

  Secondly as in the patch 1/2 commit log and also Austin mentioned it,
  if failed device close is deferred it will also defer the cleanup of
  the failed device at the block layer.

  But yes block layer can still clean up when btrfs closes the
  device at the time of replace, also where in the long run, pulling
  out of the failed disk would (planned) trigger a udev notification
  into the btrfs sysfs interface and it can close the device.

  Anything that I have missed ?

  Further, jfyi closing of failed device isn't related to the
  looping for device failure though.

> We can go mark the device failed and skip it while doing read/write,
> and umount can do the cleanup work.

  In servers a need of umount is considered as downtime, things
  shouldn't wait for umount to cleanup.

> That way we don't need a dedicated thread looping around to detect a
> rare situation.

  I think its better to make it event based instead of thread looping,
  pls take a look..

    [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors

Thanks, Anand


> Thanks,
> 
> -liubo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-05 11:07     ` Austin S. Hemmelgarn
@ 2017-10-06 23:33       ` Liu Bo
  2017-10-09 11:58         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 15+ messages in thread
From: Liu Bo @ 2017-10-06 23:33 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Anand Jain, linux-btrfs

On Thu, Oct 05, 2017 at 07:07:44AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-04 16:11, Liu Bo wrote:
> > On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
> > > From: Anand Jain <Anand.Jain@oracle.com>
> > > 
> > > Write and flush errors are critical errors, upon which the device fd
> > > must be closed and marked as failed.
> > > 
> > 
> > Can we defer the job of closing device to umount?
> > 
> > We can go mark the device failed and skip it while doing read/write,
> > and umount can do the cleanup work.
> > 
> > That way we don't need a dedicated thread looping around to detect a
> > rare situation.
> If BTRFS doesn't close the device, then it's 100% guaranteed if it
> reconnects that it will show up under a different device node.  It would
> also mean that the device node stays visible when there is in fact no device
> connected to it, which is a pain from a monitoring perspective.

I see, you're assuming that these errors are due to disconnection of
disks, it could be bad sectors (although almost impossible from
enterprise hard disks) or some other errors across the stack.

I do agree that cleanup needs to be done if disk got disconnected, but
not doing cleanup here, a udev rule is needed to handle such an event.


thanks,
-liubo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-05 13:56     ` Anand Jain
@ 2017-10-06 23:56       ` Liu Bo
  2017-10-08 14:23         ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Liu Bo @ 2017-10-06 23:56 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Thu, Oct 05, 2017 at 09:56:59PM +0800, Anand Jain wrote:
> 
> 
> On 10/05/2017 04:11 AM, Liu Bo wrote:
> > On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
> > > From: Anand Jain <Anand.Jain@oracle.com>
> > > 
> > > Write and flush errors are critical errors, upon which the device fd
> > > must be closed and marked as failed.
> > > 
> > 
> > Can we defer the job of closing device to umount?
> 
>  Originally I think it was designed to handle the bad disk
>  same as missing device. as below [1]. This patch just does that.
>  But I am curious to know if there is any issue to close failed
>  device ?
> 
> [1]
> -----
> static int read_one_dev(struct btrfs_fs_info *fs_info,
>                         struct extent_buffer *leaf,
>                         struct btrfs_dev_item *dev_item)
> ::
>             /*
>              * this happens when a device that was properly setup
>              * in the device info lists suddenly goes bad.
>              * device->bdev is NULL, and so we have to set
>              * device->missing to one here
>              */
> ------
> 
>  So here I misuse the device missing scenario (device->bdev = NULL)
>  to the failed device scenario. (code comment mentioned it).
>

This is OK to me, we can unify these stuff later, for now, ->missing
is to bypass this failed disk for r/w.

>  Secondly as in the patch 1/2 commit log and also Austin mentioned it,
>  if failed device close is deferred it will also defer the cleanup of
>  the failed device at the block layer.
> 
>  But yes block layer can still clean up when btrfs closes the
>  device at the time of replace, also where in the long run, pulling
>  out of the failed disk would (planned) trigger a udev notification
>  into the btrfs sysfs interface and it can close the device.
> 
>  Anything that I have missed ?
>

I'm more inclined to do cleanup by making udev call 'device remove',
which is supposed to do all the validation checks you need to done
here.

>  Further, jfyi closing of failed device isn't related to the
>  looping for device failure though.
> 
> > We can go mark the device failed and skip it while doing read/write,
> > and umount can do the cleanup work.
> 
>  In servers a need of umount is considered as downtime, things
>  shouldn't wait for umount to cleanup.
>

OK.

> > That way we don't need a dedicated thread looping around to detect a
> > rare situation.
> 
>  I think its better to make it event based instead of thread looping,
>  pls take a look..
> 
>    [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors
> 

Event based is good, for anything about failing disk, we would have to
follow what md is doing eventually, i.e. let udev handle it.

thanks,
-liubo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-06 23:56       ` Liu Bo
@ 2017-10-08 14:23         ` Anand Jain
  2017-10-13 18:46           ` Liu Bo
  0 siblings, 1 reply; 15+ messages in thread
From: Anand Jain @ 2017-10-08 14:23 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs



On 10/07/2017 07:56 AM, Liu Bo wrote:
> On Thu, Oct 05, 2017 at 09:56:59PM +0800, Anand Jain wrote:
>>
>>
>> On 10/05/2017 04:11 AM, Liu Bo wrote:
>>> On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
>>>> From: Anand Jain <Anand.Jain@oracle.com>
>>>>
>>>> Write and flush errors are critical errors, upon which the device fd
>>>> must be closed and marked as failed.
>>>>
>>>
>>> Can we defer the job of closing device to umount?
>>
>>   Originally I think it was designed to handle the bad disk
>>   same as missing device. as below [1]. This patch just does that.

>>   But I am curious to know if there is any issue to close failed
>>   device ?

    Anything on this ?


>> [1]
>> -----
>> static int read_one_dev(struct btrfs_fs_info *fs_info,
>>                          struct extent_buffer *leaf,
>>                          struct btrfs_dev_item *dev_item)
>> ::
>>              /*
>>               * this happens when a device that was properly setup
>>               * in the device info lists suddenly goes bad.
>>               * device->bdev is NULL, and so we have to set
>>               * device->missing to one here
>>               */
>> ------
>>
>>   So here I misuse the device missing scenario (device->bdev = NULL)
>>   to the failed device scenario. (code comment mentioned it).
>>
> 
> This is OK to me, we can unify these stuff later, for now, ->missing
> is to bypass this failed disk for r/w.
> 
>>   Secondly as in the patch 1/2 commit log and also Austin mentioned it,
>>   if failed device close is deferred it will also defer the cleanup of
>>   the failed device at the block layer.
>>
>>   But yes block layer can still clean up when btrfs closes the
>>   device at the time of replace, also where in the long run, pulling
>>   out of the failed disk would (planned) trigger a udev notification
>>   into the btrfs sysfs interface and it can close the device.
>>
>>   Anything that I have missed ?
>>
> 
> I'm more inclined to do cleanup by making udev call 'device remove',
> which is supposed to do all the validation checks you need to done
> here.

    Validation check like what ?

Thanks, Anand

>>   Further, jfyi closing of failed device isn't related to the
>>   looping for device failure though.
>>
>>> We can go mark the device failed and skip it while doing read/write,
>>> and umount can do the cleanup work.
>>
>>   In servers a need of umount is considered as downtime, things
>>   shouldn't wait for umount to cleanup.
>>
> 
> OK.
> 
>>> That way we don't need a dedicated thread looping around to detect a
>>> rare situation.
>>
>>   I think its better to make it event based instead of thread looping,
>>   pls take a look..
>>
>>     [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors
>>
> 
> Event based is good, for anything about failing disk, we would have to
> follow what md is doing eventually, i.e. let udev handle it.
> 
> thanks,
> -liubo
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-06 23:33       ` Liu Bo
@ 2017-10-09 11:58         ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 15+ messages in thread
From: Austin S. Hemmelgarn @ 2017-10-09 11:58 UTC (permalink / raw)
  To: bo.li.liu; +Cc: Anand Jain, linux-btrfs

On 2017-10-06 19:33, Liu Bo wrote:
> On Thu, Oct 05, 2017 at 07:07:44AM -0400, Austin S. Hemmelgarn wrote:
>> On 2017-10-04 16:11, Liu Bo wrote:
>>> On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
>>>> From: Anand Jain <Anand.Jain@oracle.com>
>>>>
>>>> Write and flush errors are critical errors, upon which the device fd
>>>> must be closed and marked as failed.
>>>>
>>>
>>> Can we defer the job of closing device to umount?
>>>
>>> We can go mark the device failed and skip it while doing read/write,
>>> and umount can do the cleanup work.
>>>
>>> That way we don't need a dedicated thread looping around to detect a
>>> rare situation.
>> If BTRFS doesn't close the device, then it's 100% guaranteed if it
>> reconnects that it will show up under a different device node.  It would
>> also mean that the device node stays visible when there is in fact no device
>> connected to it, which is a pain from a monitoring perspective.
> 
> I see, you're assuming that these errors are due to disconnection of
> disks, it could be bad sectors (although almost impossible from
> enterprise hard disks) or some other errors across the stack.
And we need to handle any of these cases.  If BTRFS is going to claim 
handling of device failures, it needs to handle all types of device 
failure it reasonably can without making assumptions about what type of 
failure occurred.
> 
> I do agree that cleanup needs to be done if disk got disconnected, but
> not doing cleanup here, a udev rule is needed to handle such an event.
And why exactly does it need to involve userspace beyond notifying that 
something bad happened?  We know the failure happened, why should we 
tell userspace about it and require userspace to tell us to handle it? 
It's not like this is a performance critical path since it won't happen 
very often, and the primary concern is data safety, not performance, so 
why exactly do we need to push something that isn't a policy decision 
(and arguably isn't a decision, if we're not using a device, we 
shouldn't be holding it open, period) out to userspace?

Looking at it a bit differently though, every time I've seen any write 
or flush errors, either the connection to the device was unstable, or 
the storage device was failing, and neither case is something that 
should require userspace to make a decision about how to handle things.

To be entirely honest though, from a system administrator's perspective, 
I would actually expect a device getting disconnected (or otherwise 
disappearing off the bus) to end up in a 'Missing' state, not a 'Failed' 
state.  I see no reason to treat it any differently than if it 
disappeared before the volume was mounted, other than the requirement to 
actually handle it.  That type of handling _will_ require userspace 
involvement (because I seriously doubt there will ever be proper 
notification from the block layer to holders that a device is gone), but 
that's also largely orthogonal to this patch set (there's no reason that 
we need to handle write or flush errors the same way).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-08 14:23         ` Anand Jain
@ 2017-10-13 18:46           ` Liu Bo
  2017-10-16  6:09             ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Liu Bo @ 2017-10-13 18:46 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Sun, Oct 08, 2017 at 10:23:58PM +0800, Anand Jain wrote:
> 
> 
> On 10/07/2017 07:56 AM, Liu Bo wrote:
> > On Thu, Oct 05, 2017 at 09:56:59PM +0800, Anand Jain wrote:
> > > 
> > > 
> > > On 10/05/2017 04:11 AM, Liu Bo wrote:
> > > > On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
> > > > > From: Anand Jain <Anand.Jain@oracle.com>
> > > > > 
> > > > > Write and flush errors are critical errors, upon which the device fd
> > > > > must be closed and marked as failed.
> > > > > 
> > > > 
> > > > Can we defer the job of closing device to umount?
> > > 
> > >   Originally I think it was designed to handle the bad disk
> > >   same as missing device. as below [1]. This patch just does that.
> 
> > >   But I am curious to know if there is any issue to close failed
> > >   device ?
> 
>    Anything on this ?

No big deal, it's just different with md's behavior, but surely it
doesn't make a big difference whether we close device fd here or not.

> 
> 
> > > [1]
> > > -----
> > > static int read_one_dev(struct btrfs_fs_info *fs_info,
> > >                          struct extent_buffer *leaf,
> > >                          struct btrfs_dev_item *dev_item)
> > > ::
> > >              /*
> > >               * this happens when a device that was properly setup
> > >               * in the device info lists suddenly goes bad.
> > >               * device->bdev is NULL, and so we have to set
> > >               * device->missing to one here
> > >               */
> > > ------
> > > 
> > >   So here I misuse the device missing scenario (device->bdev = NULL)
> > >   to the failed device scenario. (code comment mentioned it).
> > > 
> > 
> > This is OK to me, we can unify these stuff later, for now, ->missing
> > is to bypass this failed disk for r/w.
> > 
> > >   Secondly as in the patch 1/2 commit log and also Austin mentioned it,
> > >   if failed device close is deferred it will also defer the cleanup of
> > >   the failed device at the block layer.
> > > 
> > >   But yes block layer can still clean up when btrfs closes the
> > >   device at the time of replace, also where in the long run, pulling
> > >   out of the failed disk would (planned) trigger a udev notification
> > >   into the btrfs sysfs interface and it can close the device.
> > > 
> > >   Anything that I have missed ?
> > > 
> > 
> > I'm more inclined to do cleanup by making udev call 'device remove',
> > which is supposed to do all the validation checks you need to done
> > here.
> 
>    Validation check like what ?

OK, just realize that I was talking about what we should do upon
device getting pulled out, in that case, validation check about raid
profile is needed, such as can we keep the raid profile or do we need
to degrade raid proflie, and 'device delete' can do that.

Although I agree with the fact that write/flush errors are critical
ones, it's still too strict if it got only a few write/flush errors,
given cable issues or timeout from some layers may also lead to those
errors.  A compromised way is to have a threshold to give drives
another chance.

thanks,

-liubo

> 
> Thanks, Anand
> 
> > >   Further, jfyi closing of failed device isn't related to the
> > >   looping for device failure though.
> > > 
> > > > We can go mark the device failed and skip it while doing read/write,
> > > > and umount can do the cleanup work.
> > > 
> > >   In servers a need of umount is considered as downtime, things
> > >   shouldn't wait for umount to cleanup.
> > > 
> > 
> > OK.
> > 
> > > > That way we don't need a dedicated thread looping around to detect a
> > > > rare situation.
> > > 
> > >   I think its better to make it event based instead of thread looping,
> > >   pls take a look..
> > > 
> > >     [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors
> > > 
> > 
> > Event based is good, for anything about failing disk, we would have to
> > follow what md is doing eventually, i.e. let udev handle it.
> > 
> > thanks,
> > -liubo
> > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 1/2] btrfs: introduce device dynamic state transition to failed
  2017-10-03 15:59 ` [PATCH v8 1/2] btrfs: introduce device dynamic state transition to failed Anand Jain
@ 2017-10-13 18:47   ` Liu Bo
  2017-10-16  6:09     ` Anand Jain
  0 siblings, 1 reply; 15+ messages in thread
From: Liu Bo @ 2017-10-13 18:47 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Oct 03, 2017 at 11:59:19PM +0800, Anand Jain wrote:
> From: Anand Jain <Anand.Jain@oracle.com>
> 
> This patch provides helper functions to force a device to failed,
> and we need it for the following reasons,
> 1) a. It can be reported that device has failed when it does and
>    b. Close the device when it goes offline so that blocklayer can
>       cleanup
> 2) Identify the candidate for the auto replace
> 3) Stop further RW to the failing device and
> 4) A device in the multi device btrfs may fail, but as of now in
>    some system config whole of btrfs gets unmounted.
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
> ---
> V8: General misc cleanup. Based on v4.14-rc2
> 
>  fs/btrfs/volumes.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h |  15 +++++++-
>  2 files changed, 118 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 0e8f16c305df..06e7cf4cef81 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7255,3 +7255,107 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info)
>  		fs_devices = fs_devices->seed;
>  	}
>  }
> +
> +static void do_close_device(struct work_struct *work)
> +{
> +	struct btrfs_device *device;
> +
> +	device = container_of(work, struct btrfs_device, rcu_work);
> +
> +	if (device->closing_bdev)
> +		blkdev_put(device->closing_bdev, device->mode);
> +
> +	device->closing_bdev = NULL;
> +}
> +
> +static void btrfs_close_one_device(struct rcu_head *head)
> +{
> +	struct btrfs_device *device;
> +
> +	device = container_of(head, struct btrfs_device, rcu);
> +
> +	INIT_WORK(&device->rcu_work, do_close_device);
> +	schedule_work(&device->rcu_work);
> +}

No need to schedule the job to a worker.

thanks,

-liubo

> +
> +void btrfs_force_device_close(struct btrfs_device *device)
> +{
> +	struct btrfs_fs_info *fs_info;
> +	struct btrfs_fs_devices *fs_devices;
> +
> +	fs_devices = device->fs_devices;
> +	fs_info = fs_devices->fs_info;
> +
> +	btrfs_sysfs_rm_device_link(fs_devices, device);
> +
> +	mutex_lock(&fs_devices->device_list_mutex);
> +	mutex_lock(&fs_devices->fs_info->chunk_mutex);
> +
> +	btrfs_assign_next_active_device(fs_devices->fs_info, device, NULL);
> +
> +	if (device->bdev)
> +		fs_devices->open_devices--;
> +
> +	if (device->writeable) {
> +		list_del_init(&device->dev_alloc_list);
> +		fs_devices->rw_devices--;
> +	}
> +	device->writeable = 0;
> +
> +	/*
> +	 * Todo: We have miss-used missing flag all around, and here
> +	 * too for now. (In the long run I want to keep missing to only
> +	 * indicate that it was not present when RAID was assembled.)
> +	 */
> +	device->missing = 1;
> +	fs_devices->missing_devices++;
> +	device->closing_bdev = device->bdev;
> +	device->bdev = NULL;
> +
> +	call_rcu(&device->rcu, btrfs_close_one_device);
> +
> +	mutex_unlock(&fs_devices->fs_info->chunk_mutex);
> +	mutex_unlock(&fs_devices->device_list_mutex);
> +
> +	rcu_barrier();
> +
> +	btrfs_warn_in_rcu(fs_info, "device %s failed",
> +				rcu_str_deref(device->name));
> +
> +	/*
> +	 * We lost one/more disk, which means its not as it
> +	 * was configured by the user. Show mount should show
> +	 * degraded.
> +	 */
> +	btrfs_set_opt(fs_info->mount_opt, DEGRADED);
> +
> +	/*
> +	 * Now having lost one of the device, check if chunk stripe
> +	 * is incomplete and handle fatal error if needed.
> +	 */
> +	if (!btrfs_check_rw_degradable(fs_info))
> +		btrfs_handle_fs_error(fs_info, -EIO,
> +				"devices below critical level");
> +}
> +
> +void btrfs_mark_device_failed(struct btrfs_device *dev)
> +{
> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
> +
> +	/* This shouldn't be called if device is already missing */
> +	if (dev->missing || !dev->bdev)
> +		return;
> +	if (dev->failed)
> +		return;
> +	dev->failed = 1;
> +
> +	/* Last RW device is requested to force close let FS handle it. */
> +	if (fs_devices->rw_devices == 1) {
> +		btrfs_handle_fs_error(fs_devices->fs_info, -EIO,
> +					"Last RW device failed");
> +		return;
> +	}
> +
> +	/* Point of no return start here. */
> +	btrfs_force_device_close(dev);
> +}
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 6108fdfec67f..05b150c03995 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -65,13 +65,26 @@ struct btrfs_device {
>  	struct btrfs_pending_bios pending_sync_bios;
>  
>  	struct block_device *bdev;
> +	struct block_device *closing_bdev;
>  
>  	/* the mode sent to blkdev_get */
>  	fmode_t mode;
>  
>  	int writeable;
>  	int in_fs_metadata;
> +	/* missing: device not found at the time of mount */
>  	int missing;
> +	/* failed: device confirmed to have experienced critical io failure */
> +	int failed;
> +	/*
> +	* offline: system or user or block layer transport has removed
> +	* offlined the device which was once present and without going
> +	* through unmount. Implies an intriem communication break down
> +	* and not necessarily a candidate for the device replace. And
> +	* device might be online after user intervention or after
> +	* block transport layer error recovery.
> +	*/
> +	int offline;
>  	int can_discard;
>  	int is_tgtdev_for_dev_replace;
>  	blk_status_t last_flush_error;
> @@ -544,5 +557,5 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
>  bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
>  void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
>  				 u8 *uuid);
> -
> +void btrfs_mark_device_failed(struct btrfs_device *dev);
>  #endif
> -- 
> 2.7.0
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
  2017-10-13 18:46           ` Liu Bo
@ 2017-10-16  6:09             ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2017-10-16  6:09 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs



On 10/14/2017 02:46 AM, Liu Bo wrote:
> On Sun, Oct 08, 2017 at 10:23:58PM +0800, Anand Jain wrote:
>>
>>
>> On 10/07/2017 07:56 AM, Liu Bo wrote:
>>> On Thu, Oct 05, 2017 at 09:56:59PM +0800, Anand Jain wrote:
>>>>
>>>>
>>>> On 10/05/2017 04:11 AM, Liu Bo wrote:
>>>>> On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote:
>>>>>> From: Anand Jain <Anand.Jain@oracle.com>
>>>>>>
>>>>>> Write and flush errors are critical errors, upon which the device fd
>>>>>> must be closed and marked as failed.
>>>>>>
>>>>>
>>>>> Can we defer the job of closing device to umount?
>>>>
>>>>    Originally I think it was designed to handle the bad disk
>>>>    same as missing device. as below [1]. This patch just does that.
>>
>>>>    But I am curious to know if there is any issue to close failed
>>>>    device ?
>>
>>     Anything on this ?
> 
> No big deal, it's just different with md's behavior, but surely it
> doesn't make a big difference whether we close device fd here or not.

  Ok.

>>
>>
>>>> [1]
>>>> -----
>>>> static int read_one_dev(struct btrfs_fs_info *fs_info,
>>>>                           struct extent_buffer *leaf,
>>>>                           struct btrfs_dev_item *dev_item)
>>>> ::
>>>>               /*
>>>>                * this happens when a device that was properly setup
>>>>                * in the device info lists suddenly goes bad.
>>>>                * device->bdev is NULL, and so we have to set
>>>>                * device->missing to one here
>>>>                */
>>>> ------
>>>>
>>>>    So here I misuse the device missing scenario (device->bdev = NULL)
>>>>    to the failed device scenario. (code comment mentioned it).
>>>>
>>>
>>> This is OK to me, we can unify these stuff later, for now, ->missing
>>> is to bypass this failed disk for r/w.
>>>
>>>>    Secondly as in the patch 1/2 commit log and also Austin mentioned it,
>>>>    if failed device close is deferred it will also defer the cleanup of
>>>>    the failed device at the block layer.
>>>>
>>>>    But yes block layer can still clean up when btrfs closes the
>>>>    device at the time of replace, also where in the long run, pulling
>>>>    out of the failed disk would (planned) trigger a udev notification
>>>>    into the btrfs sysfs interface and it can close the device.
>>>>
>>>>    Anything that I have missed ?
>>>>
>>>
>>> I'm more inclined to do cleanup by making udev call 'device remove',
>>> which is supposed to do all the validation checks you need to done
>>> here.
>>
>>     Validation check like what ?
> 
> OK, just realize that I was talking about what we should do upon
> device getting pulled out, in that case, validation check about raid
> profile is needed, such as can we keep the raid profile or do we need
> to degrade raid proflie, and 'device delete' can do that.

  OK.

> Although I agree with the fact that write/flush errors are critical
> ones,


> it's still too strict if it got only a few write/flush errors,
> given cable issues or timeout from some layers may also lead to those
> errors.  A compromised way is to have a threshold to give drives
> another chance.


  Fundamentally IO retries are property of the IO module to deal with it.
  It knows what type of IO error is worth a retry. If FS receives an IO
  has failed it should trust IO module has done its best effort. Adding
  additional retires at the FS layer is _not_ a good idea - it would
  affect the time to fail-over, which is very crucial for HA (high
  availability) configurations.

  Instead to deal with spurious IO error its planned to create an offline
  device state, as I wrote about it here [1].

    [1]
    https://www.spinics.net/lists/linux-btrfs/msg49643.html

  I removed offline part here to keep the changes specific to the trigger
  as in here, and it can be added later.


Thanks, Anand

> thanks,
> 
> -liubo
> 
>>
>> Thanks, Anand
>>
>>>>    Further, jfyi closing of failed device isn't related to the
>>>>    looping for device failure though.
>>>>
>>>>> We can go mark the device failed and skip it while doing read/write,
>>>>> and umount can do the cleanup work.
>>>>
>>>>    In servers a need of umount is considered as downtime, things
>>>>    shouldn't wait for umount to cleanup.
>>>>
>>>
>>> OK.
>>>
>>>>> That way we don't need a dedicated thread looping around to detect a
>>>>> rare situation.
>>>>
>>>>    I think its better to make it event based instead of thread looping,
>>>>    pls take a look..
>>>>
>>>>      [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors
>>>>
>>>
>>> Event based is good, for anything about failing disk, we would have to
>>> follow what md is doing eventually, i.e. let udev handle it.
>>>
>>> thanks,
>>> -liubo
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v8 1/2] btrfs: introduce device dynamic state transition to failed
  2017-10-13 18:47   ` Liu Bo
@ 2017-10-16  6:09     ` Anand Jain
  0 siblings, 0 replies; 15+ messages in thread
From: Anand Jain @ 2017-10-16  6:09 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs



On 10/14/2017 02:47 AM, Liu Bo wrote:
> On Tue, Oct 03, 2017 at 11:59:19PM +0800, Anand Jain wrote:
>> From: Anand Jain <Anand.Jain@oracle.com>
>>
>> This patch provides helper functions to force a device to failed,
>> and we need it for the following reasons,
>> 1) a. It can be reported that device has failed when it does and
>>     b. Close the device when it goes offline so that blocklayer can
>>        cleanup
>> 2) Identify the candidate for the auto replace
>> 3) Stop further RW to the failing device and
>> 4) A device in the multi device btrfs may fail, but as of now in
>>     some system config whole of btrfs gets unmounted.
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
>> ---
>> V8: General misc cleanup. Based on v4.14-rc2
>>
>>   fs/btrfs/volumes.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   fs/btrfs/volumes.h |  15 +++++++-
>>   2 files changed, 118 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index 0e8f16c305df..06e7cf4cef81 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -7255,3 +7255,107 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info)
>>   		fs_devices = fs_devices->seed;
>>   	}
>>   }
>> +
>> +static void do_close_device(struct work_struct *work)
>> +{
>> +	struct btrfs_device *device;
>> +
>> +	device = container_of(work, struct btrfs_device, rcu_work);
>> +
>> +	if (device->closing_bdev)
>> +		blkdev_put(device->closing_bdev, device->mode);
>> +
>> +	device->closing_bdev = NULL;
>> +}
>> +
>> +static void btrfs_close_one_device(struct rcu_head *head)
>> +{
>> +	struct btrfs_device *device;
>> +
>> +	device = container_of(head, struct btrfs_device, rcu);
>> +
>> +	INIT_WORK(&device->rcu_work, do_close_device);
>> +	schedule_work(&device->rcu_work);
>> +}
> 
> No need to schedule the job to a worker.

  Right. Will fix it. Thanks.

-Anand


> thanks,
> 
> -liubo
> 
>> +
>> +void btrfs_force_device_close(struct btrfs_device *device)
>> +{
>> +	struct btrfs_fs_info *fs_info;
>> +	struct btrfs_fs_devices *fs_devices;
>> +
>> +	fs_devices = device->fs_devices;
>> +	fs_info = fs_devices->fs_info;
>> +
>> +	btrfs_sysfs_rm_device_link(fs_devices, device);
>> +
>> +	mutex_lock(&fs_devices->device_list_mutex);
>> +	mutex_lock(&fs_devices->fs_info->chunk_mutex);
>> +
>> +	btrfs_assign_next_active_device(fs_devices->fs_info, device, NULL);
>> +
>> +	if (device->bdev)
>> +		fs_devices->open_devices--;
>> +
>> +	if (device->writeable) {
>> +		list_del_init(&device->dev_alloc_list);
>> +		fs_devices->rw_devices--;
>> +	}
>> +	device->writeable = 0;
>> +
>> +	/*
>> +	 * Todo: We have miss-used missing flag all around, and here
>> +	 * too for now. (In the long run I want to keep missing to only
>> +	 * indicate that it was not present when RAID was assembled.)
>> +	 */
>> +	device->missing = 1;
>> +	fs_devices->missing_devices++;
>> +	device->closing_bdev = device->bdev;
>> +	device->bdev = NULL;
>> +
>> +	call_rcu(&device->rcu, btrfs_close_one_device);
>> +
>> +	mutex_unlock(&fs_devices->fs_info->chunk_mutex);
>> +	mutex_unlock(&fs_devices->device_list_mutex);
>> +
>> +	rcu_barrier();
>> +
>> +	btrfs_warn_in_rcu(fs_info, "device %s failed",
>> +				rcu_str_deref(device->name));
>> +
>> +	/*
>> +	 * We lost one/more disk, which means its not as it
>> +	 * was configured by the user. Show mount should show
>> +	 * degraded.
>> +	 */
>> +	btrfs_set_opt(fs_info->mount_opt, DEGRADED);
>> +
>> +	/*
>> +	 * Now having lost one of the device, check if chunk stripe
>> +	 * is incomplete and handle fatal error if needed.
>> +	 */
>> +	if (!btrfs_check_rw_degradable(fs_info))
>> +		btrfs_handle_fs_error(fs_info, -EIO,
>> +				"devices below critical level");
>> +}
>> +
>> +void btrfs_mark_device_failed(struct btrfs_device *dev)
>> +{
>> +	struct btrfs_fs_devices *fs_devices = dev->fs_devices;
>> +
>> +	/* This shouldn't be called if device is already missing */
>> +	if (dev->missing || !dev->bdev)
>> +		return;
>> +	if (dev->failed)
>> +		return;
>> +	dev->failed = 1;
>> +
>> +	/* Last RW device is requested to force close let FS handle it. */
>> +	if (fs_devices->rw_devices == 1) {
>> +		btrfs_handle_fs_error(fs_devices->fs_info, -EIO,
>> +					"Last RW device failed");
>> +		return;
>> +	}
>> +
>> +	/* Point of no return start here. */
>> +	btrfs_force_device_close(dev);
>> +}
>> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>> index 6108fdfec67f..05b150c03995 100644
>> --- a/fs/btrfs/volumes.h
>> +++ b/fs/btrfs/volumes.h
>> @@ -65,13 +65,26 @@ struct btrfs_device {
>>   	struct btrfs_pending_bios pending_sync_bios;
>>   
>>   	struct block_device *bdev;
>> +	struct block_device *closing_bdev;
>>   
>>   	/* the mode sent to blkdev_get */
>>   	fmode_t mode;
>>   
>>   	int writeable;
>>   	int in_fs_metadata;
>> +	/* missing: device not found at the time of mount */
>>   	int missing;
>> +	/* failed: device confirmed to have experienced critical io failure */
>> +	int failed;
>> +	/*
>> +	* offline: system or user or block layer transport has removed
>> +	* offlined the device which was once present and without going
>> +	* through unmount. Implies an intriem communication break down
>> +	* and not necessarily a candidate for the device replace. And
>> +	* device might be online after user intervention or after
>> +	* block transport layer error recovery.
>> +	*/
>> +	int offline;
>>   	int can_discard;
>>   	int is_tgtdev_for_dev_replace;
>>   	blk_status_t last_flush_error;
>> @@ -544,5 +557,5 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
>>   bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
>>   void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
>>   				 u8 *uuid);
>> -
>> +void btrfs_mark_device_failed(struct btrfs_device *dev);
>>   #endif
>> -- 
>> 2.7.0
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-10-16  6:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-03 15:59 [PATCH v8 0/2] [RFC] Introduce device state 'failed' Anand Jain
2017-10-03 15:59 ` [PATCH v8 1/2] btrfs: introduce device dynamic state transition to failed Anand Jain
2017-10-13 18:47   ` Liu Bo
2017-10-16  6:09     ` Anand Jain
2017-10-03 15:59 ` [PATCH v8 2/2] btrfs: check device for critical errors and mark failed Anand Jain
2017-10-04 20:11   ` Liu Bo
2017-10-05 11:07     ` Austin S. Hemmelgarn
2017-10-06 23:33       ` Liu Bo
2017-10-09 11:58         ` Austin S. Hemmelgarn
2017-10-05 13:56     ` Anand Jain
2017-10-06 23:56       ` Liu Bo
2017-10-08 14:23         ` Anand Jain
2017-10-13 18:46           ` Liu Bo
2017-10-16  6:09             ` Anand Jain
2017-10-05 13:54 ` [PATCH v8.1 2/2] btrfs: mark device failed for write and flush errors Anand Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.