All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
@ 2016-04-12 14:15 Anand Jain
  2016-04-12 14:15 ` [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount Anand Jain
                   ` (16 more replies)
  0 siblings, 17 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

Thanks for various comments, tests and feedback.

Background: Spare device and Auto replace:
 Spare device is predominately used to mitigate or narrow the time
 window of a degraded raid mode, as because during which any further
 disk failure would lead to a catastrophic data loss. Data center
 storage generally will have couple of disks reserved as spares
 on their storage, so that it will automatically kickin to resilver
 the storage pool so that the pool is back to a healthy state.
 Mainly this is an storage feature rather than a FS feature,
 I believe people acquainted with enterprise storage use cases
 will appreciate the need of it, and so most/all of the enterprise
 storage has spare device feature.

Btrfs device states:
 This patch-set adds 'failed' state and makes provision to use
 'offline' state as two new device states. So to summarize
 various device states and their meanings..

 /* missing: device wasn't found at the time of mount */
 int missing;

 /*
  * failed: device confirmed to have experienced critical
  * io failure
  */
 int failed;

 /*
  * offline: When there is no confirmation that a disk has
  * failed. But an interim communication breakdown
  * and not necessarily a candidate for the device replace.
  * Device might be online after user intervention or after
  * block transport layer error recovery.
  */
 int offline;


Device state transition tuning and visualization:
 Sysfs interfaces are planned to provide the required tuning for
 device state transition, sensitivities and visualization of device
 states. However sysfs framework which could provide such an interface
 is being reviewed/tested and not yet ready as of now. So for the
 testing and debug of these features here I have used an update
 version of the procfs patch which is in the ML.

  [PATCH] btrfs: debug: procfs-devlist: introduce procfs interface for
the device list for debugging

 I find the above patch very useful, easy to use (as compared to
 sysfs to visualize the device state) and stable.

This patch set does not depend on any of the sysfs patches as such.

Backward compatibility:
 Adds a new incompatibility feature flags
 (BTRFS_FEATURE_INCOMPAT_SPARE_DEV) to manage the spare device
 when older kernels are used. So it is tested to be work fine
 with older kernel/prog versions.


Auto replace:
 Replace happens automatically, that is when there is any write
 failed or flush failed, the device will be marked as failed, which
 will stop any further IO attempt to that device. And in the next
 commit cycle the auto replace will pick the spare device to
 replace the failed device. And so the btrfs volume is back to a
 healthy state.

Per FSID spare vs Global spare:
 As of now only global spare is supported, that is spare(s)
 are for all the btrfs FS in the system. However future there will
 be a fs_info->no_auto_replace tunable which can be tuned by the user
 to limit the use of global spare.


Example use case:
 Here below is an example use case of the spare setup.

 Add a spare device:
        btrfs spare add /dev/sde -f

 If there is a spare device which is already added before the,
 just run

        btrfs dev scan [/dev/sde]

 Which will register the spare device to the kernel.

        btrfs fi show
         Label: none uuid: 52f170c1-725c-457d-8cfd-d57090460091
          Total devices 2 FS bytes used 112.00KiB
          devid 1 size 2.00GiB used 417.50MiB path /dev/sdc
          devid 2 size 2.00GiB used 417.50MiB path /dev/sdd

        Global spare
          device size 3.00GiB path /dev/sde


Patches:

Kernel:
 First, it needs, Qu's per chunk missing device patchset, which is
 part of the set.

 Next patches 6-9 adds support for Spare device. For kernel without
 spare feature the spare device is kept away. And when the kernel
 supports the spare device, it will inhibit from mounting it. Further
 these patch set provides helper function to pick a spare device and
 release a spare device back to the spare device pool.

 Patch 10 provides helper function to auto replace.
 Patch 11 provides helper function to bring a device to failed state.
 Patch 12 marks a device as failed based on flush and write errors,
  and avoids any further IO to it.
 Last 13 triggers auto replace.

Progs:
 Needs below 4 patches which will add sub cli 'spare' to manage
 the spare device. As of now deleting a spare device has to be
 managed using wipefs. However in the long run we would a proper
 btrfs command to do that job.


v3->v4:
Kernel:
 a.
  Mainly bug fixes. Thanks to Yauhen for the bug reports.
  Fixed the issue of bdev not being null. Also fixed the
  issue where auto replace didn't check for
  mutually_exclusive_operation_running. In this process,
  the function force_device_close() is changed quite a
  bit, mainly bdev is copied and nulled within the lock
  context, and later close on the copied bdev is called.
 b.
  changed the wording hot spare to spare device, as some of
  the legacy raid setup would need a perticular device
  order for some reasons. So the hot spare would copy
  back the replace target to the replaced disk. However
  we don't need such a setup in modern hw and btrfs won't
  do that way. To avoid any confusion I won't use the term
  hot spare here.

progs:
 No change. Same as v2.

V2->V3:
Kernel:
  Thanks to Yauhen and Austin for the review comments.
  Again split Patch 11 and 12 which was merged in V2 for better.
  Patch numbers are reordered (sorry about that) but for better.
  Fix rcu issue in btrfs_get_spare_device(), we don't need rcu
   as its under uuid_mutex
  Fix rcu issue and to check for replace lock at
   btrfs_auto_replace_start()
  Cleanup old: casualty_kthread() new: health_kthread() with
    changes as per
    838fe188 'btrfs: cleaner_kthread() doesn't need explicit freeze'
    (thanks Yauhen)
  Yauhen reported this issue:
	When a disk is removed through the virtualbox interface.
	BUG: unable to handle kernel NULL pointer dereference at 0000000000000548
	IP: generic_make_request_checks+0x4d/0x910
	::
 	bvec_alloc+0x5e/0x100
	generic_make_request+0x24/0x290
	submit_bio+0x67/0x140
	finish_rmw+0x409/0x570 [btrfs]
	full_stripe_write+0xa5/0xb0 [btrfs]
	raid56_parity_write+0xf5/0x180 [btrfs]
	btrfs_map_bio+0x105/0x300 [btrfs]
	btrfs_get_extent+0x83/0xb20 [btrfs]

	Status: So far the raid group profile would adapt to lower suitable
	group profile when device is missing/failed. This appears to
	be not happening with RAID56 OR there are stale IO which wasn't
	flushed out. Anyway to have this fixed I am moving the patch
	  btrfs: introduce device dynamic state transition to offline or failed
	to the top in v3,
	But firstly we need a reliable test case, or a very carefully
	crafted test case which can create this situation.

Progs:
  No change, same as V2.

V1->V2:
Kernel:
 (Based on tests and commets provided in the ML)
 a. Now transition_kthread() wakes up the casualty_kthread to check
    for device states. Instead of doing that in the transition_kthread()
    itself. Cleaner and less pressure on transition_kthread().
 b. Dropped
     [PATCH 05/15] btrfs: optimize btrfs_check_degradable() for calls outside of barrier
    as it was wrong patch and the optimization was incomplete.
 c. Merged patches
    btrfs: check for failed device and hot replace
      to
    btrfs: check device for critical errors and mark failed
    in an effort to make the changes as in a above.

Progs:
 a. Added to call btrfs_register_one_device() when doing btrfs
    spare add

Anand Jain (8):
  btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV
  btrfs: add check not to mount a spare device
  btrfs: support btrfs dev scan for spare device
  btrfs: provide framework to get and put a spare device
  btrfs: introduce helper functions to perform hot replace
  btrfs: introduce device dynamic state transition to offline or failed
  btrfs: check device for critical errors and mark failed
  btrfs: check for failed device and hot replace

Qu Wenruo (5):
  btrfs: Introduce a new function to check if all chunks a OK for
    degraded mount
  btrfs: Do per-chunk check for mount time check
  btrfs: Do per-chunk degraded check for remount
  btrfs: Allow barrier_all_devices to do per-chunk device check
  btrfs: Cleanup num_tolerated_disk_barrier_failures

 fs/btrfs/ctree.h       |  11 +-
 fs/btrfs/dev-replace.c |  43 ++++++++
 fs/btrfs/dev-replace.h |   1 +
 fs/btrfs/disk-io.c     | 231 +++++++++++++++++++++++++++-------------
 fs/btrfs/disk-io.h     |   2 -
 fs/btrfs/super.c       |  16 ++-
 fs/btrfs/volumes.c     | 280 ++++++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/volumes.h     |  27 +++++
 8 files changed, 512 insertions(+), 99 deletions(-)

-- 
2.7.0

>From 05f00e7e71ce03309ea6408f7dcc507e19860be6 Mon Sep 17 00:00:00 2001
From: Anand Jain <anand.jain@oracle.com>
Date: Sat, 2 Apr 2016 07:08:36 +0800
Subject: [PATCH 00/13 v3] Introduce device state 'failed', Hot spare and Auto replace

Thanks for various comments, tests and feedback.

Background: Hot spare and Auto replace:
 Hot spare is predominately used to mitigate or narrow the time
 window of a degraded mode, during which any further disk
 failure might lead to a catastrophic data loss. Data center
 storage generally will have couple of disks reserved as spares
 on the storage, so that it will automatically kickin to resilver
 the storage pool so that the pool is back to a healthy state.
 Mainly this is an storage feature rather than a FS feature,
 I believe people acquainted with enterprise storage use cases
 will appreciate the need of it, and so most/all of the enterprise
 storage has hot spare feature.

Btrfs device states:
 This patch-set adds 'failed' state and makes provision to use
 'offline' state as two new device states. So to summarize
 various device states and their meanings..

 /* missing: device wasn't found at the time of mount */
 int missing;

 /*
  * failed: device confirmed to have experienced critical
  * io failure
  */
 int failed;

 /*
  * offline: When there is no confirmation that a disk has
  * failed. But an interim communication breakdown
  * and not necessarily a candidate for the device replace.
  * Device might be online after user intervention or after
  * block transport layer error recovery.
  */
 int offline;


Device state transition Tuning and visualization:
 Sysfs interfaces are planned to provide the required tuning for
 device state transition, sensitivities and visualization of device
 states. However sysfs framework which could provide such an interface
 is being reviewed/tested and not yet ready as of now. So for the
 testing and debug of these features here I have used an update
 version of the procfs patch which is in the ML.

  [PATCH] btrfs: debug: procfs-devlist: introduce procfs interface for
the device list for debugging

 I find the above patch very useful, easy to use (as compared to
 sysfs to visualize the device state) and stable.

This patch set does not depend on any of the sysfs patches as such.

Backward compatibility:
 Adds a new incompatibility feature flags
 (BTRFS_FEATURE_INCOMPAT_SPARE_DEV) to manage the spare device
 when older kernels are used. So it is tested to be work fine
 with older kernel/prog versions.


Auto replace:
 Replace happens automatically, that is when there is any write
 failed or flush failed, the device will be marked as failed, which
 will stop any further IO attempt to that device. And in the next
 commit cycle the auto replace will pick the spare device to
 replace the failed device. And so the btrfs volume is back to a
 healthy state.

Per FSID spare vs Global spare:
 As of now only global hot spare is supported, that is hot spare(s)
 are for all the btrfs FS in the system. However future there will
 be a fs_info->no_auto_replace tunable which can be tuned by the user
 to limit the use of global spare.


Example use case:
 Here below is an example use case of the hot spare setup.

 Add a spare device:
        btrfs spare add /dev/sde -f

 If there is a spare device which is already added before the,
 just run

        btrfs dev scan [/dev/sde]

 Which will register the spare device to the kernel.

        btrfs fi show
         Label: none uuid: 52f170c1-725c-457d-8cfd-d57090460091
          Total devices 2 FS bytes used 112.00KiB
          devid 1 size 2.00GiB used 417.50MiB path /dev/sdc
          devid 2 size 2.00GiB used 417.50MiB path /dev/sdd

        Global spare
          device size 3.00GiB path /dev/sde


Patches:

Kernel:
 First, it needs, Qu's per chunk missing device patchset, which is
 part of the set.

 Next patches 6-9 adds support for Spare device. For kernel without
 spare feature the spare device is kept away. And when the kernel
 supports the spare device, it will inhibit from mounting it. Further
 these patch set provides helper function to pick a spare device and
 release a spare device back to the spare device pool.

 Patch 10 provides helper function to auto replace.
 Patch 11 provides helper function to bring a device to failed state.
 Patch 12 marks a device as failed based on flush and write errors,
  and avoids any further IO to it.
 Last 13 triggers auto replace.

Progs:
 Needs below 4 patches which will add sub cli 'spare' to manage
 the spare device. As of now deleting a spare device has to be
 managed using wipefs. However in the long run we would a proper
 btrfs command to do that job.

V2->V3:
Kernel:
  Thanks to Yauhen and Austin for the review comments.
  Again split Patch 11 and 12 which was merged in V2 for better.
  Patch numbers are reordered (sorry about that) but for better.
  Fix rcu issue in btrfs_get_spare_device(), we don't need rcu
   as its under uuid_mutex
  Fix rcu issue and to check for replace lock at
   btrfs_auto_replace_start()
  Cleanup old: casualty_kthread() new: health_kthread() with
    changes as per
    838fe188 'btrfs: cleaner_kthread() doesn't need explicit freeze'
    (thanks Yauhen)
  Yauhen reported this issue:
	When a disk is removed through the virtualbox interface.
	BUG: unable to handle kernel NULL pointer dereference at 0000000000000548
	IP: generic_make_request_checks+0x4d/0x910
	::
 	bvec_alloc+0x5e/0x100
	generic_make_request+0x24/0x290
	submit_bio+0x67/0x140
	finish_rmw+0x409/0x570 [btrfs]
	full_stripe_write+0xa5/0xb0 [btrfs]
	raid56_parity_write+0xf5/0x180 [btrfs]
	btrfs_map_bio+0x105/0x300 [btrfs]
	btrfs_get_extent+0x83/0xb20 [btrfs]

	Status: So far the raid group profile would adapt to lower suitable
	group profile when device is missing/failed. This appears to
	be not happening with RAID56 OR there are stale IO which wasn't
	flushed out. Anyway to have this fixed I am moving the patch
	  btrfs: introduce device dynamic state transition to offline or failed
	to the top in v3,
	But firstly we need a reliable test case, or a very carefully
	crafted test case which can create this situation.

Progs:
  No change, same as V2.

V1->V2:
Kernel:
 (Based on tests and commets provided in the ML)
 a. Now transition_kthread() wakes up the casualty_kthread to check
    for device states. Instead of doing that in the transition_kthread()
    itself. Cleaner and less pressure on transition_kthread().
 b. Dropped
     [PATCH 05/15] btrfs: optimize btrfs_check_degradable() for calls outside of barrier
    as it was wrong patch and the optimization was incomplete.
 c. Merged patches
    btrfs: check for failed device and hot replace
      to
    btrfs: check device for critical errors and mark failed
    in an effort to make the changes as in a above.

Progs:
 a. Added to call btrfs_register_one_device() when doing btrfs
    spare add

Anand Jain (8):
  btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV
  btrfs: add check not to mount a spare device
  btrfs: support btrfs dev scan for spare device
  btrfs: provide framework to get and put a spare device
  btrfs: introduce helper functions to perform hot replace
  btrfs: introduce device dynamic state transition to offline or failed
  btrfs: check device for critical errors and mark failed
  btrfs: check for failed device and hot replace

Qu Wenruo (5):
  btrfs: Introduce a new function to check if all chunks a OK for
    degraded mount
  btrfs: Do per-chunk check for mount time check
  btrfs: Do per-chunk degraded check for remount
  btrfs: Allow barrier_all_devices to do per-chunk device check
  btrfs: Cleanup num_tolerated_disk_barrier_failures

 fs/btrfs/ctree.h       |  11 +-
 fs/btrfs/dev-replace.c |  43 ++++++++
 fs/btrfs/dev-replace.h |   1 +
 fs/btrfs/disk-io.c     | 230 +++++++++++++++++++++++++++-------------
 fs/btrfs/disk-io.h     |   2 -
 fs/btrfs/super.c       |  16 ++-
 fs/btrfs/volumes.c     | 279 ++++++++++++++++++++++++++++++++++++++++++++++---
 fs/btrfs/volumes.h     |  26 +++++
 8 files changed, 509 insertions(+), 99 deletions(-)


btrfs-progs:

Anand Jain (4):
  btrfs-progs: Introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV SB flags
  btrfs-progs: Introduce btrfs spare subcommand
  btrfs-progs: add fi show for spare
  btrfs-progs: add global spare device list to filesystem show

 Android.mk        |   2 +-
 Makefile.in       |   3 +-
 btrfs.c           |   1 +
 cmds-filesystem.c |   9 ++
 cmds-spare.c      | 292 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 commands.h        |   2 +
 ctree.h           |   4 +-
 utils.h           |   1 +
 volumes.c         |   4 +
 volumes.h         |   2 +
 10 files changed, 317 insertions(+), 3 deletions(-)
 create mode 100644 cmds-spare.c

-- 
2.7.0


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 19:21   ` Yauhen Kharuzhy
  2016-04-12 14:15 ` [PATCH 02/13] btrfs: Do per-chunk check for mount time check Anand Jain
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Qu Wenruo <quwenruo@cn.fujitsu.com>

Introduce a new function, btrfs_check_degradable(), to judge if all chunks
in btrfs is OK for degraded mount.

It provides the new basis for accurate btrfs mount/remount and even
runtime degraded mount check other than old one-size-fit-all method.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/volumes.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  1 +
 2 files changed, 64 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9d72dabdddfc..a351c5dd9e9b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7039,3 +7039,66 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 
 	call_rcu(&device->rcu, free_device);
 }
+
+/*
+ * Check if all chunks in the fs is OK for degraded mount
+ * Caller itself should do extra check if DEGRADED mount option is given
+ * for >0 return value.
+ *
+ * Return 0 if all chunks are OK.
+ * Return >0 if all chunks are degradable but not all OK.
+ * Return <0 if any chunk is not degradable or other bug.
+ */
+int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags)
+{
+	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	u64 next_start = 0;
+	int ret = 0;
+
+	if (flags & MS_RDONLY)
+		return 0;
+
+	read_lock(&map_tree->map_tree.lock);
+	em = lookup_extent_mapping(&map_tree->map_tree, 0, (u64)(-1));
+	/* No any chunk? Should be a huge bug */
+	if (!em) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	while (em) {
+		struct map_lookup *map;
+		int missing = 0;
+		int max_tolerated;
+		int i;
+
+		map = (struct map_lookup *) em->bdev;
+		max_tolerated =
+			btrfs_get_num_tolerated_disk_barrier_failures(
+					map->type);
+		for (i = 0; i < map->num_stripes; i++) {
+			if (map->stripes[i].dev->missing)
+				missing++;
+		}
+		if (missing > max_tolerated) {
+			ret = -EIO;
+			btrfs_warn(fs_info,
+				   "missing devices(%d) exceeds the limit(%d), writebale mount is not allowed",
+				   missing, max_tolerated);
+			goto out;
+		} else if (missing)
+			ret = 1;
+		next_start = extent_map_end(em);
+
+		/*
+		 * Alwasy search range [next_start, (u64)-1) to find the next
+		 * chunk map
+		 */
+		em = lookup_extent_mapping(&map_tree->map_tree, next_start,
+					   (u64)(-1) - next_start);
+	}
+out:
+	read_unlock(&map_tree->map_tree.lock);
+	return ret;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1939ebde63df..351431a3f5aa 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -566,5 +566,6 @@ static inline void unlock_chunks(struct btrfs_root *root)
 struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
+int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags);
 
 #endif
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 02/13] btrfs: Do per-chunk check for mount time check
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
  2016-04-12 14:15 ` [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 14:15 ` [PATCH 03/13] btrfs: Do per-chunk degraded check for remount Anand Jain
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Qu Wenruo <quwenruo@cn.fujitsu.com>

Now use the btrfs_check_degraded() to do mount time degraded check.

With this patch, now we can mount with the following case:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdc
 # mount /dev/sdb /mnt/btrfs -o degraded
 As the single data chunk is only in sdb, so it's OK to mount as degraded,
 as missing one device is OK for RAID1.

But still fail with the following case as expected:
 # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc
 # wipefs -a /dev/sdb
 # mount /dev/sdc /mnt/btrfs -o degraded
 As the data chunk is only in sdb, so it's not OK to mount it as degraded.

Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Reported-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

[Btrfs: use btrfs_error instead of btrfs_err during mount]
Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/disk-io.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d01f89d130e0..4f91a049fbca 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2885,6 +2885,16 @@ int open_ctree(struct super_block *sb,
 		goto fail_tree_roots;
 	}
 
+	ret = btrfs_check_degradable(fs_info, fs_info->sb->s_flags);
+	if (ret < 0) {
+		btrfs_err(fs_info, "degraded writable mount failed %d", ret);
+		goto fail_tree_roots;
+	} else if (ret > 0 && !btrfs_test_opt(chunk_root, DEGRADED)) {
+		btrfs_warn(fs_info,
+			"Some device missing, but still degraded mountable, please mount with -o degraded option");
+		ret = -EACCES;
+		goto fail_tree_roots;
+	}
 	/*
 	 * keep the device that is marked to be the target device for the
 	 * dev_replace procedure
@@ -2988,14 +2998,6 @@ retry_root_backup:
 	}
 	fs_info->num_tolerated_disk_barrier_failures =
 		btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-	if (fs_info->fs_devices->missing_devices >
-	     fs_info->num_tolerated_disk_barrier_failures &&
-	    !(sb->s_flags & MS_RDONLY)) {
-		pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), writeable mount is not allowed\n",
-			fs_info->fs_devices->missing_devices,
-			fs_info->num_tolerated_disk_barrier_failures);
-		goto fail_sysfs;
-	}
 
 	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
 					       "btrfs-cleaner");
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 03/13] btrfs: Do per-chunk degraded check for remount
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
  2016-04-12 14:15 ` [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount Anand Jain
  2016-04-12 14:15 ` [PATCH 02/13] btrfs: Do per-chunk check for mount time check Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 14:15 ` [PATCH 04/13] btrfs: Allow barrier_all_devices to do per-chunk device check Anand Jain
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Qu Wenruo <quwenruo@cn.fujitsu.com>

Just the same for mount time check, use new btrfs_check_degraded() to do
per chunk check.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

Btrfs: use btrfs_error instead of btrfs_err during remount

Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/super.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index b4e15416704d..729f596b540a 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1767,11 +1767,14 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data)
 			goto restore;
 		}
 
-		if (fs_info->fs_devices->missing_devices >
-		     fs_info->num_tolerated_disk_barrier_failures &&
-		    !(*flags & MS_RDONLY)) {
+		ret = btrfs_check_degradable(fs_info, *flags);
+		if (ret < 0) {
+			btrfs_err(fs_info,
+				"degraded writable remount failed %d", ret);
+			goto restore;
+		} else if (ret > 0 && !btrfs_test_opt(root, DEGRADED)) {
 			btrfs_warn(fs_info,
-				"too many missing devices, writeable remount is not allowed");
+				"some device missing, but still degraded mountable, please remount with -o degraded option");
 			ret = -EACCES;
 			goto restore;
 		}
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 04/13] btrfs: Allow barrier_all_devices to do per-chunk device check
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (2 preceding siblings ...)
  2016-04-12 14:15 ` [PATCH 03/13] btrfs: Do per-chunk degraded check for remount Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 14:15 ` [PATCH 05/13] btrfs: Cleanup num_tolerated_disk_barrier_failures Anand Jain
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Qu Wenruo <quwenruo@cn.fujitsu.com>

The last user of num_tolerated_disk_barrier_failures is
barrier_all_devices(). But it's can be easily changed to new per-chunk
degradable check framework.

Now btrfs_device will have two extra members, representing send/wait
error, set at write_dev_flush() time. And then check it in a similar but
more accurate behavior than old code.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/disk-io.c | 13 +++++--------
 fs/btrfs/volumes.c |  6 +++++-
 fs/btrfs/volumes.h |  4 ++++
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4f91a049fbca..9ad3667f5e71 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3496,8 +3496,6 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 {
 	struct list_head *head;
 	struct btrfs_device *dev;
-	int errors_send = 0;
-	int errors_wait = 0;
 	int ret;
 
 	/* send down all the barriers */
@@ -3506,7 +3504,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 		if (dev->missing)
 			continue;
 		if (!dev->bdev) {
-			errors_send++;
+			dev->err_send = 1;
 			continue;
 		}
 		if (!dev->in_fs_metadata || !dev->writeable)
@@ -3514,7 +3512,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
 		ret = write_dev_flush(dev, 0);
 		if (ret)
-			errors_send++;
+			dev->err_send = 1;
 	}
 
 	/* wait for all the barriers */
@@ -3522,7 +3520,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 		if (dev->missing)
 			continue;
 		if (!dev->bdev) {
-			errors_wait++;
+			dev->err_wait = 1;
 			continue;
 		}
 		if (!dev->in_fs_metadata || !dev->writeable)
@@ -3530,10 +3528,9 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
 
 		ret = write_dev_flush(dev, 1);
 		if (ret)
-			errors_wait++;
+			dev->err_wait = 1;
 	}
-	if (errors_send > info->num_tolerated_disk_barrier_failures ||
-	    errors_wait > info->num_tolerated_disk_barrier_failures)
+	if (btrfs_check_degradable(info, info->sb->s_flags) < 0)
 		return -EIO;
 	return 0;
 }
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a351c5dd9e9b..d9cae4d7ba55 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7078,8 +7078,12 @@ int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags)
 			btrfs_get_num_tolerated_disk_barrier_failures(
 					map->type);
 		for (i = 0; i < map->num_stripes; i++) {
-			if (map->stripes[i].dev->missing)
+			if (map->stripes[i].dev->missing ||
+			    map->stripes[i].dev->err_wait ||
+			    map->stripes[i].dev->err_send)
 				missing++;
+			map->stripes[i].dev->err_wait = 0;
+			map->stripes[i].dev->err_send = 0;
 		}
 		if (missing > max_tolerated) {
 			ret = -EIO;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 351431a3f5aa..48ced5cc09e4 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -76,6 +76,10 @@ struct btrfs_device {
 	int can_discard;
 	int is_tgtdev_for_dev_replace;
 
+	/* for barrier_all_devices() check */
+	int err_send;
+	int err_wait;
+
 #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED
 	seqcount_t data_seqcount;
 #endif
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 05/13] btrfs: Cleanup num_tolerated_disk_barrier_failures
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (3 preceding siblings ...)
  2016-04-12 14:15 ` [PATCH 04/13] btrfs: Allow barrier_all_devices to do per-chunk device check Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 14:15 ` [PATCH 06/13] btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV Anand Jain
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Qu Wenruo <quwenruo@cn.fujitsu.com>

As we use per-chunk degradable check, now the global
num_tolerated_disk_barrier_failures is of no use. So cleanup it.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>

[Btrfs: resolve conflict to apply 'btrfs: Cleanup num_tolerated_disk_barrier_failures']
Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/ctree.h   |  2 --
 fs/btrfs/disk-io.c | 56 ------------------------------------------------------
 fs/btrfs/disk-io.h |  2 --
 fs/btrfs/volumes.c | 17 -----------------
 4 files changed, 77 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0b5c2c71dffd..7a6471269b34 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1829,8 +1829,6 @@ struct btrfs_fs_info {
 	/* next backup root to be overwritten */
 	int backup_root_index;
 
-	int num_tolerated_disk_barrier_failures;
-
 	/* device replace state */
 	struct btrfs_dev_replace dev_replace;
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9ad3667f5e71..65c9f19d8017 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2996,8 +2996,6 @@ retry_root_backup:
 		printk(KERN_ERR "BTRFS: Failed to read block groups: %d\n", ret);
 		goto fail_sysfs;
 	}
-	fs_info->num_tolerated_disk_barrier_failures =
-		btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
 
 	fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root,
 					       "btrfs-cleaner");
@@ -3564,60 +3562,6 @@ int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags)
 	return min_tolerated;
 }
 
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-	struct btrfs_fs_info *fs_info)
-{
-	struct btrfs_ioctl_space_info space;
-	struct btrfs_space_info *sinfo;
-	u64 types[] = {BTRFS_BLOCK_GROUP_DATA,
-		       BTRFS_BLOCK_GROUP_SYSTEM,
-		       BTRFS_BLOCK_GROUP_METADATA,
-		       BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA};
-	int i;
-	int c;
-	int num_tolerated_disk_barrier_failures =
-		(int)fs_info->fs_devices->num_devices;
-
-	for (i = 0; i < ARRAY_SIZE(types); i++) {
-		struct btrfs_space_info *tmp;
-
-		sinfo = NULL;
-		rcu_read_lock();
-		list_for_each_entry_rcu(tmp, &fs_info->space_info, list) {
-			if (tmp->flags == types[i]) {
-				sinfo = tmp;
-				break;
-			}
-		}
-		rcu_read_unlock();
-
-		if (!sinfo)
-			continue;
-
-		down_read(&sinfo->groups_sem);
-		for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) {
-			u64 flags;
-
-			if (list_empty(&sinfo->block_groups[c]))
-				continue;
-
-			btrfs_get_block_group_info(&sinfo->block_groups[c],
-						   &space);
-			if (space.total_bytes == 0 || space.used_bytes == 0)
-				continue;
-			flags = space.flags;
-
-			num_tolerated_disk_barrier_failures = min(
-				num_tolerated_disk_barrier_failures,
-				btrfs_get_num_tolerated_disk_barrier_failures(
-					flags));
-		}
-		up_read(&sinfo->groups_sem);
-	}
-
-	return num_tolerated_disk_barrier_failures;
-}
-
 static int write_all_supers(struct btrfs_root *root, int max_mirrors)
 {
 	struct list_head *head;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 8e79d0070bcf..dd155621f95f 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -141,8 +141,6 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans,
 int btree_lock_page_hook(struct page *page, void *data,
 				void (*flush_fn)(void *));
 int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags);
-int btrfs_calc_num_tolerated_disk_barrier_failures(
-	struct btrfs_fs_info *fs_info);
 int __init btrfs_end_io_wq_init(void);
 void btrfs_end_io_wq_exit(void);
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d9cae4d7ba55..8549bd2b3a42 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1872,9 +1872,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path)
 		free_fs_devices(cur_devices);
 	}
 
-	root->fs_info->num_tolerated_disk_barrier_failures =
-		btrfs_calc_num_tolerated_disk_barrier_failures(root->fs_info);
-
 	/*
 	 * at this point, the device is zero sized.  We want to
 	 * remove it from the devices list and zero out the old super
@@ -2402,8 +2399,6 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path)
 				"sysfs: failed to create fsid for sprout");
 	}
 
-	root->fs_info->num_tolerated_disk_barrier_failures =
-		btrfs_calc_num_tolerated_disk_barrier_failures(root->fs_info);
 	ret = btrfs_commit_transaction(trans, root);
 
 	if (seeding_dev) {
@@ -3754,13 +3749,6 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
 			bctl->meta.target, bctl->data.target);
 	}
 
-	if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
-		fs_info->num_tolerated_disk_barrier_failures = min(
-			btrfs_calc_num_tolerated_disk_barrier_failures(fs_info),
-			btrfs_get_num_tolerated_disk_barrier_failures(
-				bctl->sys.target));
-	}
-
 	ret = insert_balance_item(fs_info->tree_root, bctl);
 	if (ret && ret != -EEXIST)
 		goto out;
@@ -3783,11 +3771,6 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
 	mutex_lock(&fs_info->balance_mutex);
 	atomic_dec(&fs_info->balance_running);
 
-	if (bctl->sys.flags & BTRFS_BALANCE_ARGS_CONVERT) {
-		fs_info->num_tolerated_disk_barrier_failures =
-			btrfs_calc_num_tolerated_disk_barrier_failures(fs_info);
-	}
-
 	if (bargs) {
 		memset(bargs, 0, sizeof(*bargs));
 		update_ioctl_balance_args(fs_info, 0, bargs);
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 06/13] btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (4 preceding siblings ...)
  2016-04-12 14:15 ` [PATCH 05/13] btrfs: Cleanup num_tolerated_disk_barrier_failures Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 14:15 ` [PATCH 07/13] btrfs: add check not to mount a spare device Anand Jain
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

Add BTRFS_FEATURE_INCOMPAT_SPARE_DEV (400) flag to identify
a spare device.

Along with this it checks in the mount context that a spare
device will fail to mount.  As spare devices aren't mountable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
 fs/btrfs/ctree.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7a6471269b34..a823ff7944f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -531,6 +531,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_RAID56		(1ULL << 7)
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_SPARE_DEV	(1ULL << 10)
 
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
 #define BTRFS_FEATURE_COMPAT_SAFE_SET		0ULL
@@ -551,7 +552,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_RAID56 |		\
 	 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
-	 BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+	 BTRFS_FEATURE_INCOMPAT_NO_HOLES |		\
+	 BTRFS_FEATURE_INCOMPAT_SPARE_DEV)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 07/13] btrfs: add check not to mount a spare device
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (5 preceding siblings ...)
  2016-04-12 14:15 ` [PATCH 06/13] btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 14:15 ` [PATCH 08/13] btrfs: support btrfs dev scan for " Anand Jain
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

Spare devices can be scanned but shouldn't be mountable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
 fs/btrfs/disk-io.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 65c9f19d8017..e9fca3bc7e42 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2811,6 +2811,14 @@ int open_ctree(struct super_block *sb,
 		goto fail_alloc;
 	}
 
+	if (btrfs_super_incompat_flags(disk_super) &
+			BTRFS_FEATURE_INCOMPAT_SPARE_DEV) {
+		/*You can only scan a spare device but not mount*/
+		printk(KERN_ERR "BTRFS: You can't mount a spare device\n");
+		err = -ENOTSUPP;
+		goto fail_alloc;
+	}
+
 	/*
 	 * Needn't use the lock because there is no other task which will
 	 * update the flag.
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 08/13] btrfs: support btrfs dev scan for spare device
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (6 preceding siblings ...)
  2016-04-12 14:15 ` [PATCH 07/13] btrfs: add check not to mount a spare device Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 14:15 ` [PATCH 09/13] btrfs: provide framework to get and put a " Anand Jain
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

When the user or system calls the BTRFS_IOC_SCAN_DEV,
ioctl this patch will make sure it is added to the device
list and set it as spare.

This operation will be same when BTRFS_IOC_DEVICES_READY
as well since BTRFS_IOC_DEVICES_READY ioctl has been doing
that by legacy.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
 fs/btrfs/volumes.c | 4 ++++
 fs/btrfs/volumes.h | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8549bd2b3a42..150807e0310e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -605,6 +605,10 @@ static noinline int device_list_add(const char *path,
 		if (IS_ERR(fs_devices))
 			return PTR_ERR(fs_devices);
 
+		if (btrfs_super_incompat_flags(disk_super) &
+				BTRFS_FEATURE_INCOMPAT_SPARE_DEV)
+			fs_devices->spare = 1;
+
 		list_add(&fs_devices->list, &fs_uuids);
 
 		device = NULL;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 48ced5cc09e4..51cf716eb35b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -263,6 +263,8 @@ struct btrfs_fs_devices {
 	struct kobject fsid_kobj;
 	struct kobject *device_dir_kobj;
 	struct completion kobj_unregister;
+
+	int spare;
 };
 
 #define BTRFS_BIO_INLINE_CSUM_SIZE	64
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 09/13] btrfs: provide framework to get and put a spare device
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (7 preceding siblings ...)
  2016-04-12 14:15 ` [PATCH 08/13] btrfs: support btrfs dev scan for " Anand Jain
@ 2016-04-12 14:15 ` Anand Jain
  2016-04-12 14:16 ` [PATCH 10/13] btrfs: introduce helper functions to perform hot replace Anand Jain
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:15 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

This adds functions to get and put a spare device from the list.
So that hot repace code can pick a spare device when needed.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/super.c   |  5 +++++
 fs/btrfs/volumes.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  2 ++
 4 files changed, 61 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a823ff7944f1..1cf1bbf3058f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4185,6 +4185,7 @@ void btrfs_sysfs_remove_mounted(struct btrfs_fs_info *fs_info);
 ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
 
 /* super.c */
+struct file_system_type *btrfs_get_fs_type(void);
 int btrfs_parse_options(struct btrfs_root *root, char *options,
 			unsigned long new_flags);
 int btrfs_sync_fs(struct super_block *sb, int wait);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 729f596b540a..2d77a8dde92c 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -69,6 +69,11 @@ static struct file_system_type btrfs_fs_type;
 
 static int btrfs_remount(struct super_block *sb, int *flags, char *data);
 
+struct file_system_type *btrfs_get_fs_type()
+{
+	return &btrfs_fs_type;
+}
+
 const char *btrfs_decode_error(int errno)
 {
 	char *errstr = "unknown";
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 150807e0310e..00d82872ede0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -525,6 +525,59 @@ static void pending_bios_fn(struct btrfs_work *work)
 	run_scheduled_bios(device);
 }
 
+int btrfs_get_spare_device(char **path)
+{
+	int ret = 1;
+	struct btrfs_fs_devices *fs_devices;
+	struct btrfs_device *device;
+	struct list_head *fs_uuids = btrfs_get_fs_uuids();
+
+	mutex_lock(&uuid_mutex);
+	list_for_each_entry(fs_devices, fs_uuids, list) {
+		if (!fs_devices->spare)
+			continue;
+
+		/* as of now there is only one device in the spare fs_devices */
+		device = list_entry(fs_devices->devices.next,
+					struct btrfs_device, dev_list);
+
+		if (!device || !device->name)
+			continue;
+
+		fs_devices->spare = 0;
+		/*
+		 * Its under uuid_mutex and there is one spare per fsid
+		 * so rcu lock is actually not required
+		 */
+		*path = kstrdup(device->name->str, GFP_KERNEL);
+		if (*path)
+			ret = 0;
+		else
+			ret = -ENOMEM;
+		break;
+	}
+
+	if (!ret) {
+		btrfs_sysfs_remove_fsid(fs_devices);
+		list_del(&fs_devices->list);
+		free_fs_devices(fs_devices);
+	}
+	mutex_unlock(&uuid_mutex);
+
+	return ret;
+}
+
+void btrfs_put_spare_device(char *path)
+{
+	struct file_system_type *btrfs_fs_type;
+	struct btrfs_fs_devices *fs_devices;
+
+	btrfs_fs_type = btrfs_get_fs_type();
+
+	if (btrfs_scan_one_device(path, FMODE_READ,
+				    btrfs_fs_type, &fs_devices))
+		printk(KERN_INFO "failed to return spare device\n");
+}
 
 void btrfs_free_stale_device(struct btrfs_device *cur_dev)
 {
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 51cf716eb35b..b4308afa3097 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -469,6 +469,8 @@ int btrfs_init_new_device(struct btrfs_root *root, char *path);
 int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path,
 				  struct btrfs_device *srcdev,
 				  struct btrfs_device **device_out);
+int btrfs_get_spare_device(char **path);
+void btrfs_put_spare_device(char *path);
 int btrfs_balance(struct btrfs_balance_control *bctl,
 		  struct btrfs_ioctl_balance_args *bargs);
 int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info);
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 10/13] btrfs: introduce helper functions to perform hot replace
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (8 preceding siblings ...)
  2016-04-12 14:15 ` [PATCH 09/13] btrfs: provide framework to get and put a " Anand Jain
@ 2016-04-12 14:16 ` Anand Jain
  2016-04-12 14:40   ` kbuild test robot
  2016-04-12 14:16 ` [PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

Hot replace / auto replace is important volume manager feature
and is critical to the data center operations, so that the degraded
volume can be brought back to a healthy state at the earliest and
without manual intervention.

This modifies the existing replace code to suite the need of auto
replace, in the long run I hope both the codes to be merged.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
 fs/btrfs/dev-replace.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dev-replace.h |  1 +
 2 files changed, 44 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 2b926867d136..ddc4843604df 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -957,3 +957,46 @@ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info)
 				     &fs_info->fs_state));
 	}
 }
+
+int btrfs_auto_replace_start(struct btrfs_root *root, u64 src_devid)
+{
+	int ret;
+	char *tgt_path;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+
+	if (!src_devid)
+		return -EINVAL;
+
+	if (fs_info->sb->s_flags & MS_RDONLY)
+		return -EROFS;
+
+	btrfs_dev_replace_lock(&fs_info->dev_replace, 0);
+	if (btrfs_dev_replace_is_ongoing(&fs_info->dev_replace)) {
+		btrfs_dev_replace_unlock(&fs_info->dev_replace, 0);
+		return -EBUSY;
+	}
+	btrfs_dev_replace_unlock(&fs_info->dev_replace, 0);
+
+	if (btrfs_get_spare_device(&tgt_path)) {
+		btrfs_err(root->fs_info,
+			"No spare device found/configured in the kernel");
+		return -EINVAL;
+	}
+
+	if (atomic_xchg(
+		&root->fs_info->mutually_exclusive_operation_running, 1)) {
+		ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
+	} else {
+		ret = btrfs_dev_replace_start(root, tgt_path, src_devid, NULL,
+		BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS);
+		atomic_set(
+			&root->fs_info->mutually_exclusive_operation_running, 0);
+	}
+
+	if (ret)
+		btrfs_put_spare_device(tgt_path);
+
+	kfree(tgt_path);
+
+	return ret;
+}
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index e922b42d91df..54b0812c8ba4 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -46,4 +46,5 @@ static inline void btrfs_dev_replace_stats_inc(atomic64_t *stat_value)
 {
 	atomic64_inc(stat_value);
 }
+int btrfs_auto_replace_start(struct btrfs_root *root, u64 src_devid);
 #endif
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (9 preceding siblings ...)
  2016-04-12 14:16 ` [PATCH 10/13] btrfs: introduce helper functions to perform hot replace Anand Jain
@ 2016-04-12 14:16 ` Anand Jain
  2016-04-14  1:15   ` [PATCH] Btrfs: Set superblock s_bdev field properly at device closing Yauhen Kharuzhy
  2016-04-14 10:51   ` [PATCH v5 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
  2016-04-12 14:16 ` [PATCH 12/13] btrfs: check device for critical errors and mark failed Anand Jain
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

This patch provides helper functions to force a device to offline
or failed, and we need this device states for the following reasons,
1) a. it can be reported that device has failed when it does
   b. close the device when it goes offline so that blocklayer can
      cleanup
2) identify the candidate for the auto replace
3) avoid further commit error reported against the failing device and
4) a device in the multi device btrfs may go offline from the system
   (but as of now in in some system config btrfs gets unmounted in this
    context, which is not a correct behavior)

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
 fs/btrfs/volumes.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  14 ++++++
 2 files changed, 152 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 00d82872ede0..275143c42374 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7146,3 +7146,141 @@ out:
 	read_unlock(&map_tree->map_tree.lock);
 	return ret;
 }
+
+static void __close_device(struct work_struct *work)
+{
+	struct btrfs_device *device;
+
+	device = container_of(work, struct btrfs_device, rcu_work);
+
+	if (device->closing_bdev)
+		blkdev_put(device->closing_bdev, device->mode);
+
+	device->closing_bdev = NULL;
+}
+
+static void close_device(struct rcu_head *head)
+{
+	struct btrfs_device *device;
+
+	device = container_of(head, struct btrfs_device, rcu);
+
+	INIT_WORK(&device->rcu_work, __close_device);
+	schedule_work(&device->rcu_work);
+}
+
+void device_force_close(struct btrfs_device *device)
+{
+	struct btrfs_device *next_device;
+	struct btrfs_fs_devices *fs_devices;
+
+	fs_devices = device->fs_devices;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	mutex_lock(&fs_devices->fs_info->chunk_mutex);
+	spin_lock(&fs_devices->fs_info->free_chunk_lock);
+
+	next_device = list_entry(fs_devices->devices.next,
+					struct btrfs_device, dev_list);
+	if (device->bdev == fs_devices->fs_info->sb->s_bdev)
+		fs_devices->fs_info->sb->s_bdev = next_device->bdev;
+
+	if (device->bdev == fs_devices->latest_bdev)
+		fs_devices->latest_bdev = next_device->bdev;
+
+	if (device->bdev)
+		fs_devices->open_devices--;
+
+	if (device->writeable) {
+		list_del_init(&device->dev_alloc_list);
+		fs_devices->rw_devices--;
+	}
+	device->writeable = 0;
+
+	/*
+	 * fixme: works for now, but its better to keep the state of
+	 * missing and offline different, and update rest of the
+	 * places where we check for only missing and not for failed
+	 * or offline as of now.
+	 */
+	device->missing = 1;
+	fs_devices->missing_devices++;
+	device->closing_bdev = device->bdev;
+	device->bdev = NULL;
+
+	call_rcu(&device->rcu, close_device);
+
+	spin_unlock(&fs_devices->fs_info->free_chunk_lock);
+	mutex_unlock(&fs_devices->fs_info->chunk_mutex);
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	rcu_barrier();
+}
+
+void btrfs_device_enforce_state(struct btrfs_device *dev, char *why)
+{
+	int tolerance;
+	bool degrade_option;
+	char dev_status[10];
+	char chunk_status[25];
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_fs_devices *fs_devices;
+
+	fs_devices = dev->fs_devices;
+	fs_info = fs_devices->fs_info;
+	degrade_option = btrfs_test_opt(fs_info->fs_root, DEGRADED);
+
+	/* todo: support seed later */
+	if (fs_devices->seeding)
+		return;
+
+	/* this shouldn't be called if device is already missing */
+	if (dev->missing || !dev->bdev)
+		return;
+
+	if (dev->offline || dev->failed)
+		return;
+
+	/* Only RW device is requested to force close let FS handle it*/
+	if (fs_devices->rw_devices == 1) {
+		btrfs_std_error(fs_info, -EIO,
+			"force offline last RW device");
+		return;
+	}
+
+	if (!strcmp(why, "offline"))
+		dev->offline = 1;
+	else if (!strcmp(why, "failed"))
+		dev->failed = 1;
+	else
+		return;
+
+	/*
+	 * Here after, there shouldn't any reason why can't force
+	 * close this device
+	 */
+	btrfs_sysfs_rm_device_link(fs_devices, dev);
+	device_force_close(dev);
+	strcpy(dev_status, "closed");
+
+	tolerance = btrfs_check_degradable(fs_info,
+						fs_info->sb->s_flags);
+	if (tolerance > 0) {
+		strncpy(chunk_status, "chunk(s) degraded", 25);
+	} else if(tolerance < 0) {
+		strncpy(chunk_status, "chunk(s) failed", 25);
+	} else {
+		strncpy(chunk_status, "No chunk(s) are degraded", 25);
+	}
+
+	btrfs_warn_in_rcu(fs_info, "device %s marked %s, %s, %s",
+		rcu_str_deref(dev->name), why, dev_status, chunk_status);
+	btrfs_info_in_rcu(fs_info,
+		"num_devices %llu rw_devices %llu degraded-option: %s",
+		fs_devices->num_devices, fs_devices->rw_devices,
+		degrade_option ? "set":"unset");
+
+	if (tolerance < 0)
+		btrfs_std_error(fs_info, -EIO, "devices below critical level");
+
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b4308afa3097..60eb098d8c76 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -66,13 +66,26 @@ struct btrfs_device {
 	struct btrfs_pending_bios pending_sync_bios;
 
 	struct block_device *bdev;
+	struct block_device *closing_bdev;
 
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
 	int writeable;
 	int in_fs_metadata;
+	/* missing: device wasn't found at the time of mount */
 	int missing;
+	/* failed: device confirmed to have experienced critical io failure */
+	int failed;
+	/*
+	 * offline: system or user or block layer transport has removed
+	 * offlined the device which was once present and without going
+	 * through unmount. Implies an intriem communication break down
+	 * and not necessarily a candidate for the device replace. And
+	 * device might be online after user intervention or after
+	 * block transport layer error recovery.
+	 */
+	int offline;
 	int can_discard;
 	int is_tgtdev_for_dev_replace;
 
@@ -575,5 +588,6 @@ struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
 int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags);
+void btrfs_device_enforce_state(struct btrfs_device *dev, char *why);
 
 #endif
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 12/13] btrfs: check device for critical errors and mark failed
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (10 preceding siblings ...)
  2016-04-12 14:16 ` [PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
@ 2016-04-12 14:16 ` Anand Jain
  2016-04-12 14:16 ` [PATCH 13/13] btrfs: check for failed device and hot replace Anand Jain
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

Write and Flush errors are considered as critical errors,
upon which the device will be brought offline and marked as
failed. Write and Flush errors are identified using device
error statistics. This is monitored using a kthread
btrfs_health.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
 fs/btrfs/ctree.h   |   2 ++
 fs/btrfs/disk-io.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/btrfs/volumes.c |   1 +
 fs/btrfs/volumes.h |   4 +++
 4 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 1cf1bbf3058f..e36200cf6ead 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1569,6 +1569,7 @@ struct btrfs_fs_info {
 	struct mutex tree_log_mutex;
 	struct mutex transaction_kthread_mutex;
 	struct mutex cleaner_mutex;
+	struct mutex health_mutex;
 	struct mutex chunk_mutex;
 	struct mutex volume_mutex;
 
@@ -1686,6 +1687,7 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *extent_workers;
 	struct task_struct *transaction_kthread;
 	struct task_struct *cleaner_kthread;
+	struct task_struct *health_kthread;
 	int thread_pool_size;
 
 	struct kobject *space_info_kobj;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e9fca3bc7e42..1deb5714cc3a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1876,6 +1876,93 @@ sleep:
 	return 0;
 }
 
+/*
+ * returns:
+ * < 0 : Check didn't run, std error
+ *   0 : No errors found
+ * > 0 : # of devices having fatal errors
+ */
+static int btrfs_update_devices_health(struct btrfs_root *root)
+{
+	int ret = 0;
+	struct btrfs_device *device;
+	struct btrfs_fs_info *fs_info = root->fs_info;
+
+	if (btrfs_fs_closing(fs_info))
+		return -EBUSY;
+
+	/* mark disk(s) with write or flush error(s) as failed */
+	mutex_lock(&fs_info->volume_mutex);
+	list_for_each_entry_rcu(device,
+			&fs_info->fs_devices->devices, dev_list) {
+		int c_err;
+
+		if (device->failed) {
+			ret++;
+			continue;
+		}
+
+		/*
+		 * todo: replace target device's write/flush error,
+		 * skip for now
+		 */
+		if (device->is_tgtdev_for_dev_replace)
+			continue;
+
+		if (!device->dev_stats_valid)
+			continue;
+
+		c_err = atomic_read(&device->new_critical_errs);
+		atomic_sub(c_err, &device->new_critical_errs);
+		if (c_err) {
+			btrfs_crit_in_rcu(fs_info,
+				"fatal error on device %s",
+					rcu_str_deref(device->name));
+			btrfs_device_enforce_state(device, "failed");
+			ret ++;
+		}
+	}
+	mutex_unlock(&fs_info->volume_mutex);
+
+	return ret;
+}
+
+/*
+ * Devices health maintenance kthread, gets woken-up by transaction
+ * kthread, once sysfs is ready, this should publish the report
+ * through sysfs so that user land scripts and invoke actions.
+ */
+static int health_kthread(void *arg)
+{
+	struct btrfs_root *root = arg;
+
+	do {
+		if (btrfs_need_cleaner_sleep(root))
+			goto sleep;
+
+		if (!mutex_trylock(&root->fs_info->health_mutex))
+			goto sleep;
+
+		if (btrfs_need_cleaner_sleep(root)) {
+			mutex_unlock(&root->fs_info->health_mutex);
+			goto sleep;
+		}
+
+		/* Check devices health */
+		btrfs_update_devices_health(root);
+
+		mutex_unlock(&root->fs_info->health_mutex);
+
+sleep:
+		set_current_state(TASK_INTERRUPTIBLE);
+		if (!kthread_should_stop())
+			schedule();
+		__set_current_state(TASK_RUNNING);
+	} while (!kthread_should_stop());
+
+	return 0;
+}
+
 static int transaction_kthread(void *arg)
 {
 	struct btrfs_root *root = arg;
@@ -1922,6 +2009,7 @@ static int transaction_kthread(void *arg)
 			btrfs_end_transaction(trans, root);
 		}
 sleep:
+		wake_up_process(root->fs_info->health_kthread);
 		wake_up_process(root->fs_info->cleaner_kthread);
 		mutex_unlock(&root->fs_info->transaction_kthread_mutex);
 
@@ -2668,6 +2756,7 @@ int open_ctree(struct super_block *sb,
 	mutex_init(&fs_info->chunk_mutex);
 	mutex_init(&fs_info->transaction_kthread_mutex);
 	mutex_init(&fs_info->cleaner_mutex);
+	mutex_init(&fs_info->health_mutex);
 	mutex_init(&fs_info->volume_mutex);
 	mutex_init(&fs_info->ro_block_group_mutex);
 	init_rwsem(&fs_info->commit_root_sem);
@@ -3010,11 +3099,16 @@ retry_root_backup:
 	if (IS_ERR(fs_info->cleaner_kthread))
 		goto fail_sysfs;
 
+	fs_info->health_kthread = kthread_run(health_kthread, tree_root,
+					       "btrfs-health");
+	if (IS_ERR(fs_info->health_kthread))
+		goto fail_cleaner;
+
 	fs_info->transaction_kthread = kthread_run(transaction_kthread,
 						   tree_root,
 						   "btrfs-transaction");
 	if (IS_ERR(fs_info->transaction_kthread))
-		goto fail_cleaner;
+		goto fail_health;
 
 	if (!btrfs_test_opt(tree_root, SSD) &&
 	    !btrfs_test_opt(tree_root, NOSSD) &&
@@ -3178,6 +3272,10 @@ fail_trans_kthread:
 	kthread_stop(fs_info->transaction_kthread);
 	btrfs_cleanup_transaction(fs_info->tree_root);
 	btrfs_free_fs_roots(fs_info);
+
+fail_health:
+	kthread_stop(fs_info->health_kthread);
+
 fail_cleaner:
 	kthread_stop(fs_info->cleaner_kthread);
 
@@ -3833,6 +3931,7 @@ void close_ctree(struct btrfs_root *root)
 
 	kthread_stop(fs_info->transaction_kthread);
 	kthread_stop(fs_info->cleaner_kthread);
+	kthread_stop(fs_info->health_kthread);
 
 	fs_info->closing = 2;
 	smp_mb();
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 275143c42374..c2a87fc127a7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -233,6 +233,7 @@ static struct btrfs_device *__alloc_device(void)
 	spin_lock_init(&dev->reada_lock);
 	atomic_set(&dev->reada_in_flight, 0);
 	atomic_set(&dev->dev_stats_ccnt, 0);
+	atomic_set(&dev->new_critical_errs, 0);
 	btrfs_device_data_ordered_init(dev);
 	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 60eb098d8c76..1ad63ce5d328 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -167,6 +167,7 @@ struct btrfs_device {
 	/* Counter to record the change of device stats */
 	atomic_t dev_stats_ccnt;
 	atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
+	atomic_t new_critical_errs;
 };
 
 /*
@@ -537,6 +538,9 @@ static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
 	atomic_inc(dev->dev_stat_values + index);
 	smp_mb__before_atomic();
 	atomic_inc(&dev->dev_stats_ccnt);
+	if (index == BTRFS_DEV_STAT_WRITE_ERRS ||
+		index == BTRFS_DEV_STAT_FLUSH_ERRS)
+		atomic_inc(&dev->new_critical_errs);
 }
 
 static inline int btrfs_dev_stat_read(struct btrfs_device *dev,
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 13/13] btrfs: check for failed device and hot replace
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (11 preceding siblings ...)
  2016-04-12 14:16 ` [PATCH 12/13] btrfs: check device for critical errors and mark failed Anand Jain
@ 2016-04-12 14:16 ` Anand Jain
  2016-04-12 20:02 ` [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Yauhen Kharuzhy
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-12 14:16 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

This patch checks for failed device and kicks out auto
replace, if when user decided to disable auto replace
it can be done by future sysfs or future ioctl interface
to set fs_info->no_auto_replace parameter to 1.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/disk-io.c | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e36200cf6ead..3262430d65a3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1862,6 +1862,8 @@ struct btrfs_fs_info {
 	struct list_head pinned_chunks;
 
 	int creating_free_space_tree;
+
+	int no_auto_replace;
 };
 
 struct btrfs_subvolume_writers {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1deb5714cc3a..5c5c51319bec 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1876,6 +1876,39 @@ sleep:
 	return 0;
 }
 
+static int btrfs_recuperate(struct btrfs_root *root)
+{
+	int ret;
+	u64 failed_devid = 0;
+	struct btrfs_device *device;
+	struct btrfs_fs_devices *fs_devices;
+
+	fs_devices = root->fs_info->fs_devices;
+
+	/* fixme: does it need device_list_mutex */
+	mutex_lock(&fs_devices->device_list_mutex);
+	rcu_read_lock();
+	list_for_each_entry_rcu(device,
+			&fs_devices->devices, dev_list) {
+		if (device->failed) {
+			failed_devid = device->devid;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	/*
+	 * We are using the replace code which should be interrupt-able
+	 * during unmount, and as of now there is no user land stop
+	 * request that we support and this will run until its complete
+	 */
+	if (failed_devid && !root->fs_info->no_auto_replace)
+		ret = btrfs_auto_replace_start(root, failed_devid);
+
+	return ret;
+}
+
 /*
  * returns:
  * < 0 : Check didn't run, std error
@@ -1951,6 +1984,8 @@ static int health_kthread(void *arg)
 		/* Check devices health */
 		btrfs_update_devices_health(root);
 
+		btrfs_recuperate(root);
+
 		mutex_unlock(&root->fs_info->health_mutex);
 
 sleep:
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 10/13] btrfs: introduce helper functions to perform hot replace
  2016-04-12 14:16 ` [PATCH 10/13] btrfs: introduce helper functions to perform hot replace Anand Jain
@ 2016-04-12 14:40   ` kbuild test robot
  0 siblings, 0 replies; 32+ messages in thread
From: kbuild test robot @ 2016-04-12 14:40 UTC (permalink / raw)
  To: Anand Jain; +Cc: kbuild-all, linux-btrfs, dsterba, yauhen.kharuzhy

[-- Attachment #1: Type: text/plain, Size: 2070 bytes --]

Hi Anand,

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.6-rc3 next-20160412]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Anand-Jain/Introduce-device-state-failed-spare-device-and-auto-replace/20160412-222557
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next
config: x86_64-randconfig-x010-201615 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/btrfs/dev-replace.c: In function 'btrfs_auto_replace_start':
   fs/btrfs/dev-replace.c:979:39: warning: passing argument 2 of 'btrfs_dev_replace_start' from incompatible pointer type [-Wincompatible-pointer-types]
      ret = btrfs_dev_replace_start(root, tgt_path, src_devid, NULL,
                                          ^
   fs/btrfs/dev-replace.c:308:5: note: expected 'struct btrfs_ioctl_dev_replace_args *' but argument is of type 'char *'
    int btrfs_dev_replace_start(struct btrfs_root *root,
        ^
>> fs/btrfs/dev-replace.c:979:9: error: too many arguments to function 'btrfs_dev_replace_start'
      ret = btrfs_dev_replace_start(root, tgt_path, src_devid, NULL,
            ^
   fs/btrfs/dev-replace.c:308:5: note: declared here
    int btrfs_dev_replace_start(struct btrfs_root *root,
        ^

vim +/btrfs_dev_replace_start +979 fs/btrfs/dev-replace.c

   973		}
   974	
   975		if (atomic_xchg(
   976			&root->fs_info->mutually_exclusive_operation_running, 1)) {
   977			ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
   978		} else {
 > 979			ret = btrfs_dev_replace_start(root, tgt_path, src_devid, NULL,
   980			BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS);
   981			atomic_set(
   982				&root->fs_info->mutually_exclusive_operation_running, 0);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 25015 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount
  2016-04-12 14:15 ` [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount Anand Jain
@ 2016-04-12 19:21   ` Yauhen Kharuzhy
  0 siblings, 0 replies; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-12 19:21 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba

On Tue, Apr 12, 2016 at 10:15:51PM +0800, Anand Jain wrote:
> From: Qu Wenruo <quwenruo@cn.fujitsu.com>
> 
> Introduce a new function, btrfs_check_degradable(), to judge if all chunks
> in btrfs is OK for degraded mount.
> 
> It provides the new basis for accurate btrfs mount/remount and even
> runtime degraded mount check other than old one-size-fit-all method.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/volumes.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/btrfs/volumes.h |  1 +
>  2 files changed, 64 insertions(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 9d72dabdddfc..a351c5dd9e9b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7039,3 +7039,66 @@ static void btrfs_close_one_device(struct btrfs_device *device)
>  
>  	call_rcu(&device->rcu, free_device);
>  }
> +
> +/*
> + * Check if all chunks in the fs is OK for degraded mount
> + * Caller itself should do extra check if DEGRADED mount option is given
> + * for >0 return value.
> + *
> + * Return 0 if all chunks are OK.
> + * Return >0 if all chunks are degradable but not all OK.
> + * Return <0 if any chunk is not degradable or other bug.
> + */
> +int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags)
> +{
> +	struct btrfs_mapping_tree *map_tree = &fs_info->mapping_tree;
> +	struct extent_map *em;
> +	u64 next_start = 0;
> +	int ret = 0;
> +
> +	if (flags & MS_RDONLY)
> +		return 0;
> +
> +	read_lock(&map_tree->map_tree.lock);
> +	em = lookup_extent_mapping(&map_tree->map_tree, 0, (u64)(-1));
> +	/* No any chunk? Should be a huge bug */
> +	if (!em) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	while (em) {
> +		struct map_lookup *map;
> +		int missing = 0;
> +		int max_tolerated;
> +		int i;
> +
> +		map = (struct map_lookup *) em->bdev;
> +		max_tolerated =
> +			btrfs_get_num_tolerated_disk_barrier_failures(
> +					map->type);
> +		for (i = 0; i < map->num_stripes; i++) {
> +			if (map->stripes[i].dev->missing)
> +				missing++;
> +		}
> +		if (missing > max_tolerated) {
> +			ret = -EIO;
> +			btrfs_warn(fs_info,
> +				   "missing devices(%d) exceeds the limit(%d), writebale mount is not allowed",
> +				   missing, max_tolerated);

Typo: s/writebale/writeable/


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (12 preceding siblings ...)
  2016-04-12 14:16 ` [PATCH 13/13] btrfs: check for failed device and hot replace Anand Jain
@ 2016-04-12 20:02 ` Yauhen Kharuzhy
  2016-04-13 22:43   ` Anand Jain
  2016-04-13 21:21 ` Yauhen Kharuzhy
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-12 20:02 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba

On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
> Thanks for various comments, tests and feedback.

Seems working for me. I have triggered OOM killer while testing this in VirtualBox but
I don't think that it is related to autoreplace, it seems to be
scrub implementation issue:

[  449.615157] CPU: 0 PID: 1771 Comm: btrfs-health Not tainted 4.4.5-scst31x+ #25
[  449.621763] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  449.647614]  0000000000000000 ffff8800601c7660 ffffffff813529e3 ffff8800601c7858
[  449.659766]  ffff88005ba66140 ffff8800601c76d0 ffffffff8121b41e ffff8800601c7680
[  449.683167]  ffffffff810d7ccd ffff8800601c76a0 0000000000000206 ffffffff81c6d0e0
[  449.700746] Call Trace:
[  449.705078]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[  449.715238]  [<ffffffff8121b41e>] dump_header+0x5a/0x21d
[  449.725400]  [<ffffffff810d7ccd>] ? trace_hardirqs_on+0xd/0x10
[  449.741261]  [<ffffffff811a3e80>] oom_kill_process+0x200/0x3d0
[  449.753042]  [<ffffffff811a4602>] out_of_memory+0x562/0x580
[  449.765923]  [<ffffffff811a4373>] ? out_of_memory+0x2d3/0x580
[  449.768455]  [<ffffffff811aa98c>] __alloc_pages_nodemask+0xafc/0xc80
[  449.770281]  [<ffffffff811f5ebb>] alloc_pages_current+0x9b/0x1c0
[  449.783371]  [<ffffffffa02160f5>] scrub_pages+0xb5/0x400 [btrfs]
[  449.804598]  [<ffffffffa0212a65>] ? scrub_find_csum+0xd5/0x110 [btrfs]
[  449.819145]  [<ffffffffa0216dce>] scrub_stripe+0x82e/0x1180 [btrfs]
[  449.829299]  [<ffffffffa0217830>] scrub_chunk+0x110/0x160 [btrfs]
[  449.835859]  [<ffffffffa0217afc>] scrub_enumerate_chunks+0x27c/0x560 [btrfs]
[  449.852805]  [<ffffffff810ceb00>] ? wake_atomic_t_function+0x30/0x70
[  449.867081]  [<ffffffffa021930d>] btrfs_scrub_dev+0x1cd/0x680 [btrfs]
[  449.876784]  [<ffffffffa022d234>] btrfs_dev_replace_start+0x334/0x540 [btrfs]
[  449.891503]  [<ffffffffa022def8>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[  449.911958]  [<ffffffffa01ac4e6>] health_kthread+0x246/0x490 [btrfs]
[  449.922132]  [<ffffffffa01ac3d8>] ? health_kthread+0x138/0x490 [btrfs]
[  449.946273]  [<ffffffffa01ac2a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[  449.975742]  [<ffffffff810a70df>] kthread+0xef/0x110
[  449.994914]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[  450.022306]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[  450.036069]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[  450.045622]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[  450.047625] Mem-Info:
[  450.055195] active_anon:30 inactive_anon:71 isolated_anon:0
[  450.055195]  active_file:220 inactive_file:980 isolated_file:0
[  450.055195]  unevictable:527 dirty:41 writeback:59 unstable:0
[  450.055195]  slab_reclaimable:18226 slab_unreclaimable:283931
[  450.055195]  mapped:612 shmem:10 pagetables:1209 bounce:0
[  450.055195]  free:3310 free_pcp:153 free_cma:0
[  450.069070] Node 0 DMA free:6232kB min:48kB low:60kB high:72kB active_anon:0kB inactive_anon:0kB active_file:8kB ina
ctive_file:16kB unevictable:28kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:28kB dir
ty:4kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:788kB slab_unreclaimable:6236kB kernel_stack:96kB pagetables
:48kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:220 all_unreclaim
able? yes
[  450.161023] lowmem_reserve[]: 0 1546 1546 1546
[  450.181786] Node 0 DMA32 free:10620kB min:4896kB low:6120kB high:7344kB active_anon:120kB inactive_anon:176kB active
_file:964kB inactive_file:1132kB unevictable:2080kB isolated(anon):0kB isolated(file):0kB present:1668032kB managed:158
3780kB mlocked:2080kB dirty:160kB writeback:112kB mapped:2568kB shmem:40kB slab_reclaimable:72116kB slab_unreclaimable:1129488kB kernel_stack:4192kB pagetables:4788kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  450.267804] lowmem_reserve[]: 0 0 0 0
[  450.272899] Node 0 DMA: 45*4kB (UME) 31*8kB (UME) 19*16kB (ME) 10*32kB (ME) 7*64kB (ME) 7*128kB (UME) 3*256kB (UME) 2*512kB (UM) 2*1024kB (M) 0*2048kB 0*4096kB = 6236kB
[  450.286381] Node 0 DMA32: 2006*4kB (UME) 453*8kB (UME) 68*16kB (UME) 15*32kB (UM) 2*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13472kB
[  450.299928] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  450.304622] 985 total pagecache pages
[  450.306857] 111 pages in swap cache
[  450.308870] Swap cache stats: add 9380, delete 9269, find 113/183
[  450.312090] Free swap  = 381628kB
[  450.314188] Total swap = 418492kB
[  450.317644] 421006 pages RAM
[  450.319573] 0 pages HighMem/MovableOnly
[  450.322100] 21084 pages reserved
[  450.323853] 0 pages hwpoisoned
...

-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (13 preceding siblings ...)
  2016-04-12 20:02 ` [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Yauhen Kharuzhy
@ 2016-04-13 21:21 ` Yauhen Kharuzhy
  2016-04-14  8:45   ` Anand Jain
  2016-04-14 19:12 ` Yauhen Kharuzhy
  2016-04-14 23:09 ` Yauhen Kharuzhy
  16 siblings, 1 reply; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-13 21:21 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba

On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
> Thanks for various comments, tests and feedback.

Hmm... I broke it :)

I get kernel oops after few cycles of drive removing-insertion-replacing.

My steps to reproduce:
1) create RAID (I used RAID6)
2) remove drive (i tested /sys interface for this and VBox storage
management – reproduced with both bethods). Write & sync fs to detect
falure.
3) insert drive again
4) wipe it
5) replace missing device (reproduced with user-initiated replace and
autoreplace)
6) repeat steps 2-3

At reboot, kernel oopses (see below). Sometimes more than one repeat of
steps 2-5 needed (I am still working to localize this now).

Commands from my last session:

root@grack12:~# btrfs fi show
Label: 'test'  uuid: 833fef31-5536-411c-8f58-53b527569fa5
        Total devices 4 FS bytes used 768.00KiB
        devid    1 size 8.00GiB used 1.41GiB path /dev/sdc
        devid    2 size 8.00GiB used 1.41GiB path /dev/sdd
        devid    3 size 8.00GiB used 1.41GiB path /dev/sde
        devid    5 size 8.00GiB used 1.12GiB path /dev/sdg

Global spare

root@test:~# ls -l /sys/block/sdg
lrwxrwxrwx 1 root root 0 Apr  8 20:03 /sys/block/sdg -> ../devices/pci0000:00/0000:00:1f.2/ata7/host6/target6:0:0/6:0:0:0/block/sdg
root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete 
root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan 
root@test:~# wipefs -a /dev/sdg
8 bytes were erased at offset 0x10040 (btrfs)
they were: 5f 42 48 52 66 53 5f 4d
root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete 
root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan 
root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete 
root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan 
root@test:~# wipefs -a /dev/sdg
8 bytes were erased at offset 0x10040 (btrfs)
they were: 5f 42 48 52 66 53 5f 4d
root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete 
root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan 
root@test:~# wipefs -a /dev/sdg
8 bytes were erased at offset 0x10040 (btrfs)
they were: 5f 42 48 52 66 53 5f 4d
root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
root@test:~# reboot

Oops itself:

[  349.559019] BTRFS info (device sdd): dev_replace from <missing disk> (devid 5) to /dev/sdg started
[  349.647966] BTRFS info (device sdd): dev_replace from <missing disk> (devid 5) to /dev/sdg finished
[  373.701691] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC 
[  373.731698] Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd a
uth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul c
rc32_pclmul sha256_ssse3 sha256_generic snd_pcm snd_timer iTCO_wdt hmac drbg iTCO_vendor_support ansi_cprng snd soundco
re aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse evdev serio_raw pcspkr acpi_cpufreq 8250_
fintek lpc_ich video ac battery parport_pc tpm_tis tpm mfd_core parport button processor rng_core i2c_piix4 btrfs xor r
aid6_pq dm_mod raid1 md_mod sg sd_mod sr_mod cdrom ata_generic ahci libahci ata_piix libata crc32c_intel scsi_mod pcnet
32 mii
[  373.933548] CPU: 0 PID: 3955 Comm: umount Not tainted 4.4.5-scst31x-debug+ #33
[  373.941730] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  373.945337] task: ffff88005b2fe080 ti: ffff880056cbc000 task.ti: ffff880056cbc000
[  373.951991] RIP: 0010:[<ffffffff811a1069>]  [<ffffffff811a1069>] __filemap_fdatawrite_range+0x29/0xf0
[  373.954135] RSP: 0018:ffff880056cbfd50  EFLAGS: 00010286
[  373.972201] RAX: 0000000000000000 RBX: ffff880056cbfd50 RCX: 0000000000000000
[  374.003989] RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff880056cbfdb0
[  374.044001] RBP: ffff880056cbfdc8 R08: 0000000000000000 R09: 0000000000000002
[  374.099584] R10: ffffffff81d1b880 R11: ffffffff81d1b840 R12: 00441f0f0000441f
[  374.113566] R13: ffff88005b2fe080 R14: 0000000000000000 R15: ffff88005b2fe080
[  374.157600] FS:  00007f9281eea7e0(0000) GS:ffff880066600000(0000) knlGS:0000000000000000
[  374.164870] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  374.184379] CR2: 0000000001277048 CR3: 0000000060324000 CR4: 00000000000406f0
[  374.190320] Stack:
[  374.201539]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  374.245946]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  374.286073]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  374.313913] Call Trace:
[  374.329665]  [<ffffffff811a11ec>] filemap_flush+0x1c/0x20
[  374.357488]  [<ffffffff812619d6>] __sync_blockdev+0x26/0x30
[  374.389452]  [<ffffffff8125814e>] sync_filesystem+0x4e/0xa0
[  374.425568]  [<ffffffff81220867>] generic_shutdown_super+0x27/0xf0
[  374.457622]  [<ffffffff81220ba2>] kill_anon_super+0x12/0x20
[  374.489572]  [<ffffffffa018ea88>] btrfs_kill_super+0x18/0x120 [btrfs]
[  374.529552]  [<ffffffff81220e1e>] deactivate_locked_super+0x3e/0x70
[  374.561196]  [<ffffffff8122126c>] deactivate_super+0x5c/0x60
[  374.602015]  [<ffffffff8124182f>] cleanup_mnt+0x3f/0x90
[  374.632955]  [<ffffffff812418c2>] __cleanup_mnt+0x12/0x20
[  374.652292]  [<ffffffff810a5533>] task_work_run+0x73/0xa0
[  374.662654]  [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0
[  374.672858]  [<ffffffff81003e0c>] syscall_return_slowpath+0xcc/0xe0
[  374.702077]  [<ffffffff816379e2>] int_ret_from_sys_call+0x25/0x9f
[  374.721467] Code: ff 90 0f 1f 44 00 00 55 31 c0 41 89 c8 b9 0c 00 00 00 48 89 e5 41 55 41 54 53 48 8d 5d 88 49 89 fc
 48 89 df 48 83 ec 60 f3 48 ab <49> 8b 3c 24 48 b8 ff ff ff ff ff ff ff 7f 48 89 55 a0 48 89 45 
[  374.853615] RIP  [<ffffffff811a1069>] __filemap_fdatawrite_range+0x29/0xf0
[  374.909672]  RSP <ffff880056cbfd50>
[  374.937941] ---[ end trace 2bbc2fd699f402ff ]---


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-12 20:02 ` [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Yauhen Kharuzhy
@ 2016-04-13 22:43   ` Anand Jain
  0 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-13 22:43 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs, dsterba



On 04/13/2016 04:02 AM, Yauhen Kharuzhy wrote:
> On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
>> Thanks for various comments, tests and feedback.
>
> Seems working for me. I have triggered OOM killer while testing this in VirtualBox but


> I don't think that it is related to autoreplace,

Yep looks like. I suggest to report those bugs separately and not as a
review/testing reply to the patch.

Thanks, Anand


 > it seems to be scrub implementation issue:

> [  449.615157] CPU: 0 PID: 1771 Comm: btrfs-health Not tainted 4.4.5-scst31x+ #25
> [  449.621763] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [  449.647614]  0000000000000000 ffff8800601c7660 ffffffff813529e3 ffff8800601c7858
> [  449.659766]  ffff88005ba66140 ffff8800601c76d0 ffffffff8121b41e ffff8800601c7680
> [  449.683167]  ffffffff810d7ccd ffff8800601c76a0 0000000000000206 ffffffff81c6d0e0
> [  449.700746] Call Trace:
> [  449.705078]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
> [  449.715238]  [<ffffffff8121b41e>] dump_header+0x5a/0x21d
> [  449.725400]  [<ffffffff810d7ccd>] ? trace_hardirqs_on+0xd/0x10
> [  449.741261]  [<ffffffff811a3e80>] oom_kill_process+0x200/0x3d0
> [  449.753042]  [<ffffffff811a4602>] out_of_memory+0x562/0x580
> [  449.765923]  [<ffffffff811a4373>] ? out_of_memory+0x2d3/0x580
> [  449.768455]  [<ffffffff811aa98c>] __alloc_pages_nodemask+0xafc/0xc80
> [  449.770281]  [<ffffffff811f5ebb>] alloc_pages_current+0x9b/0x1c0
> [  449.783371]  [<ffffffffa02160f5>] scrub_pages+0xb5/0x400 [btrfs]
> [  449.804598]  [<ffffffffa0212a65>] ? scrub_find_csum+0xd5/0x110 [btrfs]
> [  449.819145]  [<ffffffffa0216dce>] scrub_stripe+0x82e/0x1180 [btrfs]
> [  449.829299]  [<ffffffffa0217830>] scrub_chunk+0x110/0x160 [btrfs]
> [  449.835859]  [<ffffffffa0217afc>] scrub_enumerate_chunks+0x27c/0x560 [btrfs]
> [  449.852805]  [<ffffffff810ceb00>] ? wake_atomic_t_function+0x30/0x70
> [  449.867081]  [<ffffffffa021930d>] btrfs_scrub_dev+0x1cd/0x680 [btrfs]
> [  449.876784]  [<ffffffffa022d234>] btrfs_dev_replace_start+0x334/0x540 [btrfs]
> [  449.891503]  [<ffffffffa022def8>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
> [  449.911958]  [<ffffffffa01ac4e6>] health_kthread+0x246/0x490 [btrfs]
> [  449.922132]  [<ffffffffa01ac3d8>] ? health_kthread+0x138/0x490 [btrfs]
> [  449.946273]  [<ffffffffa01ac2a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
> [  449.975742]  [<ffffffff810a70df>] kthread+0xef/0x110
> [  449.994914]  [<ffffffff810dc081>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
> [  450.022306]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
> [  450.036069]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
> [  450.045622]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
> [  450.047625] Mem-Info:
> [  450.055195] active_anon:30 inactive_anon:71 isolated_anon:0
> [  450.055195]  active_file:220 inactive_file:980 isolated_file:0
> [  450.055195]  unevictable:527 dirty:41 writeback:59 unstable:0
> [  450.055195]  slab_reclaimable:18226 slab_unreclaimable:283931
> [  450.055195]  mapped:612 shmem:10 pagetables:1209 bounce:0
> [  450.055195]  free:3310 free_pcp:153 free_cma:0
> [  450.069070] Node 0 DMA free:6232kB min:48kB low:60kB high:72kB active_anon:0kB inactive_anon:0kB active_file:8kB ina
> ctive_file:16kB unevictable:28kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:28kB dir
> ty:4kB writeback:0kB mapped:4kB shmem:0kB slab_reclaimable:788kB slab_unreclaimable:6236kB kernel_stack:96kB pagetables
> :48kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:220 all_unreclaim
> able? yes
> [  450.161023] lowmem_reserve[]: 0 1546 1546 1546
> [  450.181786] Node 0 DMA32 free:10620kB min:4896kB low:6120kB high:7344kB active_anon:120kB inactive_anon:176kB active
> _file:964kB inactive_file:1132kB unevictable:2080kB isolated(anon):0kB isolated(file):0kB present:1668032kB managed:158
> 3780kB mlocked:2080kB dirty:160kB writeback:112kB mapped:2568kB shmem:40kB slab_reclaimable:72116kB slab_unreclaimable:1129488kB kernel_stack:4192kB pagetables:4788kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [  450.267804] lowmem_reserve[]: 0 0 0 0
> [  450.272899] Node 0 DMA: 45*4kB (UME) 31*8kB (UME) 19*16kB (ME) 10*32kB (ME) 7*64kB (ME) 7*128kB (UME) 3*256kB (UME) 2*512kB (UM) 2*1024kB (M) 0*2048kB 0*4096kB = 6236kB
> [  450.286381] Node 0 DMA32: 2006*4kB (UME) 453*8kB (UME) 68*16kB (UME) 15*32kB (UM) 2*64kB (UM) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13472kB
> [  450.299928] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> [  450.304622] 985 total pagecache pages
> [  450.306857] 111 pages in swap cache
> [  450.308870] Swap cache stats: add 9380, delete 9269, find 113/183
> [  450.312090] Free swap  = 381628kB
> [  450.314188] Total swap = 418492kB
> [  450.317644] 421006 pages RAM
> [  450.319573] 0 pages HighMem/MovableOnly
> [  450.322100] 21084 pages reserved
> [  450.323853] 0 pages hwpoisoned
> ...
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] Btrfs: Set superblock s_bdev field properly at device closing
  2016-04-12 14:16 ` [PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
@ 2016-04-14  1:15   ` Yauhen Kharuzhy
  2016-04-14  6:59     ` Anand Jain
  2016-04-14 10:51   ` [PATCH v5 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
  1 sibling, 1 reply; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-14  1:15 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, Yauhen Kharuzhy

fs_info->sb->s_bdev field isn't set to any value at mount time but is set
after device replacing or at device closing. Existing code of
device_force_close() checks if current s_bdev is not equal to closing
bdev and, if equal, replace it by bdev field of first btrfs_device from
device list. This device may be the same as closed, and s_bdev field will
be invalid.

If s_bdev is not NULL but references an freed block device, kernel
oopses at filesystem sync time on unmount.

For multi-device FS setting of this field may be senseless, but using of
it should be consistent over the all btrfs code. So, set it on mount
time and select valid device at device closing time.

Alternative solution may be to not set s_bdev entirely.

Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
---
 fs/btrfs/super.c   |  1 +
 fs/btrfs/volumes.c | 16 ++++++++++++----
 2 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3dd154e..1a2c58f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1522,6 +1522,7 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
 		char b[BDEVNAME_SIZE];
 
 		strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
+		s->s_bdev = bdev;
 		btrfs_sb(s)->bdev_holder = fs_type;
 		error = btrfs_fill_super(s, fs_devices, data,
 					 flags & MS_SILENT ? 1 : 0);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 08ab116..f14f3f2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7132,6 +7132,7 @@ void device_force_close(struct btrfs_device *device)
 {
 	struct btrfs_device *next_device;
 	struct btrfs_fs_devices *fs_devices;
+	int found = 0;
 
 	fs_devices = device->fs_devices;
 
@@ -7139,13 +7140,20 @@ void device_force_close(struct btrfs_device *device)
 	mutex_lock(&fs_devices->fs_info->chunk_mutex);
 	spin_lock(&fs_devices->fs_info->free_chunk_lock);
 
-	next_device = list_entry(fs_devices->devices.next,
-					struct btrfs_device, dev_list);
+	list_for_each_entry(next_device, &fs_devices->devices, dev_list) {
+		if (next_device->bdev && next_device->bdev != device->bdev) {
+			found = 1;
+			break;
+		}
+	}
+
 	if (device->bdev == fs_devices->fs_info->sb->s_bdev)
-		fs_devices->fs_info->sb->s_bdev = next_device->bdev;
+		fs_devices->fs_info->sb->s_bdev =
+			found ? next_device->bdev : NULL;
 
 	if (device->bdev == fs_devices->latest_bdev)
-		fs_devices->latest_bdev = next_device->bdev;
+		fs_devices->latest_bdev =
+			found ? next_device->bdev : NULL;
 
 	if (device->bdev)
 		fs_devices->open_devices--;
-- 
2.5.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] Btrfs: Set superblock s_bdev field properly at device closing
  2016-04-14  1:15   ` [PATCH] Btrfs: Set superblock s_bdev field properly at device closing Yauhen Kharuzhy
@ 2016-04-14  6:59     ` Anand Jain
  2016-04-14  9:10       ` Yauhen Kharuzhy
  0 siblings, 1 reply; 32+ messages in thread
From: Anand Jain @ 2016-04-14  6:59 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs



Hi Yauhen

On 04/14/2016 09:15 AM, Yauhen Kharuzhy wrote:
> fs_info->sb->s_bdev field isn't set to any value at mount time

  There were patch to do set it at the vfs layer, or something like that.

> but is set
> after device replacing or at device closing.

  Actually we are updating s_bdev/latest_bdev wrongly at most of
  the device related operations, and not just here. I had plans
  of wrapping all those into a common helper function and separate
  from this patch set, when time permits, I am ok if you are fixing
  all those.

Thanks, Anand


> Existing code of
> device_force_close() checks if current s_bdev is not equal to closing
> bdev and, if equal, replace it by bdev field of first btrfs_device from
> device list. This device may be the same as closed, and s_bdev field will
> be invalid.
>
> If s_bdev is not NULL but references an freed block device, kernel
> oopses at filesystem sync time on unmount.
>
> For multi-device FS setting of this field may be senseless, but using of
> it should be consistent over the all btrfs code. So, set it on mount
> time and select valid device at device closing time.
>
> Alternative solution may be to not set s_bdev entirely.
>
> Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
> ---
>   fs/btrfs/super.c   |  1 +
>   fs/btrfs/volumes.c | 16 ++++++++++++----
>   2 files changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 3dd154e..1a2c58f 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1522,6 +1522,7 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
>   		char b[BDEVNAME_SIZE];
>
>   		strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
> +		s->s_bdev = bdev;
>   		btrfs_sb(s)->bdev_holder = fs_type;
>   		error = btrfs_fill_super(s, fs_devices, data,
>   					 flags & MS_SILENT ? 1 : 0);
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 08ab116..f14f3f2 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7132,6 +7132,7 @@ void device_force_close(struct btrfs_device *device)
>   {
>   	struct btrfs_device *next_device;
>   	struct btrfs_fs_devices *fs_devices;
> +	int found = 0;
>
>   	fs_devices = device->fs_devices;
>
> @@ -7139,13 +7140,20 @@ void device_force_close(struct btrfs_device *device)
>   	mutex_lock(&fs_devices->fs_info->chunk_mutex);
>   	spin_lock(&fs_devices->fs_info->free_chunk_lock);
>
> -	next_device = list_entry(fs_devices->devices.next,
> -					struct btrfs_device, dev_list);
> +	list_for_each_entry(next_device, &fs_devices->devices, dev_list) {
> +		if (next_device->bdev && next_device->bdev != device->bdev) {
> +			found = 1;
> +			break;
> +		}
> +	}
> +
>   	if (device->bdev == fs_devices->fs_info->sb->s_bdev)
> -		fs_devices->fs_info->sb->s_bdev = next_device->bdev;
> +		fs_devices->fs_info->sb->s_bdev =
> +			found ? next_device->bdev : NULL;
>
>   	if (device->bdev == fs_devices->latest_bdev)
> -		fs_devices->latest_bdev = next_device->bdev;
> +		fs_devices->latest_bdev =
> +			found ? next_device->bdev : NULL;
>
>   	if (device->bdev)
>   		fs_devices->open_devices--;
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-13 21:21 ` Yauhen Kharuzhy
@ 2016-04-14  8:45   ` Anand Jain
  2016-04-14  9:22     ` Yauhen Kharuzhy
  0 siblings, 1 reply; 32+ messages in thread
From: Anand Jain @ 2016-04-14  8:45 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs, dsterba




Thanks for the report ! more below..


On 04/14/2016 05:21 AM, Yauhen Kharuzhy wrote:
> On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
>> Thanks for various comments, tests and feedback.
>
> Hmm... I broke it :)
>
> I get kernel oops after few cycles of drive removing-insertion-replacing.
>
> My steps to reproduce:
> 1) create RAID (I used RAID6)
> 2) remove drive (i tested /sys interface for this and VBox storage
> management – reproduced with both bethods). Write & sync fs to detect
> falure.
> 3) insert drive again
> 4) wipe it
> 5) replace missing device (reproduced with user-initiated replace and
> autoreplace)
> 6) repeat steps 2-3
>
> At reboot, kernel oopses (see below). Sometimes more than one repeat of
> steps 2-5 needed (I am still working to localize this now).
>
> Commands from my last session:
>
> root@grack12:~# btrfs fi show
> Label: 'test'  uuid: 833fef31-5536-411c-8f58-53b527569fa5
>          Total devices 4 FS bytes used 768.00KiB
>          devid    1 size 8.00GiB used 1.41GiB path /dev/sdc
>          devid    2 size 8.00GiB used 1.41GiB path /dev/sdd
>          devid    3 size 8.00GiB used 1.41GiB path /dev/sde
>          devid    5 size 8.00GiB used 1.12GiB path /dev/sdg
>
> Global spare
>
> root@test:~# ls -l /sys/block/sdg
> lrwxrwxrwx 1 root root 0 Apr  8 20:03 /sys/block/sdg -> ../devices/pci0000:00/0000:00:1f.2/ata7/host6/target6:0:0/6:0:0:0/block/sdg
> root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete
> root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
> FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
> root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan
> root@test:~# wipefs -a /dev/sdg
> 8 bytes were erased at offset 0x10040 (btrfs)
> they were: 5f 42 48 52 66 53 5f 4d
> root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
> root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete
> root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan
> root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete

  You may use simpler devmgt tool, https://github.com/asj/devmgt

> root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
> FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
> root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan
> root@test:~# wipefs -a /dev/sdg
> 8 bytes were erased at offset 0x10040 (btrfs)
> they were: 5f 42 48 52 66 53 5f 4d
> root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
> root@test:~# echo 1 > /sys/class/scsi_device/6\:0\:0\:0/device/delete
> root@test:~# touch /media/833fef31-5536-411c-8f58-53b527569fa5/ && btrfs fi sync /media/833fef31-5536-411c-8f58-53b527569fa5/
> FSSync '/media/833fef31-5536-411c-8f58-53b527569fa5/'
> root@test:~# echo 0 0 0 > /sys/class/scsi_host/host6/scan
> root@test:~# wipefs -a /dev/sdg
> 8 bytes were erased at offset 0x10040 (btrfs)
> they were: 5f 42 48 52 66 53 5f 4d
> root@test:~# btrfs replace start 5 /dev/sdg /media/833fef31-5536-411c-8f58-53b527569fa5/
> root@test:~# reboot
>
> Oops itself:
>
> [  349.559019] BTRFS info (device sdd): dev_replace from <missing disk> (devid 5) to /dev/sdg started
> [  349.647966] BTRFS info (device sdd): dev_replace from <missing disk> (devid 5) to /dev/sdg finished
> [  373.701691] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
> [  373.731698] Modules linked in: cpufreq_powersave cpufreq_stats cpufreq_userspace cpufreq_conservative softdog nfsd a
> uth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ipmi_devintf ipmi_msghandler iosf_mbi crct10dif_pclmul c
> rc32_pclmul sha256_ssse3 sha256_generic snd_pcm snd_timer iTCO_wdt hmac drbg iTCO_vendor_support ansi_cprng snd soundco
> re aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse evdev serio_raw pcspkr acpi_cpufreq 8250_
> fintek lpc_ich video ac battery parport_pc tpm_tis tpm mfd_core parport button processor rng_core i2c_piix4 btrfs xor r
> aid6_pq dm_mod raid1 md_mod sg sd_mod sr_mod cdrom ata_generic ahci libahci ata_piix libata crc32c_intel scsi_mod pcnet
> 32 mii
> [  373.933548] CPU: 0 PID: 3955 Comm: umount Not tainted 4.4.5-scst31x-debug+ #33
> [  373.941730] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> [  373.945337] task: ffff88005b2fe080 ti: ffff880056cbc000 task.ti: ffff880056cbc000
> [  373.951991] RIP: 0010:[<ffffffff811a1069>]  [<ffffffff811a1069>] __filemap_fdatawrite_range+0x29/0xf0


  You are failing the replace-target, presumably when the replace is
  still running, however note that this patch-set does not fail the
  replace-target for errors (as of now I have no idea how to do that
  without leading to a messy situation), and so it would follow the
  original code as without this patch.
  Next, originally with-out this patch-set we won't close any device
  for errors. So when you delete the device at the block-layer and
  re-attach (scan) most probably you are having a newer device path
  to the block device. (which kind of defeats the idea of testing
  an intermittently disappearing device), so I doubt, if the test
  case is reliable,  and above panic is btrfs related and if its
  this patch-set related.



HTH.

Thanks, Anand


> [  373.954135] RSP: 0018:ffff880056cbfd50  EFLAGS: 00010286
> [  373.972201] RAX: 0000000000000000 RBX: ffff880056cbfd50 RCX: 0000000000000000
> [  374.003989] RDX: 7fffffffffffffff RSI: 0000000000000000 RDI: ffff880056cbfdb0
> [  374.044001] RBP: ffff880056cbfdc8 R08: 0000000000000000 R09: 0000000000000002
> [  374.099584] R10: ffffffff81d1b880 R11: ffffffff81d1b840 R12: 00441f0f0000441f
> [  374.113566] R13: ffff88005b2fe080 R14: 0000000000000000 R15: ffff88005b2fe080
> [  374.157600] FS:  00007f9281eea7e0(0000) GS:ffff880066600000(0000) knlGS:0000000000000000
> [  374.164870] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  374.184379] CR2: 0000000001277048 CR3: 0000000060324000 CR4: 00000000000406f0
> [  374.190320] Stack:
> [  374.201539]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  374.245946]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  374.286073]  0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  374.313913] Call Trace:
> [  374.329665]  [<ffffffff811a11ec>] filemap_flush+0x1c/0x20
> [  374.357488]  [<ffffffff812619d6>] __sync_blockdev+0x26/0x30
> [  374.389452]  [<ffffffff8125814e>] sync_filesystem+0x4e/0xa0
> [  374.425568]  [<ffffffff81220867>] generic_shutdown_super+0x27/0xf0
> [  374.457622]  [<ffffffff81220ba2>] kill_anon_super+0x12/0x20
> [  374.489572]  [<ffffffffa018ea88>] btrfs_kill_super+0x18/0x120 [btrfs]
> [  374.529552]  [<ffffffff81220e1e>] deactivate_locked_super+0x3e/0x70
> [  374.561196]  [<ffffffff8122126c>] deactivate_super+0x5c/0x60
> [  374.602015]  [<ffffffff8124182f>] cleanup_mnt+0x3f/0x90
> [  374.632955]  [<ffffffff812418c2>] __cleanup_mnt+0x12/0x20
> [  374.652292]  [<ffffffff810a5533>] task_work_run+0x73/0xa0
> [  374.662654]  [<ffffffff810032ac>] exit_to_usermode_loop+0xcc/0xd0
> [  374.672858]  [<ffffffff81003e0c>] syscall_return_slowpath+0xcc/0xe0
> [  374.702077]  [<ffffffff816379e2>] int_ret_from_sys_call+0x25/0x9f
> [  374.721467] Code: ff 90 0f 1f 44 00 00 55 31 c0 41 89 c8 b9 0c 00 00 00 48 89 e5 41 55 41 54 53 48 8d 5d 88 49 89 fc
>   48 89 df 48 83 ec 60 f3 48 ab <49> 8b 3c 24 48 b8 ff ff ff ff ff ff ff 7f 48 89 55 a0 48 89 45
> [  374.853615] RIP  [<ffffffff811a1069>] __filemap_fdatawrite_range+0x29/0xf0
> [  374.909672]  RSP <ffff880056cbfd50>
> [  374.937941] ---[ end trace 2bbc2fd699f402ff ]---
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Btrfs: Set superblock s_bdev field properly at device closing
  2016-04-14  6:59     ` Anand Jain
@ 2016-04-14  9:10       ` Yauhen Kharuzhy
  2016-04-14  9:48         ` Anand Jain
  0 siblings, 1 reply; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-14  9:10 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Thu, Apr 14, 2016 at 02:59:23PM +0800, Anand Jain wrote:
> 
> 
> Hi Yauhen
> 
> On 04/14/2016 09:15 AM, Yauhen Kharuzhy wrote:
> >fs_info->sb->s_bdev field isn't set to any value at mount time
> 
>  There were patch to do set it at the vfs layer, or something like that.
> 
> >but is set
> >after device replacing or at device closing.
> 
>  Actually we are updating s_bdev/latest_bdev wrongly at most of
>  the device related operations, and not just here. I had plans
>  of wrapping all those into a common helper function and separate
>  from this patch set, when time permits, I am ok if you are fixing
>  all those.
> 
> Thanks, Anand

Yes, I can but I need to know expected behaviour. Should we set a s_bdev
field (as proposed in my patch) or we can keep it NULL because btrfs has
its own sync implementation and non-NULL value has no sense with multi-device FS?

Actually s_bdev is changed in two locations only: after device replace
and at device closing. At device replace we always (I hope...) may suppose that
target device's bdev is valid, so we need to change behaviour of
device_force_close() only. But I still didn't check latest_bdev exactly
meaning.

In any case, I will try to help to make global spare code stable first.

> 
> 
> >Existing code of
> >device_force_close() checks if current s_bdev is not equal to closing
> >bdev and, if equal, replace it by bdev field of first btrfs_device from
> >device list. This device may be the same as closed, and s_bdev field will
> >be invalid.
> >
> >If s_bdev is not NULL but references an freed block device, kernel
> >oopses at filesystem sync time on unmount.
> >
> >For multi-device FS setting of this field may be senseless, but using of
> >it should be consistent over the all btrfs code. So, set it on mount
> >time and select valid device at device closing time.
> >
> >Alternative solution may be to not set s_bdev entirely.
> >
> >Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
> >---
> >  fs/btrfs/super.c   |  1 +
> >  fs/btrfs/volumes.c | 16 ++++++++++++----
> >  2 files changed, 13 insertions(+), 4 deletions(-)
> >
> >diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> >index 3dd154e..1a2c58f 100644
> >--- a/fs/btrfs/super.c
> >+++ b/fs/btrfs/super.c
> >@@ -1522,6 +1522,7 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
> >  		char b[BDEVNAME_SIZE];
> >
> >  		strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
> >+		s->s_bdev = bdev;
> >  		btrfs_sb(s)->bdev_holder = fs_type;
> >  		error = btrfs_fill_super(s, fs_devices, data,
> >  					 flags & MS_SILENT ? 1 : 0);
> >diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> >index 08ab116..f14f3f2 100644
> >--- a/fs/btrfs/volumes.c
> >+++ b/fs/btrfs/volumes.c
> >@@ -7132,6 +7132,7 @@ void device_force_close(struct btrfs_device *device)
> >  {
> >  	struct btrfs_device *next_device;
> >  	struct btrfs_fs_devices *fs_devices;
> >+	int found = 0;
> >
> >  	fs_devices = device->fs_devices;
> >
> >@@ -7139,13 +7140,20 @@ void device_force_close(struct btrfs_device *device)
> >  	mutex_lock(&fs_devices->fs_info->chunk_mutex);
> >  	spin_lock(&fs_devices->fs_info->free_chunk_lock);
> >
> >-	next_device = list_entry(fs_devices->devices.next,
> >-					struct btrfs_device, dev_list);
> >+	list_for_each_entry(next_device, &fs_devices->devices, dev_list) {
> >+		if (next_device->bdev && next_device->bdev != device->bdev) {
> >+			found = 1;
> >+			break;
> >+		}
> >+	}
> >+
> >  	if (device->bdev == fs_devices->fs_info->sb->s_bdev)
> >-		fs_devices->fs_info->sb->s_bdev = next_device->bdev;
> >+		fs_devices->fs_info->sb->s_bdev =
> >+			found ? next_device->bdev : NULL;
> >
> >  	if (device->bdev == fs_devices->latest_bdev)
> >-		fs_devices->latest_bdev = next_device->bdev;
> >+		fs_devices->latest_bdev =
> >+			found ? next_device->bdev : NULL;
> >
> >  	if (device->bdev)
> >  		fs_devices->open_devices--;
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-14  8:45   ` Anand Jain
@ 2016-04-14  9:22     ` Yauhen Kharuzhy
  2016-04-14  9:57       ` Anand Jain
  0 siblings, 1 reply; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-14  9:22 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba

On Thu, Apr 14, 2016 at 04:45:11PM +0800, Anand Jain wrote:
> 
> 
> 
> Thanks for the report ! more below..
> 
> 
>  You may use simpler devmgt tool, https://github.com/asj/devmgt

Thanks, will try.

> 
>  You are failing the replace-target, presumably when the replace is
>  still running, however note that this patch-set does not fail the
>  replace-target for errors (as of now I have no idea how to do that
>  without leading to a messy situation), and so it would follow the
>  original code as without this patch.
>  Next, originally with-out this patch-set we won't close any device
>  for errors. So when you delete the device at the block-layer and
>  re-attach (scan) most probably you are having a newer device path
>  to the block device. (which kind of defeats the idea of testing
>  an intermittently disappearing device), so I doubt, if the test
>  case is reliable,  and above panic is btrfs related and if its
>  this patch-set related.

No, It is fixed by my latest patch (about of s_bdev field in
superblock). Actual sequence which leads to oops is:
1) FS is mounted, s_bdev is NULL
2) failed device is closed, s_bdev untouched
3) missing device is replaced, s_bdev is set to non-NULL – bdev of
the replaced device
4) at second device closing, s_bdev is "changed" to first device from
the device list but it is... some device because closed dev still
didn't delete from the list!
5) after device closing, s_bdev points to invalid bdev.
6) umount -> sync_filesystem() -> sync_blokdev(s_bdev) -> OOPS.


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Btrfs: Set superblock s_bdev field properly at device closing
  2016-04-14  9:10       ` Yauhen Kharuzhy
@ 2016-04-14  9:48         ` Anand Jain
  0 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-14  9:48 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs



On 04/14/2016 05:10 PM, Yauhen Kharuzhy wrote:
> On Thu, Apr 14, 2016 at 02:59:23PM +0800, Anand Jain wrote:
>>
>>
>> Hi Yauhen
>>
>> On 04/14/2016 09:15 AM, Yauhen Kharuzhy wrote:
>>> fs_info->sb->s_bdev field isn't set to any value at mount time
>>
>>   There were patch to do set it at the vfs layer, or something like that.
>>
>>> but is set
>>> after device replacing or at device closing.
>>
>>   Actually we are updating s_bdev/latest_bdev wrongly at most of
>>   the device related operations, and not just here. I had plans
>>   of wrapping all those into a common helper function and separate
>>   from this patch set, when time permits, I am ok if you are fixing
>>   all those.
>>
>> Thanks, Anand
>
> Yes, I can but I need to know expected behaviour. Should we set a s_bdev
> field (as proposed in my patch) or we can keep it NULL because btrfs has
> its own sync implementation and non-NULL value has no sense with multi-device FS?

  I am not completely sure, I need to dig. But I think.. and suggest
  not to change s_bdev mainly because..
  For ext4 vfs sets the s_bdev (yeah we use mount_fs() instead).
  And also at
    https://lwn.net/Articles/379862/
  it is clear that originally we have had plans to maintain s_bdev
  as NULL.
  But that's too old, and there were some patches to handle multi
  device in the vfs layer, it needs a review of that part which
  I am not sure.

  In any case, its better to have one fix per patch, and about
  s_bdev its distinctly a candidate for a newer analysis and
  patch fix.

> Actually s_bdev is changed in two locations only: after device replace
> and at device closing. At device replace we always (I hope...) may suppose that
> target device's bdev is valid, so we need to change behaviour of
> device_force_close() only.

  I will do something like this..

-----
-       if (fs_info->sb->s_bdev == src_device->bdev)
+       if (fs_info->sb->s_bdev &&
+               (fs_info->sb->s_bdev == src_device->bdev))
                 fs_info->sb->s_bdev = tgt_device->bdev;
-----
so that we won't have s_bdev set to something after replacing
a missing or failed. And move them to a helper function.

Thanks, Anand


> But I still didn't check latest_bdev exactly meaning.


> In any case, I will try to help to make global spare code stable first.
>
>>
>>
>>> Existing code of
>>> device_force_close() checks if current s_bdev is not equal to closing
>>> bdev and, if equal, replace it by bdev field of first btrfs_device from
>>> device list. This device may be the same as closed, and s_bdev field will
>>> be invalid.
>>>
>>> If s_bdev is not NULL but references an freed block device, kernel
>>> oopses at filesystem sync time on unmount.
>>>
>>> For multi-device FS setting of this field may be senseless, but using of
>>> it should be consistent over the all btrfs code. So, set it on mount
>>> time and select valid device at device closing time.
>>>
>>> Alternative solution may be to not set s_bdev entirely.
>>>
>>> Signed-off-by: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>
>>> ---
>>>   fs/btrfs/super.c   |  1 +
>>>   fs/btrfs/volumes.c | 16 ++++++++++++----
>>>   2 files changed, 13 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
>>> index 3dd154e..1a2c58f 100644
>>> --- a/fs/btrfs/super.c
>>> +++ b/fs/btrfs/super.c
>>> @@ -1522,6 +1522,7 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags,
>>>   		char b[BDEVNAME_SIZE];
>>>
>>>   		strlcpy(s->s_id, bdevname(bdev, b), sizeof(s->s_id));
>>> +		s->s_bdev = bdev;
>>>   		btrfs_sb(s)->bdev_holder = fs_type;
>>>   		error = btrfs_fill_super(s, fs_devices, data,
>>>   					 flags & MS_SILENT ? 1 : 0);
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 08ab116..f14f3f2 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -7132,6 +7132,7 @@ void device_force_close(struct btrfs_device *device)
>>>   {
>>>   	struct btrfs_device *next_device;
>>>   	struct btrfs_fs_devices *fs_devices;
>>> +	int found = 0;
>>>
>>>   	fs_devices = device->fs_devices;
>>>
>>> @@ -7139,13 +7140,20 @@ void device_force_close(struct btrfs_device *device)
>>>   	mutex_lock(&fs_devices->fs_info->chunk_mutex);
>>>   	spin_lock(&fs_devices->fs_info->free_chunk_lock);
>>>
>>> -	next_device = list_entry(fs_devices->devices.next,
>>> -					struct btrfs_device, dev_list);
>>> +	list_for_each_entry(next_device, &fs_devices->devices, dev_list) {
>>> +		if (next_device->bdev && next_device->bdev != device->bdev) {
>>> +			found = 1;
>>> +			break;
>>> +		}
>>> +	}
>>> +
>>>   	if (device->bdev == fs_devices->fs_info->sb->s_bdev)
>>> -		fs_devices->fs_info->sb->s_bdev = next_device->bdev;
>>> +		fs_devices->fs_info->sb->s_bdev =
>>> +			found ? next_device->bdev : NULL;
>>>
>>>   	if (device->bdev == fs_devices->latest_bdev)
>>> -		fs_devices->latest_bdev = next_device->bdev;
>>> +		fs_devices->latest_bdev =
>>> +			found ? next_device->bdev : NULL;
>>>
>>>   	if (device->bdev)
>>>   		fs_devices->open_devices--;
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-14  9:22     ` Yauhen Kharuzhy
@ 2016-04-14  9:57       ` Anand Jain
  0 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-14  9:57 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs, dsterba



On 04/14/2016 05:22 PM, Yauhen Kharuzhy wrote:
> On Thu, Apr 14, 2016 at 04:45:11PM +0800, Anand Jain wrote:
>>
>>
>>
>> Thanks for the report ! more below..
>>
>>
>>   You may use simpler devmgt tool, https://github.com/asj/devmgt
>
> Thanks, will try.
>
>>
>>   You are failing the replace-target, presumably when the replace is
>>   still running, however note that this patch-set does not fail the
>>   replace-target for errors (as of now I have no idea how to do that
>>   without leading to a messy situation), and so it would follow the
>>   original code as without this patch.
>>   Next, originally with-out this patch-set we won't close any device
>>   for errors. So when you delete the device at the block-layer and
>>   re-attach (scan) most probably you are having a newer device path
>>   to the block device. (which kind of defeats the idea of testing
>>   an intermittently disappearing device), so I doubt, if the test
>>   case is reliable,  and above panic is btrfs related and if its
>>   this patch-set related.
>
> No, It is fixed by my latest patch (about of s_bdev field in
> superblock). Actual sequence which leads to oops is:
> 1) FS is mounted, s_bdev is NULL
> 2) failed device is closed, s_bdev untouched


> 3) missing device is replaced, s_bdev is set to non-NULL – bdev of
> the replaced device
> 4) at second device closing, s_bdev is "changed" to first device from
> the device list but it is... some device because closed dev still
> didn't delete from the list!
> 5) after device closing, s_bdev points to invalid bdev.
> 6) umount -> sync_filesystem() -> sync_blokdev(s_bdev) -> OOPS.
>

  This is wrong. It should be other way around. That is s_bdev
  should continue to be NULL. And if s_bdev continues to be NULL
  the sync thread will fail-safe.

  The diff sent in the other thread will fix.

Thanks, Anand

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 11/13] btrfs: introduce device dynamic state transition to offline or failed
  2016-04-12 14:16 ` [PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
  2016-04-14  1:15   ` [PATCH] Btrfs: Set superblock s_bdev field properly at device closing Yauhen Kharuzhy
@ 2016-04-14 10:51   ` Anand Jain
  2016-04-14 16:56     ` Yauhen Kharuzhy
  1 sibling, 1 reply; 32+ messages in thread
From: Anand Jain @ 2016-04-14 10:51 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, yauhen.kharuzhy

From: Anand Jain <Anand.Jain@oracle.com>

This patch provides helper functions to force a device to offline
or failed, and we need this device states for the following reasons,
1) a. it can be reported that device has failed when it does
   b. close the device when it goes offline so that blocklayer can
      cleanup
2) identify the candidate for the auto replace
3) avoid further commit error reported against the failing device and
4) a device in the multi device btrfs may go offline from the system
   (but as of now in in some system config btrfs gets unmounted in this
    context, which is not a correct behavior)

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
---
v5:
  Originally we had a bug as fixed in the patch
   [PATCH] btrfs: s_bdev is not null after missing replace
  Incorporate those changes at force close a failed device.
  To test pls have both this patch and above patch which
  fixes the original issue, not introduced as part of this
  patch set.

v4..v1:
  please ref the patch set cover page

 fs/btrfs/volumes.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |  14 ++++++
 2 files changed, 153 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f76e2c4aac96..5b5bc48c8a98 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7148,3 +7148,142 @@ out:
 	read_unlock(&map_tree->map_tree.lock);
 	return ret;
 }
+
+static void __close_device(struct work_struct *work)
+{
+	struct btrfs_device *device;
+
+	device = container_of(work, struct btrfs_device, rcu_work);
+
+	if (device->closing_bdev)
+		blkdev_put(device->closing_bdev, device->mode);
+
+	device->closing_bdev = NULL;
+}
+
+static void close_device(struct rcu_head *head)
+{
+	struct btrfs_device *device;
+
+	device = container_of(head, struct btrfs_device, rcu);
+
+	INIT_WORK(&device->rcu_work, __close_device);
+	schedule_work(&device->rcu_work);
+}
+
+void device_force_close(struct btrfs_device *device)
+{
+	struct btrfs_device *next_device;
+	struct btrfs_fs_devices *fs_devices;
+
+	fs_devices = device->fs_devices;
+
+	mutex_lock(&fs_devices->device_list_mutex);
+	mutex_lock(&fs_devices->fs_info->chunk_mutex);
+	spin_lock(&fs_devices->fs_info->free_chunk_lock);
+
+	next_device = list_entry(fs_devices->devices.next,
+					struct btrfs_device, dev_list);
+	if (fs_devices->fs_info->sb->s_bdev &&
+		(fs_devices->fs_info->sb->s_bdev == device->bdev))
+		fs_devices->fs_info->sb->s_bdev = next_device->bdev;
+
+	if (device->bdev == fs_devices->latest_bdev)
+		fs_devices->latest_bdev = next_device->bdev;
+
+	if (device->bdev)
+		fs_devices->open_devices--;
+
+	if (device->writeable) {
+		list_del_init(&device->dev_alloc_list);
+		fs_devices->rw_devices--;
+	}
+	device->writeable = 0;
+
+	/*
+	 * fixme: works for now, but its better to keep the state of
+	 * missing and offline different, and update rest of the
+	 * places where we check for only missing and not for failed
+	 * or offline as of now.
+	 */
+	device->missing = 1;
+	fs_devices->missing_devices++;
+	device->closing_bdev = device->bdev;
+	device->bdev = NULL;
+
+	call_rcu(&device->rcu, close_device);
+
+	spin_unlock(&fs_devices->fs_info->free_chunk_lock);
+	mutex_unlock(&fs_devices->fs_info->chunk_mutex);
+	mutex_unlock(&fs_devices->device_list_mutex);
+
+	rcu_barrier();
+}
+
+void btrfs_device_enforce_state(struct btrfs_device *dev, char *why)
+{
+	int tolerance;
+	bool degrade_option;
+	char dev_status[10];
+	char chunk_status[25];
+	struct btrfs_fs_info *fs_info;
+	struct btrfs_fs_devices *fs_devices;
+
+	fs_devices = dev->fs_devices;
+	fs_info = fs_devices->fs_info;
+	degrade_option = btrfs_test_opt(fs_info->fs_root, DEGRADED);
+
+	/* todo: support seed later */
+	if (fs_devices->seeding)
+		return;
+
+	/* this shouldn't be called if device is already missing */
+	if (dev->missing || !dev->bdev)
+		return;
+
+	if (dev->offline || dev->failed)
+		return;
+
+	/* Only RW device is requested to force close let FS handle it*/
+	if (fs_devices->rw_devices == 1) {
+		btrfs_std_error(fs_info, -EIO,
+			"force offline last RW device");
+		return;
+	}
+
+	if (!strcmp(why, "offline"))
+		dev->offline = 1;
+	else if (!strcmp(why, "failed"))
+		dev->failed = 1;
+	else
+		return;
+
+	/*
+	 * Here after, there shouldn't any reason why can't force
+	 * close this device
+	 */
+	btrfs_sysfs_rm_device_link(fs_devices, dev);
+	device_force_close(dev);
+	strcpy(dev_status, "closed");
+
+	tolerance = btrfs_check_degradable(fs_info,
+						fs_info->sb->s_flags);
+	if (tolerance > 0) {
+		strncpy(chunk_status, "chunk(s) degraded", 25);
+	} else if(tolerance < 0) {
+		strncpy(chunk_status, "chunk(s) failed", 25);
+	} else {
+		strncpy(chunk_status, "No chunk(s) are degraded", 25);
+	}
+
+	btrfs_warn_in_rcu(fs_info, "device %s marked %s, %s, %s",
+		rcu_str_deref(dev->name), why, dev_status, chunk_status);
+	btrfs_info_in_rcu(fs_info,
+		"num_devices %llu rw_devices %llu degraded-option: %s",
+		fs_devices->num_devices, fs_devices->rw_devices,
+		degrade_option ? "set":"unset");
+
+	if (tolerance < 0)
+		btrfs_std_error(fs_info, -EIO, "devices below critical level");
+
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b4308afa3097..60eb098d8c76 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -66,13 +66,26 @@ struct btrfs_device {
 	struct btrfs_pending_bios pending_sync_bios;
 
 	struct block_device *bdev;
+	struct block_device *closing_bdev;
 
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
 	int writeable;
 	int in_fs_metadata;
+	/* missing: device wasn't found at the time of mount */
 	int missing;
+	/* failed: device confirmed to have experienced critical io failure */
+	int failed;
+	/*
+	 * offline: system or user or block layer transport has removed
+	 * offlined the device which was once present and without going
+	 * through unmount. Implies an intriem communication break down
+	 * and not necessarily a candidate for the device replace. And
+	 * device might be online after user intervention or after
+	 * block transport layer error recovery.
+	 */
+	int offline;
 	int can_discard;
 	int is_tgtdev_for_dev_replace;
 
@@ -575,5 +588,6 @@ struct list_head *btrfs_get_fs_uuids(void);
 void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info);
 void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
 int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags);
+void btrfs_device_enforce_state(struct btrfs_device *dev, char *why);
 
 #endif
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 11/13] btrfs: introduce device dynamic state transition to offline or failed
  2016-04-14 10:51   ` [PATCH v5 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
@ 2016-04-14 16:56     ` Yauhen Kharuzhy
  2016-04-18 10:50       ` Anand Jain
  0 siblings, 1 reply; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-14 16:56 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba

On Thu, Apr 14, 2016 at 06:51:58PM +0800, Anand Jain wrote:
> From: Anand Jain <Anand.Jain@oracle.com>
> 
> This patch provides helper functions to force a device to offline
> or failed, and we need this device states for the following reasons,
> 1) a. it can be reported that device has failed when it does
>    b. close the device when it goes offline so that blocklayer can
>       cleanup
> 2) identify the candidate for the auto replace
> 3) avoid further commit error reported against the failing device and
> 4) a device in the multi device btrfs may go offline from the system
>    (but as of now in in some system config btrfs gets unmounted in this
>     context, which is not a correct behavior)
> 
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
> ---
> v5:
>   Originally we had a bug as fixed in the patch
>    [PATCH] btrfs: s_bdev is not null after missing replace
>   Incorporate those changes at force close a failed device.
>   To test pls have both this patch and above patch which
>   fixes the original issue, not introduced as part of this
>   patch set.

...

> +void device_force_close(struct btrfs_device *device)
> +{
> +	struct btrfs_device *next_device;
> +	struct btrfs_fs_devices *fs_devices;
> +
> +	fs_devices = device->fs_devices;
> +
> +	mutex_lock(&fs_devices->device_list_mutex);
> +	mutex_lock(&fs_devices->fs_info->chunk_mutex);
> +	spin_lock(&fs_devices->fs_info->free_chunk_lock);
> +
> +	next_device = list_entry(fs_devices->devices.next,
> +					struct btrfs_device, dev_list);
> +	if (fs_devices->fs_info->sb->s_bdev &&
> +		(fs_devices->fs_info->sb->s_bdev == device->bdev))
> +		fs_devices->fs_info->sb->s_bdev = next_device->bdev;
> +
> +	if (device->bdev == fs_devices->latest_bdev)
> +		fs_devices->latest_bdev = next_device->bdev;

latest_bdev can point to invalid bdev here if next_device is the same as
closing device.


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (14 preceding siblings ...)
  2016-04-13 21:21 ` Yauhen Kharuzhy
@ 2016-04-14 19:12 ` Yauhen Kharuzhy
  2016-04-14 23:09 ` Yauhen Kharuzhy
  16 siblings, 0 replies; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-14 19:12 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs, dsterba

On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
> Thanks for various comments, tests and feedback.

Tested-By: Yauhen Kharuzhy <yauhen.kharuzhy@zavadatar.com>,
for all patches in the series.


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
                   ` (15 preceding siblings ...)
  2016-04-14 19:12 ` Yauhen Kharuzhy
@ 2016-04-14 23:09 ` Yauhen Kharuzhy
  2016-04-18  8:54   ` Anand Jain
  16 siblings, 1 reply; 32+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-14 23:09 UTC (permalink / raw)
  To: Anand Jain; +Cc: linux-btrfs

On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
> Thanks for various comments, tests and feedback.

Hmm... Yet another lockdep warning, appeared when I removed target drive
during of replacing:

[ 5375.718844] 
[ 5375.718845] ======================================================
[ 5375.718846] [ INFO: possible circular locking dependency detected ]
[ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
[ 5375.718849] -------------------------------------------------------
[ 5375.718851] btrfs-health/4662 is trying to acquire lock:
[ 5375.718861]  (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.718862] 
[ 5375.718862] but task is already holding lock:
[ 5375.718907]  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.718907] 
[ 5375.718907] which lock already depends on the new lock.
[ 5375.718907] 
[ 5375.718908] 
[ 5375.718908] the existing dependency chain (in reverse order) is:
[ 5375.718911] 
[ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
[ 5375.718917]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718921]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.718940]        [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
[ 5375.718945]        [<ffffffff81267079>] show_vfsmnt+0x49/0x150
[ 5375.718948]        [<ffffffff81240b07>] m_show+0x17/0x20
[ 5375.718951]        [<ffffffff81246868>] seq_read+0x2d8/0x3b0
[ 5375.718955]        [<ffffffff8121df28>] __vfs_read+0x28/0xd0
[ 5375.718959]        [<ffffffff8121e806>] vfs_read+0x86/0x130
[ 5375.718962]        [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
[ 5375.718966]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718968] 
[ 5375.718968] -> #2 (namespace_sem){+++++.}:
[ 5375.718971]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718974]        [<ffffffff81635199>] down_write+0x49/0x80
[ 5375.718977]        [<ffffffff81243593>] lock_mount+0x43/0x1c0
[ 5375.718979]        [<ffffffff81243c13>] do_add_mount+0x23/0xd0
[ 5375.718982]        [<ffffffff81244afb>] do_mount+0x27b/0xe30
[ 5375.718985]        [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
[ 5375.718988]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.718991] 
[ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
[ 5375.718994]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.718996]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
[ 5375.719001]        [<ffffffff8122d608>] path_openat+0x468/0x1360
[ 5375.719004]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719007]        [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
[ 5375.719010]        [<ffffffff8121db7e>] SyS_open+0x1e/0x20
[ 5375.719013]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
[ 5375.719015] 
[ 5375.719015] -> #0 (sb_writers){.+.+.+}:
[ 5375.719018]        [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719021]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719026]        [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719028]        [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719031]        [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719035]        [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719037]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719040]        [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719043]        [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719073]        [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719099]        [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719123]        [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719150]        [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719175]        [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719199]        [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719222]        [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719225]        [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719229]        [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719230] 
[ 5375.719230] other info that might help us debug this:
[ 5375.719230] 
[ 5375.719233] Chain exists of:
[ 5375.719233]   sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
[ 5375.719233] 
[ 5375.719234]  Possible unsafe locking scenario:
[ 5375.719234] 
[ 5375.719234]        CPU0                    CPU1
[ 5375.719235]        ----                    ----
[ 5375.719236]   lock(&fs_devs->device_list_mutex);
[ 5375.719238]                                lock(namespace_sem);
[ 5375.719239]                                lock(&fs_devs->device_list_mutex);
[ 5375.719241]   lock(sb_writers);
[ 5375.719241] 
[ 5375.719241]  *** DEADLOCK ***
[ 5375.719241] 
[ 5375.719243] 4 locks held by btrfs-health/4662:
[ 5375.719266]  #0:  (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
[ 5375.719293]  #1:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
[ 5375.719319]  #2:  (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
[ 5375.719343]  #3:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
[ 5375.719343] 
[ 5375.719343] stack backtrace:
[ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
[ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
[ 5375.719352]  0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
[ 5375.719354]  ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
[ 5375.719357]  ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
[ 5375.719357] Call Trace:
[ 5375.719363]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
[ 5375.719366]  [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
[ 5375.719369]  [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
[ 5375.719373]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719376]  [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
[ 5375.719378]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719383]  [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
[ 5375.719385]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
[ 5375.719387]  [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
[ 5375.719389]  [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
[ 5375.719393]  [<ffffffff8122ded2>] path_openat+0xd32/0x1360
[ 5375.719415]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719418]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[ 5375.719420]  [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
[ 5375.719423]  [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
[ 5375.719426]  [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
[ 5375.719430]  [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
[ 5375.719433]  [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
[ 5375.719436]  [<ffffffff8121d923>] filp_open+0x33/0x60
[ 5375.719462]  [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
[ 5375.719485]  [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
[ 5375.719506]  [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
[ 5375.719530]  [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
[ 5375.719554]  [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
[ 5375.719576]  [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
[ 5375.719598]  [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
[ 5375.719621]  [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
[ 5375.719641]  [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
[ 5375.719661]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
[ 5375.719663]  [<ffffffff810a70df>] kthread+0xef/0x110
[ 5375.719666]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719669]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
[ 5375.719672]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
[ 5375.719697] ------------[ cut here ]------------


-- 
Yauhen Kharuzhy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace
  2016-04-14 23:09 ` Yauhen Kharuzhy
@ 2016-04-18  8:54   ` Anand Jain
  0 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-18  8:54 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs



On 04/15/2016 07:09 AM, Yauhen Kharuzhy wrote:
> On Tue, Apr 12, 2016 at 10:15:50PM +0800, Anand Jain wrote:
>> Thanks for various comments, tests and feedback.
>
> Hmm... Yet another lockdep warning, appeared when I removed target drive
> during of replacing:

  Thanks for the report.

  This is not introduce in this patch set, its in the original
  set, I have sent out

   btrfs: fix lock dep warning, move scratch dev out of 
device_list_mutex and uuid_mutex

  to fix this.

Thanks, Anand


> [ 5375.718844]
> [ 5375.718845] ======================================================
> [ 5375.718846] [ INFO: possible circular locking dependency detected ]
> [ 5375.718849] 4.4.5-scst31x-debug-11+ #40 Not tainted
> [ 5375.718849] -------------------------------------------------------
> [ 5375.718851] btrfs-health/4662 is trying to acquire lock:
> [ 5375.718861]  (sb_writers){.+.+.+}, at: [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
> [ 5375.718862]
> [ 5375.718862] but task is already holding lock:
> [ 5375.718907]  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
> [ 5375.718907]
> [ 5375.718907] which lock already depends on the new lock.
> [ 5375.718907]
> [ 5375.718908]
> [ 5375.718908] the existing dependency chain (in reverse order) is:
> [ 5375.718911]
> [ 5375.718911] -> #3 (&fs_devs->device_list_mutex){+.+.+.}:
> [ 5375.718917]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
> [ 5375.718921]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
> [ 5375.718940]        [<ffffffffa0219bf6>] btrfs_show_devname+0x36/0x210 [btrfs]
> [ 5375.718945]        [<ffffffff81267079>] show_vfsmnt+0x49/0x150
> [ 5375.718948]        [<ffffffff81240b07>] m_show+0x17/0x20
> [ 5375.718951]        [<ffffffff81246868>] seq_read+0x2d8/0x3b0
> [ 5375.718955]        [<ffffffff8121df28>] __vfs_read+0x28/0xd0
> [ 5375.718959]        [<ffffffff8121e806>] vfs_read+0x86/0x130
> [ 5375.718962]        [<ffffffff8121f4c9>] SyS_read+0x49/0xa0
> [ 5375.718966]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
> [ 5375.718968]
> [ 5375.718968] -> #2 (namespace_sem){+++++.}:
> [ 5375.718971]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
> [ 5375.718974]        [<ffffffff81635199>] down_write+0x49/0x80
> [ 5375.718977]        [<ffffffff81243593>] lock_mount+0x43/0x1c0
> [ 5375.718979]        [<ffffffff81243c13>] do_add_mount+0x23/0xd0
> [ 5375.718982]        [<ffffffff81244afb>] do_mount+0x27b/0xe30
> [ 5375.718985]        [<ffffffff812459dc>] SyS_mount+0x8c/0xd0
> [ 5375.718988]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
> [ 5375.718991]
> [ 5375.718991] -> #1 (&sb->s_type->i_mutex_key#5){+.+.+.}:
> [ 5375.718994]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
> [ 5375.718996]        [<ffffffff81633949>] mutex_lock_nested+0x69/0x3c0
> [ 5375.719001]        [<ffffffff8122d608>] path_openat+0x468/0x1360
> [ 5375.719004]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
> [ 5375.719007]        [<ffffffff8121da7b>] do_sys_open+0x12b/0x210
> [ 5375.719010]        [<ffffffff8121db7e>] SyS_open+0x1e/0x20
> [ 5375.719013]        [<ffffffff81637976>] entry_SYSCALL_64_fastpath+0x16/0x7a
> [ 5375.719015]
> [ 5375.719015] -> #0 (sb_writers){.+.+.+}:
> [ 5375.719018]        [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
> [ 5375.719021]        [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
> [ 5375.719026]        [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
> [ 5375.719028]        [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
> [ 5375.719031]        [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
> [ 5375.719035]        [<ffffffff8122ded2>] path_openat+0xd32/0x1360
> [ 5375.719037]        [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
> [ 5375.719040]        [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
> [ 5375.719043]        [<ffffffff8121d923>] filp_open+0x33/0x60
> [ 5375.719073]        [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
> [ 5375.719099]        [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
> [ 5375.719123]        [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
> [ 5375.719150]        [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
> [ 5375.719175]        [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
> [ 5375.719199]        [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
> [ 5375.719222]        [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
> [ 5375.719225]        [<ffffffff810a70df>] kthread+0xef/0x110
> [ 5375.719229]        [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
> [ 5375.719230]
> [ 5375.719230] other info that might help us debug this:
> [ 5375.719230]
> [ 5375.719233] Chain exists of:
> [ 5375.719233]   sb_writers --> namespace_sem --> &fs_devs->device_list_mutex
> [ 5375.719233]
> [ 5375.719234]  Possible unsafe locking scenario:
> [ 5375.719234]
> [ 5375.719234]        CPU0                    CPU1
> [ 5375.719235]        ----                    ----
> [ 5375.719236]   lock(&fs_devs->device_list_mutex);
> [ 5375.719238]                                lock(namespace_sem);
> [ 5375.719239]                                lock(&fs_devs->device_list_mutex);
> [ 5375.719241]   lock(sb_writers);
> [ 5375.719241]
> [ 5375.719241]  *** DEADLOCK ***
> [ 5375.719241]
> [ 5375.719243] 4 locks held by btrfs-health/4662:
> [ 5375.719266]  #0:  (&fs_info->health_mutex){+.+.+.}, at: [<ffffffffa0246303>] health_kthread+0x63/0x490 [btrfs]
> [ 5375.719293]  #1:  (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.+.}, at: [<ffffffffa02c6611>] btrfs_dev_replace_finishing+0x41/0x990 [btrfs]
> [ 5375.719319]  #2:  (uuid_mutex){+.+.+.}, at: [<ffffffffa0282620>] btrfs_destroy_dev_replace_tgtdev+0x20/0x150 [btrfs]
> [ 5375.719343]  #3:  (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa028263c>] btrfs_destroy_dev_replace_tgtdev+0x3c/0x150 [btrfs]
> [ 5375.719343]
> [ 5375.719343] stack backtrace:
> [ 5375.719347] CPU: 2 PID: 4662 Comm: btrfs-health Not tainted 4.4.5-scst31x-debug-11+ #40
> [ 5375.719348] Hardware name: Supermicro SYS-6018R-WTRT/X10DRW-iT, BIOS 1.0c 01/07/2015
> [ 5375.719352]  0000000000000000 ffff880856f73880 ffffffff813529e3 ffffffff826182a0
> [ 5375.719354]  ffffffff8260c090 ffff880856f738c0 ffffffff810d667c ffff880856f73930
> [ 5375.719357]  ffff880861f32b40 ffff880861f32b68 0000000000000003 0000000000000004
> [ 5375.719357] Call Trace:
> [ 5375.719363]  [<ffffffff813529e3>] dump_stack+0x85/0xc2
> [ 5375.719366]  [<ffffffff810d667c>] print_circular_bug+0x1ec/0x260
> [ 5375.719369]  [<ffffffff810d97ca>] __lock_acquire+0x17ba/0x1ae0
> [ 5375.719373]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
> [ 5375.719376]  [<ffffffff810da4be>] lock_acquire+0xce/0x1e0
> [ 5375.719378]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
> [ 5375.719383]  [<ffffffff810d3bef>] percpu_down_read+0x4f/0xa0
> [ 5375.719385]  [<ffffffff812214f7>] ? __sb_start_write+0xb7/0xf0
> [ 5375.719387]  [<ffffffff812214f7>] __sb_start_write+0xb7/0xf0
> [ 5375.719389]  [<ffffffff81242eb4>] mnt_want_write+0x24/0x50
> [ 5375.719393]  [<ffffffff8122ded2>] path_openat+0xd32/0x1360
> [ 5375.719415]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
> [ 5375.719418]  [<ffffffff810f606d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
> [ 5375.719420]  [<ffffffff8122f86e>] do_filp_open+0x7e/0xe0
> [ 5375.719423]  [<ffffffff810f615d>] ? rcu_read_lock_sched_held+0x6d/0x80
> [ 5375.719426]  [<ffffffff81201a9b>] ? kmem_cache_alloc+0x26b/0x5d0
> [ 5375.719430]  [<ffffffff8122e7d4>] ? getname_kernel+0x34/0x120
> [ 5375.719433]  [<ffffffff8121d8a4>] file_open_name+0xe4/0x130
> [ 5375.719436]  [<ffffffff8121d923>] filp_open+0x33/0x60
> [ 5375.719462]  [<ffffffffa02776a6>] update_dev_time+0x16/0x40 [btrfs]
> [ 5375.719485]  [<ffffffffa02825be>] btrfs_scratch_superblocks+0x4e/0x90 [btrfs]
> [ 5375.719506]  [<ffffffffa0282665>] btrfs_destroy_dev_replace_tgtdev+0x65/0x150 [btrfs]
> [ 5375.719530]  [<ffffffffa02c6c80>] btrfs_dev_replace_finishing+0x6b0/0x990 [btrfs]
> [ 5375.719554]  [<ffffffffa02c6b23>] ? btrfs_dev_replace_finishing+0x553/0x990 [btrfs]
> [ 5375.719576]  [<ffffffffa02c729e>] btrfs_dev_replace_start+0x33e/0x540 [btrfs]
> [ 5375.719598]  [<ffffffffa02c7f58>] btrfs_auto_replace_start+0xf8/0x140 [btrfs]
> [ 5375.719621]  [<ffffffffa02464e6>] health_kthread+0x246/0x490 [btrfs]
> [ 5375.719641]  [<ffffffffa02463d8>] ? health_kthread+0x138/0x490 [btrfs]
> [ 5375.719661]  [<ffffffffa02462a0>] ? btrfs_congested_fn+0x180/0x180 [btrfs]
> [ 5375.719663]  [<ffffffff810a70df>] kthread+0xef/0x110
> [ 5375.719666]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
> [ 5375.719669]  [<ffffffff81637d2f>] ret_from_fork+0x3f/0x70
> [ 5375.719672]  [<ffffffff810a6ff0>] ? kthread_create_on_node+0x200/0x200
> [ 5375.719697] ------------[ cut here ]------------
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 11/13] btrfs: introduce device dynamic state transition to offline or failed
  2016-04-14 16:56     ` Yauhen Kharuzhy
@ 2016-04-18 10:50       ` Anand Jain
  0 siblings, 0 replies; 32+ messages in thread
From: Anand Jain @ 2016-04-18 10:50 UTC (permalink / raw)
  To: Yauhen Kharuzhy; +Cc: linux-btrfs, dsterba



On 04/15/2016 12:56 AM, Yauhen Kharuzhy wrote:
> On Thu, Apr 14, 2016 at 06:51:58PM +0800, Anand Jain wrote:
>> From: Anand Jain <Anand.Jain@oracle.com>
>>
>> This patch provides helper functions to force a device to offline
>> or failed, and we need this device states for the following reasons,
>> 1) a. it can be reported that device has failed when it does
>>     b. close the device when it goes offline so that blocklayer can
>>        cleanup
>> 2) identify the candidate for the auto replace
>> 3) avoid further commit error reported against the failing device and
>> 4) a device in the multi device btrfs may go offline from the system
>>     (but as of now in in some system config btrfs gets unmounted in this
>>      context, which is not a correct behavior)
>>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> Tested-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
>> ---
>> v5:
>>    Originally we had a bug as fixed in the patch
>>     [PATCH] btrfs: s_bdev is not null after missing replace
>>    Incorporate those changes at force close a failed device.
>>    To test pls have both this patch and above patch which
>>    fixes the original issue, not introduced as part of this
>>    patch set.
>
> ...
>
>> +void device_force_close(struct btrfs_device *device)
>> +{
>> +	struct btrfs_device *next_device;
>> +	struct btrfs_fs_devices *fs_devices;
>> +
>> +	fs_devices = device->fs_devices;
>> +
>> +	mutex_lock(&fs_devices->device_list_mutex);
>> +	mutex_lock(&fs_devices->fs_info->chunk_mutex);
>> +	spin_lock(&fs_devices->fs_info->free_chunk_lock);
>> +
>> +	next_device = list_entry(fs_devices->devices.next,
>> +					struct btrfs_device, dev_list);
>> +	if (fs_devices->fs_info->sb->s_bdev &&
>> +		(fs_devices->fs_info->sb->s_bdev == device->bdev))
>> +		fs_devices->fs_info->sb->s_bdev = next_device->bdev;
>> +
>> +	if (device->bdev == fs_devices->latest_bdev)
>> +		fs_devices->latest_bdev = next_device->bdev;
>
> latest_bdev can point to invalid bdev here if next_device is the same as
> closing device.

  As mentioned a wrapper helper function is better for this,
  (I thought you will do it, as I didn't the patch) I just sent out
    [PATCH] btrfs: cleanup assigning next active device with a check

Thanks, Anand




^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2016-04-18 10:51 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-12 14:15 [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Anand Jain
2016-04-12 14:15 ` [PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount Anand Jain
2016-04-12 19:21   ` Yauhen Kharuzhy
2016-04-12 14:15 ` [PATCH 02/13] btrfs: Do per-chunk check for mount time check Anand Jain
2016-04-12 14:15 ` [PATCH 03/13] btrfs: Do per-chunk degraded check for remount Anand Jain
2016-04-12 14:15 ` [PATCH 04/13] btrfs: Allow barrier_all_devices to do per-chunk device check Anand Jain
2016-04-12 14:15 ` [PATCH 05/13] btrfs: Cleanup num_tolerated_disk_barrier_failures Anand Jain
2016-04-12 14:15 ` [PATCH 06/13] btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV Anand Jain
2016-04-12 14:15 ` [PATCH 07/13] btrfs: add check not to mount a spare device Anand Jain
2016-04-12 14:15 ` [PATCH 08/13] btrfs: support btrfs dev scan for " Anand Jain
2016-04-12 14:15 ` [PATCH 09/13] btrfs: provide framework to get and put a " Anand Jain
2016-04-12 14:16 ` [PATCH 10/13] btrfs: introduce helper functions to perform hot replace Anand Jain
2016-04-12 14:40   ` kbuild test robot
2016-04-12 14:16 ` [PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
2016-04-14  1:15   ` [PATCH] Btrfs: Set superblock s_bdev field properly at device closing Yauhen Kharuzhy
2016-04-14  6:59     ` Anand Jain
2016-04-14  9:10       ` Yauhen Kharuzhy
2016-04-14  9:48         ` Anand Jain
2016-04-14 10:51   ` [PATCH v5 11/13] btrfs: introduce device dynamic state transition to offline or failed Anand Jain
2016-04-14 16:56     ` Yauhen Kharuzhy
2016-04-18 10:50       ` Anand Jain
2016-04-12 14:16 ` [PATCH 12/13] btrfs: check device for critical errors and mark failed Anand Jain
2016-04-12 14:16 ` [PATCH 13/13] btrfs: check for failed device and hot replace Anand Jain
2016-04-12 20:02 ` [PATCH v4 00/13] Introduce device state 'failed', spare device and auto replace Yauhen Kharuzhy
2016-04-13 22:43   ` Anand Jain
2016-04-13 21:21 ` Yauhen Kharuzhy
2016-04-14  8:45   ` Anand Jain
2016-04-14  9:22     ` Yauhen Kharuzhy
2016-04-14  9:57       ` Anand Jain
2016-04-14 19:12 ` Yauhen Kharuzhy
2016-04-14 23:09 ` Yauhen Kharuzhy
2016-04-18  8:54   ` Anand Jain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.