All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] raid56: scrub related fixes
@ 2017-03-29  1:33 Qu Wenruo
  2017-03-29  1:33 ` [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56 Qu Wenruo
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Qu Wenruo @ 2017-03-29  1:33 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba

This patchset can be fetched from my github repo:
https://github.com/adam900710/linux.git raid56_fixes

It's based on v4.11-rc2, the last two patches get modified according to
the advice from Liu Bo.

The patchset fixes the following bugs:

1) False alert or wrong csum error number when scrubbing RAID5/6
   The bug itself won't cause any damage to fs, just pure race.

   This can be triggered by running scrub for 64K corrupted data stripe,
   Normally it will report 16 csum error recovered, but sometimes it
   will report more than 16 csum error recovered, under rare case, even
   unrecoverable error an be reported.

2) Corrupted data stripe rebuild corrupts P/Q
   So scrub makes one error into another, not really fixing anything

   Since kernel scrub doesn't report parity error, so either offline
   scrub or manual check is needed to expose such error.

3) Use-after-free caused by cancelling dev-replace 
   This is quite a deadly bug, since cancelling dev-replace can
   cause kernel panic.

   Can be triggered by btrfs/069.

v2:
  Use bio_counter to protect rbio against dev-replace cancel, instead of
  original btrfs_device refcount, which is too restrict and must disable
  rbio cache, suggested by Liu Bo.

v3:
  Add fix for another possible use-after-free when rechecking recovered
  full stripe
  Squashing two patches as they are fixing the same problem, to make
  bisect easier.
  Use mutex other than spinlock to protect full stripe locks tree, this
  allow us to allocate memory inside the critical section on demand.
  Encapsulate rb_root and mutex into btrfs_full_stripe_locks_tree.
  Rename scrub_full_stripe_lock to full_stripe_lock inside scrub.c.
  Rename related function to have unified naming.
  Code style change to follow the existing scrub code style.

Qu Wenruo (5):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition
  btrfs: scrub: Don't append on-disk pages for raid56 scrub
  btrfs: Wait flighting bio before freeing target device for raid56
  btrfs: Prevent scrub recheck from racing with dev replace

 fs/btrfs/ctree.h       |  17 ++++
 fs/btrfs/dev-replace.c |   2 +
 fs/btrfs/extent-tree.c |   2 +
 fs/btrfs/raid56.c      |  14 +++
 fs/btrfs/scrub.c       | 248 +++++++++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 275 insertions(+), 8 deletions(-)

-- 
2.12.1




^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
  2017-03-29  1:33 [PATCH v3 0/5] raid56: scrub related fixes Qu Wenruo
@ 2017-03-29  1:33 ` Qu Wenruo
  2017-03-30  0:34   ` Liu Bo
  2017-03-29  1:33 ` [PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition Qu Wenruo
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2017-03-29  1:33 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba

Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protect, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which stores a rb_tree of mutex for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/ctree.h       |  17 ++++
 fs/btrfs/extent-tree.c |   2 +
 fs/btrfs/scrub.c       | 212 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 231 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29b7fc28c607..9fe56da21fed 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -538,6 +538,14 @@ struct btrfs_io_ctl {
 	unsigned check_crcs:1;
 };
 
+/*
+ * Tree to record all locked full stripes of a RAID5/6 block group
+ */
+struct btrfs_full_stripe_locks_tree {
+	struct rb_root root;
+	struct mutex lock;
+};
+
 struct btrfs_block_group_cache {
 	struct btrfs_key key;
 	struct btrfs_block_group_item item;
@@ -648,6 +656,9 @@ struct btrfs_block_group_cache {
 	 * Protected by free_space_lock.
 	 */
 	int needs_free_space;
+
+	/* Record locked full stripes for RAID5/6 block group */
+	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
 
 /* delayed seq elem */
@@ -3647,6 +3658,12 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info,
 			   struct btrfs_device *dev);
 int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid,
 			 struct btrfs_scrub_progress *progress);
+static inline void btrfs_init_full_stripe_locks_tree(
+			struct btrfs_full_stripe_locks_tree *locks_root)
+{
+	locks_root->root = RB_ROOT;
+	mutex_init(&locks_root->lock);
+}
 
 /* dev-replace.c */
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be5477676cc8..e4d48997d927 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -131,6 +131,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
 	if (atomic_dec_and_test(&cache->count)) {
 		WARN_ON(cache->pinned > 0);
 		WARN_ON(cache->reserved > 0);
+		WARN_ON(!RB_EMPTY_ROOT(&cache->full_stripe_locks_root.root));
 		kfree(cache->free_space_ctl);
 		kfree(cache);
 	}
@@ -9917,6 +9918,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
 	btrfs_init_free_space_ctl(cache);
 	atomic_set(&cache->trimming, 0);
 	mutex_init(&cache->free_space_lock);
+	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
 
 	return cache;
 }
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b0251eb1239f..ab33b9a8aac2 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -240,6 +240,13 @@ struct scrub_warning {
 	struct btrfs_device	*dev;
 };
 
+struct full_stripe_lock {
+	struct rb_node node;
+	u64 logical;
+	u64 refs;
+	struct mutex mutex;
+};
+
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
 static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx);
@@ -349,6 +356,211 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
 }
 
 /*
+ * Insert new full stripe lock into full stripe locks tree
+ *
+ * Return pointer to existing or newly inserted full_stripe_lock structure if
+ * everything works well.
+ * Return ERR_PTR(-ENOMEM) if we failed to allocate memory
+ *
+ * NOTE: caller must hold full_stripe_locks_root->lock before calling this
+ * function
+ */
+static struct full_stripe_lock *insert_full_stripe_lock(
+		struct btrfs_full_stripe_locks_tree *locks_root,
+		u64 fstripe_logical)
+{
+	struct rb_node **p;
+	struct rb_node *parent = NULL;
+	struct full_stripe_lock *entry;
+	struct full_stripe_lock *ret;
+
+	WARN_ON(!mutex_is_locked(&locks_root->lock));
+
+	p = &locks_root->root.rb_node;
+	while (*p) {
+		parent = *p;
+		entry = rb_entry(parent, struct full_stripe_lock, node);
+		if (fstripe_logical < entry->logical) {
+			p = &(*p)->rb_left;
+		} else if (fstripe_logical > entry->logical) {
+			p = &(*p)->rb_right;
+		} else {
+			entry->refs++;
+			return entry;
+		}
+	}
+
+	/* Insert new lock */
+	ret = kmalloc(sizeof(*ret), GFP_KERNEL);
+	if (!ret)
+		return ERR_PTR(-ENOMEM);
+	ret->logical = fstripe_logical;
+	ret->refs = 1;
+	mutex_init(&ret->mutex);
+
+	rb_link_node(&ret->node, parent, p);
+	rb_insert_color(&ret->node, &locks_root->root);
+	return ret;
+}
+
+/*
+ * Search for a full stripe lock of a block group
+ *
+ * Return pointer to existing full stripe lock if found
+ * Return NULL if not found
+ */
+static struct full_stripe_lock *search_full_stripe_lock(
+		struct btrfs_full_stripe_locks_tree *locks_root,
+		u64 fstripe_logical)
+{
+	struct rb_node *node;
+	struct full_stripe_lock *entry;
+
+	WARN_ON(!mutex_is_locked(&locks_root->lock));
+
+	node = locks_root->root.rb_node;
+	while (node) {
+		entry = rb_entry(node, struct full_stripe_lock, node);
+		if (fstripe_logical < entry->logical)
+			node = node->rb_left;
+		else if (fstripe_logical > entry->logical)
+			node = node->rb_right;
+		else
+			return entry;
+	}
+	return NULL;
+}
+
+/*
+ * Helper to get full stripe logical from a normal bytenr.
+ * Thanks to the chaos of scrub structures, we need to get it all
+ * by ourselves, using btrfs_map_sblock().
+ */
+static int get_full_stripe_logical(struct btrfs_fs_info *fs_info, u64 bytenr,
+				   u64 *bytenr_ret)
+{
+	struct btrfs_bio *bbio = NULL;
+	u64 len;
+	int ret;
+
+	/* Just use map_sblock() to get full stripe logical */
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, bytenr,
+			       &len, &bbio, 0, 1);
+	if (ret || !bbio || !bbio->raid_map)
+		goto error;
+	*bytenr_ret = bbio->raid_map[0];
+	btrfs_put_bbio(bbio);
+	return 0;
+error:
+	btrfs_put_bbio(bbio);
+	if (ret)
+		return ret;
+	return -EIO;
+}
+
+/*
+ * To lock a full stripe to avoid concurrency of recovery and read
+ * It's only used for profiles with parities(RAID5/6), for other profiles it
+ * does nothing
+ *
+ * Return 0 if we locked full stripe covering @bytenr, with a mutex hold.
+ * So caller must call unlock_full_stripe() at the same context.
+ *
+ * Return <0 if encounters error.
+ */
+static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+	struct btrfs_block_group_cache *bg_cache;
+	struct btrfs_full_stripe_locks_tree *locks_root;
+	struct full_stripe_lock *existing;
+	u64 fstripe_start;
+	int ret = 0;
+
+	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
+	if (!bg_cache)
+		return -ENOENT;
+
+	/* Profiles not based on parity don't need full stripe lock */
+	if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK))
+		goto out;
+	locks_root = &bg_cache->full_stripe_locks_root;
+
+	ret = get_full_stripe_logical(fs_info, bytenr, &fstripe_start);
+	if (ret < 0)
+		goto out;
+
+	/* Now insert the full stripe lock */
+	mutex_lock(&locks_root->lock);
+	existing = insert_full_stripe_lock(locks_root, fstripe_start);
+	mutex_unlock(&locks_root->lock);
+	if (IS_ERR(existing)) {
+		ret = PTR_ERR(existing);
+		goto out;
+	}
+	mutex_lock(&existing->mutex);
+out:
+	btrfs_put_block_group(bg_cache);
+	return ret;
+}
+
+/*
+ * Unlock a full stripe.
+ * NOTE: Caller must ensure it's the same context calling corresponding
+ * lock_full_stripe().
+ *
+ * Return 0 if we unlock full stripe without problem.
+ * Return <0 for error
+ */
+static int unlock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
+{
+	struct btrfs_block_group_cache *bg_cache;
+	struct btrfs_full_stripe_locks_tree *locks_root;
+	struct full_stripe_lock *fstripe_lock;
+	u64 fstripe_start;
+	bool freeit = false;
+	int ret = 0;
+
+	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
+	if (!bg_cache)
+		return -ENOENT;
+	if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK))
+		goto out;
+
+	locks_root = &bg_cache->full_stripe_locks_root;
+	ret = get_full_stripe_logical(fs_info, bytenr, &fstripe_start);
+	if (ret < 0)
+		goto out;
+
+	mutex_lock(&locks_root->lock);
+	fstripe_lock = search_full_stripe_lock(locks_root, fstripe_start);
+	/* Unparied unlock_full_stripe() detected */
+	if (WARN_ON(!fstripe_lock)) {
+		ret = -ENOENT;
+		mutex_unlock(&locks_root->lock);
+		goto out;
+	}
+
+	if (fstripe_lock->refs == 0) {
+		WARN_ON(1);
+		btrfs_warn(fs_info, "full stripe lock at %llu refcount underflow",
+			fstripe_lock->logical);
+	} else
+		fstripe_lock->refs--;
+	if (fstripe_lock->refs == 0) {
+		rb_erase(&fstripe_lock->node, &locks_root->root);
+		freeit = true;
+	}
+	mutex_unlock(&locks_root->lock);
+
+	mutex_unlock(&fstripe_lock->mutex);
+	if (freeit)
+		kfree(fstripe_lock);
+out:
+	btrfs_put_block_group(bg_cache);
+	return ret;
+}
+
+/*
  * used for workers that require transaction commits (i.e., for the
  * NOCOW case)
  */
-- 
2.12.1




^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition
  2017-03-29  1:33 [PATCH v3 0/5] raid56: scrub related fixes Qu Wenruo
  2017-03-29  1:33 ` [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56 Qu Wenruo
@ 2017-03-29  1:33 ` Qu Wenruo
  2017-03-30  0:51   ` Liu Bo
  2017-03-29  1:33 ` [PATCH v3 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub Qu Wenruo
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2017-03-29  1:33 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba

When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
   To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
   Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
   Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16(assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
        Corrupted       |       Correct          |      Correct        |
|   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
------------------------------------------------------------------------
Read out D1             |Read out D2             |Read full stripe     |
Check csum              |Check csum              |Check parity         |
Csum mismatch           |Csum match, continue    |Parity mismatch      |
handle_errored_block    |                        |handle_errored_block |
 Read out full stripe   |                        | Read out full stripe|
 D1 csum error(err++)   |                        | D1 csum error(err++)|
 Recover D1             |                        | Recover D1          |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protect, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli <kreijack@libero.it>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/scrub.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ab33b9a8aac2..f92d2512f4f3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1129,6 +1129,17 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	have_csum = sblock_to_check->pagev[0]->have_csum;
 	dev = sblock_to_check->pagev[0]->dev;
 
+	/*
+	 * For RAID5/6 race can happen for different dev scrub thread.
+	 * For data corruption, Parity and Data thread will both try
+	 * to recovery the data.
+	 * Race can lead to double added csum error, or even unrecoverable
+	 * error.
+	 */
+	ret = lock_full_stripe(fs_info, logical);
+	if (ret < 0)
+		return ret;
+
 	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
 		sblocks_for_recheck = NULL;
 		goto nodatasum_case;
@@ -1463,6 +1474,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 		kfree(sblocks_for_recheck);
 	}
 
+	ret = unlock_full_stripe(fs_info, logical);
+	if (ret < 0)
+		return ret;
 	return 0;
 }
 
-- 
2.12.1




^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub
  2017-03-29  1:33 [PATCH v3 0/5] raid56: scrub related fixes Qu Wenruo
  2017-03-29  1:33 ` [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56 Qu Wenruo
  2017-03-29  1:33 ` [PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition Qu Wenruo
@ 2017-03-29  1:33 ` Qu Wenruo
  2017-03-29 18:00   ` Liu Bo
  2017-03-29  1:33 ` [PATCH v3 4/5] btrfs: Wait flighting bio before freeing target device for raid56 Qu Wenruo
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2017-03-29  1:33 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba

In the following situation, scrub will calculate wrong parity to
overwrite correct one:

RAID5 full stripe:

Before
|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

After scrubbing dev3 only:

|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

The reason is after raid56 read rebuild rbio->stripe_pages are all
correct recovered (0xcd for data stripes).

However when we check and repair parity, in
scrub_parity_check_and_repair(), we will appending pages in
sparity->spages list to rbio->bio_pages[], which contains old on-disk
data.

And when we submit parity data to disk, we calculate parity using
rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
to rbio->stripe_pages[].

The patch fix it by not appending pages from sparity->spages.
So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/scrub.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f92d2512f4f3..2fd259dbf4db 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2991,7 +2991,6 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	struct bio *bio;
 	struct btrfs_raid_bio *rbio;
-	struct scrub_page *spage;
 	struct btrfs_bio *bbio = NULL;
 	u64 length;
 	int ret;
@@ -3021,9 +3020,6 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
 	if (!rbio)
 		goto rbio_out;
 
-	list_for_each_entry(spage, &sparity->spages, list)
-		raid56_add_scrub_pages(rbio, spage->page, spage->logical);
-
 	scrub_pending_bio_inc(sctx);
 	raid56_parity_submit_scrub_rbio(rbio);
 	return;
-- 
2.12.1




^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 4/5] btrfs: Wait flighting bio before freeing target device for raid56
  2017-03-29  1:33 [PATCH v3 0/5] raid56: scrub related fixes Qu Wenruo
                   ` (2 preceding siblings ...)
  2017-03-29  1:33 ` [PATCH v3 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub Qu Wenruo
@ 2017-03-29  1:33 ` Qu Wenruo
  2017-03-29 18:05   ` Liu Bo
  2017-03-29  1:33 ` [PATCH v3 5/5] btrfs: Prevent scrub recheck from racing with dev replace Qu Wenruo
  2017-03-30 16:52 ` [PATCH v3 0/5] raid56: scrub related fixes David Sterba
  5 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2017-03-29  1:33 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Liu Bo

When raid56 dev replace is cancelled by running scrub, we will free target
device without waiting flighting bios, causing the following NULL
pointer deference or general protection.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
 IP: generic_make_request_checks+0x4d/0x610
 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G  O    4.11.0-rc2 #72
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
 task: ffff88002875b4c0 task.stack: ffffc90001334000
 RIP: 0010:generic_make_request_checks+0x4d/0x610
 Call Trace:
  ? generic_make_request+0xc7/0x360
  generic_make_request+0x24/0x360
  ? generic_make_request+0xc7/0x360
  submit_bio+0x64/0x120
  ? page_in_rbio+0x4d/0x80 [btrfs]
  ? rbio_orig_end_io+0x80/0x80 [btrfs]
  finish_rmw+0x3f4/0x540 [btrfs]
  validate_rbio_for_rmw+0x36/0x40 [btrfs]
  raid_rmw_end_io+0x7a/0x90 [btrfs]
  bio_endio+0x56/0x60
  end_workqueue_fn+0x3c/0x40 [btrfs]
  btrfs_scrubparity_helper+0xef/0x620 [btrfs]
  btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
  process_one_work+0x2af/0x720
  ? process_one_work+0x22b/0x720
  worker_thread+0x4b/0x4f0
  kthread+0x10f/0x150
  ? process_one_work+0x720/0x720
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x2e/0x40
 RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8

In btrfs_dev_replace_finishing(), we will call
btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
device when scrub is finished normally.

However when scrub is aborted, either due to error or canceled by scrub,
we didn't wait bios, this can leads to use-after-free if there are bios
holding the target device.

Furthermore, for raid56 scrub, at least 2 places are calling
btrfs_map_sblock() without protection of bio_counter, leading to the
problem.

This patch fixes the problen by
1) Wait bio_counter before freeing target device when canceling replace
2) When calling btrfs_map_sblock() for raid56, use bio_counter to
   protect the call.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/dev-replace.c |  2 ++
 fs/btrfs/raid56.c      | 14 ++++++++++++++
 fs/btrfs/scrub.c       |  5 +++++
 3 files changed, 21 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e653921f05d9..b9d88136b5a9 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -546,8 +546,10 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
 		mutex_unlock(&fs_info->chunk_mutex);
 		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
 		mutex_unlock(&uuid_mutex);
+		btrfs_rm_dev_replace_blocked(fs_info);
 		if (tgt_device)
 			btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
+		btrfs_rm_dev_replace_unblocked(fs_info);
 		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
 
 		return scrub_ret;
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1571bf26dc07..5c180fea32ab 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2194,6 +2194,8 @@ static void read_rebuild_work(struct btrfs_work *work)
 /*
  * The following code is used to scrub/replace the parity stripe
  *
+ * Caller must have already increased bio_counter for getting @bbio.
+ *
  * Note: We need make sure all the pages that add into the scrub/replace
  * raid bio are correct and not be changed during the scrub/replace. That
  * is those pages just hold metadata or file data with checksum.
@@ -2231,6 +2233,12 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	ASSERT(rbio->stripe_npages == stripe_nsectors);
 	bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors);
 
+	/*
+	 * We have already increased bio_counter when getting bbio, record it
+	 * so we can free it at rbio_orig_end_io().
+	 */
+	rbio->generic_bio_cnt = 1;
+
 	return rbio;
 }
 
@@ -2673,6 +2681,12 @@ raid56_alloc_missing_rbio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		return NULL;
 	}
 
+	/*
+	 * When we get bbio, we have already increased bio_counter, record it
+	 * so we can free it at rbio_orig_end_io()
+	 */
+	rbio->generic_bio_cnt = 1;
+
 	return rbio;
 }
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 2fd259dbf4db..b8c49074d1b3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2413,6 +2413,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock)
 	int ret;
 	int i;
 
+	btrfs_bio_counter_inc_blocked(fs_info);
 	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
 			&length, &bbio, 0, 1);
 	if (ret || !bbio || !bbio->raid_map)
@@ -2457,6 +2458,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock)
 rbio_out:
 	bio_put(bio);
 bbio_out:
+	btrfs_bio_counter_dec(fs_info);
 	btrfs_put_bbio(bbio);
 	spin_lock(&sctx->stat_lock);
 	sctx->stat.malloc_errors++;
@@ -3000,6 +3002,8 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
 		goto out;
 
 	length = sparity->logic_end - sparity->logic_start;
+
+	btrfs_bio_counter_inc_blocked(fs_info);
 	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_WRITE, sparity->logic_start,
 			       &length, &bbio, 0, 1);
 	if (ret || !bbio || !bbio->raid_map)
@@ -3027,6 +3031,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
 rbio_out:
 	bio_put(bio);
 bbio_out:
+	btrfs_bio_counter_dec(fs_info);
 	btrfs_put_bbio(bbio);
 	bitmap_or(sparity->ebitmap, sparity->ebitmap, sparity->dbitmap,
 		  sparity->nsectors);
-- 
2.12.1




^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 5/5] btrfs: Prevent scrub recheck from racing with dev replace
  2017-03-29  1:33 [PATCH v3 0/5] raid56: scrub related fixes Qu Wenruo
                   ` (3 preceding siblings ...)
  2017-03-29  1:33 ` [PATCH v3 4/5] btrfs: Wait flighting bio before freeing target device for raid56 Qu Wenruo
@ 2017-03-29  1:33 ` Qu Wenruo
  2017-03-29 18:08   ` Liu Bo
  2017-03-30 16:52 ` [PATCH v3 0/5] raid56: scrub related fixes David Sterba
  5 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2017-03-29  1:33 UTC (permalink / raw)
  To: linux-btrfs; +Cc: dsterba, Liu Bo

scrub_setup_recheck_block() calls btrfs_map_sblock() and then access
bbio without protection of bio_counter.

This can leads to use-after-free if racing with dev replace cancel.

Fix it by increasing bio_counter before calling btrfs_map_sblock() and
decrease the bio_counter when corresponding recover is finished.

Cc: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
---
 fs/btrfs/scrub.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b8c49074d1b3..84b077c993c0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1072,9 +1072,11 @@ static inline void scrub_get_recover(struct scrub_recover *recover)
 	atomic_inc(&recover->refs);
 }
 
-static inline void scrub_put_recover(struct scrub_recover *recover)
+static inline void scrub_put_recover(struct btrfs_fs_info *fs_info,
+				     struct scrub_recover *recover)
 {
 	if (atomic_dec_and_test(&recover->refs)) {
+		btrfs_bio_counter_dec(fs_info);
 		btrfs_put_bbio(recover->bbio);
 		kfree(recover);
 	}
@@ -1464,7 +1466,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 				sblock->pagev[page_index]->sblock = NULL;
 				recover = sblock->pagev[page_index]->recover;
 				if (recover) {
-					scrub_put_recover(recover);
+					scrub_put_recover(fs_info, recover);
 					sblock->pagev[page_index]->recover =
 									NULL;
 				}
@@ -1556,16 +1558,19 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 		 * with a length of PAGE_SIZE, each returned stripe
 		 * represents one mirror
 		 */
+		btrfs_bio_counter_inc_blocked(fs_info);
 		ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
 				logical, &mapped_length, &bbio, 0, 1);
 		if (ret || !bbio || mapped_length < sublen) {
 			btrfs_put_bbio(bbio);
+			btrfs_bio_counter_dec(fs_info);
 			return -EIO;
 		}
 
 		recover = kzalloc(sizeof(struct scrub_recover), GFP_NOFS);
 		if (!recover) {
 			btrfs_put_bbio(bbio);
+			btrfs_bio_counter_dec(fs_info);
 			return -ENOMEM;
 		}
 
@@ -1591,7 +1596,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 				spin_lock(&sctx->stat_lock);
 				sctx->stat.malloc_errors++;
 				spin_unlock(&sctx->stat_lock);
-				scrub_put_recover(recover);
+				scrub_put_recover(fs_info, recover);
 				return -ENOMEM;
 			}
 			scrub_page_get(page);
@@ -1633,7 +1638,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 			scrub_get_recover(recover);
 			page->recover = recover;
 		}
-		scrub_put_recover(recover);
+		scrub_put_recover(fs_info, recover);
 		length -= sublen;
 		logical += sublen;
 		page_index++;
-- 
2.12.1




^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub
  2017-03-29  1:33 ` [PATCH v3 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub Qu Wenruo
@ 2017-03-29 18:00   ` Liu Bo
  0 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2017-03-29 18:00 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba

On Wed, Mar 29, 2017 at 09:33:20AM +0800, Qu Wenruo wrote:
> In the following situation, scrub will calculate wrong parity to
> overwrite correct one:
> 
> RAID5 full stripe:
> 
> Before
> |     Dev 1      |     Dev  2     |     Dev 3     |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --------------------------------------------------- 0
> | 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 4K
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> ...
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 64K
> 
> After scrubbing dev3 only:
> 
> |     Dev 1      |     Dev  2     |     Dev 3     |
> | Data stripe 1  | Data stripe 2  | Parity Stripe |
> --------------------------------------------------- 0
> | 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
> --------------------------------------------------- 4K
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> ...
> |     0xcdcd     |     0xcdcd     |     0x0000    |
> --------------------------------------------------- 64K
> 
> The reason is after raid56 read rebuild rbio->stripe_pages are all
> correct recovered (0xcd for data stripes).
> 
> However when we check and repair parity, in
> scrub_parity_check_and_repair(), we will appending pages in
> sparity->spages list to rbio->bio_pages[], which contains old on-disk
> data.
>

s/appending/append/

> And when we submit parity data to disk, we calculate parity using
> rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
> to rbio->stripe_pages[].
>

*is not found

> The patch fix it by not appending pages from sparity->spages.
> So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.
>

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo

> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/scrub.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index f92d2512f4f3..2fd259dbf4db 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -2991,7 +2991,6 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
>  	struct btrfs_fs_info *fs_info = sctx->fs_info;
>  	struct bio *bio;
>  	struct btrfs_raid_bio *rbio;
> -	struct scrub_page *spage;
>  	struct btrfs_bio *bbio = NULL;
>  	u64 length;
>  	int ret;
> @@ -3021,9 +3020,6 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
>  	if (!rbio)
>  		goto rbio_out;
>  
> -	list_for_each_entry(spage, &sparity->spages, list)
> -		raid56_add_scrub_pages(rbio, spage->page, spage->logical);
> -
>  	scrub_pending_bio_inc(sctx);
>  	raid56_parity_submit_scrub_rbio(rbio);
>  	return;
> -- 
> 2.12.1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/5] btrfs: Wait flighting bio before freeing target device for raid56
  2017-03-29  1:33 ` [PATCH v3 4/5] btrfs: Wait flighting bio before freeing target device for raid56 Qu Wenruo
@ 2017-03-29 18:05   ` Liu Bo
  0 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2017-03-29 18:05 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba

On Wed, Mar 29, 2017 at 09:33:21AM +0800, Qu Wenruo wrote:
> When raid56 dev replace is cancelled by running scrub, we will free target
> device without waiting flighting bios, causing the following NULL
> pointer deference or general protection.
> 
>  BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
>  IP: generic_make_request_checks+0x4d/0x610
>  CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G  O    4.11.0-rc2 #72
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
>  Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
>  task: ffff88002875b4c0 task.stack: ffffc90001334000
>  RIP: 0010:generic_make_request_checks+0x4d/0x610
>  Call Trace:
>   ? generic_make_request+0xc7/0x360
>   generic_make_request+0x24/0x360
>   ? generic_make_request+0xc7/0x360
>   submit_bio+0x64/0x120
>   ? page_in_rbio+0x4d/0x80 [btrfs]
>   ? rbio_orig_end_io+0x80/0x80 [btrfs]
>   finish_rmw+0x3f4/0x540 [btrfs]
>   validate_rbio_for_rmw+0x36/0x40 [btrfs]
>   raid_rmw_end_io+0x7a/0x90 [btrfs]
>   bio_endio+0x56/0x60
>   end_workqueue_fn+0x3c/0x40 [btrfs]
>   btrfs_scrubparity_helper+0xef/0x620 [btrfs]
>   btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
>   process_one_work+0x2af/0x720
>   ? process_one_work+0x22b/0x720
>   worker_thread+0x4b/0x4f0
>   kthread+0x10f/0x150
>   ? process_one_work+0x720/0x720
>   ? kthread_create_on_node+0x40/0x40
>   ret_from_fork+0x2e/0x40
>  RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8
> 
> In btrfs_dev_replace_finishing(), we will call
> btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
> device when scrub is finished normally.
> 
> However when scrub is aborted, either due to error or canceled by scrub,

Isn't it "when dev replace is aborted"?

> we didn't wait bios, this can leads to use-after-free if there are bios
> holding the target device.
> 
> Furthermore, for raid56 scrub, at least 2 places are calling
> btrfs_map_sblock() without protection of bio_counter, leading to the
> problem.
> 
> This patch fixes the problen by

typo: *problem"

> 1) Wait bio_counter before freeing target device when canceling replace
> 2) When calling btrfs_map_sblock() for raid56, use bio_counter to
>    protect the call.
>

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo
 
> Cc: Liu Bo <bo.li.liu@oracle.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/dev-replace.c |  2 ++
>  fs/btrfs/raid56.c      | 14 ++++++++++++++
>  fs/btrfs/scrub.c       |  5 +++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index e653921f05d9..b9d88136b5a9 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -546,8 +546,10 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info,
>  		mutex_unlock(&fs_info->chunk_mutex);
>  		mutex_unlock(&fs_info->fs_devices->device_list_mutex);
>  		mutex_unlock(&uuid_mutex);
> +		btrfs_rm_dev_replace_blocked(fs_info);
>  		if (tgt_device)
>  			btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
> +		btrfs_rm_dev_replace_unblocked(fs_info);
>  		mutex_unlock(&dev_replace->lock_finishing_cancel_unmount);
>  
>  		return scrub_ret;
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index 1571bf26dc07..5c180fea32ab 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -2194,6 +2194,8 @@ static void read_rebuild_work(struct btrfs_work *work)
>  /*
>   * The following code is used to scrub/replace the parity stripe
>   *
> + * Caller must have already increased bio_counter for getting @bbio.
> + *
>   * Note: We need make sure all the pages that add into the scrub/replace
>   * raid bio are correct and not be changed during the scrub/replace. That
>   * is those pages just hold metadata or file data with checksum.
> @@ -2231,6 +2233,12 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info *fs_info, struct bio *bio,
>  	ASSERT(rbio->stripe_npages == stripe_nsectors);
>  	bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors);
>  
> +	/*
> +	 * We have already increased bio_counter when getting bbio, record it
> +	 * so we can free it at rbio_orig_end_io().
> +	 */
> +	rbio->generic_bio_cnt = 1;
> +
>  	return rbio;
>  }
>  
> @@ -2673,6 +2681,12 @@ raid56_alloc_missing_rbio(struct btrfs_fs_info *fs_info, struct bio *bio,
>  		return NULL;
>  	}
>  
> +	/*
> +	 * When we get bbio, we have already increased bio_counter, record it
> +	 * so we can free it at rbio_orig_end_io()
> +	 */
> +	rbio->generic_bio_cnt = 1;
> +
>  	return rbio;
>  }
>  
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 2fd259dbf4db..b8c49074d1b3 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -2413,6 +2413,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock)
>  	int ret;
>  	int i;
>  
> +	btrfs_bio_counter_inc_blocked(fs_info);
>  	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
>  			&length, &bbio, 0, 1);
>  	if (ret || !bbio || !bbio->raid_map)
> @@ -2457,6 +2458,7 @@ static void scrub_missing_raid56_pages(struct scrub_block *sblock)
>  rbio_out:
>  	bio_put(bio);
>  bbio_out:
> +	btrfs_bio_counter_dec(fs_info);
>  	btrfs_put_bbio(bbio);
>  	spin_lock(&sctx->stat_lock);
>  	sctx->stat.malloc_errors++;
> @@ -3000,6 +3002,8 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
>  		goto out;
>  
>  	length = sparity->logic_end - sparity->logic_start;
> +
> +	btrfs_bio_counter_inc_blocked(fs_info);
>  	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_WRITE, sparity->logic_start,
>  			       &length, &bbio, 0, 1);
>  	if (ret || !bbio || !bbio->raid_map)
> @@ -3027,6 +3031,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity)
>  rbio_out:
>  	bio_put(bio);
>  bbio_out:
> +	btrfs_bio_counter_dec(fs_info);
>  	btrfs_put_bbio(bbio);
>  	bitmap_or(sparity->ebitmap, sparity->ebitmap, sparity->dbitmap,
>  		  sparity->nsectors);
> -- 
> 2.12.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/5] btrfs: Prevent scrub recheck from racing with dev replace
  2017-03-29  1:33 ` [PATCH v3 5/5] btrfs: Prevent scrub recheck from racing with dev replace Qu Wenruo
@ 2017-03-29 18:08   ` Liu Bo
  0 siblings, 0 replies; 18+ messages in thread
From: Liu Bo @ 2017-03-29 18:08 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba

On Wed, Mar 29, 2017 at 09:33:22AM +0800, Qu Wenruo wrote:
> scrub_setup_recheck_block() calls btrfs_map_sblock() and then access
> bbio without protection of bio_counter.
>

s/access/accesses/

> This can leads to use-after-free if racing with dev replace cancel.
>

s/leads/lead/

> Fix it by increasing bio_counter before calling btrfs_map_sblock() and
> decrease the bio_counter when corresponding recover is finished.
>

*decreasing

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo


> Cc: Liu Bo <bo.li.liu@oracle.com>
> Reported-by: Liu Bo <bo.li.liu@oracle.com>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/scrub.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index b8c49074d1b3..84b077c993c0 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1072,9 +1072,11 @@ static inline void scrub_get_recover(struct scrub_recover *recover)
>  	atomic_inc(&recover->refs);
>  }
>  
> -static inline void scrub_put_recover(struct scrub_recover *recover)
> +static inline void scrub_put_recover(struct btrfs_fs_info *fs_info,
> +				     struct scrub_recover *recover)
>  {
>  	if (atomic_dec_and_test(&recover->refs)) {
> +		btrfs_bio_counter_dec(fs_info);
>  		btrfs_put_bbio(recover->bbio);
>  		kfree(recover);
>  	}
> @@ -1464,7 +1466,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>  				sblock->pagev[page_index]->sblock = NULL;
>  				recover = sblock->pagev[page_index]->recover;
>  				if (recover) {
> -					scrub_put_recover(recover);
> +					scrub_put_recover(fs_info, recover);
>  					sblock->pagev[page_index]->recover =
>  									NULL;
>  				}
> @@ -1556,16 +1558,19 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
>  		 * with a length of PAGE_SIZE, each returned stripe
>  		 * represents one mirror
>  		 */
> +		btrfs_bio_counter_inc_blocked(fs_info);
>  		ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
>  				logical, &mapped_length, &bbio, 0, 1);
>  		if (ret || !bbio || mapped_length < sublen) {
>  			btrfs_put_bbio(bbio);
> +			btrfs_bio_counter_dec(fs_info);
>  			return -EIO;
>  		}
>  
>  		recover = kzalloc(sizeof(struct scrub_recover), GFP_NOFS);
>  		if (!recover) {
>  			btrfs_put_bbio(bbio);
> +			btrfs_bio_counter_dec(fs_info);
>  			return -ENOMEM;
>  		}
>  
> @@ -1591,7 +1596,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
>  				spin_lock(&sctx->stat_lock);
>  				sctx->stat.malloc_errors++;
>  				spin_unlock(&sctx->stat_lock);
> -				scrub_put_recover(recover);
> +				scrub_put_recover(fs_info, recover);
>  				return -ENOMEM;
>  			}
>  			scrub_page_get(page);
> @@ -1633,7 +1638,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
>  			scrub_get_recover(recover);
>  			page->recover = recover;
>  		}
> -		scrub_put_recover(recover);
> +		scrub_put_recover(fs_info, recover);
>  		length -= sublen;
>  		logical += sublen;
>  		page_index++;
> -- 
> 2.12.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
  2017-03-29  1:33 ` [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56 Qu Wenruo
@ 2017-03-30  0:34   ` Liu Bo
  2017-03-30  1:03     ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Liu Bo @ 2017-03-30  0:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba

On Wed, Mar 29, 2017 at 09:33:18AM +0800, Qu Wenruo wrote:
> Unlike mirror based profiles, RAID5/6 recovery needs to read out the
> whole full stripe.
> 
> And if we don't do proper protect, it can easily cause race condition.
> 
> Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
> for RAID5/6.
> Which stores a rb_tree of mutex for full stripes, so scrub callers can
> use them to lock a full stripe to avoid race.
> 
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/ctree.h       |  17 ++++
>  fs/btrfs/extent-tree.c |   2 +
>  fs/btrfs/scrub.c       | 212 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 231 insertions(+)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 29b7fc28c607..9fe56da21fed 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -538,6 +538,14 @@ struct btrfs_io_ctl {
>  	unsigned check_crcs:1;
>  };
>  
> +/*
> + * Tree to record all locked full stripes of a RAID5/6 block group
> + */
> +struct btrfs_full_stripe_locks_tree {
> +	struct rb_root root;
> +	struct mutex lock;
> +};
> +
>  struct btrfs_block_group_cache {
>  	struct btrfs_key key;
>  	struct btrfs_block_group_item item;
> @@ -648,6 +656,9 @@ struct btrfs_block_group_cache {
>  	 * Protected by free_space_lock.
>  	 */
>  	int needs_free_space;
> +
> +	/* Record locked full stripes for RAID5/6 block group */
> +	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
>  };
>  
>  /* delayed seq elem */
> @@ -3647,6 +3658,12 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info,
>  			   struct btrfs_device *dev);
>  int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid,
>  			 struct btrfs_scrub_progress *progress);
> +static inline void btrfs_init_full_stripe_locks_tree(
> +			struct btrfs_full_stripe_locks_tree *locks_root)
> +{
> +	locks_root->root = RB_ROOT;
> +	mutex_init(&locks_root->lock);
> +}
>  
>  /* dev-replace.c */
>  void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index be5477676cc8..e4d48997d927 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -131,6 +131,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
>  	if (atomic_dec_and_test(&cache->count)) {
>  		WARN_ON(cache->pinned > 0);
>  		WARN_ON(cache->reserved > 0);
> +		WARN_ON(!RB_EMPTY_ROOT(&cache->full_stripe_locks_root.root));

It's also necessary to free objects left in the rbtree because we may get here
due to -ENOMEM in unlock_full_stripe.

>  		kfree(cache->free_space_ctl);
>  		kfree(cache);
>  	}
> @@ -9917,6 +9918,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>  	btrfs_init_free_space_ctl(cache);
>  	atomic_set(&cache->trimming, 0);
>  	mutex_init(&cache->free_space_lock);
> +	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
>  
>  	return cache;
>  }
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index b0251eb1239f..ab33b9a8aac2 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -240,6 +240,13 @@ struct scrub_warning {
>  	struct btrfs_device	*dev;
>  };
>  
> +struct full_stripe_lock {
> +	struct rb_node node;
> +	u64 logical;
> +	u64 refs;
> +	struct mutex mutex;
> +};
> +
>  static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
>  static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
>  static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx);
> @@ -349,6 +356,211 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
>  }
>  
>  /*
> + * Insert new full stripe lock into full stripe locks tree
> + *
> + * Return pointer to existing or newly inserted full_stripe_lock structure if
> + * everything works well.
> + * Return ERR_PTR(-ENOMEM) if we failed to allocate memory
> + *
> + * NOTE: caller must hold full_stripe_locks_root->lock before calling this
> + * function
> + */
> +static struct full_stripe_lock *insert_full_stripe_lock(
> +		struct btrfs_full_stripe_locks_tree *locks_root,
> +		u64 fstripe_logical)
> +{
> +	struct rb_node **p;
> +	struct rb_node *parent = NULL;
> +	struct full_stripe_lock *entry;
> +	struct full_stripe_lock *ret;
> +
> +	WARN_ON(!mutex_is_locked(&locks_root->lock));
> +
> +	p = &locks_root->root.rb_node;
> +	while (*p) {
> +		parent = *p;
> +		entry = rb_entry(parent, struct full_stripe_lock, node);
> +		if (fstripe_logical < entry->logical) {
> +			p = &(*p)->rb_left;
> +		} else if (fstripe_logical > entry->logical) {
> +			p = &(*p)->rb_right;
> +		} else {
> +			entry->refs++;
> +			return entry;
> +		}
> +	}
> +
> +	/* Insert new lock */
> +	ret = kmalloc(sizeof(*ret), GFP_KERNEL);
> +	if (!ret)
> +		return ERR_PTR(-ENOMEM);
> +	ret->logical = fstripe_logical;
> +	ret->refs = 1;
> +	mutex_init(&ret->mutex);
> +
> +	rb_link_node(&ret->node, parent, p);
> +	rb_insert_color(&ret->node, &locks_root->root);
> +	return ret;
> +}
> +
> +/*
> + * Search for a full stripe lock of a block group
> + *
> + * Return pointer to existing full stripe lock if found
> + * Return NULL if not found
> + */
> +static struct full_stripe_lock *search_full_stripe_lock(
> +		struct btrfs_full_stripe_locks_tree *locks_root,
> +		u64 fstripe_logical)
> +{
> +	struct rb_node *node;
> +	struct full_stripe_lock *entry;
> +
> +	WARN_ON(!mutex_is_locked(&locks_root->lock));
> +
> +	node = locks_root->root.rb_node;
> +	while (node) {
> +		entry = rb_entry(node, struct full_stripe_lock, node);
> +		if (fstripe_logical < entry->logical)
> +			node = node->rb_left;
> +		else if (fstripe_logical > entry->logical)
> +			node = node->rb_right;
> +		else
> +			return entry;
> +	}
> +	return NULL;
> +}
> +
> +/*
> + * Helper to get full stripe logical from a normal bytenr.
> + * Thanks to the chaos of scrub structures, we need to get it all
> + * by ourselves, using btrfs_map_sblock().
> + */
> +static int get_full_stripe_logical(struct btrfs_fs_info *fs_info, u64 bytenr,
> +				   u64 *bytenr_ret)
> +{
> +	struct btrfs_bio *bbio = NULL;
> +	u64 len;
> +	int ret;
> +
> +	/* Just use map_sblock() to get full stripe logical */
> +	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, bytenr,
> +			       &len, &bbio, 0, 1);
> +	if (ret || !bbio || !bbio->raid_map)
> +		goto error;
> +	*bytenr_ret = bbio->raid_map[0];

If only the start logical address of a full stripe is needed, we could
get it by aligning
(bytenr - block_group_start) to (nr_data_stripes * stripe_len).

> +	btrfs_put_bbio(bbio);
> +	return 0;
> +error:
> +	btrfs_put_bbio(bbio);
> +	if (ret)
> +		return ret;
> +	return -EIO;
> +}
> +
> +/*
> + * To lock a full stripe to avoid concurrency of recovery and read
> + * It's only used for profiles with parities(RAID5/6), for other profiles it
> + * does nothing
> + *
> + * Return 0 if we locked full stripe covering @bytenr, with a mutex hold.
> + * So caller must call unlock_full_stripe() at the same context.
> + *
> + * Return <0 if encounters error.
> + */
> +static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
> +{
> +	struct btrfs_block_group_cache *bg_cache;
> +	struct btrfs_full_stripe_locks_tree *locks_root;
> +	struct full_stripe_lock *existing;
> +	u64 fstripe_start;
> +	int ret = 0;
> +
> +	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
> +	if (!bg_cache)
> +		return -ENOENT;
> +

When starting to scrub a chunk, we've already increased a ref for block group,
could you please put a ASSERT to catch it?

> +	/* Profiles not based on parity don't need full stripe lock */
> +	if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK))
> +		goto out;
> +	locks_root = &bg_cache->full_stripe_locks_root;
> +
> +	ret = get_full_stripe_logical(fs_info, bytenr, &fstripe_start);
> +	if (ret < 0)
> +		goto out;
> +
> +	/* Now insert the full stripe lock */
> +	mutex_lock(&locks_root->lock);
> +	existing = insert_full_stripe_lock(locks_root, fstripe_start);
> +	mutex_unlock(&locks_root->lock);
> +	if (IS_ERR(existing)) {
> +		ret = PTR_ERR(existing);
> +		goto out;
> +	}
> +	mutex_lock(&existing->mutex);
> +out:
> +	btrfs_put_block_group(bg_cache);
> +	return ret;
> +}
> +
> +/*
> + * Unlock a full stripe.
> + * NOTE: Caller must ensure it's the same context calling corresponding
> + * lock_full_stripe().
> + *
> + * Return 0 if we unlock full stripe without problem.
> + * Return <0 for error
> + */
> +static int unlock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
> +{
> +	struct btrfs_block_group_cache *bg_cache;
> +	struct btrfs_full_stripe_locks_tree *locks_root;
> +	struct full_stripe_lock *fstripe_lock;
> +	u64 fstripe_start;
> +	bool freeit = false;
> +	int ret = 0;
> +
> +	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
> +	if (!bg_cache)
> +		return -ENOENT;

Ditto.

> +	if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK))
> +		goto out;
> +
> +	locks_root = &bg_cache->full_stripe_locks_root;
> +	ret = get_full_stripe_logical(fs_info, bytenr, &fstripe_start);
> +	if (ret < 0)
> +		goto out;
> +
> +	mutex_lock(&locks_root->lock);
> +	fstripe_lock = search_full_stripe_lock(locks_root, fstripe_start);
> +	/* Unparied unlock_full_stripe() detected */

* unpaired *

> +	if (WARN_ON(!fstripe_lock)) {
> +		ret = -ENOENT;
> +		mutex_unlock(&locks_root->lock);
> +		goto out;
> +	}
> +
> +	if (fstripe_lock->refs == 0) {
> +		WARN_ON(1);
> +		btrfs_warn(fs_info, "full stripe lock at %llu refcount underflow",
> +			fstripe_lock->logical);
> +	} else
> +		fstripe_lock->refs--;

This is error-prone, could you please change it to the following?

} else {
	fstripe_lock->refs--;
}

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>

Thanks,

-liubo

> +	if (fstripe_lock->refs == 0) {
> +		rb_erase(&fstripe_lock->node, &locks_root->root);
> +		freeit = true;
> +	}
> +	mutex_unlock(&locks_root->lock);
> +
> +	mutex_unlock(&fstripe_lock->mutex);
> +	if (freeit)
> +		kfree(fstripe_lock);
> +out:
> +	btrfs_put_block_group(bg_cache);
> +	return ret;
> +}
> +
> +/*
>   * used for workers that require transaction commits (i.e., for the
>   * NOCOW case)
>   */
> -- 
> 2.12.1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition
  2017-03-29  1:33 ` [PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition Qu Wenruo
@ 2017-03-30  0:51   ` Liu Bo
  2017-03-30  3:24     ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: Liu Bo @ 2017-03-30  0:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba

On Wed, Mar 29, 2017 at 09:33:19AM +0800, Qu Wenruo wrote:
[...]
> 
> Reported-by: Goffredo Baroncelli <kreijack@libero.it>
> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
> ---
>  fs/btrfs/scrub.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ab33b9a8aac2..f92d2512f4f3 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -1129,6 +1129,17 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>  	have_csum = sblock_to_check->pagev[0]->have_csum;
>  	dev = sblock_to_check->pagev[0]->dev;
>  
> +	/*
> +	 * For RAID5/6 race can happen for different dev scrub thread.
> +	 * For data corruption, Parity and Data thread will both try
> +	 * to recovery the data.
> +	 * Race can lead to double added csum error, or even unrecoverable
> +	 * error.
> +	 */
> +	ret = lock_full_stripe(fs_info, logical);
> +	if (ret < 0)
> +		return ret;
> +

Firstly, sctx->stat needs to be set with errors before returning errors.

Secondly, I think the critical section starts right before re-checking the
failed mirror, doesn't it?  If so, we could get the benefit from
sblock_bad->page since the page->recover can tell if this is a raid56 profile so
that we may save a searching for block group.

>  	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
>  		sblocks_for_recheck = NULL;
>  		goto nodatasum_case;
> @@ -1463,6 +1474,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>  		kfree(sblocks_for_recheck);
>  	}
>  
> +	ret = unlock_full_stripe(fs_info, logical);
> +	if (ret < 0)
> +		return ret;

Seems that the callers of scrub_handle_errored_block doesn't care about @ret.

And could you please put a 'locked' flag after taking the lock succesfully?
Otherwise every raid profile has to check block group for raid flag.

Thanks,

-liubo
>  	return 0;
>  }
>  
> -- 
> 2.12.1
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
  2017-03-30  0:34   ` Liu Bo
@ 2017-03-30  1:03     ` Qu Wenruo
  2017-03-30 10:31       ` David Sterba
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2017-03-30  1:03 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, dsterba



At 03/30/2017 08:34 AM, Liu Bo wrote:
> On Wed, Mar 29, 2017 at 09:33:18AM +0800, Qu Wenruo wrote:
>> Unlike mirror based profiles, RAID5/6 recovery needs to read out the
>> whole full stripe.
>>
>> And if we don't do proper protect, it can easily cause race condition.
>>
>> Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
>> for RAID5/6.
>> Which stores a rb_tree of mutex for full stripes, so scrub callers can
>> use them to lock a full stripe to avoid race.
>>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>  fs/btrfs/ctree.h       |  17 ++++
>>  fs/btrfs/extent-tree.c |   2 +
>>  fs/btrfs/scrub.c       | 212 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 231 insertions(+)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 29b7fc28c607..9fe56da21fed 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -538,6 +538,14 @@ struct btrfs_io_ctl {
>>  	unsigned check_crcs:1;
>>  };
>>
>> +/*
>> + * Tree to record all locked full stripes of a RAID5/6 block group
>> + */
>> +struct btrfs_full_stripe_locks_tree {
>> +	struct rb_root root;
>> +	struct mutex lock;
>> +};
>> +
>>  struct btrfs_block_group_cache {
>>  	struct btrfs_key key;
>>  	struct btrfs_block_group_item item;
>> @@ -648,6 +656,9 @@ struct btrfs_block_group_cache {
>>  	 * Protected by free_space_lock.
>>  	 */
>>  	int needs_free_space;
>> +
>> +	/* Record locked full stripes for RAID5/6 block group */
>> +	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
>>  };
>>
>>  /* delayed seq elem */
>> @@ -3647,6 +3658,12 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info,
>>  			   struct btrfs_device *dev);
>>  int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid,
>>  			 struct btrfs_scrub_progress *progress);
>> +static inline void btrfs_init_full_stripe_locks_tree(
>> +			struct btrfs_full_stripe_locks_tree *locks_root)
>> +{
>> +	locks_root->root = RB_ROOT;
>> +	mutex_init(&locks_root->lock);
>> +}
>>
>>  /* dev-replace.c */
>>  void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index be5477676cc8..e4d48997d927 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -131,6 +131,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache *cache)
>>  	if (atomic_dec_and_test(&cache->count)) {
>>  		WARN_ON(cache->pinned > 0);
>>  		WARN_ON(cache->reserved > 0);
>> +		WARN_ON(!RB_EMPTY_ROOT(&cache->full_stripe_locks_root.root));
>
> It's also necessary to free objects left in the rbtree because we may get here
> due to -ENOMEM in unlock_full_stripe.c

Nope, unlock_full_stripe() won't return -ENOMEM.

So here, if full_stripe_locks_root are not empty, it means that some one 
inserted a full stripe lock but doesn't even unlock it.

In that case, they still hold the mutex, and we can't really help but 
warn and continue since only the lock holder could release the mutex.

>
>>  		kfree(cache->free_space_ctl);
>>  		kfree(cache);
>>  	}
>> @@ -9917,6 +9918,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info,
>>  	btrfs_init_free_space_ctl(cache);
>>  	atomic_set(&cache->trimming, 0);
>>  	mutex_init(&cache->free_space_lock);
>> +	btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root);
>>
>>  	return cache;
>>  }
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index b0251eb1239f..ab33b9a8aac2 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -240,6 +240,13 @@ struct scrub_warning {
>>  	struct btrfs_device	*dev;
>>  };
>>
>> +struct full_stripe_lock {
>> +	struct rb_node node;
>> +	u64 logical;
>> +	u64 refs;
>> +	struct mutex mutex;
>> +};
>> +
>>  static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
>>  static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
>>  static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx);
>> @@ -349,6 +356,211 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info *fs_info)
>>  }
>>
>>  /*
>> + * Insert new full stripe lock into full stripe locks tree
>> + *
>> + * Return pointer to existing or newly inserted full_stripe_lock structure if
>> + * everything works well.
>> + * Return ERR_PTR(-ENOMEM) if we failed to allocate memory
>> + *
>> + * NOTE: caller must hold full_stripe_locks_root->lock before calling this
>> + * function
>> + */
>> +static struct full_stripe_lock *insert_full_stripe_lock(
>> +		struct btrfs_full_stripe_locks_tree *locks_root,
>> +		u64 fstripe_logical)
>> +{
>> +	struct rb_node **p;
>> +	struct rb_node *parent = NULL;
>> +	struct full_stripe_lock *entry;
>> +	struct full_stripe_lock *ret;
>> +
>> +	WARN_ON(!mutex_is_locked(&locks_root->lock));
>> +
>> +	p = &locks_root->root.rb_node;
>> +	while (*p) {
>> +		parent = *p;
>> +		entry = rb_entry(parent, struct full_stripe_lock, node);
>> +		if (fstripe_logical < entry->logical) {
>> +			p = &(*p)->rb_left;
>> +		} else if (fstripe_logical > entry->logical) {
>> +			p = &(*p)->rb_right;
>> +		} else {
>> +			entry->refs++;
>> +			return entry;
>> +		}
>> +	}
>> +
>> +	/* Insert new lock */
>> +	ret = kmalloc(sizeof(*ret), GFP_KERNEL);
>> +	if (!ret)
>> +		return ERR_PTR(-ENOMEM);
>> +	ret->logical = fstripe_logical;
>> +	ret->refs = 1;
>> +	mutex_init(&ret->mutex);
>> +
>> +	rb_link_node(&ret->node, parent, p);
>> +	rb_insert_color(&ret->node, &locks_root->root);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Search for a full stripe lock of a block group
>> + *
>> + * Return pointer to existing full stripe lock if found
>> + * Return NULL if not found
>> + */
>> +static struct full_stripe_lock *search_full_stripe_lock(
>> +		struct btrfs_full_stripe_locks_tree *locks_root,
>> +		u64 fstripe_logical)
>> +{
>> +	struct rb_node *node;
>> +	struct full_stripe_lock *entry;
>> +
>> +	WARN_ON(!mutex_is_locked(&locks_root->lock));
>> +
>> +	node = locks_root->root.rb_node;
>> +	while (node) {
>> +		entry = rb_entry(node, struct full_stripe_lock, node);
>> +		if (fstripe_logical < entry->logical)
>> +			node = node->rb_left;
>> +		else if (fstripe_logical > entry->logical)
>> +			node = node->rb_right;
>> +		else
>> +			return entry;
>> +	}
>> +	return NULL;
>> +}
>> +
>> +/*
>> + * Helper to get full stripe logical from a normal bytenr.
>> + * Thanks to the chaos of scrub structures, we need to get it all
>> + * by ourselves, using btrfs_map_sblock().
>> + */
>> +static int get_full_stripe_logical(struct btrfs_fs_info *fs_info, u64 bytenr,
>> +				   u64 *bytenr_ret)
>> +{
>> +	struct btrfs_bio *bbio = NULL;
>> +	u64 len;
>> +	int ret;
>> +
>> +	/* Just use map_sblock() to get full stripe logical */
>> +	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, bytenr,
>> +			       &len, &bbio, 0, 1);
>> +	if (ret || !bbio || !bbio->raid_map)
>> +		goto error;
>> +	*bytenr_ret = bbio->raid_map[0];
>
> If only the start logical address of a full stripe is needed, we could
> get it by aligning
> (bytenr - block_group_start) to (nr_data_stripes * stripe_len).

Good idea, and this avoids the possible error returned from map_sblock().

>
>> +	btrfs_put_bbio(bbio);
>> +	return 0;
>> +error:
>> +	btrfs_put_bbio(bbio);
>> +	if (ret)
>> +		return ret;
>> +	return -EIO;
>> +}
>> +
>> +/*
>> + * To lock a full stripe to avoid concurrency of recovery and read
>> + * It's only used for profiles with parities(RAID5/6), for other profiles it
>> + * does nothing
>> + *
>> + * Return 0 if we locked full stripe covering @bytenr, with a mutex hold.
>> + * So caller must call unlock_full_stripe() at the same context.
>> + *
>> + * Return <0 if encounters error.
>> + */
>> +static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
>> +{
>> +	struct btrfs_block_group_cache *bg_cache;
>> +	struct btrfs_full_stripe_locks_tree *locks_root;
>> +	struct full_stripe_lock *existing;
>> +	u64 fstripe_start;
>> +	int ret = 0;
>> +
>> +	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
>> +	if (!bg_cache)
>> +		return -ENOENT;
>> +
>
> When starting to scrub a chunk, we've already increased a ref for block group,
> could you please put a ASSERT to catch it?

Personally I prefer WARN_ON() than ASSERT().

ASSERT() always panic the modules and forces us to reset the system. 
Wiping out any possibility to check the system.

>
>> +	/* Profiles not based on parity don't need full stripe lock */
>> +	if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK))
>> +		goto out;
>> +	locks_root = &bg_cache->full_stripe_locks_root;
>> +
>> +	ret = get_full_stripe_logical(fs_info, bytenr, &fstripe_start);
>> +	if (ret < 0)
>> +		goto out;
>> +
>> +	/* Now insert the full stripe lock */
>> +	mutex_lock(&locks_root->lock);
>> +	existing = insert_full_stripe_lock(locks_root, fstripe_start);
>> +	mutex_unlock(&locks_root->lock);
>> +	if (IS_ERR(existing)) {
>> +		ret = PTR_ERR(existing);
>> +		goto out;
>> +	}
>> +	mutex_lock(&existing->mutex);
>> +out:
>> +	btrfs_put_block_group(bg_cache);
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Unlock a full stripe.
>> + * NOTE: Caller must ensure it's the same context calling corresponding
>> + * lock_full_stripe().
>> + *
>> + * Return 0 if we unlock full stripe without problem.
>> + * Return <0 for error
>> + */
>> +static int unlock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
>> +{
>> +	struct btrfs_block_group_cache *bg_cache;
>> +	struct btrfs_full_stripe_locks_tree *locks_root;
>> +	struct full_stripe_lock *fstripe_lock;
>> +	u64 fstripe_start;
>> +	bool freeit = false;
>> +	int ret = 0;
>> +
>> +	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
>> +	if (!bg_cache)
>> +		return -ENOENT;
>
> Ditto.
>
>> +	if (!(bg_cache->flags & BTRFS_BLOCK_GROUP_RAID56_MASK))
>> +		goto out;
>> +
>> +	locks_root = &bg_cache->full_stripe_locks_root;
>> +	ret = get_full_stripe_logical(fs_info, bytenr, &fstripe_start);
>> +	if (ret < 0)
>> +		goto out;
>> +
>> +	mutex_lock(&locks_root->lock);
>> +	fstripe_lock = search_full_stripe_lock(locks_root, fstripe_start);
>> +	/* Unparied unlock_full_stripe() detected */
>
> * unpaired *
>
>> +	if (WARN_ON(!fstripe_lock)) {
>> +		ret = -ENOENT;
>> +		mutex_unlock(&locks_root->lock);
>> +		goto out;
>> +	}
>> +
>> +	if (fstripe_lock->refs == 0) {
>> +		WARN_ON(1);
>> +		btrfs_warn(fs_info, "full stripe lock at %llu refcount underflow",
>> +			fstripe_lock->logical);
>> +	} else
>> +		fstripe_lock->refs--;
>
> This is error-prone, could you please change it to the following?
>
> } else {
> 	fstripe_lock->refs--;
> }
>

Right, I'll add back the bracket.

Thanks for the review,
Qu

> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
>
> Thanks,
>
> -liubo
>
>> +	if (fstripe_lock->refs == 0) {
>> +		rb_erase(&fstripe_lock->node, &locks_root->root);
>> +		freeit = true;
>> +	}
>> +	mutex_unlock(&locks_root->lock);
>> +
>> +	mutex_unlock(&fstripe_lock->mutex);
>> +	if (freeit)
>> +		kfree(fstripe_lock);
>> +out:
>> +	btrfs_put_block_group(bg_cache);
>> +	return ret;
>> +}
>> +
>> +/*
>>   * used for workers that require transaction commits (i.e., for the
>>   * NOCOW case)
>>   */
>> --
>> 2.12.1
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition
  2017-03-30  0:51   ` Liu Bo
@ 2017-03-30  3:24     ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2017-03-30  3:24 UTC (permalink / raw)
  To: bo.li.liu; +Cc: linux-btrfs, dsterba



At 03/30/2017 08:51 AM, Liu Bo wrote:
> On Wed, Mar 29, 2017 at 09:33:19AM +0800, Qu Wenruo wrote:
> [...]
>>
>> Reported-by: Goffredo Baroncelli <kreijack@libero.it>
>> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
>> ---
>>  fs/btrfs/scrub.c | 14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
>> index ab33b9a8aac2..f92d2512f4f3 100644
>> --- a/fs/btrfs/scrub.c
>> +++ b/fs/btrfs/scrub.c
>> @@ -1129,6 +1129,17 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>>  	have_csum = sblock_to_check->pagev[0]->have_csum;
>>  	dev = sblock_to_check->pagev[0]->dev;
>>
>> +	/*
>> +	 * For RAID5/6 race can happen for different dev scrub thread.
>> +	 * For data corruption, Parity and Data thread will both try
>> +	 * to recovery the data.
>> +	 * Race can lead to double added csum error, or even unrecoverable
>> +	 * error.
>> +	 */
>> +	ret = lock_full_stripe(fs_info, logical);
>> +	if (ret < 0)
>> +		return ret;
>> +
>
> Firstly, sctx->stat needs to be set with errors before returning errors.

Right.

I'll update malloc_errors for ret == -ENOMEM case and 
uncorrectable_errors for ret == -ENOENT case.

>
> Secondly, I think the critical section starts right before re-checking the
> failed mirror, doesn't it?  If so, we could get the benefit from
> sblock_bad->page since the page->recover can tell if this is a raid56 profile so
> that we may save a searching for block group.

It's true we can save a block group search, but that's relying on 
bbio->flags.

I prefer to make lock/unlock_full_stripe() to be as independent as 
possible, so such modification is not favoured.


Although I'm looking forward better scrub structure which allow us to 
get block group cache easier.
If we can get bg_cache easier in such context, then 
lock/unlock_full_stripe() have no need to search bg_cache by themselves.

But for now, I prefer to trade a little performance for more independent 
code.

>
>>  	if (sctx->is_dev_replace && !is_metadata && !have_csum) {
>>  		sblocks_for_recheck = NULL;
>>  		goto nodatasum_case;
>> @@ -1463,6 +1474,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
>>  		kfree(sblocks_for_recheck);
>>  	}
>>
>> +	ret = unlock_full_stripe(fs_info, logical);
>> +	if (ret < 0)
>> +		return ret;
>
> Seems that the callers of scrub_handle_errored_block doesn't care about @ret.
>
> And could you please put a 'locked' flag after taking the lock succesfully?
> Otherwise every raid profile has to check block group for raid flag.

I'm OK to introduce a bool as new paramter for lock/unlock_full_stripe() 
as a shortcut to avoid bg search at unlock.

But the true fix would be scrub structure cleanup to make us accessing 
bg_cache without hassle.

Thanks for the review,
Qu

>
> Thanks,
>
> -liubo
>>  	return 0;
>>  }
>>
>> --
>> 2.12.1
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
  2017-03-30  1:03     ` Qu Wenruo
@ 2017-03-30 10:31       ` David Sterba
  2017-03-31  2:03         ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: David Sterba @ 2017-03-30 10:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: bo.li.liu, linux-btrfs

On Thu, Mar 30, 2017 at 09:03:21AM +0800, Qu Wenruo wrote:
> >> +static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
> >> +{
> >> +	struct btrfs_block_group_cache *bg_cache;
> >> +	struct btrfs_full_stripe_locks_tree *locks_root;
> >> +	struct full_stripe_lock *existing;
> >> +	u64 fstripe_start;
> >> +	int ret = 0;
> >> +
> >> +	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
> >> +	if (!bg_cache)
> >> +		return -ENOENT;
> >> +
> >
> > When starting to scrub a chunk, we've already increased a ref for block group,
> > could you please put a ASSERT to catch it?
> 
> Personally I prefer WARN_ON() than ASSERT().
> 
> ASSERT() always panic the modules and forces us to reset the system. 
> Wiping out any possibility to check the system.

I think the sematnics of WARN_ON and ASSERT are different, so it should
be decided case by case which one to use. Assert is good for 'never
happens' or catch errors at development time (wrong API use, invariant
condition that must always match).

Also the asserts are gone if the config option is unset, while WARN_ON
will stay in some form (verbose or not). Both are suitable for catching
problems, but the warning is for less critical errors so we want to know
when it happens but still can continue.

The above case looks like a candidate for ASSERT as the refcounts must
be correct, continuing with the warning could lead to other unspecified
problems.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/5] raid56: scrub related fixes
  2017-03-29  1:33 [PATCH v3 0/5] raid56: scrub related fixes Qu Wenruo
                   ` (4 preceding siblings ...)
  2017-03-29  1:33 ` [PATCH v3 5/5] btrfs: Prevent scrub recheck from racing with dev replace Qu Wenruo
@ 2017-03-30 16:52 ` David Sterba
  5 siblings, 0 replies; 18+ messages in thread
From: David Sterba @ 2017-03-30 16:52 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs, dsterba

On Wed, Mar 29, 2017 at 09:33:17AM +0800, Qu Wenruo wrote:
> This patchset can be fetched from my github repo:
> https://github.com/adam900710/linux.git raid56_fixes
> 
> It's based on v4.11-rc2, the last two patches get modified according to
> the advice from Liu Bo.
> 
> The patchset fixes the following bugs:
> 
> 1) False alert or wrong csum error number when scrubbing RAID5/6
>    The bug itself won't cause any damage to fs, just pure race.
> 
>    This can be triggered by running scrub for 64K corrupted data stripe,
>    Normally it will report 16 csum error recovered, but sometimes it
>    will report more than 16 csum error recovered, under rare case, even
>    unrecoverable error an be reported.
> 
> 2) Corrupted data stripe rebuild corrupts P/Q
>    So scrub makes one error into another, not really fixing anything
> 
>    Since kernel scrub doesn't report parity error, so either offline
>    scrub or manual check is needed to expose such error.
> 
> 3) Use-after-free caused by cancelling dev-replace 
>    This is quite a deadly bug, since cancelling dev-replace can
>    cause kernel panic.
> 
>    Can be triggered by btrfs/069.
> 
> v2:
>   Use bio_counter to protect rbio against dev-replace cancel, instead of
>   original btrfs_device refcount, which is too restrict and must disable
>   rbio cache, suggested by Liu Bo.
> 
> v3:
>   Add fix for another possible use-after-free when rechecking recovered
>   full stripe
>   Squashing two patches as they are fixing the same problem, to make
>   bisect easier.
>   Use mutex other than spinlock to protect full stripe locks tree, this
>   allow us to allocate memory inside the critical section on demand.
>   Encapsulate rb_root and mutex into btrfs_full_stripe_locks_tree.
>   Rename scrub_full_stripe_lock to full_stripe_lock inside scrub.c.
>   Rename related function to have unified naming.
>   Code style change to follow the existing scrub code style.
> 
> Qu Wenruo (5):
>   btrfs: scrub: Introduce full stripe lock for RAID56
>   btrfs: scrub: Fix RAID56 recovery race condition
>   btrfs: scrub: Don't append on-disk pages for raid56 scrub
>   btrfs: Wait flighting bio before freeing target device for raid56
>   btrfs: Prevent scrub recheck from racing with dev replace

As Liu Bo reviewed 3-5 and otherwise look good to me, I'm going to add
them to 4.12 queue, and will fix the typos myself. Please update 1 and 2
and resend.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
  2017-03-30 10:31       ` David Sterba
@ 2017-03-31  2:03         ` Qu Wenruo
  2017-03-31 17:26           ` David Sterba
  0 siblings, 1 reply; 18+ messages in thread
From: Qu Wenruo @ 2017-03-31  2:03 UTC (permalink / raw)
  To: dsterba, bo.li.liu, linux-btrfs



At 03/30/2017 06:31 PM, David Sterba wrote:
> On Thu, Mar 30, 2017 at 09:03:21AM +0800, Qu Wenruo wrote:
>>>> +static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
>>>> +{
>>>> +	struct btrfs_block_group_cache *bg_cache;
>>>> +	struct btrfs_full_stripe_locks_tree *locks_root;
>>>> +	struct full_stripe_lock *existing;
>>>> +	u64 fstripe_start;
>>>> +	int ret = 0;
>>>> +
>>>> +	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
>>>> +	if (!bg_cache)
>>>> +		return -ENOENT;
>>>> +
>>>
>>> When starting to scrub a chunk, we've already increased a ref for block group,
>>> could you please put a ASSERT to catch it?
>>
>> Personally I prefer WARN_ON() than ASSERT().
>>
>> ASSERT() always panic the modules and forces us to reset the system.
>> Wiping out any possibility to check the system.
>
> I think the sematnics of WARN_ON and ASSERT are different, so it should
> be decided case by case which one to use. Assert is good for 'never
> happens' or catch errors at development time (wrong API use, invariant
> condition that must always match).
>
> Also the asserts are gone if the config option is unset, while WARN_ON
> will stay in some form (verbose or not). Both are suitable for catching
> problems, but the warning is for less critical errors so we want to know
> when it happens but still can continue.
>
> The above case looks like a candidate for ASSERT as the refcounts must
> be correct, continuing with the warning could lead to other unspecified
> problems.

I'm OK to use ASSERT() here, but current ASSERT() in btrfs can hide real 
problem if CONFIG_BTRFS_ASSERT is not set.

When CONFIG_BTRFS_ASSERT is not set, ASSERT() just does thing, and 
*continue* executing.

This forces us to build a fallback method.

For above case, if we simply do "ASSERT(bg_cache);" then for 
BTRFS_CONFIG_ASSERT not set case (which is quite common for most 
distributions) we will cause NULL pointer deference.

So here, we still need to do bg_cache return value check, but just 
change "WARN_ON(1);" to "ASSERT(0);" like:
------
	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
	if (!bg_cache) {
		ASSERT(0); /* WARN_ON(1); */
		return -ENOENT;
	}
------

Can we make ASSERT() really catch problem no matter kernel config?
Current ASSERT() behavior is in fact forcing us to consider both 
situation, which makes it less handy.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
  2017-03-31  2:03         ` Qu Wenruo
@ 2017-03-31 17:26           ` David Sterba
  2017-04-03  0:43             ` Qu Wenruo
  0 siblings, 1 reply; 18+ messages in thread
From: David Sterba @ 2017-03-31 17:26 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, bo.li.liu, linux-btrfs

On Fri, Mar 31, 2017 at 10:03:28AM +0800, Qu Wenruo wrote:
> 
> 
> At 03/30/2017 06:31 PM, David Sterba wrote:
> > On Thu, Mar 30, 2017 at 09:03:21AM +0800, Qu Wenruo wrote:
> >>>> +static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
> >>>> +{
> >>>> +	struct btrfs_block_group_cache *bg_cache;
> >>>> +	struct btrfs_full_stripe_locks_tree *locks_root;
> >>>> +	struct full_stripe_lock *existing;
> >>>> +	u64 fstripe_start;
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
> >>>> +	if (!bg_cache)
> >>>> +		return -ENOENT;
> >>>> +
> >>>
> >>> When starting to scrub a chunk, we've already increased a ref for block group,
> >>> could you please put a ASSERT to catch it?
> >>
> >> Personally I prefer WARN_ON() than ASSERT().
> >>
> >> ASSERT() always panic the modules and forces us to reset the system.
> >> Wiping out any possibility to check the system.
> >
> > I think the sematnics of WARN_ON and ASSERT are different, so it should
> > be decided case by case which one to use. Assert is good for 'never
> > happens' or catch errors at development time (wrong API use, invariant
> > condition that must always match).
> >
> > Also the asserts are gone if the config option is unset, while WARN_ON
> > will stay in some form (verbose or not). Both are suitable for catching
> > problems, but the warning is for less critical errors so we want to know
> > when it happens but still can continue.
> >
> > The above case looks like a candidate for ASSERT as the refcounts must
> > be correct, continuing with the warning could lead to other unspecified
> > problems.
> 
> I'm OK to use ASSERT() here, but current ASSERT() in btrfs can hide real 
> problem if CONFIG_BTRFS_ASSERT is not set.
> 
> When CONFIG_BTRFS_ASSERT is not set, ASSERT() just does thing, and 
> *continue* executing.
> 
> This forces us to build a fallback method.
> 
> For above case, if we simply do "ASSERT(bg_cache);" then for 
> BTRFS_CONFIG_ASSERT not set case (which is quite common for most 
> distributions) we will cause NULL pointer deference.
> 
> So here, we still need to do bg_cache return value check, but just 
> change "WARN_ON(1);" to "ASSERT(0);" like:
> ------
> 	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
> 	if (!bg_cache) {
> 		ASSERT(0); /* WARN_ON(1); */
> 		return -ENOENT;
> 	}
> ------
> 
> Can we make ASSERT() really catch problem no matter kernel config?
> Current ASSERT() behavior is in fact forcing us to consider both 
> situation, which makes it less handy.

All agreed, I'm not very happy about how the current ASSERT is
implemented. We want to add more and potentially expensive checks during
debugging builds, but also want to make sure that code does not proceed
pass some points if the invariants and expected values do not hold.

BUG_ON does that but then we have tons of them already and some of them
are just a temporary error handling, while at some other places it
serves as the sanity checker. We'd probably need 3rd option, that would
behave like BUG_ON but named differently, so we can clearly see that
it's intentional, or we can annotate the BUG_ON by comments.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56
  2017-03-31 17:26           ` David Sterba
@ 2017-04-03  0:43             ` Qu Wenruo
  0 siblings, 0 replies; 18+ messages in thread
From: Qu Wenruo @ 2017-04-03  0:43 UTC (permalink / raw)
  To: dsterba, bo.li.liu, linux-btrfs



At 04/01/2017 01:26 AM, David Sterba wrote:
> On Fri, Mar 31, 2017 at 10:03:28AM +0800, Qu Wenruo wrote:
>>
>>
>> At 03/30/2017 06:31 PM, David Sterba wrote:
>>> On Thu, Mar 30, 2017 at 09:03:21AM +0800, Qu Wenruo wrote:
>>>>>> +static int lock_full_stripe(struct btrfs_fs_info *fs_info, u64 bytenr)
>>>>>> +{
>>>>>> +	struct btrfs_block_group_cache *bg_cache;
>>>>>> +	struct btrfs_full_stripe_locks_tree *locks_root;
>>>>>> +	struct full_stripe_lock *existing;
>>>>>> +	u64 fstripe_start;
>>>>>> +	int ret = 0;
>>>>>> +
>>>>>> +	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
>>>>>> +	if (!bg_cache)
>>>>>> +		return -ENOENT;
>>>>>> +
>>>>>
>>>>> When starting to scrub a chunk, we've already increased a ref for block group,
>>>>> could you please put a ASSERT to catch it?
>>>>
>>>> Personally I prefer WARN_ON() than ASSERT().
>>>>
>>>> ASSERT() always panic the modules and forces us to reset the system.
>>>> Wiping out any possibility to check the system.
>>>
>>> I think the sematnics of WARN_ON and ASSERT are different, so it should
>>> be decided case by case which one to use. Assert is good for 'never
>>> happens' or catch errors at development time (wrong API use, invariant
>>> condition that must always match).
>>>
>>> Also the asserts are gone if the config option is unset, while WARN_ON
>>> will stay in some form (verbose or not). Both are suitable for catching
>>> problems, but the warning is for less critical errors so we want to know
>>> when it happens but still can continue.
>>>
>>> The above case looks like a candidate for ASSERT as the refcounts must
>>> be correct, continuing with the warning could lead to other unspecified
>>> problems.
>>
>> I'm OK to use ASSERT() here, but current ASSERT() in btrfs can hide real
>> problem if CONFIG_BTRFS_ASSERT is not set.
>>
>> When CONFIG_BTRFS_ASSERT is not set, ASSERT() just does thing, and
>> *continue* executing.
>>
>> This forces us to build a fallback method.
>>
>> For above case, if we simply do "ASSERT(bg_cache);" then for
>> BTRFS_CONFIG_ASSERT not set case (which is quite common for most
>> distributions) we will cause NULL pointer deference.
>>
>> So here, we still need to do bg_cache return value check, but just
>> change "WARN_ON(1);" to "ASSERT(0);" like:
>> ------
>> 	bg_cache = btrfs_lookup_block_group(fs_info, bytenr);
>> 	if (!bg_cache) {
>> 		ASSERT(0); /* WARN_ON(1); */
>> 		return -ENOENT;
>> 	}
>> ------
>>
>> Can we make ASSERT() really catch problem no matter kernel config?
>> Current ASSERT() behavior is in fact forcing us to consider both
>> situation, which makes it less handy.
>
> All agreed, I'm not very happy about how the current ASSERT is
> implemented. We want to add more and potentially expensive checks during
> debugging builds, but also want to make sure that code does not proceed
> pass some points if the invariants and expected values do not hold.
>
> BUG_ON does that but then we have tons of them already and some of them
> are just a temporary error handling, while at some other places it
> serves as the sanity checker. We'd probably need 3rd option, that would
> behave like BUG_ON but named differently, so we can clearly see that
> it's intentional, or we can annotate the BUG_ON by comments.

Right, until we have better solution, I'll use the above method and 
resend the first 2 patch.

BTW, I'm not quite a fan of a new BUG_ON with different name.
Currently WARN_ON/BUG_ON/ASSERT is already a little more complex than we 
need.

I prefer to remove all error handler BUG_ON() step by step as we're 
already doing, and then leaving BUG_ON() as a sanity check for really 
impossible case.

And modify WARN_ON() to give less scary info (without backtrace) 
depending on kernel config for more or less foreseeable and less 
critical error case.

Thanks,
Qu



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-04-03  0:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-29  1:33 [PATCH v3 0/5] raid56: scrub related fixes Qu Wenruo
2017-03-29  1:33 ` [PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56 Qu Wenruo
2017-03-30  0:34   ` Liu Bo
2017-03-30  1:03     ` Qu Wenruo
2017-03-30 10:31       ` David Sterba
2017-03-31  2:03         ` Qu Wenruo
2017-03-31 17:26           ` David Sterba
2017-04-03  0:43             ` Qu Wenruo
2017-03-29  1:33 ` [PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition Qu Wenruo
2017-03-30  0:51   ` Liu Bo
2017-03-30  3:24     ` Qu Wenruo
2017-03-29  1:33 ` [PATCH v3 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub Qu Wenruo
2017-03-29 18:00   ` Liu Bo
2017-03-29  1:33 ` [PATCH v3 4/5] btrfs: Wait flighting bio before freeing target device for raid56 Qu Wenruo
2017-03-29 18:05   ` Liu Bo
2017-03-29  1:33 ` [PATCH v3 5/5] btrfs: Prevent scrub recheck from racing with dev replace Qu Wenruo
2017-03-29 18:08   ` Liu Bo
2017-03-30 16:52 ` [PATCH v3 0/5] raid56: scrub related fixes David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.