linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time
@ 2018-12-11  5:02 Qu Wenruo
  2018-12-19 20:39 ` Jeff Mahoney
  2018-12-20 16:12 ` Jeff Mahoney
  0 siblings, 2 replies; 4+ messages in thread
From: Qu Wenruo @ 2018-12-11  5:02 UTC (permalink / raw)
  To: linux-btrfs

[BUG]
Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting
time out of commit trans"), kernel may lockup with quota enabled.

There is one backref trace triggered by snapshot dropping along with
write operation in the source subvolume.
The example can be stably reproduced.

  btrfs-cleaner   D    0  4062      2 0x80000000
  Call Trace:
   schedule+0x32/0x90
   btrfs_tree_read_lock+0x93/0x130 [btrfs]
   find_parent_nodes+0x29b/0x1170 [btrfs]
   btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
   btrfs_find_all_roots+0x57/0x70 [btrfs]
   btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
   btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
   btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
   do_walk_down+0x541/0x5e3 [btrfs]
   walk_down_tree+0xab/0xe7 [btrfs]
   btrfs_drop_snapshot+0x356/0x71a [btrfs]
   btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
   cleaner_kthread+0x12b/0x160 [btrfs]
   kthread+0x112/0x130
   ret_from_fork+0x27/0x50

[CAUSE]
When dropping snapshots with qgroup enabled, we will trigger backref
walk.

However such backref walk at that timing is pretty dangerous, as if one
of the parent nodes get WRITE locked by other thread, we could cause a
dead lock.

For example:

           FS 260     FS 261 (Dropped)
            node A        node B
           /      \      /      \
       node C      node D      node E
      /   \         /  \        /     \
  leaf F|leaf G|leaf H|leaf I|leaf J|leaf K

The lock sequence would be:

      Thread A (cleaner)             |       Thread B (other writer)
-----------------------------------------------------------------------
write_lock(B)                        |
write_lock(D)                        |
^^^ called by walk_down_tree()       |
                                     |       write_lock(A)
                                     |       write_lock(D) << Stall
read_lock(H) << for backref walk     |
read_lock(D) << lock owner is        |
                the same thread A    |
                so read lock is OK   |
read_lock(A) << Stall                |

So thread A hold write lock D, and needs read lock A to unlock.
While thread B holds write lock A, while needs lock D to unlock.

This will cause a dead lock.

This is not only limited to snapshot dropping case.
As the backref walk, even only happens on commit trees, is breaking the
normal top-down locking order, makes it deadlock prone.

[FIX]
The solution is not to trigger backref walk with any write lock hold.
This means we can't do backref walk at delayed ref insert time.

So this patch will mostly revert commit fb235dc06fac ("btrfs: qgroup:
Move half of the qgroup accounting time out of commit trans").

Reported-by: David Sterba <dsterba@suse.cz>
Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/delayed-ref.c |  6 ------
 fs/btrfs/qgroup.c      | 35 ++---------------------------------
 2 files changed, 2 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..7aaae8436986 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -788,9 +788,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
 	if (ret > 0)
 		kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
 
-	if (qrecord_inserted)
-		btrfs_qgroup_trace_extent_post(fs_info, record);
-
 	return 0;
 }
 
@@ -869,9 +866,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
 	if (ret > 0)
 		kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
 
-
-	if (qrecord_inserted)
-		return btrfs_qgroup_trace_extent_post(fs_info, record);
 	return 0;
 }
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 45868fd76209..7d485b1e0efb 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -1553,33 +1553,6 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
 	return 0;
 }
 
-int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info *fs_info,
-				   struct btrfs_qgroup_extent_record *qrecord)
-{
-	struct ulist *old_root;
-	u64 bytenr = qrecord->bytenr;
-	int ret;
-
-	ret = btrfs_find_all_roots(NULL, fs_info, bytenr, 0, &old_root, false);
-	if (ret < 0) {
-		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
-		btrfs_warn(fs_info,
-"error accounting new delayed refs extent (err code: %d), quota inconsistent",
-			ret);
-		return 0;
-	}
-
-	/*
-	 * Here we don't need to get the lock of
-	 * trans->transaction->delayed_refs, since inserted qrecord won't
-	 * be deleted, only qrecord->node may be modified (new qrecord insert)
-	 *
-	 * So modifying qrecord->old_roots is safe here
-	 */
-	qrecord->old_roots = old_root;
-	return 0;
-}
-
 int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 			      u64 num_bytes, gfp_t gfp_flag)
 {
@@ -1607,7 +1580,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
 		kfree(record);
 		return 0;
 	}
-	return btrfs_qgroup_trace_extent_post(fs_info, record);
+	return 0;
 }
 
 int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
@@ -2557,11 +2530,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 		trace_btrfs_qgroup_account_extents(fs_info, record);
 
 		if (!ret) {
-			/*
-			 * Old roots should be searched when inserting qgroup
-			 * extent record
-			 */
-			if (WARN_ON(!record->old_roots)) {
+			if (!record->old_roots) {
 				/* Search commit root to find old_roots */
 				ret = btrfs_find_all_roots(NULL, fs_info,
 						record->bytenr, 0,
-- 
2.19.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time
  2018-12-11  5:02 [PATCH] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time Qu Wenruo
@ 2018-12-19 20:39 ` Jeff Mahoney
  2018-12-20 16:12 ` Jeff Mahoney
  1 sibling, 0 replies; 4+ messages in thread
From: Jeff Mahoney @ 2018-12-19 20:39 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 8056 bytes --]

On 12/11/18 12:02 AM, Qu Wenruo wrote:
> [BUG]
> Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting
> time out of commit trans"), kernel may lockup with quota enabled.
> 
> There is one backref trace triggered by snapshot dropping along with
> write operation in the source subvolume.
> The example can be stably reproduced.
> 
>   btrfs-cleaner   D    0  4062      2 0x80000000
>   Call Trace:
>    schedule+0x32/0x90
>    btrfs_tree_read_lock+0x93/0x130 [btrfs]
>    find_parent_nodes+0x29b/0x1170 [btrfs]
>    btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
>    btrfs_find_all_roots+0x57/0x70 [btrfs]
>    btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
>    btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
>    btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
>    do_walk_down+0x541/0x5e3 [btrfs]
>    walk_down_tree+0xab/0xe7 [btrfs]
>    btrfs_drop_snapshot+0x356/0x71a [btrfs]
>    btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
>    cleaner_kthread+0x12b/0x160 [btrfs]
>    kthread+0x112/0x130
>    ret_from_fork+0x27/0x50

FWIW, I can hit this repeatedly with btrfs/078 and qgroups enabled:

[ 2938.002109] fsstress        D    0 14791  14790 0x00000000
[ 2938.002111] Call Trace:
[ 2938.002116]  ? __schedule+0x333/0xb80
[ 2938.002123]  schedule+0x3c/0x90
[ 2938.017789]  btrfs_tree_read_lock+0xa5/0x110 [btrfs]
[ 2938.017794]  ? wait_woken+0x80/0x80
[ 2938.020626]  find_parent_nodes+0x362/0xf10 [btrfs]
[ 2938.020650]  btrfs_find_all_roots_safe+0xab/0x110 [btrfs]
[ 2938.020665]  btrfs_find_all_roots+0x62/0x80 [btrfs]
[ 2938.020679]  btrfs_qgroup_trace_extent_post+0x27/0x60 [btrfs]
[ 2938.020696]  btrfs_add_delayed_data_ref+0x22f/0x3d0 [btrfs]
[ 2938.029540]  ? add_pinned_bytes+0x60/0x60 [btrfs]
[ 2938.030888]  btrfs_inc_extent_ref+0x74/0x100 [btrfs]
[ 2938.030902]  __btrfs_mod_ref+0x13f/0x240 [btrfs]
[ 2938.030919]  update_ref_for_cow+0x223/0x300 [btrfs]
[ 2938.030941]  __btrfs_cow_block+0x209/0x590 [btrfs]
[ 2938.030954]  btrfs_cow_block+0xea/0x250 [btrfs]
[ 2938.030966]  btrfs_search_slot+0x4c2/0x9b0 [btrfs]
[ 2938.030982]  btrfs_lookup_file_extent+0x37/0x40 [btrfs]
[ 2938.030998]  __btrfs_drop_extents+0x155/0xda0 [btrfs]
[ 2938.031008]  ? kmem_cache_alloc+0x1a9/0x2b0
[ 2938.031021]  btrfs_drop_extents+0x76/0xa0 [btrfs]
[ 2938.031039]  btrfs_clone+0xac3/0x1160 [btrfs]
[ 2938.031060]  btrfs_extent_same_range+0x406/0x500 [btrfs]
[ 2938.031076]  btrfs_remap_file_range+0x2b5/0x2d0 [btrfs]
[ 2938.031083]  vfs_dedupe_file_range_one+0x105/0x160
[ 2938.031087]  vfs_dedupe_file_range+0x152/0x1f0
[ 2938.031091]  do_vfs_ioctl+0x23c/0x6a0
[ 2938.031094]  ? __do_sys_newfstat+0x29/0x40
[ 2938.031098]  ksys_ioctl+0x60/0x90
[ 2938.031101]  __x64_sys_ioctl+0x16/0x20
[ 2938.031103]  do_syscall_64+0x57/0x190
[ 2938.031106]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

So it's not just snapshots.  This patch fixes this case as well.

-Jeff


> [CAUSE]
> When dropping snapshots with qgroup enabled, we will trigger backref
> walk.
> 
> However such backref walk at that timing is pretty dangerous, as if one
> of the parent nodes get WRITE locked by other thread, we could cause a
> dead lock.
> 
> For example:
> 
>            FS 260     FS 261 (Dropped)
>             node A        node B
>            /      \      /      \
>        node C      node D      node E
>       /   \         /  \        /     \
>   leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
> 
> The lock sequence would be:
> 
>       Thread A (cleaner)             |       Thread B (other writer)
> -----------------------------------------------------------------------
> write_lock(B)                        |
> write_lock(D)                        |
> ^^^ called by walk_down_tree()       |
>                                      |       write_lock(A)
>                                      |       write_lock(D) << Stall
> read_lock(H) << for backref walk     |
> read_lock(D) << lock owner is        |
>                 the same thread A    |
>                 so read lock is OK   |
> read_lock(A) << Stall                |
> 
> So thread A hold write lock D, and needs read lock A to unlock.
> While thread B holds write lock A, while needs lock D to unlock.
> 
> This will cause a dead lock.
> 
> This is not only limited to snapshot dropping case.
> As the backref walk, even only happens on commit trees, is breaking the
> normal top-down locking order, makes it deadlock prone.
> 
> [FIX]
> The solution is not to trigger backref walk with any write lock hold.
> This means we can't do backref walk at delayed ref insert time.
> 
> So this patch will mostly revert commit fb235dc06fac ("btrfs: qgroup:
> Move half of the qgroup accounting time out of commit trans").
> 
> Reported-by: David Sterba <dsterba@suse.cz>
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/delayed-ref.c |  6 ------
>  fs/btrfs/qgroup.c      | 35 ++---------------------------------
>  2 files changed, 2 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..7aaae8436986 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -788,9 +788,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  	if (ret > 0)
>  		kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
>  
> -	if (qrecord_inserted)
> -		btrfs_qgroup_trace_extent_post(fs_info, record);
> -
>  	return 0;
>  }
>  
> @@ -869,9 +866,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>  	if (ret > 0)
>  		kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
>  
> -
> -	if (qrecord_inserted)
> -		return btrfs_qgroup_trace_extent_post(fs_info, record);
>  	return 0;
>  }
>  
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 45868fd76209..7d485b1e0efb 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1553,33 +1553,6 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> -int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info *fs_info,
> -				   struct btrfs_qgroup_extent_record *qrecord)
> -{
> -	struct ulist *old_root;
> -	u64 bytenr = qrecord->bytenr;
> -	int ret;
> -
> -	ret = btrfs_find_all_roots(NULL, fs_info, bytenr, 0, &old_root, false);
> -	if (ret < 0) {
> -		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> -		btrfs_warn(fs_info,
> -"error accounting new delayed refs extent (err code: %d), quota inconsistent",
> -			ret);
> -		return 0;
> -	}
> -
> -	/*
> -	 * Here we don't need to get the lock of
> -	 * trans->transaction->delayed_refs, since inserted qrecord won't
> -	 * be deleted, only qrecord->node may be modified (new qrecord insert)
> -	 *
> -	 * So modifying qrecord->old_roots is safe here
> -	 */
> -	qrecord->old_roots = old_root;
> -	return 0;
> -}
> -
>  int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>  			      u64 num_bytes, gfp_t gfp_flag)
>  {
> @@ -1607,7 +1580,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>  		kfree(record);
>  		return 0;
>  	}
> -	return btrfs_qgroup_trace_extent_post(fs_info, record);
> +	return 0;
>  }
>  
>  int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
> @@ -2557,11 +2530,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
>  		trace_btrfs_qgroup_account_extents(fs_info, record);
>  
>  		if (!ret) {
> -			/*
> -			 * Old roots should be searched when inserting qgroup
> -			 * extent record
> -			 */
> -			if (WARN_ON(!record->old_roots)) {
> +			if (!record->old_roots) {
>  				/* Search commit root to find old_roots */
>  				ret = btrfs_find_all_roots(NULL, fs_info,
>  						record->bytenr, 0,
> 

-- 
Jeff Mahoney
SUSE Labs



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time
  2018-12-11  5:02 [PATCH] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time Qu Wenruo
  2018-12-19 20:39 ` Jeff Mahoney
@ 2018-12-20 16:12 ` Jeff Mahoney
  2018-12-20 23:59   ` Qu Wenruo
  1 sibling, 1 reply; 4+ messages in thread
From: Jeff Mahoney @ 2018-12-20 16:12 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6392 bytes --]

On 12/11/18 12:02 AM, Qu Wenruo wrote:
> [BUG]
> Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting
> time out of commit trans"), kernel may lockup with quota enabled.
> 
> There is one backref trace triggered by snapshot dropping along with
> write operation in the source subvolume.
> The example can be stably reproduced.
> 
>   btrfs-cleaner   D    0  4062      2 0x80000000
>   Call Trace:
>    schedule+0x32/0x90
>    btrfs_tree_read_lock+0x93/0x130 [btrfs]
>    find_parent_nodes+0x29b/0x1170 [btrfs]
>    btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
>    btrfs_find_all_roots+0x57/0x70 [btrfs]
>    btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
>    btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
>    btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
>    do_walk_down+0x541/0x5e3 [btrfs]
>    walk_down_tree+0xab/0xe7 [btrfs]
>    btrfs_drop_snapshot+0x356/0x71a [btrfs]
>    btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
>    cleaner_kthread+0x12b/0x160 [btrfs]
>    kthread+0x112/0x130
>    ret_from_fork+0x27/0x50
> 
> [CAUSE]
> When dropping snapshots with qgroup enabled, we will trigger backref
> walk.
> 
> However such backref walk at that timing is pretty dangerous, as if one
> of the parent nodes get WRITE locked by other thread, we could cause a
> dead lock.
> 
> For example:
> 
>            FS 260     FS 261 (Dropped)
>             node A        node B
>            /      \      /      \
>        node C      node D      node E
>       /   \         /  \        /     \
>   leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
> 
> The lock sequence would be:
> 
>       Thread A (cleaner)             |       Thread B (other writer)
> -----------------------------------------------------------------------
> write_lock(B)                        |
> write_lock(D)                        |
> ^^^ called by walk_down_tree()       |
>                                      |       write_lock(A)
>                                      |       write_lock(D) << Stall
> read_lock(H) << for backref walk     |
> read_lock(D) << lock owner is        |
>                 the same thread A    |
>                 so read lock is OK   |
> read_lock(A) << Stall                |
> 
> So thread A hold write lock D, and needs read lock A to unlock.
> While thread B holds write lock A, while needs lock D to unlock.
> 
> This will cause a dead lock.
> 
> This is not only limited to snapshot dropping case.
> As the backref walk, even only happens on commit trees, is breaking the
> normal top-down locking order, makes it deadlock prone.
> 
> [FIX]
> The solution is not to trigger backref walk with any write lock hold.
> This means we can't do backref walk at delayed ref insert time.
> 
> So this patch will mostly revert commit fb235dc06fac ("btrfs: qgroup:
> Move half of the qgroup accounting time out of commit trans").
> 
> Reported-by: David Sterba <dsterba@suse.cz>
> Reported-by: Filipe Manana <fdmanana@suse.com>
> Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/delayed-ref.c |  6 ------
>  fs/btrfs/qgroup.c      | 35 ++---------------------------------
>  2 files changed, 2 insertions(+), 39 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..7aaae8436986 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -788,9 +788,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>  	if (ret > 0)
>  		kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
>  
> -	if (qrecord_inserted)
> -		btrfs_qgroup_trace_extent_post(fs_info, record);
> -
>  	return 0;
>  }
>  
> @@ -869,9 +866,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>  	if (ret > 0)
>  		kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
>  
> -
> -	if (qrecord_inserted)
> -		return btrfs_qgroup_trace_extent_post(fs_info, record);
>  	return 0;
>  }


This looks good but is leaving some cruft behind.

These were the only uses for qrecord_inserted so we don't need it in
either function and we don't need to pass it to add_delayed_ref_head.

-Jeff

> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 45868fd76209..7d485b1e0efb 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -1553,33 +1553,6 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
>  	return 0;
>  }
>  
> -int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info *fs_info,
> -				   struct btrfs_qgroup_extent_record *qrecord)
> -{
> -	struct ulist *old_root;
> -	u64 bytenr = qrecord->bytenr;
> -	int ret;
> -
> -	ret = btrfs_find_all_roots(NULL, fs_info, bytenr, 0, &old_root, false);
> -	if (ret < 0) {
> -		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
> -		btrfs_warn(fs_info,
> -"error accounting new delayed refs extent (err code: %d), quota inconsistent",
> -			ret);
> -		return 0;
> -	}
> -
> -	/*
> -	 * Here we don't need to get the lock of
> -	 * trans->transaction->delayed_refs, since inserted qrecord won't
> -	 * be deleted, only qrecord->node may be modified (new qrecord insert)
> -	 *
> -	 * So modifying qrecord->old_roots is safe here
> -	 */
> -	qrecord->old_roots = old_root;
> -	return 0;
> -}
> -
>  int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>  			      u64 num_bytes, gfp_t gfp_flag)
>  {
> @@ -1607,7 +1580,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>  		kfree(record);
>  		return 0;
>  	}
> -	return btrfs_qgroup_trace_extent_post(fs_info, record);
> +	return 0;
>  }
>  
>  int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
> @@ -2557,11 +2530,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
>  		trace_btrfs_qgroup_account_extents(fs_info, record);
>  
>  		if (!ret) {
> -			/*
> -			 * Old roots should be searched when inserting qgroup
> -			 * extent record
> -			 */
> -			if (WARN_ON(!record->old_roots)) {
> +			if (!record->old_roots) {
>  				/* Search commit root to find old_roots */
>  				ret = btrfs_find_all_roots(NULL, fs_info,
>  						record->bytenr, 0,
> 

-- 
Jeff Mahoney
SUSE Labs



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time
  2018-12-20 16:12 ` Jeff Mahoney
@ 2018-12-20 23:59   ` Qu Wenruo
  0 siblings, 0 replies; 4+ messages in thread
From: Qu Wenruo @ 2018-12-20 23:59 UTC (permalink / raw)
  To: Jeff Mahoney, Qu Wenruo, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 6824 bytes --]



On 2018/12/21 上午12:12, Jeff Mahoney wrote:
> On 12/11/18 12:02 AM, Qu Wenruo wrote:
>> [BUG]
>> Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting
>> time out of commit trans"), kernel may lockup with quota enabled.
>>
>> There is one backref trace triggered by snapshot dropping along with
>> write operation in the source subvolume.
>> The example can be stably reproduced.
>>
>>   btrfs-cleaner   D    0  4062      2 0x80000000
>>   Call Trace:
>>    schedule+0x32/0x90
>>    btrfs_tree_read_lock+0x93/0x130 [btrfs]
>>    find_parent_nodes+0x29b/0x1170 [btrfs]
>>    btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
>>    btrfs_find_all_roots+0x57/0x70 [btrfs]
>>    btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
>>    btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
>>    btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
>>    do_walk_down+0x541/0x5e3 [btrfs]
>>    walk_down_tree+0xab/0xe7 [btrfs]
>>    btrfs_drop_snapshot+0x356/0x71a [btrfs]
>>    btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
>>    cleaner_kthread+0x12b/0x160 [btrfs]
>>    kthread+0x112/0x130
>>    ret_from_fork+0x27/0x50
>>
>> [CAUSE]
>> When dropping snapshots with qgroup enabled, we will trigger backref
>> walk.
>>
>> However such backref walk at that timing is pretty dangerous, as if one
>> of the parent nodes get WRITE locked by other thread, we could cause a
>> dead lock.
>>
>> For example:
>>
>>            FS 260     FS 261 (Dropped)
>>             node A        node B
>>            /      \      /      \
>>        node C      node D      node E
>>       /   \         /  \        /     \
>>   leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
>>
>> The lock sequence would be:
>>
>>       Thread A (cleaner)             |       Thread B (other writer)
>> -----------------------------------------------------------------------
>> write_lock(B)                        |
>> write_lock(D)                        |
>> ^^^ called by walk_down_tree()       |
>>                                      |       write_lock(A)
>>                                      |       write_lock(D) << Stall
>> read_lock(H) << for backref walk     |
>> read_lock(D) << lock owner is        |
>>                 the same thread A    |
>>                 so read lock is OK   |
>> read_lock(A) << Stall                |
>>
>> So thread A hold write lock D, and needs read lock A to unlock.
>> While thread B holds write lock A, while needs lock D to unlock.
>>
>> This will cause a dead lock.
>>
>> This is not only limited to snapshot dropping case.
>> As the backref walk, even only happens on commit trees, is breaking the
>> normal top-down locking order, makes it deadlock prone.
>>
>> [FIX]
>> The solution is not to trigger backref walk with any write lock hold.
>> This means we can't do backref walk at delayed ref insert time.
>>
>> So this patch will mostly revert commit fb235dc06fac ("btrfs: qgroup:
>> Move half of the qgroup accounting time out of commit trans").
>>
>> Reported-by: David Sterba <dsterba@suse.cz>
>> Reported-by: Filipe Manana <fdmanana@suse.com>
>> Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>  fs/btrfs/delayed-ref.c |  6 ------
>>  fs/btrfs/qgroup.c      | 35 ++---------------------------------
>>  2 files changed, 2 insertions(+), 39 deletions(-)
>>
>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>> index 9301b3ad9217..7aaae8436986 100644
>> --- a/fs/btrfs/delayed-ref.c
>> +++ b/fs/btrfs/delayed-ref.c
>> @@ -788,9 +788,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
>>  	if (ret > 0)
>>  		kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
>>  
>> -	if (qrecord_inserted)
>> -		btrfs_qgroup_trace_extent_post(fs_info, record);
>> -
>>  	return 0;
>>  }
>>  
>> @@ -869,9 +866,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
>>  	if (ret > 0)
>>  		kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
>>  
>> -
>> -	if (qrecord_inserted)
>> -		return btrfs_qgroup_trace_extent_post(fs_info, record);
>>  	return 0;
>>  }
> 
> 
> This looks good but is leaving some cruft behind.
> 
> These were the only uses for qrecord_inserted so we don't need it in
> either function and we don't need to pass it to add_delayed_ref_head.

This version is a little old, the latest one can be found titlted:
[PATCH v3 2/7] btrfs: qgroup: Don't trigger backref walk at delayed ref
insert time

Or at patchwork:
https://patchwork.kernel.org/patch/10725371/

Thanks,
Qu

> 
> -Jeff
> 
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 45868fd76209..7d485b1e0efb 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -1553,33 +1553,6 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info,
>>  	return 0;
>>  }
>>  
>> -int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info *fs_info,
>> -				   struct btrfs_qgroup_extent_record *qrecord)
>> -{
>> -	struct ulist *old_root;
>> -	u64 bytenr = qrecord->bytenr;
>> -	int ret;
>> -
>> -	ret = btrfs_find_all_roots(NULL, fs_info, bytenr, 0, &old_root, false);
>> -	if (ret < 0) {
>> -		fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
>> -		btrfs_warn(fs_info,
>> -"error accounting new delayed refs extent (err code: %d), quota inconsistent",
>> -			ret);
>> -		return 0;
>> -	}
>> -
>> -	/*
>> -	 * Here we don't need to get the lock of
>> -	 * trans->transaction->delayed_refs, since inserted qrecord won't
>> -	 * be deleted, only qrecord->node may be modified (new qrecord insert)
>> -	 *
>> -	 * So modifying qrecord->old_roots is safe here
>> -	 */
>> -	qrecord->old_roots = old_root;
>> -	return 0;
>> -}
>> -
>>  int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>>  			      u64 num_bytes, gfp_t gfp_flag)
>>  {
>> @@ -1607,7 +1580,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr,
>>  		kfree(record);
>>  		return 0;
>>  	}
>> -	return btrfs_qgroup_trace_extent_post(fs_info, record);
>> +	return 0;
>>  }
>>  
>>  int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans,
>> @@ -2557,11 +2530,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
>>  		trace_btrfs_qgroup_account_extents(fs_info, record);
>>  
>>  		if (!ret) {
>> -			/*
>> -			 * Old roots should be searched when inserting qgroup
>> -			 * extent record
>> -			 */
>> -			if (WARN_ON(!record->old_roots)) {
>> +			if (!record->old_roots) {
>>  				/* Search commit root to find old_roots */
>>  				ret = btrfs_find_all_roots(NULL, fs_info,
>>  						record->bytenr, 0,
>>
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-12-21  0:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-11  5:02 [PATCH] btrfs: qgroup: Don't trigger backref walk at delayed ref insert time Qu Wenruo
2018-12-19 20:39 ` Jeff Mahoney
2018-12-20 16:12 ` Jeff Mahoney
2018-12-20 23:59   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).