On 12/11/18 12:02 AM, Qu Wenruo wrote: > [BUG] > Since fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting > time out of commit trans"), kernel may lockup with quota enabled. > > There is one backref trace triggered by snapshot dropping along with > write operation in the source subvolume. > The example can be stably reproduced. > > btrfs-cleaner D 0 4062 2 0x80000000 > Call Trace: > schedule+0x32/0x90 > btrfs_tree_read_lock+0x93/0x130 [btrfs] > find_parent_nodes+0x29b/0x1170 [btrfs] > btrfs_find_all_roots_safe+0xa8/0x120 [btrfs] > btrfs_find_all_roots+0x57/0x70 [btrfs] > btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs] > btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs] > btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs] > do_walk_down+0x541/0x5e3 [btrfs] > walk_down_tree+0xab/0xe7 [btrfs] > btrfs_drop_snapshot+0x356/0x71a [btrfs] > btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs] > cleaner_kthread+0x12b/0x160 [btrfs] > kthread+0x112/0x130 > ret_from_fork+0x27/0x50 > > [CAUSE] > When dropping snapshots with qgroup enabled, we will trigger backref > walk. > > However such backref walk at that timing is pretty dangerous, as if one > of the parent nodes get WRITE locked by other thread, we could cause a > dead lock. > > For example: > > FS 260 FS 261 (Dropped) > node A node B > / \ / \ > node C node D node E > / \ / \ / \ > leaf F|leaf G|leaf H|leaf I|leaf J|leaf K > > The lock sequence would be: > > Thread A (cleaner) | Thread B (other writer) > ----------------------------------------------------------------------- > write_lock(B) | > write_lock(D) | > ^^^ called by walk_down_tree() | > | write_lock(A) > | write_lock(D) << Stall > read_lock(H) << for backref walk | > read_lock(D) << lock owner is | > the same thread A | > so read lock is OK | > read_lock(A) << Stall | > > So thread A hold write lock D, and needs read lock A to unlock. > While thread B holds write lock A, while needs lock D to unlock. > > This will cause a dead lock. > > This is not only limited to snapshot dropping case. > As the backref walk, even only happens on commit trees, is breaking the > normal top-down locking order, makes it deadlock prone. > > [FIX] > The solution is not to trigger backref walk with any write lock hold. > This means we can't do backref walk at delayed ref insert time. > > So this patch will mostly revert commit fb235dc06fac ("btrfs: qgroup: > Move half of the qgroup accounting time out of commit trans"). > > Reported-by: David Sterba > Reported-by: Filipe Manana > Fixes: fb235dc06fac ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans") > Signed-off-by: Qu Wenruo > --- > fs/btrfs/delayed-ref.c | 6 ------ > fs/btrfs/qgroup.c | 35 ++--------------------------------- > 2 files changed, 2 insertions(+), 39 deletions(-) > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index 9301b3ad9217..7aaae8436986 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -788,9 +788,6 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans, > if (ret > 0) > kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref); > > - if (qrecord_inserted) > - btrfs_qgroup_trace_extent_post(fs_info, record); > - > return 0; > } > > @@ -869,9 +866,6 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans, > if (ret > 0) > kmem_cache_free(btrfs_delayed_data_ref_cachep, ref); > > - > - if (qrecord_inserted) > - return btrfs_qgroup_trace_extent_post(fs_info, record); > return 0; > } This looks good but is leaving some cruft behind. These were the only uses for qrecord_inserted so we don't need it in either function and we don't need to pass it to add_delayed_ref_head. -Jeff > diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c > index 45868fd76209..7d485b1e0efb 100644 > --- a/fs/btrfs/qgroup.c > +++ b/fs/btrfs/qgroup.c > @@ -1553,33 +1553,6 @@ int btrfs_qgroup_trace_extent_nolock(struct btrfs_fs_info *fs_info, > return 0; > } > > -int btrfs_qgroup_trace_extent_post(struct btrfs_fs_info *fs_info, > - struct btrfs_qgroup_extent_record *qrecord) > -{ > - struct ulist *old_root; > - u64 bytenr = qrecord->bytenr; > - int ret; > - > - ret = btrfs_find_all_roots(NULL, fs_info, bytenr, 0, &old_root, false); > - if (ret < 0) { > - fs_info->qgroup_flags |= BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT; > - btrfs_warn(fs_info, > -"error accounting new delayed refs extent (err code: %d), quota inconsistent", > - ret); > - return 0; > - } > - > - /* > - * Here we don't need to get the lock of > - * trans->transaction->delayed_refs, since inserted qrecord won't > - * be deleted, only qrecord->node may be modified (new qrecord insert) > - * > - * So modifying qrecord->old_roots is safe here > - */ > - qrecord->old_roots = old_root; > - return 0; > -} > - > int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr, > u64 num_bytes, gfp_t gfp_flag) > { > @@ -1607,7 +1580,7 @@ int btrfs_qgroup_trace_extent(struct btrfs_trans_handle *trans, u64 bytenr, > kfree(record); > return 0; > } > - return btrfs_qgroup_trace_extent_post(fs_info, record); > + return 0; > } > > int btrfs_qgroup_trace_leaf_items(struct btrfs_trans_handle *trans, > @@ -2557,11 +2530,7 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans) > trace_btrfs_qgroup_account_extents(fs_info, record); > > if (!ret) { > - /* > - * Old roots should be searched when inserting qgroup > - * extent record > - */ > - if (WARN_ON(!record->old_roots)) { > + if (!record->old_roots) { > /* Search commit root to find old_roots */ > ret = btrfs_find_all_roots(NULL, fs_info, > record->bytenr, 0, > -- Jeff Mahoney SUSE Labs