All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings
@ 2021-03-11 14:31 fdmanana
  2021-03-11 14:31 ` [PATCH 1/9] btrfs: fix race when cloning extent buffer during rewind of an old root fdmanana
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

This patchset fixes a couple bugs, in the two first patches, with the tree
mod log code. The remaining patches just move all that code into a separate
file, since it's quite large and ctree.c is huge as well, and do some small
refactorings and cleanups.

One of the bugs in particular, has been hit frequently by Zygo, hitting a
BUG_ON().

Filipe Manana (9):
  btrfs: fix race when cloning extent buffer during rewind of an old
    root
  btrfs: always pin deleted leaves when there are active tree mod log
    users
  btrfs: move the tree mod log code into its own file
  btrfs: use booleans where appropriate for the tree mod log functions
  btrfs: use a bit to track the existence of tree mod log users
  btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at
    btrfs_free_tree_block()
  btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb()
  btrfs: add and use helper to get lowest sequence number for the tree
    mod log
  btrfs: update debug message when checking seq number of a delayed ref

 fs/btrfs/Makefile       |   2 +-
 fs/btrfs/backref.c      |  33 +-
 fs/btrfs/ctree.c        | 954 ++--------------------------------------
 fs/btrfs/ctree.h        |  20 +-
 fs/btrfs/delayed-ref.c  |  31 +-
 fs/btrfs/extent-tree.c  |  21 +-
 fs/btrfs/qgroup.c       |   9 +-
 fs/btrfs/tree-mod-log.c | 912 ++++++++++++++++++++++++++++++++++++++
 fs/btrfs/tree-mod-log.h |  53 +++
 9 files changed, 1056 insertions(+), 979 deletions(-)
 create mode 100644 fs/btrfs/tree-mod-log.c
 create mode 100644 fs/btrfs/tree-mod-log.h

-- 
2.28.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/9] btrfs: fix race when cloning extent buffer during rewind of an old root
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-11 14:31 ` [PATCH 2/9] btrfs: always pin deleted leaves when there are active tree mod log users fdmanana
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

While resolving backreferences, as part of a logical ino ioctl call or
fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
operations of a root, triggering a stack trace like the following:

  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/ctree.c:1210!
  invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G        W         5.11.0-2d11c0084b02-misc-next+ #89
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
  Code: 05 48 8d 74 10 (...)
  RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
  RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
  RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
  RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
  R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
  R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
  FS:  00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
  Call Trace:
   btrfs_search_old_slot+0x265/0x10d0
   ? lock_acquired+0xbb/0x600
   ? btrfs_search_slot+0x1090/0x1090
   ? free_extent_buffer.part.61+0xd7/0x140
   ? free_extent_buffer+0x13/0x20
   resolve_indirect_refs+0x3e9/0xfc0
   ? lock_downgrade+0x3d0/0x3d0
   ? __kasan_check_read+0x11/0x20
   ? add_prelim_ref.part.11+0x150/0x150
   ? lock_downgrade+0x3d0/0x3d0
   ? __kasan_check_read+0x11/0x20
   ? lock_acquired+0xbb/0x600
   ? __kasan_check_write+0x14/0x20
   ? do_raw_spin_unlock+0xa8/0x140
   ? rb_insert_color+0x30/0x360
   ? prelim_ref_insert+0x12d/0x430
   find_parent_nodes+0x5c3/0x1830
   ? resolve_indirect_refs+0xfc0/0xfc0
   ? lock_release+0xc8/0x620
   ? fs_reclaim_acquire+0x67/0xf0
   ? lock_acquire+0xc7/0x510
   ? lock_downgrade+0x3d0/0x3d0
   ? lockdep_hardirqs_on_prepare+0x160/0x210
   ? lock_release+0xc8/0x620
   ? fs_reclaim_acquire+0x67/0xf0
   ? lock_acquire+0xc7/0x510
   ? poison_range+0x38/0x40
   ? unpoison_range+0x14/0x40
   ? trace_hardirqs_on+0x55/0x120
   btrfs_find_all_roots_safe+0x142/0x1e0
   ? find_parent_nodes+0x1830/0x1830
   ? btrfs_inode_flags_to_xflags+0x50/0x50
   iterate_extent_inodes+0x20e/0x580
   ? tree_backref_for_extent+0x230/0x230
   ? lock_downgrade+0x3d0/0x3d0
   ? read_extent_buffer+0xdd/0x110
   ? lock_downgrade+0x3d0/0x3d0
   ? __kasan_check_read+0x11/0x20
   ? lock_acquired+0xbb/0x600
   ? __kasan_check_write+0x14/0x20
   ? _raw_spin_unlock+0x22/0x30
   ? __kasan_check_write+0x14/0x20
   iterate_inodes_from_logical+0x129/0x170
   ? iterate_inodes_from_logical+0x129/0x170
   ? btrfs_inode_flags_to_xflags+0x50/0x50
   ? iterate_extent_inodes+0x580/0x580
   ? __vmalloc_node+0x92/0xb0
   ? init_data_container+0x34/0xb0
   ? init_data_container+0x34/0xb0
   ? kvmalloc_node+0x60/0x80
   btrfs_ioctl_logical_to_ino+0x158/0x230
   btrfs_ioctl+0x205e/0x4040
   ? __might_sleep+0x71/0xe0
   ? btrfs_ioctl_get_supported_features+0x30/0x30
   ? getrusage+0x4b6/0x9c0
   ? __kasan_check_read+0x11/0x20
   ? lock_release+0xc8/0x620
   ? __might_fault+0x64/0xd0
   ? lock_acquire+0xc7/0x510
   ? lock_downgrade+0x3d0/0x3d0
   ? lockdep_hardirqs_on_prepare+0x210/0x210
   ? lockdep_hardirqs_on_prepare+0x210/0x210
   ? __kasan_check_read+0x11/0x20
   ? do_vfs_ioctl+0xfc/0x9d0
   ? ioctl_file_clone+0xe0/0xe0
   ? lock_downgrade+0x3d0/0x3d0
   ? lockdep_hardirqs_on_prepare+0x210/0x210
   ? __kasan_check_read+0x11/0x20
   ? lock_release+0xc8/0x620
   ? __task_pid_nr_ns+0xd3/0x250
   ? lock_acquire+0xc7/0x510
   ? __fget_files+0x160/0x230
   ? __fget_light+0xf2/0x110
   __x64_sys_ioctl+0xc3/0x100
   do_syscall_64+0x37/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xa9
  RIP: 0033:0x7fd1976e2427
  Code: 00 00 90 48 8b 05 (...)
  RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
  RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
  RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
  R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
  R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
  Modules linked in:
  ---[ end trace ec8931a1c36e57be ]---

  (gdb) l *(__tree_mod_log_rewind+0x3b1)
  0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
  1205                     * the modification. as we're going backwards, we do the
  1206                     * opposite of each operation here.
  1207                     */
  1208                    switch (tm->op) {
  1209                    case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
  1210                            BUG_ON(tm->slot < n);
  1211                            fallthrough;
  1212                    case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
  1213                    case MOD_LOG_KEY_REMOVE:
  1214                            btrfs_set_node_key(eb, &tm->key, tm->slot);

Here's what happens to hit that BUG_ON():

1) We have one tree mod log user (through fiemap or the logical ino ioctl),
   with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;

2) Another task is at ctree.c:balance_level() and we have eb X currently as
   the root of the tree, and we promote its single child, eb Y, as the new
   root.

   Then, at ctree.c:balance_level(), we call:

      tree_mod_log_insert_root(eb X, eb Y, 1);

3) At tree_mod_log_insert_root() we create tree mod log elements for each
   slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
   with a ->logical pointing to ebX->start. These are placed in an array
   named tm_list.
   Lets assume there are N elements (N pointers in eb X);

4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
   element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
   ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
   to the level of eb X and ->generation set to the generation of eb X;

5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
   tm_list as argument. After that, tree_mod_log_free_eb() calls
   __tree_mod_log_insert() for each member of tm_list in reverse order,
   from highest slot in eb X, slot N - 1, to slot 0 of eb X;

6) __tree_mod_log_insert() sets the sequence number of each given tree mod
   log operation - it increments fs_info->tree_mod_seq and sets
   fs_info->tree_mod_seq as the sequence number of the given tree mod log
   operation.

   This means that for the tm_list created at tree_mod_log_insert_root(),
   the element corresponding to slot 0 of eb X has the highest sequence
   number (1 + N), and the element corresponding to the last slot has the
   lowest sequence number (2);

7) Then, after inserting tm_list's elements into the tree mod log rbtree,
   the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
   sequence number, which is N + 2;

8) Back to ctree.c:balance_level(), we free eb X by calling
   btrfs_free_tree_block() on it. Because eb X was created in the current
   transaction, has no other references and writeback did not happen for
   it, we add it back to the free space cache/tree;

9) Later some other task T allocates the metadata extent from eb X, since
   it is marked as free space in the space cache/tree, and uses it as a
   node for some other btree;

10) The tree mod log user task calls btrfs_search_old_slot(), which calls
    get_old_root(), and finally that calls __tree_mod_log_oldest_root()
    with time_seq == 1 and eb_root == eb Y;

11) First iteration of the while loop finds the tree mod log element with
    sequence number N + 2, for the logical address of eb Y and of type
    MOD_LOG_ROOT_REPLACE;

12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
    of the loop, and set root_logical to point to tm->old_root.logical
    which corresponds to the logical address of eb X;

13) On the next iteration of the while loop, the call to
    tree_mod_log_search_oldest() returns the smallest tree mod log element
    for the logical address of eb X, which has a sequence number of 2, an
    operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
    the old slot N - 1 of eb X (eb X had N items in it before being freed);

14) We then break out of the while loop and return the tree mod log operation
    of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
    eb X, to get_old_root();

15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
    and set "logical" to the logical address of eb X, which was the old
    root. We then call tree_mod_log_search() passing it the logical
    address of eb X and time_seq == 1;

16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
    which results in adding a tree mod log operation of type
    MOD_LOG_KEY_ADD to the tree mod log - this is done at
    ctree.c:insert_ptr() - but after adding the tree mod log operation
    and before updating the number of items in eb X from 0 to 1...

17) The task at get_old_root() calls tree_mod_log_search() and gets the
    tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
    Then it enters the following if branch:

    if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
       (...)
    } (...)

    Calls read_tree_block() for eb X, which gets a reference on eb X but
    does not lock it - task T has it locked.
    Then it clones eb X while it has nritems set to 0 in its header, before
    task T sets nritems to 1 in eb X's header. From hereupon we use the
    clone of eb X which no other task has access to;

18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
    mod log operation we just got from tree_mod_log_search() in the
    previous step and the cloned version of eb X;

19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
    of items set in eb X's clone, which is 0. Then we enter the while loop,
    and in its first iteration we process the MOD_LOG_KEY_ADD operation,
    which just decrements "n" from 0 to (u32)-1, since "n" is declared with
    a type of u32. At the end of this iteration we call rb_next() to find the
    next tree mod log operation for eb X, that gives us the mod log operation
    of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
    number of N + 1 (steps 3 to 6);

20) Then we go back to the top of the while loop and trigger the following
    BUG_ON():

        (...)
        switch (tm->op) {
        case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
                 BUG_ON(tm->slot < n);
                 fallthrough;
        (...)

    Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.

Fix this by taking a read lock on the extent buffer before cloning it at
ctree.c:get_old_root(). This should be done regardless of the extent
buffer having been freed and reused, as a concurrent task might be
modifying it (while holding a write lock on it).

Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index f33e69f02230..19d7b81cee55 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1365,7 +1365,9 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
 				   "failed to read tree block %llu from get_old_root",
 				   logical);
 		} else {
+			btrfs_tree_read_lock(old);
 			eb = btrfs_clone_extent_buffer(old);
+			btrfs_tree_read_unlock(old);
 			free_extent_buffer(old);
 		}
 	} else if (old_root) {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/9] btrfs: always pin deleted leaves when there are active tree mod log users
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
  2021-03-11 14:31 ` [PATCH 1/9] btrfs: fix race when cloning extent buffer during rewind of an old root fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-15 19:28   ` David Sterba
  2021-03-11 14:31 ` [PATCH 3/9] btrfs: move the tree mod log code into its own file fdmanana
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

When freeing a tree block we may end up adding its extent back to the
free space cache/tree, as long as there are no more references for it,
it was created in the current transaction and writeback for it never
happened. This is generally fine, however when we have tree mod log
operations it can result in inconsistent versions of a btree after
unwinding extent buffers with the recorded tree mod log operations.

This is because:

* We only log operations for nodes (adding and removing key/pointers),
  for leaves we don't do anything;

* This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation
  for a node that points to a leaf that was deleted;

* Before we apply the logged operation to unwind a node, we can have
  that leaf's extent allocated again, either as a node or as a leaf, and
  possibly for another btree. This is possible if the leaf was created in
  the current transaction and writeback for it never started, in which
  case btrfs_free_tree_block() returns its extent back to the free space
  cache/tree;

* Then, before applying the tree mod log operation, some task allocates
  the metadata extent just freed before, and uses it either as a leaf or
  as a node for some btree (can be the same or another one, it does not
  matter);

* After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now
  get the target node with an item pointing to the metadata extent that
  now has content different from what it had before the leaf was deleted.
  It might now belong to a different btree and be a node and not a leaf
  anymore.

  As a consequence, the results of searches after the unwinding can be
  unpredictable and produce unexpected results.

So make sure we pin extent buffers corresponding to leaves when there
are tree mod log users.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5e228d6ad63f..2482b26b1971 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3310,6 +3310,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 
 	if (last_ref && btrfs_header_generation(buf) == trans->transid) {
 		struct btrfs_block_group *cache;
+		bool must_pin = false;
 
 		if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 			ret = check_ref_cleanup(trans, buf->start);
@@ -3327,7 +3328,27 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 
-		if (btrfs_is_zoned(fs_info)) {
+		/*
+		 * If this is a leaf and there are tree mod log users, we may
+		 * have recorded mod log operations that point to this leaf.
+		 * So we must make sure no one reuses this leaf's extent before
+		 * mod log operations are applied to a node, otherwise after
+		 * rewinding a node using the mod log operations we get an
+		 * inconsistent btree, as the leaf's extent may now be used as
+		 * a node or leaf for another different btree.
+		 * We are safe from races here because at this point no other
+		 * node or root points to this extent buffer, so if after this
+		 * check a new tree mod log user joins, it will not be able to
+		 * find a node pointing to this leaf and record operations that
+		 * point to this leaf.
+		 */
+		if (btrfs_header_level(buf) == 0) {
+			read_lock(&fs_info->tree_mod_log_lock);
+			must_pin = !list_empty(&fs_info->tree_mod_seq_list);
+			read_unlock(&fs_info->tree_mod_log_lock);
+		}
+
+		if (must_pin || btrfs_is_zoned(fs_info)) {
 			btrfs_redirty_list_add(trans->transaction, buf);
 			pin_down_extent(trans, cache, buf->start, buf->len, 1);
 			btrfs_put_block_group(cache);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/9] btrfs: move the tree mod log code into its own file
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
  2021-03-11 14:31 ` [PATCH 1/9] btrfs: fix race when cloning extent buffer during rewind of an old root fdmanana
  2021-03-11 14:31 ` [PATCH 2/9] btrfs: always pin deleted leaves when there are active tree mod log users fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-11 17:26     ` kernel test robot
  2021-03-12  8:50   ` Anand Jain
  2021-03-11 14:31 ` [PATCH 4/9] btrfs: use booleans where appropriate for the tree mod log functions fdmanana
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

The tree modification log, which records modifications done to btrees, is
quite large and currently spread all over ctree.c, which is a huge file
already.

To make things better organized, move all that code into its own separate
source and header files. Functions and definitions that are used outside
of the module (mostly by ctree.c) are renamed so that they start with a
"btrfs_" prefix. Everything else remains unchanged.

This makes it easier to go over the tree modification log code everytime
I need to go read it to fix a bug.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/Makefile       |   2 +-
 fs/btrfs/backref.c      |  33 +-
 fs/btrfs/ctree.c        | 956 ++--------------------------------------
 fs/btrfs/ctree.h        |  17 -
 fs/btrfs/delayed-ref.c  |   9 +-
 fs/btrfs/qgroup.c       |   9 +-
 fs/btrfs/tree-mod-log.c | 889 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/tree-mod-log.h |  52 +++
 8 files changed, 1006 insertions(+), 961 deletions(-)
 create mode 100644 fs/btrfs/tree-mod-log.c
 create mode 100644 fs/btrfs/tree-mod-log.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index b634c42115ea..fdba34b10a98 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -28,7 +28,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 	   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
 	   uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \
 	   block-rsv.o delalloc-space.o block-group.o discard.o reflink.o \
-	   subpage.o
+	   subpage.o tree-mod-log.o
 
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index f47c1528eb9a..117d423fdb93 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -14,6 +14,7 @@
 #include "delayed-ref.h"
 #include "locking.h"
 #include "misc.h"
+#include "tree-mod-log.h"
 
 /* Just an arbitrary number so we can be sure this happened */
 #define BACKREF_FOUND_SHARED 6
@@ -452,7 +453,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
 	if (path->slots[0] >= btrfs_header_nritems(eb) ||
 	    is_shared_data_backref(preftrees, eb->start) ||
 	    ref->root_id != btrfs_header_owner(eb)) {
-		if (time_seq == SEQ_LAST)
+		if (time_seq == BTRFS_SEQ_LAST)
 			ret = btrfs_next_leaf(root, path);
 		else
 			ret = btrfs_next_old_leaf(root, path, time_seq);
@@ -476,7 +477,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
 		if (slot == 0 &&
 		    (is_shared_data_backref(preftrees, eb->start) ||
 		     ref->root_id != btrfs_header_owner(eb))) {
-			if (time_seq == SEQ_LAST)
+			if (time_seq == BTRFS_SEQ_LAST)
 				ret = btrfs_next_leaf(root, path);
 			else
 				ret = btrfs_next_old_leaf(root, path, time_seq);
@@ -514,7 +515,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
 			eie = NULL;
 		}
 next:
-		if (time_seq == SEQ_LAST)
+		if (time_seq == BTRFS_SEQ_LAST)
 			ret = btrfs_next_item(root, path);
 		else
 			ret = btrfs_next_old_item(root, path, time_seq);
@@ -574,7 +575,7 @@ static int resolve_indirect_ref(struct btrfs_fs_info *fs_info,
 
 	if (path->search_commit_root)
 		root_level = btrfs_header_level(root->commit_root);
-	else if (time_seq == SEQ_LAST)
+	else if (time_seq == BTRFS_SEQ_LAST)
 		root_level = btrfs_header_level(root->node);
 	else
 		root_level = btrfs_old_root_level(root, time_seq);
@@ -605,7 +606,7 @@ static int resolve_indirect_ref(struct btrfs_fs_info *fs_info,
 	    search_key.offset >= LLONG_MAX)
 		search_key.offset = 0;
 	path->lowest_level = level;
-	if (time_seq == SEQ_LAST)
+	if (time_seq == BTRFS_SEQ_LAST)
 		ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);
 	else
 		ret = btrfs_search_old_slot(root, &search_key, path, time_seq);
@@ -1147,8 +1148,8 @@ static int add_keyed_refs(struct btrfs_fs_info *fs_info,
  * indirect refs to their parent bytenr.
  * When roots are found, they're added to the roots list
  *
- * If time_seq is set to SEQ_LAST, it will not search delayed_refs, and behave
- * much like trans == NULL case, the difference only lies in it will not
+ * If time_seq is set to BTRFS_SEQ_LAST, it will not search delayed_refs, and
+ * behave much like trans == NULL case, the difference only lies in it will not
  * commit root.
  * The special case is for qgroup to search roots in commit_transaction().
  *
@@ -1199,7 +1200,7 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
 		path->skip_locking = 1;
 	}
 
-	if (time_seq == SEQ_LAST)
+	if (time_seq == BTRFS_SEQ_LAST)
 		path->skip_locking = 1;
 
 	/*
@@ -1217,9 +1218,9 @@ static int find_parent_nodes(struct btrfs_trans_handle *trans,
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
 	if (trans && likely(trans->type != __TRANS_DUMMY) &&
-	    time_seq != SEQ_LAST) {
+	    time_seq != BTRFS_SEQ_LAST) {
 #else
-	if (trans && time_seq != SEQ_LAST) {
+	if (trans && time_seq != BTRFS_SEQ_LAST) {
 #endif
 		/*
 		 * look if there are updates for this ref queued and lock the
@@ -1527,7 +1528,7 @@ int btrfs_check_shared(struct btrfs_root *root, u64 inum, u64 bytenr,
 	struct btrfs_trans_handle *trans;
 	struct ulist_iterator uiter;
 	struct ulist_node *node;
-	struct seq_list elem = SEQ_LIST_INIT(elem);
+	struct btrfs_seq_list elem = BTRFS_SEQ_LIST_INIT(elem);
 	int ret = 0;
 	struct share_check shared = {
 		.root_objectid = root->root_key.objectid,
@@ -1953,7 +1954,7 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
 	struct ulist *roots = NULL;
 	struct ulist_node *ref_node = NULL;
 	struct ulist_node *root_node = NULL;
-	struct seq_list tree_mod_seq_elem = SEQ_LIST_INIT(tree_mod_seq_elem);
+	struct btrfs_seq_list seq_elem = BTRFS_SEQ_LIST_INIT(seq_elem);
 	struct ulist_iterator ref_uiter;
 	struct ulist_iterator root_uiter;
 
@@ -1971,12 +1972,12 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
 	}
 
 	if (trans)
-		btrfs_get_tree_mod_seq(fs_info, &tree_mod_seq_elem);
+		btrfs_get_tree_mod_seq(fs_info, &seq_elem);
 	else
 		down_read(&fs_info->commit_root_sem);
 
 	ret = btrfs_find_all_leafs(trans, fs_info, extent_item_objectid,
-				   tree_mod_seq_elem.seq, &refs,
+				   seq_elem.seq, &refs,
 				   &extent_item_pos, ignore_offset);
 	if (ret)
 		goto out;
@@ -1984,7 +1985,7 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
 	ULIST_ITER_INIT(&ref_uiter);
 	while (!ret && (ref_node = ulist_next(refs, &ref_uiter))) {
 		ret = btrfs_find_all_roots_safe(trans, fs_info, ref_node->val,
-						tree_mod_seq_elem.seq, &roots,
+						seq_elem.seq, &roots,
 						ignore_offset);
 		if (ret)
 			break;
@@ -2007,7 +2008,7 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
 	free_leaf_list(refs);
 out:
 	if (trans) {
-		btrfs_put_tree_mod_seq(fs_info, &tree_mod_seq_elem);
+		btrfs_put_tree_mod_seq(fs_info, &seq_elem);
 		btrfs_end_transaction(trans);
 	} else {
 		up_read(&fs_info->commit_root_sem);
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 19d7b81cee55..2c6e525b3458 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -14,6 +14,7 @@
 #include "locking.h"
 #include "volumes.h"
 #include "qgroup.h"
+#include "tree-mod-log.h"
 
 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
 		      *root, struct btrfs_path *path, int level);
@@ -233,597 +234,6 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
-enum mod_log_op {
-	MOD_LOG_KEY_REPLACE,
-	MOD_LOG_KEY_ADD,
-	MOD_LOG_KEY_REMOVE,
-	MOD_LOG_KEY_REMOVE_WHILE_FREEING,
-	MOD_LOG_KEY_REMOVE_WHILE_MOVING,
-	MOD_LOG_MOVE_KEYS,
-	MOD_LOG_ROOT_REPLACE,
-};
-
-struct tree_mod_root {
-	u64 logical;
-	u8 level;
-};
-
-struct tree_mod_elem {
-	struct rb_node node;
-	u64 logical;
-	u64 seq;
-	enum mod_log_op op;
-
-	/* this is used for MOD_LOG_KEY_* and MOD_LOG_MOVE_KEYS operations */
-	int slot;
-
-	/* this is used for MOD_LOG_KEY* and MOD_LOG_ROOT_REPLACE */
-	u64 generation;
-
-	/* those are used for op == MOD_LOG_KEY_{REPLACE,REMOVE} */
-	struct btrfs_disk_key key;
-	u64 blockptr;
-
-	/* this is used for op == MOD_LOG_MOVE_KEYS */
-	struct {
-		int dst_slot;
-		int nr_items;
-	} move;
-
-	/* this is used for op == MOD_LOG_ROOT_REPLACE */
-	struct tree_mod_root old_root;
-};
-
-/*
- * Pull a new tree mod seq number for our operation.
- */
-static inline u64 btrfs_inc_tree_mod_seq(struct btrfs_fs_info *fs_info)
-{
-	return atomic64_inc_return(&fs_info->tree_mod_seq);
-}
-
-/*
- * This adds a new blocker to the tree mod log's blocker list if the @elem
- * passed does not already have a sequence number set. So when a caller expects
- * to record tree modifications, it should ensure to set elem->seq to zero
- * before calling btrfs_get_tree_mod_seq.
- * Returns a fresh, unused tree log modification sequence number, even if no new
- * blocker was added.
- */
-u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
-			   struct seq_list *elem)
-{
-	write_lock(&fs_info->tree_mod_log_lock);
-	if (!elem->seq) {
-		elem->seq = btrfs_inc_tree_mod_seq(fs_info);
-		list_add_tail(&elem->list, &fs_info->tree_mod_seq_list);
-	}
-	write_unlock(&fs_info->tree_mod_log_lock);
-
-	return elem->seq;
-}
-
-void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
-			    struct seq_list *elem)
-{
-	struct rb_root *tm_root;
-	struct rb_node *node;
-	struct rb_node *next;
-	struct tree_mod_elem *tm;
-	u64 min_seq = (u64)-1;
-	u64 seq_putting = elem->seq;
-
-	if (!seq_putting)
-		return;
-
-	write_lock(&fs_info->tree_mod_log_lock);
-	list_del(&elem->list);
-	elem->seq = 0;
-
-	if (!list_empty(&fs_info->tree_mod_seq_list)) {
-		struct seq_list *first;
-
-		first = list_first_entry(&fs_info->tree_mod_seq_list,
-					 struct seq_list, list);
-		if (seq_putting > first->seq) {
-			/*
-			 * Blocker with lower sequence number exists, we
-			 * cannot remove anything from the log.
-			 */
-			write_unlock(&fs_info->tree_mod_log_lock);
-			return;
-		}
-		min_seq = first->seq;
-	}
-
-	/*
-	 * anything that's lower than the lowest existing (read: blocked)
-	 * sequence number can be removed from the tree.
-	 */
-	tm_root = &fs_info->tree_mod_log;
-	for (node = rb_first(tm_root); node; node = next) {
-		next = rb_next(node);
-		tm = rb_entry(node, struct tree_mod_elem, node);
-		if (tm->seq >= min_seq)
-			continue;
-		rb_erase(node, tm_root);
-		kfree(tm);
-	}
-	write_unlock(&fs_info->tree_mod_log_lock);
-}
-
-/*
- * key order of the log:
- *       node/leaf start address -> sequence
- *
- * The 'start address' is the logical address of the *new* root node
- * for root replace operations, or the logical address of the affected
- * block for all other operations.
- */
-static noinline int
-__tree_mod_log_insert(struct btrfs_fs_info *fs_info, struct tree_mod_elem *tm)
-{
-	struct rb_root *tm_root;
-	struct rb_node **new;
-	struct rb_node *parent = NULL;
-	struct tree_mod_elem *cur;
-
-	lockdep_assert_held_write(&fs_info->tree_mod_log_lock);
-
-	tm->seq = btrfs_inc_tree_mod_seq(fs_info);
-
-	tm_root = &fs_info->tree_mod_log;
-	new = &tm_root->rb_node;
-	while (*new) {
-		cur = rb_entry(*new, struct tree_mod_elem, node);
-		parent = *new;
-		if (cur->logical < tm->logical)
-			new = &((*new)->rb_left);
-		else if (cur->logical > tm->logical)
-			new = &((*new)->rb_right);
-		else if (cur->seq < tm->seq)
-			new = &((*new)->rb_left);
-		else if (cur->seq > tm->seq)
-			new = &((*new)->rb_right);
-		else
-			return -EEXIST;
-	}
-
-	rb_link_node(&tm->node, parent, new);
-	rb_insert_color(&tm->node, tm_root);
-	return 0;
-}
-
-/*
- * Determines if logging can be omitted. Returns 1 if it can. Otherwise, it
- * returns zero with the tree_mod_log_lock acquired. The caller must hold
- * this until all tree mod log insertions are recorded in the rb tree and then
- * write unlock fs_info::tree_mod_log_lock.
- */
-static inline int tree_mod_dont_log(struct btrfs_fs_info *fs_info,
-				    struct extent_buffer *eb) {
-	smp_mb();
-	if (list_empty(&(fs_info)->tree_mod_seq_list))
-		return 1;
-	if (eb && btrfs_header_level(eb) == 0)
-		return 1;
-
-	write_lock(&fs_info->tree_mod_log_lock);
-	if (list_empty(&(fs_info)->tree_mod_seq_list)) {
-		write_unlock(&fs_info->tree_mod_log_lock);
-		return 1;
-	}
-
-	return 0;
-}
-
-/* Similar to tree_mod_dont_log, but doesn't acquire any locks. */
-static inline int tree_mod_need_log(const struct btrfs_fs_info *fs_info,
-				    struct extent_buffer *eb)
-{
-	smp_mb();
-	if (list_empty(&(fs_info)->tree_mod_seq_list))
-		return 0;
-	if (eb && btrfs_header_level(eb) == 0)
-		return 0;
-
-	return 1;
-}
-
-static struct tree_mod_elem *
-alloc_tree_mod_elem(struct extent_buffer *eb, int slot,
-		    enum mod_log_op op, gfp_t flags)
-{
-	struct tree_mod_elem *tm;
-
-	tm = kzalloc(sizeof(*tm), flags);
-	if (!tm)
-		return NULL;
-
-	tm->logical = eb->start;
-	if (op != MOD_LOG_KEY_ADD) {
-		btrfs_node_key(eb, &tm->key, slot);
-		tm->blockptr = btrfs_node_blockptr(eb, slot);
-	}
-	tm->op = op;
-	tm->slot = slot;
-	tm->generation = btrfs_node_ptr_generation(eb, slot);
-	RB_CLEAR_NODE(&tm->node);
-
-	return tm;
-}
-
-static noinline int tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
-		enum mod_log_op op, gfp_t flags)
-{
-	struct tree_mod_elem *tm;
-	int ret;
-
-	if (!tree_mod_need_log(eb->fs_info, eb))
-		return 0;
-
-	tm = alloc_tree_mod_elem(eb, slot, op, flags);
-	if (!tm)
-		return -ENOMEM;
-
-	if (tree_mod_dont_log(eb->fs_info, eb)) {
-		kfree(tm);
-		return 0;
-	}
-
-	ret = __tree_mod_log_insert(eb->fs_info, tm);
-	write_unlock(&eb->fs_info->tree_mod_log_lock);
-	if (ret)
-		kfree(tm);
-
-	return ret;
-}
-
-static noinline int tree_mod_log_insert_move(struct extent_buffer *eb,
-		int dst_slot, int src_slot, int nr_items)
-{
-	struct tree_mod_elem *tm = NULL;
-	struct tree_mod_elem **tm_list = NULL;
-	int ret = 0;
-	int i;
-	int locked = 0;
-
-	if (!tree_mod_need_log(eb->fs_info, eb))
-		return 0;
-
-	tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS);
-	if (!tm_list)
-		return -ENOMEM;
-
-	tm = kzalloc(sizeof(*tm), GFP_NOFS);
-	if (!tm) {
-		ret = -ENOMEM;
-		goto free_tms;
-	}
-
-	tm->logical = eb->start;
-	tm->slot = src_slot;
-	tm->move.dst_slot = dst_slot;
-	tm->move.nr_items = nr_items;
-	tm->op = MOD_LOG_MOVE_KEYS;
-
-	for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
-		tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
-		    MOD_LOG_KEY_REMOVE_WHILE_MOVING, GFP_NOFS);
-		if (!tm_list[i]) {
-			ret = -ENOMEM;
-			goto free_tms;
-		}
-	}
-
-	if (tree_mod_dont_log(eb->fs_info, eb))
-		goto free_tms;
-	locked = 1;
-
-	/*
-	 * When we override something during the move, we log these removals.
-	 * This can only happen when we move towards the beginning of the
-	 * buffer, i.e. dst_slot < src_slot.
-	 */
-	for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
-		ret = __tree_mod_log_insert(eb->fs_info, tm_list[i]);
-		if (ret)
-			goto free_tms;
-	}
-
-	ret = __tree_mod_log_insert(eb->fs_info, tm);
-	if (ret)
-		goto free_tms;
-	write_unlock(&eb->fs_info->tree_mod_log_lock);
-	kfree(tm_list);
-
-	return 0;
-free_tms:
-	for (i = 0; i < nr_items; i++) {
-		if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
-			rb_erase(&tm_list[i]->node, &eb->fs_info->tree_mod_log);
-		kfree(tm_list[i]);
-	}
-	if (locked)
-		write_unlock(&eb->fs_info->tree_mod_log_lock);
-	kfree(tm_list);
-	kfree(tm);
-
-	return ret;
-}
-
-static inline int
-__tree_mod_log_free_eb(struct btrfs_fs_info *fs_info,
-		       struct tree_mod_elem **tm_list,
-		       int nritems)
-{
-	int i, j;
-	int ret;
-
-	for (i = nritems - 1; i >= 0; i--) {
-		ret = __tree_mod_log_insert(fs_info, tm_list[i]);
-		if (ret) {
-			for (j = nritems - 1; j > i; j--)
-				rb_erase(&tm_list[j]->node,
-					 &fs_info->tree_mod_log);
-			return ret;
-		}
-	}
-
-	return 0;
-}
-
-static noinline int tree_mod_log_insert_root(struct extent_buffer *old_root,
-			 struct extent_buffer *new_root, int log_removal)
-{
-	struct btrfs_fs_info *fs_info = old_root->fs_info;
-	struct tree_mod_elem *tm = NULL;
-	struct tree_mod_elem **tm_list = NULL;
-	int nritems = 0;
-	int ret = 0;
-	int i;
-
-	if (!tree_mod_need_log(fs_info, NULL))
-		return 0;
-
-	if (log_removal && btrfs_header_level(old_root) > 0) {
-		nritems = btrfs_header_nritems(old_root);
-		tm_list = kcalloc(nritems, sizeof(struct tree_mod_elem *),
-				  GFP_NOFS);
-		if (!tm_list) {
-			ret = -ENOMEM;
-			goto free_tms;
-		}
-		for (i = 0; i < nritems; i++) {
-			tm_list[i] = alloc_tree_mod_elem(old_root, i,
-			    MOD_LOG_KEY_REMOVE_WHILE_FREEING, GFP_NOFS);
-			if (!tm_list[i]) {
-				ret = -ENOMEM;
-				goto free_tms;
-			}
-		}
-	}
-
-	tm = kzalloc(sizeof(*tm), GFP_NOFS);
-	if (!tm) {
-		ret = -ENOMEM;
-		goto free_tms;
-	}
-
-	tm->logical = new_root->start;
-	tm->old_root.logical = old_root->start;
-	tm->old_root.level = btrfs_header_level(old_root);
-	tm->generation = btrfs_header_generation(old_root);
-	tm->op = MOD_LOG_ROOT_REPLACE;
-
-	if (tree_mod_dont_log(fs_info, NULL))
-		goto free_tms;
-
-	if (tm_list)
-		ret = __tree_mod_log_free_eb(fs_info, tm_list, nritems);
-	if (!ret)
-		ret = __tree_mod_log_insert(fs_info, tm);
-
-	write_unlock(&fs_info->tree_mod_log_lock);
-	if (ret)
-		goto free_tms;
-	kfree(tm_list);
-
-	return ret;
-
-free_tms:
-	if (tm_list) {
-		for (i = 0; i < nritems; i++)
-			kfree(tm_list[i]);
-		kfree(tm_list);
-	}
-	kfree(tm);
-
-	return ret;
-}
-
-static struct tree_mod_elem *
-__tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq,
-		      int smallest)
-{
-	struct rb_root *tm_root;
-	struct rb_node *node;
-	struct tree_mod_elem *cur = NULL;
-	struct tree_mod_elem *found = NULL;
-
-	read_lock(&fs_info->tree_mod_log_lock);
-	tm_root = &fs_info->tree_mod_log;
-	node = tm_root->rb_node;
-	while (node) {
-		cur = rb_entry(node, struct tree_mod_elem, node);
-		if (cur->logical < start) {
-			node = node->rb_left;
-		} else if (cur->logical > start) {
-			node = node->rb_right;
-		} else if (cur->seq < min_seq) {
-			node = node->rb_left;
-		} else if (!smallest) {
-			/* we want the node with the highest seq */
-			if (found)
-				BUG_ON(found->seq > cur->seq);
-			found = cur;
-			node = node->rb_left;
-		} else if (cur->seq > min_seq) {
-			/* we want the node with the smallest seq */
-			if (found)
-				BUG_ON(found->seq < cur->seq);
-			found = cur;
-			node = node->rb_right;
-		} else {
-			found = cur;
-			break;
-		}
-	}
-	read_unlock(&fs_info->tree_mod_log_lock);
-
-	return found;
-}
-
-/*
- * this returns the element from the log with the smallest time sequence
- * value that's in the log (the oldest log item). any element with a time
- * sequence lower than min_seq will be ignored.
- */
-static struct tree_mod_elem *
-tree_mod_log_search_oldest(struct btrfs_fs_info *fs_info, u64 start,
-			   u64 min_seq)
-{
-	return __tree_mod_log_search(fs_info, start, min_seq, 1);
-}
-
-/*
- * this returns the element from the log with the largest time sequence
- * value that's in the log (the most recent log item). any element with
- * a time sequence lower than min_seq will be ignored.
- */
-static struct tree_mod_elem *
-tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq)
-{
-	return __tree_mod_log_search(fs_info, start, min_seq, 0);
-}
-
-static noinline int tree_mod_log_eb_copy(struct extent_buffer *dst,
-		     struct extent_buffer *src, unsigned long dst_offset,
-		     unsigned long src_offset, int nr_items)
-{
-	struct btrfs_fs_info *fs_info = dst->fs_info;
-	int ret = 0;
-	struct tree_mod_elem **tm_list = NULL;
-	struct tree_mod_elem **tm_list_add, **tm_list_rem;
-	int i;
-	int locked = 0;
-
-	if (!tree_mod_need_log(fs_info, NULL))
-		return 0;
-
-	if (btrfs_header_level(dst) == 0 && btrfs_header_level(src) == 0)
-		return 0;
-
-	tm_list = kcalloc(nr_items * 2, sizeof(struct tree_mod_elem *),
-			  GFP_NOFS);
-	if (!tm_list)
-		return -ENOMEM;
-
-	tm_list_add = tm_list;
-	tm_list_rem = tm_list + nr_items;
-	for (i = 0; i < nr_items; i++) {
-		tm_list_rem[i] = alloc_tree_mod_elem(src, i + src_offset,
-		    MOD_LOG_KEY_REMOVE, GFP_NOFS);
-		if (!tm_list_rem[i]) {
-			ret = -ENOMEM;
-			goto free_tms;
-		}
-
-		tm_list_add[i] = alloc_tree_mod_elem(dst, i + dst_offset,
-		    MOD_LOG_KEY_ADD, GFP_NOFS);
-		if (!tm_list_add[i]) {
-			ret = -ENOMEM;
-			goto free_tms;
-		}
-	}
-
-	if (tree_mod_dont_log(fs_info, NULL))
-		goto free_tms;
-	locked = 1;
-
-	for (i = 0; i < nr_items; i++) {
-		ret = __tree_mod_log_insert(fs_info, tm_list_rem[i]);
-		if (ret)
-			goto free_tms;
-		ret = __tree_mod_log_insert(fs_info, tm_list_add[i]);
-		if (ret)
-			goto free_tms;
-	}
-
-	write_unlock(&fs_info->tree_mod_log_lock);
-	kfree(tm_list);
-
-	return 0;
-
-free_tms:
-	for (i = 0; i < nr_items * 2; i++) {
-		if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
-			rb_erase(&tm_list[i]->node, &fs_info->tree_mod_log);
-		kfree(tm_list[i]);
-	}
-	if (locked)
-		write_unlock(&fs_info->tree_mod_log_lock);
-	kfree(tm_list);
-
-	return ret;
-}
-
-static noinline int tree_mod_log_free_eb(struct extent_buffer *eb)
-{
-	struct tree_mod_elem **tm_list = NULL;
-	int nritems = 0;
-	int i;
-	int ret = 0;
-
-	if (btrfs_header_level(eb) == 0)
-		return 0;
-
-	if (!tree_mod_need_log(eb->fs_info, NULL))
-		return 0;
-
-	nritems = btrfs_header_nritems(eb);
-	tm_list = kcalloc(nritems, sizeof(struct tree_mod_elem *), GFP_NOFS);
-	if (!tm_list)
-		return -ENOMEM;
-
-	for (i = 0; i < nritems; i++) {
-		tm_list[i] = alloc_tree_mod_elem(eb, i,
-		    MOD_LOG_KEY_REMOVE_WHILE_FREEING, GFP_NOFS);
-		if (!tm_list[i]) {
-			ret = -ENOMEM;
-			goto free_tms;
-		}
-	}
-
-	if (tree_mod_dont_log(eb->fs_info, eb))
-		goto free_tms;
-
-	ret = __tree_mod_log_free_eb(eb->fs_info, tm_list, nritems);
-	write_unlock(&eb->fs_info->tree_mod_log_lock);
-	if (ret)
-		goto free_tms;
-	kfree(tm_list);
-
-	return 0;
-
-free_tms:
-	for (i = 0; i < nritems; i++)
-		kfree(tm_list[i]);
-	kfree(tm_list);
-
-	return ret;
-}
-
 /*
  * check if the tree block can be shared by multiple trees
  */
@@ -1090,7 +500,7 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 			parent_start = buf->start;
 
 		atomic_inc(&cow->refs);
-		ret = tree_mod_log_insert_root(root->node, cow, 1);
+		ret = btrfs_tree_mod_log_insert_root(root->node, cow, 1);
 		BUG_ON(ret < 0);
 		rcu_assign_pointer(root->node, cow);
 
@@ -1100,15 +510,15 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 		add_root_to_dirty_list(root);
 	} else {
 		WARN_ON(trans->transid != btrfs_header_generation(parent));
-		tree_mod_log_insert_key(parent, parent_slot,
-					MOD_LOG_KEY_REPLACE, GFP_NOFS);
+		btrfs_tree_mod_log_insert_key(parent, parent_slot,
+					      BTRFS_MOD_LOG_KEY_REPLACE, GFP_NOFS);
 		btrfs_set_node_blockptr(parent, parent_slot,
 					cow->start);
 		btrfs_set_node_ptr_generation(parent, parent_slot,
 					      trans->transid);
 		btrfs_mark_buffer_dirty(parent);
 		if (last_ref) {
-			ret = tree_mod_log_free_eb(buf);
+			ret = btrfs_tree_mod_log_free_eb(buf);
 			if (ret) {
 				btrfs_tree_unlock(cow);
 				free_extent_buffer(cow);
@@ -1127,298 +537,6 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 	return 0;
 }
 
-/*
- * returns the logical address of the oldest predecessor of the given root.
- * entries older than time_seq are ignored.
- */
-static struct tree_mod_elem *__tree_mod_log_oldest_root(
-		struct extent_buffer *eb_root, u64 time_seq)
-{
-	struct tree_mod_elem *tm;
-	struct tree_mod_elem *found = NULL;
-	u64 root_logical = eb_root->start;
-	int looped = 0;
-
-	if (!time_seq)
-		return NULL;
-
-	/*
-	 * the very last operation that's logged for a root is the
-	 * replacement operation (if it is replaced at all). this has
-	 * the logical address of the *new* root, making it the very
-	 * first operation that's logged for this root.
-	 */
-	while (1) {
-		tm = tree_mod_log_search_oldest(eb_root->fs_info, root_logical,
-						time_seq);
-		if (!looped && !tm)
-			return NULL;
-		/*
-		 * if there are no tree operation for the oldest root, we simply
-		 * return it. this should only happen if that (old) root is at
-		 * level 0.
-		 */
-		if (!tm)
-			break;
-
-		/*
-		 * if there's an operation that's not a root replacement, we
-		 * found the oldest version of our root. normally, we'll find a
-		 * MOD_LOG_KEY_REMOVE_WHILE_FREEING operation here.
-		 */
-		if (tm->op != MOD_LOG_ROOT_REPLACE)
-			break;
-
-		found = tm;
-		root_logical = tm->old_root.logical;
-		looped = 1;
-	}
-
-	/* if there's no old root to return, return what we found instead */
-	if (!found)
-		found = tm;
-
-	return found;
-}
-
-/*
- * tm is a pointer to the first operation to rewind within eb. then, all
- * previous operations will be rewound (until we reach something older than
- * time_seq).
- */
-static void
-__tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct extent_buffer *eb,
-		      u64 time_seq, struct tree_mod_elem *first_tm)
-{
-	u32 n;
-	struct rb_node *next;
-	struct tree_mod_elem *tm = first_tm;
-	unsigned long o_dst;
-	unsigned long o_src;
-	unsigned long p_size = sizeof(struct btrfs_key_ptr);
-
-	n = btrfs_header_nritems(eb);
-	read_lock(&fs_info->tree_mod_log_lock);
-	while (tm && tm->seq >= time_seq) {
-		/*
-		 * all the operations are recorded with the operator used for
-		 * the modification. as we're going backwards, we do the
-		 * opposite of each operation here.
-		 */
-		switch (tm->op) {
-		case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
-			BUG_ON(tm->slot < n);
-			fallthrough;
-		case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
-		case MOD_LOG_KEY_REMOVE:
-			btrfs_set_node_key(eb, &tm->key, tm->slot);
-			btrfs_set_node_blockptr(eb, tm->slot, tm->blockptr);
-			btrfs_set_node_ptr_generation(eb, tm->slot,
-						      tm->generation);
-			n++;
-			break;
-		case MOD_LOG_KEY_REPLACE:
-			BUG_ON(tm->slot >= n);
-			btrfs_set_node_key(eb, &tm->key, tm->slot);
-			btrfs_set_node_blockptr(eb, tm->slot, tm->blockptr);
-			btrfs_set_node_ptr_generation(eb, tm->slot,
-						      tm->generation);
-			break;
-		case MOD_LOG_KEY_ADD:
-			/* if a move operation is needed it's in the log */
-			n--;
-			break;
-		case MOD_LOG_MOVE_KEYS:
-			o_dst = btrfs_node_key_ptr_offset(tm->slot);
-			o_src = btrfs_node_key_ptr_offset(tm->move.dst_slot);
-			memmove_extent_buffer(eb, o_dst, o_src,
-					      tm->move.nr_items * p_size);
-			break;
-		case MOD_LOG_ROOT_REPLACE:
-			/*
-			 * this operation is special. for roots, this must be
-			 * handled explicitly before rewinding.
-			 * for non-roots, this operation may exist if the node
-			 * was a root: root A -> child B; then A gets empty and
-			 * B is promoted to the new root. in the mod log, we'll
-			 * have a root-replace operation for B, a tree block
-			 * that is no root. we simply ignore that operation.
-			 */
-			break;
-		}
-		next = rb_next(&tm->node);
-		if (!next)
-			break;
-		tm = rb_entry(next, struct tree_mod_elem, node);
-		if (tm->logical != first_tm->logical)
-			break;
-	}
-	read_unlock(&fs_info->tree_mod_log_lock);
-	btrfs_set_header_nritems(eb, n);
-}
-
-/*
- * Called with eb read locked. If the buffer cannot be rewound, the same buffer
- * is returned. If rewind operations happen, a fresh buffer is returned. The
- * returned buffer is always read-locked. If the returned buffer is not the
- * input buffer, the lock on the input buffer is released and the input buffer
- * is freed (its refcount is decremented).
- */
-static struct extent_buffer *
-tree_mod_log_rewind(struct btrfs_fs_info *fs_info, struct btrfs_path *path,
-		    struct extent_buffer *eb, u64 time_seq)
-{
-	struct extent_buffer *eb_rewin;
-	struct tree_mod_elem *tm;
-
-	if (!time_seq)
-		return eb;
-
-	if (btrfs_header_level(eb) == 0)
-		return eb;
-
-	tm = tree_mod_log_search(fs_info, eb->start, time_seq);
-	if (!tm)
-		return eb;
-
-	if (tm->op == MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
-		BUG_ON(tm->slot != 0);
-		eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start);
-		if (!eb_rewin) {
-			btrfs_tree_read_unlock(eb);
-			free_extent_buffer(eb);
-			return NULL;
-		}
-		btrfs_set_header_bytenr(eb_rewin, eb->start);
-		btrfs_set_header_backref_rev(eb_rewin,
-					     btrfs_header_backref_rev(eb));
-		btrfs_set_header_owner(eb_rewin, btrfs_header_owner(eb));
-		btrfs_set_header_level(eb_rewin, btrfs_header_level(eb));
-	} else {
-		eb_rewin = btrfs_clone_extent_buffer(eb);
-		if (!eb_rewin) {
-			btrfs_tree_read_unlock(eb);
-			free_extent_buffer(eb);
-			return NULL;
-		}
-	}
-
-	btrfs_tree_read_unlock(eb);
-	free_extent_buffer(eb);
-
-	btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb_rewin),
-				       eb_rewin, btrfs_header_level(eb_rewin));
-	btrfs_tree_read_lock(eb_rewin);
-	__tree_mod_log_rewind(fs_info, eb_rewin, time_seq, tm);
-	WARN_ON(btrfs_header_nritems(eb_rewin) >
-		BTRFS_NODEPTRS_PER_BLOCK(fs_info));
-
-	return eb_rewin;
-}
-
-/*
- * get_old_root() rewinds the state of @root's root node to the given @time_seq
- * value. If there are no changes, the current root->root_node is returned. If
- * anything changed in between, there's a fresh buffer allocated on which the
- * rewind operations are done. In any case, the returned buffer is read locked.
- * Returns NULL on error (with no locks held).
- */
-static inline struct extent_buffer *
-get_old_root(struct btrfs_root *root, u64 time_seq)
-{
-	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct tree_mod_elem *tm;
-	struct extent_buffer *eb = NULL;
-	struct extent_buffer *eb_root;
-	u64 eb_root_owner = 0;
-	struct extent_buffer *old;
-	struct tree_mod_root *old_root = NULL;
-	u64 old_generation = 0;
-	u64 logical;
-	int level;
-
-	eb_root = btrfs_read_lock_root_node(root);
-	tm = __tree_mod_log_oldest_root(eb_root, time_seq);
-	if (!tm)
-		return eb_root;
-
-	if (tm->op == MOD_LOG_ROOT_REPLACE) {
-		old_root = &tm->old_root;
-		old_generation = tm->generation;
-		logical = old_root->logical;
-		level = old_root->level;
-	} else {
-		logical = eb_root->start;
-		level = btrfs_header_level(eb_root);
-	}
-
-	tm = tree_mod_log_search(fs_info, logical, time_seq);
-	if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
-		btrfs_tree_read_unlock(eb_root);
-		free_extent_buffer(eb_root);
-		old = read_tree_block(fs_info, logical, root->root_key.objectid,
-				      0, level, NULL);
-		if (WARN_ON(IS_ERR(old) || !extent_buffer_uptodate(old))) {
-			if (!IS_ERR(old))
-				free_extent_buffer(old);
-			btrfs_warn(fs_info,
-				   "failed to read tree block %llu from get_old_root",
-				   logical);
-		} else {
-			btrfs_tree_read_lock(old);
-			eb = btrfs_clone_extent_buffer(old);
-			btrfs_tree_read_unlock(old);
-			free_extent_buffer(old);
-		}
-	} else if (old_root) {
-		eb_root_owner = btrfs_header_owner(eb_root);
-		btrfs_tree_read_unlock(eb_root);
-		free_extent_buffer(eb_root);
-		eb = alloc_dummy_extent_buffer(fs_info, logical);
-	} else {
-		eb = btrfs_clone_extent_buffer(eb_root);
-		btrfs_tree_read_unlock(eb_root);
-		free_extent_buffer(eb_root);
-	}
-
-	if (!eb)
-		return NULL;
-	if (old_root) {
-		btrfs_set_header_bytenr(eb, eb->start);
-		btrfs_set_header_backref_rev(eb, BTRFS_MIXED_BACKREF_REV);
-		btrfs_set_header_owner(eb, eb_root_owner);
-		btrfs_set_header_level(eb, old_root->level);
-		btrfs_set_header_generation(eb, old_generation);
-	}
-	btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb), eb,
-				       btrfs_header_level(eb));
-	btrfs_tree_read_lock(eb);
-	if (tm)
-		__tree_mod_log_rewind(fs_info, eb, time_seq, tm);
-	else
-		WARN_ON(btrfs_header_level(eb) != 0);
-	WARN_ON(btrfs_header_nritems(eb) > BTRFS_NODEPTRS_PER_BLOCK(fs_info));
-
-	return eb;
-}
-
-int btrfs_old_root_level(struct btrfs_root *root, u64 time_seq)
-{
-	struct tree_mod_elem *tm;
-	int level;
-	struct extent_buffer *eb_root = btrfs_root_node(root);
-
-	tm = __tree_mod_log_oldest_root(eb_root, time_seq);
-	if (tm && tm->op == MOD_LOG_ROOT_REPLACE) {
-		level = tm->old_root.level;
-	} else {
-		level = btrfs_header_level(eb_root);
-	}
-	free_extent_buffer(eb_root);
-
-	return level;
-}
-
 static inline int should_cow_block(struct btrfs_trans_handle *trans,
 				   struct btrfs_root *root,
 				   struct extent_buffer *buf)
@@ -1840,7 +958,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 			goto enospc;
 		}
 
-		ret = tree_mod_log_insert_root(root->node, child, 1);
+		ret = btrfs_tree_mod_log_insert_root(root->node, child, 1);
 		BUG_ON(ret < 0);
 		rcu_assign_pointer(root->node, child);
 
@@ -1920,8 +1038,8 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		} else {
 			struct btrfs_disk_key right_key;
 			btrfs_node_key(right, &right_key, 0);
-			ret = tree_mod_log_insert_key(parent, pslot + 1,
-					MOD_LOG_KEY_REPLACE, GFP_NOFS);
+			ret = btrfs_tree_mod_log_insert_key(parent, pslot + 1,
+					BTRFS_MOD_LOG_KEY_REPLACE, GFP_NOFS);
 			BUG_ON(ret < 0);
 			btrfs_set_node_key(parent, &right_key, pslot + 1);
 			btrfs_mark_buffer_dirty(parent);
@@ -1966,8 +1084,8 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 		/* update the parent key to reflect our changes */
 		struct btrfs_disk_key mid_key;
 		btrfs_node_key(mid, &mid_key, 0);
-		ret = tree_mod_log_insert_key(parent, pslot,
-				MOD_LOG_KEY_REPLACE, GFP_NOFS);
+		ret = btrfs_tree_mod_log_insert_key(parent, pslot,
+				BTRFS_MOD_LOG_KEY_REPLACE, GFP_NOFS);
 		BUG_ON(ret < 0);
 		btrfs_set_node_key(parent, &mid_key, pslot);
 		btrfs_mark_buffer_dirty(parent);
@@ -2068,8 +1186,8 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 			struct btrfs_disk_key disk_key;
 			orig_slot += left_nr;
 			btrfs_node_key(mid, &disk_key, 0);
-			ret = tree_mod_log_insert_key(parent, pslot,
-					MOD_LOG_KEY_REPLACE, GFP_NOFS);
+			ret = btrfs_tree_mod_log_insert_key(parent, pslot,
+					BTRFS_MOD_LOG_KEY_REPLACE, GFP_NOFS);
 			BUG_ON(ret < 0);
 			btrfs_set_node_key(parent, &disk_key, pslot);
 			btrfs_mark_buffer_dirty(parent);
@@ -2122,8 +1240,8 @@ static noinline int push_nodes_for_insert(struct btrfs_trans_handle *trans,
 			struct btrfs_disk_key disk_key;
 
 			btrfs_node_key(right, &disk_key, 0);
-			ret = tree_mod_log_insert_key(parent, pslot + 1,
-					MOD_LOG_KEY_REPLACE, GFP_NOFS);
+			ret = btrfs_tree_mod_log_insert_key(parent, pslot + 1,
+					BTRFS_MOD_LOG_KEY_REPLACE, GFP_NOFS);
 			BUG_ON(ret < 0);
 			btrfs_set_node_key(parent, &disk_key, pslot + 1);
 			btrfs_mark_buffer_dirty(parent);
@@ -2881,7 +1999,7 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 	}
 
 again:
-	b = get_old_root(root, time_seq);
+	b = btrfs_get_old_root(root, time_seq);
 	if (!b) {
 		ret = -EIO;
 		goto done;
@@ -2936,7 +2054,7 @@ int btrfs_search_old_slot(struct btrfs_root *root, const struct btrfs_key *key,
 
 		level = btrfs_header_level(b);
 		btrfs_tree_read_lock(b);
-		b = tree_mod_log_rewind(fs_info, p, b, time_seq);
+		b = btrfs_tree_mod_log_rewind(fs_info, p, b, time_seq);
 		if (!b) {
 			ret = -ENOMEM;
 			goto done;
@@ -3050,8 +2168,8 @@ static void fixup_low_keys(struct btrfs_path *path,
 		if (!path->nodes[i])
 			break;
 		t = path->nodes[i];
-		ret = tree_mod_log_insert_key(t, tslot, MOD_LOG_KEY_REPLACE,
-				GFP_ATOMIC);
+		ret = btrfs_tree_mod_log_insert_key(t, tslot,
+				BTRFS_MOD_LOG_KEY_REPLACE, GFP_ATOMIC);
 		BUG_ON(ret < 0);
 		btrfs_set_node_key(t, key, tslot);
 		btrfs_mark_buffer_dirty(path->nodes[i]);
@@ -3214,7 +2332,7 @@ static int push_node_left(struct btrfs_trans_handle *trans,
 		btrfs_abort_transaction(trans, ret);
 		return ret;
 	}
-	ret = tree_mod_log_eb_copy(dst, src, dst_nritems, 0, push_items);
+	ret = btrfs_tree_mod_log_eb_copy(dst, src, dst_nritems, 0, push_items);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
 		return ret;
@@ -3226,8 +2344,8 @@ static int push_node_left(struct btrfs_trans_handle *trans,
 
 	if (push_items < src_nritems) {
 		/*
-		 * Don't call tree_mod_log_insert_move here, key removal was
-		 * already fully logged by tree_mod_log_eb_copy above.
+		 * Don't call btrfs_tree_mod_log_insert_move() here, key removal
+		 * was already fully logged by btrfs_tree_mod_log_eb_copy() above.
 		 */
 		memmove_extent_buffer(src, btrfs_node_key_ptr_offset(0),
 				      btrfs_node_key_ptr_offset(push_items),
@@ -3288,15 +2406,15 @@ static int balance_node_right(struct btrfs_trans_handle *trans,
 		btrfs_abort_transaction(trans, ret);
 		return ret;
 	}
-	ret = tree_mod_log_insert_move(dst, push_items, 0, dst_nritems);
+	ret = btrfs_tree_mod_log_insert_move(dst, push_items, 0, dst_nritems);
 	BUG_ON(ret < 0);
 	memmove_extent_buffer(dst, btrfs_node_key_ptr_offset(push_items),
 				      btrfs_node_key_ptr_offset(0),
 				      (dst_nritems) *
 				      sizeof(struct btrfs_key_ptr));
 
-	ret = tree_mod_log_eb_copy(dst, src, 0, src_nritems - push_items,
-				   push_items);
+	ret = btrfs_tree_mod_log_eb_copy(dst, src, 0, src_nritems - push_items,
+					 push_items);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
 		return ret;
@@ -3362,7 +2480,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(c);
 
 	old = root->node;
-	ret = tree_mod_log_insert_root(root->node, c, 0);
+	ret = btrfs_tree_mod_log_insert_root(root->node, c, 0);
 	BUG_ON(ret < 0);
 	rcu_assign_pointer(root->node, c);
 
@@ -3401,8 +2519,8 @@ static void insert_ptr(struct btrfs_trans_handle *trans,
 	BUG_ON(nritems == BTRFS_NODEPTRS_PER_BLOCK(trans->fs_info));
 	if (slot != nritems) {
 		if (level) {
-			ret = tree_mod_log_insert_move(lower, slot + 1, slot,
-					nritems - slot);
+			ret = btrfs_tree_mod_log_insert_move(lower, slot + 1,
+					slot, nritems - slot);
 			BUG_ON(ret < 0);
 		}
 		memmove_extent_buffer(lower,
@@ -3411,8 +2529,8 @@ static void insert_ptr(struct btrfs_trans_handle *trans,
 			      (nritems - slot) * sizeof(struct btrfs_key_ptr));
 	}
 	if (level) {
-		ret = tree_mod_log_insert_key(lower, slot, MOD_LOG_KEY_ADD,
-				GFP_NOFS);
+		ret = btrfs_tree_mod_log_insert_key(lower, slot,
+					    BTRFS_MOD_LOG_KEY_ADD, GFP_NOFS);
 		BUG_ON(ret < 0);
 	}
 	btrfs_set_node_key(lower, key, slot);
@@ -3453,9 +2571,9 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
 		 * tree mod log: We don't log_removal old root in
 		 * insert_new_root, because that root buffer will be kept as a
 		 * normal node. We are going to log removal of half of the
-		 * elements below with tree_mod_log_eb_copy. We're holding a
-		 * tree lock on the buffer, which is why we cannot race with
-		 * other tree_mod_log users.
+		 * elements below with btrfs_tree_mod_log_eb_copy(). We're
+		 * holding a tree lock on the buffer, which is why we cannot
+		 * race with other tree_mod_log users.
 		 */
 		ret = insert_new_root(trans, root, path, level + 1);
 		if (ret)
@@ -3482,7 +2600,7 @@ static noinline int split_node(struct btrfs_trans_handle *trans,
 	root_add_used(root, fs_info->nodesize);
 	ASSERT(btrfs_header_level(c) == level);
 
-	ret = tree_mod_log_eb_copy(split, c, 0, mid, c_nritems - mid);
+	ret = btrfs_tree_mod_log_eb_copy(split, c, 0, mid, c_nritems - mid);
 	if (ret) {
 		btrfs_abort_transaction(trans, ret);
 		return ret;
@@ -4864,8 +3982,8 @@ static void del_ptr(struct btrfs_root *root, struct btrfs_path *path,
 	nritems = btrfs_header_nritems(parent);
 	if (slot != nritems - 1) {
 		if (level) {
-			ret = tree_mod_log_insert_move(parent, slot, slot + 1,
-					nritems - slot - 1);
+			ret = btrfs_tree_mod_log_insert_move(parent, slot,
+					slot + 1, nritems - slot - 1);
 			BUG_ON(ret < 0);
 		}
 		memmove_extent_buffer(parent,
@@ -4874,8 +3992,8 @@ static void del_ptr(struct btrfs_root *root, struct btrfs_path *path,
 			      sizeof(struct btrfs_key_ptr) *
 			      (nritems - slot - 1));
 	} else if (level) {
-		ret = tree_mod_log_insert_key(parent, slot, MOD_LOG_KEY_REMOVE,
-				GFP_NOFS);
+		ret = btrfs_tree_mod_log_insert_key(parent, slot,
+				BTRFS_MOD_LOG_KEY_REMOVE, GFP_NOFS);
 		BUG_ON(ret < 0);
 	}
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ae8901aec4f1..50208673bd55 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -500,16 +500,6 @@ struct btrfs_discard_ctl {
 	atomic64_t discard_bytes_saved;
 };
 
-/* delayed seq elem */
-struct seq_list {
-	struct list_head list;
-	u64 seq;
-};
-
-#define SEQ_LIST_INIT(name)	{ .list = LIST_HEAD_INIT((name).list), .seq = 0 }
-
-#define SEQ_LAST	((u64)-1)
-
 enum btrfs_orphan_cleanup_state {
 	ORPHAN_CLEANUP_STARTED	= 1,
 	ORPHAN_CLEANUP_DONE	= 2,
@@ -2946,13 +2936,6 @@ static inline void btrfs_clear_sb_rdonly(struct super_block *sb)
 	clear_bit(BTRFS_FS_STATE_RO, &btrfs_sb(sb)->fs_state);
 }
 
-/* tree mod log functions from ctree.c */
-u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
-			   struct seq_list *elem);
-void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
-			    struct seq_list *elem);
-int btrfs_old_root_level(struct btrfs_root *root, u64 time_seq);
-
 /* root-item.c */
 int btrfs_add_root_ref(struct btrfs_trans_handle *trans, u64 root_id,
 		       u64 ref_id, u64 dirid, u64 sequence, const char *name,
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 63be7d01a9a3..d87472a68bf4 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -11,6 +11,7 @@
 #include "transaction.h"
 #include "qgroup.h"
 #include "space-info.h"
+#include "tree-mod-log.h"
 
 struct kmem_cache *btrfs_delayed_ref_head_cachep;
 struct kmem_cache *btrfs_delayed_tree_ref_cachep;
@@ -496,10 +497,10 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
 
 	read_lock(&fs_info->tree_mod_log_lock);
 	if (!list_empty(&fs_info->tree_mod_seq_list)) {
-		struct seq_list *elem;
+		struct btrfs_seq_list *elem;
 
 		elem = list_first_entry(&fs_info->tree_mod_seq_list,
-					struct seq_list, list);
+					struct btrfs_seq_list, list);
 		seq = elem->seq;
 	}
 	read_unlock(&fs_info->tree_mod_log_lock);
@@ -517,13 +518,13 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
 
 int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info, u64 seq)
 {
-	struct seq_list *elem;
+	struct btrfs_seq_list *elem;
 	int ret = 0;
 
 	read_lock(&fs_info->tree_mod_log_lock);
 	if (!list_empty(&fs_info->tree_mod_seq_list)) {
 		elem = list_first_entry(&fs_info->tree_mod_seq_list,
-					struct seq_list, list);
+					struct btrfs_seq_list, list);
 		if (seq >= elem->seq) {
 			btrfs_debug(fs_info,
 				"holding back delayed_ref %#x.%x, lowest is %#x.%x",
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1b7d3cadc39e..9d8d1ac86962 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -23,6 +23,7 @@
 #include "qgroup.h"
 #include "block-group.h"
 #include "sysfs.h"
+#include "tree-mod-log.h"
 
 /* TODO XXX FIXME
  *  - subvol delete -> delete when ref goes to 0? delete limits also?
@@ -2631,12 +2632,12 @@ int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans)
 					record->data_rsv,
 					BTRFS_QGROUP_RSV_DATA);
 			/*
-			 * Use SEQ_LAST as time_seq to do special search, which
-			 * doesn't lock tree or delayed_refs and search current
-			 * root. It's safe inside commit_transaction().
+			 * Use BTRFS_SEQ_LAST as time_seq to do special search,
+			 * which doesn't lock tree or delayed_refs and search
+			 * current root. It's safe inside commit_transaction().
 			 */
 			ret = btrfs_find_all_roots(trans, fs_info,
-				record->bytenr, SEQ_LAST, &new_roots, false);
+				record->bytenr, BTRFS_SEQ_LAST, &new_roots, false);
 			if (ret < 0)
 				goto cleanup;
 			if (qgroup_to_skip) {
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
new file mode 100644
index 000000000000..fc7eb33ec383
--- /dev/null
+++ b/fs/btrfs/tree-mod-log.c
@@ -0,0 +1,889 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "tree-mod-log.h"
+#include "disk-io.h"
+
+struct tree_mod_root {
+	u64 logical;
+	u8 level;
+};
+
+struct tree_mod_elem {
+	struct rb_node node;
+	u64 logical;
+	u64 seq;
+	enum btrfs_mod_log_op op;
+
+	/*
+	 * This is used for BTRFS_MOD_LOG_KEY_* and BTRFS_MOD_LOG_MOVE_KEYS
+	 * operations.
+	 */
+	int slot;
+
+	/* This is used for BTRFS_MOD_LOG_KEY* and BTRFS_MOD_LOG_ROOT_REPLACE. */
+	u64 generation;
+
+	/* Those are used for op == BTRFS_MOD_LOG_KEY_{REPLACE,REMOVE}. */
+	struct btrfs_disk_key key;
+	u64 blockptr;
+
+	/* This is used for op == BTRFS_MOD_LOG_MOVE_KEYS. */
+	struct {
+		int dst_slot;
+		int nr_items;
+	} move;
+
+	/* This is used for op == BTRFS_MOD_LOG_ROOT_REPLACE. */
+	struct tree_mod_root old_root;
+};
+
+/*
+ * Pull a new tree mod seq number for our operation.
+ */
+static inline u64 btrfs_inc_tree_mod_seq(struct btrfs_fs_info *fs_info)
+{
+	return atomic64_inc_return(&fs_info->tree_mod_seq);
+}
+
+/*
+ * This adds a new blocker to the tree mod log's blocker list if the @elem
+ * passed does not already have a sequence number set. So when a caller expects
+ * to record tree modifications, it should ensure to set elem->seq to zero
+ * before calling btrfs_get_tree_mod_seq.
+ * Returns a fresh, unused tree log modification sequence number, even if no new
+ * blocker was added.
+ */
+u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
+			   struct btrfs_seq_list *elem)
+{
+	write_lock(&fs_info->tree_mod_log_lock);
+	if (!elem->seq) {
+		elem->seq = btrfs_inc_tree_mod_seq(fs_info);
+		list_add_tail(&elem->list, &fs_info->tree_mod_seq_list);
+	}
+	write_unlock(&fs_info->tree_mod_log_lock);
+
+	return elem->seq;
+}
+
+void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
+			    struct btrfs_seq_list *elem)
+{
+	struct rb_root *tm_root;
+	struct rb_node *node;
+	struct rb_node *next;
+	struct tree_mod_elem *tm;
+	u64 min_seq = BTRFS_SEQ_LAST;
+	u64 seq_putting = elem->seq;
+
+	if (!seq_putting)
+		return;
+
+	write_lock(&fs_info->tree_mod_log_lock);
+	list_del(&elem->list);
+	elem->seq = 0;
+
+	if (!list_empty(&fs_info->tree_mod_seq_list)) {
+		struct btrfs_seq_list *first;
+
+		first = list_first_entry(&fs_info->tree_mod_seq_list,
+					 struct btrfs_seq_list, list);
+		if (seq_putting > first->seq) {
+			/*
+			 * Blocker with lower sequence number exists, we
+			 * cannot remove anything from the log.
+			 */
+			write_unlock(&fs_info->tree_mod_log_lock);
+			return;
+		}
+		min_seq = first->seq;
+	}
+
+	/*
+	 * anything that's lower than the lowest existing (read: blocked)
+	 * sequence number can be removed from the tree.
+	 */
+	tm_root = &fs_info->tree_mod_log;
+	for (node = rb_first(tm_root); node; node = next) {
+		next = rb_next(node);
+		tm = rb_entry(node, struct tree_mod_elem, node);
+		if (tm->seq >= min_seq)
+			continue;
+		rb_erase(node, tm_root);
+		kfree(tm);
+	}
+	write_unlock(&fs_info->tree_mod_log_lock);
+}
+
+/*
+ * key order of the log:
+ *       node/leaf start address -> sequence
+ *
+ * The 'start address' is the logical address of the *new* root node
+ * for root replace operations, or the logical address of the affected
+ * block for all other operations.
+ */
+static noinline int tree_mod_log_insert(struct btrfs_fs_info *fs_info,
+					struct tree_mod_elem *tm)
+{
+	struct rb_root *tm_root;
+	struct rb_node **new;
+	struct rb_node *parent = NULL;
+	struct tree_mod_elem *cur;
+
+	lockdep_assert_held_write(&fs_info->tree_mod_log_lock);
+
+	tm->seq = btrfs_inc_tree_mod_seq(fs_info);
+
+	tm_root = &fs_info->tree_mod_log;
+	new = &tm_root->rb_node;
+	while (*new) {
+		cur = rb_entry(*new, struct tree_mod_elem, node);
+		parent = *new;
+		if (cur->logical < tm->logical)
+			new = &((*new)->rb_left);
+		else if (cur->logical > tm->logical)
+			new = &((*new)->rb_right);
+		else if (cur->seq < tm->seq)
+			new = &((*new)->rb_left);
+		else if (cur->seq > tm->seq)
+			new = &((*new)->rb_right);
+		else
+			return -EEXIST;
+	}
+
+	rb_link_node(&tm->node, parent, new);
+	rb_insert_color(&tm->node, tm_root);
+	return 0;
+}
+
+/*
+ * Determines if logging can be omitted. Returns 1 if it can. Otherwise, it
+ * returns zero with the tree_mod_log_lock acquired. The caller must hold
+ * this until all tree mod log insertions are recorded in the rb tree and then
+ * write unlock fs_info::tree_mod_log_lock.
+ */
+static inline int tree_mod_dont_log(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb)
+{
+	smp_mb();
+	if (list_empty(&(fs_info)->tree_mod_seq_list))
+		return 1;
+	if (eb && btrfs_header_level(eb) == 0)
+		return 1;
+
+	write_lock(&fs_info->tree_mod_log_lock);
+	if (list_empty(&(fs_info)->tree_mod_seq_list)) {
+		write_unlock(&fs_info->tree_mod_log_lock);
+		return 1;
+	}
+
+	return 0;
+}
+
+/* Similar to tree_mod_dont_log, but doesn't acquire any locks. */
+static inline int tree_mod_need_log(const struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb)
+{
+	smp_mb();
+	if (list_empty(&(fs_info)->tree_mod_seq_list))
+		return 0;
+	if (eb && btrfs_header_level(eb) == 0)
+		return 0;
+
+	return 1;
+}
+
+static struct tree_mod_elem *alloc_tree_mod_elem(struct extent_buffer *eb,
+						 int slot,
+						 enum btrfs_mod_log_op op,
+						 gfp_t flags)
+{
+	struct tree_mod_elem *tm;
+
+	tm = kzalloc(sizeof(*tm), flags);
+	if (!tm)
+		return NULL;
+
+	tm->logical = eb->start;
+	if (op != BTRFS_MOD_LOG_KEY_ADD) {
+		btrfs_node_key(eb, &tm->key, slot);
+		tm->blockptr = btrfs_node_blockptr(eb, slot);
+	}
+	tm->op = op;
+	tm->slot = slot;
+	tm->generation = btrfs_node_ptr_generation(eb, slot);
+	RB_CLEAR_NODE(&tm->node);
+
+	return tm;
+}
+
+int btrfs_tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
+				  enum btrfs_mod_log_op op, gfp_t flags)
+{
+	struct tree_mod_elem *tm;
+	int ret;
+
+	if (!tree_mod_need_log(eb->fs_info, eb))
+		return 0;
+
+	tm = alloc_tree_mod_elem(eb, slot, op, flags);
+	if (!tm)
+		return -ENOMEM;
+
+	if (tree_mod_dont_log(eb->fs_info, eb)) {
+		kfree(tm);
+		return 0;
+	}
+
+	ret = tree_mod_log_insert(eb->fs_info, tm);
+	write_unlock(&eb->fs_info->tree_mod_log_lock);
+	if (ret)
+		kfree(tm);
+
+	return ret;
+}
+
+int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
+				   int dst_slot, int src_slot,
+				   int nr_items)
+{
+	struct tree_mod_elem *tm = NULL;
+	struct tree_mod_elem **tm_list = NULL;
+	int ret = 0;
+	int i;
+	int locked = 0;
+
+	if (!tree_mod_need_log(eb->fs_info, eb))
+		return 0;
+
+	tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS);
+	if (!tm_list)
+		return -ENOMEM;
+
+	tm = kzalloc(sizeof(*tm), GFP_NOFS);
+	if (!tm) {
+		ret = -ENOMEM;
+		goto free_tms;
+	}
+
+	tm->logical = eb->start;
+	tm->slot = src_slot;
+	tm->move.dst_slot = dst_slot;
+	tm->move.nr_items = nr_items;
+	tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
+
+	for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
+		tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
+		    BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING, GFP_NOFS);
+		if (!tm_list[i]) {
+			ret = -ENOMEM;
+			goto free_tms;
+		}
+	}
+
+	if (tree_mod_dont_log(eb->fs_info, eb))
+		goto free_tms;
+	locked = 1;
+
+	/*
+	 * When we override something during the move, we log these removals.
+	 * This can only happen when we move towards the beginning of the
+	 * buffer, i.e. dst_slot < src_slot.
+	 */
+	for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
+		ret = tree_mod_log_insert(eb->fs_info, tm_list[i]);
+		if (ret)
+			goto free_tms;
+	}
+
+	ret = tree_mod_log_insert(eb->fs_info, tm);
+	if (ret)
+		goto free_tms;
+	write_unlock(&eb->fs_info->tree_mod_log_lock);
+	kfree(tm_list);
+
+	return 0;
+free_tms:
+	for (i = 0; i < nr_items; i++) {
+		if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
+			rb_erase(&tm_list[i]->node, &eb->fs_info->tree_mod_log);
+		kfree(tm_list[i]);
+	}
+	if (locked)
+		write_unlock(&eb->fs_info->tree_mod_log_lock);
+	kfree(tm_list);
+	kfree(tm);
+
+	return ret;
+}
+
+static inline int tree_mod_log_free_eb(struct btrfs_fs_info *fs_info,
+				       struct tree_mod_elem **tm_list,
+				       int nritems)
+{
+	int i, j;
+	int ret;
+
+	for (i = nritems - 1; i >= 0; i--) {
+		ret = tree_mod_log_insert(fs_info, tm_list[i]);
+		if (ret) {
+			for (j = nritems - 1; j > i; j--)
+				rb_erase(&tm_list[j]->node,
+					 &fs_info->tree_mod_log);
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
+int btrfs_tree_mod_log_insert_root(struct extent_buffer *old_root,
+				   struct extent_buffer *new_root,
+				   int log_removal)
+{
+	struct btrfs_fs_info *fs_info = old_root->fs_info;
+	struct tree_mod_elem *tm = NULL;
+	struct tree_mod_elem **tm_list = NULL;
+	int nritems = 0;
+	int ret = 0;
+	int i;
+
+	if (!tree_mod_need_log(fs_info, NULL))
+		return 0;
+
+	if (log_removal && btrfs_header_level(old_root) > 0) {
+		nritems = btrfs_header_nritems(old_root);
+		tm_list = kcalloc(nritems, sizeof(struct tree_mod_elem *),
+				  GFP_NOFS);
+		if (!tm_list) {
+			ret = -ENOMEM;
+			goto free_tms;
+		}
+		for (i = 0; i < nritems; i++) {
+			tm_list[i] = alloc_tree_mod_elem(old_root, i,
+			    BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, GFP_NOFS);
+			if (!tm_list[i]) {
+				ret = -ENOMEM;
+				goto free_tms;
+			}
+		}
+	}
+
+	tm = kzalloc(sizeof(*tm), GFP_NOFS);
+	if (!tm) {
+		ret = -ENOMEM;
+		goto free_tms;
+	}
+
+	tm->logical = new_root->start;
+	tm->old_root.logical = old_root->start;
+	tm->old_root.level = btrfs_header_level(old_root);
+	tm->generation = btrfs_header_generation(old_root);
+	tm->op = BTRFS_MOD_LOG_ROOT_REPLACE;
+
+	if (tree_mod_dont_log(fs_info, NULL))
+		goto free_tms;
+
+	if (tm_list)
+		ret = tree_mod_log_free_eb(fs_info, tm_list, nritems);
+	if (!ret)
+		ret = tree_mod_log_insert(fs_info, tm);
+
+	write_unlock(&fs_info->tree_mod_log_lock);
+	if (ret)
+		goto free_tms;
+	kfree(tm_list);
+
+	return ret;
+
+free_tms:
+	if (tm_list) {
+		for (i = 0; i < nritems; i++)
+			kfree(tm_list[i]);
+		kfree(tm_list);
+	}
+	kfree(tm);
+
+	return ret;
+}
+
+static struct tree_mod_elem *__tree_mod_log_search(struct btrfs_fs_info *fs_info,
+						   u64 start, u64 min_seq,
+						   int smallest)
+{
+	struct rb_root *tm_root;
+	struct rb_node *node;
+	struct tree_mod_elem *cur = NULL;
+	struct tree_mod_elem *found = NULL;
+
+	read_lock(&fs_info->tree_mod_log_lock);
+	tm_root = &fs_info->tree_mod_log;
+	node = tm_root->rb_node;
+	while (node) {
+		cur = rb_entry(node, struct tree_mod_elem, node);
+		if (cur->logical < start) {
+			node = node->rb_left;
+		} else if (cur->logical > start) {
+			node = node->rb_right;
+		} else if (cur->seq < min_seq) {
+			node = node->rb_left;
+		} else if (!smallest) {
+			/* we want the node with the highest seq */
+			if (found)
+				BUG_ON(found->seq > cur->seq);
+			found = cur;
+			node = node->rb_left;
+		} else if (cur->seq > min_seq) {
+			/* we want the node with the smallest seq */
+			if (found)
+				BUG_ON(found->seq < cur->seq);
+			found = cur;
+			node = node->rb_right;
+		} else {
+			found = cur;
+			break;
+		}
+	}
+	read_unlock(&fs_info->tree_mod_log_lock);
+
+	return found;
+}
+
+/*
+ * This returns the element from the log with the smallest time sequence
+ * value that's in the log (the oldest log item). any element with a time
+ * sequence lower than min_seq will be ignored.
+ */
+static struct tree_mod_elem *tree_mod_log_search_oldest(struct btrfs_fs_info *fs_info,
+							u64 start, u64 min_seq)
+{
+	return __tree_mod_log_search(fs_info, start, min_seq, 1);
+}
+
+/*
+ * This returns the element from the log with the largest time sequence
+ * value that's in the log (the most recent log item). any element with
+ * a time sequence lower than min_seq will be ignored.
+ */
+static struct tree_mod_elem *tree_mod_log_search(struct btrfs_fs_info *fs_info,
+						 u64 start, u64 min_seq)
+{
+	return __tree_mod_log_search(fs_info, start, min_seq, 0);
+}
+
+
+int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
+			       struct extent_buffer *src,
+			       unsigned long dst_offset,
+			       unsigned long src_offset,
+			       int nr_items)
+{
+	struct btrfs_fs_info *fs_info = dst->fs_info;
+	int ret = 0;
+	struct tree_mod_elem **tm_list = NULL;
+	struct tree_mod_elem **tm_list_add, **tm_list_rem;
+	int i;
+	int locked = 0;
+
+	if (!tree_mod_need_log(fs_info, NULL))
+		return 0;
+
+	if (btrfs_header_level(dst) == 0 && btrfs_header_level(src) == 0)
+		return 0;
+
+	tm_list = kcalloc(nr_items * 2, sizeof(struct tree_mod_elem *),
+			  GFP_NOFS);
+	if (!tm_list)
+		return -ENOMEM;
+
+	tm_list_add = tm_list;
+	tm_list_rem = tm_list + nr_items;
+	for (i = 0; i < nr_items; i++) {
+		tm_list_rem[i] = alloc_tree_mod_elem(src, i + src_offset,
+		    BTRFS_MOD_LOG_KEY_REMOVE, GFP_NOFS);
+		if (!tm_list_rem[i]) {
+			ret = -ENOMEM;
+			goto free_tms;
+		}
+
+		tm_list_add[i] = alloc_tree_mod_elem(dst, i + dst_offset,
+		    BTRFS_MOD_LOG_KEY_ADD, GFP_NOFS);
+		if (!tm_list_add[i]) {
+			ret = -ENOMEM;
+			goto free_tms;
+		}
+	}
+
+	if (tree_mod_dont_log(fs_info, NULL))
+		goto free_tms;
+	locked = 1;
+
+	for (i = 0; i < nr_items; i++) {
+		ret = tree_mod_log_insert(fs_info, tm_list_rem[i]);
+		if (ret)
+			goto free_tms;
+		ret = tree_mod_log_insert(fs_info, tm_list_add[i]);
+		if (ret)
+			goto free_tms;
+	}
+
+	write_unlock(&fs_info->tree_mod_log_lock);
+	kfree(tm_list);
+
+	return 0;
+
+free_tms:
+	for (i = 0; i < nr_items * 2; i++) {
+		if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
+			rb_erase(&tm_list[i]->node, &fs_info->tree_mod_log);
+		kfree(tm_list[i]);
+	}
+	if (locked)
+		write_unlock(&fs_info->tree_mod_log_lock);
+	kfree(tm_list);
+
+	return ret;
+}
+
+int btrfs_tree_mod_log_free_eb(struct extent_buffer *eb)
+{
+	struct tree_mod_elem **tm_list = NULL;
+	int nritems = 0;
+	int i;
+	int ret = 0;
+
+	if (btrfs_header_level(eb) == 0)
+		return 0;
+
+	if (!tree_mod_need_log(eb->fs_info, NULL))
+		return 0;
+
+	nritems = btrfs_header_nritems(eb);
+	tm_list = kcalloc(nritems, sizeof(struct tree_mod_elem *), GFP_NOFS);
+	if (!tm_list)
+		return -ENOMEM;
+
+	for (i = 0; i < nritems; i++) {
+		tm_list[i] = alloc_tree_mod_elem(eb, i,
+		    BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, GFP_NOFS);
+		if (!tm_list[i]) {
+			ret = -ENOMEM;
+			goto free_tms;
+		}
+	}
+
+	if (tree_mod_dont_log(eb->fs_info, eb))
+		goto free_tms;
+
+	ret = tree_mod_log_free_eb(eb->fs_info, tm_list, nritems);
+	write_unlock(&eb->fs_info->tree_mod_log_lock);
+	if (ret)
+		goto free_tms;
+	kfree(tm_list);
+
+	return 0;
+
+free_tms:
+	for (i = 0; i < nritems; i++)
+		kfree(tm_list[i]);
+	kfree(tm_list);
+
+	return ret;
+}
+
+/*
+ * Returns the logical address of the oldest predecessor of the given root.
+ * entries older than time_seq are ignored.
+ */
+static struct tree_mod_elem *tree_mod_log_oldest_root(struct extent_buffer *eb_root,
+						      u64 time_seq)
+{
+	struct tree_mod_elem *tm;
+	struct tree_mod_elem *found = NULL;
+	u64 root_logical = eb_root->start;
+	int looped = 0;
+
+	if (!time_seq)
+		return NULL;
+
+	/*
+	 * the very last operation that's logged for a root is the
+	 * replacement operation (if it is replaced at all). this has
+	 * the logical address of the *new* root, making it the very
+	 * first operation that's logged for this root.
+	 */
+	while (1) {
+		tm = tree_mod_log_search_oldest(eb_root->fs_info, root_logical,
+						time_seq);
+		if (!looped && !tm)
+			return NULL;
+		/*
+		 * if there are no tree operation for the oldest root, we simply
+		 * return it. this should only happen if that (old) root is at
+		 * level 0.
+		 */
+		if (!tm)
+			break;
+
+		/*
+		 * if there's an operation that's not a root replacement, we
+		 * found the oldest version of our root. normally, we'll find a
+		 * BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING operation here.
+		 */
+		if (tm->op != BTRFS_MOD_LOG_ROOT_REPLACE)
+			break;
+
+		found = tm;
+		root_logical = tm->old_root.logical;
+		looped = 1;
+	}
+
+	/* if there's no old root to return, return what we found instead */
+	if (!found)
+		found = tm;
+
+	return found;
+}
+
+
+/*
+ * tm is a pointer to the first operation to rewind within eb. then, all
+ * previous operations will be rewound (until we reach something older than
+ * time_seq).
+ */
+static void tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
+				struct extent_buffer *eb,
+				u64 time_seq,
+				struct tree_mod_elem *first_tm)
+{
+	u32 n;
+	struct rb_node *next;
+	struct tree_mod_elem *tm = first_tm;
+	unsigned long o_dst;
+	unsigned long o_src;
+	unsigned long p_size = sizeof(struct btrfs_key_ptr);
+
+	n = btrfs_header_nritems(eb);
+	read_lock(&fs_info->tree_mod_log_lock);
+	while (tm && tm->seq >= time_seq) {
+		/*
+		 * all the operations are recorded with the operator used for
+		 * the modification. as we're going backwards, we do the
+		 * opposite of each operation here.
+		 */
+		switch (tm->op) {
+		case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
+			BUG_ON(tm->slot < n);
+			fallthrough;
+		case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
+		case BTRFS_MOD_LOG_KEY_REMOVE:
+			btrfs_set_node_key(eb, &tm->key, tm->slot);
+			btrfs_set_node_blockptr(eb, tm->slot, tm->blockptr);
+			btrfs_set_node_ptr_generation(eb, tm->slot,
+						      tm->generation);
+			n++;
+			break;
+		case BTRFS_MOD_LOG_KEY_REPLACE:
+			BUG_ON(tm->slot >= n);
+			btrfs_set_node_key(eb, &tm->key, tm->slot);
+			btrfs_set_node_blockptr(eb, tm->slot, tm->blockptr);
+			btrfs_set_node_ptr_generation(eb, tm->slot,
+						      tm->generation);
+			break;
+		case BTRFS_MOD_LOG_KEY_ADD:
+			/* if a move operation is needed it's in the log */
+			n--;
+			break;
+		case BTRFS_MOD_LOG_MOVE_KEYS:
+			o_dst = btrfs_node_key_ptr_offset(tm->slot);
+			o_src = btrfs_node_key_ptr_offset(tm->move.dst_slot);
+			memmove_extent_buffer(eb, o_dst, o_src,
+					      tm->move.nr_items * p_size);
+			break;
+		case BTRFS_MOD_LOG_ROOT_REPLACE:
+			/*
+			 * this operation is special. for roots, this must be
+			 * handled explicitly before rewinding.
+			 * for non-roots, this operation may exist if the node
+			 * was a root: root A -> child B; then A gets empty and
+			 * B is promoted to the new root. in the mod log, we'll
+			 * have a root-replace operation for B, a tree block
+			 * that is no root. we simply ignore that operation.
+			 */
+			break;
+		}
+		next = rb_next(&tm->node);
+		if (!next)
+			break;
+		tm = rb_entry(next, struct tree_mod_elem, node);
+		if (tm->logical != first_tm->logical)
+			break;
+	}
+	read_unlock(&fs_info->tree_mod_log_lock);
+	btrfs_set_header_nritems(eb, n);
+}
+
+
+/*
+ * Called with eb read locked. If the buffer cannot be rewound, the same buffer
+ * is returned. If rewind operations happen, a fresh buffer is returned. The
+ * returned buffer is always read-locked. If the returned buffer is not the
+ * input buffer, the lock on the input buffer is released and the input buffer
+ * is freed (its refcount is decremented).
+ */
+struct extent_buffer *btrfs_tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
+						struct btrfs_path *path,
+						struct extent_buffer *eb,
+						u64 time_seq)
+{
+	struct extent_buffer *eb_rewin;
+	struct tree_mod_elem *tm;
+
+	if (!time_seq)
+		return eb;
+
+	if (btrfs_header_level(eb) == 0)
+		return eb;
+
+	tm = tree_mod_log_search(fs_info, eb->start, time_seq);
+	if (!tm)
+		return eb;
+
+	if (tm->op == BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
+		BUG_ON(tm->slot != 0);
+		eb_rewin = alloc_dummy_extent_buffer(fs_info, eb->start);
+		if (!eb_rewin) {
+			btrfs_tree_read_unlock(eb);
+			free_extent_buffer(eb);
+			return NULL;
+		}
+		btrfs_set_header_bytenr(eb_rewin, eb->start);
+		btrfs_set_header_backref_rev(eb_rewin,
+					     btrfs_header_backref_rev(eb));
+		btrfs_set_header_owner(eb_rewin, btrfs_header_owner(eb));
+		btrfs_set_header_level(eb_rewin, btrfs_header_level(eb));
+	} else {
+		eb_rewin = btrfs_clone_extent_buffer(eb);
+		if (!eb_rewin) {
+			btrfs_tree_read_unlock(eb);
+			free_extent_buffer(eb);
+			return NULL;
+		}
+	}
+
+	btrfs_tree_read_unlock(eb);
+	free_extent_buffer(eb);
+
+	btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb_rewin),
+				       eb_rewin, btrfs_header_level(eb_rewin));
+	btrfs_tree_read_lock(eb_rewin);
+	tree_mod_log_rewind(fs_info, eb_rewin, time_seq, tm);
+	WARN_ON(btrfs_header_nritems(eb_rewin) >
+		BTRFS_NODEPTRS_PER_BLOCK(fs_info));
+
+	return eb_rewin;
+}
+
+/*
+ * Rewinds the state of @root's root node to the given @time_seq value.
+ * If there are no changes, the current root->root_node is returned. If anything
+ * changed in between, there's a fresh buffer allocated on which the rewind
+ * operations are done. In any case, the returned buffer is read locked.
+ * Returns NULL on error (with no locks held).
+ */
+struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq)
+{
+	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct tree_mod_elem *tm;
+	struct extent_buffer *eb = NULL;
+	struct extent_buffer *eb_root;
+	u64 eb_root_owner = 0;
+	struct extent_buffer *old;
+	struct tree_mod_root *old_root = NULL;
+	u64 old_generation = 0;
+	u64 logical;
+	int level;
+
+	eb_root = btrfs_read_lock_root_node(root);
+	tm = tree_mod_log_oldest_root(eb_root, time_seq);
+	if (!tm)
+		return eb_root;
+
+	if (tm->op == BTRFS_MOD_LOG_ROOT_REPLACE) {
+		old_root = &tm->old_root;
+		old_generation = tm->generation;
+		logical = old_root->logical;
+		level = old_root->level;
+	} else {
+		logical = eb_root->start;
+		level = btrfs_header_level(eb_root);
+	}
+
+	tm = tree_mod_log_search(fs_info, logical, time_seq);
+	if (old_root && tm && tm->op != BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
+		btrfs_tree_read_unlock(eb_root);
+		free_extent_buffer(eb_root);
+		old = read_tree_block(fs_info, logical, root->root_key.objectid,
+				      0, level, NULL);
+		if (WARN_ON(IS_ERR(old) || !extent_buffer_uptodate(old))) {
+			if (!IS_ERR(old))
+				free_extent_buffer(old);
+			btrfs_warn(fs_info,
+				   "failed to read tree block %llu from get_old_root",
+				   logical);
+		} else {
+			btrfs_tree_read_lock(old);
+			eb = btrfs_clone_extent_buffer(old);
+			btrfs_tree_read_unlock(old);
+			free_extent_buffer(old);
+		}
+	} else if (old_root) {
+		eb_root_owner = btrfs_header_owner(eb_root);
+		btrfs_tree_read_unlock(eb_root);
+		free_extent_buffer(eb_root);
+		eb = alloc_dummy_extent_buffer(fs_info, logical);
+	} else {
+		eb = btrfs_clone_extent_buffer(eb_root);
+		btrfs_tree_read_unlock(eb_root);
+		free_extent_buffer(eb_root);
+	}
+
+	if (!eb)
+		return NULL;
+	if (old_root) {
+		btrfs_set_header_bytenr(eb, eb->start);
+		btrfs_set_header_backref_rev(eb, BTRFS_MIXED_BACKREF_REV);
+		btrfs_set_header_owner(eb, eb_root_owner);
+		btrfs_set_header_level(eb, old_root->level);
+		btrfs_set_header_generation(eb, old_generation);
+	}
+	btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb), eb,
+				       btrfs_header_level(eb));
+	btrfs_tree_read_lock(eb);
+	if (tm)
+		tree_mod_log_rewind(fs_info, eb, time_seq, tm);
+	else
+		WARN_ON(btrfs_header_level(eb) != 0);
+	WARN_ON(btrfs_header_nritems(eb) > BTRFS_NODEPTRS_PER_BLOCK(fs_info));
+
+	return eb;
+}
+
+int btrfs_old_root_level(struct btrfs_root *root, u64 time_seq)
+{
+	struct tree_mod_elem *tm;
+	int level;
+	struct extent_buffer *eb_root = btrfs_root_node(root);
+
+	tm = tree_mod_log_oldest_root(eb_root, time_seq);
+	if (tm && tm->op == BTRFS_MOD_LOG_ROOT_REPLACE)
+		level = tm->old_root.level;
+	else
+		level = btrfs_header_level(eb_root);
+
+	free_extent_buffer(eb_root);
+
+	return level;
+}
+
diff --git a/fs/btrfs/tree-mod-log.h b/fs/btrfs/tree-mod-log.h
new file mode 100644
index 000000000000..15e7a4c4bb14
--- /dev/null
+++ b/fs/btrfs/tree-mod-log.h
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef BTRFS_TREE_MOD_LOG_H
+#define BTRFS_TREE_MOD_LOG_H
+
+#include "ctree.h"
+
+/* Represents a tree mod log user. */
+struct btrfs_seq_list {
+	struct list_head list;
+	u64 seq;
+};
+
+#define BTRFS_SEQ_LIST_INIT(name) { .list = LIST_HEAD_INIT((name).list), .seq = 0 }
+#define BTRFS_SEQ_LAST            ((u64)-1)
+
+enum btrfs_mod_log_op {
+	BTRFS_MOD_LOG_KEY_REPLACE,
+	BTRFS_MOD_LOG_KEY_ADD,
+	BTRFS_MOD_LOG_KEY_REMOVE,
+	BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING,
+	BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING,
+	BTRFS_MOD_LOG_MOVE_KEYS,
+	BTRFS_MOD_LOG_ROOT_REPLACE,
+};
+
+u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
+			   struct btrfs_seq_list *elem);
+void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
+			    struct btrfs_seq_list *elem);
+int btrfs_tree_mod_log_insert_root(struct extent_buffer *old_root,
+				   struct extent_buffer *new_root,
+				   int log_removal);
+int btrfs_tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
+				  enum btrfs_mod_log_op op, gfp_t flags);
+int btrfs_tree_mod_log_free_eb(struct extent_buffer *eb);
+struct extent_buffer *btrfs_tree_mod_log_rewind(struct btrfs_fs_info *fs_info,
+						struct btrfs_path *path,
+						struct extent_buffer *eb,
+						u64 time_seq);
+struct extent_buffer *btrfs_get_old_root(struct btrfs_root *root, u64 time_seq);
+int btrfs_old_root_level(struct btrfs_root *root, u64 time_seq);
+int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
+			       struct extent_buffer *src,
+			       unsigned long dst_offset,
+			       unsigned long src_offset,
+			       int nr_items);
+int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
+				   int dst_slot, int src_slot,
+				   int nr_items);
+
+#endif /* BTRFS_TREE_MOD_LOG_H */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/9] btrfs: use booleans where appropriate for the tree mod log functions
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
                   ` (2 preceding siblings ...)
  2021-03-11 14:31 ` [PATCH 3/9] btrfs: move the tree mod log code into its own file fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-12 12:44   ` Anand Jain
  2021-03-11 14:31 ` [PATCH 5/9] btrfs: use a bit to track the existence of tree mod log users fdmanana
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

Several functions of the tree modification log use integers as booleans,
so change them to use booleans instead, making their use more clear.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.c        |  6 +++---
 fs/btrfs/tree-mod-log.c | 42 ++++++++++++++++++++---------------------
 fs/btrfs/tree-mod-log.h |  2 +-
 3 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 2c6e525b3458..1b0f28f69cb3 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -500,7 +500,7 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans,
 			parent_start = buf->start;
 
 		atomic_inc(&cow->refs);
-		ret = btrfs_tree_mod_log_insert_root(root->node, cow, 1);
+		ret = btrfs_tree_mod_log_insert_root(root->node, cow, true);
 		BUG_ON(ret < 0);
 		rcu_assign_pointer(root->node, cow);
 
@@ -958,7 +958,7 @@ static noinline int balance_level(struct btrfs_trans_handle *trans,
 			goto enospc;
 		}
 
-		ret = btrfs_tree_mod_log_insert_root(root->node, child, 1);
+		ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
 		BUG_ON(ret < 0);
 		rcu_assign_pointer(root->node, child);
 
@@ -2480,7 +2480,7 @@ static noinline int insert_new_root(struct btrfs_trans_handle *trans,
 	btrfs_mark_buffer_dirty(c);
 
 	old = root->node;
-	ret = btrfs_tree_mod_log_insert_root(root->node, c, 0);
+	ret = btrfs_tree_mod_log_insert_root(root->node, c, false);
 	BUG_ON(ret < 0);
 	rcu_assign_pointer(root->node, c);
 
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index fc7eb33ec383..f2a6da1b2903 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -158,40 +158,40 @@ static noinline int tree_mod_log_insert(struct btrfs_fs_info *fs_info,
 }
 
 /*
- * Determines if logging can be omitted. Returns 1 if it can. Otherwise, it
- * returns zero with the tree_mod_log_lock acquired. The caller must hold
+ * Determines if logging can be omitted. Returns true if it can. Otherwise, it
+ * returns false with the tree_mod_log_lock acquired. The caller must hold
  * this until all tree mod log insertions are recorded in the rb tree and then
  * write unlock fs_info::tree_mod_log_lock.
  */
-static inline int tree_mod_dont_log(struct btrfs_fs_info *fs_info,
+static inline bool tree_mod_dont_log(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb)
 {
 	smp_mb();
 	if (list_empty(&(fs_info)->tree_mod_seq_list))
-		return 1;
+		return true;
 	if (eb && btrfs_header_level(eb) == 0)
-		return 1;
+		return true;
 
 	write_lock(&fs_info->tree_mod_log_lock);
 	if (list_empty(&(fs_info)->tree_mod_seq_list)) {
 		write_unlock(&fs_info->tree_mod_log_lock);
-		return 1;
+		return true;
 	}
 
-	return 0;
+	return false;
 }
 
 /* Similar to tree_mod_dont_log, but doesn't acquire any locks. */
-static inline int tree_mod_need_log(const struct btrfs_fs_info *fs_info,
+static inline bool tree_mod_need_log(const struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb)
 {
 	smp_mb();
 	if (list_empty(&(fs_info)->tree_mod_seq_list))
-		return 0;
+		return false;
 	if (eb && btrfs_header_level(eb) == 0)
-		return 0;
+		return false;
 
-	return 1;
+	return true;
 }
 
 static struct tree_mod_elem *alloc_tree_mod_elem(struct extent_buffer *eb,
@@ -252,7 +252,7 @@ int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
 	struct tree_mod_elem **tm_list = NULL;
 	int ret = 0;
 	int i;
-	int locked = 0;
+	bool locked = false;
 
 	if (!tree_mod_need_log(eb->fs_info, eb))
 		return 0;
@@ -284,7 +284,7 @@ int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
 
 	if (tree_mod_dont_log(eb->fs_info, eb))
 		goto free_tms;
-	locked = 1;
+	locked = true;
 
 	/*
 	 * When we override something during the move, we log these removals.
@@ -340,7 +340,7 @@ static inline int tree_mod_log_free_eb(struct btrfs_fs_info *fs_info,
 
 int btrfs_tree_mod_log_insert_root(struct extent_buffer *old_root,
 				   struct extent_buffer *new_root,
-				   int log_removal)
+				   bool log_removal)
 {
 	struct btrfs_fs_info *fs_info = old_root->fs_info;
 	struct tree_mod_elem *tm = NULL;
@@ -410,7 +410,7 @@ int btrfs_tree_mod_log_insert_root(struct extent_buffer *old_root,
 
 static struct tree_mod_elem *__tree_mod_log_search(struct btrfs_fs_info *fs_info,
 						   u64 start, u64 min_seq,
-						   int smallest)
+						   bool smallest)
 {
 	struct rb_root *tm_root;
 	struct rb_node *node;
@@ -458,7 +458,7 @@ static struct tree_mod_elem *__tree_mod_log_search(struct btrfs_fs_info *fs_info
 static struct tree_mod_elem *tree_mod_log_search_oldest(struct btrfs_fs_info *fs_info,
 							u64 start, u64 min_seq)
 {
-	return __tree_mod_log_search(fs_info, start, min_seq, 1);
+	return __tree_mod_log_search(fs_info, start, min_seq, true);
 }
 
 /*
@@ -469,7 +469,7 @@ static struct tree_mod_elem *tree_mod_log_search_oldest(struct btrfs_fs_info *fs
 static struct tree_mod_elem *tree_mod_log_search(struct btrfs_fs_info *fs_info,
 						 u64 start, u64 min_seq)
 {
-	return __tree_mod_log_search(fs_info, start, min_seq, 0);
+	return __tree_mod_log_search(fs_info, start, min_seq, false);
 }
 
 
@@ -484,7 +484,7 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 	struct tree_mod_elem **tm_list = NULL;
 	struct tree_mod_elem **tm_list_add, **tm_list_rem;
 	int i;
-	int locked = 0;
+	bool locked = false;
 
 	if (!tree_mod_need_log(fs_info, NULL))
 		return 0;
@@ -517,7 +517,7 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 
 	if (tree_mod_dont_log(fs_info, NULL))
 		goto free_tms;
-	locked = 1;
+	locked = true;
 
 	for (i = 0; i < nr_items; i++) {
 		ret = tree_mod_log_insert(fs_info, tm_list_rem[i]);
@@ -602,7 +602,7 @@ static struct tree_mod_elem *tree_mod_log_oldest_root(struct extent_buffer *eb_r
 	struct tree_mod_elem *tm;
 	struct tree_mod_elem *found = NULL;
 	u64 root_logical = eb_root->start;
-	int looped = 0;
+	bool looped = false;
 
 	if (!time_seq)
 		return NULL;
@@ -636,7 +636,7 @@ static struct tree_mod_elem *tree_mod_log_oldest_root(struct extent_buffer *eb_r
 
 		found = tm;
 		root_logical = tm->old_root.logical;
-		looped = 1;
+		looped = true;
 	}
 
 	/* if there's no old root to return, return what we found instead */
diff --git a/fs/btrfs/tree-mod-log.h b/fs/btrfs/tree-mod-log.h
index 15e7a4c4bb14..2f7396ea57ae 100644
--- a/fs/btrfs/tree-mod-log.h
+++ b/fs/btrfs/tree-mod-log.h
@@ -30,7 +30,7 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
 			    struct btrfs_seq_list *elem);
 int btrfs_tree_mod_log_insert_root(struct extent_buffer *old_root,
 				   struct extent_buffer *new_root,
-				   int log_removal);
+				   bool log_removal);
 int btrfs_tree_mod_log_insert_key(struct extent_buffer *eb, int slot,
 				  enum btrfs_mod_log_op op, gfp_t flags);
 int btrfs_tree_mod_log_free_eb(struct extent_buffer *eb);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/9] btrfs: use a bit to track the existence of tree mod log users
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
                   ` (3 preceding siblings ...)
  2021-03-11 14:31 ` [PATCH 4/9] btrfs: use booleans where appropriate for the tree mod log functions fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-13  7:26   ` Wang Yugui
  2021-03-11 14:31 ` [PATCH 6/9] btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block() fdmanana
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

The tree modification log functions are called very frequently, basically
they are called everytime a btree is modified (a pointer added or removed
to a node, a new root for a btree is set, etc). Because of that, to avoid
heavy lock contention on the lock that protects the list of tree mod log
users, we have checks that test the emptiness of the list with a full
memory barrier before the checks, so that when there are no tree mod log
users we avoid taking the lock.

Replace the memory barrier and list emptiness check with a test for a new
bit set at fs_info->flags. This bit is used to indicate when there are
tree mod log users, set whenever a user is added to the list and cleared
when the last user is removed from the list. This makes the intention a
bit more obvious and possibly more efficient (assuming test_bit() may be
cheaper than a full memory barrier on some architectures).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h        |  3 +++
 fs/btrfs/tree-mod-log.c | 11 ++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 50208673bd55..90184ee2f832 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -580,6 +580,9 @@ enum {
 
 	/* Indicate that we can't trust the free space tree for caching yet */
 	BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
+
+	/* Indicate whether there are any tree modification log users. */
+	BTRFS_FS_TREE_MOD_LOG_USERS,
 };
 
 /*
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index f2a6da1b2903..b912b82c36c9 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -60,6 +60,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
 	if (!elem->seq) {
 		elem->seq = btrfs_inc_tree_mod_seq(fs_info);
 		list_add_tail(&elem->list, &fs_info->tree_mod_seq_list);
+		set_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags);
 	}
 	write_unlock(&fs_info->tree_mod_log_lock);
 
@@ -83,7 +84,9 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
 	list_del(&elem->list);
 	elem->seq = 0;
 
-	if (!list_empty(&fs_info->tree_mod_seq_list)) {
+	if (list_empty(&fs_info->tree_mod_seq_list)) {
+		clear_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags);
+	} else {
 		struct btrfs_seq_list *first;
 
 		first = list_first_entry(&fs_info->tree_mod_seq_list,
@@ -166,8 +169,7 @@ static noinline int tree_mod_log_insert(struct btrfs_fs_info *fs_info,
 static inline bool tree_mod_dont_log(struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb)
 {
-	smp_mb();
-	if (list_empty(&(fs_info)->tree_mod_seq_list))
+	if (!test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
 		return true;
 	if (eb && btrfs_header_level(eb) == 0)
 		return true;
@@ -185,8 +187,7 @@ static inline bool tree_mod_dont_log(struct btrfs_fs_info *fs_info,
 static inline bool tree_mod_need_log(const struct btrfs_fs_info *fs_info,
 				    struct extent_buffer *eb)
 {
-	smp_mb();
-	if (list_empty(&(fs_info)->tree_mod_seq_list))
+	if (!test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
 		return false;
 	if (eb && btrfs_header_level(eb) == 0)
 		return false;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/9] btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block()
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
                   ` (4 preceding siblings ...)
  2021-03-11 14:31 ` [PATCH 5/9] btrfs: use a bit to track the existence of tree mod log users fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-11 14:31 ` [PATCH 7/9] btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb() fdmanana
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

Instead of exposing implementation details of the tree mod log to check
if there are active tree mod log users at btrfs_free_tree_block(), use
the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This
way extent-tree.c does not need to known about any of the internals of
the tree mod log and avoids taking a lock unnecessarily as well.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/extent-tree.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2482b26b1971..7a28314189b4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3342,11 +3342,9 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 		 * find a node pointing to this leaf and record operations that
 		 * point to this leaf.
 		 */
-		if (btrfs_header_level(buf) == 0) {
-			read_lock(&fs_info->tree_mod_log_lock);
-			must_pin = !list_empty(&fs_info->tree_mod_seq_list);
-			read_unlock(&fs_info->tree_mod_log_lock);
-		}
+		if (btrfs_header_level(buf) == 0 &&
+		    test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
+			must_pin = true;
 
 		if (must_pin || btrfs_is_zoned(fs_info)) {
 			btrfs_redirty_list_add(trans->transaction, buf);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 7/9] btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb()
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
                   ` (5 preceding siblings ...)
  2021-03-11 14:31 ` [PATCH 6/9] btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block() fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-11 14:31 ` [PATCH 8/9] btrfs: add and use helper to get lowest sequence number for the tree mod log fdmanana
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf,
and if so, return immediately and do nothing. However this check can be
removed, because after it we call tree_mod_need_log(), which returns
false when given an extent buffer that corresponds to a leaf.

So just remove the leaf check and pass the extent buffer to
tree_mod_need_log().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-mod-log.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index b912b82c36c9..c854cebf27f6 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -554,10 +554,7 @@ int btrfs_tree_mod_log_free_eb(struct extent_buffer *eb)
 	int i;
 	int ret = 0;
 
-	if (btrfs_header_level(eb) == 0)
-		return 0;
-
-	if (!tree_mod_need_log(eb->fs_info, NULL))
+	if (!tree_mod_need_log(eb->fs_info, eb))
 		return 0;
 
 	nritems = btrfs_header_nritems(eb);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 8/9] btrfs: add and use helper to get lowest sequence number for the tree mod log
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
                   ` (6 preceding siblings ...)
  2021-03-11 14:31 ` [PATCH 7/9] btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb() fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-11 14:31 ` [PATCH 9/9] btrfs: update debug message when checking seq number of a delayed ref fdmanana
  2021-03-16 16:58 ` [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings David Sterba
  9 siblings, 0 replies; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

There are two places outside the tree mod log module that extract the
lowest sequence number of the tree mod log. These places end up
duplicating code and open coding the logic and internal implementation
details of the tree mod log. So add a helper to the tree mod log module
and header that returns the lowest sequence number or 0 if there aren't
any tree mod log users at the moment.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-ref.c  | 31 ++++++++-----------------------
 fs/btrfs/tree-mod-log.c | 25 +++++++++++++++++++++++++
 fs/btrfs/tree-mod-log.h |  1 +
 3 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index d87472a68bf4..7a3131cbf1fb 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -495,16 +495,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
 	if (head->is_data)
 		return;
 
-	read_lock(&fs_info->tree_mod_log_lock);
-	if (!list_empty(&fs_info->tree_mod_seq_list)) {
-		struct btrfs_seq_list *elem;
-
-		elem = list_first_entry(&fs_info->tree_mod_seq_list,
-					struct btrfs_seq_list, list);
-		seq = elem->seq;
-	}
-	read_unlock(&fs_info->tree_mod_log_lock);
-
+	seq = btrfs_tree_mod_log_lowest_seq(fs_info);
 again:
 	for (node = rb_first_cached(&head->ref_tree); node;
 	     node = rb_next(node)) {
@@ -518,23 +509,17 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
 
 int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info, u64 seq)
 {
-	struct btrfs_seq_list *elem;
+	u64 min_seq = btrfs_tree_mod_log_lowest_seq(fs_info);
 	int ret = 0;
 
-	read_lock(&fs_info->tree_mod_log_lock);
-	if (!list_empty(&fs_info->tree_mod_seq_list)) {
-		elem = list_first_entry(&fs_info->tree_mod_seq_list,
-					struct btrfs_seq_list, list);
-		if (seq >= elem->seq) {
-			btrfs_debug(fs_info,
-				"holding back delayed_ref %#x.%x, lowest is %#x.%x",
-				(u32)(seq >> 32), (u32)seq,
-				(u32)(elem->seq >> 32), (u32)elem->seq);
-			ret = 1;
-		}
+	if (min_seq != 0 && seq >= min_seq) {
+		btrfs_debug(fs_info,
+			    "holding back delayed_ref %#x.%x, lowest is %#x.%x",
+			    (u32)(seq >> 32), (u32)seq,
+			    (u32)(min_seq >> 32), (u32)min_seq);
+		ret = 1;
 	}
 
-	read_unlock(&fs_info->tree_mod_log_lock);
 	return ret;
 }
 
diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
index c854cebf27f6..b2083b575e21 100644
--- a/fs/btrfs/tree-mod-log.c
+++ b/fs/btrfs/tree-mod-log.c
@@ -885,3 +885,28 @@ int btrfs_old_root_level(struct btrfs_root *root, u64 time_seq)
 	return level;
 }
 
+/*
+ * Return the lowest sequence number in the tree modification log.
+ *
+ * @fs_info:	the filesystem context
+ *
+ * Returns the sequence number of the oldest tree modification log user, which
+ * corresponds to the lowest sequence number of all existing users. If there are
+ * no users it returns 0.
+ */
+u64 btrfs_tree_mod_log_lowest_seq(struct btrfs_fs_info *fs_info)
+{
+	u64 ret = 0;
+
+	read_lock(&fs_info->tree_mod_log_lock);
+	if (!list_empty(&fs_info->tree_mod_seq_list)) {
+		struct btrfs_seq_list *elem;
+
+		elem = list_first_entry(&fs_info->tree_mod_seq_list,
+					struct btrfs_seq_list, list);
+		ret = elem->seq;
+	}
+	read_unlock(&fs_info->tree_mod_log_lock);
+
+	return ret;
+}
diff --git a/fs/btrfs/tree-mod-log.h b/fs/btrfs/tree-mod-log.h
index 2f7396ea57ae..d218f8a661bc 100644
--- a/fs/btrfs/tree-mod-log.h
+++ b/fs/btrfs/tree-mod-log.h
@@ -48,5 +48,6 @@ int btrfs_tree_mod_log_eb_copy(struct extent_buffer *dst,
 int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
 				   int dst_slot, int src_slot,
 				   int nr_items);
+u64 btrfs_tree_mod_log_lowest_seq(struct btrfs_fs_info *fs_info);
 
 #endif /* BTRFS_TREE_MOD_LOG_H */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 9/9] btrfs: update debug message when checking seq number of a delayed ref
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
                   ` (7 preceding siblings ...)
  2021-03-11 14:31 ` [PATCH 8/9] btrfs: add and use helper to get lowest sequence number for the tree mod log fdmanana
@ 2021-03-11 14:31 ` fdmanana
  2021-03-16 16:58 ` [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings David Sterba
  9 siblings, 0 replies; 22+ messages in thread
From: fdmanana @ 2021-03-11 14:31 UTC (permalink / raw)
  To: linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

From: Filipe Manana <fdmanana@suse.com>

We used to encode two different numbers in the tree mod log counter used
for sequence numbers, one in the upper 32 bits and the other one in the
lower 32 bits. However that is no longer the case, we stopped doing that
since commit fcebe4562dec83 ("Btrfs: rework qgroup accounting").

So update the debug message at btrfs_check_delayed_seq to stop extracting
the two 32 bits counters and print instead the 64 bits sequence numbers.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/delayed-ref.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 7a3131cbf1fb..8c1d78befb5b 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -514,9 +514,8 @@ int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info, u64 seq)
 
 	if (min_seq != 0 && seq >= min_seq) {
 		btrfs_debug(fs_info,
-			    "holding back delayed_ref %#x.%x, lowest is %#x.%x",
-			    (u32)(seq >> 32), (u32)seq,
-			    (u32)(min_seq >> 32), (u32)min_seq);
+			    "holding back delayed_ref %llu, lowest is %llu",
+			    seq, min_seq);
 		ret = 1;
 	}
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/9] btrfs: move the tree mod log code into its own file
  2021-03-11 14:31 ` [PATCH 3/9] btrfs: move the tree mod log code into its own file fdmanana
@ 2021-03-11 17:26     ` kernel test robot
  2021-03-12  8:50   ` Anand Jain
  1 sibling, 0 replies; 22+ messages in thread
From: kernel test robot @ 2021-03-11 17:26 UTC (permalink / raw)
  To: fdmanana, linux-btrfs; +Cc: kbuild-all, ce3g8jdj, Filipe Manana

[-- Attachment #1: Type: text/plain, Size: 3445 bytes --]

Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v5.12-rc2]
[also build test WARNING on next-20210311]
[cannot apply to kdave/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/fdmanana-kernel-org/btrfs-bug-fixes-for-the-tree-mod-log-and-small-refactorings/20210311-223429
base:    a38fd8748464831584a19438cbb3082b5a2dab15
config: x86_64-randconfig-m001-20210311 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

smatch warnings:
fs/btrfs/tree-mod-log.c:286 btrfs_tree_mod_log_insert_move() warn: missing error code 'ret'
fs/btrfs/tree-mod-log.c:519 btrfs_tree_mod_log_eb_copy() warn: missing error code 'ret'
fs/btrfs/tree-mod-log.c:577 btrfs_tree_mod_log_free_eb() warn: missing error code 'ret'

vim +/ret +286 fs/btrfs/tree-mod-log.c

   246	
   247	int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
   248					   int dst_slot, int src_slot,
   249					   int nr_items)
   250	{
   251		struct tree_mod_elem *tm = NULL;
   252		struct tree_mod_elem **tm_list = NULL;
   253		int ret = 0;
   254		int i;
   255		int locked = 0;
   256	
   257		if (!tree_mod_need_log(eb->fs_info, eb))
   258			return 0;
   259	
   260		tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS);
   261		if (!tm_list)
   262			return -ENOMEM;
   263	
   264		tm = kzalloc(sizeof(*tm), GFP_NOFS);
   265		if (!tm) {
   266			ret = -ENOMEM;
   267			goto free_tms;
   268		}
   269	
   270		tm->logical = eb->start;
   271		tm->slot = src_slot;
   272		tm->move.dst_slot = dst_slot;
   273		tm->move.nr_items = nr_items;
   274		tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
   275	
   276		for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
   277			tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
   278			    BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING, GFP_NOFS);
   279			if (!tm_list[i]) {
   280				ret = -ENOMEM;
   281				goto free_tms;
   282			}
   283		}
   284	
   285		if (tree_mod_dont_log(eb->fs_info, eb))
 > 286			goto free_tms;
   287		locked = 1;
   288	
   289		/*
   290		 * When we override something during the move, we log these removals.
   291		 * This can only happen when we move towards the beginning of the
   292		 * buffer, i.e. dst_slot < src_slot.
   293		 */
   294		for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
   295			ret = tree_mod_log_insert(eb->fs_info, tm_list[i]);
   296			if (ret)
   297				goto free_tms;
   298		}
   299	
   300		ret = tree_mod_log_insert(eb->fs_info, tm);
   301		if (ret)
   302			goto free_tms;
   303		write_unlock(&eb->fs_info->tree_mod_log_lock);
   304		kfree(tm_list);
   305	
   306		return 0;
   307	free_tms:
   308		for (i = 0; i < nr_items; i++) {
   309			if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
   310				rb_erase(&tm_list[i]->node, &eb->fs_info->tree_mod_log);
   311			kfree(tm_list[i]);
   312		}
   313		if (locked)
   314			write_unlock(&eb->fs_info->tree_mod_log_lock);
   315		kfree(tm_list);
   316		kfree(tm);
   317	
   318		return ret;
   319	}
   320	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30789 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/9] btrfs: move the tree mod log code into its own file
@ 2021-03-11 17:26     ` kernel test robot
  0 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2021-03-11 17:26 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3552 bytes --]

Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v5.12-rc2]
[also build test WARNING on next-20210311]
[cannot apply to kdave/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/fdmanana-kernel-org/btrfs-bug-fixes-for-the-tree-mod-log-and-small-refactorings/20210311-223429
base:    a38fd8748464831584a19438cbb3082b5a2dab15
config: x86_64-randconfig-m001-20210311 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

smatch warnings:
fs/btrfs/tree-mod-log.c:286 btrfs_tree_mod_log_insert_move() warn: missing error code 'ret'
fs/btrfs/tree-mod-log.c:519 btrfs_tree_mod_log_eb_copy() warn: missing error code 'ret'
fs/btrfs/tree-mod-log.c:577 btrfs_tree_mod_log_free_eb() warn: missing error code 'ret'

vim +/ret +286 fs/btrfs/tree-mod-log.c

   246	
   247	int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
   248					   int dst_slot, int src_slot,
   249					   int nr_items)
   250	{
   251		struct tree_mod_elem *tm = NULL;
   252		struct tree_mod_elem **tm_list = NULL;
   253		int ret = 0;
   254		int i;
   255		int locked = 0;
   256	
   257		if (!tree_mod_need_log(eb->fs_info, eb))
   258			return 0;
   259	
   260		tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS);
   261		if (!tm_list)
   262			return -ENOMEM;
   263	
   264		tm = kzalloc(sizeof(*tm), GFP_NOFS);
   265		if (!tm) {
   266			ret = -ENOMEM;
   267			goto free_tms;
   268		}
   269	
   270		tm->logical = eb->start;
   271		tm->slot = src_slot;
   272		tm->move.dst_slot = dst_slot;
   273		tm->move.nr_items = nr_items;
   274		tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
   275	
   276		for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
   277			tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
   278			    BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING, GFP_NOFS);
   279			if (!tm_list[i]) {
   280				ret = -ENOMEM;
   281				goto free_tms;
   282			}
   283		}
   284	
   285		if (tree_mod_dont_log(eb->fs_info, eb))
 > 286			goto free_tms;
   287		locked = 1;
   288	
   289		/*
   290		 * When we override something during the move, we log these removals.
   291		 * This can only happen when we move towards the beginning of the
   292		 * buffer, i.e. dst_slot < src_slot.
   293		 */
   294		for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
   295			ret = tree_mod_log_insert(eb->fs_info, tm_list[i]);
   296			if (ret)
   297				goto free_tms;
   298		}
   299	
   300		ret = tree_mod_log_insert(eb->fs_info, tm);
   301		if (ret)
   302			goto free_tms;
   303		write_unlock(&eb->fs_info->tree_mod_log_lock);
   304		kfree(tm_list);
   305	
   306		return 0;
   307	free_tms:
   308		for (i = 0; i < nr_items; i++) {
   309			if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
   310				rb_erase(&tm_list[i]->node, &eb->fs_info->tree_mod_log);
   311			kfree(tm_list[i]);
   312		}
   313		if (locked)
   314			write_unlock(&eb->fs_info->tree_mod_log_lock);
   315		kfree(tm_list);
   316		kfree(tm);
   317	
   318		return ret;
   319	}
   320	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 30789 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/9] btrfs: move the tree mod log code into its own file
  2021-03-11 17:26     ` kernel test robot
@ 2021-03-11 17:41       ` Filipe Manana
  -1 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2021-03-11 17:41 UTC (permalink / raw)
  To: kernel test robot; +Cc: linux-btrfs, kbuild-all, Zygo Blaxell, Filipe Manana

On Thu, Mar 11, 2021 at 5:27 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on v5.12-rc2]
> [also build test WARNING on next-20210311]
> [cannot apply to kdave/for-next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:    https://github.com/0day-ci/linux/commits/fdmanana-kernel-org/btrfs-bug-fixes-for-the-tree-mod-log-and-small-refactorings/20210311-223429
> base:    a38fd8748464831584a19438cbb3082b5a2dab15
> config: x86_64-randconfig-m001-20210311 (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> smatch warnings:
> fs/btrfs/tree-mod-log.c:286 btrfs_tree_mod_log_insert_move() warn: missing error code 'ret'
> fs/btrfs/tree-mod-log.c:519 btrfs_tree_mod_log_eb_copy() warn: missing error code 'ret'
> fs/btrfs/tree-mod-log.c:577 btrfs_tree_mod_log_free_eb() warn: missing error code 'ret'

No, those are all ok.
For all cases, 'ret' was initialized to 0 when declared, and that's
what we want to return when tree_mod_dont_log() returns true.

>
> vim +/ret +286 fs/btrfs/tree-mod-log.c
>
>    246
>    247  int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
>    248                                     int dst_slot, int src_slot,
>    249                                     int nr_items)
>    250  {
>    251          struct tree_mod_elem *tm = NULL;
>    252          struct tree_mod_elem **tm_list = NULL;
>    253          int ret = 0;
>    254          int i;
>    255          int locked = 0;
>    256
>    257          if (!tree_mod_need_log(eb->fs_info, eb))
>    258                  return 0;
>    259
>    260          tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS);
>    261          if (!tm_list)
>    262                  return -ENOMEM;
>    263
>    264          tm = kzalloc(sizeof(*tm), GFP_NOFS);
>    265          if (!tm) {
>    266                  ret = -ENOMEM;
>    267                  goto free_tms;
>    268          }
>    269
>    270          tm->logical = eb->start;
>    271          tm->slot = src_slot;
>    272          tm->move.dst_slot = dst_slot;
>    273          tm->move.nr_items = nr_items;
>    274          tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
>    275
>    276          for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
>    277                  tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
>    278                      BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING, GFP_NOFS);
>    279                  if (!tm_list[i]) {
>    280                          ret = -ENOMEM;
>    281                          goto free_tms;
>    282                  }
>    283          }
>    284
>    285          if (tree_mod_dont_log(eb->fs_info, eb))
>  > 286                  goto free_tms;
>    287          locked = 1;
>    288
>    289          /*
>    290           * When we override something during the move, we log these removals.
>    291           * This can only happen when we move towards the beginning of the
>    292           * buffer, i.e. dst_slot < src_slot.
>    293           */
>    294          for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
>    295                  ret = tree_mod_log_insert(eb->fs_info, tm_list[i]);
>    296                  if (ret)
>    297                          goto free_tms;
>    298          }
>    299
>    300          ret = tree_mod_log_insert(eb->fs_info, tm);
>    301          if (ret)
>    302                  goto free_tms;
>    303          write_unlock(&eb->fs_info->tree_mod_log_lock);
>    304          kfree(tm_list);
>    305
>    306          return 0;
>    307  free_tms:
>    308          for (i = 0; i < nr_items; i++) {
>    309                  if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
>    310                          rb_erase(&tm_list[i]->node, &eb->fs_info->tree_mod_log);
>    311                  kfree(tm_list[i]);
>    312          }
>    313          if (locked)
>    314                  write_unlock(&eb->fs_info->tree_mod_log_lock);
>    315          kfree(tm_list);
>    316          kfree(tm);
>    317
>    318          return ret;
>    319  }
>    320
>
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/9] btrfs: move the tree mod log code into its own file
@ 2021-03-11 17:41       ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2021-03-11 17:41 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4655 bytes --]

On Thu, Mar 11, 2021 at 5:27 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on v5.12-rc2]
> [also build test WARNING on next-20210311]
> [cannot apply to kdave/for-next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:    https://github.com/0day-ci/linux/commits/fdmanana-kernel-org/btrfs-bug-fixes-for-the-tree-mod-log-and-small-refactorings/20210311-223429
> base:    a38fd8748464831584a19438cbb3082b5a2dab15
> config: x86_64-randconfig-m001-20210311 (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> smatch warnings:
> fs/btrfs/tree-mod-log.c:286 btrfs_tree_mod_log_insert_move() warn: missing error code 'ret'
> fs/btrfs/tree-mod-log.c:519 btrfs_tree_mod_log_eb_copy() warn: missing error code 'ret'
> fs/btrfs/tree-mod-log.c:577 btrfs_tree_mod_log_free_eb() warn: missing error code 'ret'

No, those are all ok.
For all cases, 'ret' was initialized to 0 when declared, and that's
what we want to return when tree_mod_dont_log() returns true.

>
> vim +/ret +286 fs/btrfs/tree-mod-log.c
>
>    246
>    247  int btrfs_tree_mod_log_insert_move(struct extent_buffer *eb,
>    248                                     int dst_slot, int src_slot,
>    249                                     int nr_items)
>    250  {
>    251          struct tree_mod_elem *tm = NULL;
>    252          struct tree_mod_elem **tm_list = NULL;
>    253          int ret = 0;
>    254          int i;
>    255          int locked = 0;
>    256
>    257          if (!tree_mod_need_log(eb->fs_info, eb))
>    258                  return 0;
>    259
>    260          tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS);
>    261          if (!tm_list)
>    262                  return -ENOMEM;
>    263
>    264          tm = kzalloc(sizeof(*tm), GFP_NOFS);
>    265          if (!tm) {
>    266                  ret = -ENOMEM;
>    267                  goto free_tms;
>    268          }
>    269
>    270          tm->logical = eb->start;
>    271          tm->slot = src_slot;
>    272          tm->move.dst_slot = dst_slot;
>    273          tm->move.nr_items = nr_items;
>    274          tm->op = BTRFS_MOD_LOG_MOVE_KEYS;
>    275
>    276          for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
>    277                  tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
>    278                      BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING, GFP_NOFS);
>    279                  if (!tm_list[i]) {
>    280                          ret = -ENOMEM;
>    281                          goto free_tms;
>    282                  }
>    283          }
>    284
>    285          if (tree_mod_dont_log(eb->fs_info, eb))
>  > 286                  goto free_tms;
>    287          locked = 1;
>    288
>    289          /*
>    290           * When we override something during the move, we log these removals.
>    291           * This can only happen when we move towards the beginning of the
>    292           * buffer, i.e. dst_slot < src_slot.
>    293           */
>    294          for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
>    295                  ret = tree_mod_log_insert(eb->fs_info, tm_list[i]);
>    296                  if (ret)
>    297                          goto free_tms;
>    298          }
>    299
>    300          ret = tree_mod_log_insert(eb->fs_info, tm);
>    301          if (ret)
>    302                  goto free_tms;
>    303          write_unlock(&eb->fs_info->tree_mod_log_lock);
>    304          kfree(tm_list);
>    305
>    306          return 0;
>    307  free_tms:
>    308          for (i = 0; i < nr_items; i++) {
>    309                  if (tm_list[i] && !RB_EMPTY_NODE(&tm_list[i]->node))
>    310                          rb_erase(&tm_list[i]->node, &eb->fs_info->tree_mod_log);
>    311                  kfree(tm_list[i]);
>    312          }
>    313          if (locked)
>    314                  write_unlock(&eb->fs_info->tree_mod_log_lock);
>    315          kfree(tm_list);
>    316          kfree(tm);
>    317
>    318          return ret;
>    319  }
>    320
>
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/9] btrfs: move the tree mod log code into its own file
  2021-03-11 14:31 ` [PATCH 3/9] btrfs: move the tree mod log code into its own file fdmanana
  2021-03-11 17:26     ` kernel test robot
@ 2021-03-12  8:50   ` Anand Jain
  1 sibling, 0 replies; 22+ messages in thread
From: Anand Jain @ 2021-03-12  8:50 UTC (permalink / raw)
  To: fdmanana, linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

On 11/3/21 10:31 pm, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The tree modification log, which records modifications done to btrees, is
> quite large and currently spread all over ctree.c, which is a huge file
> already.
> 
> To make things better organized, move all that code into its own separate
> source and header files. Functions and definitions that are used outside
> of the module (mostly by ctree.c) are renamed so that they start with a
> "btrfs_" prefix. Everything else remains unchanged.
> 
> This makes it easier to go over the tree modification log code everytime
> I need to go read it to fix a bug.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Anand Jain <anand.jain@oracle.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/9] btrfs: use booleans where appropriate for the tree mod log functions
  2021-03-11 14:31 ` [PATCH 4/9] btrfs: use booleans where appropriate for the tree mod log functions fdmanana
@ 2021-03-12 12:44   ` Anand Jain
  0 siblings, 0 replies; 22+ messages in thread
From: Anand Jain @ 2021-03-12 12:44 UTC (permalink / raw)
  To: fdmanana, linux-btrfs; +Cc: ce3g8jdj, Filipe Manana

On 11/3/21 10:31 pm, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Several functions of the tree modification log use integers as booleans,
> so change them to use booleans instead, making their use more clear.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

  Reviewed-by: Anand Jain <anand.jain@oracle.com>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/9] btrfs: use a bit to track the existence of tree mod log users
  2021-03-11 14:31 ` [PATCH 5/9] btrfs: use a bit to track the existence of tree mod log users fdmanana
@ 2021-03-13  7:26   ` Wang Yugui
  2021-03-15  9:52     ` Filipe Manana
  0 siblings, 1 reply; 22+ messages in thread
From: Wang Yugui @ 2021-03-13  7:26 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, ce3g8jdj, Filipe Manana

Hi,

It will be more easy to backport for heavy i/o load if we put this patch
before '[PATCH 3/9] btrfs: move the tree mod log code into its own file' ?

And this patch can be merged into one with the following 
'[PATCH 6/9] btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block()'?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/03/13

> From: Filipe Manana <fdmanana@suse.com>
> 
> The tree modification log functions are called very frequently, basically
> they are called everytime a btree is modified (a pointer added or removed
> to a node, a new root for a btree is set, etc). Because of that, to avoid
> heavy lock contention on the lock that protects the list of tree mod log
> users, we have checks that test the emptiness of the list with a full
> memory barrier before the checks, so that when there are no tree mod log
> users we avoid taking the lock.
> 
> Replace the memory barrier and list emptiness check with a test for a new
> bit set at fs_info->flags. This bit is used to indicate when there are
> tree mod log users, set whenever a user is added to the list and cleared
> when the last user is removed from the list. This makes the intention a
> bit more obvious and possibly more efficient (assuming test_bit() may be
> cheaper than a full memory barrier on some architectures).
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ctree.h        |  3 +++
>  fs/btrfs/tree-mod-log.c | 11 ++++++-----
>  2 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 50208673bd55..90184ee2f832 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -580,6 +580,9 @@ enum {
>  
>  	/* Indicate that we can't trust the free space tree for caching yet */
>  	BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
> +
> +	/* Indicate whether there are any tree modification log users. */
> +	BTRFS_FS_TREE_MOD_LOG_USERS,
>  };
>  
>  /*
> diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
> index f2a6da1b2903..b912b82c36c9 100644
> --- a/fs/btrfs/tree-mod-log.c
> +++ b/fs/btrfs/tree-mod-log.c
> @@ -60,6 +60,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
>  	if (!elem->seq) {
>  		elem->seq = btrfs_inc_tree_mod_seq(fs_info);
>  		list_add_tail(&elem->list, &fs_info->tree_mod_seq_list);
> +		set_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags);
>  	}
>  	write_unlock(&fs_info->tree_mod_log_lock);
>  
> @@ -83,7 +84,9 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
>  	list_del(&elem->list);
>  	elem->seq = 0;
>  
> -	if (!list_empty(&fs_info->tree_mod_seq_list)) {
> +	if (list_empty(&fs_info->tree_mod_seq_list)) {
> +		clear_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags);
> +	} else {
>  		struct btrfs_seq_list *first;
>  
>  		first = list_first_entry(&fs_info->tree_mod_seq_list,
> @@ -166,8 +169,7 @@ static noinline int tree_mod_log_insert(struct btrfs_fs_info *fs_info,
>  static inline bool tree_mod_dont_log(struct btrfs_fs_info *fs_info,
>  				    struct extent_buffer *eb)
>  {
> -	smp_mb();
> -	if (list_empty(&(fs_info)->tree_mod_seq_list))
> +	if (!test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
>  		return true;
>  	if (eb && btrfs_header_level(eb) == 0)
>  		return true;
> @@ -185,8 +187,7 @@ static inline bool tree_mod_dont_log(struct btrfs_fs_info *fs_info,
>  static inline bool tree_mod_need_log(const struct btrfs_fs_info *fs_info,
>  				    struct extent_buffer *eb)
>  {
> -	smp_mb();
> -	if (list_empty(&(fs_info)->tree_mod_seq_list))
> +	if (!test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
>  		return false;
>  	if (eb && btrfs_header_level(eb) == 0)
>  		return false;
> -- 
> 2.28.0



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 5/9] btrfs: use a bit to track the existence of tree mod log users
  2021-03-13  7:26   ` Wang Yugui
@ 2021-03-15  9:52     ` Filipe Manana
  0 siblings, 0 replies; 22+ messages in thread
From: Filipe Manana @ 2021-03-15  9:52 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs, Zygo Blaxell, Filipe Manana

On Sat, Mar 13, 2021 at 7:31 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi,
>
> It will be more easy to backport for heavy i/o load if we put this patch
> before '[PATCH 3/9] btrfs: move the tree mod log code into its own file' ?

Only bug fixes and fixes for serious performance regressions are
backported to stable kernels.

This patch is neither of them. It's a cleanup.
It mentions that replacing the memory barriers with test_bit() may
provide a slight benefit. If it does it's such a micro optimization
that I doubt it causes any observable effect on any workload.
Even if it does, which I seriously doubt, it's still not a candidate
for stable backports.

>
> And this patch can be merged into one with the following
> '[PATCH 6/9] btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block()'?

They do the same thing, but for slightly different cases. So I prefer
to keep them separate for ease of review.

>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/03/13
>
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > The tree modification log functions are called very frequently, basically
> > they are called everytime a btree is modified (a pointer added or removed
> > to a node, a new root for a btree is set, etc). Because of that, to avoid
> > heavy lock contention on the lock that protects the list of tree mod log
> > users, we have checks that test the emptiness of the list with a full
> > memory barrier before the checks, so that when there are no tree mod log
> > users we avoid taking the lock.
> >
> > Replace the memory barrier and list emptiness check with a test for a new
> > bit set at fs_info->flags. This bit is used to indicate when there are
> > tree mod log users, set whenever a user is added to the list and cleared
> > when the last user is removed from the list. This makes the intention a
> > bit more obvious and possibly more efficient (assuming test_bit() may be
> > cheaper than a full memory barrier on some architectures).
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/ctree.h        |  3 +++
> >  fs/btrfs/tree-mod-log.c | 11 ++++++-----
> >  2 files changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index 50208673bd55..90184ee2f832 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -580,6 +580,9 @@ enum {
> >
> >       /* Indicate that we can't trust the free space tree for caching yet */
> >       BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
> > +
> > +     /* Indicate whether there are any tree modification log users. */
> > +     BTRFS_FS_TREE_MOD_LOG_USERS,
> >  };
> >
> >  /*
> > diff --git a/fs/btrfs/tree-mod-log.c b/fs/btrfs/tree-mod-log.c
> > index f2a6da1b2903..b912b82c36c9 100644
> > --- a/fs/btrfs/tree-mod-log.c
> > +++ b/fs/btrfs/tree-mod-log.c
> > @@ -60,6 +60,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,
> >       if (!elem->seq) {
> >               elem->seq = btrfs_inc_tree_mod_seq(fs_info);
> >               list_add_tail(&elem->list, &fs_info->tree_mod_seq_list);
> > +             set_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags);
> >       }
> >       write_unlock(&fs_info->tree_mod_log_lock);
> >
> > @@ -83,7 +84,9 @@ void btrfs_put_tree_mod_seq(struct btrfs_fs_info *fs_info,
> >       list_del(&elem->list);
> >       elem->seq = 0;
> >
> > -     if (!list_empty(&fs_info->tree_mod_seq_list)) {
> > +     if (list_empty(&fs_info->tree_mod_seq_list)) {
> > +             clear_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags);
> > +     } else {
> >               struct btrfs_seq_list *first;
> >
> >               first = list_first_entry(&fs_info->tree_mod_seq_list,
> > @@ -166,8 +169,7 @@ static noinline int tree_mod_log_insert(struct btrfs_fs_info *fs_info,
> >  static inline bool tree_mod_dont_log(struct btrfs_fs_info *fs_info,
> >                                   struct extent_buffer *eb)
> >  {
> > -     smp_mb();
> > -     if (list_empty(&(fs_info)->tree_mod_seq_list))
> > +     if (!test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
> >               return true;
> >       if (eb && btrfs_header_level(eb) == 0)
> >               return true;
> > @@ -185,8 +187,7 @@ static inline bool tree_mod_dont_log(struct btrfs_fs_info *fs_info,
> >  static inline bool tree_mod_need_log(const struct btrfs_fs_info *fs_info,
> >                                   struct extent_buffer *eb)
> >  {
> > -     smp_mb();
> > -     if (list_empty(&(fs_info)->tree_mod_seq_list))
> > +     if (!test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
> >               return false;
> >       if (eb && btrfs_header_level(eb) == 0)
> >               return false;
> > --
> > 2.28.0
>
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/9] btrfs: always pin deleted leaves when there are active tree mod log users
  2021-03-11 14:31 ` [PATCH 2/9] btrfs: always pin deleted leaves when there are active tree mod log users fdmanana
@ 2021-03-15 19:28   ` David Sterba
  2021-03-16 11:49     ` Filipe Manana
  0 siblings, 1 reply; 22+ messages in thread
From: David Sterba @ 2021-03-15 19:28 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, ce3g8jdj, Filipe Manana

On Thu, Mar 11, 2021 at 02:31:06PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When freeing a tree block we may end up adding its extent back to the
> free space cache/tree, as long as there are no more references for it,
> it was created in the current transaction and writeback for it never
> happened. This is generally fine, however when we have tree mod log
> operations it can result in inconsistent versions of a btree after
> unwinding extent buffers with the recorded tree mod log operations.
> 
> This is because:
> 
> * We only log operations for nodes (adding and removing key/pointers),
>   for leaves we don't do anything;
> 
> * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation
>   for a node that points to a leaf that was deleted;
> 
> * Before we apply the logged operation to unwind a node, we can have
>   that leaf's extent allocated again, either as a node or as a leaf, and
>   possibly for another btree. This is possible if the leaf was created in
>   the current transaction and writeback for it never started, in which
>   case btrfs_free_tree_block() returns its extent back to the free space
>   cache/tree;
> 
> * Then, before applying the tree mod log operation, some task allocates
>   the metadata extent just freed before, and uses it either as a leaf or
>   as a node for some btree (can be the same or another one, it does not
>   matter);
> 
> * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now
>   get the target node with an item pointing to the metadata extent that
>   now has content different from what it had before the leaf was deleted.
>   It might now belong to a different btree and be a node and not a leaf
>   anymore.
> 
>   As a consequence, the results of searches after the unwinding can be
>   unpredictable and produce unexpected results.
> 
> So make sure we pin extent buffers corresponding to leaves when there
> are tree mod log users.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/extent-tree.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5e228d6ad63f..2482b26b1971 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3310,6 +3310,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>  
>  	if (last_ref && btrfs_header_generation(buf) == trans->transid) {
>  		struct btrfs_block_group *cache;
> +		bool must_pin = false;
>  
>  		if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
>  			ret = check_ref_cleanup(trans, buf->start);
> @@ -3327,7 +3328,27 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>  			goto out;
>  		}
>  
> -		if (btrfs_is_zoned(fs_info)) {
> +		/*
> +		 * If this is a leaf and there are tree mod log users, we may
> +		 * have recorded mod log operations that point to this leaf.
> +		 * So we must make sure no one reuses this leaf's extent before
> +		 * mod log operations are applied to a node, otherwise after
> +		 * rewinding a node using the mod log operations we get an
> +		 * inconsistent btree, as the leaf's extent may now be used as
> +		 * a node or leaf for another different btree.
> +		 * We are safe from races here because at this point no other
> +		 * node or root points to this extent buffer, so if after this
> +		 * check a new tree mod log user joins, it will not be able to
> +		 * find a node pointing to this leaf and record operations that
> +		 * point to this leaf.
> +		 */
> +		if (btrfs_header_level(buf) == 0) {
> +			read_lock(&fs_info->tree_mod_log_lock);
> +			must_pin = !list_empty(&fs_info->tree_mod_seq_list);
> +			read_unlock(&fs_info->tree_mod_log_lock);
> +		}
> +
> +		if (must_pin || btrfs_is_zoned(fs_info)) {
>  			btrfs_redirty_list_add(trans->transaction, buf);
>  			pin_down_extent(trans, cache, buf->start, buf->len, 1);
>  			btrfs_put_block_group(cache);

This has been added in d3575156f662 ("btrfs: zoned: redirty released
extent buffers") 5.12-rc1, so it is a regression but otherwise it sounds
like it's not related only to zoned mode. I'm not sure if this is
relevant for older stable trees because of missing
btrfs_redirty_list_add, possibly with some tweaks. Please let me know,
thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/9] btrfs: always pin deleted leaves when there are active tree mod log users
  2021-03-15 19:28   ` David Sterba
@ 2021-03-16 11:49     ` Filipe Manana
  2021-03-16 11:54       ` Johannes Thumshirn
  0 siblings, 1 reply; 22+ messages in thread
From: Filipe Manana @ 2021-03-16 11:49 UTC (permalink / raw)
  To: dsterba, Filipe Manana, linux-btrfs, Zygo Blaxell, Filipe Manana

On Mon, Mar 15, 2021 at 7:31 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Thu, Mar 11, 2021 at 02:31:06PM +0000, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > When freeing a tree block we may end up adding its extent back to the
> > free space cache/tree, as long as there are no more references for it,
> > it was created in the current transaction and writeback for it never
> > happened. This is generally fine, however when we have tree mod log
> > operations it can result in inconsistent versions of a btree after
> > unwinding extent buffers with the recorded tree mod log operations.
> >
> > This is because:
> >
> > * We only log operations for nodes (adding and removing key/pointers),
> >   for leaves we don't do anything;
> >
> > * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation
> >   for a node that points to a leaf that was deleted;
> >
> > * Before we apply the logged operation to unwind a node, we can have
> >   that leaf's extent allocated again, either as a node or as a leaf, and
> >   possibly for another btree. This is possible if the leaf was created in
> >   the current transaction and writeback for it never started, in which
> >   case btrfs_free_tree_block() returns its extent back to the free space
> >   cache/tree;
> >
> > * Then, before applying the tree mod log operation, some task allocates
> >   the metadata extent just freed before, and uses it either as a leaf or
> >   as a node for some btree (can be the same or another one, it does not
> >   matter);
> >
> > * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now
> >   get the target node with an item pointing to the metadata extent that
> >   now has content different from what it had before the leaf was deleted.
> >   It might now belong to a different btree and be a node and not a leaf
> >   anymore.
> >
> >   As a consequence, the results of searches after the unwinding can be
> >   unpredictable and produce unexpected results.
> >
> > So make sure we pin extent buffers corresponding to leaves when there
> > are tree mod log users.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/extent-tree.c | 23 ++++++++++++++++++++++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 5e228d6ad63f..2482b26b1971 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -3310,6 +3310,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
> >
> >       if (last_ref && btrfs_header_generation(buf) == trans->transid) {
> >               struct btrfs_block_group *cache;
> > +             bool must_pin = false;
> >
> >               if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
> >                       ret = check_ref_cleanup(trans, buf->start);
> > @@ -3327,7 +3328,27 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
> >                       goto out;
> >               }
> >
> > -             if (btrfs_is_zoned(fs_info)) {
> > +             /*
> > +              * If this is a leaf and there are tree mod log users, we may
> > +              * have recorded mod log operations that point to this leaf.
> > +              * So we must make sure no one reuses this leaf's extent before
> > +              * mod log operations are applied to a node, otherwise after
> > +              * rewinding a node using the mod log operations we get an
> > +              * inconsistent btree, as the leaf's extent may now be used as
> > +              * a node or leaf for another different btree.
> > +              * We are safe from races here because at this point no other
> > +              * node or root points to this extent buffer, so if after this
> > +              * check a new tree mod log user joins, it will not be able to
> > +              * find a node pointing to this leaf and record operations that
> > +              * point to this leaf.
> > +              */
> > +             if (btrfs_header_level(buf) == 0) {
> > +                     read_lock(&fs_info->tree_mod_log_lock);
> > +                     must_pin = !list_empty(&fs_info->tree_mod_seq_list);
> > +                     read_unlock(&fs_info->tree_mod_log_lock);
> > +             }
> > +
> > +             if (must_pin || btrfs_is_zoned(fs_info)) {
> >                       btrfs_redirty_list_add(trans->transaction, buf);
> >                       pin_down_extent(trans, cache, buf->start, buf->len, 1);
> >                       btrfs_put_block_group(cache);
>
> This has been added in d3575156f662 ("btrfs: zoned: redirty released
> extent buffers") 5.12-rc1, so it is a regression but otherwise it sounds
> like it's not related only to zoned mode. I'm not sure if this is
> relevant for older stable trees because of missing
> btrfs_redirty_list_add, possibly with some tweaks. Please let me know,
> thanks.

It's not related to zoned filesystems at all.
I just happened to reuse that if branch for the zoned case because it
does the same thing I needed to do, and the call to
btrfs_redirty_list_add() does nothing for the non-zoned case, so it's
safe.

I.e., it's the same as adding the following instead:

if (must_pin) {
  pin_down_extent(trans, cache, buf->start, buf->len, 1);
  btrfs_put_block_group(cache);
  goto out;
}

I opted for the shorter version by ORing the two cases.

But yes, for pre 5.12-rc1 the patch will not apply and the above code
would be needed.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/9] btrfs: always pin deleted leaves when there are active tree mod log users
  2021-03-16 11:49     ` Filipe Manana
@ 2021-03-16 11:54       ` Johannes Thumshirn
  0 siblings, 0 replies; 22+ messages in thread
From: Johannes Thumshirn @ 2021-03-16 11:54 UTC (permalink / raw)
  To: Filipe Manana, dsterba, linux-btrfs, Zygo Blaxell, Filipe Manana

On 16/03/2021 12:50, Filipe Manana wrote:
> On Mon, Mar 15, 2021 at 7:31 PM David Sterba <dsterba@suse.cz> wrote:
>>
>> On Thu, Mar 11, 2021 at 02:31:06PM +0000, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> When freeing a tree block we may end up adding its extent back to the
>>> free space cache/tree, as long as there are no more references for it,
>>> it was created in the current transaction and writeback for it never
>>> happened. This is generally fine, however when we have tree mod log
>>> operations it can result in inconsistent versions of a btree after
>>> unwinding extent buffers with the recorded tree mod log operations.
>>>
>>> This is because:
>>>
>>> * We only log operations for nodes (adding and removing key/pointers),
>>>   for leaves we don't do anything;
>>>
>>> * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation
>>>   for a node that points to a leaf that was deleted;
>>>
>>> * Before we apply the logged operation to unwind a node, we can have
>>>   that leaf's extent allocated again, either as a node or as a leaf, and
>>>   possibly for another btree. This is possible if the leaf was created in
>>>   the current transaction and writeback for it never started, in which
>>>   case btrfs_free_tree_block() returns its extent back to the free space
>>>   cache/tree;
>>>
>>> * Then, before applying the tree mod log operation, some task allocates
>>>   the metadata extent just freed before, and uses it either as a leaf or
>>>   as a node for some btree (can be the same or another one, it does not
>>>   matter);
>>>
>>> * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now
>>>   get the target node with an item pointing to the metadata extent that
>>>   now has content different from what it had before the leaf was deleted.
>>>   It might now belong to a different btree and be a node and not a leaf
>>>   anymore.
>>>
>>>   As a consequence, the results of searches after the unwinding can be
>>>   unpredictable and produce unexpected results.
>>>
>>> So make sure we pin extent buffers corresponding to leaves when there
>>> are tree mod log users.
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>> ---
>>>  fs/btrfs/extent-tree.c | 23 ++++++++++++++++++++++-
>>>  1 file changed, 22 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 5e228d6ad63f..2482b26b1971 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -3310,6 +3310,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>>>
>>>       if (last_ref && btrfs_header_generation(buf) == trans->transid) {
>>>               struct btrfs_block_group *cache;
>>> +             bool must_pin = false;
>>>
>>>               if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
>>>                       ret = check_ref_cleanup(trans, buf->start);
>>> @@ -3327,7 +3328,27 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
>>>                       goto out;
>>>               }
>>>
>>> -             if (btrfs_is_zoned(fs_info)) {
>>> +             /*
>>> +              * If this is a leaf and there are tree mod log users, we may
>>> +              * have recorded mod log operations that point to this leaf.
>>> +              * So we must make sure no one reuses this leaf's extent before
>>> +              * mod log operations are applied to a node, otherwise after
>>> +              * rewinding a node using the mod log operations we get an
>>> +              * inconsistent btree, as the leaf's extent may now be used as
>>> +              * a node or leaf for another different btree.
>>> +              * We are safe from races here because at this point no other
>>> +              * node or root points to this extent buffer, so if after this
>>> +              * check a new tree mod log user joins, it will not be able to
>>> +              * find a node pointing to this leaf and record operations that
>>> +              * point to this leaf.
>>> +              */
>>> +             if (btrfs_header_level(buf) == 0) {
>>> +                     read_lock(&fs_info->tree_mod_log_lock);
>>> +                     must_pin = !list_empty(&fs_info->tree_mod_seq_list);
>>> +                     read_unlock(&fs_info->tree_mod_log_lock);
>>> +             }
>>> +
>>> +             if (must_pin || btrfs_is_zoned(fs_info)) {
>>>                       btrfs_redirty_list_add(trans->transaction, buf);
>>>                       pin_down_extent(trans, cache, buf->start, buf->len, 1);
>>>                       btrfs_put_block_group(cache);
>>
>> This has been added in d3575156f662 ("btrfs: zoned: redirty released
>> extent buffers") 5.12-rc1, so it is a regression but otherwise it sounds
>> like it's not related only to zoned mode. I'm not sure if this is
>> relevant for older stable trees because of missing
>> btrfs_redirty_list_add, possibly with some tweaks. Please let me know,
>> thanks.
> 
> It's not related to zoned filesystems at all.
> I just happened to reuse that if branch for the zoned case because it
> does the same thing I needed to do, and the call to
> btrfs_redirty_list_add() does nothing for the non-zoned case, so it's
> safe.
> 
> I.e., it's the same as adding the following instead:
> 
> if (must_pin) {
>   pin_down_extent(trans, cache, buf->start, buf->len, 1);
>   btrfs_put_block_group(cache);
>   goto out;
> }
> 
> I opted for the shorter version by ORing the two cases.

FYI the redirtying is one thing we promised Josef to get rid in the zoned ASAP.
We just can't get it to work properly yet.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings
  2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
                   ` (8 preceding siblings ...)
  2021-03-11 14:31 ` [PATCH 9/9] btrfs: update debug message when checking seq number of a delayed ref fdmanana
@ 2021-03-16 16:58 ` David Sterba
  9 siblings, 0 replies; 22+ messages in thread
From: David Sterba @ 2021-03-16 16:58 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs, ce3g8jdj, Filipe Manana

On Thu, Mar 11, 2021 at 02:31:04PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> This patchset fixes a couple bugs, in the two first patches, with the tree
> mod log code. The remaining patches just move all that code into a separate
> file, since it's quite large and ctree.c is huge as well, and do some small
> refactorings and cleanups.

Great, thanks. Patchset moved from for-next to misc-next.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-03-16 17:01 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-11 14:31 [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings fdmanana
2021-03-11 14:31 ` [PATCH 1/9] btrfs: fix race when cloning extent buffer during rewind of an old root fdmanana
2021-03-11 14:31 ` [PATCH 2/9] btrfs: always pin deleted leaves when there are active tree mod log users fdmanana
2021-03-15 19:28   ` David Sterba
2021-03-16 11:49     ` Filipe Manana
2021-03-16 11:54       ` Johannes Thumshirn
2021-03-11 14:31 ` [PATCH 3/9] btrfs: move the tree mod log code into its own file fdmanana
2021-03-11 17:26   ` kernel test robot
2021-03-11 17:26     ` kernel test robot
2021-03-11 17:41     ` Filipe Manana
2021-03-11 17:41       ` Filipe Manana
2021-03-12  8:50   ` Anand Jain
2021-03-11 14:31 ` [PATCH 4/9] btrfs: use booleans where appropriate for the tree mod log functions fdmanana
2021-03-12 12:44   ` Anand Jain
2021-03-11 14:31 ` [PATCH 5/9] btrfs: use a bit to track the existence of tree mod log users fdmanana
2021-03-13  7:26   ` Wang Yugui
2021-03-15  9:52     ` Filipe Manana
2021-03-11 14:31 ` [PATCH 6/9] btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block() fdmanana
2021-03-11 14:31 ` [PATCH 7/9] btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb() fdmanana
2021-03-11 14:31 ` [PATCH 8/9] btrfs: add and use helper to get lowest sequence number for the tree mod log fdmanana
2021-03-11 14:31 ` [PATCH 9/9] btrfs: update debug message when checking seq number of a delayed ref fdmanana
2021-03-16 16:58 ` [PATCH 0/9] btrfs: bug fixes for the tree mod log and small refactorings David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.