Re: [PATCH 5/6] btrfs: avoid inode logging during rename and link when possible

From: Filipe Manana <fdmanana@kernel.org>
To: linux-btrfs <linux-btrfs@vger.kernel.org>,
	David Sterba <dsterba@suse.com>
Subject: Re: [PATCH 5/6] btrfs: avoid inode logging during rename and link when possible
Date: Wed, 2 Feb 2022 16:49:46 +0000	[thread overview]
Message-ID: <CAL3q7H4b7NcJMgkfGOoBi0HPTQS8zFy6mGxyZamU_51dtKp2HQ@mail.gmail.com> (raw)
In-Reply-To: <e4f702b102c093d3b67a36194fbe6c51a9e5a94c.1642676248.git.fdmanana@suse.com>

On Fri, Jan 21, 2022 at 11:20 PM <fdmanana@kernel.org> wrote:
>
> From: Filipe Manana <fdmanana@suse.com>
>
> During a rename or link operation, we need to determine if an inode was
> previously logged or not, and if it was, do some update to the logged
> inode. We used to rely exclusively on the logged_trans field of struct
> btrfs_inode to determine that, but that was not reliable because the
> value of that field is not persisted in the inode item, so it's lost
> when an inode is evicted and loaded back again. That led to several
> issues in the past, such as not persisting deletions (such as the case
> fixed by commit 803f0f64d17769 ("Btrfs: fix fsync not persisting dentry
> deletions due to inode evictions")), or resulting in losing a file
> after an inode eviction followed by a rename (commit ecc64fab7d49c6
> ("btrfs: fix lost inode on log replay after mix of fsync, rename and
> inode eviction")), besides other issues.
>
> So the inode_logged() helper was introduced and used to determine if an
> inode was possibly logged before in the current transaction, with the
> caveat that it could return false positives, in the sense that even if an
> inode was not logged before in the current transaction, it could still
> return true, but never to return false in case the inode was logged.
> From a functional point of view that is fine, but from a performance
> perspective it can introduce significant latencies to rename and link
> operations, as they will end up doing inode logging even when it is not
> necessary.
>
> Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
> installations and upgrades, with the zypper tool, were often taking a
> long time to complete. With strace it could be observed that zypper was
> spending about 99% of its time on rename operations, and then with
> further analysis we checked that directory logging was happening too
> frequently. Taking into account that installation/upgrade of some of the
> packages needed a few thousand file renames, the slowdown was very
> noticeable for the user.
>
> The issue was caused indirectly due to an excessive number of inode
> evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
> a 5.16-rc8 kernel. While triggering the inode evictions if something
> outside btrfs' control, btrfs could still behave better by eliminating
> the false positives from the inode_logged() helper.
>
> So change inode_logged() to actually eliminate such false positives caused
> by inode eviction and when an inode was never logged since the filesystem
> was mounted, as both cases relate to when the logges_trans field of struct
> btrfs_inode has a value of zero. When it can not determine if the inode
> was logged based only on the logged_trans value, lookup for the existence
> of the inode item in the log tree - if it's there then we known the inode
> was logged, if it's not there then it can not have been logged in the
> current transaction. Once we determine if the inode was logged, update
> the logged_trans value to avoid future calls to have to search in the log
> tree again.
>
> Alternatively, we could start storing logged_trans in the on disk inode
> item structure (struct btrfs_inode_item) in the unused space it still has,
> but that would be a bit odd because:
>
> 1) We only care about logged_trans since the filesystem was mounted, we
>    don't care about its value from a previous mount. Having it persisted
>    in the inode item structure would not make the best use of the precious
>    unused space;
>
> 2) In order to get logged_trans persisted before inode eviction, we would
>    have to update the delayed inode when we finish logging the inode and
>    update its logged_trans in struct btrfs_inode, which makes it a bit
>    cumbersome since we need to check if the delayed inode exists, if not
>    create it and populate it and deal with any errors (-ENOMEM mostly).
>
> This change is part of a patchset comprised of the following patches:
>
>   1/5 btrfs: add helper to delete a dir entry from a log tree
>   2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
>   3/5 btrfs: avoid logging all directory changes during renames
>   4/5 btrfs: stop doing unnecessary log updates during a rename
>   5/5 btrfs: avoid inode logging during rename and link when possible
>
> The following test script mimics part of what the zypper tool does during
> package installations/upgrades. It does not triggers inode evictions, but
> it's similar because it triggers false positives from the inode_logged()
> helper, because the inodes have a logged_trans of 0, there's a log tree
> due to a fsync of an unrelated file and the directory inode has its
> last_trans field set to the current transaction:
>
>   $ cat test.sh
>
>   #!/bin/bash
>
>   DEV=/dev/nvme0n1
>   MNT=/mnt/nvme0n1
>
>   NUM_FILES=10000
>
>   mkfs.btrfs -f $DEV
>   mount $DEV $MNT
>
>   mkdir $MNT/testdir
>
>   for ((i = 1; i <= $NUM_FILES; i++)); do
>       echo -n > $MNT/testdir/file_$i
>   done
>
>   sync
>
>   # Now do some change to an unrelated file and fsync it.
>   # This is just to create a log tree to make sure that inode_logged()
>   # does not return false when called against "testdir".
>   xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
>
>   # Do some change to testdir. This is to make sure inode_logged()
>   # will return true when called against "testdir", because its
>   # logged_trans is 0, it was changed in the current transaction
>   # and there's a log tree.
>   echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
>
>   echo "Renaming $NUM_FILES files..."
>   start=$(date +%s%N)
>   for ((i = 1; i <= $NUM_FILES; i++)); do
>       mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
>   done
>   end=$(date +%s%N)
>
>   dur=$(( (end - start) / 1000000 ))
>   echo "Renames took $dur milliseconds"
>
>   umount $MNT
>
> Testing this change on a box using a non-debug kernel (Debian's default
> kernel config) gave the following results:
>
> NUM_FILES=10000, before patchset:                   27837 ms
> NUM_FILES=10000, after patches 1/5 to 4/5 applied:   9236 ms (-66.8%)
> NUM_FILES=10000, after whole patchset applied:       8902 ms (-68.0%)
>
> NUM_FILES=5000, before patchset:                     9127 ms
> NUM_FILES=5000, after patches 1/5 to 4/5 applied:    4640 ms (-49.2%)
> NUM_FILES=5000, after whole patchset applied:        4441 ms (-51.3%)
>
> NUM_FILES=2000, before patchset:                     2528 ms
> NUM_FILES=2000, after patches 1/5 to 4/5 applied:    1983 ms (-21.6%)
> NUM_FILES=2000, after whole patchset applied:        1747 ms (-30.9%)
>
> NUM_FILES=1000, before patchset:                     1085 ms
> NUM_FILES=1000, after patches 1/5 to 4/5 applied:     893 ms (-17.7%)
> NUM_FILES=1000, after whole patchset applied:         867 ms (-20.1%)
>
> Running dbench on the same physical machine with the following script:
>
>   $ cat run-dbench.sh
>   #!/bin/bash
>
>   NUM_JOBS=$(nproc --all)
>
>   DEV=/dev/nvme0n1
>   MNT=/mnt/nvme0n1
>   MOUNT_OPTIONS="-o ssd"
>   MKFS_OPTIONS="-O no-holes -R free-space-tree"
>
>   echo "performance" | \
>       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
>
>   mkfs.btrfs -f $MKFS_OPTIONS $DEV
>   mount $MOUNT_OPTIONS $DEV $MNT
>
>   dbench -D $MNT -t 120 $NUM_JOBS
>
>   umount $MNT
>
> Before patchset:
>
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX    3761352     0.032   143.843
>  Close        2762770     0.002     2.273
>  Rename        159304     0.291    67.037
>  Unlink        759784     0.207   143.998
>  Deltree           72     4.028    15.977
>  Mkdir             36     0.003     0.006
>  Qpathinfo    3409780     0.013     9.678
>  Qfileinfo     596772     0.001     0.878
>  Qfsinfo       625189     0.003     1.245
>  Sfileinfo     306443     0.006     1.840
>  Find         1318106     0.063    19.798
>  WriteX       1871137     0.021     8.532
>  ReadX        5897325     0.003     3.567
>  LockX          12252     0.003     0.258
>  UnlockX        12252     0.002     0.100
>  Flush         263666     3.327   155.632
>
> Throughput 980.047 MB/sec  12 clients  12 procs  max_latency=155.636 ms
>
> After whole patchset applied:
>
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX    4195584     0.033   107.742
>  Close        3081932     0.002     1.935
>  Rename        177641     0.218    14.905
>  Unlink        847333     0.166   107.822
>  Deltree          118     5.315    15.247
>  Mkdir             59     0.004     0.048
>  Qpathinfo    3802612     0.014    10.302
>  Qfileinfo     666748     0.001     1.034
>  Qfsinfo       697329     0.003     0.944
>  Sfileinfo     341712     0.006     2.099
>  Find         1470365     0.065     9.359
>  WriteX       2093921     0.021     8.087
>  ReadX        6576234     0.003     3.407
>  LockX          13660     0.003     0.308
>  UnlockX        13660     0.002     0.114
>  Flush         294090     2.906   115.539
>
> Throughput 1093.11 MB/sec  12 clients  12 procs  max_latency=115.544 ms
>
> +11.5% throughput    -25.8% max latency   rename max latency -77.8%
>
> Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/tree-log.c | 245 ++++++++++++++++++++++++++++++++------------
>  fs/btrfs/tree-log.h |   3 +
>  2 files changed, 183 insertions(+), 65 deletions(-)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index bf529c6cb3ff..6f9829188948 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -3459,35 +3459,121 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
>  }
>
>  /*
> - * Check if an inode was logged in the current transaction. This may often
> - * return some false positives, because logged_trans is an in memory only field,
> - * not persisted anywhere. This is meant to be used in contexts where a false
> - * positive has no functional consequences.
> + * Check if an inode was logged in the current transaction. This correctly deals
> + * with the case where the inode was logged but has a logged_trans of 0, which
> + * happens if the inode is evicted and loaded again, as logged_trans is an in
> + * memory only field (not persisted).
> + *
> + * Returns 1 if the inode was logged before in the transaction, 0 if it was not,
> + * and < 0 on error.
>   */
> -static bool inode_logged(struct btrfs_trans_handle *trans,
> -                        struct btrfs_inode *inode)
> +static int inode_logged(struct btrfs_trans_handle *trans,
> +                       struct btrfs_inode *inode,
> +                       struct btrfs_path *path_in)
>  {
> +       struct btrfs_path *path = path_in;
> +       struct btrfs_key key;
> +       int ret;
> +
>         if (inode->logged_trans == trans->transid)
> -               return true;
> +               return 1;
>
> -       if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &inode->root->state))
> -               return false;
> +       /*
> +        * If logged_trans is not 0, then we know the inode logged was not logged
> +        * in this transaction, so we can return false right away.
> +        */
> +       if (inode->logged_trans > 0)
> +               return 0;
> +
> +       /*
> +        * If no log tree was created for this root in this transaction, then
> +        * the inode can not have been logged in this transaction. In that case
> +        * set logged_trans to anything greater than 0 and less than the current
> +        * transaction's ID, to avoid the search below in a future call in case
> +        * a log tree gets created after this.
> +        */
> +       if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &inode->root->state)) {
> +               inode->logged_trans = trans->transid - 1;
> +               return 0;
> +       }
> +
> +       /*
> +        * We have a log tree and the inode's logged_trans is 0. We can't tell
> +        * for sure if the inode was logged before in this transaction by looking
> +        * only at logged_trans. We could be pessimistic and assume it was, but
> +        * that can lead to unnecessarily logging an inode during rename and link
> +        * operations, and then further updating the log in followup rename and
> +        * link operations, specially if it's a directory, which adds latency
> +        * visible to applications doing a series of rename or link operations.
> +        *
> +        * A logged_trans of 0 here can mean several things:
> +        *
> +        * 1) The inode was never logged since the filesystem was mounted, and may
> +        *    or may have not been evicted and loaded again;
> +        *
> +        * 2) The inode was logged in a previous transaction, then evicted and
> +        *    then loaded again;
> +        *
> +        * 3) The inode was logged in the current transaction, then evicted and
> +        *    then loaded again.
> +        *
> +        * For cases 1) and 2) we don't want to return true, but we need to detect
> +        * case 3) and return true. So we do a search in the log root for the inode
> +        * item.
> +        */
> +       key.objectid = btrfs_ino(inode);
> +       key.type = BTRFS_INODE_ITEM_KEY;
> +       key.offset = 0;
> +
> +       if (!path) {
> +               path = btrfs_alloc_path();
> +               if (!path)
> +                       return -ENOMEM;
> +       }
> +
> +       ret = btrfs_search_slot(NULL, inode->root->log_root, &key, path, 0, 0);
> +
> +       if (path_in)
> +               btrfs_release_path(path);
> +       else
> +               btrfs_free_path(path);
>
>         /*
> -        * The inode's logged_trans is always 0 when we load it (because it is
> -        * not persisted in the inode item or elsewhere). So if it is 0, the
> -        * inode was last modified in the current transaction then the inode may
> -        * have been logged before in the current transaction, then evicted and
> -        * loaded again in the current transaction - or may have never been logged
> -        * in the current transaction, but since we can not be sure, we have to
> -        * assume it was, otherwise our callers can leave an inconsistent log.
> +        * Logging an inode always results in logging its inode item. So if we
> +        * did not find the item we know the inode was not logged for sure.
>          */
> -       if (inode->logged_trans == 0 &&
> -           inode->last_trans == trans->transid &&
> -           !test_bit(BTRFS_FS_LOG_RECOVERING, &trans->fs_info->flags))
> -               return true;
> +       if (ret < 0) {
> +               return ret;
> +       } else if (ret > 0) {
> +               /*
> +                * Set logged_trans to a value greater than 0 and less then the
> +                * current transaction to avoid doing the search in future calls.
> +                */
> +               inode->logged_trans = trans->transid - 1;
> +               return 0;
> +       }
> +
> +       /*
> +        * The inode was previously logged and then evicted, set logged_trans to
> +        * the current transacion's ID, to avoid future tree searches as long as
> +        * the inode is not evicted again.
> +        */
> +       inode->logged_trans = trans->transid;
> +
> +       /*
> +        * If it's a directory, then we must set last_dir_index_offset to the
> +        * maximum possible value, so that the next attempt to log the inode does
> +        * not skip checking if dir index keys found in modified subvolume tree
> +        * leaves have been logged before, otherwise it would result in attempts
> +        * to insert duplicate dir index keys in the log tree. This must be done
> +        * because last_dir_index_offset is an in-memory only field, not persisted
> +        * in the inode item or any other on-disk structure, so its value is lost
> +        * once the inode is evicted.
> +        */
> +       if (S_ISDIR(inode->vfs_inode.i_mode))
> +               inode->last_dir_index_offset = (u64)-1;
>
> -       return false;
> +       return 1;
>  }
>
>  /*
> @@ -3552,8 +3638,13 @@ void btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
>         struct btrfs_path *path;
>         int ret;
>
> -       if (!inode_logged(trans, dir))
> +       ret = inode_logged(trans, dir, NULL);
> +       if (ret == 0)
> +               return;
> +       else if (ret < 0) {
> +               btrfs_set_log_full_commit(trans);
>                 return;
> +       }
>
>         ret = join_running_log_trans(root);
>         if (ret)
> @@ -3587,8 +3678,13 @@ void btrfs_del_inode_ref_in_log(struct btrfs_trans_handle *trans,
>         u64 index;
>         int ret;
>
> -       if (!inode_logged(trans, inode))
> +       ret = inode_logged(trans, inode, NULL);
> +       if (ret == 0)
>                 return;
> +       else if (ret < 0) {
> +               btrfs_set_log_full_commit(trans);
> +               return;
> +       }
>
>         ret = join_running_log_trans(root);
>         if (ret)
> @@ -3720,7 +3816,6 @@ static int process_dir_items_leaf(struct btrfs_trans_handle *trans,
>         struct extent_buffer *src = path->nodes[0];
>         const int nritems = btrfs_header_nritems(src);
>         const u64 ino = btrfs_ino(inode);
> -       const bool inode_logged_before = inode_logged(trans, inode);
>         bool last_found = false;
>         int batch_start = 0;
>         int batch_size = 0;
> @@ -3796,14 +3891,16 @@ static int process_dir_items_leaf(struct btrfs_trans_handle *trans,
>                                 ctx->log_new_dentries = true;
>                 }
>
> -               if (!inode_logged_before)
> +               if (!ctx->logged_before)
>                         goto add_to_batch;
>
>                 /*
>                  * If we were logged before and have logged dir items, we can skip
>                  * checking if any item with a key offset larger than the last one
>                  * we logged is in the log tree, saving time and avoiding adding
> -                * contention on the log tree.
> +                * contention on the log tree. We can only rely on the value of
> +                * last_dir_index_offset when we know for sure that the inode was
> +                * previously logged in the current transaction.
>                  */
>                 if (key.offset > inode->last_dir_index_offset)
>                         goto add_to_batch;
> @@ -4046,21 +4143,6 @@ static noinline int log_directory_changes(struct btrfs_trans_handle *trans,
>         u64 max_key;
>         int ret;
>
> -       /*
> -        * If this is the first time we are being logged in the current
> -        * transaction, or we were logged before but the inode was evicted and
> -        * reloaded later, in which case its logged_trans is 0, reset the value
> -        * of the last logged key offset. Note that we don't use the helper
> -        * function inode_logged() here - that is because the function returns
> -        * true after an inode eviction, assuming the worst case as it can not
> -        * know for sure if the inode was logged before. So we can not skip key
> -        * searches in the case the inode was evicted, because it may not have
> -        * been logged in this transaction and may have been logged in a past
> -        * transaction, so we need to reset the last dir index offset to (u64)-1.
> -        */
> -       if (inode->logged_trans != trans->transid)
> -               inode->last_dir_index_offset = (u64)-1;
> -
>         min_key = BTRFS_DIR_START_INDEX;
>         max_key = 0;
>         ctx->last_dir_item_offset = inode->last_dir_index_offset;
> @@ -4097,9 +4179,6 @@ static int drop_inode_items(struct btrfs_trans_handle *trans,
>         struct btrfs_key found_key;
>         int start_slot;
>
> -       if (!inode_logged(trans, inode))
> -               return 0;
> -
>         key.objectid = btrfs_ino(inode);
>         key.type = max_key_type;
>         key.offset = (u64)-1;
> @@ -4597,7 +4676,7 @@ static int log_one_extent(struct btrfs_trans_handle *trans,
>          * are small, with a root at level 2 or 3 at most, due to their short
>          * life span.
>          */
> -       if (inode_logged(trans, inode)) {
> +       if (ctx->logged_before) {
>                 drop_args.path = path;
>                 drop_args.start = em->start;
>                 drop_args.end = em->start + em->len;
> @@ -5594,6 +5673,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
>         bool xattrs_logged = false;
>         bool recursive_logging = false;
>         bool inode_item_dropped = true;
> +       const bool orig_logged_before = ctx->logged_before;
>
>         path = btrfs_alloc_path();
>         if (!path)
> @@ -5643,6 +5723,18 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
>                 mutex_lock(&inode->log_mutex);
>         }
>
> +       /*
> +        * Before logging the inode item, cache the value returned by
> +        * inode_logged(), because after that we have the need to figure out if
> +        * the inode was previously logged in this transaction.
> +        */
> +       ret = inode_logged(trans, inode, path);
> +       if (ret < 0) {
> +               err = ret;
> +               goto out;

This should be 'goto out_unlock', otherwise it returns without
unlocking the inode's log_mutex.
David, do you prefer me to send a new version of the patchset or can
you edit this in misc-next?

Thanks.

> +       }
> +       ctx->logged_before = (ret == 1);
> +
>         /*
>          * This is for cases where logging a directory could result in losing a
>          * a file after replaying the log. For example, if we move a file from a
> @@ -5668,9 +5760,11 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
>                 clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags);
>                 if (inode_only == LOG_INODE_EXISTS)
>                         max_key_type = BTRFS_XATTR_ITEM_KEY;
> -               ret = drop_inode_items(trans, log, path, inode, max_key_type);
> +               if (ctx->logged_before)
> +                       ret = drop_inode_items(trans, log, path, inode,
> +                                              max_key_type);
>         } else {
> -               if (inode_only == LOG_INODE_EXISTS && inode_logged(trans, inode)) {
> +               if (inode_only == LOG_INODE_EXISTS && ctx->logged_before) {
>                         /*
>                          * Make sure the new inode item we write to the log has
>                          * the same isize as the current one (if it exists).
> @@ -5692,14 +5786,15 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
>                              &inode->runtime_flags)) {
>                         if (inode_only == LOG_INODE_EXISTS) {
>                                 max_key.type = BTRFS_XATTR_ITEM_KEY;
> -                               ret = drop_inode_items(trans, log, path, inode,
> -                                                      max_key.type);
> +                               if (ctx->logged_before)
> +                                       ret = drop_inode_items(trans, log, path,
> +                                                              inode, max_key.type);
>                         } else {
>                                 clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
>                                           &inode->runtime_flags);
>                                 clear_bit(BTRFS_INODE_COPY_EVERYTHING,
>                                           &inode->runtime_flags);
> -                               if (inode_logged(trans, inode))
> +                               if (ctx->logged_before)
>                                         ret = truncate_inode_items(trans, log,
>                                                                    inode, 0, 0);
>                         }
> @@ -5709,8 +5804,9 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
>                         if (inode_only == LOG_INODE_ALL)
>                                 fast_search = true;
>                         max_key.type = BTRFS_XATTR_ITEM_KEY;
> -                       ret = drop_inode_items(trans, log, path, inode,
> -                                              max_key.type);
> +                       if (ctx->logged_before)
> +                               ret = drop_inode_items(trans, log, path, inode,
> +                                                      max_key.type);
>                 } else {
>                         if (inode_only == LOG_INODE_ALL)
>                                 fast_search = true;
> @@ -5830,6 +5926,10 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
>  out:
>         btrfs_free_path(path);
>         btrfs_free_path(dst_path);
> +
> +       if (recursive_logging)
> +               ctx->logged_before = orig_logged_before;
> +
>         return err;
>  }
>
> @@ -6774,7 +6874,7 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
>         struct btrfs_root *root = inode->root;
>         struct btrfs_log_ctx ctx;
>         bool log_pinned = false;
> -       int ret = 0;
> +       int ret;
>
>         /*
>          * this will force the logging code to walk the dentry chain
> @@ -6787,9 +6887,24 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
>          * if this inode hasn't been logged and directory we're renaming it
>          * from hasn't been logged, we don't need to log it
>          */
> -       if (!inode_logged(trans, inode) &&
> -           (!old_dir || !inode_logged(trans, old_dir)))
> -               return;
> +       ret = inode_logged(trans, inode, NULL);
> +       if (ret < 0) {
> +               goto out;
> +       } else if (ret == 0) {
> +               if (!old_dir)
> +                       return;
> +               /*
> +                * If the inode was not logged and we are doing a rename (old_dir is not
> +                * NULL), check if old_dir was logged - if it was not we can return and
> +                * do nothing.
> +                */
> +               ret = inode_logged(trans, old_dir, NULL);
> +               if (ret < 0)
> +                       goto out;
> +               else if (ret == 0)
> +                       return;
> +       }
> +       ret = 0;
>
>         /*
>          * If we are doing a rename (old_dir is not NULL) from a directory that
> @@ -6849,15 +6964,15 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
>          */
>         btrfs_log_inode_parent(trans, inode, parent, LOG_INODE_EXISTS, &ctx);
>  out:
> -       if (log_pinned) {
> -               /*
> -                * If an error happened mark the log for a full commit because
> -                * it's not consistent and up to date. Do it before unpinning the
> -                * log, to avoid any races with someone else trying to commit it.
> -                */
> -               if (ret < 0)
> -                       btrfs_set_log_full_commit(trans);
> +       /*
> +        * If an error happened mark the log for a full commit because it's not
> +        * consistent and up to date or we couldn't find out if one of the
> +        * inodes was logged before in this transaction. Do it before unpinning
> +        * the log, to avoid any races with someone else trying to commit it.
> +        */
> +       if (ret < 0)
> +               btrfs_set_log_full_commit(trans);
> +       if (log_pinned)
>                 btrfs_end_log_trans(root);
> -       }
>  }
>
> diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h
> index f1acb7fa944c..1620f8170629 100644
> --- a/fs/btrfs/tree-log.h
> +++ b/fs/btrfs/tree-log.h
> @@ -17,6 +17,8 @@ struct btrfs_log_ctx {
>         int log_transid;
>         bool log_new_dentries;
>         bool logging_new_name;
> +       /* Indicate if the inode being logged was logged before. */
> +       bool logged_before;
>         /* Tracks the last logged dir item/index key offset. */
>         u64 last_dir_item_offset;
>         struct inode *inode;
> @@ -32,6 +34,7 @@ static inline void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx,
>         ctx->log_transid = 0;
>         ctx->log_new_dentries = false;
>         ctx->logging_new_name = false;
> +       ctx->logged_before = false;
>         ctx->inode = inode;
>         INIT_LIST_HEAD(&ctx->list);
>         INIT_LIST_HEAD(&ctx->ordered_extents);
> --
> 2.33.0
>