[PATCH 0/7] btrfs: more performance improvements for dbench workloads

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] btrfs: more performance improvements for dbench workloads
@ 2021-01-27 10:34 fdmanana
  2021-01-27 10:34 ` [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry fdmanana
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

The following patchset brings one more batch of performance improvements
with dbench workloads, or anything that mixes file creation, file writes,
renames, unlinks, etc with fsync like dbench does. This patchset is mostly
based on avoiding logging directory inodes multiple times when not necessary,
falling back to transaction commits less frequently and often waiting less
time for transaction commits to complete. Performance results are listed in
the change log of the last patch, but in short, I've experienced a reduction
of maximum latency up to about -40% and throuhput gains up to about +6%.

Filipe Manana (7):
  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

 fs/btrfs/file.c        |   1 +
 fs/btrfs/transaction.c |  39 +++++++--
 fs/btrfs/transaction.h |   2 +
 fs/btrfs/tree-log.c    | 195 ++++++++++++-----------------------------
 4 files changed, 92 insertions(+), 145 deletions(-)

-- 
2.28.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
@ 2021-01-27 10:34 ` fdmanana
  2021-01-27 10:34 ` [PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging fdmanana
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we remove a directory entry, as part of an unlink operation, if the
directory was logged before we must remove the directory index items from
the log. We are also updating the inode item of the directory to update
its i_size, but that is not necessary because during log replay we do not
need it and we correctly adjust the i_size in the inode item of the
subvolume as we process directory index items and replay deletes.

This is not needed since commit d555438b6e1dad ("Btrfs: drop dir i_size
when adding new names on replay"), where we explicitly ignore the i_size
of directory inode items on log replay. Before that we used it but it
was buggy as mentioned in that commit's change log (i_size got a larger
value then it should have).

So stop updating the i_size of the directory inode item in the log, as
that is a waste of time, adds more log contention to the log tree and
often results in COWing more extent buffers for the log tree.

This code path is triggered often during dbench workloads for example.
This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 39 ++++-----------------------------------
 1 file changed, 4 insertions(+), 35 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8ee0700a980f..5d87afc6058a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3379,7 +3379,6 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	int ret;
 	int err = 0;
-	int bytes_del = 0;
 	u64 dir_ino = btrfs_ino(dir);
 
 	if (!inode_logged(trans, dir))
@@ -3406,7 +3405,6 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
 	}
 	if (di) {
 		ret = btrfs_delete_one_dir_name(trans, log, path, di);
-		bytes_del += name_len;
 		if (ret) {
 			err = ret;
 			goto fail;
@@ -3421,46 +3419,17 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
 	}
 	if (di) {
 		ret = btrfs_delete_one_dir_name(trans, log, path, di);
-		bytes_del += name_len;
 		if (ret) {
 			err = ret;
 			goto fail;
 		}
 	}
 
-	/* update the directory size in the log to reflect the names
-	 * we have removed
+	/*
+	 * We do not need to update the size field of the directory's inode item
+	 * because on log replay we update the field to reflect all existing
+	 * entries in the directory (see overwrite_item()).
 	 */
-	if (bytes_del) {
-		struct btrfs_key key;
-
-		key.objectid = dir_ino;
-		key.offset = 0;
-		key.type = BTRFS_INODE_ITEM_KEY;
-		btrfs_release_path(path);
-
-		ret = btrfs_search_slot(trans, log, &key, path, 0, 1);
-		if (ret < 0) {
-			err = ret;
-			goto fail;
-		}
-		if (ret == 0) {
-			struct btrfs_inode_item *item;
-			u64 i_size;
-
-			item = btrfs_item_ptr(path->nodes[0], path->slots[0],
-					      struct btrfs_inode_item);
-			i_size = btrfs_inode_size(path->nodes[0], item);
-			if (i_size > bytes_del)
-				i_size -= bytes_del;
-			else
-				i_size = 0;
-			btrfs_set_inode_size(path->nodes[0], item, i_size);
-			btrfs_mark_buffer_dirty(path->nodes[0]);
-		} else
-			ret = 0;
-		btrfs_release_path(path);
-	}
 fail:
 	btrfs_free_path(path);
 out_unlock:
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
  2021-01-27 10:34 ` [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry fdmanana
@ 2021-01-27 10:34 ` fdmanana
  2021-01-27 10:34 ` [PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode fdmanana
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we fill an inode item for logging we are setting its nbytes field
with the value returned by inode_get_bytes() (a VFS API), however we do
not need it because it is not used during log replay. In fact, for fast
fsyncs, when we call inode_get_bytes() we may even get an outdated value
for nbytes because the nbytes field of the inode is only updated when
ordered extents complete, and a fast fsync only waits for writeback to
complete, it does not wait for ordered extent completion.

So just remove the setup of nbytes and add an explicit comment mentioning
why we do not set it. This also avoids adding contention on the inode's
i_lock (VFS) with concurrent stat() calls, since that spinlock is used by
inode_get_bytes() which is also called by our stat callback
(btrfs_getattr()).

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5d87afc6058a..be62759f0aac 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3858,7 +3858,14 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
 	btrfs_set_token_timespec_nsec(&token, &item->ctime,
 				      inode->i_ctime.tv_nsec);
 
-	btrfs_set_token_inode_nbytes(&token, item, inode_get_bytes(inode));
+	/*
+	 * We do not need to set the nbytes field, in fact during a fast fsync
+	 * its value may not even be correct, since a fast fsync does not wait
+	 * for ordered extent completion, which is where we update nbytes, it
+	 * only waits for writeback to complete. During log replay as we find
+	 * file extent items and replay them, we adjust the nbytes field of the
+	 * inode item in subvolume tree as needed (see overwrite_item()).
+	 */
 
 	btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode));
 	btrfs_set_token_inode_transid(&token, item, trans->transid);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
  2021-01-27 10:34 ` [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry fdmanana
  2021-01-27 10:34 ` [PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging fdmanana
@ 2021-01-27 10:34 ` fdmanana
  2021-01-27 10:34 ` [PATCH 4/7] btrfs: skip logging directories already logged when logging all parents fdmanana
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we fsync a new file, created in the current transaction, we check
all its ancestor inodes and always log them if they were created in the
current transaction - even if we have already logged them before, which
is a waste of time.

So avoid logging new ancestor inodes if they were already logged before
and have no xattrs added/updated/removed since they were last logged.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index be62759f0aac..105cf316ee27 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5272,6 +5272,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	if (S_ISDIR(inode->vfs_inode.i_mode)) {
 		int max_key_type = BTRFS_DIR_LOG_INDEX_KEY;
 
+		clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags);
 		if (inode_only == LOG_INODE_EXISTS)
 			max_key_type = BTRFS_XATTR_ITEM_KEY;
 		ret = drop_objectid_items(trans, log, path, ino, max_key_type);
@@ -5520,6 +5521,34 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
 	return ret;
 }
 
+/*
+ * Check if we need to log an inode. This is used in contexts where while
+ * logging an inode we need to log another inode (either that it exists or in
+ * full mode). This is used instead of btrfs_inode_in_log() because the later
+ * requires the inode to be in the log and have the log transaction committed,
+ * while here we do not care if the log transaction was already committed - our
+ * caller will commit the log later - and we want to avoid logging an inode
+ * multiple times when multiple tasks have joined the same log transaction.
+ */
+static bool need_log_inode(struct btrfs_trans_handle *trans,
+			   struct btrfs_inode *inode)
+{
+	/*
+	 * If this inode does not have new/updated/deleted xattrs since the last
+	 * time it was logged and is flagged as logged in the current transaction,
+	 * we can skip logging it. As for new/deleted names, those are updated in
+	 * the log by link/unlink/rename operations.
+	 * In case the inode was logged and then evicted and reloaded, its
+	 * logged_trans will be 0, in which case we have to fully log it since
+	 * logged_trans is a transient field, not persisted.
+	 */
+	if (inode->logged_trans == trans->transid &&
+	    !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags))
+		return false;
+
+	return true;
+}
+
 struct btrfs_dir_list {
 	u64 ino;
 	struct list_head list;
@@ -5848,7 +5877,8 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
 		if (IS_ERR(inode))
 			return PTR_ERR(inode);
 
-		if (BTRFS_I(inode)->generation >= trans->transid)
+		if (BTRFS_I(inode)->generation >= trans->transid &&
+		    need_log_inode(trans, BTRFS_I(inode)))
 			ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
 					      LOG_INODE_EXISTS, ctx);
 		btrfs_add_delayed_iput(inode);
@@ -5902,7 +5932,8 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans,
 		if (root != inode->root)
 			break;
 
-		if (inode->generation >= trans->transid) {
+		if (inode->generation >= trans->transid &&
+		    need_log_inode(trans, inode)) {
 			ret = btrfs_log_inode(trans, root, inode,
 					      LOG_INODE_EXISTS, ctx);
 			if (ret)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/7] btrfs: skip logging directories already logged when logging all parents
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
                   ` (2 preceding siblings ...)
  2021-01-27 10:34 ` [PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode fdmanana
@ 2021-01-27 10:34 ` fdmanana
  2021-01-27 10:34 ` [PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries fdmanana
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Some times when we fsync an inode we need to do a full log of all its
ancestors (due to unlink, link or rename operations), which can be an
expensive operation, specially if the directories are large.

However if we find an ancestor directory inode that is already logged in
the current transaction, and has no inserted/updated/deleted xattrs since
it was last logged, we can skip logging the directory again. We are safe
to skip that since we know that for logged directories, any link, unlink
or rename operations that implicate the directory will update the log as
necessary.

So use the helper need_log_dir(), introduced in a previous commit, to
detect already logged directories that can be skipped.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 105cf316ee27..c0dce99c2c14 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5826,6 +5826,11 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
 				goto out;
 			}

+			if (!need_log_inode(trans, BTRFS_I(dir_inode))) {
+				btrfs_add_delayed_iput(dir_inode);
+				continue;
+			}
+
 			if (ctx)
 				ctx->log_new_dentries = false;
 			ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode),
-- 
2.28.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
                   ` (3 preceding siblings ...)
  2021-01-27 10:34 ` [PATCH 4/7] btrfs: skip logging directories already logged when logging all parents fdmanana
@ 2021-01-27 10:34 ` fdmanana
  2021-01-27 10:34 ` [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync() fdmanana
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When logging new directory entries of a directory, we log the inodes of
new dentries and the inodes of dentries pointing to directories that
may have been created in past transactions. For the case of directories
we log in full mode, which can be particularly expensive for large
directories.

We do use btrfs_inode_in_log() to skip already logged inodes, however for
that helper to return true, it requires that the log transaction used to
log the inode to be already committed. This means that when we have more
than one task using the same log transaction we can end up logging an
inode multiple times, which is a waste of time and not necessary since
the log will be committed by one of the tasks and the others will wait for
the log transaction to be committed before returning to user space.

So simply replace the use of btrfs_inode_in_log() with the new helper
function need_log_inode(), introduced in a previous commit.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c0dce99c2c14..6dc376a16cf2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5676,7 +5676,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 				goto next_dir_inode;
 			}

-			if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) {
+			if (!need_log_inode(trans, BTRFS_I(di_inode))) {
 				btrfs_add_delayed_iput(di_inode);
 				break;
 			}
-- 
2.28.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
                   ` (4 preceding siblings ...)
  2021-01-27 10:34 ` [PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries fdmanana
@ 2021-01-27 10:34 ` fdmanana
  2021-01-27 15:23   ` Josef Bacik
  2021-01-27 10:35 ` [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fdmanana
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Whenever we fsync an inode, if it is a directory, a regular file that was
created in the current transaction or has last_unlink_trans set to the
generation of the current transaction, we check if any of its ancestor
inodes (and the inode itself if it is a directory) can not be logged and
need a fallback to a full transaction commit - if so, we return with a
value of 1 in order to fallback to a transaction commit.

However we often do not need to fallback to a transaction commit because:

1) The ancestor inode is not an immediate parent, and therefore there is
   not an explicit request to log it and it is not needed neither to
   guarantee the consistency of the inode originally asked to be logged
   (fsynced) nor its immediate parent;

2) The ancestor inode was already logged before, in which case any link,
   unlink or rename operation updates the log as needed.

So for these two cases we can avoid an unnecessary transaction commit.
Therefore remove check_parent_dirs_for_sync() and add a check at the top
of btrfs_log_inode() to make us fallback immediately to a transaction
commit when we are logging a directory inode that can not be logged and
needs a full transaction commit. All we need to protect is the case where
after renaming a file someone fsyncs only the old directory, which would
result is losing the renamed file after a log replay.

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

Performance results, after applying all patches, are mentioned in the
change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 121 ++++++--------------------------------------
 1 file changed, 15 insertions(+), 106 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6dc376a16cf2..4c7b283ed2b2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5265,6 +5265,21 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 		mutex_lock(&inode->log_mutex);
 	}
 
+	/*
+	 * This is for cases where logging a directory could result in losing a
+	 * a file after replaying the log. For example, if we move a file from a
+	 * directory A to a directory B, then fsync directory A, we have no way
+	 * to known the file was moved from A to B, so logging just A would
+	 * result in losing the file after a log replay.
+	 */
+	if (S_ISDIR(inode->vfs_inode.i_mode) &&
+	    inode_only == LOG_INODE_ALL &&
+	    inode->last_unlink_trans >= trans->transid) {
+		btrfs_set_log_full_commit(trans);
+		err = 1;
+		goto out_unlock;
+	}
+
 	/*
 	 * a brute force approach to making sure we get the most uptodate
 	 * copies of everything.
@@ -5428,99 +5443,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 	return err;
 }
 
-/*
- * Check if we must fallback to a transaction commit when logging an inode.
- * This must be called after logging the inode and is used only in the context
- * when fsyncing an inode requires the need to log some other inode - in which
- * case we can't lock the i_mutex of each other inode we need to log as that
- * can lead to deadlocks with concurrent fsync against other inodes (as we can
- * log inodes up or down in the hierarchy) or rename operations for example. So
- * we take the log_mutex of the inode after we have logged it and then check for
- * its last_unlink_trans value - this is safe because any task setting
- * last_unlink_trans must take the log_mutex and it must do this before it does
- * the actual unlink operation, so if we do this check before a concurrent task
- * sets last_unlink_trans it means we've logged a consistent version/state of
- * all the inode items, otherwise we are not sure and must do a transaction
- * commit (the concurrent task might have only updated last_unlink_trans before
- * we logged the inode or it might have also done the unlink).
- */
-static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans,
-					  struct btrfs_inode *inode)
-{
-	bool ret = false;
-
-	mutex_lock(&inode->log_mutex);
-	if (inode->last_unlink_trans >= trans->transid) {
-		/*
-		 * Make sure any commits to the log are forced to be full
-		 * commits.
-		 */
-		btrfs_set_log_full_commit(trans);
-		ret = true;
-	}
-	mutex_unlock(&inode->log_mutex);
-
-	return ret;
-}
-
-/*
- * follow the dentry parent pointers up the chain and see if any
- * of the directories in it require a full commit before they can
- * be logged.  Returns zero if nothing special needs to be done or 1 if
- * a full commit is required.
- */
-static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
-					       struct btrfs_inode *inode,
-					       struct dentry *parent,
-					       struct super_block *sb)
-{
-	int ret = 0;
-	struct dentry *old_parent = NULL;
-
-	/*
-	 * for regular files, if its inode is already on disk, we don't
-	 * have to worry about the parents at all.  This is because
-	 * we can use the last_unlink_trans field to record renames
-	 * and other fun in this file.
-	 */
-	if (S_ISREG(inode->vfs_inode.i_mode) &&
-	    inode->generation < trans->transid &&
-	    inode->last_unlink_trans < trans->transid)
-		goto out;
-
-	if (!S_ISDIR(inode->vfs_inode.i_mode)) {
-		if (!parent || d_really_is_negative(parent) || sb != parent->d_sb)
-			goto out;
-		inode = BTRFS_I(d_inode(parent));
-	}
-
-	while (1) {
-		if (btrfs_must_commit_transaction(trans, inode)) {
-			ret = 1;
-			break;
-		}
-
-		if (!parent || d_really_is_negative(parent) || sb != parent->d_sb)
-			break;
-
-		if (IS_ROOT(parent)) {
-			inode = BTRFS_I(d_inode(parent));
-			if (btrfs_must_commit_transaction(trans, inode))
-				ret = 1;
-			break;
-		}
-
-		parent = dget_parent(parent);
-		dput(old_parent);
-		old_parent = parent;
-		inode = BTRFS_I(d_inode(parent));
-
-	}
-	dput(old_parent);
-out:
-	return ret;
-}
-
 /*
  * Check if we need to log an inode. This is used in contexts where while
  * logging an inode we need to log another inode (either that it exists or in
@@ -5686,9 +5608,6 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
 				log_mode = LOG_INODE_ALL;
 			ret = btrfs_log_inode(trans, root, BTRFS_I(di_inode),
 					      log_mode, ctx);
-			if (!ret &&
-			    btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
-				ret = 1;
 			btrfs_add_delayed_iput(di_inode);
 			if (ret)
 				goto next_dir_inode;
@@ -5835,9 +5754,6 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
 				ctx->log_new_dentries = false;
 			ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode),
 					      LOG_INODE_ALL, ctx);
-			if (!ret &&
-			    btrfs_must_commit_transaction(trans, BTRFS_I(dir_inode)))
-				ret = 1;
 			if (!ret && ctx && ctx->log_new_dentries)
 				ret = log_new_dir_dentries(trans, root,
 						   BTRFS_I(dir_inode), ctx);
@@ -6053,12 +5969,9 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 {
 	struct btrfs_root *root = inode->root;
 	struct btrfs_fs_info *fs_info = root->fs_info;
-	struct super_block *sb;
 	int ret = 0;
 	bool log_dentries = false;
 
-	sb = inode->vfs_inode.i_sb;
-
 	if (btrfs_test_opt(fs_info, NOTREELOG)) {
 		ret = 1;
 		goto end_no_trans;
@@ -6069,10 +5982,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 		goto end_no_trans;
 	}
 
-	ret = check_parent_dirs_for_sync(trans, inode, parent, sb);
-	if (ret)
-		goto end_no_trans;
-
 	/*
 	 * Skip already logged inodes or inodes corresponding to tmpfiles
 	 * (since logging them is pointless, a link count of 0 means they
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
  2021-01-27 10:34 ` [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync() fdmanana
@ 2021-01-27 15:23   ` Josef Bacik
  2021-01-27 15:36     ` Filipe Manana
  0 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2021-01-27 15:23 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Whenever we fsync an inode, if it is a directory, a regular file that was
> created in the current transaction or has last_unlink_trans set to the
> generation of the current transaction, we check if any of its ancestor
> inodes (and the inode itself if it is a directory) can not be logged and
> need a fallback to a full transaction commit - if so, we return with a
> value of 1 in order to fallback to a transaction commit.
> 
> However we often do not need to fallback to a transaction commit because:
> 
> 1) The ancestor inode is not an immediate parent, and therefore there is
>     not an explicit request to log it and it is not needed neither to
>     guarantee the consistency of the inode originally asked to be logged
>     (fsynced) nor its immediate parent;
> 
> 2) The ancestor inode was already logged before, in which case any link,
>     unlink or rename operation updates the log as needed.
> 
> So for these two cases we can avoid an unnecessary transaction commit.
> Therefore remove check_parent_dirs_for_sync() and add a check at the top
> of btrfs_log_inode() to make us fallback immediately to a transaction
> commit when we are logging a directory inode that can not be logged and
> needs a full transaction commit. All we need to protect is the case where
> after renaming a file someone fsyncs only the old directory, which would
> result is losing the renamed file after a log replay.
> 
> This patch is part of a patchset comprised of the following patches:
> 
>    btrfs: remove unnecessary directory inode item update when deleting dir entry
>    btrfs: stop setting nbytes when filling inode item for logging
>    btrfs: avoid logging new ancestor inodes when logging new inode
>    btrfs: skip logging directories already logged when logging all parents
>    btrfs: skip logging inodes already logged when logging new entries
>    btrfs: remove unnecessary check_parent_dirs_for_sync()
>    btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
> 
> Performance results, after applying all patches, are mentioned in the
> change log of the last patch.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

I'm having a hard time with this one.

Previously we would commit the transaction if the inode was a regular file, that 
was created in this current transaction, and had been renamed.  Now with this 
patch you're only committing the transaction if we are a directory and were 
renamed ourselves.  Before if you already had directories A and B and then did 
something like

echo "foo" > /mnt/test/A/blah
fsync(/mnt/test/A/blah);
fsync(/mnt/test/A);
mv /mnt/test/A/blah /mnt/test/B
fsync(/mnt/test/B/blah);

we would commit the transaction on this second fsync, but with your patch we are 
not.  I suppose that's keeping in line with how fsync is allowed to work, but 
it's definitely a change in behavior from what we used to do.  Not sure if 
that's good or not, I'll have to think about it.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
  2021-01-27 15:23   ` Josef Bacik
@ 2021-01-27 15:36     ` Filipe Manana
  2021-01-27 15:42       ` Josef Bacik
  0 siblings, 1 reply; 14+ messages in thread
From: Filipe Manana @ 2021-01-27 15:36 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Whenever we fsync an inode, if it is a directory, a regular file that was
> > created in the current transaction or has last_unlink_trans set to the
> > generation of the current transaction, we check if any of its ancestor
> > inodes (and the inode itself if it is a directory) can not be logged and
> > need a fallback to a full transaction commit - if so, we return with a
> > value of 1 in order to fallback to a transaction commit.
> >
> > However we often do not need to fallback to a transaction commit because:
> >
> > 1) The ancestor inode is not an immediate parent, and therefore there is
> >     not an explicit request to log it and it is not needed neither to
> >     guarantee the consistency of the inode originally asked to be logged
> >     (fsynced) nor its immediate parent;
> >
> > 2) The ancestor inode was already logged before, in which case any link,
> >     unlink or rename operation updates the log as needed.
> >
> > So for these two cases we can avoid an unnecessary transaction commit.
> > Therefore remove check_parent_dirs_for_sync() and add a check at the top
> > of btrfs_log_inode() to make us fallback immediately to a transaction
> > commit when we are logging a directory inode that can not be logged and
> > needs a full transaction commit. All we need to protect is the case where
> > after renaming a file someone fsyncs only the old directory, which would
> > result is losing the renamed file after a log replay.
> >
> > This patch is part of a patchset comprised of the following patches:
> >
> >    btrfs: remove unnecessary directory inode item update when deleting dir entry
> >    btrfs: stop setting nbytes when filling inode item for logging
> >    btrfs: avoid logging new ancestor inodes when logging new inode
> >    btrfs: skip logging directories already logged when logging all parents
> >    btrfs: skip logging inodes already logged when logging new entries
> >    btrfs: remove unnecessary check_parent_dirs_for_sync()
> >    btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
> >
> > Performance results, after applying all patches, are mentioned in the
> > change log of the last patch.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> I'm having a hard time with this one.
>
> Previously we would commit the transaction if the inode was a regular file, that
> was created in this current transaction, and had been renamed.  Now with this
> patch you're only committing the transaction if we are a directory and were
> renamed ourselves.  Before if you already had directories A and B and then did
> something like
>
> echo "foo" > /mnt/test/A/blah
> fsync(/mnt/test/A/blah);
> fsync(/mnt/test/A);
> mv /mnt/test/A/blah /mnt/test/B
> fsync(/mnt/test/B/blah);
>
> we would commit the transaction on this second fsync, but with your patch we are
> not.  I suppose that's keeping in line with how fsync is allowed to work, but
> it's definitely a change in behavior from what we used to do.  Not sure if
> that's good or not, I'll have to think about it.  Thanks,

Yes. Because of the rename (or a link), we will set last_unlink_trans
to the current transaction, and when logging the file that will cause
logging of all its old parents (A). That was added several years ago
to fix corruptions, and it turned out to be needed later as well to
ensure we have
a behaviour similar to xfs and ext4 (and others) regarding strictly
ordered metadata updates (I added several tests to fstests over the
years for all the cases).
There's also the fact that on replay we will delete any inode refs
that aren't in the log (that one was added in commit 1f250e929a9c
("Btrfs: fix log replay failure after unlink and link combination").

For that example we also have A updated in the log by the rename. So
we know the log is consistent.

So that's why the whole check_parents_for_sync() is not needed.

Thanks.

>
> Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
  2021-01-27 15:36     ` Filipe Manana
@ 2021-01-27 15:42       ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2021-01-27 15:42 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

On 1/27/21 10:36 AM, Filipe Manana wrote:
> On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> Whenever we fsync an inode, if it is a directory, a regular file that was
>>> created in the current transaction or has last_unlink_trans set to the
>>> generation of the current transaction, we check if any of its ancestor
>>> inodes (and the inode itself if it is a directory) can not be logged and
>>> need a fallback to a full transaction commit - if so, we return with a
>>> value of 1 in order to fallback to a transaction commit.
>>>
>>> However we often do not need to fallback to a transaction commit because:
>>>
>>> 1) The ancestor inode is not an immediate parent, and therefore there is
>>>      not an explicit request to log it and it is not needed neither to
>>>      guarantee the consistency of the inode originally asked to be logged
>>>      (fsynced) nor its immediate parent;
>>>
>>> 2) The ancestor inode was already logged before, in which case any link,
>>>      unlink or rename operation updates the log as needed.
>>>
>>> So for these two cases we can avoid an unnecessary transaction commit.
>>> Therefore remove check_parent_dirs_for_sync() and add a check at the top
>>> of btrfs_log_inode() to make us fallback immediately to a transaction
>>> commit when we are logging a directory inode that can not be logged and
>>> needs a full transaction commit. All we need to protect is the case where
>>> after renaming a file someone fsyncs only the old directory, which would
>>> result is losing the renamed file after a log replay.
>>>
>>> This patch is part of a patchset comprised of the following patches:
>>>
>>>     btrfs: remove unnecessary directory inode item update when deleting dir entry
>>>     btrfs: stop setting nbytes when filling inode item for logging
>>>     btrfs: avoid logging new ancestor inodes when logging new inode
>>>     btrfs: skip logging directories already logged when logging all parents
>>>     btrfs: skip logging inodes already logged when logging new entries
>>>     btrfs: remove unnecessary check_parent_dirs_for_sync()
>>>     btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
>>>
>>> Performance results, after applying all patches, are mentioned in the
>>> change log of the last patch.
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>
>> I'm having a hard time with this one.
>>
>> Previously we would commit the transaction if the inode was a regular file, that
>> was created in this current transaction, and had been renamed.  Now with this
>> patch you're only committing the transaction if we are a directory and were
>> renamed ourselves.  Before if you already had directories A and B and then did
>> something like
>>
>> echo "foo" > /mnt/test/A/blah
>> fsync(/mnt/test/A/blah);
>> fsync(/mnt/test/A);
>> mv /mnt/test/A/blah /mnt/test/B
>> fsync(/mnt/test/B/blah);
>>
>> we would commit the transaction on this second fsync, but with your patch we are
>> not.  I suppose that's keeping in line with how fsync is allowed to work, but
>> it's definitely a change in behavior from what we used to do.  Not sure if
>> that's good or not, I'll have to think about it.  Thanks,
> 
> Yes. Because of the rename (or a link), we will set last_unlink_trans
> to the current transaction, and when logging the file that will cause
> logging of all its old parents (A). That was added several years ago
> to fix corruptions, and it turned out to be needed later as well to
> ensure we have
> a behaviour similar to xfs and ext4 (and others) regarding strictly
> ordered metadata updates (I added several tests to fstests over the
> years for all the cases).
> There's also the fact that on replay we will delete any inode refs
> that aren't in the log (that one was added in commit 1f250e929a9c
> ("Btrfs: fix log replay failure after unlink and link combination").
> 
> For that example we also have A updated in the log by the rename. So
> we know the log is consistent.
> 
> So that's why the whole check_parents_for_sync() is not needed.
> 

Ok that's reasonable, thanks,

Josef


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
                   ` (5 preceding siblings ...)
  2021-01-27 10:34 ` [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync() fdmanana
@ 2021-01-27 10:35 ` fdmanana
  2021-01-27 15:26   ` Josef Bacik
  2021-01-27 15:42 ` [PATCH 0/7] btrfs: more performance improvements for dbench workloads Josef Bacik
  2021-02-01 21:56 ` David Sterba
  8 siblings, 1 reply; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:35 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).

In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.

So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).

This patch is part of a patchset comprised of the following patches:

  btrfs: remove unnecessary directory inode item update when deleting dir entry
  btrfs: stop setting nbytes when filling inode item for logging
  btrfs: avoid logging new ancestor inodes when logging new inode
  btrfs: skip logging directories already logged when logging all parents
  btrfs: skip logging inodes already logged when logging new entries
  btrfs: remove unnecessary check_parent_dirs_for_sync()
  btrfs: make concurrent fsyncs wait less when waiting for a transaction commit

After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:

  $ cat dbench-test.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-m single -d single"

  echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  umount $DEV &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).

Before applying patchset, 32 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9627107     0.153    61.938
 Close        7072076     0.001     3.175
 Rename        407633     1.222    44.439
 Unlink       1943895     0.658    44.440
 Deltree          256    17.339   110.891
 Mkdir            128     0.003     0.009
 Qpathinfo    8725406     0.064    17.850
 Qfileinfo    1529516     0.001     2.188
 Qfsinfo      1599884     0.002     1.457
 Sfileinfo     784200     0.005     3.562
 Find         3373513     0.411    30.312
 WriteX       4802132     0.053    29.054
 ReadX        15089959     0.002     5.801
 LockX          31344     0.002     0.425
 UnlockX        31344     0.001     0.173
 Flush         674724     5.952   341.830

Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms

After applying patchset, 32 clients:

After patchset, with 32 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9931568     0.111    25.597
 Close        7295730     0.001     2.171
 Rename        420549     0.982    49.714
 Unlink       2005366     0.497    39.015
 Deltree          256    11.149    89.242
 Mkdir            128     0.002     0.014
 Qpathinfo    9001863     0.049    20.761
 Qfileinfo    1577730     0.001     2.546
 Qfsinfo      1650508     0.002     3.531
 Sfileinfo     809031     0.005     5.846
 Find         3480259     0.309    23.977
 WriteX       4952505     0.043    41.283
 ReadX        15568127     0.002     5.476
 LockX          32338     0.002     0.978
 UnlockX        32338     0.001     2.032
 Flush         696017     7.485   228.835

Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms

 --> +4.1% throughput, -39.6% max latency

Before applying patchset, 64 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    8956748     0.342   108.312
 Close        6579660     0.001     3.823
 Rename        379209     2.396    81.897
 Unlink       1808625     1.108   131.148
 Deltree          256    25.632   172.176
 Mkdir            128     0.003     0.018
 Qpathinfo    8117615     0.131    55.916
 Qfileinfo    1423495     0.001     2.635
 Qfsinfo      1488496     0.002     5.412
 Sfileinfo     729472     0.007     8.643
 Find         3138598     0.855    78.321
 WriteX       4470783     0.102    79.442
 ReadX        14038139     0.002     7.578
 LockX          29158     0.002     0.844
 UnlockX        29158     0.001     0.567
 Flush         627746    14.168   506.151

Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms

After applying patchset, 64 clients:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    9069003     0.303    43.193
 Close        6662328     0.001     3.888
 Rename        383976     2.194    46.418
 Unlink       1831080     1.022    43.873
 Deltree          256    24.037   155.763
 Mkdir            128     0.002     0.005
 Qpathinfo    8219173     0.137    30.233
 Qfileinfo    1441203     0.001     3.204
 Qfsinfo      1507092     0.002     4.055
 Sfileinfo     738775     0.006     5.431
 Find         3177874     0.936    38.170
 WriteX       4526152     0.084    39.518
 ReadX        14213562     0.002    24.760
 LockX          29522     0.002     1.221
 UnlockX        29522     0.001     0.694
 Flush         635652    14.358   422.039

Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms

 --> +6.8% throughput, -18.1% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/file.c        |  1 +
 fs/btrfs/transaction.c | 39 +++++++++++++++++++++++++++++++--------
 fs/btrfs/transaction.h |  2 ++
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d81ae1f518f2..be5350f5bedf 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2238,6 +2238,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		ret = PTR_ERR(trans);
 		goto out_release_extents;
 	}
+	trans->in_fsync = true;
 
 	ret = btrfs_log_dentry_safe(trans, dentry, &ctx);
 	btrfs_release_log_ctx_extents(&ctx);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3bcb5444536e..ff8efa6f8986 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -107,6 +107,11 @@ static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
 					   __TRANS_JOIN |
 					   __TRANS_JOIN_NOLOCK |
 					   __TRANS_JOIN_NOSTART),
+	[TRANS_STATE_SUPER_COMMITTED]	= (__TRANS_START |
+					   __TRANS_ATTACH |
+					   __TRANS_JOIN |
+					   __TRANS_JOIN_NOLOCK |
+					   __TRANS_JOIN_NOSTART),
 	[TRANS_STATE_COMPLETED]		= (__TRANS_START |
 					   __TRANS_ATTACH |
 					   __TRANS_JOIN |
@@ -826,10 +831,11 @@ btrfs_attach_transaction_barrier(struct btrfs_root *root)
 	return trans;
 }
 
-/* wait for a transaction commit to be fully complete */
-static noinline void wait_for_commit(struct btrfs_transaction *commit)
+/* Wait for a transaction commit to reach at least the given state. */
+static noinline void wait_for_commit(struct btrfs_transaction *commit,
+				     const enum btrfs_trans_state min_state)
 {
-	wait_event(commit->commit_wait, commit->state == TRANS_STATE_COMPLETED);
+	wait_event(commit->commit_wait, commit->state >= min_state);
 }
 
 int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
@@ -884,7 +890,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
 			goto out;  /* nothing committing|committed */
 	}
 
-	wait_for_commit(cur_trans);
+	wait_for_commit(cur_trans, TRANS_STATE_COMPLETED);
 	btrfs_put_transaction(cur_trans);
 out:
 	return ret;
@@ -2102,11 +2108,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 
 	spin_lock(&fs_info->trans_lock);
 	if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
+		enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
+
 		spin_unlock(&fs_info->trans_lock);
 		refcount_inc(&cur_trans->use_count);
-		ret = btrfs_end_transaction(trans);
 
-		wait_for_commit(cur_trans);
+		if (trans->in_fsync)
+			want_state = TRANS_STATE_SUPER_COMMITTED;
+		ret = btrfs_end_transaction(trans);
+		wait_for_commit(cur_trans, want_state);
 
 		if (TRANS_ABORTED(cur_trans))
 			ret = cur_trans->aborted;
@@ -2120,13 +2130,19 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	wake_up(&fs_info->transaction_blocked_wait);
 
 	if (cur_trans->list.prev != &fs_info->trans_list) {
+		enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
+
+		if (trans->in_fsync)
+			want_state = TRANS_STATE_SUPER_COMMITTED;
+
 		prev_trans = list_entry(cur_trans->list.prev,
 					struct btrfs_transaction, list);
-		if (prev_trans->state != TRANS_STATE_COMPLETED) {
+		if (prev_trans->state < want_state) {
 			refcount_inc(&prev_trans->use_count);
 			spin_unlock(&fs_info->trans_lock);
 
-			wait_for_commit(prev_trans);
+			wait_for_commit(prev_trans, want_state);
+
 			ret = READ_ONCE(prev_trans->aborted);
 
 			btrfs_put_transaction(prev_trans);
@@ -2345,6 +2361,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 	if (ret)
 		goto scrub_continue;
 
+	/*
+	 * We needn't acquire the lock here because there is no other task
+	 * which can change it.
+	 */
+	cur_trans->state = TRANS_STATE_SUPER_COMMITTED;
+	wake_up(&cur_trans->commit_wait);
+
 	btrfs_finish_extent_commit(trans);
 
 	if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &cur_trans->flags))
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 31ca81bad822..935bd6958a8a 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -16,6 +16,7 @@ enum btrfs_trans_state {
 	TRANS_STATE_COMMIT_START,
 	TRANS_STATE_COMMIT_DOING,
 	TRANS_STATE_UNBLOCKED,
+	TRANS_STATE_SUPER_COMMITTED,
 	TRANS_STATE_COMPLETED,
 	TRANS_STATE_MAX,
 };
@@ -133,6 +134,7 @@ struct btrfs_trans_handle {
 	bool can_flush_pending_bgs;
 	bool reloc_reserved;
 	bool dirty;
+	bool in_fsync;
 	struct btrfs_root *root;
 	struct btrfs_fs_info *fs_info;
 	struct list_head new_bgs;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
  2021-01-27 10:35 ` [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fdmanana
@ 2021-01-27 15:26   ` Josef Bacik
  0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2021-01-27 15:26 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 1/27/21 5:35 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> Often an fsync needs to fallback to a transaction commit for several
> reasons (to ensure consistency after a power failure, a new block group
> was allocated or a temporary error such as ENOMEM or ENOSPC happened).
> 
> In that case the log is marked as needing a full commit and any concurrent
> tasks attempting to log inodes or commit the log will also fallback to the
> transaction commit. When this happens they all wait for the task that first
> started the transaction commit to finish the transaction commit - however
> they wait until the full transaction commit happens, which is not needed,
> as they only need to wait for the superblocks to be persisted and not for
> unpinning all the extents pinned during the transaction's lifetime, which
> even for short lived transactions can be a few thousand and take some
> significant amount of time to complete - for dbench workloads I have
> observed up to 4~5 milliseconds of time spent unpinning extents in the
> worst cases, and the number of pinned extents was between 2 to 3 thousand.
> 
> So allow fsync tasks to skip waiting for the unpinning of extents when
> they call btrfs_commit_transaction() and they were not the task that
> started the transaction commit (that one has to do it, the alternative
> would be to offload the transaction commit to another task so that it
> could avoid waiting for the extent unpinning or offload the extent
> unpinning to another task).
> 
> This patch is part of a patchset comprised of the following patches:
> 
>    btrfs: remove unnecessary directory inode item update when deleting dir entry
>    btrfs: stop setting nbytes when filling inode item for logging
>    btrfs: avoid logging new ancestor inodes when logging new inode
>    btrfs: skip logging directories already logged when logging all parents
>    btrfs: skip logging inodes already logged when logging new entries
>    btrfs: remove unnecessary check_parent_dirs_for_sync()
>    btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
> 
> After applying the entire patchset, dbench shows improvements in respect
> to throughput and latency. The script used to measure it is the following:
> 
>    $ cat dbench-test.sh
>    #!/bin/bash
> 
>    DEV=/dev/sdk
>    MNT=/mnt/sdk
>    MOUNT_OPTIONS="-o ssd"
>    MKFS_OPTIONS="-m single -d single"
> 
>    echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> 
>    umount $DEV &> /dev/null
>    mkfs.btrfs -f $MKFS_OPTIONS $DEV
>    mount $MOUNT_OPTIONS $DEV $MNT
> 
>    dbench -D $MNT -t 300 64
> 
>    umount $MNT
> 
> The test was run on a physical machine with 12 cores (Intel corei7), 64G
> of ram, using a NVMe device and a non-debug kernel configuration (Debian's
> default configuration).
> 
> Before applying patchset, 32 clients:
> 
>   Operation      Count    AvgLat    MaxLat
>   ----------------------------------------
>   NTCreateX    9627107     0.153    61.938
>   Close        7072076     0.001     3.175
>   Rename        407633     1.222    44.439
>   Unlink       1943895     0.658    44.440
>   Deltree          256    17.339   110.891
>   Mkdir            128     0.003     0.009
>   Qpathinfo    8725406     0.064    17.850
>   Qfileinfo    1529516     0.001     2.188
>   Qfsinfo      1599884     0.002     1.457
>   Sfileinfo     784200     0.005     3.562
>   Find         3373513     0.411    30.312
>   WriteX       4802132     0.053    29.054
>   ReadX        15089959     0.002     5.801
>   LockX          31344     0.002     0.425
>   UnlockX        31344     0.001     0.173
>   Flush         674724     5.952   341.830
> 
> Throughput 1008.02 MB/sec  32 clients  32 procs  max_latency=341.833 ms
> 
> After applying patchset, 32 clients:
> 
> After patchset, with 32 clients:
> 
>   Operation      Count    AvgLat    MaxLat
>   ----------------------------------------
>   NTCreateX    9931568     0.111    25.597
>   Close        7295730     0.001     2.171
>   Rename        420549     0.982    49.714
>   Unlink       2005366     0.497    39.015
>   Deltree          256    11.149    89.242
>   Mkdir            128     0.002     0.014
>   Qpathinfo    9001863     0.049    20.761
>   Qfileinfo    1577730     0.001     2.546
>   Qfsinfo      1650508     0.002     3.531
>   Sfileinfo     809031     0.005     5.846
>   Find         3480259     0.309    23.977
>   WriteX       4952505     0.043    41.283
>   ReadX        15568127     0.002     5.476
>   LockX          32338     0.002     0.978
>   UnlockX        32338     0.001     2.032
>   Flush         696017     7.485   228.835
> 
> Throughput 1049.91 MB/sec  32 clients  32 procs  max_latency=228.847 ms
> 
>   --> +4.1% throughput, -39.6% max latency
> 
> Before applying patchset, 64 clients:
> 
>   Operation      Count    AvgLat    MaxLat
>   ----------------------------------------
>   NTCreateX    8956748     0.342   108.312
>   Close        6579660     0.001     3.823
>   Rename        379209     2.396    81.897
>   Unlink       1808625     1.108   131.148
>   Deltree          256    25.632   172.176
>   Mkdir            128     0.003     0.018
>   Qpathinfo    8117615     0.131    55.916
>   Qfileinfo    1423495     0.001     2.635
>   Qfsinfo      1488496     0.002     5.412
>   Sfileinfo     729472     0.007     8.643
>   Find         3138598     0.855    78.321
>   WriteX       4470783     0.102    79.442
>   ReadX        14038139     0.002     7.578
>   LockX          29158     0.002     0.844
>   UnlockX        29158     0.001     0.567
>   Flush         627746    14.168   506.151
> 
> Throughput 924.738 MB/sec  64 clients  64 procs  max_latency=506.154 ms
> 
> After applying patchset, 64 clients:
> 
>   Operation      Count    AvgLat    MaxLat
>   ----------------------------------------
>   NTCreateX    9069003     0.303    43.193
>   Close        6662328     0.001     3.888
>   Rename        383976     2.194    46.418
>   Unlink       1831080     1.022    43.873
>   Deltree          256    24.037   155.763
>   Mkdir            128     0.002     0.005
>   Qpathinfo    8219173     0.137    30.233
>   Qfileinfo    1441203     0.001     3.204
>   Qfsinfo      1507092     0.002     4.055
>   Sfileinfo     738775     0.006     5.431
>   Find         3177874     0.936    38.170
>   WriteX       4526152     0.084    39.518
>   ReadX        14213562     0.002    24.760
>   LockX          29522     0.002     1.221
>   UnlockX        29522     0.001     0.694
>   Flush         635652    14.358   422.039
> 
> Throughput 990.13 MB/sec  64 clients  64 procs  max_latency=422.043 ms
> 
>   --> +6.8% throughput, -18.1% max latency
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Funnily enough I've got a patch to async off unpinning as it drastically affects 
our WhatsApp workload.  Once that lands maybe we can undo this bit.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] btrfs: more performance improvements for dbench workloads
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
                   ` (6 preceding siblings ...)
  2021-01-27 10:35 ` [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fdmanana
@ 2021-01-27 15:42 ` Josef Bacik
  2021-02-01 21:56 ` David Sterba
  8 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2021-01-27 15:42 UTC (permalink / raw)
  To: fdmanana, linux-btrfs

On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The following patchset brings one more batch of performance improvements
> with dbench workloads, or anything that mixes file creation, file writes,
> renames, unlinks, etc with fsync like dbench does. This patchset is mostly
> based on avoiding logging directory inodes multiple times when not necessary,
> falling back to transaction commits less frequently and often waiting less
> time for transaction commits to complete. Performance results are listed in
> the change log of the last patch, but in short, I've experienced a reduction
> of maximum latency up to about -40% and throuhput gains up to about +6%.
> 
> Filipe Manana (7):
>    btrfs: remove unnecessary directory inode item update when deleting dir entry
>    btrfs: stop setting nbytes when filling inode item for logging
>    btrfs: avoid logging new ancestor inodes when logging new inode
>    btrfs: skip logging directories already logged when logging all parents
>    btrfs: skip logging inodes already logged when logging new entries
>    btrfs: remove unnecessary check_parent_dirs_for_sync()
>    btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
> 
>   fs/btrfs/file.c        |   1 +
>   fs/btrfs/transaction.c |  39 +++++++--
>   fs/btrfs/transaction.h |   2 +
>   fs/btrfs/tree-log.c    | 195 ++++++++++++-----------------------------
>   4 files changed, 92 insertions(+), 145 deletions(-)
> 

You can add

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

to the whole series, thanks,

Josef

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7] btrfs: more performance improvements for dbench workloads
  2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
                   ` (7 preceding siblings ...)
  2021-01-27 15:42 ` [PATCH 0/7] btrfs: more performance improvements for dbench workloads Josef Bacik
@ 2021-02-01 21:56 ` David Sterba
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2021-02-01 21:56 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Jan 27, 2021 at 10:34:53AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> The following patchset brings one more batch of performance improvements
> with dbench workloads, or anything that mixes file creation, file writes,
> renames, unlinks, etc with fsync like dbench does. This patchset is mostly
> based on avoiding logging directory inodes multiple times when not necessary,
> falling back to transaction commits less frequently and often waiting less
> time for transaction commits to complete. Performance results are listed in
> the change log of the last patch, but in short, I've experienced a reduction
> of maximum latency up to about -40% and throuhput gains up to about +6%.

Nice, added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-02-01 21:58 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
2021-01-27 10:34 ` [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry fdmanana
2021-01-27 10:34 ` [PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging fdmanana
2021-01-27 10:34 ` [PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode fdmanana
2021-01-27 10:34 ` [PATCH 4/7] btrfs: skip logging directories already logged when logging all parents fdmanana
2021-01-27 10:34 ` [PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries fdmanana
2021-01-27 10:34 ` [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync() fdmanana
2021-01-27 15:23   ` Josef Bacik
2021-01-27 15:36     ` Filipe Manana
2021-01-27 15:42       ` Josef Bacik
2021-01-27 10:35 ` [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fdmanana
2021-01-27 15:26   ` Josef Bacik
2021-01-27 15:42 ` [PATCH 0/7] btrfs: more performance improvements for dbench workloads Josef Bacik
2021-02-01 21:56 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).