* [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
@ 2021-01-27 10:34 ` fdmanana
2021-01-27 10:34 ` [PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging fdmanana
` (7 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When we remove a directory entry, as part of an unlink operation, if the
directory was logged before we must remove the directory index items from
the log. We are also updating the inode item of the directory to update
its i_size, but that is not necessary because during log replay we do not
need it and we correctly adjust the i_size in the inode item of the
subvolume as we process directory index items and replay deletes.
This is not needed since commit d555438b6e1dad ("Btrfs: drop dir i_size
when adding new names on replay"), where we explicitly ignore the i_size
of directory inode items on log replay. Before that we used it but it
was buggy as mentioned in that commit's change log (i_size got a larger
value then it should have).
So stop updating the i_size of the directory inode item in the log, as
that is a waste of time, adds more log contention to the log tree and
often results in COWing more extent buffers for the log tree.
This code path is triggered often during dbench workloads for example.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 39 ++++-----------------------------------
1 file changed, 4 insertions(+), 35 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 8ee0700a980f..5d87afc6058a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3379,7 +3379,6 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
struct btrfs_path *path;
int ret;
int err = 0;
- int bytes_del = 0;
u64 dir_ino = btrfs_ino(dir);
if (!inode_logged(trans, dir))
@@ -3406,7 +3405,6 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
}
if (di) {
ret = btrfs_delete_one_dir_name(trans, log, path, di);
- bytes_del += name_len;
if (ret) {
err = ret;
goto fail;
@@ -3421,46 +3419,17 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans,
}
if (di) {
ret = btrfs_delete_one_dir_name(trans, log, path, di);
- bytes_del += name_len;
if (ret) {
err = ret;
goto fail;
}
}
- /* update the directory size in the log to reflect the names
- * we have removed
+ /*
+ * We do not need to update the size field of the directory's inode item
+ * because on log replay we update the field to reflect all existing
+ * entries in the directory (see overwrite_item()).
*/
- if (bytes_del) {
- struct btrfs_key key;
-
- key.objectid = dir_ino;
- key.offset = 0;
- key.type = BTRFS_INODE_ITEM_KEY;
- btrfs_release_path(path);
-
- ret = btrfs_search_slot(trans, log, &key, path, 0, 1);
- if (ret < 0) {
- err = ret;
- goto fail;
- }
- if (ret == 0) {
- struct btrfs_inode_item *item;
- u64 i_size;
-
- item = btrfs_item_ptr(path->nodes[0], path->slots[0],
- struct btrfs_inode_item);
- i_size = btrfs_inode_size(path->nodes[0], item);
- if (i_size > bytes_del)
- i_size -= bytes_del;
- else
- i_size = 0;
- btrfs_set_inode_size(path->nodes[0], item, i_size);
- btrfs_mark_buffer_dirty(path->nodes[0]);
- } else
- ret = 0;
- btrfs_release_path(path);
- }
fail:
btrfs_free_path(path);
out_unlock:
--
2.28.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
2021-01-27 10:34 ` [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry fdmanana
@ 2021-01-27 10:34 ` fdmanana
2021-01-27 10:34 ` [PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode fdmanana
` (6 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When we fill an inode item for logging we are setting its nbytes field
with the value returned by inode_get_bytes() (a VFS API), however we do
not need it because it is not used during log replay. In fact, for fast
fsyncs, when we call inode_get_bytes() we may even get an outdated value
for nbytes because the nbytes field of the inode is only updated when
ordered extents complete, and a fast fsync only waits for writeback to
complete, it does not wait for ordered extent completion.
So just remove the setup of nbytes and add an explicit comment mentioning
why we do not set it. This also avoids adding contention on the inode's
i_lock (VFS) with concurrent stat() calls, since that spinlock is used by
inode_get_bytes() which is also called by our stat callback
(btrfs_getattr()).
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5d87afc6058a..be62759f0aac 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3858,7 +3858,14 @@ static void fill_inode_item(struct btrfs_trans_handle *trans,
btrfs_set_token_timespec_nsec(&token, &item->ctime,
inode->i_ctime.tv_nsec);
- btrfs_set_token_inode_nbytes(&token, item, inode_get_bytes(inode));
+ /*
+ * We do not need to set the nbytes field, in fact during a fast fsync
+ * its value may not even be correct, since a fast fsync does not wait
+ * for ordered extent completion, which is where we update nbytes, it
+ * only waits for writeback to complete. During log replay as we find
+ * file extent items and replay them, we adjust the nbytes field of the
+ * inode item in subvolume tree as needed (see overwrite_item()).
+ */
btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode));
btrfs_set_token_inode_transid(&token, item, trans->transid);
--
2.28.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
2021-01-27 10:34 ` [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry fdmanana
2021-01-27 10:34 ` [PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging fdmanana
@ 2021-01-27 10:34 ` fdmanana
2021-01-27 10:34 ` [PATCH 4/7] btrfs: skip logging directories already logged when logging all parents fdmanana
` (5 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When we fsync a new file, created in the current transaction, we check
all its ancestor inodes and always log them if they were created in the
current transaction - even if we have already logged them before, which
is a waste of time.
So avoid logging new ancestor inodes if they were already logged before
and have no xattrs added/updated/removed since they were last logged.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 35 +++++++++++++++++++++++++++++++++--
1 file changed, 33 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index be62759f0aac..105cf316ee27 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5272,6 +5272,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
if (S_ISDIR(inode->vfs_inode.i_mode)) {
int max_key_type = BTRFS_DIR_LOG_INDEX_KEY;
+ clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags);
if (inode_only == LOG_INODE_EXISTS)
max_key_type = BTRFS_XATTR_ITEM_KEY;
ret = drop_objectid_items(trans, log, path, ino, max_key_type);
@@ -5520,6 +5521,34 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
return ret;
}
+/*
+ * Check if we need to log an inode. This is used in contexts where while
+ * logging an inode we need to log another inode (either that it exists or in
+ * full mode). This is used instead of btrfs_inode_in_log() because the later
+ * requires the inode to be in the log and have the log transaction committed,
+ * while here we do not care if the log transaction was already committed - our
+ * caller will commit the log later - and we want to avoid logging an inode
+ * multiple times when multiple tasks have joined the same log transaction.
+ */
+static bool need_log_inode(struct btrfs_trans_handle *trans,
+ struct btrfs_inode *inode)
+{
+ /*
+ * If this inode does not have new/updated/deleted xattrs since the last
+ * time it was logged and is flagged as logged in the current transaction,
+ * we can skip logging it. As for new/deleted names, those are updated in
+ * the log by link/unlink/rename operations.
+ * In case the inode was logged and then evicted and reloaded, its
+ * logged_trans will be 0, in which case we have to fully log it since
+ * logged_trans is a transient field, not persisted.
+ */
+ if (inode->logged_trans == trans->transid &&
+ !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags))
+ return false;
+
+ return true;
+}
+
struct btrfs_dir_list {
u64 ino;
struct list_head list;
@@ -5848,7 +5877,8 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
if (IS_ERR(inode))
return PTR_ERR(inode);
- if (BTRFS_I(inode)->generation >= trans->transid)
+ if (BTRFS_I(inode)->generation >= trans->transid &&
+ need_log_inode(trans, BTRFS_I(inode)))
ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
LOG_INODE_EXISTS, ctx);
btrfs_add_delayed_iput(inode);
@@ -5902,7 +5932,8 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans,
if (root != inode->root)
break;
- if (inode->generation >= trans->transid) {
+ if (inode->generation >= trans->transid &&
+ need_log_inode(trans, inode)) {
ret = btrfs_log_inode(trans, root, inode,
LOG_INODE_EXISTS, ctx);
if (ret)
--
2.28.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 4/7] btrfs: skip logging directories already logged when logging all parents
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
` (2 preceding siblings ...)
2021-01-27 10:34 ` [PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode fdmanana
@ 2021-01-27 10:34 ` fdmanana
2021-01-27 10:34 ` [PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries fdmanana
` (4 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
Some times when we fsync an inode we need to do a full log of all its
ancestors (due to unlink, link or rename operations), which can be an
expensive operation, specially if the directories are large.
However if we find an ancestor directory inode that is already logged in
the current transaction, and has no inserted/updated/deleted xattrs since
it was last logged, we can skip logging the directory again. We are safe
to skip that since we know that for logged directories, any link, unlink
or rename operations that implicate the directory will update the log as
necessary.
So use the helper need_log_dir(), introduced in a previous commit, to
detect already logged directories that can be skipped.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 105cf316ee27..c0dce99c2c14 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5826,6 +5826,11 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
goto out;
}
+ if (!need_log_inode(trans, BTRFS_I(dir_inode))) {
+ btrfs_add_delayed_iput(dir_inode);
+ continue;
+ }
+
if (ctx)
ctx->log_new_dentries = false;
ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode),
--
2.28.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
` (3 preceding siblings ...)
2021-01-27 10:34 ` [PATCH 4/7] btrfs: skip logging directories already logged when logging all parents fdmanana
@ 2021-01-27 10:34 ` fdmanana
2021-01-27 10:34 ` [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync() fdmanana
` (3 subsequent siblings)
8 siblings, 0 replies; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When logging new directory entries of a directory, we log the inodes of
new dentries and the inodes of dentries pointing to directories that
may have been created in past transactions. For the case of directories
we log in full mode, which can be particularly expensive for large
directories.
We do use btrfs_inode_in_log() to skip already logged inodes, however for
that helper to return true, it requires that the log transaction used to
log the inode to be already committed. This means that when we have more
than one task using the same log transaction we can end up logging an
inode multiple times, which is a waste of time and not necessary since
the log will be committed by one of the tasks and the others will wait for
the log transaction to be committed before returning to user space.
So simply replace the use of btrfs_inode_in_log() with the new helper
function need_log_inode(), introduced in a previous commit.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c0dce99c2c14..6dc376a16cf2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5676,7 +5676,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
goto next_dir_inode;
}
- if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) {
+ if (!need_log_inode(trans, BTRFS_I(di_inode))) {
btrfs_add_delayed_iput(di_inode);
break;
}
--
2.28.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
` (4 preceding siblings ...)
2021-01-27 10:34 ` [PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries fdmanana
@ 2021-01-27 10:34 ` fdmanana
2021-01-27 15:23 ` Josef Bacik
2021-01-27 10:35 ` [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fdmanana
` (2 subsequent siblings)
8 siblings, 1 reply; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:34 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
Whenever we fsync an inode, if it is a directory, a regular file that was
created in the current transaction or has last_unlink_trans set to the
generation of the current transaction, we check if any of its ancestor
inodes (and the inode itself if it is a directory) can not be logged and
need a fallback to a full transaction commit - if so, we return with a
value of 1 in order to fallback to a transaction commit.
However we often do not need to fallback to a transaction commit because:
1) The ancestor inode is not an immediate parent, and therefore there is
not an explicit request to log it and it is not needed neither to
guarantee the consistency of the inode originally asked to be logged
(fsynced) nor its immediate parent;
2) The ancestor inode was already logged before, in which case any link,
unlink or rename operation updates the log as needed.
So for these two cases we can avoid an unnecessary transaction commit.
Therefore remove check_parent_dirs_for_sync() and add a check at the top
of btrfs_log_inode() to make us fallback immediately to a transaction
commit when we are logging a directory inode that can not be logged and
needs a full transaction commit. All we need to protect is the case where
after renaming a file someone fsyncs only the old directory, which would
result is losing the renamed file after a log replay.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 121 ++++++--------------------------------------
1 file changed, 15 insertions(+), 106 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6dc376a16cf2..4c7b283ed2b2 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5265,6 +5265,21 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
mutex_lock(&inode->log_mutex);
}
+ /*
+ * This is for cases where logging a directory could result in losing a
+ * a file after replaying the log. For example, if we move a file from a
+ * directory A to a directory B, then fsync directory A, we have no way
+ * to known the file was moved from A to B, so logging just A would
+ * result in losing the file after a log replay.
+ */
+ if (S_ISDIR(inode->vfs_inode.i_mode) &&
+ inode_only == LOG_INODE_ALL &&
+ inode->last_unlink_trans >= trans->transid) {
+ btrfs_set_log_full_commit(trans);
+ err = 1;
+ goto out_unlock;
+ }
+
/*
* a brute force approach to making sure we get the most uptodate
* copies of everything.
@@ -5428,99 +5443,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
return err;
}
-/*
- * Check if we must fallback to a transaction commit when logging an inode.
- * This must be called after logging the inode and is used only in the context
- * when fsyncing an inode requires the need to log some other inode - in which
- * case we can't lock the i_mutex of each other inode we need to log as that
- * can lead to deadlocks with concurrent fsync against other inodes (as we can
- * log inodes up or down in the hierarchy) or rename operations for example. So
- * we take the log_mutex of the inode after we have logged it and then check for
- * its last_unlink_trans value - this is safe because any task setting
- * last_unlink_trans must take the log_mutex and it must do this before it does
- * the actual unlink operation, so if we do this check before a concurrent task
- * sets last_unlink_trans it means we've logged a consistent version/state of
- * all the inode items, otherwise we are not sure and must do a transaction
- * commit (the concurrent task might have only updated last_unlink_trans before
- * we logged the inode or it might have also done the unlink).
- */
-static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans,
- struct btrfs_inode *inode)
-{
- bool ret = false;
-
- mutex_lock(&inode->log_mutex);
- if (inode->last_unlink_trans >= trans->transid) {
- /*
- * Make sure any commits to the log are forced to be full
- * commits.
- */
- btrfs_set_log_full_commit(trans);
- ret = true;
- }
- mutex_unlock(&inode->log_mutex);
-
- return ret;
-}
-
-/*
- * follow the dentry parent pointers up the chain and see if any
- * of the directories in it require a full commit before they can
- * be logged. Returns zero if nothing special needs to be done or 1 if
- * a full commit is required.
- */
-static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
- struct btrfs_inode *inode,
- struct dentry *parent,
- struct super_block *sb)
-{
- int ret = 0;
- struct dentry *old_parent = NULL;
-
- /*
- * for regular files, if its inode is already on disk, we don't
- * have to worry about the parents at all. This is because
- * we can use the last_unlink_trans field to record renames
- * and other fun in this file.
- */
- if (S_ISREG(inode->vfs_inode.i_mode) &&
- inode->generation < trans->transid &&
- inode->last_unlink_trans < trans->transid)
- goto out;
-
- if (!S_ISDIR(inode->vfs_inode.i_mode)) {
- if (!parent || d_really_is_negative(parent) || sb != parent->d_sb)
- goto out;
- inode = BTRFS_I(d_inode(parent));
- }
-
- while (1) {
- if (btrfs_must_commit_transaction(trans, inode)) {
- ret = 1;
- break;
- }
-
- if (!parent || d_really_is_negative(parent) || sb != parent->d_sb)
- break;
-
- if (IS_ROOT(parent)) {
- inode = BTRFS_I(d_inode(parent));
- if (btrfs_must_commit_transaction(trans, inode))
- ret = 1;
- break;
- }
-
- parent = dget_parent(parent);
- dput(old_parent);
- old_parent = parent;
- inode = BTRFS_I(d_inode(parent));
-
- }
- dput(old_parent);
-out:
- return ret;
-}
-
/*
* Check if we need to log an inode. This is used in contexts where while
* logging an inode we need to log another inode (either that it exists or in
@@ -5686,9 +5608,6 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans,
log_mode = LOG_INODE_ALL;
ret = btrfs_log_inode(trans, root, BTRFS_I(di_inode),
log_mode, ctx);
- if (!ret &&
- btrfs_must_commit_transaction(trans, BTRFS_I(di_inode)))
- ret = 1;
btrfs_add_delayed_iput(di_inode);
if (ret)
goto next_dir_inode;
@@ -5835,9 +5754,6 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans,
ctx->log_new_dentries = false;
ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode),
LOG_INODE_ALL, ctx);
- if (!ret &&
- btrfs_must_commit_transaction(trans, BTRFS_I(dir_inode)))
- ret = 1;
if (!ret && ctx && ctx->log_new_dentries)
ret = log_new_dir_dentries(trans, root,
BTRFS_I(dir_inode), ctx);
@@ -6053,12 +5969,9 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
{
struct btrfs_root *root = inode->root;
struct btrfs_fs_info *fs_info = root->fs_info;
- struct super_block *sb;
int ret = 0;
bool log_dentries = false;
- sb = inode->vfs_inode.i_sb;
-
if (btrfs_test_opt(fs_info, NOTREELOG)) {
ret = 1;
goto end_no_trans;
@@ -6069,10 +5982,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
goto end_no_trans;
}
- ret = check_parent_dirs_for_sync(trans, inode, parent, sb);
- if (ret)
- goto end_no_trans;
-
/*
* Skip already logged inodes or inodes corresponding to tmpfiles
* (since logging them is pointless, a link count of 0 means they
--
2.28.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
2021-01-27 10:34 ` [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync() fdmanana
@ 2021-01-27 15:23 ` Josef Bacik
2021-01-27 15:36 ` Filipe Manana
0 siblings, 1 reply; 14+ messages in thread
From: Josef Bacik @ 2021-01-27 15:23 UTC (permalink / raw)
To: fdmanana, linux-btrfs
On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Whenever we fsync an inode, if it is a directory, a regular file that was
> created in the current transaction or has last_unlink_trans set to the
> generation of the current transaction, we check if any of its ancestor
> inodes (and the inode itself if it is a directory) can not be logged and
> need a fallback to a full transaction commit - if so, we return with a
> value of 1 in order to fallback to a transaction commit.
>
> However we often do not need to fallback to a transaction commit because:
>
> 1) The ancestor inode is not an immediate parent, and therefore there is
> not an explicit request to log it and it is not needed neither to
> guarantee the consistency of the inode originally asked to be logged
> (fsynced) nor its immediate parent;
>
> 2) The ancestor inode was already logged before, in which case any link,
> unlink or rename operation updates the log as needed.
>
> So for these two cases we can avoid an unnecessary transaction commit.
> Therefore remove check_parent_dirs_for_sync() and add a check at the top
> of btrfs_log_inode() to make us fallback immediately to a transaction
> commit when we are logging a directory inode that can not be logged and
> needs a full transaction commit. All we need to protect is the case where
> after renaming a file someone fsyncs only the old directory, which would
> result is losing the renamed file after a log replay.
>
> This patch is part of a patchset comprised of the following patches:
>
> btrfs: remove unnecessary directory inode item update when deleting dir entry
> btrfs: stop setting nbytes when filling inode item for logging
> btrfs: avoid logging new ancestor inodes when logging new inode
> btrfs: skip logging directories already logged when logging all parents
> btrfs: skip logging inodes already logged when logging new entries
> btrfs: remove unnecessary check_parent_dirs_for_sync()
> btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
>
> Performance results, after applying all patches, are mentioned in the
> change log of the last patch.
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
I'm having a hard time with this one.
Previously we would commit the transaction if the inode was a regular file, that
was created in this current transaction, and had been renamed. Now with this
patch you're only committing the transaction if we are a directory and were
renamed ourselves. Before if you already had directories A and B and then did
something like
echo "foo" > /mnt/test/A/blah
fsync(/mnt/test/A/blah);
fsync(/mnt/test/A);
mv /mnt/test/A/blah /mnt/test/B
fsync(/mnt/test/B/blah);
we would commit the transaction on this second fsync, but with your patch we are
not. I suppose that's keeping in line with how fsync is allowed to work, but
it's definitely a change in behavior from what we used to do. Not sure if
that's good or not, I'll have to think about it. Thanks,
Josef
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
2021-01-27 15:23 ` Josef Bacik
@ 2021-01-27 15:36 ` Filipe Manana
2021-01-27 15:42 ` Josef Bacik
0 siblings, 1 reply; 14+ messages in thread
From: Filipe Manana @ 2021-01-27 15:36 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs
On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik <josef@toxicpanda.com> wrote:
>
> On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Whenever we fsync an inode, if it is a directory, a regular file that was
> > created in the current transaction or has last_unlink_trans set to the
> > generation of the current transaction, we check if any of its ancestor
> > inodes (and the inode itself if it is a directory) can not be logged and
> > need a fallback to a full transaction commit - if so, we return with a
> > value of 1 in order to fallback to a transaction commit.
> >
> > However we often do not need to fallback to a transaction commit because:
> >
> > 1) The ancestor inode is not an immediate parent, and therefore there is
> > not an explicit request to log it and it is not needed neither to
> > guarantee the consistency of the inode originally asked to be logged
> > (fsynced) nor its immediate parent;
> >
> > 2) The ancestor inode was already logged before, in which case any link,
> > unlink or rename operation updates the log as needed.
> >
> > So for these two cases we can avoid an unnecessary transaction commit.
> > Therefore remove check_parent_dirs_for_sync() and add a check at the top
> > of btrfs_log_inode() to make us fallback immediately to a transaction
> > commit when we are logging a directory inode that can not be logged and
> > needs a full transaction commit. All we need to protect is the case where
> > after renaming a file someone fsyncs only the old directory, which would
> > result is losing the renamed file after a log replay.
> >
> > This patch is part of a patchset comprised of the following patches:
> >
> > btrfs: remove unnecessary directory inode item update when deleting dir entry
> > btrfs: stop setting nbytes when filling inode item for logging
> > btrfs: avoid logging new ancestor inodes when logging new inode
> > btrfs: skip logging directories already logged when logging all parents
> > btrfs: skip logging inodes already logged when logging new entries
> > btrfs: remove unnecessary check_parent_dirs_for_sync()
> > btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
> >
> > Performance results, after applying all patches, are mentioned in the
> > change log of the last patch.
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
>
> I'm having a hard time with this one.
>
> Previously we would commit the transaction if the inode was a regular file, that
> was created in this current transaction, and had been renamed. Now with this
> patch you're only committing the transaction if we are a directory and were
> renamed ourselves. Before if you already had directories A and B and then did
> something like
>
> echo "foo" > /mnt/test/A/blah
> fsync(/mnt/test/A/blah);
> fsync(/mnt/test/A);
> mv /mnt/test/A/blah /mnt/test/B
> fsync(/mnt/test/B/blah);
>
> we would commit the transaction on this second fsync, but with your patch we are
> not. I suppose that's keeping in line with how fsync is allowed to work, but
> it's definitely a change in behavior from what we used to do. Not sure if
> that's good or not, I'll have to think about it. Thanks,
Yes. Because of the rename (or a link), we will set last_unlink_trans
to the current transaction, and when logging the file that will cause
logging of all its old parents (A). That was added several years ago
to fix corruptions, and it turned out to be needed later as well to
ensure we have
a behaviour similar to xfs and ext4 (and others) regarding strictly
ordered metadata updates (I added several tests to fstests over the
years for all the cases).
There's also the fact that on replay we will delete any inode refs
that aren't in the log (that one was added in commit 1f250e929a9c
("Btrfs: fix log replay failure after unlink and link combination").
For that example we also have A updated in the log by the rename. So
we know the log is consistent.
So that's why the whole check_parents_for_sync() is not needed.
Thanks.
>
> Josef
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync()
2021-01-27 15:36 ` Filipe Manana
@ 2021-01-27 15:42 ` Josef Bacik
0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2021-01-27 15:42 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
On 1/27/21 10:36 AM, Filipe Manana wrote:
> On Wed, Jan 27, 2021 at 3:23 PM Josef Bacik <josef@toxicpanda.com> wrote:
>>
>> On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
>>> From: Filipe Manana <fdmanana@suse.com>
>>>
>>> Whenever we fsync an inode, if it is a directory, a regular file that was
>>> created in the current transaction or has last_unlink_trans set to the
>>> generation of the current transaction, we check if any of its ancestor
>>> inodes (and the inode itself if it is a directory) can not be logged and
>>> need a fallback to a full transaction commit - if so, we return with a
>>> value of 1 in order to fallback to a transaction commit.
>>>
>>> However we often do not need to fallback to a transaction commit because:
>>>
>>> 1) The ancestor inode is not an immediate parent, and therefore there is
>>> not an explicit request to log it and it is not needed neither to
>>> guarantee the consistency of the inode originally asked to be logged
>>> (fsynced) nor its immediate parent;
>>>
>>> 2) The ancestor inode was already logged before, in which case any link,
>>> unlink or rename operation updates the log as needed.
>>>
>>> So for these two cases we can avoid an unnecessary transaction commit.
>>> Therefore remove check_parent_dirs_for_sync() and add a check at the top
>>> of btrfs_log_inode() to make us fallback immediately to a transaction
>>> commit when we are logging a directory inode that can not be logged and
>>> needs a full transaction commit. All we need to protect is the case where
>>> after renaming a file someone fsyncs only the old directory, which would
>>> result is losing the renamed file after a log replay.
>>>
>>> This patch is part of a patchset comprised of the following patches:
>>>
>>> btrfs: remove unnecessary directory inode item update when deleting dir entry
>>> btrfs: stop setting nbytes when filling inode item for logging
>>> btrfs: avoid logging new ancestor inodes when logging new inode
>>> btrfs: skip logging directories already logged when logging all parents
>>> btrfs: skip logging inodes already logged when logging new entries
>>> btrfs: remove unnecessary check_parent_dirs_for_sync()
>>> btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
>>>
>>> Performance results, after applying all patches, are mentioned in the
>>> change log of the last patch.
>>>
>>> Signed-off-by: Filipe Manana <fdmanana@suse.com>
>>
>> I'm having a hard time with this one.
>>
>> Previously we would commit the transaction if the inode was a regular file, that
>> was created in this current transaction, and had been renamed. Now with this
>> patch you're only committing the transaction if we are a directory and were
>> renamed ourselves. Before if you already had directories A and B and then did
>> something like
>>
>> echo "foo" > /mnt/test/A/blah
>> fsync(/mnt/test/A/blah);
>> fsync(/mnt/test/A);
>> mv /mnt/test/A/blah /mnt/test/B
>> fsync(/mnt/test/B/blah);
>>
>> we would commit the transaction on this second fsync, but with your patch we are
>> not. I suppose that's keeping in line with how fsync is allowed to work, but
>> it's definitely a change in behavior from what we used to do. Not sure if
>> that's good or not, I'll have to think about it. Thanks,
>
> Yes. Because of the rename (or a link), we will set last_unlink_trans
> to the current transaction, and when logging the file that will cause
> logging of all its old parents (A). That was added several years ago
> to fix corruptions, and it turned out to be needed later as well to
> ensure we have
> a behaviour similar to xfs and ext4 (and others) regarding strictly
> ordered metadata updates (I added several tests to fstests over the
> years for all the cases).
> There's also the fact that on replay we will delete any inode refs
> that aren't in the log (that one was added in commit 1f250e929a9c
> ("Btrfs: fix log replay failure after unlink and link combination").
>
> For that example we also have A updated in the log by the rename. So
> we know the log is consistent.
>
> So that's why the whole check_parents_for_sync() is not needed.
>
Ok that's reasonable, thanks,
Josef
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
` (5 preceding siblings ...)
2021-01-27 10:34 ` [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync() fdmanana
@ 2021-01-27 10:35 ` fdmanana
2021-01-27 15:26 ` Josef Bacik
2021-01-27 15:42 ` [PATCH 0/7] btrfs: more performance improvements for dbench workloads Josef Bacik
2021-02-01 21:56 ` David Sterba
8 siblings, 1 reply; 14+ messages in thread
From: fdmanana @ 2021-01-27 10:35 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).
In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.
So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:
$ cat dbench-test.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).
Before applying patchset, 32 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9627107 0.153 61.938
Close 7072076 0.001 3.175
Rename 407633 1.222 44.439
Unlink 1943895 0.658 44.440
Deltree 256 17.339 110.891
Mkdir 128 0.003 0.009
Qpathinfo 8725406 0.064 17.850
Qfileinfo 1529516 0.001 2.188
Qfsinfo 1599884 0.002 1.457
Sfileinfo 784200 0.005 3.562
Find 3373513 0.411 30.312
WriteX 4802132 0.053 29.054
ReadX 15089959 0.002 5.801
LockX 31344 0.002 0.425
UnlockX 31344 0.001 0.173
Flush 674724 5.952 341.830
Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms
After applying patchset, 32 clients:
After patchset, with 32 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9931568 0.111 25.597
Close 7295730 0.001 2.171
Rename 420549 0.982 49.714
Unlink 2005366 0.497 39.015
Deltree 256 11.149 89.242
Mkdir 128 0.002 0.014
Qpathinfo 9001863 0.049 20.761
Qfileinfo 1577730 0.001 2.546
Qfsinfo 1650508 0.002 3.531
Sfileinfo 809031 0.005 5.846
Find 3480259 0.309 23.977
WriteX 4952505 0.043 41.283
ReadX 15568127 0.002 5.476
LockX 32338 0.002 0.978
UnlockX 32338 0.001 2.032
Flush 696017 7.485 228.835
Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms
--> +4.1% throughput, -39.6% max latency
Before applying patchset, 64 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 8956748 0.342 108.312
Close 6579660 0.001 3.823
Rename 379209 2.396 81.897
Unlink 1808625 1.108 131.148
Deltree 256 25.632 172.176
Mkdir 128 0.003 0.018
Qpathinfo 8117615 0.131 55.916
Qfileinfo 1423495 0.001 2.635
Qfsinfo 1488496 0.002 5.412
Sfileinfo 729472 0.007 8.643
Find 3138598 0.855 78.321
WriteX 4470783 0.102 79.442
ReadX 14038139 0.002 7.578
LockX 29158 0.002 0.844
UnlockX 29158 0.001 0.567
Flush 627746 14.168 506.151
Throughput 924.738 MB/sec 64 clients 64 procs max_latency=506.154 ms
After applying patchset, 64 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9069003 0.303 43.193
Close 6662328 0.001 3.888
Rename 383976 2.194 46.418
Unlink 1831080 1.022 43.873
Deltree 256 24.037 155.763
Mkdir 128 0.002 0.005
Qpathinfo 8219173 0.137 30.233
Qfileinfo 1441203 0.001 3.204
Qfsinfo 1507092 0.002 4.055
Sfileinfo 738775 0.006 5.431
Find 3177874 0.936 38.170
WriteX 4526152 0.084 39.518
ReadX 14213562 0.002 24.760
LockX 29522 0.002 1.221
UnlockX 29522 0.001 0.694
Flush 635652 14.358 422.039
Throughput 990.13 MB/sec 64 clients 64 procs max_latency=422.043 ms
--> +6.8% throughput, -18.1% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/file.c | 1 +
fs/btrfs/transaction.c | 39 +++++++++++++++++++++++++++++++--------
fs/btrfs/transaction.h | 2 ++
3 files changed, 34 insertions(+), 8 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d81ae1f518f2..be5350f5bedf 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2238,6 +2238,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
ret = PTR_ERR(trans);
goto out_release_extents;
}
+ trans->in_fsync = true;
ret = btrfs_log_dentry_safe(trans, dentry, &ctx);
btrfs_release_log_ctx_extents(&ctx);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3bcb5444536e..ff8efa6f8986 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -107,6 +107,11 @@ static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK |
__TRANS_JOIN_NOSTART),
+ [TRANS_STATE_SUPER_COMMITTED] = (__TRANS_START |
+ __TRANS_ATTACH |
+ __TRANS_JOIN |
+ __TRANS_JOIN_NOLOCK |
+ __TRANS_JOIN_NOSTART),
[TRANS_STATE_COMPLETED] = (__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
@@ -826,10 +831,11 @@ btrfs_attach_transaction_barrier(struct btrfs_root *root)
return trans;
}
-/* wait for a transaction commit to be fully complete */
-static noinline void wait_for_commit(struct btrfs_transaction *commit)
+/* Wait for a transaction commit to reach at least the given state. */
+static noinline void wait_for_commit(struct btrfs_transaction *commit,
+ const enum btrfs_trans_state min_state)
{
- wait_event(commit->commit_wait, commit->state == TRANS_STATE_COMPLETED);
+ wait_event(commit->commit_wait, commit->state >= min_state);
}
int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
@@ -884,7 +890,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
goto out; /* nothing committing|committed */
}
- wait_for_commit(cur_trans);
+ wait_for_commit(cur_trans, TRANS_STATE_COMPLETED);
btrfs_put_transaction(cur_trans);
out:
return ret;
@@ -2102,11 +2108,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
spin_lock(&fs_info->trans_lock);
if (cur_trans->state >= TRANS_STATE_COMMIT_START) {
+ enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
+
spin_unlock(&fs_info->trans_lock);
refcount_inc(&cur_trans->use_count);
- ret = btrfs_end_transaction(trans);
- wait_for_commit(cur_trans);
+ if (trans->in_fsync)
+ want_state = TRANS_STATE_SUPER_COMMITTED;
+ ret = btrfs_end_transaction(trans);
+ wait_for_commit(cur_trans, want_state);
if (TRANS_ABORTED(cur_trans))
ret = cur_trans->aborted;
@@ -2120,13 +2130,19 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
wake_up(&fs_info->transaction_blocked_wait);
if (cur_trans->list.prev != &fs_info->trans_list) {
+ enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED;
+
+ if (trans->in_fsync)
+ want_state = TRANS_STATE_SUPER_COMMITTED;
+
prev_trans = list_entry(cur_trans->list.prev,
struct btrfs_transaction, list);
- if (prev_trans->state != TRANS_STATE_COMPLETED) {
+ if (prev_trans->state < want_state) {
refcount_inc(&prev_trans->use_count);
spin_unlock(&fs_info->trans_lock);
- wait_for_commit(prev_trans);
+ wait_for_commit(prev_trans, want_state);
+
ret = READ_ONCE(prev_trans->aborted);
btrfs_put_transaction(prev_trans);
@@ -2345,6 +2361,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
if (ret)
goto scrub_continue;
+ /*
+ * We needn't acquire the lock here because there is no other task
+ * which can change it.
+ */
+ cur_trans->state = TRANS_STATE_SUPER_COMMITTED;
+ wake_up(&cur_trans->commit_wait);
+
btrfs_finish_extent_commit(trans);
if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &cur_trans->flags))
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 31ca81bad822..935bd6958a8a 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -16,6 +16,7 @@ enum btrfs_trans_state {
TRANS_STATE_COMMIT_START,
TRANS_STATE_COMMIT_DOING,
TRANS_STATE_UNBLOCKED,
+ TRANS_STATE_SUPER_COMMITTED,
TRANS_STATE_COMPLETED,
TRANS_STATE_MAX,
};
@@ -133,6 +134,7 @@ struct btrfs_trans_handle {
bool can_flush_pending_bgs;
bool reloc_reserved;
bool dirty;
+ bool in_fsync;
struct btrfs_root *root;
struct btrfs_fs_info *fs_info;
struct list_head new_bgs;
--
2.28.0
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
2021-01-27 10:35 ` [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fdmanana
@ 2021-01-27 15:26 ` Josef Bacik
0 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2021-01-27 15:26 UTC (permalink / raw)
To: fdmanana, linux-btrfs
On 1/27/21 5:35 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> Often an fsync needs to fallback to a transaction commit for several
> reasons (to ensure consistency after a power failure, a new block group
> was allocated or a temporary error such as ENOMEM or ENOSPC happened).
>
> In that case the log is marked as needing a full commit and any concurrent
> tasks attempting to log inodes or commit the log will also fallback to the
> transaction commit. When this happens they all wait for the task that first
> started the transaction commit to finish the transaction commit - however
> they wait until the full transaction commit happens, which is not needed,
> as they only need to wait for the superblocks to be persisted and not for
> unpinning all the extents pinned during the transaction's lifetime, which
> even for short lived transactions can be a few thousand and take some
> significant amount of time to complete - for dbench workloads I have
> observed up to 4~5 milliseconds of time spent unpinning extents in the
> worst cases, and the number of pinned extents was between 2 to 3 thousand.
>
> So allow fsync tasks to skip waiting for the unpinning of extents when
> they call btrfs_commit_transaction() and they were not the task that
> started the transaction commit (that one has to do it, the alternative
> would be to offload the transaction commit to another task so that it
> could avoid waiting for the extent unpinning or offload the extent
> unpinning to another task).
>
> This patch is part of a patchset comprised of the following patches:
>
> btrfs: remove unnecessary directory inode item update when deleting dir entry
> btrfs: stop setting nbytes when filling inode item for logging
> btrfs: avoid logging new ancestor inodes when logging new inode
> btrfs: skip logging directories already logged when logging all parents
> btrfs: skip logging inodes already logged when logging new entries
> btrfs: remove unnecessary check_parent_dirs_for_sync()
> btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
>
> After applying the entire patchset, dbench shows improvements in respect
> to throughput and latency. The script used to measure it is the following:
>
> $ cat dbench-test.sh
> #!/bin/bash
>
> DEV=/dev/sdk
> MNT=/mnt/sdk
> MOUNT_OPTIONS="-o ssd"
> MKFS_OPTIONS="-m single -d single"
>
> echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
>
> umount $DEV &> /dev/null
> mkfs.btrfs -f $MKFS_OPTIONS $DEV
> mount $MOUNT_OPTIONS $DEV $MNT
>
> dbench -D $MNT -t 300 64
>
> umount $MNT
>
> The test was run on a physical machine with 12 cores (Intel corei7), 64G
> of ram, using a NVMe device and a non-debug kernel configuration (Debian's
> default configuration).
>
> Before applying patchset, 32 clients:
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 9627107 0.153 61.938
> Close 7072076 0.001 3.175
> Rename 407633 1.222 44.439
> Unlink 1943895 0.658 44.440
> Deltree 256 17.339 110.891
> Mkdir 128 0.003 0.009
> Qpathinfo 8725406 0.064 17.850
> Qfileinfo 1529516 0.001 2.188
> Qfsinfo 1599884 0.002 1.457
> Sfileinfo 784200 0.005 3.562
> Find 3373513 0.411 30.312
> WriteX 4802132 0.053 29.054
> ReadX 15089959 0.002 5.801
> LockX 31344 0.002 0.425
> UnlockX 31344 0.001 0.173
> Flush 674724 5.952 341.830
>
> Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms
>
> After applying patchset, 32 clients:
>
> After patchset, with 32 clients:
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 9931568 0.111 25.597
> Close 7295730 0.001 2.171
> Rename 420549 0.982 49.714
> Unlink 2005366 0.497 39.015
> Deltree 256 11.149 89.242
> Mkdir 128 0.002 0.014
> Qpathinfo 9001863 0.049 20.761
> Qfileinfo 1577730 0.001 2.546
> Qfsinfo 1650508 0.002 3.531
> Sfileinfo 809031 0.005 5.846
> Find 3480259 0.309 23.977
> WriteX 4952505 0.043 41.283
> ReadX 15568127 0.002 5.476
> LockX 32338 0.002 0.978
> UnlockX 32338 0.001 2.032
> Flush 696017 7.485 228.835
>
> Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms
>
> --> +4.1% throughput, -39.6% max latency
>
> Before applying patchset, 64 clients:
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 8956748 0.342 108.312
> Close 6579660 0.001 3.823
> Rename 379209 2.396 81.897
> Unlink 1808625 1.108 131.148
> Deltree 256 25.632 172.176
> Mkdir 128 0.003 0.018
> Qpathinfo 8117615 0.131 55.916
> Qfileinfo 1423495 0.001 2.635
> Qfsinfo 1488496 0.002 5.412
> Sfileinfo 729472 0.007 8.643
> Find 3138598 0.855 78.321
> WriteX 4470783 0.102 79.442
> ReadX 14038139 0.002 7.578
> LockX 29158 0.002 0.844
> UnlockX 29158 0.001 0.567
> Flush 627746 14.168 506.151
>
> Throughput 924.738 MB/sec 64 clients 64 procs max_latency=506.154 ms
>
> After applying patchset, 64 clients:
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 9069003 0.303 43.193
> Close 6662328 0.001 3.888
> Rename 383976 2.194 46.418
> Unlink 1831080 1.022 43.873
> Deltree 256 24.037 155.763
> Mkdir 128 0.002 0.005
> Qpathinfo 8219173 0.137 30.233
> Qfileinfo 1441203 0.001 3.204
> Qfsinfo 1507092 0.002 4.055
> Sfileinfo 738775 0.006 5.431
> Find 3177874 0.936 38.170
> WriteX 4526152 0.084 39.518
> ReadX 14213562 0.002 24.760
> LockX 29522 0.002 1.221
> UnlockX 29522 0.001 0.694
> Flush 635652 14.358 422.039
>
> Throughput 990.13 MB/sec 64 clients 64 procs max_latency=422.043 ms
>
> --> +6.8% throughput, -18.1% max latency
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
Funnily enough I've got a patch to async off unpinning as it drastically affects
our WhatsApp workload. Once that lands maybe we can undo this bit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/7] btrfs: more performance improvements for dbench workloads
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
` (6 preceding siblings ...)
2021-01-27 10:35 ` [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit fdmanana
@ 2021-01-27 15:42 ` Josef Bacik
2021-02-01 21:56 ` David Sterba
8 siblings, 0 replies; 14+ messages in thread
From: Josef Bacik @ 2021-01-27 15:42 UTC (permalink / raw)
To: fdmanana, linux-btrfs
On 1/27/21 5:34 AM, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The following patchset brings one more batch of performance improvements
> with dbench workloads, or anything that mixes file creation, file writes,
> renames, unlinks, etc with fsync like dbench does. This patchset is mostly
> based on avoiding logging directory inodes multiple times when not necessary,
> falling back to transaction commits less frequently and often waiting less
> time for transaction commits to complete. Performance results are listed in
> the change log of the last patch, but in short, I've experienced a reduction
> of maximum latency up to about -40% and throuhput gains up to about +6%.
>
> Filipe Manana (7):
> btrfs: remove unnecessary directory inode item update when deleting dir entry
> btrfs: stop setting nbytes when filling inode item for logging
> btrfs: avoid logging new ancestor inodes when logging new inode
> btrfs: skip logging directories already logged when logging all parents
> btrfs: skip logging inodes already logged when logging new entries
> btrfs: remove unnecessary check_parent_dirs_for_sync()
> btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
>
> fs/btrfs/file.c | 1 +
> fs/btrfs/transaction.c | 39 +++++++--
> fs/btrfs/transaction.h | 2 +
> fs/btrfs/tree-log.c | 195 ++++++++++++-----------------------------
> 4 files changed, 92 insertions(+), 145 deletions(-)
>
You can add
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
to the whole series, thanks,
Josef
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 0/7] btrfs: more performance improvements for dbench workloads
2021-01-27 10:34 [PATCH 0/7] btrfs: more performance improvements for dbench workloads fdmanana
` (7 preceding siblings ...)
2021-01-27 15:42 ` [PATCH 0/7] btrfs: more performance improvements for dbench workloads Josef Bacik
@ 2021-02-01 21:56 ` David Sterba
8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2021-02-01 21:56 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Wed, Jan 27, 2021 at 10:34:53AM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> The following patchset brings one more batch of performance improvements
> with dbench workloads, or anything that mixes file creation, file writes,
> renames, unlinks, etc with fsync like dbench does. This patchset is mostly
> based on avoiding logging directory inodes multiple times when not necessary,
> falling back to transaction commits less frequently and often waiting less
> time for transaction commits to complete. Performance results are listed in
> the change log of the last patch, but in short, I've experienced a reduction
> of maximum latency up to about -40% and throuhput gains up to about +6%.
Nice, added to misc-next, thanks.
^ permalink raw reply [flat|nested] 14+ messages in thread