* [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename
2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
@ 2020-11-25 12:19 ` fdmanana
2020-11-25 12:19 ` [PATCH 2/6] btrfs: fix race that results in logging old extents during a fast fsync fdmanana
` (5 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When we are doing a rename or a link operation for an inode that was logged
in the previous transaction and that transaction is still committing, we
have a time window where we incorrectly consider that the inode was logged
previously in the current transaction and therefore decide to log it to
update it in the log. The following steps give an example on how this
happens during a link operation:
1) Inode X is logged in transaction 1000, so its logged_trans field is set
to 1000;
2) Task A starts to commit transaction 1000;
3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED;
4) Task B starts a link operation for inode X, and as a consequence it
starts transaction 1001;
5) Task A is still committing transaction 1000, therefore the value stored
at fs_info->last_trans_committed is still 999;
6) Task B calls btrfs_log_new_name(), it reads a value of 999 from
fs_info->last_trans_committed and because the logged_trans field of
inode X has a value of 1000, the function does not return immediately,
instead it proceeds to logging the inode, which should not happen
because the inode was logged in the previous transaction (1000) and
not in the current one (1001).
This is not a functional problem, just wasted time and space logging an
inode that does not need to be logged, contributing to higher latency
for link and rename operations.
So fix this by comparing the inodes' logged_trans field with the
generation of the current transaction instead of comparing with the value
stored in fs_info->last_trans_committed.
This case is often hit when running dbench for a long enough duration, as
it does lots of rename operations.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 80cf496226c2..b63f5c2b982a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -6443,7 +6443,6 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode, struct btrfs_inode *old_dir,
struct dentry *parent)
{
- struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_log_ctx ctx;
/*
@@ -6457,8 +6456,8 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
* if this inode hasn't been logged and directory we're renaming it
* from hasn't been logged, we don't need to log it
*/
- if (inode->logged_trans <= fs_info->last_trans_committed &&
- (!old_dir || old_dir->logged_trans <= fs_info->last_trans_committed))
+ if (inode->logged_trans < trans->transid &&
+ (!old_dir || old_dir->logged_trans < trans->transid))
return;
btrfs_init_log_ctx(&ctx, &inode->vfs_inode);
--
2.28.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/6] btrfs: fix race that results in logging old extents during a fast fsync
2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
2020-11-25 12:19 ` [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename fdmanana
@ 2020-11-25 12:19 ` fdmanana
2020-11-25 12:19 ` [PATCH 3/6] btrfs: fix race that causes unnecessary logging of ancestor inodes fdmanana
` (4 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When logging the extents of an inode during a fast fsync, we have a time
window where we can log extents that are from the previous transaction and
already persisted. This only makes us waste time unnecessarily.
The following sequence of steps shows how this can happen:
1) We are at transaction 1000;
2) An ordered extent E from inode I completes, that is it has gone through
btrfs_finish_ordered_io(), and it set the extent maps' generation to
1000 when we unpin the extent, which is the generation of the current
transaction;
3) The commit for transaction 1000 starts by task A;
4) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
5) Some change is made to inode I, resulting in creation of a new
transaction with a generation of 1001;
6) The transaction 1000 commit starts unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
7) Task B starts an fsync on inode I, and when it gets to
btrfs_log_changed_extents() sees the extent map for extent E in the
list of modified extents. It sees the extent map has a generation of
1000 and fs_info->last_trans_committed has a value of 999, so it
proceeds to logging the respective file extent item and all the
checksums covering its range.
So we end up wasting time since the extent was already persisted and
is reachable through the trees pointed to by the super block committed
by transaction 1000.
So just fix this by comparing the extent maps generation against the
generation of the transaction handle - if it is smaller then the id in the
handle, we know the extent was already persisted and we do not need to log
it.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index b63f5c2b982a..596d72d239e9 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4416,14 +4416,12 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
struct extent_map *em, *n;
struct list_head extents;
struct extent_map_tree *tree = &inode->extent_tree;
- u64 test_gen;
int ret = 0;
int num = 0;
INIT_LIST_HEAD(&extents);
write_lock(&tree->lock);
- test_gen = root->fs_info->last_trans_committed;
list_for_each_entry_safe(em, n, &tree->modified_extents, list) {
list_del_init(&em->list);
@@ -4439,7 +4437,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
goto process;
}
- if (em->generation <= test_gen)
+ if (em->generation < trans->transid)
continue;
/* We log prealloc extents beyond eof later. */
--
2.28.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 3/6] btrfs: fix race that causes unnecessary logging of ancestor inodes
2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
2020-11-25 12:19 ` [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename fdmanana
2020-11-25 12:19 ` [PATCH 2/6] btrfs: fix race that results in logging old extents during a fast fsync fdmanana
@ 2020-11-25 12:19 ` fdmanana
2020-11-25 12:19 ` [PATCH 4/6] btrfs: fix race that makes inode logging fallback to transaction commit fdmanana
` (3 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When logging an inode and we are checking if we need to log ancestors that
are new, if the previous transaction is still committing we have a time
window where we can unncessarily log ancestor inodes that were created in
the previous transaction.
The race is described by the following steps:
1) We are at transaction 1000;
2) Directory inode X is created, its generation is set to 1000;
3) The commit for transaction 1000 is started by task A;
4) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
5) Inode Y, a regular file, is created under directory inode X, this
results in starting a new transaction with a generation of 1001;
6) The transaction 1000 commit is unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
7) Task B calls fsync on inode Y and gets a handle for transaction 1001;
8) Task B ends up at log_all_new_ancestors() and then because inode Y has
only one hard link, ends up at log_new_ancestors_fast(). There it reads
a value of 999 from fs_info->last_trans_committed, and sees that the
parent inode X has a generation of 1000, so we end up logging inode X:
if (inode->generation > fs_info->last_trans_committed) {
ret = btrfs_log_inode(trans, root, inode,
LOG_INODE_EXISTS, ctx);
(...)
which is not necessary since it was created in the past transaction,
with a generation of 1000, and that transaction has already committed
its super blocks - it's still unpinning extents so it has not yet
updated fs_info->last_trans_committed from 999 to 1000.
So this just causes us to spend more time logging and allocating and
writing more tree blocks for the log tree.
So fix this by comparing an inode's generation with the generation of the
transaction our transaction handle refers to - if the inode's generation
matches the generation of the current transaction than we know it is a
new inode we need to log, otherwise don't log it.
This case is often hit when running dbench for a long enough duration.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 596d72d239e9..6313c50f5e33 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5835,7 +5835,6 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
while (true) {
struct btrfs_fs_info *fs_info = root->fs_info;
- const u64 last_committed = fs_info->last_trans_committed;
struct extent_buffer *leaf = path->nodes[0];
int slot = path->slots[0];
struct btrfs_key search_key;
@@ -5854,7 +5853,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
if (IS_ERR(inode))
return PTR_ERR(inode);
- if (BTRFS_I(inode)->generation > last_committed)
+ if (BTRFS_I(inode)->generation >= trans->transid)
ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
LOG_INODE_EXISTS, ctx);
btrfs_add_delayed_iput(inode);
@@ -5895,7 +5894,6 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans,
struct btrfs_log_ctx *ctx)
{
struct btrfs_root *root = inode->root;
- struct btrfs_fs_info *fs_info = root->fs_info;
struct dentry *old_parent = NULL;
struct super_block *sb = inode->vfs_inode.i_sb;
int ret = 0;
@@ -5909,7 +5907,7 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans,
if (root != inode->root)
break;
- if (inode->generation > fs_info->last_trans_committed) {
+ if (inode->generation >= trans->transid) {
ret = btrfs_log_inode(trans, root, inode,
LOG_INODE_EXISTS, ctx);
if (ret)
--
2.28.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 4/6] btrfs: fix race that makes inode logging fallback to transaction commit
2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
` (2 preceding siblings ...)
2020-11-25 12:19 ` [PATCH 3/6] btrfs: fix race that causes unnecessary logging of ancestor inodes fdmanana
@ 2020-11-25 12:19 ` fdmanana
2020-11-25 12:19 ` [PATCH 5/6] btrfs: fix race leading to unnecessary transaction commit when logging inode fdmanana
` (2 subsequent siblings)
6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When logging an inode and the previous transaction is still committing, we
have a time window where we can end up incorrectly think an inode has its
last_unlink_trans field with a value greater than the last transaction
committed, which results in the logging to fallback to a full transaction
commit, which is usually much more expensive than doing a log commit.
The race is described by the following steps:
1) We are at transaction 1000;
2) We modify an inode X (a directory) using transaction 1000 and set its
last_unlink_trans field to 1000, because for example we removed one
of its subdirectories;
3) We create a new inode Y with a dentry in inode X using transaction 1000,
so its generation field is set to 1000;
4) The commit for transaction 1000 is started by task A;
5) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
6) Some task starts a new transaction with a generation of 1001;
7) We do some modification to inode Y (using transaction 1001);
8) The transaction 1000 commit starts unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
9) Task B starts an fsync on inode Y, and gets a handle for transaction
1001. When it gets to check_parent_dirs_for_sync() it does the checking
of the ancestor dentries because the following check does not evaluate
to true:
if (S_ISREG(inode->vfs_inode.i_mode) &&
inode->generation <= last_committed &&
inode->last_unlink_trans <= last_committed)
goto out;
The generation value for inode Y is 1000 and last_committed, which has
the value read from fs_info->last_trans_committed, has a value of 999,
so that check evaluates to false and we proceed to check the ancestor
inodes.
Once we get to the first ancestor, inode X, we call
btrfs_must_commit_transaction() on it, which evaluates to true:
static bool btrfs_must_commit_transaction(...)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
bool ret = false;
mutex_lock(&inode->log_mutex);
if (inode->last_unlink_trans > fs_info->last_trans_committed) {
/*
* Make sure any commits to the log are forced to be full
* commits.
*/
btrfs_set_log_full_commit(trans);
ret = true;
}
(...)
because inode's X last_unlink_trans has a value of 1000 and
fs_info->last_trans_committed still has a value of 999, it returns
true to check_parent_dirs_for_sync(), making it return 1 which is
propagated up to btrfs_sync_file(), causing it to fallback to a full
transaction commit of transaction 1001.
We should have not falled back to commit transaction 1001, since inode
X had last_unlink_trans set to 1000 and the super blocks for
transaction 1000 were already written. So while not resulting in a
functional problem, it leads to a lot more work and higher latencies
for a fsync since committing a transaction is usually more expensive
than committing a log (if other filesystem changes happened under that
transaction).
Similar problem happens when logging directories, for the same reason as
btrfs_must_commit_transaction() returns true on an inode with its
last_unlink_trans having the generation of the previous transaction and
that transaction is still committing, unpinning its freed extents.
So fix this by comparing last_unlink_trans with the id of the current
transaction instead of fs_info->last_trans_committed.
This case is often hit when running dbench for a long enough duration, as
it does lots of rename and rmdir operations (both update the field
last_unkink_trans of an inode) and fsyncs of files and directories.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 20 ++++++++------------
1 file changed, 8 insertions(+), 12 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6313c50f5e33..592a50d9697f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5449,11 +5449,10 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode)
{
- struct btrfs_fs_info *fs_info = inode->root->fs_info;
bool ret = false;
mutex_lock(&inode->log_mutex);
- if (inode->last_unlink_trans > fs_info->last_trans_committed) {
+ if (inode->last_unlink_trans >= trans->transid) {
/*
* Make sure any commits to the log are forced to be full
* commits.
@@ -5475,8 +5474,7 @@ static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans,
static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode,
struct dentry *parent,
- struct super_block *sb,
- u64 last_committed)
+ struct super_block *sb)
{
int ret = 0;
struct dentry *old_parent = NULL;
@@ -5488,8 +5486,8 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
* and other fun in this file.
*/
if (S_ISREG(inode->vfs_inode.i_mode) &&
- inode->generation <= last_committed &&
- inode->last_unlink_trans <= last_committed)
+ inode->generation < trans->transid &&
+ inode->last_unlink_trans < trans->transid)
goto out;
if (!S_ISDIR(inode->vfs_inode.i_mode)) {
@@ -6024,7 +6022,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info = root->fs_info;
struct super_block *sb;
int ret = 0;
- u64 last_committed = fs_info->last_trans_committed;
bool log_dentries = false;
sb = inode->vfs_inode.i_sb;
@@ -6049,8 +6046,7 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
goto end_no_trans;
}
- ret = check_parent_dirs_for_sync(trans, inode, parent, sb,
- last_committed);
+ ret = check_parent_dirs_for_sync(trans, inode, parent, sb);
if (ret)
goto end_no_trans;
@@ -6080,8 +6076,8 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
* and other fun in this file.
*/
if (S_ISREG(inode->vfs_inode.i_mode) &&
- inode->generation <= last_committed &&
- inode->last_unlink_trans <= last_committed) {
+ inode->generation < trans->transid &&
+ inode->last_unlink_trans < trans->transid) {
ret = 0;
goto end_trans;
}
@@ -6130,7 +6126,7 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
* but the file inode does not have a matching BTRFS_INODE_REF_KEY item
* and has a link count of 2.
*/
- if (inode->last_unlink_trans > last_committed) {
+ if (inode->last_unlink_trans >= trans->transid) {
ret = btrfs_log_all_parents(trans, inode, ctx);
if (ret)
goto end_trans;
--
2.28.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 5/6] btrfs: fix race leading to unnecessary transaction commit when logging inode
2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
` (3 preceding siblings ...)
2020-11-25 12:19 ` [PATCH 4/6] btrfs: fix race that makes inode logging fallback to transaction commit fdmanana
@ 2020-11-25 12:19 ` fdmanana
2020-11-25 12:19 ` [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit fdmanana
2020-11-30 17:30 ` [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads David Sterba
6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
When logging an inode we may often have to fallback to a full transaction
commit, either because a new block group was allocated, there is some case
we can not deal with wihout a transaction commit or some error like an
-ENOMEM happened. However after we fallback to a transaction commit, we
have a time window where we can make the next attempt to log any inode
commit the next transaction unnecessarily, adding additional overhead and
increasing latency.
A sequence of steps that leads to this issue is the following:
1) The current open transaction has a generation of 1000;
2) A new block group is allocated, and as a consequence we must make sure
any attempts to commit a log fallback to a transaction commit, so
btrfs_set_log_full_commit() is called from btrfs_make_block_group().
This sets fs_info->last_trans_log_full_commit to 1000;
3) Task A is holding a handle on transaction 1000 and tries to log inode X.
Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
which returns true, since fs_info->last_trans_log_full_commit has a
value of 1000. So we end up returning -EAGAIN and propagating it up to
btrfs_sync_file(), where we commit transaction 1000;
4) The transaction commit task (task A) sets the tansaction state to
unblocked (TRANS_STATE_UNBLOCKED);
5) Some other task, task B, starts a new transaction with a generation of
1001;
6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;
7) Transaction 1000 has not fully committed yet, we are still writing all
the extent buffers it created;
8) Some new task, task C, starts an fsync of inode Y, gets a handle for
transaction 1001, and it gets to btrfs_log_inode_parent() which does
the following check:
if (fs_info->last_trans_log_full_commit > last_committed) {
ret = 1;
goto end_no_trans;
}
At that point last_trans_log_full_commit has a value of 1000 and
last_committed (value of fs_info->last_trans_committed) has a value of
999, since transaction 1000 has not yet committed - it is either still
writing out dirty extent buffers, its super blocks or unpinning
extents.
As a consequence we return 1, which gets propagated up to
btrfs_sync_file(), which will then call btrfs_commit_transaction()
for transaction 1001.
As a consequence we have an unnecessary second transaction commit, we
previously committed transaction 1000 and now commit transaction 1001
as well, resulting in more overhead and increased latency.
So fix this double transaction commit issue simply by removing that check,
because all we need to do is wait for the previous transaction to finish
its commit, which we already do later when starting the log transaction at
start_log_trans(), because there we acquire the tree_log_mutex lock, which
is held by a transaction commit and only released after the transaction
commits its super blocks.
Another issue that check has is that it reads last_trans_log_full_commit
without using READ_ONCE(), which is incorrect since that member of
struct btrfs_fs_info is always updated with WRITE_ONCE() through the
helper btrfs_set_log_full_commit().
This double transaction commit issue can actually be triggered quite often
in long runs of dbench, since besides the creation of new block groups
that force inode logging to fallback to a transaction commit, there are
cases where dbench asks to fsync a directory which had files in it that
were previously renamed or subdirectories that were removed, resulting in
the inode logging to fallback to a full transaction commit.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/tree-log.c | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 592a50d9697f..bc5b652f4f64 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -6031,16 +6031,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
goto end_no_trans;
}
- /*
- * The prev transaction commit doesn't complete, we need do
- * full commit by ourselves.
- */
- if (fs_info->last_trans_log_full_commit >
- fs_info->last_trans_committed) {
- ret = 1;
- goto end_no_trans;
- }
-
if (btrfs_root_refs(&root->root_item) == 0) {
ret = 1;
goto end_no_trans;
--
2.28.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
` (4 preceding siblings ...)
2020-11-25 12:19 ` [PATCH 5/6] btrfs: fix race leading to unnecessary transaction commit when logging inode fdmanana
@ 2020-11-25 12:19 ` fdmanana
2021-02-02 5:38 ` Wang Yugui
2020-11-30 17:30 ` [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads David Sterba
6 siblings, 1 reply; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
To: linux-btrfs
From: Filipe Manana <fdmanana@suse.com>
Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.
This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).
Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:
1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers
They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash
So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
The following script that uses dbench was used to measure the impact of
the whole patchset:
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).
Before patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11277211 0.250 85.340
Close 8283172 0.002 6.479
Rename 477515 1.935 86.026
Unlink 2277936 0.770 87.071
Deltree 256 15.732 81.379
Mkdir 128 0.003 0.009
Qpathinfo 10221180 0.056 44.404
Qfileinfo 1789967 0.002 4.066
Qfsinfo 1874399 0.003 9.176
Sfileinfo 918589 0.061 10.247
Find 3951758 0.341 54.040
WriteX 5616547 0.047 85.079
ReadX 17676028 0.005 9.704
LockX 36704 0.003 1.800
UnlockX 36704 0.002 0.687
Flush 790541 14.115 676.236
Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms
After patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 12687926 0.171 86.526
Close 9320780 0.002 8.063
Rename 537253 1.444 78.576
Unlink 2561827 0.559 87.228
Deltree 374 11.499 73.549
Mkdir 187 0.003 0.005
Qpathinfo 11500300 0.061 36.801
Qfileinfo 2017118 0.002 7.189
Qfsinfo 2108641 0.003 4.825
Sfileinfo 1033574 0.008 8.065
Find 4446553 0.408 47.835
WriteX 6335667 0.045 84.388
ReadX 19887312 0.003 9.215
LockX 41312 0.003 1.394
UnlockX 41312 0.002 1.425
Flush 889233 13.014 623.259
Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms
+12.7% throughput, -8.2% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/ctree.h | 2 +-
fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
2 files changed, 40 insertions(+), 18 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c0c6e79c43f9..7185384f475a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1026,7 +1026,7 @@ enum {
BTRFS_ROOT_DEAD_RELOC_TREE,
/* Mark dead root stored on device whose cleanup needs to be resumed */
BTRFS_ROOT_DEAD_TREE,
- /* The root has a log tree. Used only for subvolume roots. */
+ /* The root has a log tree. Used for subvolume roots and the tree root. */
BTRFS_ROOT_HAS_LOG_TREE,
/* Qgroup flushing is in progress */
BTRFS_ROOT_QGROUP_FLUSHING,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index bc5b652f4f64..e8b84543d565 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
struct btrfs_log_ctx *ctx)
{
struct btrfs_fs_info *fs_info = root->fs_info;
+ struct btrfs_root *tree_root = fs_info->tree_root;
int ret = 0;
+ /*
+ * First check if the log root tree was already created. If not, create
+ * it before locking the root's log_mutex, just to keep lockdep happy.
+ */
+ if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
+ mutex_lock(&tree_root->log_mutex);
+ if (!fs_info->log_root_tree) {
+ ret = btrfs_init_log_root_tree(trans, fs_info);
+ if (!ret)
+ set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
+ }
+ mutex_unlock(&tree_root->log_mutex);
+ if (ret)
+ return ret;
+ }
+
mutex_lock(&root->log_mutex);
if (root->log_root) {
@@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
}
} else {
- mutex_lock(&fs_info->tree_log_mutex);
- if (!fs_info->log_root_tree)
- ret = btrfs_init_log_root_tree(trans, fs_info);
- mutex_unlock(&fs_info->tree_log_mutex);
- if (ret)
- goto out;
-
ret = btrfs_add_log_tree(trans, root);
if (ret)
goto out;
@@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
int log_transid = 0;
struct btrfs_log_ctx root_log_ctx;
struct blk_plug plug;
+ u64 log_root_start;
+ u64 log_root_level;
mutex_lock(&root->log_mutex);
log_transid = ctx->log_transid;
@@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
goto out_wake_log_root;
}
- btrfs_set_super_log_root(fs_info->super_for_commit,
- log_root_tree->node->start);
- btrfs_set_super_log_root_level(fs_info->super_for_commit,
- btrfs_header_level(log_root_tree->node));
-
+ log_root_start = log_root_tree->node->start;
+ log_root_level = btrfs_header_level(log_root_tree->node);
log_root_tree->log_transid++;
mutex_unlock(&log_root_tree->log_mutex);
/*
- * Nobody else is going to jump in and write the ctree
- * super here because the log_commit atomic below is protecting
- * us. We must be called with a transaction handle pinning
- * the running transaction open, so a full commit can't hop
- * in and cause problems either.
+ * Here we are guaranteed that nobody is going to write the superblock
+ * for the current transaction before us and that neither we do write
+ * our superblock before the previous transaction finishes its commit
+ * and writes its superblock, because:
+ *
+ * 1) We are holding a handle on the current transaction, so no body
+ * can commit it until we release the handle;
+ *
+ * 2) Before writing our superblock we acquire the tree_log_mutex, so
+ * if the previous transaction is still committing, and hasn't yet
+ * written its superblock, we wait for it to do it, because a
+ * transaction commit acquires the tree_log_mutex when the commit
+ * begins and releases it only after writing its superblock.
*/
+ mutex_lock(&fs_info->tree_log_mutex);
+ btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
+ btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
ret = write_all_supers(fs_info, 1);
+ mutex_unlock(&fs_info->tree_log_mutex);
if (ret) {
btrfs_set_log_full_commit(trans);
btrfs_abort_transaction(trans, ret);
@@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
if (fs_info->log_root_tree) {
free_log_tree(trans, fs_info->log_root_tree);
fs_info->log_root_tree = NULL;
+ clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
}
return 0;
}
--
2.28.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2020-11-25 12:19 ` [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit fdmanana
@ 2021-02-02 5:38 ` Wang Yugui
2021-02-02 11:28 ` Filipe Manana
0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-02 5:38 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
Hi, Filipe Manana
The dbench result with these patches is very good. thanks a lot.
This is the dbench(synchronous mode) result , and then a question.
command: dbench -s -t 60 -D /btrfs/ 32
mount option:ssd,space_cache=v2
kernel:5.10.12 + patchset 1 + this patchset
patchset 1:
0001-btrfs-fix-race-causing-unnecessary-inode-logging-dur.patch
0002-btrfs-fix-race-that-results-in-logging-old-extents-d.patch
0003-btrfs-fix-race-that-causes-unnecessary-logging-of-an.patch
0004-btrfs-fix-race-that-makes-inode-logging-fallback-to-.patch
0005-btrfs-fix-race-leading-to-unnecessary-transaction-co.patch
0006-btrfs-do-not-block-inode-logging-for-so-long-during-.patch
We get two types of result as below, and the result type 1 is not easy
to reproduce now.
Question:
for synchronous mode, the result type 1 is perfect?
and there is still some minor place about the flush to do for
the result type2?
result type 1:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 868942 0.028 3.017
Close 638536 0.003 0.061
Rename 36851 0.663 4.000
Unlink 175182 0.399 5.358
Qpathinfo 789014 0.014 1.846
Qfileinfo 137684 0.002 0.047
Qfsinfo 144241 0.004 0.059
Sfileinfo 70913 0.008 0.046
Find 304554 0.057 1.889
** WriteX 429696 3.960 2239.973
ReadX 1363356 0.005 0.358
LockX 2836 0.004 0.038
UnlockX 2836 0.002 0.018
** Flush 60771 0.621 6.794
Throughput 452.385 MB/sec (sync open) 32 clients 32 procs max_latency=1963.312 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.10.12-4.el7.x86_64
result type 2:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 888943 0.028 2.679
Close 652765 0.002 0.058
Rename 37705 0.572 3.962
Unlink 179713 0.383 3.983
Qpathinfo 806705 0.014 2.294
Qfileinfo 140752 0.002 0.125
Qfsinfo 147909 0.004 0.049
Sfileinfo 72374 0.008 0.104
Find 311839 0.058 2.305
** WriteX 439656 3.854 1872.109
ReadX 1396868 0.005 0.324
LockX 2910 0.004 0.026
UnlockX 2910 0.002 0.025
** Flush 62260 0.750 1659.364
Throughput 461.856 MB/sec (sync open) 32 clients 32 procs max_latency=1872.118 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.10.12-4.el7.x86_64
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02
> From: Filipe Manana <fdmanana@suse.com>
>
> Early on during a transaction commit we acquire the tree_log_mutex and
> hold it until after we write the super blocks. But before writing the
> extent buffers dirtied by the transaction and the super blocks we unblock
> the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
> fs_info->running_transaction to NULL.
>
> This means that after that and before writing the super blocks, new
> transactions can start. However if any transaction wants to log an inode,
> it will block waiting for the transaction commit to write its dirty
> extent buffers and the super blocks because the tree_log_mutex is only
> released after those operations are complete, and starting a new log
> transaction blocks on that mutex (at start_log_trans()).
>
> Writing the dirty extent buffers and the super blocks can take a very
> significant amount of time to complete, but we could allow the tasks
> wanting to log an inode to proceed with most of their steps:
>
> 1) create the log trees
> 2) log metadata in the trees
> 3) write their dirty extent buffers
>
> They only need to wait for the previous transaction commit to complete
> (write its super blocks) before they attempt to write their super blocks,
> otherwise we could end up with a corrupt filesystem after a crash
>
> So change start_log_trans() to use the root tree's log_mutex to serialize
> for the creation of the log root tree instead of using the tree_log_mutex,
> and make btrfs_sync_log() acquire the tree_log_mutex before writing the
> super blocks. This allows for inode logging to wait much less time when
> there is a previous transaction that is still committing, often not having
> to wait at all, as by the time when we try to sync the log the previous
> transaction already wrote its super blocks.
>
> This patch belongs to a patch set that is comprised of the following
> patches:
>
> btrfs: fix race causing unnecessary inode logging during link and rename
> btrfs: fix race that results in logging old extents during a fast fsync
> btrfs: fix race that causes unnecessary logging of ancestor inodes
> btrfs: fix race that makes inode logging fallback to transaction commit
> btrfs: fix race leading to unnecessary transaction commit when logging inode
> btrfs: do not block inode logging for so long during transaction commit
>
> The following script that uses dbench was used to measure the impact of
> the whole patchset:
>
> $ cat test-dbench.sh
> #!/bin/bash
>
> DEV=/dev/nvme0n1
> MNT=/mnt/btrfs
> MOUNT_OPTIONS="-o ssd"
>
> echo "performance" | \
> tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
>
> mkfs.btrfs -f -m single -d single $DEV
> mount $MOUNT_OPTIONS $DEV $MNT
>
> dbench -D $MNT -t 300 64
>
> umount $MNT
>
> The test was run on a machine with 12 cores, 64G of ram, using a NVMe
> device and a non-debug kernel configuration (Debian's default).
>
> Before patch set:
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 11277211 0.250 85.340
> Close 8283172 0.002 6.479
> Rename 477515 1.935 86.026
> Unlink 2277936 0.770 87.071
> Deltree 256 15.732 81.379
> Mkdir 128 0.003 0.009
> Qpathinfo 10221180 0.056 44.404
> Qfileinfo 1789967 0.002 4.066
> Qfsinfo 1874399 0.003 9.176
> Sfileinfo 918589 0.061 10.247
> Find 3951758 0.341 54.040
> WriteX 5616547 0.047 85.079
> ReadX 17676028 0.005 9.704
> LockX 36704 0.003 1.800
> UnlockX 36704 0.002 0.687
> Flush 790541 14.115 676.236
>
> Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms
>
> After patch set:
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 12687926 0.171 86.526
> Close 9320780 0.002 8.063
> Rename 537253 1.444 78.576
> Unlink 2561827 0.559 87.228
> Deltree 374 11.499 73.549
> Mkdir 187 0.003 0.005
> Qpathinfo 11500300 0.061 36.801
> Qfileinfo 2017118 0.002 7.189
> Qfsinfo 2108641 0.003 4.825
> Sfileinfo 1033574 0.008 8.065
> Find 4446553 0.408 47.835
> WriteX 6335667 0.045 84.388
> ReadX 19887312 0.003 9.215
> LockX 41312 0.003 1.394
> UnlockX 41312 0.002 1.425
> Flush 889233 13.014 623.259
>
> Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms
>
> +12.7% throughput, -8.2% max latency
>
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
> fs/btrfs/ctree.h | 2 +-
> fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
> 2 files changed, 40 insertions(+), 18 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c0c6e79c43f9..7185384f475a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1026,7 +1026,7 @@ enum {
> BTRFS_ROOT_DEAD_RELOC_TREE,
> /* Mark dead root stored on device whose cleanup needs to be resumed */
> BTRFS_ROOT_DEAD_TREE,
> - /* The root has a log tree. Used only for subvolume roots. */
> + /* The root has a log tree. Used for subvolume roots and the tree root. */
> BTRFS_ROOT_HAS_LOG_TREE,
> /* Qgroup flushing is in progress */
> BTRFS_ROOT_QGROUP_FLUSHING,
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index bc5b652f4f64..e8b84543d565 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> struct btrfs_log_ctx *ctx)
> {
> struct btrfs_fs_info *fs_info = root->fs_info;
> + struct btrfs_root *tree_root = fs_info->tree_root;
> int ret = 0;
>
> + /*
> + * First check if the log root tree was already created. If not, create
> + * it before locking the root's log_mutex, just to keep lockdep happy.
> + */
> + if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
> + mutex_lock(&tree_root->log_mutex);
> + if (!fs_info->log_root_tree) {
> + ret = btrfs_init_log_root_tree(trans, fs_info);
> + if (!ret)
> + set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
> + }
> + mutex_unlock(&tree_root->log_mutex);
> + if (ret)
> + return ret;
> + }
> +
> mutex_lock(&root->log_mutex);
>
> if (root->log_root) {
> @@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
> }
> } else {
> - mutex_lock(&fs_info->tree_log_mutex);
> - if (!fs_info->log_root_tree)
> - ret = btrfs_init_log_root_tree(trans, fs_info);
> - mutex_unlock(&fs_info->tree_log_mutex);
> - if (ret)
> - goto out;
> -
> ret = btrfs_add_log_tree(trans, root);
> if (ret)
> goto out;
> @@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> int log_transid = 0;
> struct btrfs_log_ctx root_log_ctx;
> struct blk_plug plug;
> + u64 log_root_start;
> + u64 log_root_level;
>
> mutex_lock(&root->log_mutex);
> log_transid = ctx->log_transid;
> @@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> goto out_wake_log_root;
> }
>
> - btrfs_set_super_log_root(fs_info->super_for_commit,
> - log_root_tree->node->start);
> - btrfs_set_super_log_root_level(fs_info->super_for_commit,
> - btrfs_header_level(log_root_tree->node));
> -
> + log_root_start = log_root_tree->node->start;
> + log_root_level = btrfs_header_level(log_root_tree->node);
> log_root_tree->log_transid++;
> mutex_unlock(&log_root_tree->log_mutex);
>
> /*
> - * Nobody else is going to jump in and write the ctree
> - * super here because the log_commit atomic below is protecting
> - * us. We must be called with a transaction handle pinning
> - * the running transaction open, so a full commit can't hop
> - * in and cause problems either.
> + * Here we are guaranteed that nobody is going to write the superblock
> + * for the current transaction before us and that neither we do write
> + * our superblock before the previous transaction finishes its commit
> + * and writes its superblock, because:
> + *
> + * 1) We are holding a handle on the current transaction, so no body
> + * can commit it until we release the handle;
> + *
> + * 2) Before writing our superblock we acquire the tree_log_mutex, so
> + * if the previous transaction is still committing, and hasn't yet
> + * written its superblock, we wait for it to do it, because a
> + * transaction commit acquires the tree_log_mutex when the commit
> + * begins and releases it only after writing its superblock.
> */
> + mutex_lock(&fs_info->tree_log_mutex);
> + btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
> + btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
> ret = write_all_supers(fs_info, 1);
> + mutex_unlock(&fs_info->tree_log_mutex);
> if (ret) {
> btrfs_set_log_full_commit(trans);
> btrfs_abort_transaction(trans, ret);
> @@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
> if (fs_info->log_root_tree) {
> free_log_tree(trans, fs_info->log_root_tree);
> fs_info->log_root_tree = NULL;
> + clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
> }
> return 0;
> }
> --
> 2.28.0
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-02 5:38 ` Wang Yugui
@ 2021-02-02 11:28 ` Filipe Manana
2021-02-02 13:01 ` Wang Yugui
0 siblings, 1 reply; 20+ messages in thread
From: Filipe Manana @ 2021-02-02 11:28 UTC (permalink / raw)
To: Wang Yugui; +Cc: linux-btrfs
On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi, Filipe Manana
>
> The dbench result with these patches is very good. thanks a lot.
>
> This is the dbench(synchronous mode) result , and then a question.
>
> command: dbench -s -t 60 -D /btrfs/ 32
> mount option:ssd,space_cache=v2
> kernel:5.10.12 + patchset 1 + this patchset
patchset 1 and "this patchset" are the same, did you mean two
different patchsets or just a single patchset?
> patchset 1:
> 0001-btrfs-fix-race-causing-unnecessary-inode-logging-dur.patch
> 0002-btrfs-fix-race-that-results-in-logging-old-extents-d.patch
> 0003-btrfs-fix-race-that-causes-unnecessary-logging-of-an.patch
> 0004-btrfs-fix-race-that-makes-inode-logging-fallback-to-.patch
> 0005-btrfs-fix-race-leading-to-unnecessary-transaction-co.patch
> 0006-btrfs-do-not-block-inode-logging-for-so-long-during-.patch
>
> We get two types of result as below, and the result type 1 is not easy
> to reproduce now.
Ok, there are outliers often, that's why multiple tests should be done.
>
> Question:
> for synchronous mode, the result type 1 is perfect?
What do you mean by perfect? You mean if result 1 is better than result 2?
> and there is still some minor place about the flush to do for
> the result type2?
By "minor place" you mean the huge difference I suppose.
>
>
> result type 1:
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 868942 0.028 3.017
> Close 638536 0.003 0.061
> Rename 36851 0.663 4.000
> Unlink 175182 0.399 5.358
> Qpathinfo 789014 0.014 1.846
> Qfileinfo 137684 0.002 0.047
> Qfsinfo 144241 0.004 0.059
> Sfileinfo 70913 0.008 0.046
> Find 304554 0.057 1.889
> ** WriteX 429696 3.960 2239.973
> ReadX 1363356 0.005 0.358
> LockX 2836 0.004 0.038
> UnlockX 2836 0.002 0.018
> ** Flush 60771 0.621 6.794
>
> Throughput 452.385 MB/sec (sync open) 32 clients 32 procs max_latency=1963.312 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.10.12-4.el7.x86_64
>
>
> result type 2:
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 888943 0.028 2.679
> Close 652765 0.002 0.058
> Rename 37705 0.572 3.962
> Unlink 179713 0.383 3.983
> Qpathinfo 806705 0.014 2.294
> Qfileinfo 140752 0.002 0.125
> Qfsinfo 147909 0.004 0.049
> Sfileinfo 72374 0.008 0.104
> Find 311839 0.058 2.305
> ** WriteX 439656 3.854 1872.109
> ReadX 1396868 0.005 0.324
> LockX 2910 0.004 0.026
> UnlockX 2910 0.002 0.025
> ** Flush 62260 0.750 1659.364
>
> Throughput 461.856 MB/sec (sync open) 32 clients 32 procs max_latency=1872.118 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.10.12-4.el7.x86_64
I'm not sure what your question is exactly.
Are both results after applying the same patchset, or are they before
and after applying the patchset, respectively?
If they are both with the patchset applied, and you wonder about the
big variation in the "Flush" operations, I am not sure about why it is
so.
Both throughput and max latency are better in result 2.
It's normal to have variations across dbench runs, I get them too, and
I do several runs (5 or 6) to check things out.
I don't use virtualization (testing on bare metal), I set the cpu
governor mode to performance (instead of the "powersave" default) and
use a non-debug kernel configuration, because otherwise I get
significant variations in latencies and throughput too (though I never
got a huge difference such as from 6.794 to 1659.364).
Thanks.
>
>
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
>
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Early on during a transaction commit we acquire the tree_log_mutex and
> > hold it until after we write the super blocks. But before writing the
> > extent buffers dirtied by the transaction and the super blocks we unblock
> > the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
> > fs_info->running_transaction to NULL.
> >
> > This means that after that and before writing the super blocks, new
> > transactions can start. However if any transaction wants to log an inode,
> > it will block waiting for the transaction commit to write its dirty
> > extent buffers and the super blocks because the tree_log_mutex is only
> > released after those operations are complete, and starting a new log
> > transaction blocks on that mutex (at start_log_trans()).
> >
> > Writing the dirty extent buffers and the super blocks can take a very
> > significant amount of time to complete, but we could allow the tasks
> > wanting to log an inode to proceed with most of their steps:
> >
> > 1) create the log trees
> > 2) log metadata in the trees
> > 3) write their dirty extent buffers
> >
> > They only need to wait for the previous transaction commit to complete
> > (write its super blocks) before they attempt to write their super blocks,
> > otherwise we could end up with a corrupt filesystem after a crash
> >
> > So change start_log_trans() to use the root tree's log_mutex to serialize
> > for the creation of the log root tree instead of using the tree_log_mutex,
> > and make btrfs_sync_log() acquire the tree_log_mutex before writing the
> > super blocks. This allows for inode logging to wait much less time when
> > there is a previous transaction that is still committing, often not having
> > to wait at all, as by the time when we try to sync the log the previous
> > transaction already wrote its super blocks.
> >
> > This patch belongs to a patch set that is comprised of the following
> > patches:
> >
> > btrfs: fix race causing unnecessary inode logging during link and rename
> > btrfs: fix race that results in logging old extents during a fast fsync
> > btrfs: fix race that causes unnecessary logging of ancestor inodes
> > btrfs: fix race that makes inode logging fallback to transaction commit
> > btrfs: fix race leading to unnecessary transaction commit when logging inode
> > btrfs: do not block inode logging for so long during transaction commit
> >
> > The following script that uses dbench was used to measure the impact of
> > the whole patchset:
> >
> > $ cat test-dbench.sh
> > #!/bin/bash
> >
> > DEV=/dev/nvme0n1
> > MNT=/mnt/btrfs
> > MOUNT_OPTIONS="-o ssd"
> >
> > echo "performance" | \
> > tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> >
> > mkfs.btrfs -f -m single -d single $DEV
> > mount $MOUNT_OPTIONS $DEV $MNT
> >
> > dbench -D $MNT -t 300 64
> >
> > umount $MNT
> >
> > The test was run on a machine with 12 cores, 64G of ram, using a NVMe
> > device and a non-debug kernel configuration (Debian's default).
> >
> > Before patch set:
> >
> > Operation Count AvgLat MaxLat
> > ----------------------------------------
> > NTCreateX 11277211 0.250 85.340
> > Close 8283172 0.002 6.479
> > Rename 477515 1.935 86.026
> > Unlink 2277936 0.770 87.071
> > Deltree 256 15.732 81.379
> > Mkdir 128 0.003 0.009
> > Qpathinfo 10221180 0.056 44.404
> > Qfileinfo 1789967 0.002 4.066
> > Qfsinfo 1874399 0.003 9.176
> > Sfileinfo 918589 0.061 10.247
> > Find 3951758 0.341 54.040
> > WriteX 5616547 0.047 85.079
> > ReadX 17676028 0.005 9.704
> > LockX 36704 0.003 1.800
> > UnlockX 36704 0.002 0.687
> > Flush 790541 14.115 676.236
> >
> > Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms
> >
> > After patch set:
> >
> > Operation Count AvgLat MaxLat
> > ----------------------------------------
> > NTCreateX 12687926 0.171 86.526
> > Close 9320780 0.002 8.063
> > Rename 537253 1.444 78.576
> > Unlink 2561827 0.559 87.228
> > Deltree 374 11.499 73.549
> > Mkdir 187 0.003 0.005
> > Qpathinfo 11500300 0.061 36.801
> > Qfileinfo 2017118 0.002 7.189
> > Qfsinfo 2108641 0.003 4.825
> > Sfileinfo 1033574 0.008 8.065
> > Find 4446553 0.408 47.835
> > WriteX 6335667 0.045 84.388
> > ReadX 19887312 0.003 9.215
> > LockX 41312 0.003 1.394
> > UnlockX 41312 0.002 1.425
> > Flush 889233 13.014 623.259
> >
> > Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms
> >
> > +12.7% throughput, -8.2% max latency
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> > fs/btrfs/ctree.h | 2 +-
> > fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
> > 2 files changed, 40 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index c0c6e79c43f9..7185384f475a 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1026,7 +1026,7 @@ enum {
> > BTRFS_ROOT_DEAD_RELOC_TREE,
> > /* Mark dead root stored on device whose cleanup needs to be resumed */
> > BTRFS_ROOT_DEAD_TREE,
> > - /* The root has a log tree. Used only for subvolume roots. */
> > + /* The root has a log tree. Used for subvolume roots and the tree root. */
> > BTRFS_ROOT_HAS_LOG_TREE,
> > /* Qgroup flushing is in progress */
> > BTRFS_ROOT_QGROUP_FLUSHING,
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index bc5b652f4f64..e8b84543d565 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> > struct btrfs_log_ctx *ctx)
> > {
> > struct btrfs_fs_info *fs_info = root->fs_info;
> > + struct btrfs_root *tree_root = fs_info->tree_root;
> > int ret = 0;
> >
> > + /*
> > + * First check if the log root tree was already created. If not, create
> > + * it before locking the root's log_mutex, just to keep lockdep happy.
> > + */
> > + if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
> > + mutex_lock(&tree_root->log_mutex);
> > + if (!fs_info->log_root_tree) {
> > + ret = btrfs_init_log_root_tree(trans, fs_info);
> > + if (!ret)
> > + set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
> > + }
> > + mutex_unlock(&tree_root->log_mutex);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > mutex_lock(&root->log_mutex);
> >
> > if (root->log_root) {
> > @@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> > set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
> > }
> > } else {
> > - mutex_lock(&fs_info->tree_log_mutex);
> > - if (!fs_info->log_root_tree)
> > - ret = btrfs_init_log_root_tree(trans, fs_info);
> > - mutex_unlock(&fs_info->tree_log_mutex);
> > - if (ret)
> > - goto out;
> > -
> > ret = btrfs_add_log_tree(trans, root);
> > if (ret)
> > goto out;
> > @@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> > int log_transid = 0;
> > struct btrfs_log_ctx root_log_ctx;
> > struct blk_plug plug;
> > + u64 log_root_start;
> > + u64 log_root_level;
> >
> > mutex_lock(&root->log_mutex);
> > log_transid = ctx->log_transid;
> > @@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> > goto out_wake_log_root;
> > }
> >
> > - btrfs_set_super_log_root(fs_info->super_for_commit,
> > - log_root_tree->node->start);
> > - btrfs_set_super_log_root_level(fs_info->super_for_commit,
> > - btrfs_header_level(log_root_tree->node));
> > -
> > + log_root_start = log_root_tree->node->start;
> > + log_root_level = btrfs_header_level(log_root_tree->node);
> > log_root_tree->log_transid++;
> > mutex_unlock(&log_root_tree->log_mutex);
> >
> > /*
> > - * Nobody else is going to jump in and write the ctree
> > - * super here because the log_commit atomic below is protecting
> > - * us. We must be called with a transaction handle pinning
> > - * the running transaction open, so a full commit can't hop
> > - * in and cause problems either.
> > + * Here we are guaranteed that nobody is going to write the superblock
> > + * for the current transaction before us and that neither we do write
> > + * our superblock before the previous transaction finishes its commit
> > + * and writes its superblock, because:
> > + *
> > + * 1) We are holding a handle on the current transaction, so no body
> > + * can commit it until we release the handle;
> > + *
> > + * 2) Before writing our superblock we acquire the tree_log_mutex, so
> > + * if the previous transaction is still committing, and hasn't yet
> > + * written its superblock, we wait for it to do it, because a
> > + * transaction commit acquires the tree_log_mutex when the commit
> > + * begins and releases it only after writing its superblock.
> > */
> > + mutex_lock(&fs_info->tree_log_mutex);
> > + btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
> > + btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
> > ret = write_all_supers(fs_info, 1);
> > + mutex_unlock(&fs_info->tree_log_mutex);
> > if (ret) {
> > btrfs_set_log_full_commit(trans);
> > btrfs_abort_transaction(trans, ret);
> > @@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
> > if (fs_info->log_root_tree) {
> > free_log_tree(trans, fs_info->log_root_tree);
> > fs_info->log_root_tree = NULL;
> > + clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
> > }
> > return 0;
> > }
> > --
> > 2.28.0
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-02 11:28 ` Filipe Manana
@ 2021-02-02 13:01 ` Wang Yugui
2021-02-02 14:09 ` Wang Yugui
0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-02 13:01 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
Hi, Filipe Manana
> On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >
> > Hi, Filipe Manana
> >
> > The dbench result with these patches is very good. thanks a lot.
> >
> > This is the dbench(synchronous mode) result , and then a question.
> >
> > command: dbench -s -t 60 -D /btrfs/ 32
> > mount option:ssd,space_cache=v2
> > kernel:5.10.12 + patchset 1 + this patchset
>
> patchset 1 and "this patchset" are the same, did you mean two
> different patchsets or just a single patchset?
patchset1:
btrfs: some performance improvements for dbench alike workloads
patchset2:
btrfs: more performance improvements for dbench workloads
https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801
I'm sorroy that I have replayed to the wrong patchset.
>
> >
> > Question:
> > for synchronous mode, the result type 1 is perfect?
>
> What do you mean by perfect? You mean if result 1 is better than result 2?
In result 1, the MaxLat of Flush of dbench synchronous mode is fast as
expected, the same level as kernel 5.4.91.
But in result 2, the MaxLat of Flush of dbench synchronous mode is big
as write level, but this is synchronous mode, most job should be done
already before flush.
> > and there is still some minor place about the flush to do for
> > the result type2?
>
> By "minor place" you mean the huge difference I suppose.
>
> >
> >
> > result type 1:
> >
> > Operation Count AvgLat MaxLat
> > ----------------------------------------
> > NTCreateX 868942 0.028 3.017
> > Close 638536 0.003 0.061
> > Rename 36851 0.663 4.000
> > Unlink 175182 0.399 5.358
> > Qpathinfo 789014 0.014 1.846
> > Qfileinfo 137684 0.002 0.047
> > Qfsinfo 144241 0.004 0.059
> > Sfileinfo 70913 0.008 0.046
> > Find 304554 0.057 1.889
> > ** WriteX 429696 3.960 2239.973
> > ReadX 1363356 0.005 0.358
> > LockX 2836 0.004 0.038
> > UnlockX 2836 0.002 0.018
> > ** Flush 60771 0.621 6.794
> >
> > Throughput 452.385 MB/sec (sync open) 32 clients 32 procs max_latency=1963.312 ms
> > + stat -f -c %T /btrfs/
> > btrfs
> > + uname -r
> > 5.10.12-4.el7.x86_64
> >
> >
> > result type 2:
> > Operation Count AvgLat MaxLat
> > ----------------------------------------
> > NTCreateX 888943 0.028 2.679
> > Close 652765 0.002 0.058
> > Rename 37705 0.572 3.962
> > Unlink 179713 0.383 3.983
> > Qpathinfo 806705 0.014 2.294
> > Qfileinfo 140752 0.002 0.125
> > Qfsinfo 147909 0.004 0.049
> > Sfileinfo 72374 0.008 0.104
> > Find 311839 0.058 2.305
> > ** WriteX 439656 3.854 1872.109
> > ReadX 1396868 0.005 0.324
> > LockX 2910 0.004 0.026
> > UnlockX 2910 0.002 0.025
> > ** Flush 62260 0.750 1659.364
> >
> > Throughput 461.856 MB/sec (sync open) 32 clients 32 procs max_latency=1872.118 ms
> > + stat -f -c %T /btrfs/
> > btrfs
> > + uname -r
> > 5.10.12-4.el7.x86_64
>
> I'm not sure what your question is exactly.
>
> Are both results after applying the same patchset, or are they before
> and after applying the patchset, respectively?
Both result after applying the same patchset.
and both on the same server, same SAS SSD disk.
but the result is not stable, and the major diff is MaxLat of Flush.
Server:Dell T7610
CPU: E5-2660 v2(10core 20threads) x2
SSD:TOSHIBA PX05SMQ040
Memory:192G (with ECC)
> If they are both with the patchset applied, and you wonder about the
> big variation in the "Flush" operations, I am not sure about why it is
> so.
> Both throughput and max latency are better in result 2.
>
> It's normal to have variations across dbench runs, I get them too, and
> I do several runs (5 or 6) to check things out.
>
> I don't use virtualization (testing on bare metal), I set the cpu
> governor mode to performance (instead of the "powersave" default) and
> use a non-debug kernel configuration, because otherwise I get
> significant variations in latencies and throughput too (though I never
> got a huge difference such as from 6.794 to 1659.364).
This is a bare metal(dell T7610).
CPU mode is set to performance by BIOS. and I checked it by
'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
smaller than 1023 of a NVMe SSD,but it is still enough for
dbench 32 threads?
The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
a problem.
It is not easy to re-product both, mabye easy to reproduce the small
one, maybe easy to reproduce the big one.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-02 13:01 ` Wang Yugui
@ 2021-02-02 14:09 ` Wang Yugui
2021-02-03 10:51 ` Filipe Manana
0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-02 14:09 UTC (permalink / raw)
To: Filipe Manana; +Cc: linux-btrfs
Hi, Filipe Manana
There are some dbench(sync mode) result on the same hardware,
but with different linux kernel
4.14.200
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 225281 5.163 82.143
Flush 32161 2.250 62.669
Throughput 236.719 MB/sec (sync open) 32 clients 32 procs max_latency=82.149 ms
4.19.21
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 118842 10.946 116.345
Flush 16506 0.115 44.575
Throughput 125.973 MB/sec (sync open) 32 clients 32 procs max_latency=116.390 ms
4.19.150
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 144509 9.151 117.353
lush 20563 0.128 52.014
Throughput 153.707 MB/sec (sync open) 32 clients 32 procs max_latency=117.379 ms
5.4.91
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 367033 4.377 1908.724
Flush 52037 0.159 39.871
Throughput 384.554 MB/sec (sync open) 32 clients 32 procs max_latency=1908.968 ms
5.10.12+patches
Operation Count AvgLat MaxLat
----------------------------------------
WriteX 429696 3.960 2239.973
Flush 60771 0.621 6.794
Throughput 452.385 MB/sec (sync open) 32 clients 32 procs max_latency=1963.312 ms
MaxLat / AvgLat of WriteX is increased from 82.143/5.163=15.9 to
2239.973/3.960=565.6.
For QoS, can we have an option to tune the value of MaxLat / AvgLat of
WriteX to less than 100?
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02
> Hi, Filipe Manana
>
> > On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >
> > > Hi, Filipe Manana
> > >
> > > The dbench result with these patches is very good. thanks a lot.
> > >
> > > This is the dbench(synchronous mode) result , and then a question.
> > >
> > > command: dbench -s -t 60 -D /btrfs/ 32
> > > mount option:ssd,space_cache=v2
> > > kernel:5.10.12 + patchset 1 + this patchset
> >
> > patchset 1 and "this patchset" are the same, did you mean two
> > different patchsets or just a single patchset?
>
> patchset1:
> btrfs: some performance improvements for dbench alike workloads
>
> patchset2:
> btrfs: more performance improvements for dbench workloads
> https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801
>
> I'm sorroy that I have replayed to the wrong patchset.
>
> >
> > >
> > > Question:
> > > for synchronous mode, the result type 1 is perfect?
> >
> > What do you mean by perfect? You mean if result 1 is better than result 2?
>
> In result 1, the MaxLat of Flush of dbench synchronous mode is fast as
> expected, the same level as kernel 5.4.91.
>
> But in result 2, the MaxLat of Flush of dbench synchronous mode is big
> as write level, but this is synchronous mode, most job should be done
> already before flush.
>
> > > and there is still some minor place about the flush to do for
> > > the result type2?
> >
> > By "minor place" you mean the huge difference I suppose.
> >
> > >
> > >
> > > result type 1:
> > >
> > > Operation Count AvgLat MaxLat
> > > ----------------------------------------
> > > NTCreateX 868942 0.028 3.017
> > > Close 638536 0.003 0.061
> > > Rename 36851 0.663 4.000
> > > Unlink 175182 0.399 5.358
> > > Qpathinfo 789014 0.014 1.846
> > > Qfileinfo 137684 0.002 0.047
> > > Qfsinfo 144241 0.004 0.059
> > > Sfileinfo 70913 0.008 0.046
> > > Find 304554 0.057 1.889
> > > ** WriteX 429696 3.960 2239.973
> > > ReadX 1363356 0.005 0.358
> > > LockX 2836 0.004 0.038
> > > UnlockX 2836 0.002 0.018
> > > ** Flush 60771 0.621 6.794
> > >
> > > Throughput 452.385 MB/sec (sync open) 32 clients 32 procs max_latency=1963.312 ms
> > > + stat -f -c %T /btrfs/
> > > btrfs
> > > + uname -r
> > > 5.10.12-4.el7.x86_64
> > >
> > >
> > > result type 2:
> > > Operation Count AvgLat MaxLat
> > > ----------------------------------------
> > > NTCreateX 888943 0.028 2.679
> > > Close 652765 0.002 0.058
> > > Rename 37705 0.572 3.962
> > > Unlink 179713 0.383 3.983
> > > Qpathinfo 806705 0.014 2.294
> > > Qfileinfo 140752 0.002 0.125
> > > Qfsinfo 147909 0.004 0.049
> > > Sfileinfo 72374 0.008 0.104
> > > Find 311839 0.058 2.305
> > > ** WriteX 439656 3.854 1872.109
> > > ReadX 1396868 0.005 0.324
> > > LockX 2910 0.004 0.026
> > > UnlockX 2910 0.002 0.025
> > > ** Flush 62260 0.750 1659.364
> > >
> > > Throughput 461.856 MB/sec (sync open) 32 clients 32 procs max_latency=1872.118 ms
> > > + stat -f -c %T /btrfs/
> > > btrfs
> > > + uname -r
> > > 5.10.12-4.el7.x86_64
> >
> > I'm not sure what your question is exactly.
> >
> > Are both results after applying the same patchset, or are they before
> > and after applying the patchset, respectively?
>
> Both result after applying the same patchset.
> and both on the same server, same SAS SSD disk.
> but the result is not stable, and the major diff is MaxLat of Flush.
>
> Server:Dell T7610
> CPU: E5-2660 v2(10core 20threads) x2
> SSD:TOSHIBA PX05SMQ040
> Memory:192G (with ECC)
>
>
> > If they are both with the patchset applied, and you wonder about the
> > big variation in the "Flush" operations, I am not sure about why it is
> > so.
> > Both throughput and max latency are better in result 2.
> >
> > It's normal to have variations across dbench runs, I get them too, and
> > I do several runs (5 or 6) to check things out.
> >
> > I don't use virtualization (testing on bare metal), I set the cpu
> > governor mode to performance (instead of the "powersave" default) and
> > use a non-debug kernel configuration, because otherwise I get
> > significant variations in latencies and throughput too (though I never
> > got a huge difference such as from 6.794 to 1659.364).
>
> This is a bare metal(dell T7610).
> CPU mode is set to performance by BIOS. and I checked it by
> 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
>
> Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
> smaller than 1023 of a NVMe SSD,but it is still enough for
> dbench 32 threads?
>
>
> The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
> a problem.
> It is not easy to re-product both, mabye easy to reproduce the small
> one, maybe easy to reproduce the big one.
>
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-02 14:09 ` Wang Yugui
@ 2021-02-03 10:51 ` Filipe Manana
2021-02-03 11:03 ` Nikolay Borisov
0 siblings, 1 reply; 20+ messages in thread
From: Filipe Manana @ 2021-02-03 10:51 UTC (permalink / raw)
To: Wang Yugui; +Cc: linux-btrfs
On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi, Filipe Manana
>
> There are some dbench(sync mode) result on the same hardware,
> but with different linux kernel
>
> 4.14.200
> Operation Count AvgLat MaxLat
> ----------------------------------------
> WriteX 225281 5.163 82.143
> Flush 32161 2.250 62.669
> Throughput 236.719 MB/sec (sync open) 32 clients 32 procs max_latency=82.149 ms
>
> 4.19.21
> Operation Count AvgLat MaxLat
> ----------------------------------------
> WriteX 118842 10.946 116.345
> Flush 16506 0.115 44.575
> Throughput 125.973 MB/sec (sync open) 32 clients 32 procs max_latency=116.390 ms
>
> 4.19.150
> Operation Count AvgLat MaxLat
> ----------------------------------------
> WriteX 144509 9.151 117.353
> lush 20563 0.128 52.014
> Throughput 153.707 MB/sec (sync open) 32 clients 32 procs max_latency=117.379 ms
>
> 5.4.91
> Operation Count AvgLat MaxLat
> ----------------------------------------
> WriteX 367033 4.377 1908.724
> Flush 52037 0.159 39.871
> Throughput 384.554 MB/sec (sync open) 32 clients 32 procs max_latency=1908.968 ms
Ok, it seems somewhere between 4.19 and 5.4, something made the
latency much worse for you at least.
Is it only when using sync open (O_SYNC, dbench's -s flag), what about
when not using it?
I'll have to look at it, but it will likely take some time.
Thanks.
>
> 5.10.12+patches
> Operation Count AvgLat MaxLat
> ----------------------------------------
> WriteX 429696 3.960 2239.973
> Flush 60771 0.621 6.794
> Throughput 452.385 MB/sec (sync open) 32 clients 32 procs max_latency=1963.312 ms
>
>
> MaxLat / AvgLat of WriteX is increased from 82.143/5.163=15.9 to
> 2239.973/3.960=565.6.
>
> For QoS, can we have an option to tune the value of MaxLat / AvgLat of
> WriteX to less than 100?
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
>
> > Hi, Filipe Manana
> >
> > > On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > > >
> > > > Hi, Filipe Manana
> > > >
> > > > The dbench result with these patches is very good. thanks a lot.
> > > >
> > > > This is the dbench(synchronous mode) result , and then a question.
> > > >
> > > > command: dbench -s -t 60 -D /btrfs/ 32
> > > > mount option:ssd,space_cache=v2
> > > > kernel:5.10.12 + patchset 1 + this patchset
> > >
> > > patchset 1 and "this patchset" are the same, did you mean two
> > > different patchsets or just a single patchset?
> >
> > patchset1:
> > btrfs: some performance improvements for dbench alike workloads
> >
> > patchset2:
> > btrfs: more performance improvements for dbench workloads
> > https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801
> >
> > I'm sorroy that I have replayed to the wrong patchset.
> >
> > >
> > > >
> > > > Question:
> > > > for synchronous mode, the result type 1 is perfect?
> > >
> > > What do you mean by perfect? You mean if result 1 is better than result 2?
> >
> > In result 1, the MaxLat of Flush of dbench synchronous mode is fast as
> > expected, the same level as kernel 5.4.91.
> >
> > But in result 2, the MaxLat of Flush of dbench synchronous mode is big
> > as write level, but this is synchronous mode, most job should be done
> > already before flush.
> >
> > > > and there is still some minor place about the flush to do for
> > > > the result type2?
> > >
> > > By "minor place" you mean the huge difference I suppose.
> > >
> > > >
> > > >
> > > > result type 1:
> > > >
> > > > Operation Count AvgLat MaxLat
> > > > ----------------------------------------
> > > > NTCreateX 868942 0.028 3.017
> > > > Close 638536 0.003 0.061
> > > > Rename 36851 0.663 4.000
> > > > Unlink 175182 0.399 5.358
> > > > Qpathinfo 789014 0.014 1.846
> > > > Qfileinfo 137684 0.002 0.047
> > > > Qfsinfo 144241 0.004 0.059
> > > > Sfileinfo 70913 0.008 0.046
> > > > Find 304554 0.057 1.889
> > > > ** WriteX 429696 3.960 2239.973
> > > > ReadX 1363356 0.005 0.358
> > > > LockX 2836 0.004 0.038
> > > > UnlockX 2836 0.002 0.018
> > > > ** Flush 60771 0.621 6.794
> > > >
> > > > Throughput 452.385 MB/sec (sync open) 32 clients 32 procs max_latency=1963.312 ms
> > > > + stat -f -c %T /btrfs/
> > > > btrfs
> > > > + uname -r
> > > > 5.10.12-4.el7.x86_64
> > > >
> > > >
> > > > result type 2:
> > > > Operation Count AvgLat MaxLat
> > > > ----------------------------------------
> > > > NTCreateX 888943 0.028 2.679
> > > > Close 652765 0.002 0.058
> > > > Rename 37705 0.572 3.962
> > > > Unlink 179713 0.383 3.983
> > > > Qpathinfo 806705 0.014 2.294
> > > > Qfileinfo 140752 0.002 0.125
> > > > Qfsinfo 147909 0.004 0.049
> > > > Sfileinfo 72374 0.008 0.104
> > > > Find 311839 0.058 2.305
> > > > ** WriteX 439656 3.854 1872.109
> > > > ReadX 1396868 0.005 0.324
> > > > LockX 2910 0.004 0.026
> > > > UnlockX 2910 0.002 0.025
> > > > ** Flush 62260 0.750 1659.364
> > > >
> > > > Throughput 461.856 MB/sec (sync open) 32 clients 32 procs max_latency=1872.118 ms
> > > > + stat -f -c %T /btrfs/
> > > > btrfs
> > > > + uname -r
> > > > 5.10.12-4.el7.x86_64
> > >
> > > I'm not sure what your question is exactly.
> > >
> > > Are both results after applying the same patchset, or are they before
> > > and after applying the patchset, respectively?
> >
> > Both result after applying the same patchset.
> > and both on the same server, same SAS SSD disk.
> > but the result is not stable, and the major diff is MaxLat of Flush.
> >
> > Server:Dell T7610
> > CPU: E5-2660 v2(10core 20threads) x2
> > SSD:TOSHIBA PX05SMQ040
> > Memory:192G (with ECC)
> >
> >
> > > If they are both with the patchset applied, and you wonder about the
> > > big variation in the "Flush" operations, I am not sure about why it is
> > > so.
> > > Both throughput and max latency are better in result 2.
> > >
> > > It's normal to have variations across dbench runs, I get them too, and
> > > I do several runs (5 or 6) to check things out.
> > >
> > > I don't use virtualization (testing on bare metal), I set the cpu
> > > governor mode to performance (instead of the "powersave" default) and
> > > use a non-debug kernel configuration, because otherwise I get
> > > significant variations in latencies and throughput too (though I never
> > > got a huge difference such as from 6.794 to 1659.364).
> >
> > This is a bare metal(dell T7610).
> > CPU mode is set to performance by BIOS. and I checked it by
> > 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
> >
> > Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
> > smaller than 1023 of a NVMe SSD,but it is still enough for
> > dbench 32 threads?
> >
> >
> > The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
> > a problem.
> > It is not easy to re-product both, mabye easy to reproduce the small
> > one, maybe easy to reproduce the big one.
> >
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/02/02
> >
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-03 10:51 ` Filipe Manana
@ 2021-02-03 11:03 ` Nikolay Borisov
2021-02-03 15:16 ` Wang Yugui
0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2021-02-03 11:03 UTC (permalink / raw)
To: Filipe Manana, Wang Yugui; +Cc: linux-btrfs
On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>>
>> Hi, Filipe Manana
>>
>> There are some dbench(sync mode) result on the same hardware,
>> but with different linux kernel
>>
>> 4.14.200
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> WriteX 225281 5.163 82.143
>> Flush 32161 2.250 62.669
>> Throughput 236.719 MB/sec (sync open) 32 clients 32 procs max_latency=82.149 ms
>>
>> 4.19.21
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> WriteX 118842 10.946 116.345
>> Flush 16506 0.115 44.575
>> Throughput 125.973 MB/sec (sync open) 32 clients 32 procs max_latency=116.390 ms
>>
>> 4.19.150
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> WriteX 144509 9.151 117.353
>> lush 20563 0.128 52.014
>> Throughput 153.707 MB/sec (sync open) 32 clients 32 procs max_latency=117.379 ms
>>
>> 5.4.91
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> WriteX 367033 4.377 1908.724
>> Flush 52037 0.159 39.871
>> Throughput 384.554 MB/sec (sync open) 32 clients 32 procs max_latency=1908.968 ms
>
> Ok, it seems somewhere between 4.19 and 5.4, something made the
> latency much worse for you at least.
>
> Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> when not using it?
>
> I'll have to look at it, but it will likely take some time.
This seems like the perf regression I observed starting with kernel 5.0,
essentially preemptive flush of metadata was broken for quite some time,
but kernel 5.0 removed a btrfs_end_transaction call from
should_end_transaction which unmasked the issue.
In particular this should have been fixed by the following commit in
misc-next:
https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
which is part of a larger series of patches. So Wang, in order to test
this hypothesis can you re-run those tests with the latest misc-next
branch .
<snip>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-03 11:03 ` Nikolay Borisov
@ 2021-02-03 15:16 ` Wang Yugui
2021-02-04 3:17 ` Wang Yugui
0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-03 15:16 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Filipe Manana, linux-btrfs
Hi,
There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)
1, the MaxLat is changed from 1900ms level to 1000ms level.
that is a good improvement.
2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
MaxLat because all of them will write something to disk.
3, I'm not sure whether MaxLat of Flush is OK.
there should be nothing to write for Flush because of
dbench(sync open) mode. but I'm not sure whether some
scsi comand such as Synchronize Cache will be executed by Flush.
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 924298 0.035 745.523
Close 678499 0.002 0.093
Rename 39280 0.597 744.714
Unlink 187119 0.435 745.547
Qpathinfo 838558 0.013 0.978
Qfileinfo 146308 0.002 0.055
Qfsinfo 154172 0.004 0.108
Sfileinfo 75423 0.010 0.109
Find 324478 0.050 1.115
WriteX 457376 3.665 957.922
ReadX 1454051 0.005 0.131
LockX 3032 0.004 0.027
UnlockX 3032 0.002 0.022
Flush 64867 0.649 718.676
Throughput 481.002 MB/sec (sync open) 32 clients 32 procs max_latency=957.929 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.11.0-0.rc6.141.el8.x86_64
detail:
32 33682 548.70 MB/sec execute 1 sec latency 7.024 ms
32 36692 538.41 MB/sec execute 2 sec latency 6.719 ms
32 39787 544.75 MB/sec execute 3 sec latency 6.405 ms
32 42850 540.09 MB/sec execute 4 sec latency 7.717 ms
32 45872 535.64 MB/sec execute 5 sec latency 7.271 ms
32 48861 524.29 MB/sec execute 6 sec latency 6.685 ms
32 51343 512.54 MB/sec execute 7 sec latency 128.666 ms
32 53366 496.73 MB/sec execute 8 sec latency 703.255 ms *1
32 56428 498.60 MB/sec execute 9 sec latency 6.346 ms
32 59471 498.40 MB/sec execute 10 sec latency 9.958 ms
32 62175 495.68 MB/sec execute 11 sec latency 14.270 ms
32 65143 498.90 MB/sec execute 12 sec latency 9.391 ms
32 68116 502.19 MB/sec execute 13 sec latency 5.713 ms
32 71078 501.45 MB/sec execute 14 sec latency 6.235 ms
32 74030 500.43 MB/sec execute 15 sec latency 7.135 ms
32 76908 500.82 MB/sec execute 16 sec latency 7.071 ms
32 79794 499.61 MB/sec execute 17 sec latency 7.556 ms
32 82791 502.30 MB/sec execute 18 sec latency 6.509 ms
32 85676 502.05 MB/sec execute 19 sec latency 6.938 ms
32 86950 486.44 MB/sec execute 20 sec latency 554.015 ms *2
32 89670 487.60 MB/sec execute 21 sec latency 901.490 ms *3
32 92577 487.64 MB/sec execute 22 sec latency 6.715 ms
32 95528 488.03 MB/sec execute 23 sec latency 7.457 ms
32 98507 488.76 MB/sec execute 24 sec latency 7.266 ms
32 101340 488.76 MB/sec execute 25 sec latency 6.699 ms
32 104331 489.94 MB/sec execute 26 sec latency 6.506 ms
32 107166 490.87 MB/sec execute 27 sec latency 6.582 ms
32 110022 490.17 MB/sec execute 28 sec latency 7.072 ms
32 112931 490.34 MB/sec execute 29 sec latency 6.484 ms
32 115658 489.67 MB/sec execute 30 sec latency 6.767 ms
32 118569 490.14 MB/sec execute 31 sec latency 6.825 ms
32 121334 490.34 MB/sec execute 32 sec latency 7.270 ms
32 124182 489.95 MB/sec execute 33 sec latency 6.849 ms
32 127073 490.05 MB/sec execute 34 sec latency 6.934 ms
32 129835 489.56 MB/sec execute 35 sec latency 7.455 ms
32 130952 481.58 MB/sec execute 36 sec latency 635.676 ms *4
32 133736 481.74 MB/sec execute 37 sec latency 957.929 ms *5
32 136646 481.79 MB/sec execute 38 sec latency 7.339 ms
32 139616 483.23 MB/sec execute 39 sec latency 7.199 ms
32 142526 483.62 MB/sec execute 40 sec latency 7.344 ms
32 145429 483.58 MB/sec execute 41 sec latency 6.967 ms
32 148329 484.09 MB/sec execute 42 sec latency 8.043 ms
32 151091 483.89 MB/sec execute 43 sec latency 7.476 ms
32 153913 484.33 MB/sec execute 44 sec latency 7.611 ms
32 156679 484.29 MB/sec execute 45 sec latency 7.612 ms
32 159534 483.90 MB/sec execute 46 sec latency 8.295 ms
32 162328 484.17 MB/sec execute 47 sec latency 6.582 ms
32 165080 483.64 MB/sec execute 48 sec latency 8.939 ms
32 167861 484.12 MB/sec execute 49 sec latency 6.684 ms
32 170616 483.56 MB/sec execute 50 sec latency 7.051 ms
32 173557 483.89 MB/sec execute 51 sec latency 6.923 ms
32 176424 484.52 MB/sec execute 52 sec latency 6.689 ms
32 179255 484.14 MB/sec execute 53 sec latency 7.973 ms
32 181195 481.47 MB/sec execute 54 sec latency 305.495 ms
32 183309 479.62 MB/sec execute 55 sec latency 866.862 ms *6
32 186256 479.82 MB/sec execute 56 sec latency 7.016 ms
32 189209 480.82 MB/sec execute 57 sec latency 6.789 ms
32 192072 480.93 MB/sec execute 58 sec latency 7.305 ms
32 195054 481.00 MB/sec execute 59 sec latency 7.432 ms
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/03
>
>
> On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> > On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >>
> >> Hi, Filipe Manana
> >>
> >> There are some dbench(sync mode) result on the same hardware,
> >> but with different linux kernel
> >>
> >> 4.14.200
> >> Operation Count AvgLat MaxLat
> >> ----------------------------------------
> >> WriteX 225281 5.163 82.143
> >> Flush 32161 2.250 62.669
> >> Throughput 236.719 MB/sec (sync open) 32 clients 32 procs max_latency=82.149 ms
> >>
> >> 4.19.21
> >> Operation Count AvgLat MaxLat
> >> ----------------------------------------
> >> WriteX 118842 10.946 116.345
> >> Flush 16506 0.115 44.575
> >> Throughput 125.973 MB/sec (sync open) 32 clients 32 procs max_latency=116.390 ms
> >>
> >> 4.19.150
> >> Operation Count AvgLat MaxLat
> >> ----------------------------------------
> >> WriteX 144509 9.151 117.353
> >> lush 20563 0.128 52.014
> >> Throughput 153.707 MB/sec (sync open) 32 clients 32 procs max_latency=117.379 ms
> >>
> >> 5.4.91
> >> Operation Count AvgLat MaxLat
> >> ----------------------------------------
> >> WriteX 367033 4.377 1908.724
> >> Flush 52037 0.159 39.871
> >> Throughput 384.554 MB/sec (sync open) 32 clients 32 procs max_latency=1908.968 ms
> >
> > Ok, it seems somewhere between 4.19 and 5.4, something made the
> > latency much worse for you at least.
> >
> > Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> > when not using it?
> >
> > I'll have to look at it, but it will likely take some time.
>
>
> This seems like the perf regression I observed starting with kernel 5.0,
> essentially preemptive flush of metadata was broken for quite some time,
> but kernel 5.0 removed a btrfs_end_transaction call from
> should_end_transaction which unmasked the issue.
>
> In particular this should have been fixed by the following commit in
> misc-next:
>
> https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
> which is part of a larger series of patches. So Wang, in order to test
> this hypothesis can you re-run those tests with the latest misc-next
> branch .
>
> <snip>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-03 15:16 ` Wang Yugui
@ 2021-02-04 3:17 ` Wang Yugui
2021-02-04 6:19 ` Nikolay Borisov
0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-04 3:17 UTC (permalink / raw)
To: Nikolay Borisov, Filipe Manana; +Cc: linux-btrfs
Hi,
I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
5.10.12 with the same other kernel components and the same kernel config.
Better dbench(sync open) result on both Throughput and max_latency.
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 958611 0.026 440.564
Close 703418 0.002 0.085
Rename 40599 0.518 439.825
Unlink 194095 0.282 439.792
Qpathinfo 869295 0.010 1.852
Qfileinfo 151285 0.002 0.095
Qfsinfo 159492 0.004 0.051
Sfileinfo 78078 0.010 0.104
Find 335905 0.036 3.043
WriteX 473353 3.637 713.039
ReadX 1502197 0.005 0.752
LockX 3100 0.004 0.024
UnlockX 3100 0.002 0.020
Flush 67264 0.575 363.550
Throughput 497.551 MB/sec (sync open) 32 clients 32 procs max_latency=713.045 ms
detail:
32 34620 506.62 MB/sec execute 1 sec latency 6.860 ms
32 37878 528.91 MB/sec execute 2 sec latency 6.366 ms
32 41015 541.17 MB/sec execute 3 sec latency 6.326 ms
32 43981 539.10 MB/sec execute 4 sec latency 6.743 ms
32 47111 540.92 MB/sec execute 5 sec latency 7.597 ms
32 50163 544.44 MB/sec execute 6 sec latency 6.984 ms
32 52952 532.23 MB/sec execute 7 sec latency 63.769 ms
32 55561 514.29 MB/sec execute 8 sec latency 482.822 ms
32 58415 515.72 MB/sec execute 9 sec latency 6.238 ms
32 61365 515.94 MB/sec execute 10 sec latency 8.102 ms
32 64309 518.10 MB/sec execute 11 sec latency 6.176 ms
32 67322 517.59 MB/sec execute 12 sec latency 6.690 ms
32 70343 515.28 MB/sec execute 13 sec latency 6.831 ms
32 73323 512.81 MB/sec execute 14 sec latency 7.754 ms
32 76151 513.38 MB/sec execute 15 sec latency 7.275 ms
32 79042 513.44 MB/sec execute 16 sec latency 6.678 ms
32 81995 514.71 MB/sec execute 17 sec latency 8.393 ms
32 84911 512.80 MB/sec execute 18 sec latency 7.574 ms
32 87884 510.68 MB/sec execute 19 sec latency 8.363 ms
32 89711 502.50 MB/sec execute 20 sec latency 695.270 ms *1
32 92628 503.68 MB/sec execute 21 sec latency 7.405 ms
32 95627 503.66 MB/sec execute 22 sec latency 6.545 ms
32 98638 502.90 MB/sec execute 23 sec latency 6.649 ms
32 101641 503.29 MB/sec execute 24 sec latency 7.740 ms
32 104515 503.72 MB/sec execute 25 sec latency 6.941 ms
32 107448 504.28 MB/sec execute 26 sec latency 6.704 ms
32 110321 504.79 MB/sec execute 27 sec latency 6.216 ms
32 113235 503.77 MB/sec execute 28 sec latency 7.366 ms
32 116185 503.18 MB/sec execute 29 sec latency 7.153 ms
32 119074 503.35 MB/sec execute 30 sec latency 7.026 ms
32 121970 503.51 MB/sec execute 31 sec latency 8.789 ms
32 123948 498.76 MB/sec execute 32 sec latency 539.597 ms *2
32 127045 499.14 MB/sec execute 33 sec latency 6.112 ms
32 130059 499.66 MB/sec execute 34 sec latency 7.762 ms
32 133062 500.45 MB/sec execute 35 sec latency 6.706 ms
32 136014 501.03 MB/sec execute 36 sec latency 8.384 ms
32 139018 502.13 MB/sec execute 37 sec latency 6.822 ms
32 142047 502.50 MB/sec execute 38 sec latency 6.924 ms
32 144935 501.74 MB/sec execute 39 sec latency 7.319 ms
32 147836 501.38 MB/sec execute 40 sec latency 6.958 ms
32 150677 501.36 MB/sec execute 41 sec latency 6.993 ms
32 153562 501.63 MB/sec execute 42 sec latency 7.126 ms
32 156385 501.63 MB/sec execute 43 sec latency 7.602 ms
32 159258 500.99 MB/sec execute 44 sec latency 7.281 ms
32 162248 501.05 MB/sec execute 45 sec latency 6.896 ms
32 165128 501.12 MB/sec execute 46 sec latency 7.134 ms
32 168070 501.39 MB/sec execute 47 sec latency 6.674 ms
32 169978 497.76 MB/sec execute 48 sec latency 355.037 ms
32 172936 497.92 MB/sec execute 49 sec latency 713.045 ms *3
32 175828 497.93 MB/sec execute 50 sec latency 6.585 ms
32 178731 498.06 MB/sec execute 51 sec latency 8.511 ms
32 181681 498.79 MB/sec execute 52 sec latency 7.093 ms
32 184622 498.83 MB/sec execute 53 sec latency 6.265 ms
32 187647 498.90 MB/sec execute 54 sec latency 6.902 ms
32 190125 497.59 MB/sec execute 55 sec latency 142.628 ms
32 192920 497.36 MB/sec execute 56 sec latency 432.224 ms
32 195822 497.85 MB/sec execute 57 sec latency 6.497 ms
32 198709 497.56 MB/sec execute 58 sec latency 6.418 ms
32 201665 497.55 MB/sec execute 59 sec latency 6.699 ms
Because of kernel interface changed between 5.10 and 5.11 , so we need
some modify to btrfs misc-next(5.11-rc6 +81patches) to run on linux LTS
5.10.12,
# drop 0001-block-add-a-bdev_kobj-helper.patch
# drop 0001-block-remove-i_bdev.patch
# add 0001-btrfs-bdev_nr_sectors.patch
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04
> Hi,
>
> There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)
>
> 1, the MaxLat is changed from 1900ms level to 1000ms level.
> that is a good improvement.
>
> 2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
> MaxLat because all of them will write something to disk.
>
> 3, I'm not sure whether MaxLat of Flush is OK.
> there should be nothing to write for Flush because of
> dbench(sync open) mode. but I'm not sure whether some
> scsi comand such as Synchronize Cache will be executed by Flush.
>
> Operation Count AvgLat MaxLat
> ----------------------------------------
> NTCreateX 924298 0.035 745.523
> Close 678499 0.002 0.093
> Rename 39280 0.597 744.714
> Unlink 187119 0.435 745.547
> Qpathinfo 838558 0.013 0.978
> Qfileinfo 146308 0.002 0.055
> Qfsinfo 154172 0.004 0.108
> Sfileinfo 75423 0.010 0.109
> Find 324478 0.050 1.115
> WriteX 457376 3.665 957.922
> ReadX 1454051 0.005 0.131
> LockX 3032 0.004 0.027
> UnlockX 3032 0.002 0.022
> Flush 64867 0.649 718.676
>
> Throughput 481.002 MB/sec (sync open) 32 clients 32 procs max_latency=957.929 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.11.0-0.rc6.141.el8.x86_64
>
> detail:
> 32 33682 548.70 MB/sec execute 1 sec latency 7.024 ms
> 32 36692 538.41 MB/sec execute 2 sec latency 6.719 ms
> 32 39787 544.75 MB/sec execute 3 sec latency 6.405 ms
> 32 42850 540.09 MB/sec execute 4 sec latency 7.717 ms
> 32 45872 535.64 MB/sec execute 5 sec latency 7.271 ms
> 32 48861 524.29 MB/sec execute 6 sec latency 6.685 ms
> 32 51343 512.54 MB/sec execute 7 sec latency 128.666 ms
> 32 53366 496.73 MB/sec execute 8 sec latency 703.255 ms *1
> 32 56428 498.60 MB/sec execute 9 sec latency 6.346 ms
> 32 59471 498.40 MB/sec execute 10 sec latency 9.958 ms
> 32 62175 495.68 MB/sec execute 11 sec latency 14.270 ms
> 32 65143 498.90 MB/sec execute 12 sec latency 9.391 ms
> 32 68116 502.19 MB/sec execute 13 sec latency 5.713 ms
> 32 71078 501.45 MB/sec execute 14 sec latency 6.235 ms
> 32 74030 500.43 MB/sec execute 15 sec latency 7.135 ms
> 32 76908 500.82 MB/sec execute 16 sec latency 7.071 ms
> 32 79794 499.61 MB/sec execute 17 sec latency 7.556 ms
> 32 82791 502.30 MB/sec execute 18 sec latency 6.509 ms
> 32 85676 502.05 MB/sec execute 19 sec latency 6.938 ms
> 32 86950 486.44 MB/sec execute 20 sec latency 554.015 ms *2
> 32 89670 487.60 MB/sec execute 21 sec latency 901.490 ms *3
> 32 92577 487.64 MB/sec execute 22 sec latency 6.715 ms
> 32 95528 488.03 MB/sec execute 23 sec latency 7.457 ms
> 32 98507 488.76 MB/sec execute 24 sec latency 7.266 ms
> 32 101340 488.76 MB/sec execute 25 sec latency 6.699 ms
> 32 104331 489.94 MB/sec execute 26 sec latency 6.506 ms
> 32 107166 490.87 MB/sec execute 27 sec latency 6.582 ms
> 32 110022 490.17 MB/sec execute 28 sec latency 7.072 ms
> 32 112931 490.34 MB/sec execute 29 sec latency 6.484 ms
> 32 115658 489.67 MB/sec execute 30 sec latency 6.767 ms
> 32 118569 490.14 MB/sec execute 31 sec latency 6.825 ms
> 32 121334 490.34 MB/sec execute 32 sec latency 7.270 ms
> 32 124182 489.95 MB/sec execute 33 sec latency 6.849 ms
> 32 127073 490.05 MB/sec execute 34 sec latency 6.934 ms
> 32 129835 489.56 MB/sec execute 35 sec latency 7.455 ms
> 32 130952 481.58 MB/sec execute 36 sec latency 635.676 ms *4
> 32 133736 481.74 MB/sec execute 37 sec latency 957.929 ms *5
> 32 136646 481.79 MB/sec execute 38 sec latency 7.339 ms
> 32 139616 483.23 MB/sec execute 39 sec latency 7.199 ms
> 32 142526 483.62 MB/sec execute 40 sec latency 7.344 ms
> 32 145429 483.58 MB/sec execute 41 sec latency 6.967 ms
> 32 148329 484.09 MB/sec execute 42 sec latency 8.043 ms
> 32 151091 483.89 MB/sec execute 43 sec latency 7.476 ms
> 32 153913 484.33 MB/sec execute 44 sec latency 7.611 ms
> 32 156679 484.29 MB/sec execute 45 sec latency 7.612 ms
> 32 159534 483.90 MB/sec execute 46 sec latency 8.295 ms
> 32 162328 484.17 MB/sec execute 47 sec latency 6.582 ms
> 32 165080 483.64 MB/sec execute 48 sec latency 8.939 ms
> 32 167861 484.12 MB/sec execute 49 sec latency 6.684 ms
> 32 170616 483.56 MB/sec execute 50 sec latency 7.051 ms
> 32 173557 483.89 MB/sec execute 51 sec latency 6.923 ms
> 32 176424 484.52 MB/sec execute 52 sec latency 6.689 ms
> 32 179255 484.14 MB/sec execute 53 sec latency 7.973 ms
> 32 181195 481.47 MB/sec execute 54 sec latency 305.495 ms
> 32 183309 479.62 MB/sec execute 55 sec latency 866.862 ms *6
> 32 186256 479.82 MB/sec execute 56 sec latency 7.016 ms
> 32 189209 480.82 MB/sec execute 57 sec latency 6.789 ms
> 32 192072 480.93 MB/sec execute 58 sec latency 7.305 ms
> 32 195054 481.00 MB/sec execute 59 sec latency 7.432 ms
>
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/03
>
> >
> >
> > On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> > > On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >>
> > >> Hi, Filipe Manana
> > >>
> > >> There are some dbench(sync mode) result on the same hardware,
> > >> but with different linux kernel
> > >>
> > >> 4.14.200
> > >> Operation Count AvgLat MaxLat
> > >> ----------------------------------------
> > >> WriteX 225281 5.163 82.143
> > >> Flush 32161 2.250 62.669
> > >> Throughput 236.719 MB/sec (sync open) 32 clients 32 procs max_latency=82.149 ms
> > >>
> > >> 4.19.21
> > >> Operation Count AvgLat MaxLat
> > >> ----------------------------------------
> > >> WriteX 118842 10.946 116.345
> > >> Flush 16506 0.115 44.575
> > >> Throughput 125.973 MB/sec (sync open) 32 clients 32 procs max_latency=116.390 ms
> > >>
> > >> 4.19.150
> > >> Operation Count AvgLat MaxLat
> > >> ----------------------------------------
> > >> WriteX 144509 9.151 117.353
> > >> lush 20563 0.128 52.014
> > >> Throughput 153.707 MB/sec (sync open) 32 clients 32 procs max_latency=117.379 ms
> > >>
> > >> 5.4.91
> > >> Operation Count AvgLat MaxLat
> > >> ----------------------------------------
> > >> WriteX 367033 4.377 1908.724
> > >> Flush 52037 0.159 39.871
> > >> Throughput 384.554 MB/sec (sync open) 32 clients 32 procs max_latency=1908.968 ms
> > >
> > > Ok, it seems somewhere between 4.19 and 5.4, something made the
> > > latency much worse for you at least.
> > >
> > > Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> > > when not using it?
> > >
> > > I'll have to look at it, but it will likely take some time.
> >
> >
> > This seems like the perf regression I observed starting with kernel 5.0,
> > essentially preemptive flush of metadata was broken for quite some time,
> > but kernel 5.0 removed a btrfs_end_transaction call from
> > should_end_transaction which unmasked the issue.
> >
> > In particular this should have been fixed by the following commit in
> > misc-next:
> >
> > https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
> > which is part of a larger series of patches. So Wang, in order to test
> > this hypothesis can you re-run those tests with the latest misc-next
> > branch .
> >
> > <snip>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-04 3:17 ` Wang Yugui
@ 2021-02-04 6:19 ` Nikolay Borisov
2021-02-04 11:34 ` Wang Yugui
0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2021-02-04 6:19 UTC (permalink / raw)
To: Wang Yugui, Filipe Manana; +Cc: linux-btrfs
On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> Hi,
>
> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> 5.10.12 with the same other kernel components and the same kernel config.
>
> Better dbench(sync open) result on both Throughput and max_latency.
>
If i understand correctly you rebased current misc-next to 5.10.12, if
so this means there is something else in the main kernel, that's not
btrfs related which degrades performance, it seems you've got a 300ms
win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
right?
<snip>
>> Hi,
>>
>> There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)
>>
>> 1, the MaxLat is changed from 1900ms level to 1000ms level.
>> that is a good improvement.
>>
>> 2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
>> MaxLat because all of them will write something to disk.
>>
>> 3, I'm not sure whether MaxLat of Flush is OK.
>> there should be nothing to write for Flush because of
>> dbench(sync open) mode. but I'm not sure whether some
>> scsi comand such as Synchronize Cache will be executed by Flush.
>>
>> Operation Count AvgLat MaxLat
>> ----------------------------------------
>> NTCreateX 924298 0.035 745.523
>> Close 678499 0.002 0.093
>> Rename 39280 0.597 744.714
>> Unlink 187119 0.435 745.547
>> Qpathinfo 838558 0.013 0.978
>> Qfileinfo 146308 0.002 0.055
>> Qfsinfo 154172 0.004 0.108
>> Sfileinfo 75423 0.010 0.109
>> Find 324478 0.050 1.115
>> WriteX 457376 3.665 957.922
>> ReadX 1454051 0.005 0.131
>> LockX 3032 0.004 0.027
>> UnlockX 3032 0.002 0.022
>> Flush 64867 0.649 718.676
>>
>> Throughput 481.002 MB/sec (sync open) 32 clients 32 procs max_latency=957.929 ms
>> + stat -f -c %T /btrfs/
>> btrfs
>> + uname -r
>> 5.11.0-0.rc6.141.el8.x86_64
>>
>> detail:
>> 32 33682 548.70 MB/sec execute 1 sec latency 7.024 ms
>> 32 36692 538.41 MB/sec execute 2 sec latency 6.719 ms
>> 32 39787 544.75 MB/sec execute 3 sec latency 6.405 ms
>> 32 42850 540.09 MB/sec execute 4 sec latency 7.717 ms
>> 32 45872 535.64 MB/sec execute 5 sec latency 7.271 ms
>> 32 48861 524.29 MB/sec execute 6 sec latency 6.685 ms
>> 32 51343 512.54 MB/sec execute 7 sec latency 128.666 ms
>> 32 53366 496.73 MB/sec execute 8 sec latency 703.255 ms *1
>> 32 56428 498.60 MB/sec execute 9 sec latency 6.346 ms
>> 32 59471 498.40 MB/sec execute 10 sec latency 9.958 ms
>> 32 62175 495.68 MB/sec execute 11 sec latency 14.270 ms
>> 32 65143 498.90 MB/sec execute 12 sec latency 9.391 ms
>> 32 68116 502.19 MB/sec execute 13 sec latency 5.713 ms
>> 32 71078 501.45 MB/sec execute 14 sec latency 6.235 ms
>> 32 74030 500.43 MB/sec execute 15 sec latency 7.135 ms
>> 32 76908 500.82 MB/sec execute 16 sec latency 7.071 ms
>> 32 79794 499.61 MB/sec execute 17 sec latency 7.556 ms
>> 32 82791 502.30 MB/sec execute 18 sec latency 6.509 ms
>> 32 85676 502.05 MB/sec execute 19 sec latency 6.938 ms
>> 32 86950 486.44 MB/sec execute 20 sec latency 554.015 ms *2
>> 32 89670 487.60 MB/sec execute 21 sec latency 901.490 ms *3
>> 32 92577 487.64 MB/sec execute 22 sec latency 6.715 ms
>> 32 95528 488.03 MB/sec execute 23 sec latency 7.457 ms
>> 32 98507 488.76 MB/sec execute 24 sec latency 7.266 ms
>> 32 101340 488.76 MB/sec execute 25 sec latency 6.699 ms
>> 32 104331 489.94 MB/sec execute 26 sec latency 6.506 ms
>> 32 107166 490.87 MB/sec execute 27 sec latency 6.582 ms
>> 32 110022 490.17 MB/sec execute 28 sec latency 7.072 ms
>> 32 112931 490.34 MB/sec execute 29 sec latency 6.484 ms
>> 32 115658 489.67 MB/sec execute 30 sec latency 6.767 ms
>> 32 118569 490.14 MB/sec execute 31 sec latency 6.825 ms
>> 32 121334 490.34 MB/sec execute 32 sec latency 7.270 ms
>> 32 124182 489.95 MB/sec execute 33 sec latency 6.849 ms
>> 32 127073 490.05 MB/sec execute 34 sec latency 6.934 ms
>> 32 129835 489.56 MB/sec execute 35 sec latency 7.455 ms
>> 32 130952 481.58 MB/sec execute 36 sec latency 635.676 ms *4
>> 32 133736 481.74 MB/sec execute 37 sec latency 957.929 ms *5
>> 32 136646 481.79 MB/sec execute 38 sec latency 7.339 ms
>> 32 139616 483.23 MB/sec execute 39 sec latency 7.199 ms
>> 32 142526 483.62 MB/sec execute 40 sec latency 7.344 ms
>> 32 145429 483.58 MB/sec execute 41 sec latency 6.967 ms
>> 32 148329 484.09 MB/sec execute 42 sec latency 8.043 ms
>> 32 151091 483.89 MB/sec execute 43 sec latency 7.476 ms
>> 32 153913 484.33 MB/sec execute 44 sec latency 7.611 ms
>> 32 156679 484.29 MB/sec execute 45 sec latency 7.612 ms
>> 32 159534 483.90 MB/sec execute 46 sec latency 8.295 ms
>> 32 162328 484.17 MB/sec execute 47 sec latency 6.582 ms
>> 32 165080 483.64 MB/sec execute 48 sec latency 8.939 ms
>> 32 167861 484.12 MB/sec execute 49 sec latency 6.684 ms
>> 32 170616 483.56 MB/sec execute 50 sec latency 7.051 ms
>> 32 173557 483.89 MB/sec execute 51 sec latency 6.923 ms
>> 32 176424 484.52 MB/sec execute 52 sec latency 6.689 ms
>> 32 179255 484.14 MB/sec execute 53 sec latency 7.973 ms
>> 32 181195 481.47 MB/sec execute 54 sec latency 305.495 ms
>> 32 183309 479.62 MB/sec execute 55 sec latency 866.862 ms *6
>> 32 186256 479.82 MB/sec execute 56 sec latency 7.016 ms
>> 32 189209 480.82 MB/sec execute 57 sec latency 6.789 ms
>> 32 192072 480.93 MB/sec execute 58 sec latency 7.305 ms
>> 32 195054 481.00 MB/sec execute 59 sec latency 7.432 ms
>>
>>
>> Best Regards
>> Wang Yugui (wangyugui@e16-tech.com)
>> 2021/02/03
>>
>>>
>>>
>>> On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
>>>> On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>>>>>
>>>>> Hi, Filipe Manana
>>>>>
>>>>> There are some dbench(sync mode) result on the same hardware,
>>>>> but with different linux kernel
>>>>>
>>>>> 4.14.200
>>>>> Operation Count AvgLat MaxLat
>>>>> ----------------------------------------
>>>>> WriteX 225281 5.163 82.143
>>>>> Flush 32161 2.250 62.669
>>>>> Throughput 236.719 MB/sec (sync open) 32 clients 32 procs max_latency=82.149 ms
>>>>>
>>>>> 4.19.21
>>>>> Operation Count AvgLat MaxLat
>>>>> ----------------------------------------
>>>>> WriteX 118842 10.946 116.345
>>>>> Flush 16506 0.115 44.575
>>>>> Throughput 125.973 MB/sec (sync open) 32 clients 32 procs max_latency=116.390 ms
>>>>>
>>>>> 4.19.150
>>>>> Operation Count AvgLat MaxLat
>>>>> ----------------------------------------
>>>>> WriteX 144509 9.151 117.353
>>>>> lush 20563 0.128 52.014
>>>>> Throughput 153.707 MB/sec (sync open) 32 clients 32 procs max_latency=117.379 ms
>>>>>
>>>>> 5.4.91
>>>>> Operation Count AvgLat MaxLat
>>>>> ----------------------------------------
>>>>> WriteX 367033 4.377 1908.724
>>>>> Flush 52037 0.159 39.871
>>>>> Throughput 384.554 MB/sec (sync open) 32 clients 32 procs max_latency=1908.968 ms
>>>>
>>>> Ok, it seems somewhere between 4.19 and 5.4, something made the
>>>> latency much worse for you at least.
>>>>
>>>> Is it only when using sync open (O_SYNC, dbench's -s flag), what about
>>>> when not using it?
>>>>
>>>> I'll have to look at it, but it will likely take some time.
>>>
>>>
>>> This seems like the perf regression I observed starting with kernel 5.0,
>>> essentially preemptive flush of metadata was broken for quite some time,
>>> but kernel 5.0 removed a btrfs_end_transaction call from
>>> should_end_transaction which unmasked the issue.
>>>
>>> In particular this should have been fixed by the following commit in
>>> misc-next:
>>>
>>> https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
>>> which is part of a larger series of patches. So Wang, in order to test
>>> this hypothesis can you re-run those tests with the latest misc-next
>>> branch .
>>>
>>> <snip>
>>
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-04 6:19 ` Nikolay Borisov
@ 2021-02-04 11:34 ` Wang Yugui
2021-02-04 11:58 ` Nikolay Borisov
0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-04 11:34 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Filipe Manana, linux-btrfs
Hi,
> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> > Hi,
> >
> > I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> > 5.10.12 with the same other kernel components and the same kernel config.
> >
> > Better dbench(sync open) result on both Throughput and max_latency.
> >
>
> If i understand correctly you rebased current misc-next to 5.10.12, if
> so this means there is something else in the main kernel, that's not
> btrfs related which degrades performance, it seems you've got a 300ms
> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
> right?
Yes.
maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
performance.
or maybe just because of different kernel config.
kernel config I used:
https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/
I rebased current misc-next to 5.10.12, so that there is only diff in
btrfs source code.
only 3 minor patch needed for this rebase, there seems no broken kernel API
change for btrfs between 5.10 and 5.11.
# add-to-5.10 0001-block-add-a-bdev_kobj-helper.patch
# drop-from-btrs-misc-next 0001-block-remove-i_bdev.patch
# fix-to-btrfs-misc-next 0001-btrfs-bdev_nr_sectors.patch
more patch come into misc-next today, they are yet not rebased/tested.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-04 11:34 ` Wang Yugui
@ 2021-02-04 11:58 ` Nikolay Borisov
2021-02-05 1:47 ` Wang Yugui
0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2021-02-04 11:58 UTC (permalink / raw)
To: Wang Yugui; +Cc: Filipe Manana, linux-btrfs
On 4.02.21 г. 13:34 ч., Wang Yugui wrote:
> Hi,
>
>> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
>>> Hi,
>>>
>>> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
>>> 5.10.12 with the same other kernel components and the same kernel config.
>>>
>>> Better dbench(sync open) result on both Throughput and max_latency.
>>>
>>
>> If i understand correctly you rebased current misc-next to 5.10.12, if
>> so this means there is something else in the main kernel, that's not
>> btrfs related which degrades performance, it seems you've got a 300ms
>> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
>> right?
>
> Yes.
I just realized this could also be caused by btrfs code that has already
landed in v5.11-rc1 for example. I.e the main pull req for this release.
>
> maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
> performance.
> or maybe just because of different kernel config.
> kernel config I used:
> https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
> https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/
>
> I rebased current misc-next to 5.10.12, so that there is only diff in
> btrfs source code.
>
> only 3 minor patch needed for this rebase, there seems no broken kernel API
> change for btrfs between 5.10 and 5.11.
> # add-to-5.10 0001-block-add-a-bdev_kobj-helper.patch
> # drop-from-btrs-misc-next 0001-block-remove-i_bdev.patch
> # fix-to-btrfs-misc-next 0001-btrfs-bdev_nr_sectors.patch
>
> more patch come into misc-next today, they are yet not rebased/tested.
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/04
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
2021-02-04 11:58 ` Nikolay Borisov
@ 2021-02-05 1:47 ` Wang Yugui
0 siblings, 0 replies; 20+ messages in thread
From: Wang Yugui @ 2021-02-05 1:47 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Filipe Manana, linux-btrfs
Hi,
> On 4.02.21 г. 13:34 ч., Wang Yugui wrote:
> > Hi,
> >
> >> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> >>> Hi,
> >>>
> >>> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> >>> 5.10.12 with the same other kernel components and the same kernel config.
> >>>
> >>> Better dbench(sync open) result on both Throughput and max_latency.
> >>>
> >>
> >> If i understand correctly you rebased current misc-next to 5.10.12, if
> >> so this means there is something else in the main kernel, that's not
> >> btrfs related which degrades performance, it seems you've got a 300ms
> >> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
> >> right?
> >
> > Yes.
>
> I just realized this could also be caused by btrfs code that has already
> landed in v5.11-rc1 for example. I.e the main pull req for this release.
This performance problem seems very complex.
result of btrfs 5.10.12 + these btrfs patchset + linux other kernel of 5.10.12
- 5.11 free-space-tree 13 patch
- 5.11 btrfs: some performance improvements for dbench alike workloads
- misc-next btrfs: more performance improvements for dbench workloads
- misc-next Improve preemptive ENOSPC flushing
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 858500 0.030 2.969
Close 630548 0.002 0.071
Rename 36416 0.744 3.791
Unlink 173375 0.503 4.109
Qpathinfo 779355 0.014 2.715
Qfileinfo 136041 0.002 0.054
Qfsinfo 142728 0.004 0.084
Sfileinfo 69953 0.010 0.098
Find 301093 0.059 2.670
WriteX 424594 3.999 1765.527
ReadX 1348726 0.005 0.397
LockX 2816 0.004 0.022
UnlockX 2816 0.003 0.025
Flush 60094 0.764 1478.133
Throughput 445.83 MB/sec (sync open) 32 clients 32 procs max_latency=1765.539 ms
still worse than btrfs misc-next(20210203) + linux other kernel of 5.10.12
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 958611 0.026 440.564
Close 703418 0.002 0.085
Rename 40599 0.518 439.825
Unlink 194095 0.282 439.792
Qpathinfo 869295 0.010 1.852
Qfileinfo 151285 0.002 0.095
Qfsinfo 159492 0.004 0.051
Sfileinfo 78078 0.010 0.104
Find 335905 0.036 3.043
WriteX 473353 3.637 713.039
ReadX 1502197 0.005 0.752
LockX 3100 0.004 0.024
UnlockX 3100 0.002 0.020
Flush 67264 0.575 363.550
Throughput 497.551 MB/sec (sync open) 32 clients 32 procs max_latency=713.045 ms
so there is some other patch in misc-next improved this performance too.
but MaxLat/AvgLat of WriteX is still 175 in
btrfs misc-next(20210203) + linux other kernel of 5.10.12,
so there maybe some patch after 5.10 cause new performance problem too.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04
> >
> > maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
> > performance.
> > or maybe just because of different kernel config.
> > kernel config I used:
> > https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
> > https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/
> >
> > I rebased current misc-next to 5.10.12, so that there is only diff in
> > btrfs source code.
> >
> > only 3 minor patch needed for this rebase, there seems no broken kernel API
> > change for btrfs between 5.10 and 5.11.
> > # add-to-5.10 0001-block-add-a-bdev_kobj-helper.patch
> > # drop-from-btrs-misc-next 0001-block-remove-i_bdev.patch
> > # fix-to-btrfs-misc-next 0001-btrfs-bdev_nr_sectors.patch
> >
> > more patch come into misc-next today, they are yet not rebased/tested.
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/02/04
> >
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads
2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
` (5 preceding siblings ...)
2020-11-25 12:19 ` [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit fdmanana
@ 2020-11-30 17:30 ` David Sterba
6 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2020-11-30 17:30 UTC (permalink / raw)
To: fdmanana; +Cc: linux-btrfs
On Wed, Nov 25, 2020 at 12:19:22PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
>
> This patchset mostly fixes some races that result in an fsync falling back
> to a transaction commit unncessarily or logging more metadata than it needs
> to. It also addresses link and rename operations logging inodes that were
> fsynced in the previous transaction but they don't need to be logged in
> the current transaction. All these cases can be triggered often by dbench.
>
> Finally it avoids blocking an fsync for a significant time window when the
> previous transaction is still committing. All these together resulted in
> about +12% throughput with dbench and -8% maximum latency, as mentioned in
> the changelog of the last patch.
>
> There will be more changes to improve performance with dbench or similar
> workloads, but these have been cooking and testing for about one week and
> are independent from what's coming next.
Looking forward to more perf improvements, much appreciated.
> Filipe Manana (6):
> btrfs: fix race causing unnecessary inode logging during link and rename
> btrfs: fix race that results in logging old extents during a fast fsync
> btrfs: fix race that causes unnecessary logging of ancestor inodes
> btrfs: fix race that makes inode logging fallback to transaction commit
> btrfs: fix race leading to unnecessary transaction commit when logging inode
> btrfs: do not block inode logging for so long during transaction commit
Added to misc-next, thanks.
^ permalink raw reply [flat|nested] 20+ messages in thread