[PATCH 0/6] btrfs: some performance improvements for dbench alike workloads

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads
@ 2020-11-25 12:19 fdmanana
  2020-11-25 12:19 ` [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename fdmanana
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

This patchset mostly fixes some races that result in an fsync falling back
to a transaction commit unncessarily or logging more metadata than it needs
to. It also addresses link and rename operations logging inodes that were
fsynced in the previous transaction but they don't need to be logged in
the current transaction. All these cases can be triggered often by dbench.

Finally it avoids blocking an fsync for a significant time window when the
previous transaction is still committing. All these together resulted in
about +12% throughput with dbench and -8% maximum latency, as mentioned in
the changelog of the last patch.

There will be more changes to improve performance with dbench or similar
workloads, but these have been cooking and testing for about one week and
are independent from what's coming next.

Filipe Manana (6):
  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

 fs/btrfs/ctree.h    |   2 +-
 fs/btrfs/tree-log.c | 101 +++++++++++++++++++++++---------------------
 2 files changed, 53 insertions(+), 50 deletions(-)

-- 
2.28.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename
  2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
@ 2020-11-25 12:19 ` fdmanana
  2020-11-25 12:19 ` [PATCH 2/6] btrfs: fix race that results in logging old extents during a fast fsync fdmanana
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When we are doing a rename or a link operation for an inode that was logged
in the previous transaction and that transaction is still committing, we
have a time window where we incorrectly consider that the inode was logged
previously in the current transaction and therefore decide to log it to
update it in the log. The following steps give an example on how this
happens during a link operation:

1) Inode X is logged in transaction 1000, so its logged_trans field is set
   to 1000;

2) Task A starts to commit transaction 1000;

3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED;

4) Task B starts a link operation for inode X, and as a consequence it
   starts transaction 1001;

5) Task A is still committing transaction 1000, therefore the value stored
   at fs_info->last_trans_committed is still 999;

6) Task B calls btrfs_log_new_name(), it reads a value of 999 from
   fs_info->last_trans_committed and because the logged_trans field of
   inode X has a value of 1000, the function does not return immediately,
   instead it proceeds to logging the inode, which should not happen
   because the inode was logged in the previous transaction (1000) and
   not in the current one (1001).

This is not a functional problem, just wasted time and space logging an
inode that does not need to be logged, contributing to higher latency
for link and rename operations.

So fix this by comparing the inodes' logged_trans field with the
generation of the current transaction instead of comparing with the value
stored in fs_info->last_trans_committed.

This case is often hit when running dbench for a long enough duration, as
it does lots of rename operations.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 80cf496226c2..b63f5c2b982a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -6443,7 +6443,6 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 			struct btrfs_inode *inode, struct btrfs_inode *old_dir,
 			struct dentry *parent)
 {
-	struct btrfs_fs_info *fs_info = trans->fs_info;
 	struct btrfs_log_ctx ctx;

 	/*
@@ -6457,8 +6456,8 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans,
 	 * if this inode hasn't been logged and directory we're renaming it
 	 * from hasn't been logged, we don't need to log it
 	 */
-	if (inode->logged_trans <= fs_info->last_trans_committed &&
-	    (!old_dir || old_dir->logged_trans <= fs_info->last_trans_committed))
+	if (inode->logged_trans < trans->transid &&
+	    (!old_dir || old_dir->logged_trans < trans->transid))
 		return;

 	btrfs_init_log_ctx(&ctx, &inode->vfs_inode);
-- 
2.28.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/6] btrfs: fix race that results in logging old extents during a fast fsync
  2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
  2020-11-25 12:19 ` [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename fdmanana
@ 2020-11-25 12:19 ` fdmanana
  2020-11-25 12:19 ` [PATCH 3/6] btrfs: fix race that causes unnecessary logging of ancestor inodes fdmanana
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When logging the extents of an inode during a fast fsync, we have a time
window where we can log extents that are from the previous transaction and
already persisted. This only makes us waste time unnecessarily.

The following sequence of steps shows how this can happen:

1) We are at transaction 1000;

2) An ordered extent E from inode I completes, that is it has gone through
   btrfs_finish_ordered_io(), and it set the extent maps' generation to
   1000 when we unpin the extent, which is the generation of the current
   transaction;

3) The commit for transaction 1000 starts by task A;

4) The task committing transaction 1000 sets the transaction state to
   unblocked, writes the dirty extent buffers and the super blocks, then
   unlocks tree_log_mutex;

5) Some change is made to inode I, resulting in creation of a new
   transaction with a generation of 1001;

6) The transaction 1000 commit starts unpinning extents. At this point
   fs_info->last_trans_committed still has a value of 999;

7) Task B starts an fsync on inode I, and when it gets to
   btrfs_log_changed_extents() sees the extent map for extent E in the
   list of modified extents. It sees the extent map has a generation of
   1000 and fs_info->last_trans_committed has a value of 999, so it
   proceeds to logging the respective file extent item and all the
   checksums covering its range.

   So we end up wasting time since the extent was already persisted and
   is reachable through the trees pointed to by the super block committed
   by transaction 1000.

So just fix this by comparing the extent maps generation against the
generation of the transaction handle - if it is smaller then the id in the
handle, we know the extent was already persisted and we do not need to log
it.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index b63f5c2b982a..596d72d239e9 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4416,14 +4416,12 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 	struct extent_map *em, *n;
 	struct list_head extents;
 	struct extent_map_tree *tree = &inode->extent_tree;
-	u64 test_gen;
 	int ret = 0;
 	int num = 0;

 	INIT_LIST_HEAD(&extents);

 	write_lock(&tree->lock);
-	test_gen = root->fs_info->last_trans_committed;

 	list_for_each_entry_safe(em, n, &tree->modified_extents, list) {
 		list_del_init(&em->list);
@@ -4439,7 +4437,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 			goto process;
 		}

-		if (em->generation <= test_gen)
+		if (em->generation < trans->transid)
 			continue;

 		/* We log prealloc extents beyond eof later. */
-- 
2.28.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/6] btrfs: fix race that causes unnecessary logging of ancestor inodes
  2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
  2020-11-25 12:19 ` [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename fdmanana
  2020-11-25 12:19 ` [PATCH 2/6] btrfs: fix race that results in logging old extents during a fast fsync fdmanana
@ 2020-11-25 12:19 ` fdmanana
  2020-11-25 12:19 ` [PATCH 4/6] btrfs: fix race that makes inode logging fallback to transaction commit fdmanana
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When logging an inode and we are checking if we need to log ancestors that
are new, if the previous transaction is still committing we have a time
window where we can unncessarily log ancestor inodes that were created in
the previous transaction.

The race is described by the following steps:

1) We are at transaction 1000;

2) Directory inode X is created, its generation is set to 1000;

3) The commit for transaction 1000 is started by task A;

4) The task committing transaction 1000 sets the transaction state to
   unblocked, writes the dirty extent buffers and the super blocks, then
   unlocks tree_log_mutex;

5) Inode Y, a regular file, is created under directory inode X, this
   results in starting a new transaction with a generation of 1001;

6) The transaction 1000 commit is unpinning extents. At this point
   fs_info->last_trans_committed still has a value of 999;

7) Task B calls fsync on inode Y and gets a handle for transaction 1001;

8) Task B ends up at log_all_new_ancestors() and then because inode Y has
   only one hard link, ends up at log_new_ancestors_fast(). There it reads
   a value of 999 from fs_info->last_trans_committed, and sees that the
   parent inode X has a generation of 1000, so we end up logging inode X:

     if (inode->generation > fs_info->last_trans_committed) {
         ret = btrfs_log_inode(trans, root, inode,
                               LOG_INODE_EXISTS, ctx);
         (...)

   which is not necessary since it was created in the past transaction,
   with a generation of 1000, and that transaction has already committed
   its super blocks - it's still unpinning extents so it has not yet
   updated fs_info->last_trans_committed from 999 to 1000.

   So this just causes us to spend more time logging and allocating and
   writing more tree blocks for the log tree.

So fix this by comparing an inode's generation with the generation of the
transaction our transaction handle refers to - if the inode's generation
matches the generation of the current transaction than we know it is a
new inode we need to log, otherwise don't log it.

This case is often hit when running dbench for a long enough duration.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 596d72d239e9..6313c50f5e33 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5835,7 +5835,6 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,

 	while (true) {
 		struct btrfs_fs_info *fs_info = root->fs_info;
-		const u64 last_committed = fs_info->last_trans_committed;
 		struct extent_buffer *leaf = path->nodes[0];
 		int slot = path->slots[0];
 		struct btrfs_key search_key;
@@ -5854,7 +5853,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans,
 		if (IS_ERR(inode))
 			return PTR_ERR(inode);

-		if (BTRFS_I(inode)->generation > last_committed)
+		if (BTRFS_I(inode)->generation >= trans->transid)
 			ret = btrfs_log_inode(trans, root, BTRFS_I(inode),
 					      LOG_INODE_EXISTS, ctx);
 		btrfs_add_delayed_iput(inode);
@@ -5895,7 +5894,6 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans,
 				  struct btrfs_log_ctx *ctx)
 {
 	struct btrfs_root *root = inode->root;
-	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct dentry *old_parent = NULL;
 	struct super_block *sb = inode->vfs_inode.i_sb;
 	int ret = 0;
@@ -5909,7 +5907,7 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans,
 		if (root != inode->root)
 			break;

-		if (inode->generation > fs_info->last_trans_committed) {
+		if (inode->generation >= trans->transid) {
 			ret = btrfs_log_inode(trans, root, inode,
 					      LOG_INODE_EXISTS, ctx);
 			if (ret)
-- 
2.28.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/6] btrfs: fix race that makes inode logging fallback to transaction commit
  2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
                   ` (2 preceding siblings ...)
  2020-11-25 12:19 ` [PATCH 3/6] btrfs: fix race that causes unnecessary logging of ancestor inodes fdmanana
@ 2020-11-25 12:19 ` fdmanana
  2020-11-25 12:19 ` [PATCH 5/6] btrfs: fix race leading to unnecessary transaction commit when logging inode fdmanana
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When logging an inode and the previous transaction is still committing, we
have a time window where we can end up incorrectly think an inode has its
last_unlink_trans field with a value greater than the last transaction
committed, which results in the logging to fallback to a full transaction
commit, which is usually much more expensive than doing a log commit.

The race is described by the following steps:

1) We are at transaction 1000;

2) We modify an inode X (a directory) using transaction 1000 and set its
   last_unlink_trans field to 1000, because for example we removed one
   of its subdirectories;

3) We create a new inode Y with a dentry in inode X using transaction 1000,
   so its generation field is set to 1000;

4) The commit for transaction 1000 is started by task A;

5) The task committing transaction 1000 sets the transaction state to
   unblocked, writes the dirty extent buffers and the super blocks, then
   unlocks tree_log_mutex;

6) Some task starts a new transaction with a generation of 1001;

7) We do some modification to inode Y (using transaction 1001);

8) The transaction 1000 commit starts unpinning extents. At this point
   fs_info->last_trans_committed still has a value of 999;

9) Task B starts an fsync on inode Y, and gets a handle for transaction
   1001. When it gets to check_parent_dirs_for_sync() it does the checking
   of the ancestor dentries because the following check does not evaluate
   to true:

       if (S_ISREG(inode->vfs_inode.i_mode) &&
           inode->generation <= last_committed &&
           inode->last_unlink_trans <= last_committed)
               goto out;

   The generation value for inode Y is 1000 and last_committed, which has
   the value read from fs_info->last_trans_committed, has a value of 999,
   so that check evaluates to false and we proceed to check the ancestor
   inodes.

   Once we get to the first ancestor, inode X, we call
   btrfs_must_commit_transaction() on it, which evaluates to true:

   static bool btrfs_must_commit_transaction(...)
   {
       struct btrfs_fs_info *fs_info = inode->root->fs_info;
       bool ret = false;

       mutex_lock(&inode->log_mutex);
       if (inode->last_unlink_trans > fs_info->last_trans_committed) {
           /*
            * Make sure any commits to the log are forced to be full
            * commits.
            */
            btrfs_set_log_full_commit(trans);
            ret = true;
       }
    (...)

    because inode's X last_unlink_trans has a value of 1000 and
    fs_info->last_trans_committed still has a value of 999, it returns
    true to check_parent_dirs_for_sync(), making it return 1 which is
    propagated up to btrfs_sync_file(), causing it to fallback to a full
    transaction commit of transaction 1001.

    We should have not falled back to commit transaction 1001, since inode
    X had last_unlink_trans set to 1000 and the super blocks for
    transaction 1000 were already written. So while not resulting in a
    functional problem, it leads to a lot more work and higher latencies
    for a fsync since committing a transaction is usually more expensive
    than committing a log (if other filesystem changes happened under that
    transaction).

Similar problem happens when logging directories, for the same reason as
btrfs_must_commit_transaction() returns true on an inode with its
last_unlink_trans having the generation of the previous transaction and
that transaction is still committing, unpinning its freed extents.

So fix this by comparing last_unlink_trans with the id of the current
transaction instead of fs_info->last_trans_committed.

This case is often hit when running dbench for a long enough duration, as
it does lots of rename and rmdir operations (both update the field
last_unkink_trans of an inode) and fsyncs of files and directories.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 6313c50f5e33..592a50d9697f 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5449,11 +5449,10 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans,
 static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans,
 					  struct btrfs_inode *inode)
 {
-	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 	bool ret = false;
 
 	mutex_lock(&inode->log_mutex);
-	if (inode->last_unlink_trans > fs_info->last_trans_committed) {
+	if (inode->last_unlink_trans >= trans->transid) {
 		/*
 		 * Make sure any commits to the log are forced to be full
 		 * commits.
@@ -5475,8 +5474,7 @@ static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans,
 static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
 					       struct btrfs_inode *inode,
 					       struct dentry *parent,
-					       struct super_block *sb,
-					       u64 last_committed)
+					       struct super_block *sb)
 {
 	int ret = 0;
 	struct dentry *old_parent = NULL;
@@ -5488,8 +5486,8 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans,
 	 * and other fun in this file.
 	 */
 	if (S_ISREG(inode->vfs_inode.i_mode) &&
-	    inode->generation <= last_committed &&
-	    inode->last_unlink_trans <= last_committed)
+	    inode->generation < trans->transid &&
+	    inode->last_unlink_trans < trans->transid)
 		goto out;
 
 	if (!S_ISDIR(inode->vfs_inode.i_mode)) {
@@ -6024,7 +6022,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct super_block *sb;
 	int ret = 0;
-	u64 last_committed = fs_info->last_trans_committed;
 	bool log_dentries = false;
 
 	sb = inode->vfs_inode.i_sb;
@@ -6049,8 +6046,7 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 		goto end_no_trans;
 	}
 
-	ret = check_parent_dirs_for_sync(trans, inode, parent, sb,
-			last_committed);
+	ret = check_parent_dirs_for_sync(trans, inode, parent, sb);
 	if (ret)
 		goto end_no_trans;
 
@@ -6080,8 +6076,8 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 	 * and other fun in this file.
 	 */
 	if (S_ISREG(inode->vfs_inode.i_mode) &&
-	    inode->generation <= last_committed &&
-	    inode->last_unlink_trans <= last_committed) {
+	    inode->generation < trans->transid &&
+	    inode->last_unlink_trans < trans->transid) {
 		ret = 0;
 		goto end_trans;
 	}
@@ -6130,7 +6126,7 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 	 * but the file inode does not have a matching BTRFS_INODE_REF_KEY item
 	 * and has a link count of 2.
 	 */
-	if (inode->last_unlink_trans > last_committed) {
+	if (inode->last_unlink_trans >= trans->transid) {
 		ret = btrfs_log_all_parents(trans, inode, ctx);
 		if (ret)
 			goto end_trans;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/6] btrfs: fix race leading to unnecessary transaction commit when logging inode
  2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
                   ` (3 preceding siblings ...)
  2020-11-25 12:19 ` [PATCH 4/6] btrfs: fix race that makes inode logging fallback to transaction commit fdmanana
@ 2020-11-25 12:19 ` fdmanana
  2020-11-25 12:19 ` [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit fdmanana
  2020-11-30 17:30 ` [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads David Sterba
  6 siblings, 0 replies; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

When logging an inode we may often have to fallback to a full transaction
commit, either because a new block group was allocated, there is some case
we can not deal with wihout a transaction commit or some error like an
-ENOMEM happened. However after we fallback to a transaction commit, we
have a time window where we can make the next attempt to log any inode
commit the next transaction unnecessarily, adding additional overhead and
increasing latency.

A sequence of steps that leads to this issue is the following:

1) The current open transaction has a generation of 1000;

2) A new block group is allocated, and as a consequence we must make sure
   any attempts to commit a log fallback to a transaction commit, so
   btrfs_set_log_full_commit() is called from btrfs_make_block_group().
   This sets fs_info->last_trans_log_full_commit to 1000;

3) Task A is holding a handle on transaction 1000 and tries to log inode X.
   Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
   which returns true, since fs_info->last_trans_log_full_commit has a
   value of 1000. So we end up returning -EAGAIN and propagating it up to
   btrfs_sync_file(), where we commit transaction 1000;

4) The transaction commit task (task A) sets the tansaction state to
   unblocked (TRANS_STATE_UNBLOCKED);

5) Some other task, task B, starts a new transaction with a generation of
   1001;

6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;

7) Transaction 1000 has not fully committed yet, we are still writing all
   the extent buffers it created;

8) Some new task, task C, starts an fsync of inode Y, gets a handle for
   transaction 1001, and it gets to btrfs_log_inode_parent() which does
   the following check:

     if (fs_info->last_trans_log_full_commit > last_committed) {
         ret = 1;
         goto end_no_trans;
     }

   At that point last_trans_log_full_commit has a value of 1000 and
   last_committed (value of fs_info->last_trans_committed) has a value of
   999, since transaction 1000 has not yet committed - it is either still
   writing out dirty extent buffers, its super blocks or unpinning
   extents.

   As a consequence we return 1, which gets propagated up to
   btrfs_sync_file(), which will then call btrfs_commit_transaction()
   for transaction 1001.

   As a consequence we have an unnecessary second transaction commit, we
   previously committed transaction 1000 and now commit transaction 1001
   as well, resulting in more overhead and increased latency.

So fix this double transaction commit issue simply by removing that check,
because all we need to do is wait for the previous transaction to finish
its commit, which we already do later when starting the log transaction at
start_log_trans(), because there we acquire the tree_log_mutex lock, which
is held by a transaction commit and only released after the transaction
commits its super blocks.

Another issue that check has is that it reads last_trans_log_full_commit
without using READ_ONCE(), which is incorrect since that member of
struct btrfs_fs_info is always updated with WRITE_ONCE() through the
helper btrfs_set_log_full_commit().

This double transaction commit issue can actually be triggered quite often
in long runs of dbench, since besides the creation of new block groups
that force inode logging to fallback to a transaction commit, there are
cases where dbench asks to fsync a directory which had files in it that
were previously renamed or subdirectories that were removed, resulting in
the inode logging to fallback to a full transaction commit.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

Performance results are mentioned in the change log of the last patch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/tree-log.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 592a50d9697f..bc5b652f4f64 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -6031,16 +6031,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans,
 		goto end_no_trans;
 	}

-	/*
-	 * The prev transaction commit doesn't complete, we need do
-	 * full commit by ourselves.
-	 */
-	if (fs_info->last_trans_log_full_commit >
-	    fs_info->last_trans_committed) {
-		ret = 1;
-		goto end_no_trans;
-	}
-
 	if (btrfs_root_refs(&root->root_item) == 0) {
 		ret = 1;
 		goto end_no_trans;
-- 
2.28.0

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
                   ` (4 preceding siblings ...)
  2020-11-25 12:19 ` [PATCH 5/6] btrfs: fix race leading to unnecessary transaction commit when logging inode fdmanana
@ 2020-11-25 12:19 ` fdmanana
  2021-02-02  5:38   ` Wang Yugui
  2020-11-30 17:30 ` [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads David Sterba
  6 siblings, 1 reply; 20+ messages in thread
From: fdmanana @ 2020-11-25 12:19 UTC (permalink / raw)
  To: linux-btrfs

From: Filipe Manana <fdmanana@suse.com>

Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.

This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).

Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:

1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers

They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash

So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.

This patch belongs to a patch set that is comprised of the following
patches:

  btrfs: fix race causing unnecessary inode logging during link and rename
  btrfs: fix race that results in logging old extents during a fast fsync
  btrfs: fix race that causes unnecessary logging of ancestor inodes
  btrfs: fix race that makes inode logging fallback to transaction commit
  btrfs: fix race leading to unnecessary transaction commit when logging inode
  btrfs: do not block inode logging for so long during transaction commit

The following script that uses dbench was used to measure the impact of
the whole patchset:

  $ cat test-dbench.sh
  #!/bin/bash

  DEV=/dev/nvme0n1
  MNT=/mnt/btrfs
  MOUNT_OPTIONS="-o ssd"

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f -m single -d single $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  dbench -D $MNT -t 300 64

  umount $MNT

The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).

Before patch set:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    11277211    0.250    85.340
 Close        8283172     0.002     6.479
 Rename        477515     1.935    86.026
 Unlink       2277936     0.770    87.071
 Deltree          256    15.732    81.379
 Mkdir            128     0.003     0.009
 Qpathinfo    10221180    0.056    44.404
 Qfileinfo    1789967     0.002     4.066
 Qfsinfo      1874399     0.003     9.176
 Sfileinfo     918589     0.061    10.247
 Find         3951758     0.341    54.040
 WriteX       5616547     0.047    85.079
 ReadX        17676028    0.005     9.704
 LockX          36704     0.003     1.800
 UnlockX        36704     0.002     0.687
 Flush         790541    14.115   676.236

Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms

After patch set:

Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX    12687926    0.171    86.526
 Close        9320780     0.002     8.063
 Rename        537253     1.444    78.576
 Unlink       2561827     0.559    87.228
 Deltree          374    11.499    73.549
 Mkdir            187     0.003     0.005
 Qpathinfo    11500300    0.061    36.801
 Qfileinfo    2017118     0.002     7.189
 Qfsinfo      2108641     0.003     4.825
 Sfileinfo    1033574     0.008     8.065
 Find         4446553     0.408    47.835
 WriteX       6335667     0.045    84.388
 ReadX        19887312    0.003     9.215
 LockX          41312     0.003     1.394
 UnlockX        41312     0.002     1.425
 Flush         889233    13.014   623.259

Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms

+12.7% throughput, -8.2% max latency

Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
 fs/btrfs/ctree.h    |  2 +-
 fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
 2 files changed, 40 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c0c6e79c43f9..7185384f475a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1026,7 +1026,7 @@ enum {
 	BTRFS_ROOT_DEAD_RELOC_TREE,
 	/* Mark dead root stored on device whose cleanup needs to be resumed */
 	BTRFS_ROOT_DEAD_TREE,
-	/* The root has a log tree. Used only for subvolume roots. */
+	/* The root has a log tree. Used for subvolume roots and the tree root. */
 	BTRFS_ROOT_HAS_LOG_TREE,
 	/* Qgroup flushing is in progress */
 	BTRFS_ROOT_QGROUP_FLUSHING,
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index bc5b652f4f64..e8b84543d565 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 			   struct btrfs_log_ctx *ctx)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	struct btrfs_root *tree_root = fs_info->tree_root;
 	int ret = 0;
 
+	/*
+	 * First check if the log root tree was already created. If not, create
+	 * it before locking the root's log_mutex, just to keep lockdep happy.
+	 */
+	if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
+		mutex_lock(&tree_root->log_mutex);
+		if (!fs_info->log_root_tree) {
+			ret = btrfs_init_log_root_tree(trans, fs_info);
+			if (!ret)
+				set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
+		}
+		mutex_unlock(&tree_root->log_mutex);
+		if (ret)
+			return ret;
+	}
+
 	mutex_lock(&root->log_mutex);
 
 	if (root->log_root) {
@@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 			set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
 		}
 	} else {
-		mutex_lock(&fs_info->tree_log_mutex);
-		if (!fs_info->log_root_tree)
-			ret = btrfs_init_log_root_tree(trans, fs_info);
-		mutex_unlock(&fs_info->tree_log_mutex);
-		if (ret)
-			goto out;
-
 		ret = btrfs_add_log_tree(trans, root);
 		if (ret)
 			goto out;
@@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	int log_transid = 0;
 	struct btrfs_log_ctx root_log_ctx;
 	struct blk_plug plug;
+	u64 log_root_start;
+	u64 log_root_level;
 
 	mutex_lock(&root->log_mutex);
 	log_transid = ctx->log_transid;
@@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 		goto out_wake_log_root;
 	}
 
-	btrfs_set_super_log_root(fs_info->super_for_commit,
-				 log_root_tree->node->start);
-	btrfs_set_super_log_root_level(fs_info->super_for_commit,
-				       btrfs_header_level(log_root_tree->node));
-
+	log_root_start = log_root_tree->node->start;
+	log_root_level = btrfs_header_level(log_root_tree->node);
 	log_root_tree->log_transid++;
 	mutex_unlock(&log_root_tree->log_mutex);
 
 	/*
-	 * Nobody else is going to jump in and write the ctree
-	 * super here because the log_commit atomic below is protecting
-	 * us.  We must be called with a transaction handle pinning
-	 * the running transaction open, so a full commit can't hop
-	 * in and cause problems either.
+	 * Here we are guaranteed that nobody is going to write the superblock
+	 * for the current transaction before us and that neither we do write
+	 * our superblock before the previous transaction finishes its commit
+	 * and writes its superblock, because:
+	 *
+	 * 1) We are holding a handle on the current transaction, so no body
+	 *    can commit it until we release the handle;
+	 *
+	 * 2) Before writing our superblock we acquire the tree_log_mutex, so
+	 *    if the previous transaction is still committing, and hasn't yet
+	 *    written its superblock, we wait for it to do it, because a
+	 *    transaction commit acquires the tree_log_mutex when the commit
+	 *    begins and releases it only after writing its superblock.
 	 */
+	mutex_lock(&fs_info->tree_log_mutex);
+	btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
+	btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
 	ret = write_all_supers(fs_info, 1);
+	mutex_unlock(&fs_info->tree_log_mutex);
 	if (ret) {
 		btrfs_set_log_full_commit(trans);
 		btrfs_abort_transaction(trans, ret);
@@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
 	if (fs_info->log_root_tree) {
 		free_log_tree(trans, fs_info->log_root_tree);
 		fs_info->log_root_tree = NULL;
+		clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
 	}
 	return 0;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2020-11-25 12:19 ` [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit fdmanana
@ 2021-02-02  5:38   ` Wang Yugui
  2021-02-02 11:28     ` Filipe Manana
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-02  5:38 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

Hi, Filipe Manana

The dbench result with these patches is very good. thanks a lot. 

This is the dbench(synchronous mode) result , and then a question.

command: dbench -s -t 60 -D /btrfs/ 32
mount option:ssd,space_cache=v2
kernel:5.10.12 + patchset 1 + this patchset
patchset 1:
0001-btrfs-fix-race-causing-unnecessary-inode-logging-dur.patch
0002-btrfs-fix-race-that-results-in-logging-old-extents-d.patch
0003-btrfs-fix-race-that-causes-unnecessary-logging-of-an.patch
0004-btrfs-fix-race-that-makes-inode-logging-fallback-to-.patch
0005-btrfs-fix-race-leading-to-unnecessary-transaction-co.patch
0006-btrfs-do-not-block-inode-logging-for-so-long-during-.patch

We get two types of result as below, and the result type 1 is not easy
to reproduce now.

Question:
for synchronous mode, the result type 1 is perfect?
and there is still some minor place about the flush to do for
the result type2?


result type 1:

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     868942     0.028     3.017
 Close         638536     0.003     0.061
 Rename         36851     0.663     4.000
 Unlink        175182     0.399     5.358
 Qpathinfo     789014     0.014     1.846
 Qfileinfo     137684     0.002     0.047
 Qfsinfo       144241     0.004     0.059
 Sfileinfo      70913     0.008     0.046
 Find          304554     0.057     1.889
** WriteX        429696     3.960  2239.973
 ReadX        1363356     0.005     0.358
 LockX           2836     0.004     0.038
 UnlockX         2836     0.002     0.018
** Flush          60771     0.621     6.794

Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.10.12-4.el7.x86_64


result type 2:
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     888943     0.028     2.679
 Close         652765     0.002     0.058
 Rename         37705     0.572     3.962
 Unlink        179713     0.383     3.983
 Qpathinfo     806705     0.014     2.294
 Qfileinfo     140752     0.002     0.125
 Qfsinfo       147909     0.004     0.049
 Sfileinfo      72374     0.008     0.104
 Find          311839     0.058     2.305
** WriteX        439656     3.854  1872.109
 ReadX        1396868     0.005     0.324
 LockX           2910     0.004     0.026
 UnlockX         2910     0.002     0.025
** Flush          62260     0.750  1659.364

Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.10.12-4.el7.x86_64



Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02

> From: Filipe Manana <fdmanana@suse.com>
> 
> Early on during a transaction commit we acquire the tree_log_mutex and
> hold it until after we write the super blocks. But before writing the
> extent buffers dirtied by the transaction and the super blocks we unblock
> the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
> fs_info->running_transaction to NULL.
> 
> This means that after that and before writing the super blocks, new
> transactions can start. However if any transaction wants to log an inode,
> it will block waiting for the transaction commit to write its dirty
> extent buffers and the super blocks because the tree_log_mutex is only
> released after those operations are complete, and starting a new log
> transaction blocks on that mutex (at start_log_trans()).
> 
> Writing the dirty extent buffers and the super blocks can take a very
> significant amount of time to complete, but we could allow the tasks
> wanting to log an inode to proceed with most of their steps:
> 
> 1) create the log trees
> 2) log metadata in the trees
> 3) write their dirty extent buffers
> 
> They only need to wait for the previous transaction commit to complete
> (write its super blocks) before they attempt to write their super blocks,
> otherwise we could end up with a corrupt filesystem after a crash
> 
> So change start_log_trans() to use the root tree's log_mutex to serialize
> for the creation of the log root tree instead of using the tree_log_mutex,
> and make btrfs_sync_log() acquire the tree_log_mutex before writing the
> super blocks. This allows for inode logging to wait much less time when
> there is a previous transaction that is still committing, often not having
> to wait at all, as by the time when we try to sync the log the previous
> transaction already wrote its super blocks.
> 
> This patch belongs to a patch set that is comprised of the following
> patches:
> 
>   btrfs: fix race causing unnecessary inode logging during link and rename
>   btrfs: fix race that results in logging old extents during a fast fsync
>   btrfs: fix race that causes unnecessary logging of ancestor inodes
>   btrfs: fix race that makes inode logging fallback to transaction commit
>   btrfs: fix race leading to unnecessary transaction commit when logging inode
>   btrfs: do not block inode logging for so long during transaction commit
> 
> The following script that uses dbench was used to measure the impact of
> the whole patchset:
> 
>   $ cat test-dbench.sh
>   #!/bin/bash
> 
>   DEV=/dev/nvme0n1
>   MNT=/mnt/btrfs
>   MOUNT_OPTIONS="-o ssd"
> 
>   echo "performance" | \
>       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> 
>   mkfs.btrfs -f -m single -d single $DEV
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   dbench -D $MNT -t 300 64
> 
>   umount $MNT
> 
> The test was run on a machine with 12 cores, 64G of ram, using a NVMe
> device and a non-debug kernel configuration (Debian's default).
> 
> Before patch set:
> 
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX    11277211    0.250    85.340
>  Close        8283172     0.002     6.479
>  Rename        477515     1.935    86.026
>  Unlink       2277936     0.770    87.071
>  Deltree          256    15.732    81.379
>  Mkdir            128     0.003     0.009
>  Qpathinfo    10221180    0.056    44.404
>  Qfileinfo    1789967     0.002     4.066
>  Qfsinfo      1874399     0.003     9.176
>  Sfileinfo     918589     0.061    10.247
>  Find         3951758     0.341    54.040
>  WriteX       5616547     0.047    85.079
>  ReadX        17676028    0.005     9.704
>  LockX          36704     0.003     1.800
>  UnlockX        36704     0.002     0.687
>  Flush         790541    14.115   676.236
> 
> Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms
> 
> After patch set:
> 
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX    12687926    0.171    86.526
>  Close        9320780     0.002     8.063
>  Rename        537253     1.444    78.576
>  Unlink       2561827     0.559    87.228
>  Deltree          374    11.499    73.549
>  Mkdir            187     0.003     0.005
>  Qpathinfo    11500300    0.061    36.801
>  Qfileinfo    2017118     0.002     7.189
>  Qfsinfo      2108641     0.003     4.825
>  Sfileinfo    1033574     0.008     8.065
>  Find         4446553     0.408    47.835
>  WriteX       6335667     0.045    84.388
>  ReadX        19887312    0.003     9.215
>  LockX          41312     0.003     1.394
>  UnlockX        41312     0.002     1.425
>  Flush         889233    13.014   623.259
> 
> Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms
> 
> +12.7% throughput, -8.2% max latency
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>
> ---
>  fs/btrfs/ctree.h    |  2 +-
>  fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
>  2 files changed, 40 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index c0c6e79c43f9..7185384f475a 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1026,7 +1026,7 @@ enum {
>  	BTRFS_ROOT_DEAD_RELOC_TREE,
>  	/* Mark dead root stored on device whose cleanup needs to be resumed */
>  	BTRFS_ROOT_DEAD_TREE,
> -	/* The root has a log tree. Used only for subvolume roots. */
> +	/* The root has a log tree. Used for subvolume roots and the tree root. */
>  	BTRFS_ROOT_HAS_LOG_TREE,
>  	/* Qgroup flushing is in progress */
>  	BTRFS_ROOT_QGROUP_FLUSHING,
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index bc5b652f4f64..e8b84543d565 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>  			   struct btrfs_log_ctx *ctx)
>  {
>  	struct btrfs_fs_info *fs_info = root->fs_info;
> +	struct btrfs_root *tree_root = fs_info->tree_root;
>  	int ret = 0;
>  
> +	/*
> +	 * First check if the log root tree was already created. If not, create
> +	 * it before locking the root's log_mutex, just to keep lockdep happy.
> +	 */
> +	if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
> +		mutex_lock(&tree_root->log_mutex);
> +		if (!fs_info->log_root_tree) {
> +			ret = btrfs_init_log_root_tree(trans, fs_info);
> +			if (!ret)
> +				set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
> +		}
> +		mutex_unlock(&tree_root->log_mutex);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	mutex_lock(&root->log_mutex);
>  
>  	if (root->log_root) {
> @@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
>  			set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
>  		}
>  	} else {
> -		mutex_lock(&fs_info->tree_log_mutex);
> -		if (!fs_info->log_root_tree)
> -			ret = btrfs_init_log_root_tree(trans, fs_info);
> -		mutex_unlock(&fs_info->tree_log_mutex);
> -		if (ret)
> -			goto out;
> -
>  		ret = btrfs_add_log_tree(trans, root);
>  		if (ret)
>  			goto out;
> @@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  	int log_transid = 0;
>  	struct btrfs_log_ctx root_log_ctx;
>  	struct blk_plug plug;
> +	u64 log_root_start;
> +	u64 log_root_level;
>  
>  	mutex_lock(&root->log_mutex);
>  	log_transid = ctx->log_transid;
> @@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
>  		goto out_wake_log_root;
>  	}
>  
> -	btrfs_set_super_log_root(fs_info->super_for_commit,
> -				 log_root_tree->node->start);
> -	btrfs_set_super_log_root_level(fs_info->super_for_commit,
> -				       btrfs_header_level(log_root_tree->node));
> -
> +	log_root_start = log_root_tree->node->start;
> +	log_root_level = btrfs_header_level(log_root_tree->node);
>  	log_root_tree->log_transid++;
>  	mutex_unlock(&log_root_tree->log_mutex);
>  
>  	/*
> -	 * Nobody else is going to jump in and write the ctree
> -	 * super here because the log_commit atomic below is protecting
> -	 * us.  We must be called with a transaction handle pinning
> -	 * the running transaction open, so a full commit can't hop
> -	 * in and cause problems either.
> +	 * Here we are guaranteed that nobody is going to write the superblock
> +	 * for the current transaction before us and that neither we do write
> +	 * our superblock before the previous transaction finishes its commit
> +	 * and writes its superblock, because:
> +	 *
> +	 * 1) We are holding a handle on the current transaction, so no body
> +	 *    can commit it until we release the handle;
> +	 *
> +	 * 2) Before writing our superblock we acquire the tree_log_mutex, so
> +	 *    if the previous transaction is still committing, and hasn't yet
> +	 *    written its superblock, we wait for it to do it, because a
> +	 *    transaction commit acquires the tree_log_mutex when the commit
> +	 *    begins and releases it only after writing its superblock.
>  	 */
> +	mutex_lock(&fs_info->tree_log_mutex);
> +	btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
> +	btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
>  	ret = write_all_supers(fs_info, 1);
> +	mutex_unlock(&fs_info->tree_log_mutex);
>  	if (ret) {
>  		btrfs_set_log_full_commit(trans);
>  		btrfs_abort_transaction(trans, ret);
> @@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
>  	if (fs_info->log_root_tree) {
>  		free_log_tree(trans, fs_info->log_root_tree);
>  		fs_info->log_root_tree = NULL;
> +		clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
>  	}
>  	return 0;
>  }
> -- 
> 2.28.0



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-02  5:38   ` Wang Yugui
@ 2021-02-02 11:28     ` Filipe Manana
  2021-02-02 13:01       ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Filipe Manana @ 2021-02-02 11:28 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi, Filipe Manana
>
> The dbench result with these patches is very good. thanks a lot.
>
> This is the dbench(synchronous mode) result , and then a question.
>
> command: dbench -s -t 60 -D /btrfs/ 32
> mount option:ssd,space_cache=v2
> kernel:5.10.12 + patchset 1 + this patchset

patchset 1 and "this patchset" are the same, did you mean two
different patchsets or just a single patchset?

> patchset 1:
> 0001-btrfs-fix-race-causing-unnecessary-inode-logging-dur.patch
> 0002-btrfs-fix-race-that-results-in-logging-old-extents-d.patch
> 0003-btrfs-fix-race-that-causes-unnecessary-logging-of-an.patch
> 0004-btrfs-fix-race-that-makes-inode-logging-fallback-to-.patch
> 0005-btrfs-fix-race-leading-to-unnecessary-transaction-co.patch
> 0006-btrfs-do-not-block-inode-logging-for-so-long-during-.patch
>
> We get two types of result as below, and the result type 1 is not easy
> to reproduce now.

Ok, there are outliers often, that's why multiple tests should be done.

>
> Question:
> for synchronous mode, the result type 1 is perfect?

What do you mean by perfect? You mean if result 1 is better than result 2?

> and there is still some minor place about the flush to do for
> the result type2?

By "minor place" you mean the huge difference I suppose.

>
>
> result type 1:
>
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX     868942     0.028     3.017
>  Close         638536     0.003     0.061
>  Rename         36851     0.663     4.000
>  Unlink        175182     0.399     5.358
>  Qpathinfo     789014     0.014     1.846
>  Qfileinfo     137684     0.002     0.047
>  Qfsinfo       144241     0.004     0.059
>  Sfileinfo      70913     0.008     0.046
>  Find          304554     0.057     1.889
> ** WriteX        429696     3.960  2239.973
>  ReadX        1363356     0.005     0.358
>  LockX           2836     0.004     0.038
>  UnlockX         2836     0.002     0.018
> ** Flush          60771     0.621     6.794
>
> Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.10.12-4.el7.x86_64
>
>
> result type 2:
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX     888943     0.028     2.679
>  Close         652765     0.002     0.058
>  Rename         37705     0.572     3.962
>  Unlink        179713     0.383     3.983
>  Qpathinfo     806705     0.014     2.294
>  Qfileinfo     140752     0.002     0.125
>  Qfsinfo       147909     0.004     0.049
>  Sfileinfo      72374     0.008     0.104
>  Find          311839     0.058     2.305
> ** WriteX        439656     3.854  1872.109
>  ReadX        1396868     0.005     0.324
>  LockX           2910     0.004     0.026
>  UnlockX         2910     0.002     0.025
> ** Flush          62260     0.750  1659.364
>
> Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.10.12-4.el7.x86_64

I'm not sure what your question is exactly.

Are both results after applying the same patchset, or are they before
and after applying the patchset, respectively?

If they are both with the patchset applied, and you wonder about the
big variation in the "Flush" operations, I am not sure about why it is
so.
Both throughput and max latency are better in result 2.

It's normal to have variations across dbench runs, I get them too, and
I do several runs (5 or 6) to check things out.

I don't use virtualization (testing on bare metal), I set the cpu
governor mode to performance (instead of the "powersave" default) and
use a non-debug kernel configuration, because otherwise I get
significant variations in latencies and throughput too (though I never
got a huge difference such as from 6.794 to 1659.364).

Thanks.

>
>
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
>
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > Early on during a transaction commit we acquire the tree_log_mutex and
> > hold it until after we write the super blocks. But before writing the
> > extent buffers dirtied by the transaction and the super blocks we unblock
> > the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
> > fs_info->running_transaction to NULL.
> >
> > This means that after that and before writing the super blocks, new
> > transactions can start. However if any transaction wants to log an inode,
> > it will block waiting for the transaction commit to write its dirty
> > extent buffers and the super blocks because the tree_log_mutex is only
> > released after those operations are complete, and starting a new log
> > transaction blocks on that mutex (at start_log_trans()).
> >
> > Writing the dirty extent buffers and the super blocks can take a very
> > significant amount of time to complete, but we could allow the tasks
> > wanting to log an inode to proceed with most of their steps:
> >
> > 1) create the log trees
> > 2) log metadata in the trees
> > 3) write their dirty extent buffers
> >
> > They only need to wait for the previous transaction commit to complete
> > (write its super blocks) before they attempt to write their super blocks,
> > otherwise we could end up with a corrupt filesystem after a crash
> >
> > So change start_log_trans() to use the root tree's log_mutex to serialize
> > for the creation of the log root tree instead of using the tree_log_mutex,
> > and make btrfs_sync_log() acquire the tree_log_mutex before writing the
> > super blocks. This allows for inode logging to wait much less time when
> > there is a previous transaction that is still committing, often not having
> > to wait at all, as by the time when we try to sync the log the previous
> > transaction already wrote its super blocks.
> >
> > This patch belongs to a patch set that is comprised of the following
> > patches:
> >
> >   btrfs: fix race causing unnecessary inode logging during link and rename
> >   btrfs: fix race that results in logging old extents during a fast fsync
> >   btrfs: fix race that causes unnecessary logging of ancestor inodes
> >   btrfs: fix race that makes inode logging fallback to transaction commit
> >   btrfs: fix race leading to unnecessary transaction commit when logging inode
> >   btrfs: do not block inode logging for so long during transaction commit
> >
> > The following script that uses dbench was used to measure the impact of
> > the whole patchset:
> >
> >   $ cat test-dbench.sh
> >   #!/bin/bash
> >
> >   DEV=/dev/nvme0n1
> >   MNT=/mnt/btrfs
> >   MOUNT_OPTIONS="-o ssd"
> >
> >   echo "performance" | \
> >       tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
> >
> >   mkfs.btrfs -f -m single -d single $DEV
> >   mount $MOUNT_OPTIONS $DEV $MNT
> >
> >   dbench -D $MNT -t 300 64
> >
> >   umount $MNT
> >
> > The test was run on a machine with 12 cores, 64G of ram, using a NVMe
> > device and a non-debug kernel configuration (Debian's default).
> >
> > Before patch set:
> >
> >  Operation      Count    AvgLat    MaxLat
> >  ----------------------------------------
> >  NTCreateX    11277211    0.250    85.340
> >  Close        8283172     0.002     6.479
> >  Rename        477515     1.935    86.026
> >  Unlink       2277936     0.770    87.071
> >  Deltree          256    15.732    81.379
> >  Mkdir            128     0.003     0.009
> >  Qpathinfo    10221180    0.056    44.404
> >  Qfileinfo    1789967     0.002     4.066
> >  Qfsinfo      1874399     0.003     9.176
> >  Sfileinfo     918589     0.061    10.247
> >  Find         3951758     0.341    54.040
> >  WriteX       5616547     0.047    85.079
> >  ReadX        17676028    0.005     9.704
> >  LockX          36704     0.003     1.800
> >  UnlockX        36704     0.002     0.687
> >  Flush         790541    14.115   676.236
> >
> > Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms
> >
> > After patch set:
> >
> > Operation      Count    AvgLat    MaxLat
> >  ----------------------------------------
> >  NTCreateX    12687926    0.171    86.526
> >  Close        9320780     0.002     8.063
> >  Rename        537253     1.444    78.576
> >  Unlink       2561827     0.559    87.228
> >  Deltree          374    11.499    73.549
> >  Mkdir            187     0.003     0.005
> >  Qpathinfo    11500300    0.061    36.801
> >  Qfileinfo    2017118     0.002     7.189
> >  Qfsinfo      2108641     0.003     4.825
> >  Sfileinfo    1033574     0.008     8.065
> >  Find         4446553     0.408    47.835
> >  WriteX       6335667     0.045    84.388
> >  ReadX        19887312    0.003     9.215
> >  LockX          41312     0.003     1.394
> >  UnlockX        41312     0.002     1.425
> >  Flush         889233    13.014   623.259
> >
> > Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms
> >
> > +12.7% throughput, -8.2% max latency
> >
> > Signed-off-by: Filipe Manana <fdmanana@suse.com>
> > ---
> >  fs/btrfs/ctree.h    |  2 +-
> >  fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++--------------
> >  2 files changed, 40 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index c0c6e79c43f9..7185384f475a 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -1026,7 +1026,7 @@ enum {
> >       BTRFS_ROOT_DEAD_RELOC_TREE,
> >       /* Mark dead root stored on device whose cleanup needs to be resumed */
> >       BTRFS_ROOT_DEAD_TREE,
> > -     /* The root has a log tree. Used only for subvolume roots. */
> > +     /* The root has a log tree. Used for subvolume roots and the tree root. */
> >       BTRFS_ROOT_HAS_LOG_TREE,
> >       /* Qgroup flushing is in progress */
> >       BTRFS_ROOT_QGROUP_FLUSHING,
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index bc5b652f4f64..e8b84543d565 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >                          struct btrfs_log_ctx *ctx)
> >  {
> >       struct btrfs_fs_info *fs_info = root->fs_info;
> > +     struct btrfs_root *tree_root = fs_info->tree_root;
> >       int ret = 0;
> >
> > +     /*
> > +      * First check if the log root tree was already created. If not, create
> > +      * it before locking the root's log_mutex, just to keep lockdep happy.
> > +      */
> > +     if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) {
> > +             mutex_lock(&tree_root->log_mutex);
> > +             if (!fs_info->log_root_tree) {
> > +                     ret = btrfs_init_log_root_tree(trans, fs_info);
> > +                     if (!ret)
> > +                             set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state);
> > +             }
> > +             mutex_unlock(&tree_root->log_mutex);
> > +             if (ret)
> > +                     return ret;
> > +     }
> > +
> >       mutex_lock(&root->log_mutex);
> >
> >       if (root->log_root) {
> > @@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
> >                       set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
> >               }
> >       } else {
> > -             mutex_lock(&fs_info->tree_log_mutex);
> > -             if (!fs_info->log_root_tree)
> > -                     ret = btrfs_init_log_root_tree(trans, fs_info);
> > -             mutex_unlock(&fs_info->tree_log_mutex);
> > -             if (ret)
> > -                     goto out;
> > -
> >               ret = btrfs_add_log_tree(trans, root);
> >               if (ret)
> >                       goto out;
> > @@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >       int log_transid = 0;
> >       struct btrfs_log_ctx root_log_ctx;
> >       struct blk_plug plug;
> > +     u64 log_root_start;
> > +     u64 log_root_level;
> >
> >       mutex_lock(&root->log_mutex);
> >       log_transid = ctx->log_transid;
> > @@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
> >               goto out_wake_log_root;
> >       }
> >
> > -     btrfs_set_super_log_root(fs_info->super_for_commit,
> > -                              log_root_tree->node->start);
> > -     btrfs_set_super_log_root_level(fs_info->super_for_commit,
> > -                                    btrfs_header_level(log_root_tree->node));
> > -
> > +     log_root_start = log_root_tree->node->start;
> > +     log_root_level = btrfs_header_level(log_root_tree->node);
> >       log_root_tree->log_transid++;
> >       mutex_unlock(&log_root_tree->log_mutex);
> >
> >       /*
> > -      * Nobody else is going to jump in and write the ctree
> > -      * super here because the log_commit atomic below is protecting
> > -      * us.  We must be called with a transaction handle pinning
> > -      * the running transaction open, so a full commit can't hop
> > -      * in and cause problems either.
> > +      * Here we are guaranteed that nobody is going to write the superblock
> > +      * for the current transaction before us and that neither we do write
> > +      * our superblock before the previous transaction finishes its commit
> > +      * and writes its superblock, because:
> > +      *
> > +      * 1) We are holding a handle on the current transaction, so no body
> > +      *    can commit it until we release the handle;
> > +      *
> > +      * 2) Before writing our superblock we acquire the tree_log_mutex, so
> > +      *    if the previous transaction is still committing, and hasn't yet
> > +      *    written its superblock, we wait for it to do it, because a
> > +      *    transaction commit acquires the tree_log_mutex when the commit
> > +      *    begins and releases it only after writing its superblock.
> >        */
> > +     mutex_lock(&fs_info->tree_log_mutex);
> > +     btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start);
> > +     btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level);
> >       ret = write_all_supers(fs_info, 1);
> > +     mutex_unlock(&fs_info->tree_log_mutex);
> >       if (ret) {
> >               btrfs_set_log_full_commit(trans);
> >               btrfs_abort_transaction(trans, ret);
> > @@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans,
> >       if (fs_info->log_root_tree) {
> >               free_log_tree(trans, fs_info->log_root_tree);
> >               fs_info->log_root_tree = NULL;
> > +             clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state);
> >       }
> >       return 0;
> >  }
> > --
> > 2.28.0
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-02 11:28     ` Filipe Manana
@ 2021-02-02 13:01       ` Wang Yugui
  2021-02-02 14:09         ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-02 13:01 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

Hi, Filipe Manana

> On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >
> > Hi, Filipe Manana
> >
> > The dbench result with these patches is very good. thanks a lot.
> >
> > This is the dbench(synchronous mode) result , and then a question.
> >
> > command: dbench -s -t 60 -D /btrfs/ 32
> > mount option:ssd,space_cache=v2
> > kernel:5.10.12 + patchset 1 + this patchset
> 
> patchset 1 and "this patchset" are the same, did you mean two
> different patchsets or just a single patchset?

patchset1:
btrfs: some performance improvements for dbench alike workloads

patchset2:
btrfs: more performance improvements for dbench workloads
https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801

I'm sorroy that I have replayed to the wrong patchset.

> 
> >
> > Question:
> > for synchronous mode, the result type 1 is perfect?
> 
> What do you mean by perfect? You mean if result 1 is better than result 2?

In result 1,  the MaxLat of Flush of dbench synchronous mode is fast as
expected, the same level as  kernel 5.4.91.

But in result 2, the MaxLat of Flush of dbench synchronous mode is big
as write level, but this is synchronous mode, most job should be done
already before flush.

> > and there is still some minor place about the flush to do for
> > the result type2?
> 
> By "minor place" you mean the huge difference I suppose.
> 
> >
> >
> > result type 1:
> >
> >  Operation      Count    AvgLat    MaxLat
> >  ----------------------------------------
> >  NTCreateX     868942     0.028     3.017
> >  Close         638536     0.003     0.061
> >  Rename         36851     0.663     4.000
> >  Unlink        175182     0.399     5.358
> >  Qpathinfo     789014     0.014     1.846
> >  Qfileinfo     137684     0.002     0.047
> >  Qfsinfo       144241     0.004     0.059
> >  Sfileinfo      70913     0.008     0.046
> >  Find          304554     0.057     1.889
> > ** WriteX        429696     3.960  2239.973
> >  ReadX        1363356     0.005     0.358
> >  LockX           2836     0.004     0.038
> >  UnlockX         2836     0.002     0.018
> > ** Flush          60771     0.621     6.794
> >
> > Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
> > + stat -f -c %T /btrfs/
> > btrfs
> > + uname -r
> > 5.10.12-4.el7.x86_64
> >
> >
> > result type 2:
> >  Operation      Count    AvgLat    MaxLat
> >  ----------------------------------------
> >  NTCreateX     888943     0.028     2.679
> >  Close         652765     0.002     0.058
> >  Rename         37705     0.572     3.962
> >  Unlink        179713     0.383     3.983
> >  Qpathinfo     806705     0.014     2.294
> >  Qfileinfo     140752     0.002     0.125
> >  Qfsinfo       147909     0.004     0.049
> >  Sfileinfo      72374     0.008     0.104
> >  Find          311839     0.058     2.305
> > ** WriteX        439656     3.854  1872.109
> >  ReadX        1396868     0.005     0.324
> >  LockX           2910     0.004     0.026
> >  UnlockX         2910     0.002     0.025
> > ** Flush          62260     0.750  1659.364
> >
> > Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
> > + stat -f -c %T /btrfs/
> > btrfs
> > + uname -r
> > 5.10.12-4.el7.x86_64
> 
> I'm not sure what your question is exactly.
> 
> Are both results after applying the same patchset, or are they before
> and after applying the patchset, respectively?

Both result after applying the same patchset.
and both on the same server, same SAS SSD disk.
but the result is not stable, and the major diff is MaxLat of Flush.

Server:Dell T7610
CPU: E5-2660 v2(10core 20threads) x2
SSD:TOSHIBA  PX05SMQ040
Memory:192G (with ECC)


> If they are both with the patchset applied, and you wonder about the
> big variation in the "Flush" operations, I am not sure about why it is
> so.
> Both throughput and max latency are better in result 2.
> 
> It's normal to have variations across dbench runs, I get them too, and
> I do several runs (5 or 6) to check things out.
> 
> I don't use virtualization (testing on bare metal), I set the cpu
> governor mode to performance (instead of the "powersave" default) and
> use a non-debug kernel configuration, because otherwise I get
> significant variations in latencies and throughput too (though I never
> got a huge difference such as from 6.794 to 1659.364).

This is a bare metal(dell T7610).
CPU mode is set to performance by BIOS. and I checked it by
'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'

Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
smaller than 1023 of a NVMe SSD,but it is still enough for
dbench 32 threads?


The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
a problem.
It is not easy to re-product both,  mabye easy to reproduce the small
one, maybe easy to reproduce the big one.


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-02 13:01       ` Wang Yugui
@ 2021-02-02 14:09         ` Wang Yugui
  2021-02-03 10:51           ` Filipe Manana
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-02 14:09 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs

Hi, Filipe Manana

There are some dbench(sync mode) result on the same hardware,
but with different linux kernel

4.14.200
Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        225281     5.163    82.143
 Flush          32161     2.250    62.669
Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms

4.19.21
Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        118842    10.946   116.345
 Flush          16506     0.115    44.575
Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms

4.19.150
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        144509     9.151   117.353
 lush          20563     0.128    52.014
Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms

5.4.91
 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        367033     4.377  1908.724
 Flush          52037     0.159    39.871
Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms

5.10.12+patches
Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 WriteX        429696     3.960  2239.973
 Flush          60771     0.621     6.794
Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms


MaxLat / AvgLat  of WriteX is increased from 82.143/5.163=15.9  to
2239.973/3.960=565.6.

For QoS, can we have an option to tune the value of MaxLat / AvgLat  of
WriteX to less than 100?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/02

> Hi, Filipe Manana
> 
> > On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >
> > > Hi, Filipe Manana
> > >
> > > The dbench result with these patches is very good. thanks a lot.
> > >
> > > This is the dbench(synchronous mode) result , and then a question.
> > >
> > > command: dbench -s -t 60 -D /btrfs/ 32
> > > mount option:ssd,space_cache=v2
> > > kernel:5.10.12 + patchset 1 + this patchset
> > 
> > patchset 1 and "this patchset" are the same, did you mean two
> > different patchsets or just a single patchset?
> 
> patchset1:
> btrfs: some performance improvements for dbench alike workloads
> 
> patchset2:
> btrfs: more performance improvements for dbench workloads
> https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801
> 
> I'm sorroy that I have replayed to the wrong patchset.
> 
> > 
> > >
> > > Question:
> > > for synchronous mode, the result type 1 is perfect?
> > 
> > What do you mean by perfect? You mean if result 1 is better than result 2?
> 
> In result 1,  the MaxLat of Flush of dbench synchronous mode is fast as
> expected, the same level as  kernel 5.4.91.
> 
> But in result 2, the MaxLat of Flush of dbench synchronous mode is big
> as write level, but this is synchronous mode, most job should be done
> already before flush.
> 
> > > and there is still some minor place about the flush to do for
> > > the result type2?
> > 
> > By "minor place" you mean the huge difference I suppose.
> > 
> > >
> > >
> > > result type 1:
> > >
> > >  Operation      Count    AvgLat    MaxLat
> > >  ----------------------------------------
> > >  NTCreateX     868942     0.028     3.017
> > >  Close         638536     0.003     0.061
> > >  Rename         36851     0.663     4.000
> > >  Unlink        175182     0.399     5.358
> > >  Qpathinfo     789014     0.014     1.846
> > >  Qfileinfo     137684     0.002     0.047
> > >  Qfsinfo       144241     0.004     0.059
> > >  Sfileinfo      70913     0.008     0.046
> > >  Find          304554     0.057     1.889
> > > ** WriteX        429696     3.960  2239.973
> > >  ReadX        1363356     0.005     0.358
> > >  LockX           2836     0.004     0.038
> > >  UnlockX         2836     0.002     0.018
> > > ** Flush          60771     0.621     6.794
> > >
> > > Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
> > > + stat -f -c %T /btrfs/
> > > btrfs
> > > + uname -r
> > > 5.10.12-4.el7.x86_64
> > >
> > >
> > > result type 2:
> > >  Operation      Count    AvgLat    MaxLat
> > >  ----------------------------------------
> > >  NTCreateX     888943     0.028     2.679
> > >  Close         652765     0.002     0.058
> > >  Rename         37705     0.572     3.962
> > >  Unlink        179713     0.383     3.983
> > >  Qpathinfo     806705     0.014     2.294
> > >  Qfileinfo     140752     0.002     0.125
> > >  Qfsinfo       147909     0.004     0.049
> > >  Sfileinfo      72374     0.008     0.104
> > >  Find          311839     0.058     2.305
> > > ** WriteX        439656     3.854  1872.109
> > >  ReadX        1396868     0.005     0.324
> > >  LockX           2910     0.004     0.026
> > >  UnlockX         2910     0.002     0.025
> > > ** Flush          62260     0.750  1659.364
> > >
> > > Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
> > > + stat -f -c %T /btrfs/
> > > btrfs
> > > + uname -r
> > > 5.10.12-4.el7.x86_64
> > 
> > I'm not sure what your question is exactly.
> > 
> > Are both results after applying the same patchset, or are they before
> > and after applying the patchset, respectively?
> 
> Both result after applying the same patchset.
> and both on the same server, same SAS SSD disk.
> but the result is not stable, and the major diff is MaxLat of Flush.
> 
> Server:Dell T7610
> CPU: E5-2660 v2(10core 20threads) x2
> SSD:TOSHIBA  PX05SMQ040
> Memory:192G (with ECC)
> 
> 
> > If they are both with the patchset applied, and you wonder about the
> > big variation in the "Flush" operations, I am not sure about why it is
> > so.
> > Both throughput and max latency are better in result 2.
> > 
> > It's normal to have variations across dbench runs, I get them too, and
> > I do several runs (5 or 6) to check things out.
> > 
> > I don't use virtualization (testing on bare metal), I set the cpu
> > governor mode to performance (instead of the "powersave" default) and
> > use a non-debug kernel configuration, because otherwise I get
> > significant variations in latencies and throughput too (though I never
> > got a huge difference such as from 6.794 to 1659.364).
> 
> This is a bare metal(dell T7610).
> CPU mode is set to performance by BIOS. and I checked it by
> 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
> 
> Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
> smaller than 1023 of a NVMe SSD,but it is still enough for
> dbench 32 threads?
> 
> 
> The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
> a problem.
> It is not easy to re-product both,  mabye easy to reproduce the small
> one, maybe easy to reproduce the big one.
> 
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-02 14:09         ` Wang Yugui
@ 2021-02-03 10:51           ` Filipe Manana
  2021-02-03 11:03             ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: Filipe Manana @ 2021-02-03 10:51 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-btrfs

On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>
> Hi, Filipe Manana
>
> There are some dbench(sync mode) result on the same hardware,
> but with different linux kernel
>
> 4.14.200
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        225281     5.163    82.143
>  Flush          32161     2.250    62.669
> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
>
> 4.19.21
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        118842    10.946   116.345
>  Flush          16506     0.115    44.575
> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
>
> 4.19.150
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        144509     9.151   117.353
>  lush          20563     0.128    52.014
> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
>
> 5.4.91
>  Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        367033     4.377  1908.724
>  Flush          52037     0.159    39.871
> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms

Ok, it seems somewhere between 4.19 and 5.4, something made the
latency much worse for you at least.

Is it only when using sync open (O_SYNC, dbench's -s flag), what about
when not using it?

I'll have to look at it, but it will likely take some time.

Thanks.

>
> 5.10.12+patches
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  WriteX        429696     3.960  2239.973
>  Flush          60771     0.621     6.794
> Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
>
>
> MaxLat / AvgLat  of WriteX is increased from 82.143/5.163=15.9  to
> 2239.973/3.960=565.6.
>
> For QoS, can we have an option to tune the value of MaxLat / AvgLat  of
> WriteX to less than 100?
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/02
>
> > Hi, Filipe Manana
> >
> > > On Tue, Feb 2, 2021 at 5:42 AM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > > >
> > > > Hi, Filipe Manana
> > > >
> > > > The dbench result with these patches is very good. thanks a lot.
> > > >
> > > > This is the dbench(synchronous mode) result , and then a question.
> > > >
> > > > command: dbench -s -t 60 -D /btrfs/ 32
> > > > mount option:ssd,space_cache=v2
> > > > kernel:5.10.12 + patchset 1 + this patchset
> > >
> > > patchset 1 and "this patchset" are the same, did you mean two
> > > different patchsets or just a single patchset?
> >
> > patchset1:
> > btrfs: some performance improvements for dbench alike workloads
> >
> > patchset2:
> > btrfs: more performance improvements for dbench workloads
> > https://patchwork.kernel.org/project/linux-btrfs/list/?series=422801
> >
> > I'm sorroy that I have replayed to the wrong patchset.
> >
> > >
> > > >
> > > > Question:
> > > > for synchronous mode, the result type 1 is perfect?
> > >
> > > What do you mean by perfect? You mean if result 1 is better than result 2?
> >
> > In result 1,  the MaxLat of Flush of dbench synchronous mode is fast as
> > expected, the same level as  kernel 5.4.91.
> >
> > But in result 2, the MaxLat of Flush of dbench synchronous mode is big
> > as write level, but this is synchronous mode, most job should be done
> > already before flush.
> >
> > > > and there is still some minor place about the flush to do for
> > > > the result type2?
> > >
> > > By "minor place" you mean the huge difference I suppose.
> > >
> > > >
> > > >
> > > > result type 1:
> > > >
> > > >  Operation      Count    AvgLat    MaxLat
> > > >  ----------------------------------------
> > > >  NTCreateX     868942     0.028     3.017
> > > >  Close         638536     0.003     0.061
> > > >  Rename         36851     0.663     4.000
> > > >  Unlink        175182     0.399     5.358
> > > >  Qpathinfo     789014     0.014     1.846
> > > >  Qfileinfo     137684     0.002     0.047
> > > >  Qfsinfo       144241     0.004     0.059
> > > >  Sfileinfo      70913     0.008     0.046
> > > >  Find          304554     0.057     1.889
> > > > ** WriteX        429696     3.960  2239.973
> > > >  ReadX        1363356     0.005     0.358
> > > >  LockX           2836     0.004     0.038
> > > >  UnlockX         2836     0.002     0.018
> > > > ** Flush          60771     0.621     6.794
> > > >
> > > > Throughput 452.385 MB/sec (sync open)  32 clients  32 procs  max_latency=1963.312 ms
> > > > + stat -f -c %T /btrfs/
> > > > btrfs
> > > > + uname -r
> > > > 5.10.12-4.el7.x86_64
> > > >
> > > >
> > > > result type 2:
> > > >  Operation      Count    AvgLat    MaxLat
> > > >  ----------------------------------------
> > > >  NTCreateX     888943     0.028     2.679
> > > >  Close         652765     0.002     0.058
> > > >  Rename         37705     0.572     3.962
> > > >  Unlink        179713     0.383     3.983
> > > >  Qpathinfo     806705     0.014     2.294
> > > >  Qfileinfo     140752     0.002     0.125
> > > >  Qfsinfo       147909     0.004     0.049
> > > >  Sfileinfo      72374     0.008     0.104
> > > >  Find          311839     0.058     2.305
> > > > ** WriteX        439656     3.854  1872.109
> > > >  ReadX        1396868     0.005     0.324
> > > >  LockX           2910     0.004     0.026
> > > >  UnlockX         2910     0.002     0.025
> > > > ** Flush          62260     0.750  1659.364
> > > >
> > > > Throughput 461.856 MB/sec (sync open)  32 clients  32 procs  max_latency=1872.118 ms
> > > > + stat -f -c %T /btrfs/
> > > > btrfs
> > > > + uname -r
> > > > 5.10.12-4.el7.x86_64
> > >
> > > I'm not sure what your question is exactly.
> > >
> > > Are both results after applying the same patchset, or are they before
> > > and after applying the patchset, respectively?
> >
> > Both result after applying the same patchset.
> > and both on the same server, same SAS SSD disk.
> > but the result is not stable, and the major diff is MaxLat of Flush.
> >
> > Server:Dell T7610
> > CPU: E5-2660 v2(10core 20threads) x2
> > SSD:TOSHIBA  PX05SMQ040
> > Memory:192G (with ECC)
> >
> >
> > > If they are both with the patchset applied, and you wonder about the
> > > big variation in the "Flush" operations, I am not sure about why it is
> > > so.
> > > Both throughput and max latency are better in result 2.
> > >
> > > It's normal to have variations across dbench runs, I get them too, and
> > > I do several runs (5 or 6) to check things out.
> > >
> > > I don't use virtualization (testing on bare metal), I set the cpu
> > > governor mode to performance (instead of the "powersave" default) and
> > > use a non-debug kernel configuration, because otherwise I get
> > > significant variations in latencies and throughput too (though I never
> > > got a huge difference such as from 6.794 to 1659.364).
> >
> > This is a bare metal(dell T7610).
> > CPU mode is set to performance by BIOS. and I checked it by
> > 'cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor'
> >
> > Maybe I used a SAS ssd, and the queue depth of SAS SSD is 254.
> > smaller than 1023 of a NVMe SSD,but it is still enough for
> > dbench 32 threads?
> >
> >
> > The huge difference of MaxLat of Flush such as from 6.794 to 1659.364 is
> > a problem.
> > It is not easy to re-product both,  mabye easy to reproduce the small
> > one, maybe easy to reproduce the big one.
> >
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/02/02
> >
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-03 10:51           ` Filipe Manana
@ 2021-02-03 11:03             ` Nikolay Borisov
  2021-02-03 15:16               ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2021-02-03 11:03 UTC (permalink / raw)
  To: Filipe Manana, Wang Yugui; +Cc: linux-btrfs



On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>>
>> Hi, Filipe Manana
>>
>> There are some dbench(sync mode) result on the same hardware,
>> but with different linux kernel
>>
>> 4.14.200
>> Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  WriteX        225281     5.163    82.143
>>  Flush          32161     2.250    62.669
>> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
>>
>> 4.19.21
>> Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  WriteX        118842    10.946   116.345
>>  Flush          16506     0.115    44.575
>> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
>>
>> 4.19.150
>>  Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  WriteX        144509     9.151   117.353
>>  lush          20563     0.128    52.014
>> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
>>
>> 5.4.91
>>  Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  WriteX        367033     4.377  1908.724
>>  Flush          52037     0.159    39.871
>> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms
> 
> Ok, it seems somewhere between 4.19 and 5.4, something made the
> latency much worse for you at least.
> 
> Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> when not using it?
> 
> I'll have to look at it, but it will likely take some time.


This seems like the perf regression I observed starting with kernel 5.0,
essentially preemptive flush of metadata was broken for quite some time,
but kernel 5.0 removed a btrfs_end_transaction call from
should_end_transaction which unmasked the issue.

In particular this should have been fixed by the following commit in
misc-next:

https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
which is part of a larger series of patches. So Wang, in order to test
this hypothesis can you re-run those tests with the latest misc-next
branch .

<snip>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-03 11:03             ` Nikolay Borisov
@ 2021-02-03 15:16               ` Wang Yugui
  2021-02-04  3:17                 ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-03 15:16 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Filipe Manana, linux-btrfs

Hi,

There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)

1, the MaxLat is changed from 1900ms level to 1000ms level.
    that is a good improvement.

2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
    MaxLat because all of them will write something to disk.

3, I'm not sure whether MaxLat of Flush is OK.
    there should be nothing to write for Flush because of 
    dbench(sync open) mode. but I'm not sure whether some
    scsi comand such as Synchronize Cache will be executed by Flush.

Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     924298     0.035   745.523
 Close         678499     0.002     0.093
 Rename         39280     0.597   744.714
 Unlink        187119     0.435   745.547
 Qpathinfo     838558     0.013     0.978
 Qfileinfo     146308     0.002     0.055
 Qfsinfo       154172     0.004     0.108
 Sfileinfo      75423     0.010     0.109
 Find          324478     0.050     1.115
 WriteX        457376     3.665   957.922
 ReadX        1454051     0.005     0.131
 LockX           3032     0.004     0.027
 UnlockX         3032     0.002     0.022
 Flush          64867     0.649   718.676

Throughput 481.002 MB/sec (sync open)  32 clients  32 procs  max_latency=957.929 ms
+ stat -f -c %T /btrfs/
btrfs
+ uname -r
5.11.0-0.rc6.141.el8.x86_64

detail:
  32     33682   548.70 MB/sec  execute   1 sec  latency 7.024 ms
  32     36692   538.41 MB/sec  execute   2 sec  latency 6.719 ms
  32     39787   544.75 MB/sec  execute   3 sec  latency 6.405 ms
  32     42850   540.09 MB/sec  execute   4 sec  latency 7.717 ms
  32     45872   535.64 MB/sec  execute   5 sec  latency 7.271 ms
  32     48861   524.29 MB/sec  execute   6 sec  latency 6.685 ms
  32     51343   512.54 MB/sec  execute   7 sec  latency 128.666 ms
  32     53366   496.73 MB/sec  execute   8 sec  latency 703.255 ms *1
  32     56428   498.60 MB/sec  execute   9 sec  latency 6.346 ms
  32     59471   498.40 MB/sec  execute  10 sec  latency 9.958 ms
  32     62175   495.68 MB/sec  execute  11 sec  latency 14.270 ms
  32     65143   498.90 MB/sec  execute  12 sec  latency 9.391 ms
  32     68116   502.19 MB/sec  execute  13 sec  latency 5.713 ms
  32     71078   501.45 MB/sec  execute  14 sec  latency 6.235 ms
  32     74030   500.43 MB/sec  execute  15 sec  latency 7.135 ms
  32     76908   500.82 MB/sec  execute  16 sec  latency 7.071 ms
  32     79794   499.61 MB/sec  execute  17 sec  latency 7.556 ms
  32     82791   502.30 MB/sec  execute  18 sec  latency 6.509 ms
  32     85676   502.05 MB/sec  execute  19 sec  latency 6.938 ms
  32     86950   486.44 MB/sec  execute  20 sec  latency 554.015 ms *2
  32     89670   487.60 MB/sec  execute  21 sec  latency 901.490 ms *3
  32     92577   487.64 MB/sec  execute  22 sec  latency 6.715 ms
  32     95528   488.03 MB/sec  execute  23 sec  latency 7.457 ms
  32     98507   488.76 MB/sec  execute  24 sec  latency 7.266 ms
  32    101340   488.76 MB/sec  execute  25 sec  latency 6.699 ms
  32    104331   489.94 MB/sec  execute  26 sec  latency 6.506 ms
  32    107166   490.87 MB/sec  execute  27 sec  latency 6.582 ms
  32    110022   490.17 MB/sec  execute  28 sec  latency 7.072 ms
  32    112931   490.34 MB/sec  execute  29 sec  latency 6.484 ms
  32    115658   489.67 MB/sec  execute  30 sec  latency 6.767 ms
  32    118569   490.14 MB/sec  execute  31 sec  latency 6.825 ms
  32    121334   490.34 MB/sec  execute  32 sec  latency 7.270 ms
  32    124182   489.95 MB/sec  execute  33 sec  latency 6.849 ms
  32    127073   490.05 MB/sec  execute  34 sec  latency 6.934 ms
  32    129835   489.56 MB/sec  execute  35 sec  latency 7.455 ms
  32    130952   481.58 MB/sec  execute  36 sec  latency 635.676 ms *4
  32    133736   481.74 MB/sec  execute  37 sec  latency 957.929 ms *5
  32    136646   481.79 MB/sec  execute  38 sec  latency 7.339 ms
  32    139616   483.23 MB/sec  execute  39 sec  latency 7.199 ms
  32    142526   483.62 MB/sec  execute  40 sec  latency 7.344 ms
  32    145429   483.58 MB/sec  execute  41 sec  latency 6.967 ms
  32    148329   484.09 MB/sec  execute  42 sec  latency 8.043 ms
  32    151091   483.89 MB/sec  execute  43 sec  latency 7.476 ms
  32    153913   484.33 MB/sec  execute  44 sec  latency 7.611 ms
  32    156679   484.29 MB/sec  execute  45 sec  latency 7.612 ms
  32    159534   483.90 MB/sec  execute  46 sec  latency 8.295 ms
  32    162328   484.17 MB/sec  execute  47 sec  latency 6.582 ms
  32    165080   483.64 MB/sec  execute  48 sec  latency 8.939 ms
  32    167861   484.12 MB/sec  execute  49 sec  latency 6.684 ms
  32    170616   483.56 MB/sec  execute  50 sec  latency 7.051 ms
  32    173557   483.89 MB/sec  execute  51 sec  latency 6.923 ms
  32    176424   484.52 MB/sec  execute  52 sec  latency 6.689 ms
  32    179255   484.14 MB/sec  execute  53 sec  latency 7.973 ms
  32    181195   481.47 MB/sec  execute  54 sec  latency 305.495 ms
  32    183309   479.62 MB/sec  execute  55 sec  latency 866.862 ms *6
  32    186256   479.82 MB/sec  execute  56 sec  latency 7.016 ms
  32    189209   480.82 MB/sec  execute  57 sec  latency 6.789 ms
  32    192072   480.93 MB/sec  execute  58 sec  latency 7.305 ms
  32    195054   481.00 MB/sec  execute  59 sec  latency 7.432 ms


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/03

> 
> 
> On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> > On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
> >>
> >> Hi, Filipe Manana
> >>
> >> There are some dbench(sync mode) result on the same hardware,
> >> but with different linux kernel
> >>
> >> 4.14.200
> >> Operation      Count    AvgLat    MaxLat
> >>  ----------------------------------------
> >>  WriteX        225281     5.163    82.143
> >>  Flush          32161     2.250    62.669
> >> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
> >>
> >> 4.19.21
> >> Operation      Count    AvgLat    MaxLat
> >>  ----------------------------------------
> >>  WriteX        118842    10.946   116.345
> >>  Flush          16506     0.115    44.575
> >> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
> >>
> >> 4.19.150
> >>  Operation      Count    AvgLat    MaxLat
> >>  ----------------------------------------
> >>  WriteX        144509     9.151   117.353
> >>  lush          20563     0.128    52.014
> >> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
> >>
> >> 5.4.91
> >>  Operation      Count    AvgLat    MaxLat
> >>  ----------------------------------------
> >>  WriteX        367033     4.377  1908.724
> >>  Flush          52037     0.159    39.871
> >> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms
> > 
> > Ok, it seems somewhere between 4.19 and 5.4, something made the
> > latency much worse for you at least.
> > 
> > Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> > when not using it?
> > 
> > I'll have to look at it, but it will likely take some time.
> 
> 
> This seems like the perf regression I observed starting with kernel 5.0,
> essentially preemptive flush of metadata was broken for quite some time,
> but kernel 5.0 removed a btrfs_end_transaction call from
> should_end_transaction which unmasked the issue.
> 
> In particular this should have been fixed by the following commit in
> misc-next:
> 
> https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
> which is part of a larger series of patches. So Wang, in order to test
> this hypothesis can you re-run those tests with the latest misc-next
> branch .
> 
> <snip>



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-03 15:16               ` Wang Yugui
@ 2021-02-04  3:17                 ` Wang Yugui
  2021-02-04  6:19                   ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-04  3:17 UTC (permalink / raw)
  To: Nikolay Borisov, Filipe Manana; +Cc: linux-btrfs

Hi,

I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
5.10.12 with the same other kernel components and the same kernel config.

Better dbench(sync open) result on both Throughput and max_latency.

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     958611     0.026   440.564
 Close         703418     0.002     0.085
 Rename         40599     0.518   439.825
 Unlink        194095     0.282   439.792
 Qpathinfo     869295     0.010     1.852
 Qfileinfo     151285     0.002     0.095
 Qfsinfo       159492     0.004     0.051
 Sfileinfo      78078     0.010     0.104
 Find          335905     0.036     3.043
 WriteX        473353     3.637   713.039
 ReadX        1502197     0.005     0.752
 LockX           3100     0.004     0.024
 UnlockX         3100     0.002     0.020
 Flush          67264     0.575   363.550

Throughput 497.551 MB/sec (sync open)  32 clients  32 procs  max_latency=713.045 ms

detail:
  32     34620   506.62 MB/sec  execute   1 sec  latency 6.860 ms
  32     37878   528.91 MB/sec  execute   2 sec  latency 6.366 ms
  32     41015   541.17 MB/sec  execute   3 sec  latency 6.326 ms
  32     43981   539.10 MB/sec  execute   4 sec  latency 6.743 ms
  32     47111   540.92 MB/sec  execute   5 sec  latency 7.597 ms
  32     50163   544.44 MB/sec  execute   6 sec  latency 6.984 ms
  32     52952   532.23 MB/sec  execute   7 sec  latency 63.769 ms
  32     55561   514.29 MB/sec  execute   8 sec  latency 482.822 ms
  32     58415   515.72 MB/sec  execute   9 sec  latency 6.238 ms
  32     61365   515.94 MB/sec  execute  10 sec  latency 8.102 ms
  32     64309   518.10 MB/sec  execute  11 sec  latency 6.176 ms
  32     67322   517.59 MB/sec  execute  12 sec  latency 6.690 ms
  32     70343   515.28 MB/sec  execute  13 sec  latency 6.831 ms
  32     73323   512.81 MB/sec  execute  14 sec  latency 7.754 ms
  32     76151   513.38 MB/sec  execute  15 sec  latency 7.275 ms
  32     79042   513.44 MB/sec  execute  16 sec  latency 6.678 ms
  32     81995   514.71 MB/sec  execute  17 sec  latency 8.393 ms
  32     84911   512.80 MB/sec  execute  18 sec  latency 7.574 ms
  32     87884   510.68 MB/sec  execute  19 sec  latency 8.363 ms
  32     89711   502.50 MB/sec  execute  20 sec  latency 695.270 ms *1
  32     92628   503.68 MB/sec  execute  21 sec  latency 7.405 ms
  32     95627   503.66 MB/sec  execute  22 sec  latency 6.545 ms
  32     98638   502.90 MB/sec  execute  23 sec  latency 6.649 ms
  32    101641   503.29 MB/sec  execute  24 sec  latency 7.740 ms
  32    104515   503.72 MB/sec  execute  25 sec  latency 6.941 ms
  32    107448   504.28 MB/sec  execute  26 sec  latency 6.704 ms
  32    110321   504.79 MB/sec  execute  27 sec  latency 6.216 ms
  32    113235   503.77 MB/sec  execute  28 sec  latency 7.366 ms
  32    116185   503.18 MB/sec  execute  29 sec  latency 7.153 ms
  32    119074   503.35 MB/sec  execute  30 sec  latency 7.026 ms
  32    121970   503.51 MB/sec  execute  31 sec  latency 8.789 ms
  32    123948   498.76 MB/sec  execute  32 sec  latency 539.597 ms *2
  32    127045   499.14 MB/sec  execute  33 sec  latency 6.112 ms
  32    130059   499.66 MB/sec  execute  34 sec  latency 7.762 ms
  32    133062   500.45 MB/sec  execute  35 sec  latency 6.706 ms
  32    136014   501.03 MB/sec  execute  36 sec  latency 8.384 ms
  32    139018   502.13 MB/sec  execute  37 sec  latency 6.822 ms
  32    142047   502.50 MB/sec  execute  38 sec  latency 6.924 ms
  32    144935   501.74 MB/sec  execute  39 sec  latency 7.319 ms
  32    147836   501.38 MB/sec  execute  40 sec  latency 6.958 ms
  32    150677   501.36 MB/sec  execute  41 sec  latency 6.993 ms
  32    153562   501.63 MB/sec  execute  42 sec  latency 7.126 ms
  32    156385   501.63 MB/sec  execute  43 sec  latency 7.602 ms
  32    159258   500.99 MB/sec  execute  44 sec  latency 7.281 ms
  32    162248   501.05 MB/sec  execute  45 sec  latency 6.896 ms
  32    165128   501.12 MB/sec  execute  46 sec  latency 7.134 ms
  32    168070   501.39 MB/sec  execute  47 sec  latency 6.674 ms
  32    169978   497.76 MB/sec  execute  48 sec  latency 355.037 ms
  32    172936   497.92 MB/sec  execute  49 sec  latency 713.045 ms *3
  32    175828   497.93 MB/sec  execute  50 sec  latency 6.585 ms
  32    178731   498.06 MB/sec  execute  51 sec  latency 8.511 ms
  32    181681   498.79 MB/sec  execute  52 sec  latency 7.093 ms
  32    184622   498.83 MB/sec  execute  53 sec  latency 6.265 ms
  32    187647   498.90 MB/sec  execute  54 sec  latency 6.902 ms
  32    190125   497.59 MB/sec  execute  55 sec  latency 142.628 ms
  32    192920   497.36 MB/sec  execute  56 sec  latency 432.224 ms
  32    195822   497.85 MB/sec  execute  57 sec  latency 6.497 ms
  32    198709   497.56 MB/sec  execute  58 sec  latency 6.418 ms
  32    201665   497.55 MB/sec  execute  59 sec  latency 6.699 ms

Because of kernel interface changed between 5.10 and 5.11 , so we need
some modify to btrfs misc-next(5.11-rc6 +81patches) to run on linux LTS
5.10.12,
# drop 0001-block-add-a-bdev_kobj-helper.patch
# drop  0001-block-remove-i_bdev.patch
# add  0001-btrfs-bdev_nr_sectors.patch

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04

> Hi,
> 
> There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)
> 
> 1, the MaxLat is changed from 1900ms level to 1000ms level.
>     that is a good improvement.
> 
> 2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
>     MaxLat because all of them will write something to disk.
> 
> 3, I'm not sure whether MaxLat of Flush is OK.
>     there should be nothing to write for Flush because of 
>     dbench(sync open) mode. but I'm not sure whether some
>     scsi comand such as Synchronize Cache will be executed by Flush.
> 
> Operation      Count    AvgLat    MaxLat
>  ----------------------------------------
>  NTCreateX     924298     0.035   745.523
>  Close         678499     0.002     0.093
>  Rename         39280     0.597   744.714
>  Unlink        187119     0.435   745.547
>  Qpathinfo     838558     0.013     0.978
>  Qfileinfo     146308     0.002     0.055
>  Qfsinfo       154172     0.004     0.108
>  Sfileinfo      75423     0.010     0.109
>  Find          324478     0.050     1.115
>  WriteX        457376     3.665   957.922
>  ReadX        1454051     0.005     0.131
>  LockX           3032     0.004     0.027
>  UnlockX         3032     0.002     0.022
>  Flush          64867     0.649   718.676
> 
> Throughput 481.002 MB/sec (sync open)  32 clients  32 procs  max_latency=957.929 ms
> + stat -f -c %T /btrfs/
> btrfs
> + uname -r
> 5.11.0-0.rc6.141.el8.x86_64
> 
> detail:
>   32     33682   548.70 MB/sec  execute   1 sec  latency 7.024 ms
>   32     36692   538.41 MB/sec  execute   2 sec  latency 6.719 ms
>   32     39787   544.75 MB/sec  execute   3 sec  latency 6.405 ms
>   32     42850   540.09 MB/sec  execute   4 sec  latency 7.717 ms
>   32     45872   535.64 MB/sec  execute   5 sec  latency 7.271 ms
>   32     48861   524.29 MB/sec  execute   6 sec  latency 6.685 ms
>   32     51343   512.54 MB/sec  execute   7 sec  latency 128.666 ms
>   32     53366   496.73 MB/sec  execute   8 sec  latency 703.255 ms *1
>   32     56428   498.60 MB/sec  execute   9 sec  latency 6.346 ms
>   32     59471   498.40 MB/sec  execute  10 sec  latency 9.958 ms
>   32     62175   495.68 MB/sec  execute  11 sec  latency 14.270 ms
>   32     65143   498.90 MB/sec  execute  12 sec  latency 9.391 ms
>   32     68116   502.19 MB/sec  execute  13 sec  latency 5.713 ms
>   32     71078   501.45 MB/sec  execute  14 sec  latency 6.235 ms
>   32     74030   500.43 MB/sec  execute  15 sec  latency 7.135 ms
>   32     76908   500.82 MB/sec  execute  16 sec  latency 7.071 ms
>   32     79794   499.61 MB/sec  execute  17 sec  latency 7.556 ms
>   32     82791   502.30 MB/sec  execute  18 sec  latency 6.509 ms
>   32     85676   502.05 MB/sec  execute  19 sec  latency 6.938 ms
>   32     86950   486.44 MB/sec  execute  20 sec  latency 554.015 ms *2
>   32     89670   487.60 MB/sec  execute  21 sec  latency 901.490 ms *3
>   32     92577   487.64 MB/sec  execute  22 sec  latency 6.715 ms
>   32     95528   488.03 MB/sec  execute  23 sec  latency 7.457 ms
>   32     98507   488.76 MB/sec  execute  24 sec  latency 7.266 ms
>   32    101340   488.76 MB/sec  execute  25 sec  latency 6.699 ms
>   32    104331   489.94 MB/sec  execute  26 sec  latency 6.506 ms
>   32    107166   490.87 MB/sec  execute  27 sec  latency 6.582 ms
>   32    110022   490.17 MB/sec  execute  28 sec  latency 7.072 ms
>   32    112931   490.34 MB/sec  execute  29 sec  latency 6.484 ms
>   32    115658   489.67 MB/sec  execute  30 sec  latency 6.767 ms
>   32    118569   490.14 MB/sec  execute  31 sec  latency 6.825 ms
>   32    121334   490.34 MB/sec  execute  32 sec  latency 7.270 ms
>   32    124182   489.95 MB/sec  execute  33 sec  latency 6.849 ms
>   32    127073   490.05 MB/sec  execute  34 sec  latency 6.934 ms
>   32    129835   489.56 MB/sec  execute  35 sec  latency 7.455 ms
>   32    130952   481.58 MB/sec  execute  36 sec  latency 635.676 ms *4
>   32    133736   481.74 MB/sec  execute  37 sec  latency 957.929 ms *5
>   32    136646   481.79 MB/sec  execute  38 sec  latency 7.339 ms
>   32    139616   483.23 MB/sec  execute  39 sec  latency 7.199 ms
>   32    142526   483.62 MB/sec  execute  40 sec  latency 7.344 ms
>   32    145429   483.58 MB/sec  execute  41 sec  latency 6.967 ms
>   32    148329   484.09 MB/sec  execute  42 sec  latency 8.043 ms
>   32    151091   483.89 MB/sec  execute  43 sec  latency 7.476 ms
>   32    153913   484.33 MB/sec  execute  44 sec  latency 7.611 ms
>   32    156679   484.29 MB/sec  execute  45 sec  latency 7.612 ms
>   32    159534   483.90 MB/sec  execute  46 sec  latency 8.295 ms
>   32    162328   484.17 MB/sec  execute  47 sec  latency 6.582 ms
>   32    165080   483.64 MB/sec  execute  48 sec  latency 8.939 ms
>   32    167861   484.12 MB/sec  execute  49 sec  latency 6.684 ms
>   32    170616   483.56 MB/sec  execute  50 sec  latency 7.051 ms
>   32    173557   483.89 MB/sec  execute  51 sec  latency 6.923 ms
>   32    176424   484.52 MB/sec  execute  52 sec  latency 6.689 ms
>   32    179255   484.14 MB/sec  execute  53 sec  latency 7.973 ms
>   32    181195   481.47 MB/sec  execute  54 sec  latency 305.495 ms
>   32    183309   479.62 MB/sec  execute  55 sec  latency 866.862 ms *6
>   32    186256   479.82 MB/sec  execute  56 sec  latency 7.016 ms
>   32    189209   480.82 MB/sec  execute  57 sec  latency 6.789 ms
>   32    192072   480.93 MB/sec  execute  58 sec  latency 7.305 ms
>   32    195054   481.00 MB/sec  execute  59 sec  latency 7.432 ms
> 
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/03
> 
> > 
> > 
> > On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
> > > On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
> > >>
> > >> Hi, Filipe Manana
> > >>
> > >> There are some dbench(sync mode) result on the same hardware,
> > >> but with different linux kernel
> > >>
> > >> 4.14.200
> > >> Operation      Count    AvgLat    MaxLat
> > >>  ----------------------------------------
> > >>  WriteX        225281     5.163    82.143
> > >>  Flush          32161     2.250    62.669
> > >> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
> > >>
> > >> 4.19.21
> > >> Operation      Count    AvgLat    MaxLat
> > >>  ----------------------------------------
> > >>  WriteX        118842    10.946   116.345
> > >>  Flush          16506     0.115    44.575
> > >> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
> > >>
> > >> 4.19.150
> > >>  Operation      Count    AvgLat    MaxLat
> > >>  ----------------------------------------
> > >>  WriteX        144509     9.151   117.353
> > >>  lush          20563     0.128    52.014
> > >> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
> > >>
> > >> 5.4.91
> > >>  Operation      Count    AvgLat    MaxLat
> > >>  ----------------------------------------
> > >>  WriteX        367033     4.377  1908.724
> > >>  Flush          52037     0.159    39.871
> > >> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms
> > > 
> > > Ok, it seems somewhere between 4.19 and 5.4, something made the
> > > latency much worse for you at least.
> > > 
> > > Is it only when using sync open (O_SYNC, dbench's -s flag), what about
> > > when not using it?
> > > 
> > > I'll have to look at it, but it will likely take some time.
> > 
> > 
> > This seems like the perf regression I observed starting with kernel 5.0,
> > essentially preemptive flush of metadata was broken for quite some time,
> > but kernel 5.0 removed a btrfs_end_transaction call from
> > should_end_transaction which unmasked the issue.
> > 
> > In particular this should have been fixed by the following commit in
> > misc-next:
> > 
> > https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
> > which is part of a larger series of patches. So Wang, in order to test
> > this hypothesis can you re-run those tests with the latest misc-next
> > branch .
> > 
> > <snip>
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-04  3:17                 ` Wang Yugui
@ 2021-02-04  6:19                   ` Nikolay Borisov
  2021-02-04 11:34                     ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2021-02-04  6:19 UTC (permalink / raw)
  To: Wang Yugui, Filipe Manana; +Cc: linux-btrfs



On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> Hi,
> 
> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> 5.10.12 with the same other kernel components and the same kernel config.
> 
> Better dbench(sync open) result on both Throughput and max_latency.
> 

If i understand correctly you rebased current misc-next to 5.10.12, if
so this means there is something else in the main kernel, that's not
btrfs related which degrades performance, it seems you've got a 300ms
win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
right?

<snip>
>> Hi,
>>
>> There is the dbench(sync open) test result of misc-next(5.11-rc6 +81patches)
>>
>> 1, the MaxLat is changed from 1900ms level to 1000ms level.
>>     that is a good improvement.
>>
>> 2, It is OK that NTCreateX/Rename/Unlink/WriteX have the same level of
>>     MaxLat because all of them will write something to disk.
>>
>> 3, I'm not sure whether MaxLat of Flush is OK.
>>     there should be nothing to write for Flush because of 
>>     dbench(sync open) mode. but I'm not sure whether some
>>     scsi comand such as Synchronize Cache will be executed by Flush.
>>
>> Operation      Count    AvgLat    MaxLat
>>  ----------------------------------------
>>  NTCreateX     924298     0.035   745.523
>>  Close         678499     0.002     0.093
>>  Rename         39280     0.597   744.714
>>  Unlink        187119     0.435   745.547
>>  Qpathinfo     838558     0.013     0.978
>>  Qfileinfo     146308     0.002     0.055
>>  Qfsinfo       154172     0.004     0.108
>>  Sfileinfo      75423     0.010     0.109
>>  Find          324478     0.050     1.115
>>  WriteX        457376     3.665   957.922
>>  ReadX        1454051     0.005     0.131
>>  LockX           3032     0.004     0.027
>>  UnlockX         3032     0.002     0.022
>>  Flush          64867     0.649   718.676
>>
>> Throughput 481.002 MB/sec (sync open)  32 clients  32 procs  max_latency=957.929 ms
>> + stat -f -c %T /btrfs/
>> btrfs
>> + uname -r
>> 5.11.0-0.rc6.141.el8.x86_64
>>
>> detail:
>>   32     33682   548.70 MB/sec  execute   1 sec  latency 7.024 ms
>>   32     36692   538.41 MB/sec  execute   2 sec  latency 6.719 ms
>>   32     39787   544.75 MB/sec  execute   3 sec  latency 6.405 ms
>>   32     42850   540.09 MB/sec  execute   4 sec  latency 7.717 ms
>>   32     45872   535.64 MB/sec  execute   5 sec  latency 7.271 ms
>>   32     48861   524.29 MB/sec  execute   6 sec  latency 6.685 ms
>>   32     51343   512.54 MB/sec  execute   7 sec  latency 128.666 ms
>>   32     53366   496.73 MB/sec  execute   8 sec  latency 703.255 ms *1
>>   32     56428   498.60 MB/sec  execute   9 sec  latency 6.346 ms
>>   32     59471   498.40 MB/sec  execute  10 sec  latency 9.958 ms
>>   32     62175   495.68 MB/sec  execute  11 sec  latency 14.270 ms
>>   32     65143   498.90 MB/sec  execute  12 sec  latency 9.391 ms
>>   32     68116   502.19 MB/sec  execute  13 sec  latency 5.713 ms
>>   32     71078   501.45 MB/sec  execute  14 sec  latency 6.235 ms
>>   32     74030   500.43 MB/sec  execute  15 sec  latency 7.135 ms
>>   32     76908   500.82 MB/sec  execute  16 sec  latency 7.071 ms
>>   32     79794   499.61 MB/sec  execute  17 sec  latency 7.556 ms
>>   32     82791   502.30 MB/sec  execute  18 sec  latency 6.509 ms
>>   32     85676   502.05 MB/sec  execute  19 sec  latency 6.938 ms
>>   32     86950   486.44 MB/sec  execute  20 sec  latency 554.015 ms *2
>>   32     89670   487.60 MB/sec  execute  21 sec  latency 901.490 ms *3
>>   32     92577   487.64 MB/sec  execute  22 sec  latency 6.715 ms
>>   32     95528   488.03 MB/sec  execute  23 sec  latency 7.457 ms
>>   32     98507   488.76 MB/sec  execute  24 sec  latency 7.266 ms
>>   32    101340   488.76 MB/sec  execute  25 sec  latency 6.699 ms
>>   32    104331   489.94 MB/sec  execute  26 sec  latency 6.506 ms
>>   32    107166   490.87 MB/sec  execute  27 sec  latency 6.582 ms
>>   32    110022   490.17 MB/sec  execute  28 sec  latency 7.072 ms
>>   32    112931   490.34 MB/sec  execute  29 sec  latency 6.484 ms
>>   32    115658   489.67 MB/sec  execute  30 sec  latency 6.767 ms
>>   32    118569   490.14 MB/sec  execute  31 sec  latency 6.825 ms
>>   32    121334   490.34 MB/sec  execute  32 sec  latency 7.270 ms
>>   32    124182   489.95 MB/sec  execute  33 sec  latency 6.849 ms
>>   32    127073   490.05 MB/sec  execute  34 sec  latency 6.934 ms
>>   32    129835   489.56 MB/sec  execute  35 sec  latency 7.455 ms
>>   32    130952   481.58 MB/sec  execute  36 sec  latency 635.676 ms *4
>>   32    133736   481.74 MB/sec  execute  37 sec  latency 957.929 ms *5
>>   32    136646   481.79 MB/sec  execute  38 sec  latency 7.339 ms
>>   32    139616   483.23 MB/sec  execute  39 sec  latency 7.199 ms
>>   32    142526   483.62 MB/sec  execute  40 sec  latency 7.344 ms
>>   32    145429   483.58 MB/sec  execute  41 sec  latency 6.967 ms
>>   32    148329   484.09 MB/sec  execute  42 sec  latency 8.043 ms
>>   32    151091   483.89 MB/sec  execute  43 sec  latency 7.476 ms
>>   32    153913   484.33 MB/sec  execute  44 sec  latency 7.611 ms
>>   32    156679   484.29 MB/sec  execute  45 sec  latency 7.612 ms
>>   32    159534   483.90 MB/sec  execute  46 sec  latency 8.295 ms
>>   32    162328   484.17 MB/sec  execute  47 sec  latency 6.582 ms
>>   32    165080   483.64 MB/sec  execute  48 sec  latency 8.939 ms
>>   32    167861   484.12 MB/sec  execute  49 sec  latency 6.684 ms
>>   32    170616   483.56 MB/sec  execute  50 sec  latency 7.051 ms
>>   32    173557   483.89 MB/sec  execute  51 sec  latency 6.923 ms
>>   32    176424   484.52 MB/sec  execute  52 sec  latency 6.689 ms
>>   32    179255   484.14 MB/sec  execute  53 sec  latency 7.973 ms
>>   32    181195   481.47 MB/sec  execute  54 sec  latency 305.495 ms
>>   32    183309   479.62 MB/sec  execute  55 sec  latency 866.862 ms *6
>>   32    186256   479.82 MB/sec  execute  56 sec  latency 7.016 ms
>>   32    189209   480.82 MB/sec  execute  57 sec  latency 6.789 ms
>>   32    192072   480.93 MB/sec  execute  58 sec  latency 7.305 ms
>>   32    195054   481.00 MB/sec  execute  59 sec  latency 7.432 ms
>>
>>
>> Best Regards
>> Wang Yugui (wangyugui@e16-tech.com)
>> 2021/02/03
>>
>>>
>>>
>>> On 3.02.21 г. 12:51 ч., Filipe Manana wrote:
>>>> On Tue, Feb 2, 2021 at 2:15 PM Wang Yugui <wangyugui@e16-tech.com> wrote:
>>>>>
>>>>> Hi, Filipe Manana
>>>>>
>>>>> There are some dbench(sync mode) result on the same hardware,
>>>>> but with different linux kernel
>>>>>
>>>>> 4.14.200
>>>>> Operation      Count    AvgLat    MaxLat
>>>>>  ----------------------------------------
>>>>>  WriteX        225281     5.163    82.143
>>>>>  Flush          32161     2.250    62.669
>>>>> Throughput 236.719 MB/sec (sync open)  32 clients  32 procs  max_latency=82.149 ms
>>>>>
>>>>> 4.19.21
>>>>> Operation      Count    AvgLat    MaxLat
>>>>>  ----------------------------------------
>>>>>  WriteX        118842    10.946   116.345
>>>>>  Flush          16506     0.115    44.575
>>>>> Throughput 125.973 MB/sec (sync open)  32 clients  32 procs  max_latency=116.390 ms
>>>>>
>>>>> 4.19.150
>>>>>  Operation      Count    AvgLat    MaxLat
>>>>>  ----------------------------------------
>>>>>  WriteX        144509     9.151   117.353
>>>>>  lush          20563     0.128    52.014
>>>>> Throughput 153.707 MB/sec (sync open)  32 clients  32 procs  max_latency=117.379 ms
>>>>>
>>>>> 5.4.91
>>>>>  Operation      Count    AvgLat    MaxLat
>>>>>  ----------------------------------------
>>>>>  WriteX        367033     4.377  1908.724
>>>>>  Flush          52037     0.159    39.871
>>>>> Throughput 384.554 MB/sec (sync open)  32 clients  32 procs  max_latency=1908.968 ms
>>>>
>>>> Ok, it seems somewhere between 4.19 and 5.4, something made the
>>>> latency much worse for you at least.
>>>>
>>>> Is it only when using sync open (O_SYNC, dbench's -s flag), what about
>>>> when not using it?
>>>>
>>>> I'll have to look at it, but it will likely take some time.
>>>
>>>
>>> This seems like the perf regression I observed starting with kernel 5.0,
>>> essentially preemptive flush of metadata was broken for quite some time,
>>> but kernel 5.0 removed a btrfs_end_transaction call from
>>> should_end_transaction which unmasked the issue.
>>>
>>> In particular this should have been fixed by the following commit in
>>> misc-next:
>>>
>>> https://github.com/kdave/btrfs-devel/commit/28d7e221e4323a5b98e5d248eb5603ff5206a188
>>> which is part of a larger series of patches. So Wang, in order to test
>>> this hypothesis can you re-run those tests with the latest misc-next
>>> branch .
>>>
>>> <snip>
>>
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-04  6:19                   ` Nikolay Borisov
@ 2021-02-04 11:34                     ` Wang Yugui
  2021-02-04 11:58                       ` Nikolay Borisov
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2021-02-04 11:34 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Filipe Manana, linux-btrfs

Hi,

> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> > Hi,
> > 
> > I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> > 5.10.12 with the same other kernel components and the same kernel config.
> > 
> > Better dbench(sync open) result on both Throughput and max_latency.
> > 
> 
> If i understand correctly you rebased current misc-next to 5.10.12, if
> so this means there is something else in the main kernel, that's not
> btrfs related which degrades performance, it seems you've got a 300ms
> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
> right?

Yes.

maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
performance.
or maybe just because of different kernel config.
kernel config I used:
https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/

I rebased current misc-next to 5.10.12, so that there is only diff in
btrfs source code.

only 3 minor patch needed for this rebase, there seems no broken kernel API
 change for btrfs between 5.10 and 5.11.
# add-to-5.10  0001-block-add-a-bdev_kobj-helper.patch
# drop-from-btrs-misc-next  0001-block-remove-i_bdev.patch
# fix-to-btrfs-misc-next  	0001-btrfs-bdev_nr_sectors.patch

more patch come into misc-next today, they are yet not rebased/tested.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-04 11:34                     ` Wang Yugui
@ 2021-02-04 11:58                       ` Nikolay Borisov
  2021-02-05  1:47                         ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Nikolay Borisov @ 2021-02-04 11:58 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Filipe Manana, linux-btrfs



On 4.02.21 г. 13:34 ч., Wang Yugui wrote:
> Hi,
> 
>> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
>>> Hi,
>>>
>>> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
>>> 5.10.12 with the same other kernel components and the same kernel config.
>>>
>>> Better dbench(sync open) result on both Throughput and max_latency.
>>>
>>
>> If i understand correctly you rebased current misc-next to 5.10.12, if
>> so this means there is something else in the main kernel, that's not
>> btrfs related which degrades performance, it seems you've got a 300ms
>> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
>> right?
> 
> Yes.

I just realized this could also be caused by btrfs code that has already
landed in v5.11-rc1 for example. I.e the main pull req for this release.

> 
> maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
> performance.
> or maybe just because of different kernel config.
> kernel config I used:
> https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
> https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/
> 
> I rebased current misc-next to 5.10.12, so that there is only diff in
> btrfs source code.
> 
> only 3 minor patch needed for this rebase, there seems no broken kernel API
>  change for btrfs between 5.10 and 5.11.
> # add-to-5.10  0001-block-add-a-bdev_kobj-helper.patch
> # drop-from-btrs-misc-next  0001-block-remove-i_bdev.patch
> # fix-to-btrfs-misc-next  	0001-btrfs-bdev_nr_sectors.patch
> 
> more patch come into misc-next today, they are yet not rebased/tested.
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2021/02/04
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit
  2021-02-04 11:58                       ` Nikolay Borisov
@ 2021-02-05  1:47                         ` Wang Yugui
  0 siblings, 0 replies; 20+ messages in thread
From: Wang Yugui @ 2021-02-05  1:47 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Filipe Manana, linux-btrfs

Hi,

> On 4.02.21 г. 13:34 ч., Wang Yugui wrote:
> > Hi,
> > 
> >> On 4.02.21 г. 5:17 ч., Wang Yugui wrote:
> >>> Hi,
> >>>
> >>> I tried to run btrfs misc-next(5.11-rc6 +81patches) based on linux LTS
> >>> 5.10.12 with the same other kernel components and the same kernel config.
> >>>
> >>> Better dbench(sync open) result on both Throughput and max_latency.
> >>>
> >>
> >> If i understand correctly you rebased current misc-next to 5.10.12, if
> >> so this means there is something else in the main kernel, that's not
> >> btrfs related which degrades performance, it seems you've got a 300ms
> >> win by running on 5.10 as compared on 5.11-rc6-based misc next, is that
> >> right?
> > 
> > Yes.
> 
> I just realized this could also be caused by btrfs code that has already
> landed in v5.11-rc1 for example. I.e the main pull req for this release.

This performance problem seems very complex.

result of btrfs 5.10.12 + these btrfs patchset +  linux other kernel of 5.10.12
- 5.11 free-space-tree 13 patch
- 5.11 btrfs: some performance improvements for dbench alike workloads
- misc-next btrfs: more performance improvements for dbench workloads
- misc-next Improve preemptive ENOSPC flushing

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     858500     0.030     2.969
 Close         630548     0.002     0.071
 Rename         36416     0.744     3.791
 Unlink        173375     0.503     4.109
 Qpathinfo     779355     0.014     2.715
 Qfileinfo     136041     0.002     0.054
 Qfsinfo       142728     0.004     0.084
 Sfileinfo      69953     0.010     0.098
 Find          301093     0.059     2.670
 WriteX        424594     3.999  1765.527
 ReadX        1348726     0.005     0.397
 LockX           2816     0.004     0.022
 UnlockX         2816     0.003     0.025
 Flush          60094     0.764  1478.133

Throughput 445.83 MB/sec (sync open)  32 clients  32 procs  max_latency=1765.539 ms

still worse than btrfs misc-next(20210203) + linux other kernel of 5.10.12

 Operation      Count    AvgLat    MaxLat
 ----------------------------------------
 NTCreateX     958611     0.026   440.564
 Close         703418     0.002     0.085
 Rename         40599     0.518   439.825
 Unlink        194095     0.282   439.792
 Qpathinfo     869295     0.010     1.852
 Qfileinfo     151285     0.002     0.095
 Qfsinfo       159492     0.004     0.051
 Sfileinfo      78078     0.010     0.104
 Find          335905     0.036     3.043
 WriteX        473353     3.637   713.039
 ReadX        1502197     0.005     0.752
 LockX           3100     0.004     0.024
 UnlockX         3100     0.002     0.020
 Flush          67264     0.575   363.550

Throughput 497.551 MB/sec (sync open)  32 clients  32 procs  max_latency=713.045 ms

so there is some other patch in misc-next improved this performance too.

but MaxLat/AvgLat of WriteX  is still 175 in 
btrfs misc-next(20210203) + linux other kernel of 5.10.12,
so there maybe some patch after 5.10 cause new performance problem too.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/02/04


> > 
> > maybye some code rather than btrfs in main kernel 5.11.rc6 have degrades
> > performance.
> > or maybe just because of different kernel config.
> > kernel config I used:
> > https://kojipkgs.fedoraproject.org/packages/kernel/5.11.0/0.rc6.141.eln108/
> > https://kojipkgs.fedoraproject.org/packages/kernel/5.10.12/200.fc33/
> > 
> > I rebased current misc-next to 5.10.12, so that there is only diff in
> > btrfs source code.
> > 
> > only 3 minor patch needed for this rebase, there seems no broken kernel API
> >  change for btrfs between 5.10 and 5.11.
> > # add-to-5.10  0001-block-add-a-bdev_kobj-helper.patch
> > # drop-from-btrs-misc-next  0001-block-remove-i_bdev.patch
> > # fix-to-btrfs-misc-next  	0001-btrfs-bdev_nr_sectors.patch
> > 
> > more patch come into misc-next today, they are yet not rebased/tested.
> > 
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2021/02/04
> > 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads
  2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
                   ` (5 preceding siblings ...)
  2020-11-25 12:19 ` [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit fdmanana
@ 2020-11-30 17:30 ` David Sterba
  6 siblings, 0 replies; 20+ messages in thread
From: David Sterba @ 2020-11-30 17:30 UTC (permalink / raw)
  To: fdmanana; +Cc: linux-btrfs

On Wed, Nov 25, 2020 at 12:19:22PM +0000, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> This patchset mostly fixes some races that result in an fsync falling back
> to a transaction commit unncessarily or logging more metadata than it needs
> to. It also addresses link and rename operations logging inodes that were
> fsynced in the previous transaction but they don't need to be logged in
> the current transaction. All these cases can be triggered often by dbench.
> 
> Finally it avoids blocking an fsync for a significant time window when the
> previous transaction is still committing. All these together resulted in
> about +12% throughput with dbench and -8% maximum latency, as mentioned in
> the changelog of the last patch.
> 
> There will be more changes to improve performance with dbench or similar
> workloads, but these have been cooking and testing for about one week and
> are independent from what's coming next.

Looking forward to more perf improvements, much appreciated.

> Filipe Manana (6):
>   btrfs: fix race causing unnecessary inode logging during link and rename
>   btrfs: fix race that results in logging old extents during a fast fsync
>   btrfs: fix race that causes unnecessary logging of ancestor inodes
>   btrfs: fix race that makes inode logging fallback to transaction commit
>   btrfs: fix race leading to unnecessary transaction commit when logging inode
>   btrfs: do not block inode logging for so long during transaction commit

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2021-02-05  1:48 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-25 12:19 [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads fdmanana
2020-11-25 12:19 ` [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename fdmanana
2020-11-25 12:19 ` [PATCH 2/6] btrfs: fix race that results in logging old extents during a fast fsync fdmanana
2020-11-25 12:19 ` [PATCH 3/6] btrfs: fix race that causes unnecessary logging of ancestor inodes fdmanana
2020-11-25 12:19 ` [PATCH 4/6] btrfs: fix race that makes inode logging fallback to transaction commit fdmanana
2020-11-25 12:19 ` [PATCH 5/6] btrfs: fix race leading to unnecessary transaction commit when logging inode fdmanana
2020-11-25 12:19 ` [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit fdmanana
2021-02-02  5:38   ` Wang Yugui
2021-02-02 11:28     ` Filipe Manana
2021-02-02 13:01       ` Wang Yugui
2021-02-02 14:09         ` Wang Yugui
2021-02-03 10:51           ` Filipe Manana
2021-02-03 11:03             ` Nikolay Borisov
2021-02-03 15:16               ` Wang Yugui
2021-02-04  3:17                 ` Wang Yugui
2021-02-04  6:19                   ` Nikolay Borisov
2021-02-04 11:34                     ` Wang Yugui
2021-02-04 11:58                       ` Nikolay Borisov
2021-02-05  1:47                         ` Wang Yugui
2020-11-30 17:30 ` [PATCH 0/6] btrfs: some performance improvements for dbench alike workloads David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.