stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Lu, Davina" <davinalu@amazon.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"adilger.kernel@dilger.ca" <adilger.kernel@dilger.ca>,
	"regressions@lists.linux.dev" <regressions@lists.linux.dev>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com>,
	hazem ahmed mohamed <hazem.ahmed.abuelfotoh@gmail.com>,
	"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>,
	"Kiselev, Oleg" <okiselev@amazon.com>,
	"Liu, Frank" <franklmz@amazon.com>
Subject: RE: significant drop  fio IOPS performance on v5.10
Date: Wed, 9 Nov 2022 01:02:35 +0000	[thread overview]
Message-ID: <53153bdf0cce4675b09bc2ee6483409f@amazon.com> (raw)
In-Reply-To: <YzTMZ26AfioIbl27@mit.edu>

[-- Attachment #1: Type: text/plain, Size: 6392 bytes --]

Hi Ted,

For adding the lock per inode per CPU, I didn't use the "inode->i_data_sem" since this already been added to protect at ext4_map_blocks(). This will been called eventually by ext4-rsv-conversion workqueue. So not sure if we still need to another semaphore for per inode protection?

So I just simply add a new rw_semaphore under ext4_inode_info, also change the maximum queue size won't be more than supported CPU number. I tested fio and xfstest-dev ext4 cases and all good here.

Not sure your opinions about the protection, the patch is below and I also put into attachment.

Thanks
Davina

From 4906bba1e9ce127fb8a92cf709c3948ac92c617c Mon Sep 17 00:00:00 2001
From: davinalu <davinalu@amazon.com>
Date: Wed, 9 Nov 2022 00:44:25 +0000
Subject: [PATCH] ext4:allow_multi_rsv_conversion_queue

---
 fs/ext4/ext4.h    | 1 +
 fs/ext4/extents.c | 4 +++-
 fs/ext4/super.c   | 5 ++++-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3bf9a6926798..4414750b0134 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1166,6 +1166,7 @@ struct ext4_inode_info {
        atomic_t i_unwritten; /* Nr. of inflight conversions pending */

        spinlock_t i_block_reservation_lock;
+       struct rw_semaphore i_rsv_unwritten_sem;

        /*
         * Transactions that contain inode's metadata needed to complete
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5235974126bd..806393da92cf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4808,8 +4808,9 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
                                break;
                        }
                }
+               down_write(&EXT4_I(inode)->i_rsv_unwritten_sem);
                ret = ext4_map_blocks(handle, inode, &map,
-                                     EXT4_GET_BLOCKS_IO_CONVERT_EXT);
+                               EXT4_GET_BLOCKS_IO_CONVERT_EXT);
                if (ret <= 0)
                        ext4_warning(inode->i_sb,
                                     "inode #%lu: block %u: len %u: "
@@ -4817,6 +4818,7 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
                                     inode->i_ino, map.m_lblk,
                                     map.m_len, ret);
                ret2 = ext4_mark_inode_dirty(handle, inode);
+               up_write(&EXT4_I(inode)->i_rsv_unwritten_sem);
                if (credits) {
                        ret3 = ext4_journal_stop(handle);
                        if (unlikely(ret3))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 091db733834e..b3c7544798b8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1399,6 +1399,7 @@ static void init_once(void *foo)
        INIT_LIST_HEAD(&ei->i_orphan);
        init_rwsem(&ei->xattr_sem);
        init_rwsem(&ei->i_data_sem);
+       init_rwsem(&ei->i_rsv_unwritten_sem);
        inode_init_once(&ei->vfs_inode);
        ext4_fc_init_inode(&ei->vfs_inode);
 }
@@ -5212,7 +5213,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
         * concurrency isn't really necessary.  Limit it to 1.
         */
        EXT4_SB(sb)->rsv_conversion_wq =
-               alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+               alloc_workqueue("ext4-rsv-conversion",
+                               WQ_MEM_RECLAIM | WQ_UNBOUND | __WQ_ORDERED,
+                               num_active_cpus() > 1 ? num_active_cpus() : 1);
        if (!EXT4_SB(sb)->rsv_conversion_wq) {
                printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
                ret = -ENOMEM;
--
2.37.1

-----Original Message-----
From: Theodore Ts'o <tytso@mit.edu> 
Sent: Thursday, September 29, 2022 8:36 AM
To: Lu, Davina <davinalu@amazon.com>
Cc: linux-ext4@vger.kernel.org; adilger.kernel@dilger.ca; regressions@lists.linux.dev; stable@vger.kernel.org; Mohamed Abuelfotoh, Hazem <abuehaze@amazon.com>; hazem ahmed mohamed <hazem.ahmed.abuelfotoh@gmail.com>; Ritesh Harjani (IBM) <ritesh.list@gmail.com>; Kiselev, Oleg <okiselev@amazon.com>; Liu, Frank <franklmz@amazon.com>
Subject: RE: [EXTERNAL]significant drop fio IOPS performance on v5.10

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



On Wed, Sep 28, 2022 at 06:07:34AM +0000, Lu, Davina wrote:
>
> Hello,
>
> My analyzing is that the degradation is introduce by commit 
> {244adf6426ee31a83f397b700d964cff12a247d3} and the issue is the 
> contention on rsv_conversion_wq.  The simplest option is to increase 
> the journal size, but that introduces more operational complexity.
> Another option is to add the following change in attachment "allow 
> more ext4-rsv-conversion workqueue.patch"

Hi, thanks for the patch.  However, I don't believe as written it is safe.  That's because your patch will allow multiple CPU's to run ext4_flush_completed_IO(), and this function is not set up to be safe to be run concurrently.  That is, I don't see the necessary locking if we have two CPU's trying to convert unwritten extents on the same inode.

It could be made safe, and certainly if the problem is that you are worried about contention across multiple inodes being written to by different FIO jobs, then certainly this could be a potential contention issue.

However, what doesn't make sense is that increasing the journal size also seems to fix the issue for you.  That implies the problem is one of the journal being to small, and so you are running into an issue of stalls caused by the need to do a synchronous checkpoint to clear space in the journal.  That is a different problem than one of there being a contention problem with rsv_conversion_wq.

So I want to make sure we understand what you are seeing before we make such a change.  One potential concern is that will cause a large number of additional kernel threads.  Now, if these extra kernel threads are doing useful work, then that's not necessarily an issue.
But if not, then it's going to be burning a large amount of kernel memory (especially for a system with a large number of CPU cores).

Regards,

                                        - Ted

[-- Attachment #2: 0001-ext4-allow_multi_rsv_conversion_queue.patch --]
[-- Type: application/octet-stream, Size: 2544 bytes --]

From 4906bba1e9ce127fb8a92cf709c3948ac92c617c Mon Sep 17 00:00:00 2001
From: davinalu <davinalu@amazon.com>
Date: Wed, 9 Nov 2022 00:44:25 +0000
Subject: [PATCH] ext4:allow_multi_rsv_conversion_queue

---
 fs/ext4/ext4.h    | 1 +
 fs/ext4/extents.c | 4 +++-
 fs/ext4/super.c   | 5 ++++-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3bf9a6926798..4414750b0134 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1166,6 +1166,7 @@ struct ext4_inode_info {
 	atomic_t i_unwritten; /* Nr. of inflight conversions pending */
 
 	spinlock_t i_block_reservation_lock;
+	struct rw_semaphore i_rsv_unwritten_sem;
 
 	/*
 	 * Transactions that contain inode's metadata needed to complete
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5235974126bd..806393da92cf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4808,8 +4808,9 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
 				break;
 			}
 		}
+		down_write(&EXT4_I(inode)->i_rsv_unwritten_sem);
 		ret = ext4_map_blocks(handle, inode, &map,
-				      EXT4_GET_BLOCKS_IO_CONVERT_EXT);
+				EXT4_GET_BLOCKS_IO_CONVERT_EXT);
 		if (ret <= 0)
 			ext4_warning(inode->i_sb,
 				     "inode #%lu: block %u: len %u: "
@@ -4817,6 +4818,7 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
 				     inode->i_ino, map.m_lblk,
 				     map.m_len, ret);
 		ret2 = ext4_mark_inode_dirty(handle, inode);
+		up_write(&EXT4_I(inode)->i_rsv_unwritten_sem);
 		if (credits) {
 			ret3 = ext4_journal_stop(handle);
 			if (unlikely(ret3))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 091db733834e..b3c7544798b8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1399,6 +1399,7 @@ static void init_once(void *foo)
 	INIT_LIST_HEAD(&ei->i_orphan);
 	init_rwsem(&ei->xattr_sem);
 	init_rwsem(&ei->i_data_sem);
+	init_rwsem(&ei->i_rsv_unwritten_sem);
 	inode_init_once(&ei->vfs_inode);
 	ext4_fc_init_inode(&ei->vfs_inode);
 }
@@ -5212,7 +5213,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	 * concurrency isn't really necessary.  Limit it to 1.
 	 */
 	EXT4_SB(sb)->rsv_conversion_wq =
-		alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+		alloc_workqueue("ext4-rsv-conversion",
+				WQ_MEM_RECLAIM | WQ_UNBOUND | __WQ_ORDERED,
+				num_active_cpus() > 1 ? num_active_cpus() : 1);
 	if (!EXT4_SB(sb)->rsv_conversion_wq) {
 		printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
 		ret = -ENOMEM;
-- 
2.37.1


  parent reply	other threads:[~2022-11-09  1:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <357ace228adf4e859df5e9f3f4f18b49@amazon.com>
     [not found] ` <1cdc68e6a98d4e93a95be5d887bcc75d@amazon.com>
2022-09-28  6:07   ` significant drop fio IOPS performance on v5.10 Lu, Davina
2022-09-28 22:36     ` Theodore Ts'o
2022-10-05  7:24       ` Lu, Davina
2022-11-09  1:02       ` Lu, Davina [this message]
2022-09-29  7:03     ` significant drop fio IOPS performance on v5.10 #forregzbot Thorsten Leemhuis
2022-09-29 11:36       ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53153bdf0cce4675b09bc2ee6483409f@amazon.com \
    --to=davinalu@amazon.com \
    --cc=abuehaze@amazon.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=franklmz@amazon.com \
    --cc=hazem.ahmed.abuelfotoh@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=okiselev@amazon.com \
    --cc=regressions@lists.linux.dev \
    --cc=ritesh.list@gmail.com \
    --cc=stable@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).