RE: significant drop fio IOPS performance on v5.10

From: "Lu, Davina" <davinalu@amazon.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	"adilger.kernel@dilger.ca" <adilger.kernel@dilger.ca>,
	"regressions@lists.linux.dev" <regressions@lists.linux.dev>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com>,
	hazem ahmed mohamed <hazem.ahmed.abuelfotoh@gmail.com>,
	"Ritesh Harjani (IBM)" <ritesh.list@gmail.com>,
	"Kiselev, Oleg" <okiselev@amazon.com>,
	"Liu, Frank" <franklmz@amazon.com>
Subject: RE: significant drop  fio IOPS performance on v5.10
Date: Wed, 5 Oct 2022 07:24:21 +0000	[thread overview]
Message-ID: <b7aaf155f80f4605aff2ccbc4067d9a6@amazon.com> (raw)
In-Reply-To: <YzTMZ26AfioIbl27@mit.edu>

Hi Ted,

Thanks for your reply! Sorry for the late reply since I was busy on another urgent EXT4 issue recently.

Understood the concern about the patch here-- the patch is just a simple start and good suggestions on to make it safe update in multiple CPUs.
1.  Max thread number could not really need 256. From my side, ~30 could be good enough and this only apply with cpu_num >= some_threads (E.g. 16), we can do this in a sysctl/mount option etc...
2. Definitely agree that we need to make safe way and thanks for clear explanation.  A quick question, so if we make sure converting unwritten extents on same inode  is in "atomic" way, then it should be a good patch to start with?

About the larger the journal size -- which can help this IOPS issue, I found is, with small journal size with bad IOPS, journal transaction are more frequently (e.g. from 450 to 700 in a period of time) and each transaction has very less handlers (e.g from 1000 to 70).
Tracing the code, I found: the reserved block exists in each journal update in EXT4 which requires more blocks, then in 
add_transaction_credits()
if (jbd2_log_space_left(journal) < journal->j_max_transaction_buffers) { //since journal size is too small, it has less change to goes into this if to start an journal update.
            if (jbd2_log_space_left(journal) <
                                                journal->j_max_transaction_buffers) 
                   __jbd2_log_wait_for_space(journal); // good to trigger commit-> 
                   jbd2_log_do_checkpoint() ->  jbd2_log_start_commit() ->
                    __jbd2_log_start_commit  -> journal->j_commit_request = target;

So most of time, in this func, it has to wait for journal commit... The slower journal update could affect the IO performance from my understanding (fio test is like: one pwrite64 and then followed one fsync). Therefore if we larger the journal size from 128M to 2GB, mostly the "if" condition can go into to ease this problem.

Thanks
Davina

-----Original Message-----
From: Theodore Ts'o <tytso@mit.edu> 
Sent: Thursday, September 29, 2022 8:36 AM
To: Lu, Davina <davinalu@amazon.com>
Cc: linux-ext4@vger.kernel.org; adilger.kernel@dilger.ca; regressions@lists.linux.dev; stable@vger.kernel.org; Mohamed Abuelfotoh, Hazem <abuehaze@amazon.com>; hazem ahmed mohamed <hazem.ahmed.abuelfotoh@gmail.com>; Ritesh Harjani (IBM) <ritesh.list@gmail.com>; Kiselev, Oleg <okiselev@amazon.com>; Liu, Frank <franklmz@amazon.com>
Subject: RE: [EXTERNAL]significant drop fio IOPS performance on v5.10

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Wed, Sep 28, 2022 at 06:07:34AM +0000, Lu, Davina wrote:
>
> Hello,
>
> My analyzing is that the degradation is introduce by commit 
> {244adf6426ee31a83f397b700d964cff12a247d3} and the issue is the 
> contention on rsv_conversion_wq.  The simplest option is to increase 
> the journal size, but that introduces more operational complexity.
> Another option is to add the following change in attachment "allow 
> more ext4-rsv-conversion workqueue.patch"

Hi, thanks for the patch.  However, I don't believe as written it is safe.  That's because your patch will allow multiple CPU's to run ext4_flush_completed_IO(), and this function is not set up to be safe to be run concurrently.  That is, I don't see the necessary locking if we have two CPU's trying to convert unwritten extents on the same inode.

It could be made safe, and certainly if the problem is that you are worried about contention across multiple inodes being written to by different FIO jobs, then certainly this could be a potential contention issue.

However, what doesn't make sense is that increasing the journal size also seems to fix the issue for you.  That implies the problem is one of the journal being to small, and so you are running into an issue of stalls caused by the need to do a synchronous checkpoint to clear space in the journal.  That is a different problem than one of there being a contention problem with rsv_conversion_wq.

So I want to make sure we understand what you are seeing before we make such a change.  One potential concern is that will cause a large number of additional kernel threads.  Now, if these extra kernel threads are doing useful work, then that's not necessarily an issue.
But if not, then it's going to be burning a large amount of kernel memory (especially for a system with a large number of CPU cores).

Regards,

                                        - Ted