Re: EXT4 IOPS degradation in 5.10 compared to 5.4

From: "Bhatnagar, Rishabh" <risbhat@amazon.com>
To: Jan Kara <jack@suse.cz>
Cc: <tytso@mit.edu>, <adilger.kernel@dilger.ca>,
	<linux-ext4@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<abuehaze@amazon.com>
Subject: Re: EXT4 IOPS degradation in 5.10 compared to 5.4
Date: Fri, 13 Jan 2023 14:13:51 -0800	[thread overview]
Message-ID: <1cfef086-b3c1-6607-9328-b1bf70896ce4@amazon.com> (raw)
In-Reply-To: <20230112113820.hjwvieq3ucbwreql@quack3>

On 1/12/23 3:38 AM, Jan Kara wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> Hi!
>
> On Wed 11-01-23 18:06:39, Bhatnagar, Rishabh wrote:
>> We have been seeing a consistent 3% degradation in IOPS score between 5.4
>> and 5.10 stable kernels while running fio tests.
>>
>> I'm running test case on m6g.8xlarge AWS instances using arm64. The test
>> involves:
>>
>> 1. Creating 100GB volume with IO1 500 iops. Attaching it to the instance.
>>
>> 2. Setup and mount fs:
>>
>> mke2fs -m 1 -t ext4 -b 4096 -L /mnt /dev/nvme1n1
>> mount -t ext4 -o noatime,nodiratime,data=ordered /dev/nvme1n1 /mnt
>>
>> 3. Install fio package and run following test:
>> (running 16 threads doing random buffered 16kb writes on a file.
>> ioengine=psync, runtime=60secs)
>>
>> jobs=16
>> blocksize="16k"
>> filesize=1000000
>>
>> if [[ -n $1 ]]; then jobs=$1; fi
>> if [[ -n $2 ]]; then blocksize=$2; fi
>>
>> /usr/bin/fio --name=fio-test --directory=/mnt --rw=randwrite
>> --ioengine=psync --buffered=1 --bs=${blocksize} \
>>          --max-jobs=${jobs} --numjobs=${jobs} --runtime=30 --thread \
>>          --filename=file0 --filesize=${filesize} \
>>          --fsync=1 --group_reporting --create_only=1 > /dev/null
>>
>> sudo echo 1 > /proc/sys/vm/drop_caches
>>
>> set -x
>> echo "Running with jobs=${jobs} filesize=${filesize} blocksize=${blocksize}"
>> /usr/bin/fio --name=fio-test --directory=/mnt --rw=randwrite
>> --ioengine=psync --buffered=1 --bs=${blocksize} \
>>          --max-jobs=${jobs} --numjobs=${jobs} --runtime=60 --thread \
>>          --filename=file0 --filesize=${filesize} \
>>          --fsync=1 --group_reporting --time_based
>>
>> After doing some kernel bisecting between we were able to pinpoint this
>> commit that drops the iops score by 10~15 points (~3%).
>> ext4: avoid unnecessary transaction starts during writeback
>> (6b8ed62008a49751fc71fefd2a4f89202a7c2d4d)
>>
>> We see higher iops/bw/total io after reverting the commit compared to base
>> 5.10 kernel.
>> Although the average clat is higher after reverting the commit the higher bw
>> drives the iops score higher as seen in below fio output.
> I expect the difference is somewhere in waiting for the journal. Can you
> just check whether there's a difference if you use --fdatasync=1 instead of
> --fsync=1? With this workload that should avoid waiting for the journal
> because the only metadata updates are mtime timestamps in the inode.
There is a difference of 5% with and with the commit if i change this to 
fdatasync=1.
>> Fio output (5.10.162):
>> write: io=431280KB, bw=7186.3KB/s, iops=449, runt= 60015msec
>> clat (usec): min=6, max=25942, avg=267.76,stdev=1604.25
>> lat (usec): min=6, max=25943, avg=267.93,stdev=1604.25
>> clat percentiles (usec):
>> | 1.00th=[ 9], 5.00th=[ 10], 10.00th=[ 16], 20.00th=[ 24]
>> | 30.00th=[ 34], 40.00th=[ 45], 50.00th=[ 58], 60.00th=[ 70],
>> | 70.00th=[ 81], 80.00th=[ 94], 90.00th=[ 107], 95.00th=[ 114],
>> | 99.00th=[10048], 99.50th=[14016], 99.90th=[20096], 99.95th=[21888],
>> | 99.99th=[24448]
>> lat (usec) : 10=3.46%, 20=12.54%, 50=26.66%, 100=41.16%, 250=13.64%
>> lat (usec) : 500=0.02%, 750=0.03%, 1000=0.01%
>> lat (msec) : 2=0.23%, 4=0.50%, 10=0.73%, 20=0.91%, 50=0.12%
>> cpu : usr=0.02%, sys=0.42%, ctx=299540, majf=0, minf=0
>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> issued : total=r=0/w=26955/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=1
>> Run status group 0 (all jobs):
>> WRITE: io=431280KB, aggrb=7186KB/s, minb=7186KB/s, maxb=7186KB/s,
>> mint=60015msec, maxt=60015msec
>> Disk stats (read/write):
>> nvme1n1: ios=0/30627, merge=0/2125, ticks=0/410990, in_queue=410990,
>> util=99.94%
>>
>> Fio output (5.10.162 with revert):
>> write: io=441920KB, bw=7363.7KB/s, iops=460, runt= 60014msec
>> clat (usec): min=6, max=35768, avg=289.09, stdev=1736.62
>> lat (usec): min=6, max=35768, avg=289.28,stdev=1736.62
>> clat percentiles (usec):
>> | 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 16], 20.00th=[ 24],
>> | 30.00th=[ 36], 40.00th=[ 46], 50.00th=[ 59], 60.00th=[ 71],
>> | 70.00th=[ 83], 80.00th=[ 97], 90.00th=[ 110], 95.00th=[ 117],
>> | 99.00th=[10048], 99.50th=[14144], 99.90th=[21632], 99.95th=[25984],
>> | 99.99th=[28288]
>> lat (usec) : 10=4.13%, 20=11.67%, 50=26.59%, 100=39.57%, 250=15.28%
>> lat (usec) : 500=0.03%, 750=0.03%, 1000=0.03%
>> lat (msec) : 2=0.20%, 4=0.64%, 10=0.80%, 20=0.86%, 50=0.18%
>> cpu : usr=0.01%, sys=0.43%, ctx=313909, majf=0, minf=0
>> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>> issued : total=r=0/w=27620/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
>> latency : target=0, window=0, percentile=100.00%, depth=1
>> Run status group 0 (all jobs):
>> WRITE: io=441920KB, aggrb=7363KB/s, minb=7363KB/s, maxb=7363KB/s,
>> mint=60014msec, maxt=60014msec
>> Disk stats (read/write):
>> nvme1n1: ios=0/31549, merge=0/2348, ticks=0/409221, in_queue=409221,
>> util=99.88%
>>
>>
>> Also i looked ext4_writepages latency which increases when the commit is
>> reverted. (This makes sense since the commit avoids unnecessary
>> transactions).
>>
>> ./funclatency ext4_writepages -->(5.10.162)
>> avg = 7734912 nsecs, total: 134131121171 nsecs, count: 17341
>>
>> ./funclatency ext4_writepages -->(5.10.162 with revert)
>> avg = 9036068 nsecs, total: 168956404886 nsecs, count: 18698
>>
>>
>> Looking at the journal transaction data I can see that the average
>> transaction commit time decreases after reverting the commit.
>> This probably helps in the IOPS score.
> So what the workload is doing is:
> write()
>    inode lock
>    dirty 16k of page cache
>    dirty inode i_mtime
>    inode unlock
> fsync()
>    walk all inode pages, write dirty ones
>    wait for all pages under writeback in the inode to complete IO
>    force transaction commit and wait for it
>
> Now this has the best throughput (and the worst latency) if all 16
> processes work in lockstep - i.e., like:
>
>    task1         task2           task3 ...
>    write()
>                  write()
>                                  write()
>    fsync()
>                  fsync()
>                                  fsync()
>
> because in that case we writeout all dirty pages from 16 processes in one
> sweep together and also we accumulate 16 mtime updates in single
> transaction commit.
>
> Now I suspect the commit you've identified leads to less synchronization
> between the processes and thus in less batching happening. In particular
> before the commit we've called mpage_prepare_extent_to_map() twice and the
> second invocation starts at where the first invocation saw last dirty page.
> So it potentially additionally writes newly dirtied pages beyond that place
> and that effectively synchronizes processes more.
>
> To confirm the theory, it might be interesting to gather a histogram of a
> number of pages written back by ext4_writepages() call with / without the
> commit.
>
>                                                                  Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Hi Jan

I collected some data w.r.t to number of pages being written by ext4_writepages. What you pointed out seems to be correct.
Without the commit I see more batching (more writeback count from 4-20 pages)happening compared to with the commit.

Without the commit (reverted):

[0-1]   —>  4246
[2-3]   —>  312
[4-5]   —>  20836
[6-7]   —>  205
[8-9]   —>  895
[10-11] —>  56
[12-13] —>  422
[14-15] —>  62
[16-17] —>  234
[18-19] —>  66
[20-21] —>  77
[22-23] —>  9
[24-25] —>  26
[26-27] —>  1
[28-29] —>  13

Average page count : 3.9194

With the commit:

[0-1]   —> 1635
[2-3]   —> 123
[4-5]   —> 24302
[6-7]   —> 38
[8-9]   —> 604
[10-11] —> 19
[12-13] —> 123
[14-15] —> 12
[16-17] —> 24
[18-19] —> 3
[20-21] —> 8
[22-23] —> 1
[24-25] —> 3
[26-27] —> 0
[28-29] —> 1

Average page count : 3.9184

Also looking at journal data I see that without the commit we have more handles per journal transaction:

Without the commit:
cat /proc/fs/jbd2/nvme1n1-8/info
2092 transactions (2091 requested), each up to 8192 blocks
average:
0ms waiting for transaction
0ms request delay
20ms running transaction
0ms transaction was being locked
0ms flushing data (in ordered mode)
20ms logging transaction
15981us average transaction commit time
67 handles per transaction
1 blocks per transaction
2 logged blocks per transaction

With the commit:
cat /proc/fs/jbd2/nvme1n1-8/info
2143 transactions (2143 requested), each up to 8192 blocks
average:
0ms waiting for transaction
0ms request delay
0ms running transaction
0ms transaction was being locked
0ms flushing data (in ordered mode)
20ms logging transaction
20731us average transaction commit time
51 handles per transaction
1 blocks per transaction
3 logged blocks per transaction

Thanks
Rishabh