linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* EXT4 IOPS degradation between 4.14 and 5.10
@ 2023-01-26  0:33 Bhatnagar, Rishabh
  2023-01-26  9:32 ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Bhatnagar, Rishabh @ 2023-01-26  0:33 UTC (permalink / raw)
  To: Jan Kara, tytso, akpm; +Cc: linux-mm, linux-ext4, linux-kernel, abuehaze

Hi Jan

As discussed in the previous thread I'm chasing IOPS regression between 
4.14 -> 5.10 kernels.
https://lore.kernel.org/lkml/20230112113820.hjwvieq3ucbwreql@quack3/T/
<https://lore.kernel.org/lkml/20230112113820.hjwvieq3ucbwreql@quack3/T/>

Last issue we discussed was difficult to resolve so keeping it on the 
back burner for now.

I did some more bisecting and saw another series of patches that 
potentially impacts iops score:
72b045aecdd856b083521f2a963705b4c2e59680 (mm: implement 
find_get_pages_range_tag())

Running fio tests on tip as 9c19a9cb1642c074aa8bc7693cd4c038643960ae 
(including the 16 patch series) vs tip as
6b4c54e3787bc03e810062bd257a3b05fd9c72d6 (without the above series) 
shows an IOPS jump.

Fio with buffered io/fsync=1/randwrite

With HEAD as 9c19a9cb1642c074aa8bc7693cd4c038643960ae (with the above 
series)

write: io=445360KB, bw=7418.6KB/s, *iops=463*, runt= 60033msec
clat (usec): min=4, max=32132, avg=311.90, stdev=1812.74
lat (usec): min=5, max=32132, avg=312.28, stdev=1812.74
clat percentiles (usec):
| 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 16], 20.00th=[ 25],
| 30.00th=[ 36], 40.00th=[ 47], 50.00th=[ 60], 60.00th=[ 71],
| 70.00th=[ 84], 80.00th=[ 97], 90.00th=[ 111], 95.00th=[ 118],
| 99.00th=[11840], 99.50th=[15936], 99.90th=[21888], 99.95th=[23936],

With HEAD as 6b4c54e3787bc03e810062bd257a3b05fd9c72d6(without the above 
series)

write: io=455184KB, bw=7583.4KB/s, *iops=473*, runt= 60024msec
clat (usec): min=6, max=24325, avg=319.72, stdev=1694.52
lat (usec): min=6, max=24326, avg=319.99, stdev=1694.53
clat percentiles (usec):
| 1.00th=[ 9], 5.00th=[ 11], 10.00th=[ 17], 20.00th=[ 26],
| 30.00th=[ 38], 40.00th=[ 50], 50.00th=[ 60], 60.00th=[ 73],
| 70.00th=[ 85], 80.00th=[ 98], 90.00th=[ 111], 95.00th=[ 118],
| 99.00th=[ 9792], 99.50th=[14016], 99.90th=[21888], 99.95th=[22400],
| 99.99th=[24192]


I also see that number of handles per transaction were much higher 
before this patch series

0ms waiting for transaction
0ms request delay
20ms running transaction
0ms transaction was being locked
0ms flushing data (in ordered mode)
10ms logging transaction
*13524us average transaction commit time*
*73 handles per transaction*
0 blocks per transaction
1 logged blocks per transaction

vs after the patch series.

0ms waiting for transaction
0ms request delay
20ms running transaction
0ms transaction was being locked
0ms flushing data (in ordered mode)
20ms logging transaction
*21468us average transaction commit time*
*66 handles per transaction*
1 blocks per transaction
1 logged blocks per transaction

This is probably again helping in bunching the writeback transactions 
and increasing throughput.

I looked at the code to understand what might be going on.
It seems like commit 72b045aecdd856b083521f2a963705b4c2e59680 changes 
the behavior of find_get_pages_range_tag.
Before this commit if find_get_pages_tag cannot find nr_pages 
(PAGEVEC_SIZE) it returns the number of pages found as ret and
sets the *index to the last page it found + 1. After the commit the 
behavior changes such that if we don’t find nr_pages pages
we set the index to end and not to the last found page. (added diff from 
above commit)
Since pagevec_lookup_range_tag is always called in a while loop (index 
<= end) the code before the commit helps in coalescing
writeback of pages if there are multiple threads doing write as it might 
keep finding new dirty (tagged) pages since it doesn’t set index to end.

+ /*
+ * We come here when we got at @end. We take care to not overflow the
+ * index @index as it confuses some of the callers. This breaks the
+ * iteration when there is page at index -1 but that is already broken
+ * anyway.
+ */
+ if (end == (pgoff_t)-1)
+ *index = (pgoff_t)-1;
+ else
+ *index = end + 1;
+out:
rcu_read_unlock();

- if (ret)
- *index = pages[ret - 1]->index + 1;
-

 From the description of the patch i didn't see any mention of this 
functional change.
Was this change intentional and did help some usecase or general 
performance improvement?

Thanks
Rishabh


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: EXT4 IOPS degradation between 4.14 and 5.10
  2023-01-26  0:33 EXT4 IOPS degradation between 4.14 and 5.10 Bhatnagar, Rishabh
@ 2023-01-26  9:32 ` Jan Kara
  2023-01-26 18:23   ` Bhatnagar, Rishabh
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2023-01-26  9:32 UTC (permalink / raw)
  To: Bhatnagar, Rishabh
  Cc: Jan Kara, tytso, akpm, linux-mm, linux-ext4, linux-kernel, abuehaze

Hello!

On Wed 25-01-23 16:33:54, Bhatnagar, Rishabh wrote:
> As discussed in the previous thread I'm chasing IOPS regression between 4.14
> -> 5.10 kernels.
> https://lore.kernel.org/lkml/20230112113820.hjwvieq3ucbwreql@quack3/T/
> <https://lore.kernel.org/lkml/20230112113820.hjwvieq3ucbwreql@quack3/T/>
> 
> Last issue we discussed was difficult to resolve so keeping it on the back
> burner for now.
> 
> I did some more bisecting and saw another series of patches that potentially
> impacts iops score:
> 72b045aecdd856b083521f2a963705b4c2e59680 (mm: implement
> find_get_pages_range_tag())
> 
> Running fio tests on tip as 9c19a9cb1642c074aa8bc7693cd4c038643960ae
> (including the 16 patch series) vs tip as
> 6b4c54e3787bc03e810062bd257a3b05fd9c72d6 (without the above series) shows an
> IOPS jump.

Ok, thanks for pinpointing this. That series landed a long time ago ;).

> Fio with buffered io/fsync=1/randwrite

So I'm curious. Do you have any workload that actually does these
synchronous random buffered IOs? Or is it just a benchmarking exercise?

> With HEAD as 9c19a9cb1642c074aa8bc7693cd4c038643960ae (with the above
> series)
> 
> write: io=445360KB, bw=7418.6KB/s, *iops=463*, runt= 60033msec
> clat (usec): min=4, max=32132, avg=311.90, stdev=1812.74
> lat (usec): min=5, max=32132, avg=312.28, stdev=1812.74
> clat percentiles (usec):
> | 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 16], 20.00th=[ 25],
> | 30.00th=[ 36], 40.00th=[ 47], 50.00th=[ 60], 60.00th=[ 71],
> | 70.00th=[ 84], 80.00th=[ 97], 90.00th=[ 111], 95.00th=[ 118],
> | 99.00th=[11840], 99.50th=[15936], 99.90th=[21888], 99.95th=[23936],
> 
> With HEAD as 6b4c54e3787bc03e810062bd257a3b05fd9c72d6(without the above
> series)
> 
> write: io=455184KB, bw=7583.4KB/s, *iops=473*, runt= 60024msec
> clat (usec): min=6, max=24325, avg=319.72, stdev=1694.52
> lat (usec): min=6, max=24326, avg=319.99, stdev=1694.53
> clat percentiles (usec):
> | 1.00th=[ 9], 5.00th=[ 11], 10.00th=[ 17], 20.00th=[ 26],
> | 30.00th=[ 38], 40.00th=[ 50], 50.00th=[ 60], 60.00th=[ 73],
> | 70.00th=[ 85], 80.00th=[ 98], 90.00th=[ 111], 95.00th=[ 118],
> | 99.00th=[ 9792], 99.50th=[14016], 99.90th=[21888], 99.95th=[22400],
> | 99.99th=[24192]

OK, about 2% regression. How stable is that across multiple runs?

> I also see that number of handles per transaction were much higher before
> this patch series
> 
> 0ms waiting for transaction
> 0ms request delay
> 20ms running transaction
> 0ms transaction was being locked
> 0ms flushing data (in ordered mode)
> 10ms logging transaction
> *13524us average transaction commit time*
> *73 handles per transaction*
> 0 blocks per transaction
> 1 logged blocks per transaction
> 
> vs after the patch series.
> 
> 0ms waiting for transaction
> 0ms request delay
> 20ms running transaction
> 0ms transaction was being locked
> 0ms flushing data (in ordered mode)
> 20ms logging transaction
> *21468us average transaction commit time*
> *66 handles per transaction*
> 1 blocks per transaction
> 1 logged blocks per transaction
> 
> This is probably again helping in bunching the writeback transactions and
> increasing throughput.

Yeah, probably.

> I looked at the code to understand what might be going on.
> It seems like commit 72b045aecdd856b083521f2a963705b4c2e59680 changes the
> behavior of find_get_pages_range_tag.
> Before this commit if find_get_pages_tag cannot find nr_pages (PAGEVEC_SIZE)
> it returns the number of pages found as ret and
> sets the *index to the last page it found + 1. After the commit the behavior
> changes such that if we don’t find nr_pages pages
> we set the index to end and not to the last found page. (added diff from
> above commit)
> Since pagevec_lookup_range_tag is always called in a while loop (index <=
> end) the code before the commit helps in coalescing
> writeback of pages if there are multiple threads doing write as it might
> keep finding new dirty (tagged) pages since it doesn’t set index to end.
> 
> + /*
> + * We come here when we got at @end. We take care to not overflow the
> + * index @index as it confuses some of the callers. This breaks the
> + * iteration when there is page at index -1 but that is already broken
> + * anyway.
> + */
> + if (end == (pgoff_t)-1)
> + *index = (pgoff_t)-1;
> + else
> + *index = end + 1;
> +out:
> rcu_read_unlock();
> 
> - if (ret)
> - *index = pages[ret - 1]->index + 1;
> -
> 
> From the description of the patch i didn't see any mention of this
> functional change.
> Was this change intentional and did help some usecase or general performance
> improvement?

So the change was intentional. When I was working on the series, I was
somewhat concerned that the old code could end up in a pathological
situation like:
  We scan range 0-1000000, find the only dirty page at index 0, return it.
  We scan range 1-1000000, find the only dirty page at index 1, return it.
  ...

This way we end up with rather inefficient scanning and in theory malicious
user could livelock writeback like this. That being said this was/is mostly
a theoretical concern.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: EXT4 IOPS degradation between 4.14 and 5.10
  2023-01-26  9:32 ` Jan Kara
@ 2023-01-26 18:23   ` Bhatnagar, Rishabh
  2023-01-27 12:17     ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Bhatnagar, Rishabh @ 2023-01-26 18:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, akpm, linux-mm, linux-ext4, linux-kernel, abuehaze


On 1/26/23 1:32 AM, Jan Kara wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> Hello!
>
> On Wed 25-01-23 16:33:54, Bhatnagar, Rishabh wrote:
>> As discussed in the previous thread I'm chasing IOPS regression between 4.14
>> -> 5.10 kernels.
>> https://lore.kernel.org/lkml/20230112113820.hjwvieq3ucbwreql@quack3/T/
>> <https://lore.kernel.org/lkml/20230112113820.hjwvieq3ucbwreql@quack3/T/>
>>
>> Last issue we discussed was difficult to resolve so keeping it on the back
>> burner for now.
>>
>> I did some more bisecting and saw another series of patches that potentially
>> impacts iops score:
>> 72b045aecdd856b083521f2a963705b4c2e59680 (mm: implement
>> find_get_pages_range_tag())
>>
>> Running fio tests on tip as 9c19a9cb1642c074aa8bc7693cd4c038643960ae
>> (including the 16 patch series) vs tip as
>> 6b4c54e3787bc03e810062bd257a3b05fd9c72d6 (without the above series) shows an
>> IOPS jump.
> Ok, thanks for pinpointing this. That series landed a long time ago ;).
>
Yes its quite old patch series. I was tracking a regression for teams 
moving to a newer 5.10 (although that's also old) kernel from 4.14.
>> Fio with buffered io/fsync=1/randwrite
> So I'm curious. Do you have any workload that actually does these
> synchronous random buffered IOs? Or is it just a benchmarking exercise?
>
So database team was a running benchmark called hammerdb with 16 
connections where they started noticing dropped
performance for 5.10. We traced it back to filesystem with this 
benchmark. Although the database workload would be something similar
  I guess.

>> With HEAD as 9c19a9cb1642c074aa8bc7693cd4c038643960ae (with the above
>> series)
>>
>> write: io=445360KB, bw=7418.6KB/s, *iops=463*, runt= 60033msec
>> clat (usec): min=4, max=32132, avg=311.90, stdev=1812.74
>> lat (usec): min=5, max=32132, avg=312.28, stdev=1812.74
>> clat percentiles (usec):
>> | 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 16], 20.00th=[ 25],
>> | 30.00th=[ 36], 40.00th=[ 47], 50.00th=[ 60], 60.00th=[ 71],
>> | 70.00th=[ 84], 80.00th=[ 97], 90.00th=[ 111], 95.00th=[ 118],
>> | 99.00th=[11840], 99.50th=[15936], 99.90th=[21888], 99.95th=[23936],
>>
>> With HEAD as 6b4c54e3787bc03e810062bd257a3b05fd9c72d6(without the above
>> series)
>>
>> write: io=455184KB, bw=7583.4KB/s, *iops=473*, runt= 60024msec
>> clat (usec): min=6, max=24325, avg=319.72, stdev=1694.52
>> lat (usec): min=6, max=24326, avg=319.99, stdev=1694.53
>> clat percentiles (usec):
>> | 1.00th=[ 9], 5.00th=[ 11], 10.00th=[ 17], 20.00th=[ 26],
>> | 30.00th=[ 38], 40.00th=[ 50], 50.00th=[ 60], 60.00th=[ 73],
>> | 70.00th=[ 85], 80.00th=[ 98], 90.00th=[ 111], 95.00th=[ 118],
>> | 99.00th=[ 9792], 99.50th=[14016], 99.90th=[21888], 99.95th=[22400],
>> | 99.99th=[24192]
> OK, about 2% regression. How stable is that across multiple runs?
This is quite consistent. The thing to note here is this test is on 
4.14+ tip with and without
these patch series.
I tried reverting to the same behavior in 5.10 with the following patch 
but i only see a very low improvement. (~1%)

This is because on 5.10 we also have this patch
ext4: avoid unnecessary transaction starts during writeback. (This 
limits the invocation of this function within the while loop)
If i revert the above patch as well along with applying below patch i 
get back around 4% IOPS.

-       if (end == (pgoff_t)-1)
-               *index = (pgoff_t)-1;
-       else
-               *index = end + 1;
+       if (xas.xa_index > end) {
+               if (end == (pgoff_t)-1)
+                       *index = (pgoff_t)-1;
+               else
+                       *index = end + 1;
+       }
+       else if (ret)
+               *index = xas.xa_index + 1;

>
>> I also see that number of handles per transaction were much higher before
>> this patch series
>>
>> 0ms waiting for transaction
>> 0ms request delay
>> 20ms running transaction
>> 0ms transaction was being locked
>> 0ms flushing data (in ordered mode)
>> 10ms logging transaction
>> *13524us average transaction commit time*
>> *73 handles per transaction*
>> 0 blocks per transaction
>> 1 logged blocks per transaction
>>
>> vs after the patch series.
>>
>> 0ms waiting for transaction
>> 0ms request delay
>> 20ms running transaction
>> 0ms transaction was being locked
>> 0ms flushing data (in ordered mode)
>> 20ms logging transaction
>> *21468us average transaction commit time*
>> *66 handles per transaction*
>> 1 blocks per transaction
>> 1 logged blocks per transaction
>>
>> This is probably again helping in bunching the writeback transactions and
>> increasing throughput.
> Yeah, probably.
>
>> I looked at the code to understand what might be going on.
>> It seems like commit 72b045aecdd856b083521f2a963705b4c2e59680 changes the
>> behavior of find_get_pages_range_tag.
>> Before this commit if find_get_pages_tag cannot find nr_pages (PAGEVEC_SIZE)
>> it returns the number of pages found as ret and
>> sets the *index to the last page it found + 1. After the commit the behavior
>> changes such that if we don’t find nr_pages pages
>> we set the index to end and not to the last found page. (added diff from
>> above commit)
>> Since pagevec_lookup_range_tag is always called in a while loop (index <=
>> end) the code before the commit helps in coalescing
>> writeback of pages if there are multiple threads doing write as it might
>> keep finding new dirty (tagged) pages since it doesn’t set index to end.
>>
>> + /*
>> + * We come here when we got at @end. We take care to not overflow the
>> + * index @index as it confuses some of the callers. This breaks the
>> + * iteration when there is page at index -1 but that is already broken
>> + * anyway.
>> + */
>> + if (end == (pgoff_t)-1)
>> + *index = (pgoff_t)-1;
>> + else
>> + *index = end + 1;
>> +out:
>> rcu_read_unlock();
>>
>> - if (ret)
>> - *index = pages[ret - 1]->index + 1;
>> -
>>
>>  From the description of the patch i didn't see any mention of this
>> functional change.
>> Was this change intentional and did help some usecase or general performance
>> improvement?
> So the change was intentional. When I was working on the series, I was
> somewhat concerned that the old code could end up in a pathological
> situation like:
>    We scan range 0-1000000, find the only dirty page at index 0, return it.
>    We scan range 1-1000000, find the only dirty page at index 1, return it.
>    ...
>
> This way we end up with rather inefficient scanning and in theory malicious
> user could livelock writeback like this. That being said this was/is mostly
> a theoretical concern.

Ok so its more of a security concern. But do you think this has a 
latency impact? I didn't see
much latency impact between the two sets and throughput is higher.

>
>                                                                  Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: EXT4 IOPS degradation between 4.14 and 5.10
  2023-01-26 18:23   ` Bhatnagar, Rishabh
@ 2023-01-27 12:17     ` Jan Kara
  2023-01-30 17:45       ` Bhatnagar, Rishabh
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2023-01-27 12:17 UTC (permalink / raw)
  To: Bhatnagar, Rishabh
  Cc: Jan Kara, tytso, akpm, linux-mm, linux-ext4, linux-kernel, abuehaze

On Thu 26-01-23 10:23:07, Bhatnagar, Rishabh wrote:
> On 1/26/23 1:32 AM, Jan Kara wrote:
> > On Wed 25-01-23 16:33:54, Bhatnagar, Rishabh wrote:
> > > Fio with buffered io/fsync=1/randwrite
> > So I'm curious. Do you have any workload that actually does these
> > synchronous random buffered IOs? Or is it just a benchmarking exercise?
> > 
> So database team was a running benchmark called hammerdb with 16 connections
> where they started noticing dropped
> performance for 5.10. We traced it back to filesystem with this benchmark.
> Although the database workload would be something similar
>  I guess.

HammerDB is a TPC-C and TPC-H benchmark but details of the IO depend more
on the database it is hammering. Do you know which one? Anyway the most
important fact for me is that it is visible in a real world workload (well,
as much as DB benchmarking is real-world ;)).

> > > With HEAD as 9c19a9cb1642c074aa8bc7693cd4c038643960ae (with the above
> > > series)
> > > 
> > > write: io=445360KB, bw=7418.6KB/s, *iops=463*, runt= 60033msec
> > > clat (usec): min=4, max=32132, avg=311.90, stdev=1812.74
> > > lat (usec): min=5, max=32132, avg=312.28, stdev=1812.74
> > > clat percentiles (usec):
> > > | 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 16], 20.00th=[ 25],
> > > | 30.00th=[ 36], 40.00th=[ 47], 50.00th=[ 60], 60.00th=[ 71],
> > > | 70.00th=[ 84], 80.00th=[ 97], 90.00th=[ 111], 95.00th=[ 118],
> > > | 99.00th=[11840], 99.50th=[15936], 99.90th=[21888], 99.95th=[23936],
> > > 
> > > With HEAD as 6b4c54e3787bc03e810062bd257a3b05fd9c72d6(without the above
> > > series)
> > > 
> > > write: io=455184KB, bw=7583.4KB/s, *iops=473*, runt= 60024msec
> > > clat (usec): min=6, max=24325, avg=319.72, stdev=1694.52
> > > lat (usec): min=6, max=24326, avg=319.99, stdev=1694.53
> > > clat percentiles (usec):
> > > | 1.00th=[ 9], 5.00th=[ 11], 10.00th=[ 17], 20.00th=[ 26],
> > > | 30.00th=[ 38], 40.00th=[ 50], 50.00th=[ 60], 60.00th=[ 73],
> > > | 70.00th=[ 85], 80.00th=[ 98], 90.00th=[ 111], 95.00th=[ 118],
> > > | 99.00th=[ 9792], 99.50th=[14016], 99.90th=[21888], 99.95th=[22400],
> > > | 99.99th=[24192]
> > OK, about 2% regression. How stable is that across multiple runs?
> This is quite consistent. The thing to note here is this test is on 4.14+
> tip with and without
> these patch series.
> I tried reverting to the same behavior in 5.10 with the following patch but
> i only see a very low improvement. (~1%)

I see. Thanks for experiments.

> This is because on 5.10 we also have this patch
> ext4: avoid unnecessary transaction starts during writeback. (This limits
> the invocation of this function within the while loop)
> If i revert the above patch as well along with applying below patch i get
> back around 4% IOPS.
> 
> -       if (end == (pgoff_t)-1)
> -               *index = (pgoff_t)-1;
> -       else
> -               *index = end + 1;
> +       if (xas.xa_index > end) {
> +               if (end == (pgoff_t)-1)
> +                       *index = (pgoff_t)-1;
> +               else
> +                       *index = end + 1;
> +       }
> +       else if (ret)
> +               *index = xas.xa_index + 1;

Yes, so these patches have kind of magnifying effect together as each
multiplies the number of scanning by 2. So with both of them we get like 4
scans of the inode for dirty pages in one writeback round instead of 1.

> > > I also see that number of handles per transaction were much higher before
> > > this patch series
> > > 
> > > 0ms waiting for transaction
> > > 0ms request delay
> > > 20ms running transaction
> > > 0ms transaction was being locked
> > > 0ms flushing data (in ordered mode)
> > > 10ms logging transaction
> > > *13524us average transaction commit time*
> > > *73 handles per transaction*
> > > 0 blocks per transaction
> > > 1 logged blocks per transaction
> > > 
> > > vs after the patch series.
> > > 
> > > 0ms waiting for transaction
> > > 0ms request delay
> > > 20ms running transaction
> > > 0ms transaction was being locked
> > > 0ms flushing data (in ordered mode)
> > > 20ms logging transaction
> > > *21468us average transaction commit time*
> > > *66 handles per transaction*
> > > 1 blocks per transaction
> > > 1 logged blocks per transaction
> > > 
> > > This is probably again helping in bunching the writeback transactions and
> > > increasing throughput.
> > Yeah, probably.
> > 
> > > I looked at the code to understand what might be going on.
> > > It seems like commit 72b045aecdd856b083521f2a963705b4c2e59680 changes the
> > > behavior of find_get_pages_range_tag.
> > > Before this commit if find_get_pages_tag cannot find nr_pages (PAGEVEC_SIZE)
> > > it returns the number of pages found as ret and
> > > sets the *index to the last page it found + 1. After the commit the behavior
> > > changes such that if we don’t find nr_pages pages
> > > we set the index to end and not to the last found page. (added diff from
> > > above commit)
> > > Since pagevec_lookup_range_tag is always called in a while loop (index <=
> > > end) the code before the commit helps in coalescing
> > > writeback of pages if there are multiple threads doing write as it might
> > > keep finding new dirty (tagged) pages since it doesn’t set index to end.
> > > 
> > > + /*
> > > + * We come here when we got at @end. We take care to not overflow the
> > > + * index @index as it confuses some of the callers. This breaks the
> > > + * iteration when there is page at index -1 but that is already broken
> > > + * anyway.
> > > + */
> > > + if (end == (pgoff_t)-1)
> > > + *index = (pgoff_t)-1;
> > > + else
> > > + *index = end + 1;
> > > +out:
> > > rcu_read_unlock();
> > > 
> > > - if (ret)
> > > - *index = pages[ret - 1]->index + 1;
> > > -
> > > 
> > >  From the description of the patch i didn't see any mention of this
> > > functional change.
> > > Was this change intentional and did help some usecase or general performance
> > > improvement?
> > So the change was intentional. When I was working on the series, I was
> > somewhat concerned that the old code could end up in a pathological
> > situation like:
> >    We scan range 0-1000000, find the only dirty page at index 0, return it.
> >    We scan range 1-1000000, find the only dirty page at index 1, return it.
> >    ...
> > 
> > This way we end up with rather inefficient scanning and in theory malicious
> > user could livelock writeback like this. That being said this was/is mostly
> > a theoretical concern.
> 
> Ok so its more of a security concern. But do you think this has a latency
> impact? I didn't see
> much latency impact between the two sets and throughput is higher.

Yes, I expect there will be latency impact but for this workload probably
relatively small. I expect the expensive part on your workload is the
fsync(2) call, in particular the committing of the transaction and the
flushing of the disk caches as a result of that. Data writes themselves are
relatively cheap. If you had say 1MB blocks instead of 16k ones, I'd expect
the numbers to start looking differently as the cost of IO and of cache
flushing becomes for comparable - obviously it all depends on the details
of the backing storage as well. Perhaps could you measure how much time we
spend in file_write_and_wait_range() calls vs in the whole ext4_sync_file()
call to confirm that?

Overall your tests show we could gain some throughput without sacrificing
too much latency if we somehow batch tiny fsync requests more. The trick is
how to do this without regressing other workloads and also across various
storage types which are going to have very different properties.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: EXT4 IOPS degradation between 4.14 and 5.10
  2023-01-27 12:17     ` Jan Kara
@ 2023-01-30 17:45       ` Bhatnagar, Rishabh
  2023-02-08 14:02         ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Bhatnagar, Rishabh @ 2023-01-30 17:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, akpm, linux-mm, linux-ext4, linux-kernel, abuehaze


On 1/27/23 4:17 AM, Jan Kara wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Thu 26-01-23 10:23:07, Bhatnagar, Rishabh wrote:
>> On 1/26/23 1:32 AM, Jan Kara wrote:
>>> On Wed 25-01-23 16:33:54, Bhatnagar, Rishabh wrote:
>>>> Fio with buffered io/fsync=1/randwrite
>>> So I'm curious. Do you have any workload that actually does these
>>> synchronous random buffered IOs? Or is it just a benchmarking exercise?
>>>
>> So database team was a running benchmark called hammerdb with 16 connections
>> where they started noticing dropped
>> performance for 5.10. We traced it back to filesystem with this benchmark.
>> Although the database workload would be something similar
>>   I guess.
> HammerDB is a TPC-C and TPC-H benchmark but details of the IO depend more
> on the database it is hammering. Do you know which one? Anyway the most
> important fact for me is that it is visible in a real world workload (well,
> as much as DB benchmarking is real-world ;)).
I believe its the MySQL database though not so sure.
>
>>>> With HEAD as 9c19a9cb1642c074aa8bc7693cd4c038643960ae (with the above
>>>> series)
>>>>
>>>> write: io=445360KB, bw=7418.6KB/s, *iops=463*, runt= 60033msec
>>>> clat (usec): min=4, max=32132, avg=311.90, stdev=1812.74
>>>> lat (usec): min=5, max=32132, avg=312.28, stdev=1812.74
>>>> clat percentiles (usec):
>>>> | 1.00th=[ 8], 5.00th=[ 10], 10.00th=[ 16], 20.00th=[ 25],
>>>> | 30.00th=[ 36], 40.00th=[ 47], 50.00th=[ 60], 60.00th=[ 71],
>>>> | 70.00th=[ 84], 80.00th=[ 97], 90.00th=[ 111], 95.00th=[ 118],
>>>> | 99.00th=[11840], 99.50th=[15936], 99.90th=[21888], 99.95th=[23936],
>>>>
>>>> With HEAD as 6b4c54e3787bc03e810062bd257a3b05fd9c72d6(without the above
>>>> series)
>>>>
>>>> write: io=455184KB, bw=7583.4KB/s, *iops=473*, runt= 60024msec
>>>> clat (usec): min=6, max=24325, avg=319.72, stdev=1694.52
>>>> lat (usec): min=6, max=24326, avg=319.99, stdev=1694.53
>>>> clat percentiles (usec):
>>>> | 1.00th=[ 9], 5.00th=[ 11], 10.00th=[ 17], 20.00th=[ 26],
>>>> | 30.00th=[ 38], 40.00th=[ 50], 50.00th=[ 60], 60.00th=[ 73],
>>>> | 70.00th=[ 85], 80.00th=[ 98], 90.00th=[ 111], 95.00th=[ 118],
>>>> | 99.00th=[ 9792], 99.50th=[14016], 99.90th=[21888], 99.95th=[22400],
>>>> | 99.99th=[24192]
>>> OK, about 2% regression. How stable is that across multiple runs?
>> This is quite consistent. The thing to note here is this test is on 4.14+
>> tip with and without
>> these patch series.
>> I tried reverting to the same behavior in 5.10 with the following patch but
>> i only see a very low improvement. (~1%)
> I see. Thanks for experiments.
>
>> This is because on 5.10 we also have this patch
>> ext4: avoid unnecessary transaction starts during writeback. (This limits
>> the invocation of this function within the while loop)
>> If i revert the above patch as well along with applying below patch i get
>> back around 4% IOPS.
>>
>> -       if (end == (pgoff_t)-1)
>> -               *index = (pgoff_t)-1;
>> -       else
>> -               *index = end + 1;
>> +       if (xas.xa_index > end) {
>> +               if (end == (pgoff_t)-1)
>> +                       *index = (pgoff_t)-1;
>> +               else
>> +                       *index = end + 1;
>> +       }
>> +       else if (ret)
>> +               *index = xas.xa_index + 1;
> Yes, so these patches have kind of magnifying effect together as each
> multiplies the number of scanning by 2. So with both of them we get like 4
> scans of the inode for dirty pages in one writeback round instead of 1.
>
>>>> I also see that number of handles per transaction were much higher before
>>>> this patch series
>>>>
>>>> 0ms waiting for transaction
>>>> 0ms request delay
>>>> 20ms running transaction
>>>> 0ms transaction was being locked
>>>> 0ms flushing data (in ordered mode)
>>>> 10ms logging transaction
>>>> *13524us average transaction commit time*
>>>> *73 handles per transaction*
>>>> 0 blocks per transaction
>>>> 1 logged blocks per transaction
>>>>
>>>> vs after the patch series.
>>>>
>>>> 0ms waiting for transaction
>>>> 0ms request delay
>>>> 20ms running transaction
>>>> 0ms transaction was being locked
>>>> 0ms flushing data (in ordered mode)
>>>> 20ms logging transaction
>>>> *21468us average transaction commit time*
>>>> *66 handles per transaction*
>>>> 1 blocks per transaction
>>>> 1 logged blocks per transaction
>>>>
>>>> This is probably again helping in bunching the writeback transactions and
>>>> increasing throughput.
>>> Yeah, probably.
>>>
>>>> I looked at the code to understand what might be going on.
>>>> It seems like commit 72b045aecdd856b083521f2a963705b4c2e59680 changes the
>>>> behavior of find_get_pages_range_tag.
>>>> Before this commit if find_get_pages_tag cannot find nr_pages (PAGEVEC_SIZE)
>>>> it returns the number of pages found as ret and
>>>> sets the *index to the last page it found + 1. After the commit the behavior
>>>> changes such that if we don’t find nr_pages pages
>>>> we set the index to end and not to the last found page. (added diff from
>>>> above commit)
>>>> Since pagevec_lookup_range_tag is always called in a while loop (index <=
>>>> end) the code before the commit helps in coalescing
>>>> writeback of pages if there are multiple threads doing write as it might
>>>> keep finding new dirty (tagged) pages since it doesn’t set index to end.
>>>>
>>>> + /*
>>>> + * We come here when we got at @end. We take care to not overflow the
>>>> + * index @index as it confuses some of the callers. This breaks the
>>>> + * iteration when there is page at index -1 but that is already broken
>>>> + * anyway.
>>>> + */
>>>> + if (end == (pgoff_t)-1)
>>>> + *index = (pgoff_t)-1;
>>>> + else
>>>> + *index = end + 1;
>>>> +out:
>>>> rcu_read_unlock();
>>>>
>>>> - if (ret)
>>>> - *index = pages[ret - 1]->index + 1;
>>>> -
>>>>
>>>>   From the description of the patch i didn't see any mention of this
>>>> functional change.
>>>> Was this change intentional and did help some usecase or general performance
>>>> improvement?
>>> So the change was intentional. When I was working on the series, I was
>>> somewhat concerned that the old code could end up in a pathological
>>> situation like:
>>>     We scan range 0-1000000, find the only dirty page at index 0, return it.
>>>     We scan range 1-1000000, find the only dirty page at index 1, return it.
>>>     ...
>>>
>>> This way we end up with rather inefficient scanning and in theory malicious
>>> user could livelock writeback like this. That being said this was/is mostly
>>> a theoretical concern.
>> Ok so its more of a security concern. But do you think this has a latency
>> impact? I didn't see
>> much latency impact between the two sets and throughput is higher.
> Yes, I expect there will be latency impact but for this workload probably
> relatively small. I expect the expensive part on your workload is the
> fsync(2) call, in particular the committing of the transaction and the
> flushing of the disk caches as a result of that. Data writes themselves are
> relatively cheap. If you had say 1MB blocks instead of 16k ones, I'd expect
> the numbers to start looking differently as the cost of IO and of cache
> flushing becomes for comparable - obviously it all depends on the details
> of the backing storage as well. Perhaps could you measure how much time we
> spend in file_write_and_wait_range() calls vs in the whole ext4_sync_file()
> call to confirm that?
>
> Overall your tests show we could gain some throughput without sacrificing
> too much latency if we somehow batch tiny fsync requests more. The trick is
> how to do this without regressing other workloads and also across various
> storage types which are going to have very different properties.
>
>                                                                  Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

Yeah i agree fsync is much more expensive operation than just dirtying 
buffers in page cache.
I did use the ext4dist tool from bcc to get the distribution of write vs 
fsync.
Overall fsync is much more expensive operation so yeah if we can get 
higher throughput here with batching
it shouldn't impact fsync latency that much.

operation = write
      usecs               : count     distribution
          0 -> 1          : 0 |                                        |
          2 -> 3          : 0 |                                        |
          4 -> 7          : 84 |                                        |
          8 -> 15         : 1833 |************                            |
         16 -> 31         : 3148 |********************                    |
         32 -> 63         : 4149 |***************************             |
         64 -> 127        : 6083 |****************************************|
        128 -> 255        : 46 |                                        |
        256 -> 511        : 0 |                                        |
        512 -> 1023       : 6 |                                        |
       1024 -> 2047       : 63 |                                        |
       2048 -> 4095       : 44 |                                        |
       4096 -> 8191       : 190 |*                                       |
       8192 -> 16383      : 191 |*                                       |
      16384 -> 32767      : 201 |*                                       |
      32768 -> 65535      : 43 |                                        |

operation = fsync
      usecs               : count     distribution
          0 -> 1          : 0 |                                        |
          2 -> 3          : 0 |                                        |
          4 -> 7          : 0 |                                        |
          8 -> 15         : 0 |                                        |
         16 -> 31         : 0 |                                        |
         32 -> 63         : 0 |                                        |
         64 -> 127        : 0 |                                        |
        128 -> 255        : 0 |                                        |
        256 -> 511        : 1 |                                        |
        512 -> 1023       : 27 |                                        |
       1024 -> 2047       : 62 |                                        |
       2048 -> 4095       : 212 |*                                       |
       4096 -> 8191       : 56 |                                        |
       8192 -> 16383      : 94 |                                        |
      16384 -> 32767      : 1070 |*****                                   |
      32768 -> 65535      : 7649 |****************************************|
      65536 -> 131071     : 6901 |************************************    |


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: EXT4 IOPS degradation between 4.14 and 5.10
  2023-01-30 17:45       ` Bhatnagar, Rishabh
@ 2023-02-08 14:02         ` Jan Kara
  2023-02-08 19:17           ` Bhatnagar, Rishabh
  2023-02-08 19:37           ` Theodore Ts'o
  0 siblings, 2 replies; 9+ messages in thread
From: Jan Kara @ 2023-02-08 14:02 UTC (permalink / raw)
  To: Bhatnagar, Rishabh
  Cc: Jan Kara, tytso, akpm, linux-mm, linux-ext4, linux-kernel, abuehaze

On Mon 30-01-23 09:45:32, Bhatnagar, Rishabh wrote:
> 
> On 1/27/23 4:17 AM, Jan Kara wrote:
> > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> > 
> > 
> > 
> > On Thu 26-01-23 10:23:07, Bhatnagar, Rishabh wrote:
> > > On 1/26/23 1:32 AM, Jan Kara wrote:
> > > > On Wed 25-01-23 16:33:54, Bhatnagar, Rishabh wrote:
> > > > > Fio with buffered io/fsync=1/randwrite
> > > > So I'm curious. Do you have any workload that actually does these
> > > > synchronous random buffered IOs? Or is it just a benchmarking exercise?
> > > > 
> > > So database team was a running benchmark called hammerdb with 16 connections
> > > where they started noticing dropped
> > > performance for 5.10. We traced it back to filesystem with this benchmark.
> > > Although the database workload would be something similar
> > >   I guess.
> > HammerDB is a TPC-C and TPC-H benchmark but details of the IO depend more
> > on the database it is hammering. Do you know which one? Anyway the most
> > important fact for me is that it is visible in a real world workload (well,
> > as much as DB benchmarking is real-world ;)).
> I believe its the MySQL database though not so sure.

Well, in that case I think your MySQL DB is somewhat misconfigured. At
least as far as we have been consulting MySQL / MariaDB developers
regarding benchmarking, they suggested we should configure the database to
use direct IO and increase DB internal buffers instead of relying on
buffered IO and pagecache behavior. And if your fio job is representative
of the IO load the DB really creates, I'd agree that that would be a saner
and likely more performant configuration ;)

> > > > > + /*
> > > > > + * We come here when we got at @end. We take care to not overflow the
> > > > > + * index @index as it confuses some of the callers. This breaks the
> > > > > + * iteration when there is page at index -1 but that is already broken
> > > > > + * anyway.
> > > > > + */
> > > > > + if (end == (pgoff_t)-1)
> > > > > + *index = (pgoff_t)-1;
> > > > > + else
> > > > > + *index = end + 1;
> > > > > +out:
> > > > > rcu_read_unlock();
> > > > > 
> > > > > - if (ret)
> > > > > - *index = pages[ret - 1]->index + 1;
> > > > > -
> > > > > 
> > > > >   From the description of the patch i didn't see any mention of this
> > > > > functional change.
> > > > > Was this change intentional and did help some usecase or general performance
> > > > > improvement?
> > > > So the change was intentional. When I was working on the series, I was
> > > > somewhat concerned that the old code could end up in a pathological
> > > > situation like:
> > > >     We scan range 0-1000000, find the only dirty page at index 0, return it.
> > > >     We scan range 1-1000000, find the only dirty page at index 1, return it.
> > > >     ...
> > > > 
> > > > This way we end up with rather inefficient scanning and in theory malicious
> > > > user could livelock writeback like this. That being said this was/is mostly
> > > > a theoretical concern.
> > > Ok so its more of a security concern. But do you think this has a latency
> > > impact? I didn't see
> > > much latency impact between the two sets and throughput is higher.
> > Yes, I expect there will be latency impact but for this workload probably
> > relatively small. I expect the expensive part on your workload is the
> > fsync(2) call, in particular the committing of the transaction and the
> > flushing of the disk caches as a result of that. Data writes themselves are
> > relatively cheap. If you had say 1MB blocks instead of 16k ones, I'd expect
> > the numbers to start looking differently as the cost of IO and of cache
> > flushing becomes for comparable - obviously it all depends on the details
> > of the backing storage as well. Perhaps could you measure how much time we
> > spend in file_write_and_wait_range() calls vs in the whole ext4_sync_file()
> > call to confirm that?
> > 
> > Overall your tests show we could gain some throughput without sacrificing
> > too much latency if we somehow batch tiny fsync requests more. The trick is
> > how to do this without regressing other workloads and also across various
> > storage types which are going to have very different properties.
> 
> Yeah i agree fsync is much more expensive operation than just dirtying
> buffers in page cache.  I did use the ext4dist tool from bcc to get the
> distribution of write vs fsync.  Overall fsync is much more expensive
> operation so yeah if we can get higher throughput here with batching it
> shouldn't impact fsync latency that much.

Well, I was more interested in file_write_and_wait_range() vs
ext4_sync_file() latency comparison. write(2) calls are indeed very fast
because they just copy into the page cache so that is not very interesting.
But file_write_and_wait_range() is more interesting because that measures
the cost of writing file data to the disk while whole ext4_sync_file()
measures the cost of writing file data to the disk + the cost of flushing
the journal and I'm interested in how much is the flushing of the journal
costly compared to the data writeback.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: EXT4 IOPS degradation between 4.14 and 5.10
  2023-02-08 14:02         ` Jan Kara
@ 2023-02-08 19:17           ` Bhatnagar, Rishabh
  2023-02-09  8:34             ` Jan Kara
  2023-02-08 19:37           ` Theodore Ts'o
  1 sibling, 1 reply; 9+ messages in thread
From: Bhatnagar, Rishabh @ 2023-02-08 19:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, akpm, linux-mm, linux-ext4, linux-kernel, abuehaze


On 2/8/23 6:02 AM, Jan Kara wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Mon 30-01-23 09:45:32, Bhatnagar, Rishabh wrote:
>> On 1/27/23 4:17 AM, Jan Kara wrote:
>>> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>>>
>>>
>>>
>>> On Thu 26-01-23 10:23:07, Bhatnagar, Rishabh wrote:
>>>> On 1/26/23 1:32 AM, Jan Kara wrote:
>>>>> On Wed 25-01-23 16:33:54, Bhatnagar, Rishabh wrote:
>>>>>> Fio with buffered io/fsync=1/randwrite
>>>>> So I'm curious. Do you have any workload that actually does these
>>>>> synchronous random buffered IOs? Or is it just a benchmarking exercise?
>>>>>
>>>> So database team was a running benchmark called hammerdb with 16 connections
>>>> where they started noticing dropped
>>>> performance for 5.10. We traced it back to filesystem with this benchmark.
>>>> Although the database workload would be something similar
>>>>    I guess.
>>> HammerDB is a TPC-C and TPC-H benchmark but details of the IO depend more
>>> on the database it is hammering. Do you know which one? Anyway the most
>>> important fact for me is that it is visible in a real world workload (well,
>>> as much as DB benchmarking is real-world ;)).
>> I believe its the MySQL database though not so sure.
> Well, in that case I think your MySQL DB is somewhat misconfigured. At
> least as far as we have been consulting MySQL / MariaDB developers
> regarding benchmarking, they suggested we should configure the database to
> use direct IO and increase DB internal buffers instead of relying on
> buffered IO and pagecache behavior. And if your fio job is representative
> of the IO load the DB really creates, I'd agree that that would be a saner
> and likely more performant configuration ;)
>
>>>>>> + /*
>>>>>> + * We come here when we got at @end. We take care to not overflow the
>>>>>> + * index @index as it confuses some of the callers. This breaks the
>>>>>> + * iteration when there is page at index -1 but that is already broken
>>>>>> + * anyway.
>>>>>> + */
>>>>>> + if (end == (pgoff_t)-1)
>>>>>> + *index = (pgoff_t)-1;
>>>>>> + else
>>>>>> + *index = end + 1;
>>>>>> +out:
>>>>>> rcu_read_unlock();
>>>>>>
>>>>>> - if (ret)
>>>>>> - *index = pages[ret - 1]->index + 1;
>>>>>> -
>>>>>>
>>>>>>    From the description of the patch i didn't see any mention of this
>>>>>> functional change.
>>>>>> Was this change intentional and did help some usecase or general performance
>>>>>> improvement?
>>>>> So the change was intentional. When I was working on the series, I was
>>>>> somewhat concerned that the old code could end up in a pathological
>>>>> situation like:
>>>>>      We scan range 0-1000000, find the only dirty page at index 0, return it.
>>>>>      We scan range 1-1000000, find the only dirty page at index 1, return it.
>>>>>      ...
>>>>>
>>>>> This way we end up with rather inefficient scanning and in theory malicious
>>>>> user could livelock writeback like this. That being said this was/is mostly
>>>>> a theoretical concern.
>>>> Ok so its more of a security concern. But do you think this has a latency
>>>> impact? I didn't see
>>>> much latency impact between the two sets and throughput is higher.
>>> Yes, I expect there will be latency impact but for this workload probably
>>> relatively small. I expect the expensive part on your workload is the
>>> fsync(2) call, in particular the committing of the transaction and the
>>> flushing of the disk caches as a result of that. Data writes themselves are
>>> relatively cheap. If you had say 1MB blocks instead of 16k ones, I'd expect
>>> the numbers to start looking differently as the cost of IO and of cache
>>> flushing becomes for comparable - obviously it all depends on the details
>>> of the backing storage as well. Perhaps could you measure how much time we
>>> spend in file_write_and_wait_range() calls vs in the whole ext4_sync_file()
>>> call to confirm that?
>>>
>>> Overall your tests show we could gain some throughput without sacrificing
>>> too much latency if we somehow batch tiny fsync requests more. The trick is
>>> how to do this without regressing other workloads and also across various
>>> storage types which are going to have very different properties.
>> Yeah i agree fsync is much more expensive operation than just dirtying
>> buffers in page cache.  I did use the ext4dist tool from bcc to get the
>> distribution of write vs fsync.  Overall fsync is much more expensive
>> operation so yeah if we can get higher throughput here with batching it
>> shouldn't impact fsync latency that much.
> Well, I was more interested in file_write_and_wait_range() vs
> ext4_sync_file() latency comparison. write(2) calls are indeed very fast
> because they just copy into the page cache so that is not very interesting.
> But file_write_and_wait_range() is more interesting because that measures
> the cost of writing file data to the disk while whole ext4_sync_file()
> measures the cost of writing file data to the disk + the cost of flushing
> the journal and I'm interested in how much is the flushing of the journal
> costly compared to the data writeback.

Sorry i misunderstood your comment. Here is the revised data. Flushing 
journal
very heavy compared to flushing data.

ext4_sync_file:       ~18.6 msecs
fdatawrite_range:   ~ 4usecs
fdatawait_range:     ~ 83.6usecs
fc_commit:             ~18.6 msecs

Tracing 1 functions for "ext4_sync_file"


      nsecs               : count     distribution
    1048576 -> 2097151    : 75 |                                        |
    2097152 -> 4194303    : 1496 |****                                    |
    4194304 -> 8388607    : 3461 |**********                              |
    8388608 -> 16777215   : 6693 |********************                    |
   16777216 -> 33554431   : 13355 |****************************************|
   33554432 -> 67108863   : 1631 |****                                    |

avg = 18624922 nsecs, total: 505778389231 nsecs, count: 27156


Tracing 1 functions for "__filemap_fdatawrite_range"

      nsecs               : count     distribution
        512 -> 1023       : 0 |                                        |
       1024 -> 2047       : 1 |**                                      |
       2048 -> 4095       : 14 |****************************************|
       4096 -> 8191       : 5 |**************                          |
       8192 -> 16383      : 1 |**                                      |

avg = 3943 nsecs, total: 82809 nsecs, count: 21

Tracing 1 functions for "__filemap_fdatawait_range

      nsecs               : count     distribution
        128 -> 255        : 0 |                                        |
        256 -> 511        : 1 |********************                    |
        512 -> 1023       : 2 |****************************************|
       1024 -> 2047       : 2 |****************************************|
       2048 -> 4095       : 0 |                                        |
       4096 -> 8191       : 0 |                                        |
       8192 -> 16383      : 0 |                                        |
      16384 -> 32767      : 0 |                                        |
      32768 -> 65535      : 0 |                                        |
      65536 -> 131071     : 0 |                                        |
     131072 -> 262143     : 0 |                                        |
     262144 -> 524287     : 1 |********************                    |

avg = 83644 nsecs, total: 501866 nsecs, count: 6

Tracing 1 functions for "ext4_fc_commit

      nsecs               : count     distribution
    1048576 -> 2097151    : 70 |                                        |
    2097152 -> 4194303    : 1512 |****                                    |
    4194304 -> 8388607    : 3436 |**********                              |
    8388608 -> 16777215   : 6699 |********************                    |
   16777216 -> 33554431   : 13349 |****************************************|
   33554432 -> 67108863   : 1641 |****                                    |

avg = 18622010 nsecs, total: 505699318966 nsecs, count: 27156

Thanks
Rishabh

>
>                                                                  Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: EXT4 IOPS degradation between 4.14 and 5.10
  2023-02-08 14:02         ` Jan Kara
  2023-02-08 19:17           ` Bhatnagar, Rishabh
@ 2023-02-08 19:37           ` Theodore Ts'o
  1 sibling, 0 replies; 9+ messages in thread
From: Theodore Ts'o @ 2023-02-08 19:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Bhatnagar, Rishabh, akpm, linux-mm, linux-ext4, linux-kernel, abuehaze

On Wed, Feb 08, 2023 at 03:02:47PM +0100, Jan Kara wrote:
> > I believe its the MySQL database though not so sure.
> 
> Well, in that case I think your MySQL DB is somewhat misconfigured. At
> least as far as we have been consulting MySQL / MariaDB developers
> regarding benchmarking, they suggested we should configure the database to
> use direct IO and increase DB internal buffers instead of relying on
> buffered IO and pagecache behavior. And if your fio job is representative
> of the IO load the DB really creates, I'd agree that that would be a saner
> and likely more performant configuration ;)

Could it possibly be Postgres?  I happen to know that Amazon RDS and
Google Cloud SQL support both MySQL and Postgres, and there are some
optimizations that some of us in the Cloud space have been pursuing
which are much easier because MySQL uses Direct I/O, but unfortunately
Postgres uses buffered I/O and doens't support DIO.  :-(

							- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: EXT4 IOPS degradation between 4.14 and 5.10
  2023-02-08 19:17           ` Bhatnagar, Rishabh
@ 2023-02-09  8:34             ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2023-02-09  8:34 UTC (permalink / raw)
  To: Bhatnagar, Rishabh
  Cc: Jan Kara, tytso, akpm, linux-mm, linux-ext4, linux-kernel, abuehaze

On Wed 08-02-23 11:17:21, Bhatnagar, Rishabh wrote:
> > > > > > > + /*
> > > > > > > + * We come here when we got at @end. We take care to not overflow the
> > > > > > > + * index @index as it confuses some of the callers. This breaks the
> > > > > > > + * iteration when there is page at index -1 but that is already broken
> > > > > > > + * anyway.
> > > > > > > + */
> > > > > > > + if (end == (pgoff_t)-1)
> > > > > > > + *index = (pgoff_t)-1;
> > > > > > > + else
> > > > > > > + *index = end + 1;
> > > > > > > +out:
> > > > > > > rcu_read_unlock();
> > > > > > > 
> > > > > > > - if (ret)
> > > > > > > - *index = pages[ret - 1]->index + 1;
> > > > > > > -
> > > > > > > 
> > > > > > >    From the description of the patch i didn't see any mention of this
> > > > > > > functional change.
> > > > > > > Was this change intentional and did help some usecase or general performance
> > > > > > > improvement?
> > > > > > So the change was intentional. When I was working on the series, I was
> > > > > > somewhat concerned that the old code could end up in a pathological
> > > > > > situation like:
> > > > > >      We scan range 0-1000000, find the only dirty page at index 0, return it.
> > > > > >      We scan range 1-1000000, find the only dirty page at index 1, return it.
> > > > > >      ...
> > > > > > 
> > > > > > This way we end up with rather inefficient scanning and in theory malicious
> > > > > > user could livelock writeback like this. That being said this was/is mostly
> > > > > > a theoretical concern.
> > > > > Ok so its more of a security concern. But do you think this has a latency
> > > > > impact? I didn't see
> > > > > much latency impact between the two sets and throughput is higher.
> > > > Yes, I expect there will be latency impact but for this workload probably
> > > > relatively small. I expect the expensive part on your workload is the
> > > > fsync(2) call, in particular the committing of the transaction and the
> > > > flushing of the disk caches as a result of that. Data writes themselves are
> > > > relatively cheap. If you had say 1MB blocks instead of 16k ones, I'd expect
> > > > the numbers to start looking differently as the cost of IO and of cache
> > > > flushing becomes for comparable - obviously it all depends on the details
> > > > of the backing storage as well. Perhaps could you measure how much time we
> > > > spend in file_write_and_wait_range() calls vs in the whole ext4_sync_file()
> > > > call to confirm that?
> > > > 
> > > > Overall your tests show we could gain some throughput without sacrificing
> > > > too much latency if we somehow batch tiny fsync requests more. The trick is
> > > > how to do this without regressing other workloads and also across various
> > > > storage types which are going to have very different properties.
> > > Yeah i agree fsync is much more expensive operation than just dirtying
> > > buffers in page cache.  I did use the ext4dist tool from bcc to get the
> > > distribution of write vs fsync.  Overall fsync is much more expensive
> > > operation so yeah if we can get higher throughput here with batching it
> > > shouldn't impact fsync latency that much.
> > Well, I was more interested in file_write_and_wait_range() vs
> > ext4_sync_file() latency comparison. write(2) calls are indeed very fast
> > because they just copy into the page cache so that is not very interesting.
> > But file_write_and_wait_range() is more interesting because that measures
> > the cost of writing file data to the disk while whole ext4_sync_file()
> > measures the cost of writing file data to the disk + the cost of flushing
> > the journal and I'm interested in how much is the flushing of the journal
> > costly compared to the data writeback.
> 
> Sorry i misunderstood your comment. Here is the revised data. Flushing
> journal very heavy compared to flushing data.
> 
> ext4_sync_file:       ~18.6 msecs
> fdatawrite_range:   ~ 4usecs
> fdatawait_range:     ~ 83.6usecs
> fc_commit:             ~18.6 msecs
> 
> Tracing 1 functions for "ext4_sync_file"
> 
> 
>      nsecs               : count     distribution
>    1048576 -> 2097151    : 75 |                                        |
>    2097152 -> 4194303    : 1496 |****                                    |
>    4194304 -> 8388607    : 3461 |**********                              |
>    8388608 -> 16777215   : 6693 |********************                    |
>   16777216 -> 33554431   : 13355 |****************************************|
>   33554432 -> 67108863   : 1631 |****                                    |
> 
> avg = 18624922 nsecs, total: 505778389231 nsecs, count: 27156
> 
> 
> Tracing 1 functions for "__filemap_fdatawrite_range"
> 
>      nsecs               : count     distribution
>        512 -> 1023       : 0 |                                        |
>       1024 -> 2047       : 1 |**                                      |
>       2048 -> 4095       : 14 |****************************************|
>       4096 -> 8191       : 5 |**************                          |
>       8192 -> 16383      : 1 |**                                      |
> 
> avg = 3943 nsecs, total: 82809 nsecs, count: 21
> 
> Tracing 1 functions for "__filemap_fdatawait_range
> 
>      nsecs               : count     distribution
>        128 -> 255        : 0 |                                        |
>        256 -> 511        : 1 |********************                    |
>        512 -> 1023       : 2 |****************************************|
>       1024 -> 2047       : 2 |****************************************|
>       2048 -> 4095       : 0 |                                        |
>       4096 -> 8191       : 0 |                                        |
>       8192 -> 16383      : 0 |                                        |
>      16384 -> 32767      : 0 |                                        |
>      32768 -> 65535      : 0 |                                        |
>      65536 -> 131071     : 0 |                                        |
>     131072 -> 262143     : 0 |                                        |
>     262144 -> 524287     : 1 |********************                    |
> 
> avg = 83644 nsecs, total: 501866 nsecs, count: 6
> 
> Tracing 1 functions for "ext4_fc_commit
> 
>      nsecs               : count     distribution
>    1048576 -> 2097151    : 70 |                                        |
>    2097152 -> 4194303    : 1512 |****                                    |
>    4194304 -> 8388607    : 3436 |**********                              |
>    8388608 -> 16777215   : 6699 |********************                    |
>   16777216 -> 33554431   : 13349 |****************************************|
>   33554432 -> 67108863   : 1641 |****                                    |
> 
> avg = 18622010 nsecs, total: 505699318966 nsecs, count: 27156

Thanks for the data! So indeed as I suspected for your storage the
transaction commit (likely due to the flushing of disk caches) costs the
vast majority of time. I guess I'll try to experiment with this a bit if we
can somehow achieve more batching in ext4_sync_file()...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-02-09  8:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-26  0:33 EXT4 IOPS degradation between 4.14 and 5.10 Bhatnagar, Rishabh
2023-01-26  9:32 ` Jan Kara
2023-01-26 18:23   ` Bhatnagar, Rishabh
2023-01-27 12:17     ` Jan Kara
2023-01-30 17:45       ` Bhatnagar, Rishabh
2023-02-08 14:02         ` Jan Kara
2023-02-08 19:17           ` Bhatnagar, Rishabh
2023-02-09  8:34             ` Jan Kara
2023-02-08 19:37           ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).