All of lore.kernel.org
 help / color / mirror / Atom feed
* performance regression between 6.1.x and 5.15.x
@ 2023-05-08  9:24 Wang Yugui
  2023-05-08 14:46 ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-05-08  9:24 UTC (permalink / raw)
  To: linux-xfs

Hi,

I noticed a performance regression of xfs 6.1.27/6.1.23,
with the compare to xfs 5.15.110.

It is yet not clear whether  it is a problem of xfs or lvm2.

any guide to troubleshoot it?

test case:
  disk: NVMe PCIe3 SSD *4 
  LVM: raid0 default strip size 64K.
  fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
   -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
   -directory=/mnt/test


6.1.27/6.1.23
fio bw=2623MiB/s (2750MB/s)
perf report:
Samples: 330K of event 'cycles', Event count (approx.): 120739812790
Overhead  Command  Shared Object        Symbol
  31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
   5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
   3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
   3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
   2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
   2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
   2.11%  fio      [kernel.kallsyms]    [k] xas_load
   2.10%  fio      [kernel.kallsyms]    [k] xas_descend

5.15.110
fio bw=6796MiB/s (7126MB/s)
perf report:
Samples: 267K of event 'cycles', Event count (approx.): 186688803871
Overhead  Command  Shared Object       Symbol
  38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
   6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
   4.40%  fio      [kernel.kallsyms]   [k] xas_load
   3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
   3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
   1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
   1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
   1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
   1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
   1.41%  fio      [kernel.kallsyms]   [k] xas_start


Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/08



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-08  9:24 performance regression between 6.1.x and 5.15.x Wang Yugui
@ 2023-05-08 14:46 ` Wang Yugui
  2023-05-08 22:32   ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-05-08 14:46 UTC (permalink / raw)
  To: linux-xfs

Hi,

> Hi,
> 
> I noticed a performance regression of xfs 6.1.27/6.1.23,
> with the compare to xfs 5.15.110.
> 
> It is yet not clear whether  it is a problem of xfs or lvm2.
> 
> any guide to troubleshoot it?
> 
> test case:
>   disk: NVMe PCIe3 SSD *4 
>   LVM: raid0 default strip size 64K.
>   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
>    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
>    -directory=/mnt/test
> 
> 
> 6.1.27/6.1.23
> fio bw=2623MiB/s (2750MB/s)
> perf report:
> Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> Overhead  Command  Shared Object        Symbol
>   31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
>    5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
>    3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
>    3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
>    2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
>    2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
>    2.11%  fio      [kernel.kallsyms]    [k] xas_load
>    2.10%  fio      [kernel.kallsyms]    [k] xas_descend
> 
> 5.15.110
> fio bw=6796MiB/s (7126MB/s)
> perf report:
> Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> Overhead  Command  Shared Object       Symbol
>   38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
>    6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
>    4.40%  fio      [kernel.kallsyms]   [k] xas_load
>    3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
>    3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
>    1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
>    1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
>    1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
>    1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
>    1.41%  fio      [kernel.kallsyms]   [k] xas_start
> 


more info:

1, 6.2.14 have same performance as 6.1.x

2,  6.1.x fio performance detail:
Jobs: 4 (f=4): [W(4)][16.7%][w=10.2GiB/s][w=10.4k IOPS][eta 00m:15s]
Jobs: 4 (f=4): [W(4)][25.0%][w=9949MiB/s][w=9949 IOPS][eta 00m:12s] 
Jobs: 4 (f=4): [W(4)][31.2%][w=9618MiB/s][w=9618 IOPS][eta 00m:11s]
Jobs: 4 (f=4): [W(4)][37.5%][w=7970MiB/s][w=7970 IOPS][eta 00m:10s]
Jobs: 4 (f=4): [W(4)][41.2%][w=5048MiB/s][w=5047 IOPS][eta 00m:10s]
Jobs: 4 (f=4): [W(4)][42.1%][w=2489MiB/s][w=2488 IOPS][eta 00m:11s]
Jobs: 4 (f=4): [W(4)][42.9%][w=3227MiB/s][w=3226 IOPS][eta 00m:12s]
Jobs: 4 (f=4): [W(4)][45.5%][w=3622MiB/s][w=3622 IOPS][eta 00m:12s]
Jobs: 4 (f=4): [W(4)][47.8%][w=3651MiB/s][w=3650 IOPS][eta 00m:12s]
Jobs: 4 (f=4): [W(4)][52.2%][w=3435MiB/s][w=3435 IOPS][eta 00m:11s]
Jobs: 4 (f=4): [W(4)][52.0%][w=2464MiB/s][w=2463 IOPS][eta 00m:12s]
Jobs: 4 (f=4): [W(4)][53.8%][w=2438MiB/s][w=2438 IOPS][eta 00m:12s]
Jobs: 4 (f=4): [W(4)][55.6%][w=2435MiB/s][w=2434 IOPS][eta 00m:12s]
Jobs: 4 (f=4): [W(4)][57.1%][w=2449MiB/s][w=2448 IOPS][eta 00m:12s]
Jobs: 4 (f=4): [W(4)][60.7%][w=2422MiB/s][w=2421 IOPS][eta 00m:11s]
Jobs: 4 (f=4): [W(4)][62.1%][w=2457MiB/s][w=2457 IOPS][eta 00m:11s]
Jobs: 4 (f=4): [W(4)][63.3%][w=2436MiB/s][w=2436 IOPS][eta 00m:11s]
Jobs: 4 (f=4): [W(4)][64.5%][w=2432MiB/s][w=2431 IOPS][eta 00m:11s]
Jobs: 4 (f=4): [W(4)][67.7%][w=2440MiB/s][w=2440 IOPS][eta 00m:10s]
Jobs: 4 (f=4): [W(4)][71.0%][w=2437MiB/s][w=2437 IOPS][eta 00m:09s]
Jobs: 4 (f=4): [W(4)][74.2%][w=2442MiB/s][w=2442 IOPS][eta 00m:08s]
Jobs: 4 (f=4): [W(4)][77.4%][w=2425MiB/s][w=2424 IOPS][eta 00m:07s]
Jobs: 4 (f=4): [W(4)][80.6%][w=2459MiB/s][w=2459 IOPS][eta 00m:06s]
Jobs: 4 (f=4): [W(4)][86.7%][w=2428MiB/s][w=2427 IOPS][eta 00m:04s]
Jobs: 4 (f=4): [W(4)][90.0%][w=2441MiB/s][w=2440 IOPS][eta 00m:03s]
Jobs: 4 (f=4): [W(4)][93.3%][w=2438MiB/s][w=2437 IOPS][eta 00m:02s]
Jobs: 4 (f=4): [W(4)][96.7%][w=2450MiB/s][w=2449 IOPS][eta 00m:01s]
Jobs: 4 (f=4): [W(4)][100.0%][w=2430MiB/s][w=2430 IOPS][eta 00m:00s]
Jobs: 4 (f=4): [F(4)][100.0%][w=2372MiB/s][w=2372 IOPS][eta 00m:00s]

5.15 fio performance detail:
Jobs: 4 (f=4): [W(4)][14.3%][w=8563MiB/s][w=8563 IOPS][eta 00m:18s]
Jobs: 4 (f=4): [W(4)][18.2%][w=6376MiB/s][w=6375 IOPS][eta 00m:18s]
Jobs: 4 (f=4): [W(4)][20.8%][w=4566MiB/s][w=4565 IOPS][eta 00m:19s]
Jobs: 4 (f=4): [W(4)][23.1%][w=3947MiB/s][w=3947 IOPS][eta 00m:20s]
Jobs: 4 (f=4): [W(4)][25.9%][w=4601MiB/s][w=4601 IOPS][eta 00m:20s]
Jobs: 4 (f=4): [W(4)][28.6%][w=5797MiB/s][w=5796 IOPS][eta 00m:20s]
Jobs: 4 (f=4): [W(4)][32.1%][w=6802MiB/s][w=6801 IOPS][eta 00m:19s]
Jobs: 4 (f=4): [W(4)][35.7%][w=7411MiB/s][w=7411 IOPS][eta 00m:18s]
Jobs: 4 (f=4): [W(4)][40.7%][w=8445MiB/s][w=8444 IOPS][eta 00m:16s]
Jobs: 4 (f=4): [W(4)][46.2%][w=7992MiB/s][w=7992 IOPS][eta 00m:14s]
Jobs: 4 (f=4): [W(4)][52.0%][w=8118MiB/s][w=8117 IOPS][eta 00m:12s]
Jobs: 4 (f=4): [W(4)][56.0%][w=7742MiB/s][w=7741 IOPS][eta 00m:11s]
Jobs: 4 (f=4): [W(4)][62.5%][w=7497MiB/s][w=7496 IOPS][eta 00m:09s]
Jobs: 4 (f=4): [W(4)][66.7%][w=7248MiB/s][w=7248 IOPS][eta 00m:08s]
Jobs: 4 (f=4): [W(4)][70.8%][w=7461MiB/s][w=7460 IOPS][eta 00m:07s]
Jobs: 4 (f=4): [W(4)][75.0%][w=7959MiB/s][w=7959 IOPS][eta 00m:06s]
Jobs: 4 (f=4): [W(3),F(1)][79.2%][w=6982MiB/s][w=6982 IOPS][eta 00m:05s]
Jobs: 1 (f=1): [_(2),W(1),_(1)][87.0%][w=2809MiB/s][w=2808 IOPS][eta 00m:03s]
Jobs: 1 (f=1): [_(2),W(1),_(1)][95.5%][w=2669MiB/s][w=2668 IOPS][eta 00m:01s]
Jobs: 1 (f=1): [_(2),F(1),_(1)][100.0%][w=2552MiB/s][w=2552 IOPS][eta 00m:00s]


3, 'sysctl -a |grep dirty'  of 6.1.x and 5.15.x
vm.dirty_background_bytes = 1073741824
vm.dirty_background_ratio = 0
vm.dirty_bytes = 8589934592
vm.dirty_expire_centisecs = 600
vm.dirty_ratio = 0
vm.dirty_writeback_centisecs = 200
vm.dirtytime_expire_seconds = 43200

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/08



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-08 14:46 ` Wang Yugui
@ 2023-05-08 22:32   ` Dave Chinner
  2023-05-08 23:25     ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2023-05-08 22:32 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> Hi,
> 
> > Hi,
> > 
> > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > with the compare to xfs 5.15.110.
> > 
> > It is yet not clear whether  it is a problem of xfs or lvm2.
> > 
> > any guide to troubleshoot it?
> > 
> > test case:
> >   disk: NVMe PCIe3 SSD *4 
> >   LVM: raid0 default strip size 64K.
> >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> >    -directory=/mnt/test
> > 
> > 
> > 6.1.27/6.1.23
> > fio bw=2623MiB/s (2750MB/s)
> > perf report:
> > Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> > Overhead  Command  Shared Object        Symbol
> >   31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
> >    5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
> >    3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
> >    3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
> >    2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
> >    2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
> >    2.11%  fio      [kernel.kallsyms]    [k] xas_load
> >    2.10%  fio      [kernel.kallsyms]    [k] xas_descend
> > 
> > 5.15.110
> > fio bw=6796MiB/s (7126MB/s)
> > perf report:
> > Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> > Overhead  Command  Shared Object       Symbol
> >   38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> >    6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
> >    4.40%  fio      [kernel.kallsyms]   [k] xas_load
> >    3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
> >    3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
> >    1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
> >    1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
> >    1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
> >    1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
> >    1.41%  fio      [kernel.kallsyms]   [k] xas_start

Because you are testing buffered IO, you need to run perf across all
CPUs and tasks, not just the fio process so that it captures the
profile of memory reclaim and writeback that is being performed by
the kernel.

> more info:
> 
> 1, 6.2.14 have same performance as 6.1.x
> 
> 2,  6.1.x fio performance detail:
> Jobs: 4 (f=4): [W(4)][16.7%][w=10.2GiB/s][w=10.4k IOPS][eta 00m:15s]
> Jobs: 4 (f=4): [W(4)][25.0%][w=9949MiB/s][w=9949 IOPS][eta 00m:12s] 
> Jobs: 4 (f=4): [W(4)][31.2%][w=9618MiB/s][w=9618 IOPS][eta 00m:11s]
> Jobs: 4 (f=4): [W(4)][37.5%][w=7970MiB/s][w=7970 IOPS][eta 00m:10s]
> Jobs: 4 (f=4): [W(4)][41.2%][w=5048MiB/s][w=5047 IOPS][eta 00m:10s]
> Jobs: 4 (f=4): [W(4)][42.1%][w=2489MiB/s][w=2488 IOPS][eta 00m:11s]
> Jobs: 4 (f=4): [W(4)][42.9%][w=3227MiB/s][w=3226 IOPS][eta 00m:12s]
> Jobs: 4 (f=4): [W(4)][45.5%][w=3622MiB/s][w=3622 IOPS][eta 00m:12s]
> Jobs: 4 (f=4): [W(4)][47.8%][w=3651MiB/s][w=3650 IOPS][eta 00m:12s]
> Jobs: 4 (f=4): [W(4)][52.2%][w=3435MiB/s][w=3435 IOPS][eta 00m:11s]
> Jobs: 4 (f=4): [W(4)][52.0%][w=2464MiB/s][w=2463 IOPS][eta 00m:12s]
> Jobs: 4 (f=4): [W(4)][53.8%][w=2438MiB/s][w=2438 IOPS][eta 00m:12s]

Looks like it's throttled on dirty pages at this point.

How much memory does your test system have, and what does changing
the writeback throttling thresholds do? What's the numa layout?


Watching the stats in /proc/meminfo would be useful. I tend to use
Performance Co-Pilot (PCP) to collect these sorts of stats and plot
them in real time so I can see how the state of the machine is
changing as the test is running....


> Jobs: 4 (f=4): [W(4)][55.6%][w=2435MiB/s][w=2434 IOPS][eta 00m:12s]
> Jobs: 4 (f=4): [W(4)][57.1%][w=2449MiB/s][w=2448 IOPS][eta 00m:12s]
> Jobs: 4 (f=4): [W(4)][60.7%][w=2422MiB/s][w=2421 IOPS][eta 00m:11s]
> Jobs: 4 (f=4): [W(4)][62.1%][w=2457MiB/s][w=2457 IOPS][eta 00m:11s]
> Jobs: 4 (f=4): [W(4)][63.3%][w=2436MiB/s][w=2436 IOPS][eta 00m:11s]
> Jobs: 4 (f=4): [W(4)][64.5%][w=2432MiB/s][w=2431 IOPS][eta 00m:11s]
> Jobs: 4 (f=4): [W(4)][67.7%][w=2440MiB/s][w=2440 IOPS][eta 00m:10s]
> Jobs: 4 (f=4): [W(4)][71.0%][w=2437MiB/s][w=2437 IOPS][eta 00m:09s]
> Jobs: 4 (f=4): [W(4)][74.2%][w=2442MiB/s][w=2442 IOPS][eta 00m:08s]
> Jobs: 4 (f=4): [W(4)][77.4%][w=2425MiB/s][w=2424 IOPS][eta 00m:07s]
> Jobs: 4 (f=4): [W(4)][80.6%][w=2459MiB/s][w=2459 IOPS][eta 00m:06s]
> Jobs: 4 (f=4): [W(4)][86.7%][w=2428MiB/s][w=2427 IOPS][eta 00m:04s]
> Jobs: 4 (f=4): [W(4)][90.0%][w=2441MiB/s][w=2440 IOPS][eta 00m:03s]
> Jobs: 4 (f=4): [W(4)][93.3%][w=2438MiB/s][w=2437 IOPS][eta 00m:02s]
> Jobs: 4 (f=4): [W(4)][96.7%][w=2450MiB/s][w=2449 IOPS][eta 00m:01s]
> Jobs: 4 (f=4): [W(4)][100.0%][w=2430MiB/s][w=2430 IOPS][eta 00m:00s]
> Jobs: 4 (f=4): [F(4)][100.0%][w=2372MiB/s][w=2372 IOPS][eta 00m:00s]

fsync at the end is instant, which indicates that writing into the
kernel is almost certainly being throttled.

> 5.15 fio performance detail:
> Jobs: 4 (f=4): [W(4)][14.3%][w=8563MiB/s][w=8563 IOPS][eta 00m:18s]
> Jobs: 4 (f=4): [W(4)][18.2%][w=6376MiB/s][w=6375 IOPS][eta 00m:18s]
> Jobs: 4 (f=4): [W(4)][20.8%][w=4566MiB/s][w=4565 IOPS][eta 00m:19s]
> Jobs: 4 (f=4): [W(4)][23.1%][w=3947MiB/s][w=3947 IOPS][eta 00m:20s]

So throttling starts and perf drops...

> Jobs: 4 (f=4): [W(4)][25.9%][w=4601MiB/s][w=4601 IOPS][eta 00m:20s]
> Jobs: 4 (f=4): [W(4)][28.6%][w=5797MiB/s][w=5796 IOPS][eta 00m:20s]
> Jobs: 4 (f=4): [W(4)][32.1%][w=6802MiB/s][w=6801 IOPS][eta 00m:19s]
> Jobs: 4 (f=4): [W(4)][35.7%][w=7411MiB/s][w=7411 IOPS][eta 00m:18s]
> Jobs: 4 (f=4): [W(4)][40.7%][w=8445MiB/s][w=8444 IOPS][eta 00m:16s]

.... and then it picks back up....

> Jobs: 4 (f=4): [W(4)][46.2%][w=7992MiB/s][w=7992 IOPS][eta 00m:14s]
> Jobs: 4 (f=4): [W(4)][52.0%][w=8118MiB/s][w=8117 IOPS][eta 00m:12s]
> Jobs: 4 (f=4): [W(4)][56.0%][w=7742MiB/s][w=7741 IOPS][eta 00m:11s]
> Jobs: 4 (f=4): [W(4)][62.5%][w=7497MiB/s][w=7496 IOPS][eta 00m:09s]
> Jobs: 4 (f=4): [W(4)][66.7%][w=7248MiB/s][w=7248 IOPS][eta 00m:08s]
> Jobs: 4 (f=4): [W(4)][70.8%][w=7461MiB/s][w=7460 IOPS][eta 00m:07s]
> Jobs: 4 (f=4): [W(4)][75.0%][w=7959MiB/s][w=7959 IOPS][eta 00m:06s]
> Jobs: 4 (f=4): [W(3),F(1)][79.2%][w=6982MiB/s][w=6982 IOPS][eta 00m:05s]
> Jobs: 1 (f=1): [_(2),W(1),_(1)][87.0%][w=2809MiB/s][w=2808 IOPS][eta 00m:03s]
> Jobs: 1 (f=1): [_(2),W(1),_(1)][95.5%][w=2669MiB/s][w=2668 IOPS][eta 00m:01s]
> Jobs: 1 (f=1): [_(2),F(1),_(1)][100.0%][w=2552MiB/s][w=2552 IOPS][eta 00m:00s]

I suspect that the likely culprit is mm-level changes - the
page reclaim algorithm was completely replaced in 6.1 with a
multi-generation LRU that will have different cache footprint
behaviour in exactly this sort of "repeatedly over-write same files
in a set that are significantly larger than memory" micro-benchmark.

i.e. these commits:

07017acb0601 mm: multi-gen LRU: admin guide
d6c3af7d8a2b mm: multi-gen LRU: debugfs interface
1332a809d95a mm: multi-gen LRU: thrashing prevention
354ed5974429 mm: multi-gen LRU: kill switch
f76c83378851 mm: multi-gen LRU: optimize multiple memcgs
bd74fdaea146 mm: multi-gen LRU: support page table walks
018ee47f1489 mm: multi-gen LRU: exploit locality in rmap
ac35a4902374 mm: multi-gen LRU: minimal implementation
ec1c86b25f4b mm: multi-gen LRU: groundwork

If that's the case, I'd expect kernels up to 6.0 to demonstrate the
same behaviour as 5.15, and 6.1+ to demonstrate the same behaviour
as you've reported.

So I'm thinking that the best thing to do is confirm that the change
of behaviour is a result of the multi-gen LRU changes. If it is,
then it's up to the multi-gen LRU developers to determine how to fix
it.  If it's not the multi-gen LRU, we'll have to kep digging.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-08 22:32   ` Dave Chinner
@ 2023-05-08 23:25     ` Wang Yugui
  2023-05-09  1:36       ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-05-08 23:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi,

> On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > Hi,
> > > 
> > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > with the compare to xfs 5.15.110.
> > > 
> > > It is yet not clear whether  it is a problem of xfs or lvm2.
> > > 
> > > any guide to troubleshoot it?
> > > 
> > > test case:
> > >   disk: NVMe PCIe3 SSD *4 
> > >   LVM: raid0 default strip size 64K.
> > >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > >    -directory=/mnt/test
> > > 
> > > 
> > > 6.1.27/6.1.23
> > > fio bw=2623MiB/s (2750MB/s)
> > > perf report:
> > > Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> > > Overhead  Command  Shared Object        Symbol
> > >   31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
> > >    5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
> > >    3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
> > >    3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
> > >    2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
> > >    2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
> > >    2.11%  fio      [kernel.kallsyms]    [k] xas_load
> > >    2.10%  fio      [kernel.kallsyms]    [k] xas_descend
> > > 
> > > 5.15.110
> > > fio bw=6796MiB/s (7126MB/s)
> > > perf report:
> > > Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> > > Overhead  Command  Shared Object       Symbol
> > >   38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> > >    6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
> > >    4.40%  fio      [kernel.kallsyms]   [k] xas_load
> > >    3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
> > >    3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
> > >    1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
> > >    1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
> > >    1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
> > >    1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
> > >    1.41%  fio      [kernel.kallsyms]   [k] xas_start
> 
> Because you are testing buffered IO, you need to run perf across all
> CPUs and tasks, not just the fio process so that it captures the
> profile of memory reclaim and writeback that is being performed by
> the kernel.

'perf report' of all CPU.
Samples: 211K of event 'cycles', Event count (approx.): 56590727219
Overhead  Command          Shared Object            Symbol
  16.29%  fio              [kernel.kallsyms]        [k] rep_movs_alternative
   3.38%  kworker/u98:1+f  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
   3.11%  fio              [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
   3.05%  swapper          [kernel.kallsyms]        [k] intel_idle
   2.63%  fio              [kernel.kallsyms]        [k] get_page_from_freelist
   2.33%  fio              [kernel.kallsyms]        [k] asm_exc_nmi
   2.26%  kworker/u98:1+f  [kernel.kallsyms]        [k] __folio_start_writeback
   1.40%  fio              [kernel.kallsyms]        [k] __filemap_add_folio
   1.37%  fio              [kernel.kallsyms]        [k] lru_add_fn
   1.35%  fio              [kernel.kallsyms]        [k] xas_load
   1.33%  fio              [kernel.kallsyms]        [k] iomap_write_begin
   1.31%  fio              [kernel.kallsyms]        [k] xas_descend
   1.19%  kworker/u98:1+f  [kernel.kallsyms]        [k] folio_clear_dirty_for_io
   1.07%  fio              [kernel.kallsyms]        [k] folio_add_lru
   1.01%  fio              [kernel.kallsyms]        [k] __folio_mark_dirty
   1.00%  kworker/u98:1+f  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave

and 'top' show that 'kworker/u98:1' have over 80% CPU usage.


> > more info:
> > 
> > 1, 6.2.14 have same performance as 6.1.x
> > 
> > 2,  6.1.x fio performance detail:
> > Jobs: 4 (f=4): [W(4)][16.7%][w=10.2GiB/s][w=10.4k IOPS][eta 00m:15s]
> > Jobs: 4 (f=4): [W(4)][25.0%][w=9949MiB/s][w=9949 IOPS][eta 00m:12s] 
> > Jobs: 4 (f=4): [W(4)][31.2%][w=9618MiB/s][w=9618 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][37.5%][w=7970MiB/s][w=7970 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][41.2%][w=5048MiB/s][w=5047 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][42.1%][w=2489MiB/s][w=2488 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][42.9%][w=3227MiB/s][w=3226 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][45.5%][w=3622MiB/s][w=3622 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][47.8%][w=3651MiB/s][w=3650 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][52.2%][w=3435MiB/s][w=3435 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][52.0%][w=2464MiB/s][w=2463 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][53.8%][w=2438MiB/s][w=2438 IOPS][eta 00m:12s]
> 
> Looks like it's throttled on dirty pages at this point.
> 
> How much memory does your test system have, and what does changing
> the writeback throttling thresholds do? What's the numa layout?

This is a DELL server with two E5-2680 v2 configured as NUMA.

There is 192G memory in this server,
128G memory is installed in CPU1,
 64G memory is installed in CPU2.

and sysctl config:
vm.dirty_background_bytes = 1073741824  # 1G
vm.dirty_background_ratio = 0
vm.dirty_bytes = 8589934592 # 8G
vm.dirty_ratio = 0

and NVMe SSD *4 are connected to  one NVMe hba / CPU1.

> Watching the stats in /proc/meminfo would be useful. I tend to use
> Performance Co-Pilot (PCP) to collect these sorts of stats and plot
> them in real time so I can see how the state of the machine is
> changing as the test is running....
> 
> 
> > Jobs: 4 (f=4): [W(4)][55.6%][w=2435MiB/s][w=2434 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][57.1%][w=2449MiB/s][w=2448 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][60.7%][w=2422MiB/s][w=2421 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][62.1%][w=2457MiB/s][w=2457 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][63.3%][w=2436MiB/s][w=2436 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][64.5%][w=2432MiB/s][w=2431 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][67.7%][w=2440MiB/s][w=2440 IOPS][eta 00m:10s]
> > Jobs: 4 (f=4): [W(4)][71.0%][w=2437MiB/s][w=2437 IOPS][eta 00m:09s]
> > Jobs: 4 (f=4): [W(4)][74.2%][w=2442MiB/s][w=2442 IOPS][eta 00m:08s]
> > Jobs: 4 (f=4): [W(4)][77.4%][w=2425MiB/s][w=2424 IOPS][eta 00m:07s]
> > Jobs: 4 (f=4): [W(4)][80.6%][w=2459MiB/s][w=2459 IOPS][eta 00m:06s]
> > Jobs: 4 (f=4): [W(4)][86.7%][w=2428MiB/s][w=2427 IOPS][eta 00m:04s]
> > Jobs: 4 (f=4): [W(4)][90.0%][w=2441MiB/s][w=2440 IOPS][eta 00m:03s]
> > Jobs: 4 (f=4): [W(4)][93.3%][w=2438MiB/s][w=2437 IOPS][eta 00m:02s]
> > Jobs: 4 (f=4): [W(4)][96.7%][w=2450MiB/s][w=2449 IOPS][eta 00m:01s]
> > Jobs: 4 (f=4): [W(4)][100.0%][w=2430MiB/s][w=2430 IOPS][eta 00m:00s]
> > Jobs: 4 (f=4): [F(4)][100.0%][w=2372MiB/s][w=2372 IOPS][eta 00m:00s]
> 
> fsync at the end is instant, which indicates that writing into the
> kernel is almost certainly being throttled.
> 
> > 5.15 fio performance detail:
> > Jobs: 4 (f=4): [W(4)][14.3%][w=8563MiB/s][w=8563 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][18.2%][w=6376MiB/s][w=6375 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][20.8%][w=4566MiB/s][w=4565 IOPS][eta 00m:19s]
> > Jobs: 4 (f=4): [W(4)][23.1%][w=3947MiB/s][w=3947 IOPS][eta 00m:20s]
> 
> So throttling starts and perf drops...
> 
> > Jobs: 4 (f=4): [W(4)][25.9%][w=4601MiB/s][w=4601 IOPS][eta 00m:20s]
> > Jobs: 4 (f=4): [W(4)][28.6%][w=5797MiB/s][w=5796 IOPS][eta 00m:20s]
> > Jobs: 4 (f=4): [W(4)][32.1%][w=6802MiB/s][w=6801 IOPS][eta 00m:19s]
> > Jobs: 4 (f=4): [W(4)][35.7%][w=7411MiB/s][w=7411 IOPS][eta 00m:18s]
> > Jobs: 4 (f=4): [W(4)][40.7%][w=8445MiB/s][w=8444 IOPS][eta 00m:16s]
> 
> .... and then it picks back up....
> 
> > Jobs: 4 (f=4): [W(4)][46.2%][w=7992MiB/s][w=7992 IOPS][eta 00m:14s]
> > Jobs: 4 (f=4): [W(4)][52.0%][w=8118MiB/s][w=8117 IOPS][eta 00m:12s]
> > Jobs: 4 (f=4): [W(4)][56.0%][w=7742MiB/s][w=7741 IOPS][eta 00m:11s]
> > Jobs: 4 (f=4): [W(4)][62.5%][w=7497MiB/s][w=7496 IOPS][eta 00m:09s]
> > Jobs: 4 (f=4): [W(4)][66.7%][w=7248MiB/s][w=7248 IOPS][eta 00m:08s]
> > Jobs: 4 (f=4): [W(4)][70.8%][w=7461MiB/s][w=7460 IOPS][eta 00m:07s]
> > Jobs: 4 (f=4): [W(4)][75.0%][w=7959MiB/s][w=7959 IOPS][eta 00m:06s]
> > Jobs: 4 (f=4): [W(3),F(1)][79.2%][w=6982MiB/s][w=6982 IOPS][eta 00m:05s]
> > Jobs: 1 (f=1): [_(2),W(1),_(1)][87.0%][w=2809MiB/s][w=2808 IOPS][eta 00m:03s]
> > Jobs: 1 (f=1): [_(2),W(1),_(1)][95.5%][w=2669MiB/s][w=2668 IOPS][eta 00m:01s]
> > Jobs: 1 (f=1): [_(2),F(1),_(1)][100.0%][w=2552MiB/s][w=2552 IOPS][eta 00m:00s]
> 
> I suspect that the likely culprit is mm-level changes - the
> page reclaim algorithm was completely replaced in 6.1 with a
> multi-generation LRU that will have different cache footprint
> behaviour in exactly this sort of "repeatedly over-write same files
> in a set that are significantly larger than memory" micro-benchmark.
> 
> i.e. these commits:
> 
> 07017acb0601 mm: multi-gen LRU: admin guide
> d6c3af7d8a2b mm: multi-gen LRU: debugfs interface
> 1332a809d95a mm: multi-gen LRU: thrashing prevention
> 354ed5974429 mm: multi-gen LRU: kill switch
> f76c83378851 mm: multi-gen LRU: optimize multiple memcgs
> bd74fdaea146 mm: multi-gen LRU: support page table walks
> 018ee47f1489 mm: multi-gen LRU: exploit locality in rmap
> ac35a4902374 mm: multi-gen LRU: minimal implementation
> ec1c86b25f4b mm: multi-gen LRU: groundwork
> 
> If that's the case, I'd expect kernels up to 6.0 to demonstrate the
> same behaviour as 5.15, and 6.1+ to demonstrate the same behaviour
> as you've reported.

I tested 6.4.0-rc1. the performance become a little worse.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/09

> So I'm thinking that the best thing to do is confirm that the change
> of behaviour is a result of the multi-gen LRU changes. If it is,
> then it's up to the multi-gen LRU developers to determine how to fix
> it.  If it's not the multi-gen LRU, we'll have to kep digging.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-08 23:25     ` Wang Yugui
@ 2023-05-09  1:36       ` Dave Chinner
  2023-05-09 12:37         ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2023-05-09  1:36 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Tue, May 09, 2023 at 07:25:53AM +0800, Wang Yugui wrote:
> Hi,
> 
> > On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > > Hi,
> > > 
> > > > Hi,
> > > > 
> > > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > > with the compare to xfs 5.15.110.
> > > > 
> > > > It is yet not clear whether  it is a problem of xfs or lvm2.
> > > > 
> > > > any guide to troubleshoot it?
> > > > 
> > > > test case:
> > > >   disk: NVMe PCIe3 SSD *4 
> > > >   LVM: raid0 default strip size 64K.
> > > >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > > >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > > >    -directory=/mnt/test
> > > > 
> > > > 
> > > > 6.1.27/6.1.23
> > > > fio bw=2623MiB/s (2750MB/s)
> > > > perf report:
> > > > Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> > > > Overhead  Command  Shared Object        Symbol
> > > >   31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
> > > >    5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
> > > >    3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
> > > >    3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
> > > >    2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
> > > >    2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
> > > >    2.11%  fio      [kernel.kallsyms]    [k] xas_load
> > > >    2.10%  fio      [kernel.kallsyms]    [k] xas_descend
> > > > 
> > > > 5.15.110
> > > > fio bw=6796MiB/s (7126MB/s)
> > > > perf report:
> > > > Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> > > > Overhead  Command  Shared Object       Symbol
> > > >   38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> > > >    6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
> > > >    4.40%  fio      [kernel.kallsyms]   [k] xas_load
> > > >    3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
> > > >    3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
> > > >    1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
> > > >    1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
> > > >    1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
> > > >    1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
> > > >    1.41%  fio      [kernel.kallsyms]   [k] xas_start
> > 
> > Because you are testing buffered IO, you need to run perf across all
> > CPUs and tasks, not just the fio process so that it captures the
> > profile of memory reclaim and writeback that is being performed by
> > the kernel.
> 
> 'perf report' of all CPU.
> Samples: 211K of event 'cycles', Event count (approx.): 56590727219
> Overhead  Command          Shared Object            Symbol
>   16.29%  fio              [kernel.kallsyms]        [k] rep_movs_alternative
>    3.38%  kworker/u98:1+f  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
>    3.11%  fio              [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
>    3.05%  swapper          [kernel.kallsyms]        [k] intel_idle
>    2.63%  fio              [kernel.kallsyms]        [k] get_page_from_freelist
>    2.33%  fio              [kernel.kallsyms]        [k] asm_exc_nmi
>    2.26%  kworker/u98:1+f  [kernel.kallsyms]        [k] __folio_start_writeback
>    1.40%  fio              [kernel.kallsyms]        [k] __filemap_add_folio
>    1.37%  fio              [kernel.kallsyms]        [k] lru_add_fn
>    1.35%  fio              [kernel.kallsyms]        [k] xas_load
>    1.33%  fio              [kernel.kallsyms]        [k] iomap_write_begin
>    1.31%  fio              [kernel.kallsyms]        [k] xas_descend
>    1.19%  kworker/u98:1+f  [kernel.kallsyms]        [k] folio_clear_dirty_for_io
>    1.07%  fio              [kernel.kallsyms]        [k] folio_add_lru
>    1.01%  fio              [kernel.kallsyms]        [k] __folio_mark_dirty
>    1.00%  kworker/u98:1+f  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave
> 
> and 'top' show that 'kworker/u98:1' have over 80% CPU usage.

Can you provide an expanded callgraph profile for both the good and
bad kernels showing the CPU used in the fio write() path and the
kworker-based writeback path?

[ The test machine I have that I could reproduce this sort of
performance anomoly went bad a month ago, so I have no hardware
available to me right now to reproduce this behaviour locally.
Hence I'll need you to do the profiling I need to understand the
regression for me. ]

> > I suspect that the likely culprit is mm-level changes - the
> > page reclaim algorithm was completely replaced in 6.1 with a
> > multi-generation LRU that will have different cache footprint
> > behaviour in exactly this sort of "repeatedly over-write same files
> > in a set that are significantly larger than memory" micro-benchmark.
> > 
> > i.e. these commits:
> > 
> > 07017acb0601 mm: multi-gen LRU: admin guide
> > d6c3af7d8a2b mm: multi-gen LRU: debugfs interface
> > 1332a809d95a mm: multi-gen LRU: thrashing prevention
> > 354ed5974429 mm: multi-gen LRU: kill switch
> > f76c83378851 mm: multi-gen LRU: optimize multiple memcgs
> > bd74fdaea146 mm: multi-gen LRU: support page table walks
> > 018ee47f1489 mm: multi-gen LRU: exploit locality in rmap
> > ac35a4902374 mm: multi-gen LRU: minimal implementation
> > ec1c86b25f4b mm: multi-gen LRU: groundwork
> > 
> > If that's the case, I'd expect kernels up to 6.0 to demonstrate the
> > same behaviour as 5.15, and 6.1+ to demonstrate the same behaviour
> > as you've reported.
> 
> I tested 6.4.0-rc1. the performance become a little worse.

Thanks, that's as I expected.

WHich means that the interesting kernel versions to check now are a
6.0.x kernel, and then if it has the same perf as 5.15.x, then the
commit before the multi-gen LRU was introduced vs the commit after
the multi-gen LRU was introduced to see if that is the functionality
that introduced the regression....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-09  1:36       ` Dave Chinner
@ 2023-05-09 12:37         ` Wang Yugui
  2023-05-09 22:14           ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-05-09 12:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi,

> On Tue, May 09, 2023 at 07:25:53AM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > > > Hi,
> > > > 
> > > > > Hi,
> > > > > 
> > > > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > > > with the compare to xfs 5.15.110.
> > > > > 
> > > > > It is yet not clear whether  it is a problem of xfs or lvm2.
> > > > > 
> > > > > any guide to troubleshoot it?
> > > > > 
> > > > > test case:
> > > > >   disk: NVMe PCIe3 SSD *4 
> > > > >   LVM: raid0 default strip size 64K.
> > > > >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > > > >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > > > >    -directory=/mnt/test
> > > > > 
> > > > > 
> > > > > 6.1.27/6.1.23
> > > > > fio bw=2623MiB/s (2750MB/s)
> > > > > perf report:
> > > > > Samples: 330K of event 'cycles', Event count (approx.): 120739812790
> > > > > Overhead  Command  Shared Object        Symbol
> > > > >   31.07%  fio      [kernel.kallsyms]    [k] copy_user_enhanced_fast_string
> > > > >    5.11%  fio      [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
> > > > >    3.36%  fio      [kernel.kallsyms]    [k] asm_exc_nmi
> > > > >    3.29%  fio      [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
> > > > >    2.27%  fio      [kernel.kallsyms]    [k] iomap_write_begin
> > > > >    2.18%  fio      [kernel.kallsyms]    [k] get_page_from_freelist
> > > > >    2.11%  fio      [kernel.kallsyms]    [k] xas_load
> > > > >    2.10%  fio      [kernel.kallsyms]    [k] xas_descend
> > > > > 
> > > > > 5.15.110
> > > > > fio bw=6796MiB/s (7126MB/s)
> > > > > perf report:
> > > > > Samples: 267K of event 'cycles', Event count (approx.): 186688803871
> > > > > Overhead  Command  Shared Object       Symbol
> > > > >   38.09%  fio      [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
> > > > >    6.76%  fio      [kernel.kallsyms]   [k] iomap_set_range_uptodate
> > > > >    4.40%  fio      [kernel.kallsyms]   [k] xas_load
> > > > >    3.94%  fio      [kernel.kallsyms]   [k] get_page_from_freelist
> > > > >    3.04%  fio      [kernel.kallsyms]   [k] asm_exc_nmi
> > > > >    1.97%  fio      [kernel.kallsyms]   [k] native_queued_spin_lock_slowpath
> > > > >    1.88%  fio      [kernel.kallsyms]   [k] __pagevec_lru_add
> > > > >    1.53%  fio      [kernel.kallsyms]   [k] iomap_write_begin
> > > > >    1.53%  fio      [kernel.kallsyms]   [k] __add_to_page_cache_locked
> > > > >    1.41%  fio      [kernel.kallsyms]   [k] xas_start
> > > 
> > > Because you are testing buffered IO, you need to run perf across all
> > > CPUs and tasks, not just the fio process so that it captures the
> > > profile of memory reclaim and writeback that is being performed by
> > > the kernel.
> > 
> > 'perf report' of all CPU.
> > Samples: 211K of event 'cycles', Event count (approx.): 56590727219
> > Overhead  Command          Shared Object            Symbol
> >   16.29%  fio              [kernel.kallsyms]        [k] rep_movs_alternative
> >    3.38%  kworker/u98:1+f  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> >    3.11%  fio              [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> >    3.05%  swapper          [kernel.kallsyms]        [k] intel_idle
> >    2.63%  fio              [kernel.kallsyms]        [k] get_page_from_freelist
> >    2.33%  fio              [kernel.kallsyms]        [k] asm_exc_nmi
> >    2.26%  kworker/u98:1+f  [kernel.kallsyms]        [k] __folio_start_writeback
> >    1.40%  fio              [kernel.kallsyms]        [k] __filemap_add_folio
> >    1.37%  fio              [kernel.kallsyms]        [k] lru_add_fn
> >    1.35%  fio              [kernel.kallsyms]        [k] xas_load
> >    1.33%  fio              [kernel.kallsyms]        [k] iomap_write_begin
> >    1.31%  fio              [kernel.kallsyms]        [k] xas_descend
> >    1.19%  kworker/u98:1+f  [kernel.kallsyms]        [k] folio_clear_dirty_for_io
> >    1.07%  fio              [kernel.kallsyms]        [k] folio_add_lru
> >    1.01%  fio              [kernel.kallsyms]        [k] __folio_mark_dirty
> >    1.00%  kworker/u98:1+f  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave
> > 
> > and 'top' show that 'kworker/u98:1' have over 80% CPU usage.
> 
> Can you provide an expanded callgraph profile for both the good and
> bad kernels showing the CPU used in the fio write() path and the
> kworker-based writeback path?

I'm sorry that some detail guide for info gather of this test please.

the test machine here is already reserved.

> [ The test machine I have that I could reproduce this sort of
> performance anomoly went bad a month ago, so I have no hardware
> available to me right now to reproduce this behaviour locally.
> Hence I'll need you to do the profiling I need to understand the
> regression for me. ]
> 
> > > I suspect that the likely culprit is mm-level changes - the
> > > page reclaim algorithm was completely replaced in 6.1 with a
> > > multi-generation LRU that will have different cache footprint
> > > behaviour in exactly this sort of "repeatedly over-write same files
> > > in a set that are significantly larger than memory" micro-benchmark.
> > > 
> > > i.e. these commits:
> > > 
> > > 07017acb0601 mm: multi-gen LRU: admin guide
> > > d6c3af7d8a2b mm: multi-gen LRU: debugfs interface
> > > 1332a809d95a mm: multi-gen LRU: thrashing prevention
> > > 354ed5974429 mm: multi-gen LRU: kill switch
> > > f76c83378851 mm: multi-gen LRU: optimize multiple memcgs
> > > bd74fdaea146 mm: multi-gen LRU: support page table walks
> > > 018ee47f1489 mm: multi-gen LRU: exploit locality in rmap
> > > ac35a4902374 mm: multi-gen LRU: minimal implementation
> > > ec1c86b25f4b mm: multi-gen LRU: groundwork
> > > 
> > > If that's the case, I'd expect kernels up to 6.0 to demonstrate the
> > > same behaviour as 5.15, and 6.1+ to demonstrate the same behaviour
> > > as you've reported.
> > I tested 6.4.0-rc1. the performance become a little worse.
> 
> Thanks, that's as I expected.
> 
> WHich means that the interesting kernel versions to check now are a
> 6.0.x kernel, and then if it has the same perf as 5.15.x, then the
> commit before the multi-gen LRU was introduced vs the commit after
> the multi-gen LRU was introduced to see if that is the functionality
> that introduced the regression....

more performance test result:

linux 6.0.18
	fio WRITE: bw=2565MiB/s (2689MB/s)
linux 5.17.0
	fio WRITE: bw=2602MiB/s (2729MB/s) 
linux 5.16.20
	fio WRITE: bw=7666MiB/s (8039MB/s),

so it is a problem between 5.16.20 and 5.17.0?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/09



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-09 12:37         ` Wang Yugui
@ 2023-05-09 22:14           ` Dave Chinner
  2023-05-10  5:46             ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2023-05-09 22:14 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Tue, May 09, 2023 at 08:37:52PM +0800, Wang Yugui wrote:
> > On Tue, May 09, 2023 at 07:25:53AM +0800, Wang Yugui wrote:
> > > > On Mon, May 08, 2023 at 10:46:12PM +0800, Wang Yugui wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I noticed a performance regression of xfs 6.1.27/6.1.23,
> > > > > > with the compare to xfs 5.15.110.
> > > > > > 
> > > > > > It is yet not clear whether  it is a problem of xfs or lvm2.
> > > > > > 
> > > > > > any guide to troubleshoot it?
> > > > > > 
> > > > > > test case:
> > > > > >   disk: NVMe PCIe3 SSD *4 
> > > > > >   LVM: raid0 default strip size 64K.
> > > > > >   fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30
> > > > > >    -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > > > > >    -directory=/mnt/test
.....
> > > > Because you are testing buffered IO, you need to run perf across all
> > > > CPUs and tasks, not just the fio process so that it captures the
> > > > profile of memory reclaim and writeback that is being performed by
> > > > the kernel.
> > > 
> > > 'perf report' of all CPU.
> > > Samples: 211K of event 'cycles', Event count (approx.): 56590727219
> > > Overhead  Command          Shared Object            Symbol
> > >   16.29%  fio              [kernel.kallsyms]        [k] rep_movs_alternative
> > >    3.38%  kworker/u98:1+f  [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> > >    3.11%  fio              [kernel.kallsyms]        [k] native_queued_spin_lock_slowpath
> > >    3.05%  swapper          [kernel.kallsyms]        [k] intel_idle
> > >    2.63%  fio              [kernel.kallsyms]        [k] get_page_from_freelist
> > >    2.33%  fio              [kernel.kallsyms]        [k] asm_exc_nmi
> > >    2.26%  kworker/u98:1+f  [kernel.kallsyms]        [k] __folio_start_writeback
> > >    1.40%  fio              [kernel.kallsyms]        [k] __filemap_add_folio
> > >    1.37%  fio              [kernel.kallsyms]        [k] lru_add_fn
> > >    1.35%  fio              [kernel.kallsyms]        [k] xas_load
> > >    1.33%  fio              [kernel.kallsyms]        [k] iomap_write_begin
> > >    1.31%  fio              [kernel.kallsyms]        [k] xas_descend
> > >    1.19%  kworker/u98:1+f  [kernel.kallsyms]        [k] folio_clear_dirty_for_io
> > >    1.07%  fio              [kernel.kallsyms]        [k] folio_add_lru
> > >    1.01%  fio              [kernel.kallsyms]        [k] __folio_mark_dirty
> > >    1.00%  kworker/u98:1+f  [kernel.kallsyms]        [k] _raw_spin_lock_irqsave
> > > 
> > > and 'top' show that 'kworker/u98:1' have over 80% CPU usage.
> > 
> > Can you provide an expanded callgraph profile for both the good and
> > bad kernels showing the CPU used in the fio write() path and the
> > kworker-based writeback path?
> 
> I'm sorry that some detail guide for info gather of this test please.

'perf record -g' and 'perf report -g' should enable callgraph
profiling and reporting. See the perf-record man page for
'--callgraph' to make sure you have the right kernel config for this
to work efficiently.

You can do quick snapshots in time via 'perf top -U -g' and then
after a few seconds type 'E' then immediately type 'P' and the fully
expanded callgraph profile will get written to a perf.hist.N file in
the current working directory...

> > > I tested 6.4.0-rc1. the performance become a little worse.
> > 
> > Thanks, that's as I expected.
> > 
> > WHich means that the interesting kernel versions to check now are a
> > 6.0.x kernel, and then if it has the same perf as 5.15.x, then the
> > commit before the multi-gen LRU was introduced vs the commit after
> > the multi-gen LRU was introduced to see if that is the functionality
> > that introduced the regression....
> 
> more performance test result:
> 
> linux 6.0.18
> 	fio WRITE: bw=2565MiB/s (2689MB/s)
> linux 5.17.0
> 	fio WRITE: bw=2602MiB/s (2729MB/s) 
> linux 5.16.20
> 	fio WRITE: bw=7666MiB/s (8039MB/s),
> 
> so it is a problem between 5.16.20 and 5.17.0?

Ok, that is further back in time than I expected. In terms of XFS,
there are only two commits between 5.16..5.17 that might impact
performance:

ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")

and

6795801366da ("xfs: Support large folios")

To test whether ebb7fb1557b1 is the cause, go to
fs/iomap/buffered-io.c and change:

-#define IOEND_BATCH_SIZE        4096
+#define IOEND_BATCH_SIZE        1048576

This will increase the IO submission chain lengths to at least 4GB
from the 16MB bound that was placed on 5.17 and newer kernels.

To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
and comment out both calls to mapping_set_large_folios(). This will
ensure the page cache only instantiates single page folios the same
as 5.16 would have.

If neither of them change behaviour, then I think you're going to
need to do a bisect between 5.16..5.17 to find the commit that
introduced the regression. I know kernel bisects are slow and
painful, but it's exactly what I'd be doing right now if my
performance test machine wasn't broken....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-09 22:14           ` Dave Chinner
@ 2023-05-10  5:46             ` Wang Yugui
  2023-05-10  7:27               ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-05-10  5:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi,

> 'perf record -g' and 'perf report -g' should enable callgraph
> profiling and reporting. See the perf-record man page for
> '--callgraph' to make sure you have the right kernel config for this
> to work efficiently.
> 
> You can do quick snapshots in time via 'perf top -U -g' and then
> after a few seconds type 'E' then immediately type 'P' and the fully
> expanded callgraph profile will get written to a perf.hist.N file in
> the current working directory...

'perf report -g ' of BAD kernel 6.1.y

Samples: 439K of event 'cycles', Event count (approx.): 148701386235
  Children      Self  Command          Shared Object            Symbol
+   61.82%     0.01%  fio              [kernel.kallsyms]        [k] entry_SYSCALL_64_after_hwfra
+   61.81%     0.01%  fio              [kernel.kallsyms]        [k] do_syscall_64
+   61.71%     0.00%  fio              libpthread-2.17.so       [.] 0x00007f4e7a40e6fd
+   61.66%     0.00%  fio              [kernel.kallsyms]        [k] ksys_write
+   61.64%     0.02%  fio              [kernel.kallsyms]        [k] vfs_write
+   61.60%     0.02%  fio              [kernel.kallsyms]        [k] xfs_file_buffered_write
+   61.20%     0.56%  fio              [kernel.kallsyms]        [k] iomap_file_buffered_write
+   25.25%     1.44%  fio              [kernel.kallsyms]        [k] iomap_write_begin
+   23.19%     0.75%  fio              [kernel.kallsyms]        [k] __filemap_get_folio
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] ret_from_fork
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] kthread
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] worker_thread
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] process_one_work
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] wb_workfn
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] wb_writeback
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] __writeback_inodes_wb
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] writeback_sb_inodes
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] __writeback_single_inode
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] do_writepages
+   21.57%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] xfs_vm_writepages
+   21.55%     0.00%  kworker/u98:4+f  [kernel.kallsyms]        [k] iomap_writepages
+   21.55%     0.85%  kworker/u98:4+f  [kernel.kallsyms]        [k] write_cache_pages
+   20.23%     0.22%  fio              [kernel.kallsyms]        [k] copy_page_from_iter_atomic
+   19.99%     0.04%  fio              [kernel.kallsyms]        [k] copyin
+   19.94%    19.77%  fio              [kernel.kallsyms]        [k] copy_user_enhanced_fast_stri
+   16.47%     0.00%  fio              [unknown]                [k] 0x00000000024803f0
+   16.47%     0.00%  fio              [unknown]                [k] 0x0000000002480440
+   16.47%     0.00%  fio              [unknown]                [k] 0x00000000024804b0
+   16.47%     0.00%  fio              [unknown]                [k] 0x0000000002480520
+   16.47%     0.00%  fio              [unknown]                [k] 0x0000000002480590
+   16.47%     0.00%  fio              [unknown]                [k] 0x0000000002480670
+   16.47%     0.00%  fio              [unknown]                [k] 0x00000000024806e0
+   16.47%     0.00%  fio              [unknown]                [k] 0x0000000002480750
+   16.47%     0.00%  fio              [unknown]                [k] 0x00000000024807c0
+   16.47%     0.00%  fio              [unknown]                [k] 0x0000000002480830
+   16.47%     0.00%  fio              [unknown]                [k] 0x00000000024808a0
+   16.47%     0.00%  fio              [unknown]                [k] 0x0000000002480910
+   16.47%     0.00%  fio              [unknown]                [k] 0x00007f4e21070430
+   15.73%     0.00%  fio              [unknown]                [k] 0x0000000002480290
+   15.73%     0.00%  fio              [unknown]                [k] 0x000000000247f790
+   15.73%     0.00%  fio              [unknown]                [k] 0x000000000247fdf0
+   15.73%     0.00%  fio              [unknown]                [k] 0x000000000247fe60
+   15.73%     0.00%  fio              [unknown]                [k] 0x000000000247fed0
+   15.73%     0.00%  fio              [unknown]                [k] 0x000000000247ff40
+   15.73%     0.00%  fio              [unknown]                [k] 0x000000000247ffb0
+   15.73%     0.00%  fio              [unknown]                [k] 0x0000000002480020
+   15.73%     0.00%  fio              [unknown]                [k] 0x0000000002480090
+   15.73%     0.00%  fio              [unknown]                [k] 0x0000000002480180
+   15.73%     0.00%  fio              [unknown]                [k] 0x00000000024801b0
+   15.73%     0.00%  fio              [unknown]                [k] 0x0000000002480220
+   15.73%     0.00%  fio              [unknown]                [k] 0x00007f4e2102ba18
+   15.07%     0.74%  kworker/u98:4+f  [kernel.kallsyms]        [k] iomap_writepage_map
+   15.01%     0.00%  fio              [unknown]                [k] 0x0000000002480a70


'perf report -g ' of GOOD kernel 5.15.y

Samples: 335K of event 'cycles', Event count (approx.): 229260446285
  Children      Self  Command          Shared Object                    Symbol
+   66.35%     0.01%  fio              [kernel.kallsyms]                [k] entry_SYSCALL_64_aft
+   66.34%     0.01%  fio              [kernel.kallsyms]                [k] do_syscall_64
+   65.49%     0.00%  fio              libpthread-2.17.so               [.] 0x00007ff968de56fd
+   65.45%     0.00%  fio              [kernel.kallsyms]                [k] ksys_write
+   65.44%     0.02%  fio              [kernel.kallsyms]                [k] vfs_write
+   65.41%     0.00%  fio              [kernel.kallsyms]                [k] new_sync_write
+   65.40%     0.03%  fio              [kernel.kallsyms]                [k] xfs_file_buffered_wr
+   65.13%     0.53%  fio              [kernel.kallsyms]                [k] iomap_file_buffered_
+   27.17%     0.36%  fio              [kernel.kallsyms]                [k] copy_page_from_iter_
+   26.76%     0.05%  fio              [kernel.kallsyms]                [k] copyin
+   26.73%    26.52%  fio              [kernel.kallsyms]                [k] copy_user_enhanced_f
+   23.12%     1.05%  fio              [kernel.kallsyms]                [k] iomap_write_begin
+   21.59%     0.34%  fio              [kernel.kallsyms]                [k] grab_cache_page_writ
+   21.13%     0.65%  fio              [kernel.kallsyms]                [k] pagecache_get_page
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] ret_from_fork
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] kthread
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] worker_thread
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] process_one_work
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] wb_workfn
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] wb_writeback
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] __writeback_inodes_w
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] writeback_sb_inodes
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] __writeback_single_i
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] do_writepages
+   18.73%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] xfs_vm_writepages
+   18.66%     0.00%  kworker/u97:11+  [kernel.kallsyms]                [k] iomap_writepages
+   18.66%     0.48%  kworker/u97:11+  [kernel.kallsyms]                [k] write_cache_pages
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e613f0
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e61440
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e614b0
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e61520
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e61590
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e61670
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e616e0
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e61750
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e617c0
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e61830
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e618a0
+   16.75%     0.00%  fio              [unknown]                        [k] 0x0000000001e61910
+   16.75%     0.00%  fio              [unknown]                        [k] 0x00007ff9151e0430
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e60790
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e60df0
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e60e60
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e60ed0
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e60f40
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e60fb0
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e61020
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e61090
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e61180
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e611b0
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e61220
+   16.70%     0.00%  fio              [unknown]                        [k] 0x0000000001e61290
+   16.70%     0.00%  fio              [unknown]                        [k] 0x00007ff91519ba18


> Ok, that is further back in time than I expected. In terms of XFS,
> there are only two commits between 5.16..5.17 that might impact
> performance:
> 
> ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> 
> and
> 
> 6795801366da ("xfs: Support large folios")
> 
> To test whether ebb7fb1557b1 is the cause, go to
> fs/iomap/buffered-io.c and change:
> 
> -#define IOEND_BATCH_SIZE        4096
> +#define IOEND_BATCH_SIZE        1048576
> This will increase the IO submission chain lengths to at least 4GB
> from the 16MB bound that was placed on 5.17 and newer kernels.
> 
> To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> and comment out both calls to mapping_set_large_folios(). This will
> ensure the page cache only instantiates single page folios the same
> as 5.16 would have.

6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
	fio WRITE: bw=6451MiB/s (6764MB/s)

still  performance regression when compare to linux 5.16.20
	fio WRITE: bw=7666MiB/s (8039MB/s),

but the performance regression is not too big, then difficult to bisect.
We noticed samle level  performance regression  on btrfs too.
so maby some problem of some code that is  used by both btrfs and xfs
such as iomap and mm/folio.

6.1.x  with 'mapping_set_large_folios remove' only'
	fio   WRITE: bw=2676MiB/s (2806MB/s)

6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
	fio WRITE: bw=5092MiB/s (5339MB/s),
	fio  WRITE: bw=6076MiB/s (6371MB/s)

maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")'.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/10


> 
> If neither of them change behaviour, then I think you're going to
> need to do a bisect between 5.16..5.17 to find the commit that
> introduced the regression. I know kernel bisects are slow and
> painful, but it's exactly what I'd be doing right now if my
> performance test machine wasn't broken....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-10  5:46             ` Wang Yugui
@ 2023-05-10  7:27               ` Dave Chinner
  2023-05-10  8:50                 ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2023-05-10  7:27 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote:
> > Ok, that is further back in time than I expected. In terms of XFS,
> > there are only two commits between 5.16..5.17 that might impact
> > performance:
> > 
> > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> > 
> > and
> > 
> > 6795801366da ("xfs: Support large folios")
> > 
> > To test whether ebb7fb1557b1 is the cause, go to
> > fs/iomap/buffered-io.c and change:
> > 
> > -#define IOEND_BATCH_SIZE        4096
> > +#define IOEND_BATCH_SIZE        1048576
> > This will increase the IO submission chain lengths to at least 4GB
> > from the 16MB bound that was placed on 5.17 and newer kernels.
> > 
> > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> > and comment out both calls to mapping_set_large_folios(). This will
> > ensure the page cache only instantiates single page folios the same
> > as 5.16 would have.
> 
> 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
> 	fio WRITE: bw=6451MiB/s (6764MB/s)
> 
> still  performance regression when compare to linux 5.16.20
> 	fio WRITE: bw=7666MiB/s (8039MB/s),
> 
> but the performance regression is not too big, then difficult to bisect.
> We noticed samle level  performance regression  on btrfs too.
> so maby some problem of some code that is  used by both btrfs and xfs
> such as iomap and mm/folio.

Yup, that's quite possibly something like the multi-gen LRU changes,
but that's not the regression we need to find. :/

> 6.1.x  with 'mapping_set_large_folios remove' only'
> 	fio   WRITE: bw=2676MiB/s (2806MB/s)
> 
> 6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
> 	fio WRITE: bw=5092MiB/s (5339MB/s),
> 	fio  WRITE: bw=6076MiB/s (6371MB/s)
> 
> maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit
> individual ioend chain lengths in writeback")'.

OK, can you re-run the two 6.1.x kernels above (the slow and the
fast) and record the output of `iostat -dxm 1` whilst the
fio test is running? I want to see what the overall differences in
the IO load on the devices are between the two runs. This will tell
us how the IO sizes and queue depths change between the two kernels,
etc.

Right now I'm suspecting a contention interaction between write(),
do_writepages() and folio_end_writeback()...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-10  7:27               ` Dave Chinner
@ 2023-05-10  8:50                 ` Wang Yugui
  2023-05-11  1:34                   ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-05-10  8:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 2664 bytes --]

Hi,


> On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote:
> > > Ok, that is further back in time than I expected. In terms of XFS,
> > > there are only two commits between 5.16..5.17 that might impact
> > > performance:
> > > 
> > > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> > > 
> > > and
> > > 
> > > 6795801366da ("xfs: Support large folios")
> > > 
> > > To test whether ebb7fb1557b1 is the cause, go to
> > > fs/iomap/buffered-io.c and change:
> > > 
> > > -#define IOEND_BATCH_SIZE        4096
> > > +#define IOEND_BATCH_SIZE        1048576
> > > This will increase the IO submission chain lengths to at least 4GB
> > > from the 16MB bound that was placed on 5.17 and newer kernels.
> > > 
> > > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> > > and comment out both calls to mapping_set_large_folios(). This will
> > > ensure the page cache only instantiates single page folios the same
> > > as 5.16 would have.
> > 
> > 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
> > 	fio WRITE: bw=6451MiB/s (6764MB/s)
> > 
> > still  performance regression when compare to linux 5.16.20
> > 	fio WRITE: bw=7666MiB/s (8039MB/s),
> > 
> > but the performance regression is not too big, then difficult to bisect.
> > We noticed samle level  performance regression  on btrfs too.
> > so maby some problem of some code that is  used by both btrfs and xfs
> > such as iomap and mm/folio.
> 
> Yup, that's quite possibly something like the multi-gen LRU changes,
> but that's not the regression we need to find. :/
> 
> > 6.1.x  with 'mapping_set_large_folios remove' only'
> > 	fio   WRITE: bw=2676MiB/s (2806MB/s)
> > 
> > 6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
> > 	fio WRITE: bw=5092MiB/s (5339MB/s),
> > 	fio  WRITE: bw=6076MiB/s (6371MB/s)
> > 
> > maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit
> > individual ioend chain lengths in writeback")'.
> 
> OK, can you re-run the two 6.1.x kernels above (the slow and the
> fast) and record the output of `iostat -dxm 1` whilst the
> fio test is running? I want to see what the overall differences in
> the IO load on the devices are between the two runs. This will tell
> us how the IO sizes and queue depths change between the two kernels,
> etc.

`iostat -dxm 1` result saved in attachment file.
good.txt	good performance
bad.txt		bad performance

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/10

> 
> Right now I'm suspecting a contention interaction between write(),
> do_writepages() and folio_end_writeback()...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


[-- Attachment #2: good.txt --]
[-- Type: application/octet-stream, Size: 17825 bytes --]


[root@T620 ~]# iostat -dxm 1
Linux 6.1.28-0.6.el7.x86_64 (T620)      05/10/2023      _x86_64_        (40 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00   203.46    0.17   20.83     0.00    13.92  1357.81     0.21    9.62    0.09    9.70   0.61   1.29
nvme2c2n1         0.00   203.40    0.18   20.87     0.00    13.93  1355.38     0.09    4.37    0.08    4.40   0.27   0.57
nvme1n1           0.00   203.40    0.18   20.89     0.00    13.93  1353.59     0.08    3.53    0.03    3.56   0.29   0.61
nvme3n1           0.00   203.61    0.21   20.87     0.00    13.92  1352.33     0.15    7.20    0.05    7.27   0.38   0.79
sda               0.00     0.02    0.85    0.13     0.03     0.00    60.13     0.00    0.72    0.61    1.42   0.78   0.08
dm-0              0.00     0.00    0.05   42.88     0.00     2.67   127.43     1.04   24.19    0.07   24.22   0.04   0.18
dm-1              0.00     0.00    0.03   42.77     0.00     2.67   127.81     1.37   31.79    0.11   31.81   0.09   0.40
dm-2              0.00     0.00    0.03   42.77     0.00     2.67   127.82     0.89   20.79    0.09   20.80   0.03   0.12
dm-3              0.00     0.00    0.03   42.77     0.00     2.67   127.82     0.70   16.30    0.02   16.31   0.03   0.14
dm-4              0.00     0.00    0.14  171.18     0.00    10.68   127.72     3.99   23.27    0.07   23.29   0.02   0.41

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 25582.00    0.00 1368.00     0.00  1705.69  2553.55    15.84   11.58    0.00   11.58   0.62  85.00
nvme2c2n1         0.00 25583.00    0.00 1347.00     0.00  1680.68  2555.33     5.53    4.10    0.00    4.10   0.45  60.50
nvme1n1           0.00 25590.00    0.00 1348.00     0.00  1680.75  2553.54     3.85    2.86    0.00    2.86   0.45  61.10
nvme3n1           0.00 25585.00    0.00 1371.00     0.00  1709.50  2553.65    14.62   10.66    0.00   10.66   0.65  89.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 26940.00     0.00  1682.44   127.90   283.90   10.54    0.00   10.54   0.03  89.20
dm-1              0.00     0.00    0.00 26952.00     0.00  1682.57   127.85   308.07   11.43    0.00   11.43   0.03  85.30
dm-2              0.00     0.00    0.00 26948.00     0.00  1682.62   127.88   102.20    3.79    0.00    3.79   0.02  62.30
dm-3              0.00     0.00    0.00 26951.00     0.00  1682.77   127.87    68.75    2.55    0.00    2.55   0.02  62.80
dm-4              0.00     0.00    0.00 107893.00     0.00  6736.76   127.88   762.46    7.07    0.00    7.07   0.01  91.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 26957.00    0.00 1421.00     0.00  1771.55  2553.23    14.10    9.92    0.00    9.92   0.63  89.20
nvme2c2n1         0.00 26955.00    0.00 1421.00     0.00  1771.60  2553.29     5.67    3.99    0.00    3.99   0.44  63.20
nvme1n1           0.00 26953.00    0.00 1421.00     0.00  1771.60  2553.30     3.80    2.67    0.00    2.67   0.44  63.20
nvme3n1           0.00 26955.00    0.00 1421.00     0.00  1771.56  2553.24    15.01   10.56    0.00   10.56   0.64  91.00
sda               0.00     0.00    0.00    5.00     0.00     0.04    18.20     0.01    1.20    0.00    1.20   0.60   0.30
dm-0              0.00     0.00    0.00 28370.00     0.00  1771.02   127.85   292.33   10.30    0.00   10.30   0.03  91.40
dm-1              0.00     0.00    0.00 28357.00     0.00  1770.87   127.90   273.60    9.65    0.00    9.65   0.03  89.30
dm-2              0.00     0.00    0.00 28359.00     0.00  1770.85   127.89   105.64    3.73    0.00    3.73   0.02  64.50
dm-3              0.00     0.00    0.00 28361.00     0.00  1770.73   127.87    68.33    2.41    0.00    2.41   0.02  64.40
dm-4              0.00     0.00    0.00 113345.00     0.00  7077.08   127.87   739.98    6.53    0.00    6.53   0.01  94.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 29349.00    0.00 1499.00     0.00  1869.08  2553.62    29.05   19.38    0.00   19.38   0.60  90.60
nvme2c2n1         0.00 29355.00    0.00 1529.00     0.00  1906.56  2553.72     6.26    4.10    0.00    4.10   0.45  68.60
nvme1n1           0.00 29352.00    0.00 1537.00     0.00  1915.27  2552.04     4.21    2.74    0.00    2.74   0.45  68.50
nvme3n1           0.00 29353.00    0.00 1512.00     0.00  1884.06  2551.96    24.56   16.24    0.00   16.24   0.63  94.90
sda               0.00     0.00    0.00    1.00     0.00     0.00     8.00     0.00    0.00    0.00    0.00   1.00   0.10
dm-0              0.00     0.00    0.00 30902.00     0.00  1929.52   127.88   481.12   15.57    0.00   15.57   0.03  95.10
dm-1              0.00     0.00    0.00 30897.00     0.00  1929.46   127.89   571.80   18.51    0.00   18.51   0.03  90.40
dm-2              0.00     0.00    0.00 30903.00     0.00  1929.44   127.87   116.15    3.76    0.00    3.76   0.02  69.30
dm-3              0.00     0.00    0.00 30901.00     0.00  1929.46   127.88    74.80    2.42    0.00    2.42   0.02  69.20
dm-4              0.00     0.00    0.00 123603.00     0.00  7717.87   127.88  1243.94   10.06    0.00   10.06   0.01  96.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 25950.00    0.00 1392.00     0.00  1734.20  2551.47    23.06   16.56    0.00   16.56   0.60  83.20
nvme2c2n1         0.00 25954.00    0.00 1387.00     0.00  1729.16  2553.22     5.69    4.10    0.00    4.10   0.44  61.40
nvme1n1           0.00 25947.00    0.00 1380.00     0.00  1720.43  2553.22     4.02    2.91    0.00    2.91   0.45  61.80
nvme3n1           0.00 25954.00    0.00 1378.00     0.00  1717.91  2553.17    18.68   13.55    0.00   13.55   0.63  87.20
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 27322.00     0.00  1705.69   127.85   364.35   13.34    0.00   13.34   0.03  87.40
dm-1              0.00     0.00    0.00 27319.00     0.00  1705.76   127.87   451.39   16.52    0.00   16.52   0.03  83.50
dm-2              0.00     0.00    0.00 27322.00     0.00  1705.72   127.86   104.82    3.84    0.00    3.84   0.02  62.40
dm-3              0.00     0.00    0.00 27315.00     0.00  1705.68   127.89    71.30    2.61    0.00    2.61   0.02  62.50
dm-4              0.00     0.00    0.00 109278.00     0.00  6822.85   127.87   992.64    9.08    0.00    9.08   0.01  89.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 26937.00    0.00 1416.00     0.00  1765.16  2552.99    35.40   25.00    0.00   25.00   0.63  89.90
nvme2c2n1         0.00 26944.00    0.00 1420.00     0.00  1770.19  2553.06     5.75    4.05    0.00    4.05   0.45  63.80
nvme1n1           0.00 26938.00    0.00 1420.00     0.00  1770.18  2553.05     3.86    2.72    0.00    2.72   0.45  63.30
nvme3n1           0.00 26936.00    0.00 1448.00     0.00  1803.89  2551.35     6.91    4.77    0.00    4.77   0.52  75.40
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 28357.00     0.00  1770.76   127.89   129.60    4.57    0.00    4.57   0.03  76.20
dm-1              0.00     0.00    0.00 28357.00     0.00  1770.77   127.89   698.41   24.63    0.00   24.63   0.03  89.90
dm-2              0.00     0.00    0.00 28364.00     0.00  1770.81   127.86   106.04    3.74    0.00    3.74   0.02  65.30
dm-3              0.00     0.00    0.00 28358.00     0.00  1770.81   127.89    68.45    2.41    0.00    2.41   0.02  64.80
dm-4              0.00     0.00    0.00 113436.00     0.00  7083.15   127.88  1002.26    8.84    0.00    8.84   0.01  93.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 24627.00    0.00 1318.00     0.00  1645.44  2556.80    11.61    8.81    0.00    8.81   0.63  83.40
nvme2c2n1         0.00 24631.00    0.00 1298.00     0.00  1619.25  2554.87     5.13    3.95    0.00    3.95   0.44  57.60
nvme1n1           0.00 24634.00    0.00 1298.00     0.00  1619.25  2554.88     3.47    2.68    0.00    2.68   0.44  56.90
nvme3n1           0.00 24634.00    0.00 1298.00     0.00  1619.23  2554.84     4.19    3.23    0.00    3.23   0.52  67.70
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 25932.00     0.00  1619.05   127.87    75.80    2.92    0.00    2.92   0.03  69.40
dm-1              0.00     0.00    0.00 25925.00     0.00  1619.02   127.90   224.33    8.65    0.00    8.65   0.03  84.10
dm-2              0.00     0.00    0.00 25929.00     0.00  1619.06   127.88    94.71    3.65    0.00    3.65   0.02  59.30
dm-3              0.00     0.00    0.00 25932.00     0.00  1619.07   127.87    61.35    2.37    0.00    2.37   0.02  58.90
dm-4              0.00     0.00    0.00 103718.00     0.00  6476.21   127.88   455.83    4.39    0.00    4.39   0.01  84.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 27534.00    0.00 1462.00     0.00  1821.66  2551.82    25.20   17.24    0.00   17.24   0.60  88.40
nvme2c2n1         0.00 27537.00    0.00 1452.00     0.00  1810.38  2553.48     6.29    4.33    0.00    4.33   0.44  64.00
nvme1n1           0.00 27535.00    0.00 1452.00     0.00  1810.38  2553.48     4.08    2.81    0.00    2.81   0.44  63.90
nvme3n1           0.00 27532.00    0.00 1452.00     0.00  1810.40  2553.51     6.85    4.72    0.00    4.72   0.52  75.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 28992.00     0.00  1810.63   127.90   128.65    4.44    0.00    4.44   0.03  75.50
dm-1              0.00     0.00    0.00 28998.00     0.00  1810.60   127.87   494.88   17.07    0.00   17.07   0.03  88.60
dm-2              0.00     0.00    0.00 29001.00     0.00  1810.56   127.86   117.12    4.04    0.00    4.04   0.02  65.00
dm-3              0.00     0.00    0.00 29000.00     0.00  1810.56   127.86    73.24    2.53    0.00    2.53   0.02  64.90
dm-4              0.00     0.00    0.00 116103.00     0.00  7249.36   127.88   813.95    7.01    0.00    7.01   0.01  90.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 26478.00    0.00 1391.00     0.00  1735.71  2555.53    22.99   16.53    0.00   16.53   0.63  87.00
nvme2c2n1         0.00 26471.00    0.00 1395.00     0.00  1740.75  2555.60     5.75    4.12    0.00    4.12   0.44  61.60
nvme1n1           0.00 26477.00    0.00 1395.00     0.00  1740.75  2555.60     3.82    2.74    0.00    2.74   0.44  61.20
nvme3n1           0.00 26480.00    0.00 1395.00     0.00  1740.78  2555.64     6.88    4.93    0.00    4.93   0.52  72.60
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 27867.00     0.00  1739.59   127.85   129.73    4.66    0.00    4.66   0.03  73.40
dm-1              0.00     0.00    0.00 27861.00     0.00  1739.65   127.88   452.05   16.23    0.00   16.23   0.03  87.40
dm-2              0.00     0.00    0.00 27854.00     0.00  1739.65   127.91   107.19    3.85    0.00    3.85   0.02  63.00
dm-3              0.00     0.00    0.00 27859.00     0.00  1739.62   127.89    68.79    2.47    0.00    2.47   0.02  62.70
dm-4              0.00     0.00    0.00 111329.00     0.00  6951.52   127.88   757.81    6.81    0.00    6.81   0.01  89.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 26664.00    0.00 1399.00     0.00  1741.75  2549.75    34.17   24.42    0.00   24.42   0.62  87.40
nvme2c2n1         0.00 26664.00    0.00 1395.00     0.00  1736.70  2549.64     5.63    4.03    0.00    4.03   0.45  63.10
nvme1n1           0.00 26670.00    0.00 1400.00     0.00  1742.94  2549.67     3.90    2.78    0.00    2.78   0.45  63.40
nvme3n1           0.00 26664.00    0.00 1399.00     0.00  1741.72  2549.71     7.68    5.49    0.00    5.49   0.52  72.40
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 28071.00     0.00  1752.78   127.88   144.22    5.14    0.00    5.14   0.03  73.40
dm-1              0.00     0.00    0.00 28071.00     0.00  1752.81   127.88   673.15   23.98    0.00   23.98   0.03  87.90
dm-2              0.00     0.00    0.00 28071.00     0.00  1752.76   127.88   102.83    3.66    0.00    3.66   0.02  64.10
dm-3              0.00     0.00    0.00 28077.00     0.00  1752.75   127.85    67.94    2.42    0.00    2.42   0.02  64.90
dm-4              0.00     0.00    0.00 112290.00     0.00  7011.10   127.87   988.33    8.80    0.00    8.80   0.01  90.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 28862.00    0.00 1524.00     0.00  1900.19  2553.53    38.41   25.20    0.00   25.20   0.61  92.40
nvme2c2n1         0.00 28857.00    0.00 1525.00     0.00  1901.43  2553.53     6.28    4.12    0.00    4.12   0.45  68.50
nvme1n1           0.00 28853.00    0.00 1524.00     0.00  1900.17  2553.51     4.21    2.76    0.00    2.76   0.45  68.40
nvme3n1           0.00 28847.00    0.00 1524.00     0.00  1900.12  2553.45     9.62    6.31    0.00    6.31   0.52  79.70
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 30369.00     0.00  1896.88   127.92   183.75    6.05    0.00    6.05   0.03  80.30
dm-1              0.00     0.00    0.00 30384.00     0.00  1896.86   127.86   758.64   24.97    0.00   24.97   0.03  92.60
dm-2              0.00     0.00    0.00 30379.00     0.00  1896.91   127.88   117.22    3.86    0.00    3.86   0.02  69.80
dm-3              0.00     0.00    0.00 30375.00     0.00  1896.92   127.90    75.70    2.49    0.00    2.49   0.02  69.10
dm-4              0.00     0.00    0.00 121507.00     0.00  7587.55   127.89  1135.30    9.34    0.00    9.34   0.01  94.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 26847.00    0.00 1411.00     0.00  1757.56  2551.02    19.05   13.50    0.00   13.50   0.61  86.00
nvme2c2n1         0.00 26836.00    0.00 1425.00     0.00  1775.09  2551.14     5.94    4.17    0.00    4.17   0.44  63.00
nvme1n1           0.00 26847.00    0.00 1421.00     0.00  1770.14  2551.20     3.93    2.77    0.00    2.77   0.44  62.40
nvme3n1           0.00 26843.00    0.00 1422.00     0.00  1771.41  2551.23     5.43    3.82    0.00    3.82   0.52  74.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 28259.00     0.00  1764.54   127.88    98.97    3.50    0.00    3.50   0.03  74.60
dm-1              0.00     0.00    0.00 28263.00     0.00  1764.42   127.85   370.40   13.11    0.00   13.11   0.03  86.30
dm-2              0.00     0.00    0.00 28252.00     0.00  1764.46   127.91   108.94    3.86    0.00    3.86   0.02  64.20
dm-3              0.00     0.00    0.00 28263.00     0.00  1764.52   127.86    69.18    2.45    0.00    2.45   0.02  63.60
dm-4              0.00     0.00    0.00 113037.00     0.00  7057.93   127.88   647.52    5.73    0.00    5.73   0.01  88.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 24345.00    0.00 1300.00     0.00  1619.89  2551.95    12.49    9.61    0.00    9.61   0.63  82.20
nvme2c2n1         0.00 24347.00    0.00 1285.00     0.00  1601.16  2551.89     5.07    3.94    0.00    3.94   0.45  58.20
nvme1n1           0.00 24349.00    0.00 1285.00     0.00  1601.19  2551.93     3.46    2.69    0.00    2.69   0.46  58.60
nvme3n1           0.00 24350.00    0.00 1285.00     0.00  1601.23  2552.01     4.56    3.55    0.00    3.55   0.53  68.00
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 25635.00     0.00  1600.61   127.87    81.69    3.19    0.00    3.19   0.03  71.80
dm-1              0.00     0.00    0.00 25630.00     0.00  1600.55   127.89   240.19    9.37    0.00    9.37   0.03  85.10
dm-2              0.00     0.00    0.00 25632.00     0.00  1600.54   127.88    91.86    3.58    0.00    3.58   0.02  62.90
dm-3              0.00     0.00    0.00 25634.00     0.00  1600.56   127.88    59.96    2.34    0.00    2.34   0.02  63.40
dm-4              0.00     0.00    0.00 102531.00     0.00  6402.26   127.88   473.77    4.62    0.00    4.62   0.01  86.70

^C
[root@T620 ~]#


[-- Attachment #3: bad.txt --]
[-- Type: application/octet-stream, Size: 8279 bytes --]


[root@T620 ~]# iostat -dxm 1
Linux 6.1.28-0.5.el7.x86_64 (T620)      05/10/2023      _x86_64_        (40 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00   779.13    1.45   62.83     0.03    52.31  1667.62     0.36    3.89    0.07    3.97   1.17   7.50
nvme2c2n1         0.00   775.15    1.49   62.80     0.03    52.33  1667.93     0.07    1.06    0.09    1.09   0.47   3.04
nvme1n1           0.00   775.15    1.54   62.86     0.03    52.33  1665.16     0.07    0.92    0.03    0.94   0.52   3.33
nvme3n1           0.00   777.00    1.77   62.83     0.03    52.33  1659.94     0.11    1.53    0.05    1.57   0.60   3.86
sda               0.02     0.10   12.71    0.76     0.43     0.02    67.86     0.00    0.32    0.26    1.45   0.40   0.55
dm-0              0.00     0.00    0.85  839.78     0.01    52.33   127.50     1.41    1.66    0.05    1.66   0.05   3.86
dm-1              0.00     0.00    0.45  841.92     0.01    52.33   127.24     2.74    3.13    0.09    3.13   0.09   7.50
dm-2              0.00     0.00    0.48  837.90     0.00    52.33   127.84     0.94    1.12    0.10    1.12   0.04   3.05
dm-3              0.00     0.00    0.53  837.95     0.00    52.33   127.83     0.76    0.90    0.01    0.90   0.04   3.34
dm-4              0.00     0.00    2.32 3357.55     0.02   209.32   127.60     5.83    1.70    0.06    1.70   0.02   7.72

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00 10015.00    0.00  542.00     0.00   663.12  2505.68     1.71    3.15    0.00    3.15   0.96  51.90
nvme2c2n1         0.00 10015.00    0.00  539.00     0.00   659.34  2505.25     1.20    2.22    0.00    2.22   0.76  40.90
nvme1n1           0.00 10014.00    0.00  539.00     0.00   659.31  2505.14     1.11    2.07    0.00    2.07   0.74  39.70
nvme3n1           0.00 10014.00    0.00  539.00     0.00   659.31  2505.14     1.37    2.54    0.00    2.54   0.94  50.70
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 10553.00     0.00   659.56   128.00    18.83    1.78    0.00    1.78   0.05  52.10
dm-1              0.00     0.00    0.00 10554.00     0.00   659.62   128.00    25.53    2.42    0.00    2.42   0.05  53.30
dm-2              0.00     0.00    0.00 10554.00     0.00   659.59   127.99    15.50    1.47    0.00    1.47   0.04  42.80
dm-3              0.00     0.00    0.00 10553.00     0.00   659.56   128.00    13.93    1.32    0.00    1.32   0.04  41.50
dm-4              0.00     0.00    0.00 42214.00     0.00  2638.34   128.00    73.85    1.75    0.00    1.75   0.01  54.70

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00  9258.00    0.00  502.00     0.00   610.50  2490.65     1.58    3.15    0.00    3.15   0.96  48.00
nvme2c2n1         0.00  9258.00    0.00  502.00     0.00   610.50  2490.65     1.12    2.24    0.00    2.24   0.75  37.60
nvme1n1           0.00  9258.00    0.00  502.00     0.00   610.50  2490.65     1.05    2.09    0.00    2.09   0.74  37.10
nvme3n1           0.00  9258.00    0.00  503.00     0.00   610.51  2485.73     1.27    2.53    0.00    2.53   0.91  45.60
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 9761.00     0.00   610.01   127.99    16.79    1.72    0.00    1.72   0.05  48.30
dm-1              0.00     0.00    0.00 9760.00     0.00   610.00   128.00    22.74    2.33    0.00    2.33   0.05  50.40
dm-2              0.00     0.00    0.00 9760.00     0.00   610.00   128.00    13.89    1.42    0.00    1.42   0.04  40.40
dm-3              0.00     0.00    0.00 9760.00     0.00   610.00   128.00    12.56    1.29    0.00    1.29   0.04  40.00
dm-4              0.00     0.00    0.00 39041.00     0.00  2440.01   128.00    66.05    1.69    0.00    1.69   0.01  51.30

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00  7498.00    0.00  407.00     0.00   494.00  2485.78     1.30    3.19    0.00    3.19   0.94  38.30
nvme2c2n1         0.00  7533.00    0.00  407.00     0.00   493.97  2485.64     0.99    2.43    0.00    2.43   0.75  30.50
nvme1n1           0.00  7497.00    0.00  407.00     0.00   494.00  2485.78     0.92    2.26    0.00    2.26   0.73  29.60
nvme3n1           0.00  7497.00    0.00  407.00     0.00   494.00  2485.78     1.04    2.55    0.00    2.55   0.85  34.50
sda               0.00     0.00    0.00    1.00     0.00     0.00     8.00     0.00    1.00    0.00    1.00   1.00   0.10
dm-0              0.00     0.00    0.00 7904.00     0.00   494.00   128.00    12.50    1.58    0.00    1.58   0.05  37.10
dm-1              0.00     0.00    0.00 7905.00     0.00   494.00   127.98    17.70    2.24    0.00    2.24   0.05  40.60
dm-2              0.00     0.00    0.00 7940.00     0.00   494.00   127.42    11.46    1.44    0.00    1.44   0.04  33.10
dm-3              0.00     0.00    0.00 7904.00     0.00   494.00   128.00    10.23    1.29    0.00    1.29   0.04  32.30
dm-4              0.00     0.00    0.00 31653.00     0.00  1976.00   127.85    51.94    1.64    0.00    1.64   0.01  40.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00  9697.00    0.00  528.00     0.00   639.25  2479.52     1.65    3.12    0.00    3.12   0.99  52.50
nvme2c2n1         0.00  9725.00    0.00  529.00     0.00   639.28  2474.93     1.11    2.09    0.00    2.09   0.76  40.30
nvme1n1           0.00  9697.00    0.00  528.00     0.00   639.25  2479.52     1.03    1.95    0.00    1.95   0.74  39.30
nvme3n1           0.00  9697.00    0.00  528.00     0.00   639.25  2479.52     1.07    2.02    0.00    2.02   0.77  40.80
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00 10224.00     0.00   639.00   128.00    12.52    1.22    0.00    1.22   0.04  42.70
dm-1              0.00     0.00    0.00 10224.00     0.00   639.00   128.00    23.98    2.35    0.00    2.35   0.05  54.20
dm-2              0.00     0.00    0.00 10253.00     0.00   639.00   127.64    13.46    1.31    0.00    1.31   0.04  42.10
dm-3              0.00     0.00    0.00 10224.00     0.00   639.00   128.00    11.66    1.14    0.00    1.14   0.04  41.20
dm-4              0.00     0.00    0.00 40925.00     0.00  2556.00   127.91    61.66    1.51    0.00    1.51   0.01  54.70

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00  6690.00    0.00  365.00     0.00   440.75  2473.03     1.25    3.42    0.00    3.42   0.94  34.40
nvme2c2n1         0.00  6737.00    0.00  367.00     0.00   440.72  2459.40     0.95    2.59    0.00    2.59   0.75  27.70
nvme1n1           0.00  6690.00    0.00  365.00     0.00   440.75  2473.03     0.88    2.42    0.00    2.42   0.75  27.20
nvme3n1           0.00  6690.00    0.00  365.00     0.00   440.75  2473.03     0.90    2.46    0.00    2.46   0.75  27.40
sda               0.00     0.00    0.00    1.00     0.00     0.00     8.00     0.00    1.00    0.00    1.00   1.00   0.10
dm-0              0.00     0.00    0.00 7056.00     0.00   441.00   128.00     9.82    1.39    0.00    1.39   0.04  28.80
dm-1              0.00     0.00    0.00 7056.00     0.00   441.00   128.00    16.67    2.36    0.00    2.36   0.05  35.70
dm-2              0.00     0.00    0.00 7105.00     0.00   441.00   127.12    10.70    1.51    0.00    1.51   0.04  29.10
dm-3              0.00     0.00    0.00 7056.00     0.00   441.00   128.00     9.60    1.36    0.00    1.36   0.04  28.50
dm-4              0.00     0.00    0.00 28273.00     0.00  1764.00   127.78    46.83    1.66    0.00    1.66   0.01  36.00

^C
[root@T620 ~]#

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-10  8:50                 ` Wang Yugui
@ 2023-05-11  1:34                   ` Dave Chinner
  2023-05-17 13:07                     ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2023-05-11  1:34 UTC (permalink / raw)
  To: Wang Yugui; +Cc: linux-xfs

On Wed, May 10, 2023 at 04:50:56PM +0800, Wang Yugui wrote:
> Hi,
> 
> 
> > On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote:
> > > > Ok, that is further back in time than I expected. In terms of XFS,
> > > > there are only two commits between 5.16..5.17 that might impact
> > > > performance:
> > > > 
> > > > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> > > > 
> > > > and
> > > > 
> > > > 6795801366da ("xfs: Support large folios")
> > > > 
> > > > To test whether ebb7fb1557b1 is the cause, go to
> > > > fs/iomap/buffered-io.c and change:
> > > > 
> > > > -#define IOEND_BATCH_SIZE        4096
> > > > +#define IOEND_BATCH_SIZE        1048576
> > > > This will increase the IO submission chain lengths to at least 4GB
> > > > from the 16MB bound that was placed on 5.17 and newer kernels.
> > > > 
> > > > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> > > > and comment out both calls to mapping_set_large_folios(). This will
> > > > ensure the page cache only instantiates single page folios the same
> > > > as 5.16 would have.
> > > 
> > > 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
> > > 	fio WRITE: bw=6451MiB/s (6764MB/s)
> > > 
> > > still  performance regression when compare to linux 5.16.20
> > > 	fio WRITE: bw=7666MiB/s (8039MB/s),
> > > 
> > > but the performance regression is not too big, then difficult to bisect.
> > > We noticed samle level  performance regression  on btrfs too.
> > > so maby some problem of some code that is  used by both btrfs and xfs
> > > such as iomap and mm/folio.
> > 
> > Yup, that's quite possibly something like the multi-gen LRU changes,
> > but that's not the regression we need to find. :/
> > 
> > > 6.1.x  with 'mapping_set_large_folios remove' only'
> > > 	fio   WRITE: bw=2676MiB/s (2806MB/s)
> > > 
> > > 6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
> > > 	fio WRITE: bw=5092MiB/s (5339MB/s),
> > > 	fio  WRITE: bw=6076MiB/s (6371MB/s)
> > > 
> > > maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit
> > > individual ioend chain lengths in writeback")'.
> > 
> > OK, can you re-run the two 6.1.x kernels above (the slow and the
> > fast) and record the output of `iostat -dxm 1` whilst the
> > fio test is running? I want to see what the overall differences in
> > the IO load on the devices are between the two runs. This will tell
> > us how the IO sizes and queue depths change between the two kernels,
> > etc.
> 
> `iostat -dxm 1` result saved in attachment file.
> good.txt	good performance
> bad.txt		bad performance

Thanks!

What I see here is that neither the good or the bad config are able
to drive the hardware to 100% utilisation, but the way the IO stack
is behaving is identical. The only difference is that
the good config is driving much more IO to the devices, such that
the top level RAID0 stripe reports ~90% utilisation vs 50%
utilisation.

What this says to me is that the limitation in throughput is the
single threaded background IO submission (the bdi-flush thread) is
CPU bound in both cases, and that the difference is in how much CPU
each IO submission is consuming.

From some tests here at lower bandwidth (1-2GB/s) with a batch size
of 4096, I'm seeing the vast majority of submission CPU time being
spent in folio_start_writeback(), and the vast majority of CPU time
in IO completion being spent in folio_end_writeback. There's an
order of magnitude more CPU time in these functions than in any of
the XFS or iomap writeback functions.

A typical 5 second expanded snapshot profile (from `perf top -g -U`)
of the bdi-flusher thread looks like this:

   99.22%     3.68%  [kernel]  [k] write_cache_pages
   - 65.13% write_cache_pages
      - 46.84% iomap_do_writepage
         - 35.50% __folio_start_writeback
            - 7.94% _raw_spin_lock_irqsave
               - 11.35% do_raw_spin_lock
                    __pv_queued_spin_lock_slowpath
            - 5.37% _raw_spin_unlock_irqrestore
               - 5.32% do_raw_spin_unlock
                    __raw_callee_save___pv_queued_spin_unlock
               - 0.92% asm_common_interrupt
                    common_interrupt
                    __common_interrupt
                    handle_edge_irq
                    handle_irq_event
                    __handle_irq_event_percpu
                    vring_interrupt
                    virtblk_done
            - 4.18% __mod_lruvec_page_state
               - 2.18% __mod_lruvec_state
                    1.16% __mod_node_page_state
                    0.68% __mod_memcg_lruvec_state
                 0.90% __mod_memcg_lruvec_state
              2.88% xas_descend
              1.63% percpu_counter_add_batch
              1.63% mod_zone_page_state
              1.15% xas_load
              1.11% xas_start
              0.93% __rcu_read_unlock
            - 0.89% folio_memcg_lock
              0.63% asm_common_interrupt
                 common_interrupt
                 __common_interrupt
                 handle_edge_irq
                 handle_irq_event
                 __handle_irq_event_percpu
                 vring_interrupt
                 virtblk_done
                 virtblk_complete_batch
                 blk_mq_end_request_batch
                 bio_endio
                 iomap_writepage_end_bio
                 iomap_finish_ioend
         - 2.75% xfs_map_blocks
            - 1.55% __might_sleep
                 1.26% __might_resched
         - 1.90% bio_add_folio
              1.13% __bio_try_merge_page
         - 1.82% submit_bio
            - submit_bio_noacct
               - 1.82% submit_bio_noacct_nocheck
                  - __submit_bio
                       1.77% blk_mq_submit_bio
           1.27% inode_to_bdi
           1.19% xas_clear_mark
           0.65% xas_set_mark
           0.57% iomap_page_create.isra.0
      - 12.91% folio_clear_dirty_for_io
         - 2.72% __mod_lruvec_page_state
            - 1.84% __mod_lruvec_state
                 0.98% __mod_node_page_state
                 0.58% __mod_memcg_lruvec_state
           1.55% mod_zone_page_state
           1.49% percpu_counter_add_batch
         - 0.72% asm_common_interrupt
              common_interrupt
              __common_interrupt
              handle_edge_irq
              handle_irq_event
              __handle_irq_event_percpu
              vring_interrupt
              virtblk_done
              virtblk_complete_batch
              blk_mq_end_request_batch
              bio_endio
              iomap_writepage_end_bio
              iomap_finish_ioend
           0.55% folio_mkclean
      - 8.08% filemap_get_folios_tag
           1.84% xas_find_marked
      - 1.89% __pagevec_release
           1.87% release_pages
      - 1.65% __might_sleep
           1.33% __might_resched
        1.22% folio_unlock
   - 3.68% ret_from_fork
        kthread
        worker_thread
        process_one_work
        wb_workfn
        wb_writeback
        __writeback_inodes_wb
        writeback_sb_inodes
        __writeback_single_inode
        do_writepages
        xfs_vm_writepages
        iomap_writepages
        write_cache_pages

This indicates that 35% of writeback submission CPU is in
__folio_start_writeback(), 13% is in folio_clear_dirty_for_io(), 8%
is in filemap_get_folios_tag() and only ~8% of CPU time is in the
rest of the iomap/XFS code building and submitting bios from the
folios passed to it.  i.e.  it looks a lot like writeback is is
contending with the incoming write(), IO completion and memory
reclaim contexts for access to the page cache mapping and mm
accounting structures.

Unfortunately, I don't have access to hardware that I can use to
confirm this is the cause, but it doesn't look like it's directly an
XFS/iomap issue at this point. The larger batch sizes reduce both
memory reclaim and IO completion competition with submission, so it
kinda points in this direction.

I suspect we need to start using high order folios in the write path
where we have large user IOs for streaming writes, but I also wonder
if there isn't some sort of batched accounting/mapping tree updates
we could do for all the adjacent folios in a single bio....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-11  1:34                   ` Dave Chinner
@ 2023-05-17 13:07                     ` Wang Yugui
  2023-05-17 22:11                       ` Dave Chinner
  2023-05-18 18:36                       ` Creating large folios in iomap buffered write path Matthew Wilcox
  0 siblings, 2 replies; 20+ messages in thread
From: Wang Yugui @ 2023-05-17 13:07 UTC (permalink / raw)
  To: Dave Chinner, Matthew Wilcox; +Cc: linux-xfs

Hi,

> On Wed, May 10, 2023 at 04:50:56PM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > 
> > > On Wed, May 10, 2023 at 01:46:49PM +0800, Wang Yugui wrote:
> > > > > Ok, that is further back in time than I expected. In terms of XFS,
> > > > > there are only two commits between 5.16..5.17 that might impact
> > > > > performance:
> > > > > 
> > > > > ebb7fb1557b1 ("xfs, iomap: limit individual ioend chain lengths in writeback")
> > > > > 
> > > > > and
> > > > > 
> > > > > 6795801366da ("xfs: Support large folios")
> > > > > 
> > > > > To test whether ebb7fb1557b1 is the cause, go to
> > > > > fs/iomap/buffered-io.c and change:
> > > > > 
> > > > > -#define IOEND_BATCH_SIZE        4096
> > > > > +#define IOEND_BATCH_SIZE        1048576
> > > > > This will increase the IO submission chain lengths to at least 4GB
> > > > > from the 16MB bound that was placed on 5.17 and newer kernels.
> > > > > 
> > > > > To test whether 6795801366da is the cause, go to fs/xfs/xfs_icache.c
> > > > > and comment out both calls to mapping_set_large_folios(). This will
> > > > > ensure the page cache only instantiates single page folios the same
> > > > > as 5.16 would have.
> > > > 
> > > > 6.1.x with 'mapping_set_large_folios remove' and 'IOEND_BATCH_SIZE=1048576'
> > > > 	fio WRITE: bw=6451MiB/s (6764MB/s)
> > > > 
> > > > still  performance regression when compare to linux 5.16.20
> > > > 	fio WRITE: bw=7666MiB/s (8039MB/s),
> > > > 
> > > > but the performance regression is not too big, then difficult to bisect.
> > > > We noticed samle level  performance regression  on btrfs too.
> > > > so maby some problem of some code that is  used by both btrfs and xfs
> > > > such as iomap and mm/folio.
> > > 
> > > Yup, that's quite possibly something like the multi-gen LRU changes,
> > > but that's not the regression we need to find. :/
> > > 
> > > > 6.1.x  with 'mapping_set_large_folios remove' only'
> > > > 	fio   WRITE: bw=2676MiB/s (2806MB/s)
> > > > 
> > > > 6.1.x with 'IOEND_BATCH_SIZE=1048576' only'
> > > > 	fio WRITE: bw=5092MiB/s (5339MB/s),
> > > > 	fio  WRITE: bw=6076MiB/s (6371MB/s)
> > > > 
> > > > maybe we need more fix or ' ebb7fb1557b1 ("xfs, iomap: limit
> > > > individual ioend chain lengths in writeback")'.
> > > 
> > > OK, can you re-run the two 6.1.x kernels above (the slow and the
> > > fast) and record the output of `iostat -dxm 1` whilst the
> > > fio test is running? I want to see what the overall differences in
> > > the IO load on the devices are between the two runs. This will tell
> > > us how the IO sizes and queue depths change between the two kernels,
> > > etc.
> > 
> > `iostat -dxm 1` result saved in attachment file.
> > good.txt	good performance
> > bad.txt		bad performance
> 
> Thanks!
> 
> What I see here is that neither the good or the bad config are able
> to drive the hardware to 100% utilisation, but the way the IO stack
> is behaving is identical. The only difference is that
> the good config is driving much more IO to the devices, such that
> the top level RAID0 stripe reports ~90% utilisation vs 50%
> utilisation.
> 
> What this says to me is that the limitation in throughput is the
> single threaded background IO submission (the bdi-flush thread) is
> CPU bound in both cases, and that the difference is in how much CPU
> each IO submission is consuming.
> 
> From some tests here at lower bandwidth (1-2GB/s) with a batch size
> of 4096, I'm seeing the vast majority of submission CPU time being
> spent in folio_start_writeback(), and the vast majority of CPU time
> in IO completion being spent in folio_end_writeback. There's an
> order of magnitude more CPU time in these functions than in any of
> the XFS or iomap writeback functions.
> 
> A typical 5 second expanded snapshot profile (from `perf top -g -U`)
> of the bdi-flusher thread looks like this:
> 
>    99.22%     3.68%  [kernel]  [k] write_cache_pages
>    - 65.13% write_cache_pages
>       - 46.84% iomap_do_writepage
>          - 35.50% __folio_start_writeback
>             - 7.94% _raw_spin_lock_irqsave
>                - 11.35% do_raw_spin_lock
>                     __pv_queued_spin_lock_slowpath
>             - 5.37% _raw_spin_unlock_irqrestore
>                - 5.32% do_raw_spin_unlock
>                     __raw_callee_save___pv_queued_spin_unlock
>                - 0.92% asm_common_interrupt
>                     common_interrupt
>                     __common_interrupt
>                     handle_edge_irq
>                     handle_irq_event
>                     __handle_irq_event_percpu
>                     vring_interrupt
>                     virtblk_done
>             - 4.18% __mod_lruvec_page_state
>                - 2.18% __mod_lruvec_state
>                     1.16% __mod_node_page_state
>                     0.68% __mod_memcg_lruvec_state
>                  0.90% __mod_memcg_lruvec_state
>               2.88% xas_descend
>               1.63% percpu_counter_add_batch
>               1.63% mod_zone_page_state
>               1.15% xas_load
>               1.11% xas_start
>               0.93% __rcu_read_unlock
>             - 0.89% folio_memcg_lock
>               0.63% asm_common_interrupt
>                  common_interrupt
>                  __common_interrupt
>                  handle_edge_irq
>                  handle_irq_event
>                  __handle_irq_event_percpu
>                  vring_interrupt
>                  virtblk_done
>                  virtblk_complete_batch
>                  blk_mq_end_request_batch
>                  bio_endio
>                  iomap_writepage_end_bio
>                  iomap_finish_ioend
>          - 2.75% xfs_map_blocks
>             - 1.55% __might_sleep
>                  1.26% __might_resched
>          - 1.90% bio_add_folio
>               1.13% __bio_try_merge_page
>          - 1.82% submit_bio
>             - submit_bio_noacct
>                - 1.82% submit_bio_noacct_nocheck
>                   - __submit_bio
>                        1.77% blk_mq_submit_bio
>            1.27% inode_to_bdi
>            1.19% xas_clear_mark
>            0.65% xas_set_mark
>            0.57% iomap_page_create.isra.0
>       - 12.91% folio_clear_dirty_for_io
>          - 2.72% __mod_lruvec_page_state
>             - 1.84% __mod_lruvec_state
>                  0.98% __mod_node_page_state
>                  0.58% __mod_memcg_lruvec_state
>            1.55% mod_zone_page_state
>            1.49% percpu_counter_add_batch
>          - 0.72% asm_common_interrupt
>               common_interrupt
>               __common_interrupt
>               handle_edge_irq
>               handle_irq_event
>               __handle_irq_event_percpu
>               vring_interrupt
>               virtblk_done
>               virtblk_complete_batch
>               blk_mq_end_request_batch
>               bio_endio
>               iomap_writepage_end_bio
>               iomap_finish_ioend
>            0.55% folio_mkclean
>       - 8.08% filemap_get_folios_tag
>            1.84% xas_find_marked
>       - 1.89% __pagevec_release
>            1.87% release_pages
>       - 1.65% __might_sleep
>            1.33% __might_resched
>         1.22% folio_unlock
>    - 3.68% ret_from_fork
>         kthread
>         worker_thread
>         process_one_work
>         wb_workfn
>         wb_writeback
>         __writeback_inodes_wb
>         writeback_sb_inodes
>         __writeback_single_inode
>         do_writepages
>         xfs_vm_writepages
>         iomap_writepages
>         write_cache_pages
> 
> This indicates that 35% of writeback submission CPU is in
> __folio_start_writeback(), 13% is in folio_clear_dirty_for_io(), 8%
> is in filemap_get_folios_tag() and only ~8% of CPU time is in the
> rest of the iomap/XFS code building and submitting bios from the
> folios passed to it.  i.e.  it looks a lot like writeback is is
> contending with the incoming write(), IO completion and memory
> reclaim contexts for access to the page cache mapping and mm
> accounting structures.
> 
> Unfortunately, I don't have access to hardware that I can use to
> confirm this is the cause, but it doesn't look like it's directly an
> XFS/iomap issue at this point. The larger batch sizes reduce both
> memory reclaim and IO completion competition with submission, so it
> kinda points in this direction.
> 
> I suspect we need to start using high order folios in the write path
> where we have large user IOs for streaming writes, but I also wonder
> if there isn't some sort of batched accounting/mapping tree updates
> we could do for all the adjacent folios in a single bio....


Is there some comment from Matthew Wilcox?
since it seems a folios problem?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/17



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: performance regression between 6.1.x and 5.15.x
  2023-05-17 13:07                     ` Wang Yugui
@ 2023-05-17 22:11                       ` Dave Chinner
  2023-05-18 18:36                       ` Creating large folios in iomap buffered write path Matthew Wilcox
  1 sibling, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2023-05-17 22:11 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Matthew Wilcox, linux-xfs

On Wed, May 17, 2023 at 09:07:41PM +0800, Wang Yugui wrote:
> > This indicates that 35% of writeback submission CPU is in
> > __folio_start_writeback(), 13% is in folio_clear_dirty_for_io(), 8%
> > is in filemap_get_folios_tag() and only ~8% of CPU time is in the
> > rest of the iomap/XFS code building and submitting bios from the
> > folios passed to it.  i.e.  it looks a lot like writeback is is
> > contending with the incoming write(), IO completion and memory
> > reclaim contexts for access to the page cache mapping and mm
> > accounting structures.
> > 
> > Unfortunately, I don't have access to hardware that I can use to
> > confirm this is the cause, but it doesn't look like it's directly an
> > XFS/iomap issue at this point. The larger batch sizes reduce both
> > memory reclaim and IO completion competition with submission, so it
> > kinda points in this direction.
> > 
> > I suspect we need to start using high order folios in the write path
> > where we have large user IOs for streaming writes, but I also wonder
> > if there isn't some sort of batched accounting/mapping tree updates
> > we could do for all the adjacent folios in a single bio....
> 
> 
> Is there some comment from Matthew Wilcox?
> since it seems a folios problem?

None of these are new "folio problems" - we've known about these
scalability limitations of page-based writeback caching for over 15
years. e.g. from 2006:

https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf

The fundamental problem is the huge number of page cache objects
that buffered IO must handle when moving multiple GB/s to/from
storage devices. Folios offer a way to mitigate that by reducing
the number of page cache objects via using large folios in the
write() path, but we have not enabled that functionality yet.

If you want to look at making the iomap path and filemap_get_folio()
paths allocate high order folios, then that will largely mitigate
the worst of the performance degredation.

Another possible avenue is to batch all the folio updates in the IO
completion path. We currently do that one folio at a time, so a
typical IO might be doing a several dozen (or more) page cache
updates that largely could be done as a single update per IO. Worse
is that these individual updates are typically done under exclusive
locking, so this means the lock holds are no only more frequent than
they need to be, they are also longer than they need to be.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Creating large folios in iomap buffered write path
  2023-05-17 13:07                     ` Wang Yugui
  2023-05-17 22:11                       ` Dave Chinner
@ 2023-05-18 18:36                       ` Matthew Wilcox
  2023-05-18 21:46                         ` Matthew Wilcox
  1 sibling, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2023-05-18 18:36 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

On Wed, May 17, 2023 at 09:07:41PM +0800, Wang Yugui wrote:
> Dave Chinner wrote:
> > I suspect we need to start using high order folios in the write path
> > where we have large user IOs for streaming writes, but I also wonder
> > if there isn't some sort of batched accounting/mapping tree updates
> > we could do for all the adjacent folios in a single bio....
> 
> Is there some comment from Matthew Wilcox?
> since it seems a folios problem?

Not so much a "folio problem" as "an enhancement nobody got around to doing
yet".  Here's a first attempt.  It's still churning through an xfstests
run for me.  I have seen this warning trigger:

                WARN_ON_ONCE(!folio_test_uptodate(folio) &&
                             folio_test_dirty(folio));

in iomap_invalidate_folio() as it's now possible to create a folio
for write that is larger than the write, and therefore we won't
mark it uptodate.  Maybe we should create slightly smaller folios.

Anyway, how does this help your performance problem?


diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index c739b258a2d9..3702e5e47b0f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 	if (status)
 		return ERR_PTR(status);
 
-	folio = iomap_get_folio(iter, pos);
+	folio = iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		gfs2_trans_end(sdp);
 	return folio;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 063133ec77f4..21f33731617a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -461,16 +461,18 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  * iomap_get_folio - get a folio reference for writing
  * @iter: iteration structure
  * @pos: start offset of write
+ * @len: length of write
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
  */
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgp_order(len);
 
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
@@ -603,7 +605,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 	if (folio_ops && folio_ops->get_folio)
 		return folio_ops->get_folio(iter, pos, len);
 	else
-		return iomap_get_folio(iter, pos);
+		return iomap_get_folio(iter, pos, len);
 }
 
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..80facb9c9e5b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..5d1341862c5d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -466,6 +466,19 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
@@ -505,14 +518,20 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
 #define FGP_STABLE		0x00000080
+#define FGP_ORDER(fgp)		((fgp) >> 26)	/* top 6 bits */
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
+static inline unsigned fgp_order(size_t size)
+{
+	return get_order(size) << 26;
+}
+
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		unsigned fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		unsigned fgp_flags, gfp_t gfp);
 
 /**
  * filemap_get_folio - Find and get a folio.
@@ -586,7 +605,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
 }
 
 static inline struct page *find_get_page_flags(struct address_space *mapping,
-					pgoff_t offset, int fgp_flags)
+					pgoff_t offset, unsigned fgp_flags)
 {
 	return pagecache_get_page(mapping, offset, fgp_flags, 0);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..2eab5e6b6646 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1910,7 +1910,7 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  * Return: The found folio or an ERR_PTR() otherwise.
  */
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		unsigned fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
@@ -1952,7 +1952,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
+		unsigned order = fgp_order(fgp_flags);
 		int err;
+
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
@@ -1961,26 +1963,38 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp &= ~GFP_KERNEL;
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
-
-		folio = filemap_alloc_folio(gfp, 0);
-		if (!folio)
-			return ERR_PTR(-ENOMEM);
-
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		/* Init accessed so avoid atomic mark_page_accessed later */
-		if (fgp_flags & FGP_ACCESSED)
-			__folio_set_referenced(folio);
+		if (order > MAX_PAGECACHE_ORDER)
+			order = MAX_PAGECACHE_ORDER;
+		/* If we're not aligned, allocate a smaller folio */
+		if (index & ((1UL << order) - 1))
+			order = __ffs(index);
 
-		err = filemap_add_folio(mapping, folio, index, gfp);
-		if (unlikely(err)) {
+		do {
+			err = -ENOMEM;
+			if (order == 1)
+				order = 0;
+			folio = filemap_alloc_folio(gfp, order);
+			if (!folio)
+				continue;
+
+			/* Init accessed so avoid atomic mark_page_accessed later */
+			if (fgp_flags & FGP_ACCESSED)
+				__folio_set_referenced(folio);
+
+			err = filemap_add_folio(mapping, folio, index, gfp);
+			if (!err)
+				break;
 			folio_put(folio);
 			folio = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+		} while (order-- > 0);
 
+		if (err == -EEXIST)
+			goto repeat;
+		if (err)
+			return ERR_PTR(err);
 		/*
 		 * filemap_add_folio locks the page, and for mmap
 		 * we expect an unlocked page.
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index c6f056c20503..c96e88d9a262 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
 
 noinline
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		unsigned fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..59a071badb90 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -462,19 +462,6 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Creating large folios in iomap buffered write path
  2023-05-18 18:36                       ` Creating large folios in iomap buffered write path Matthew Wilcox
@ 2023-05-18 21:46                         ` Matthew Wilcox
  2023-05-18 22:03                           ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2023-05-18 21:46 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

On Thu, May 18, 2023 at 07:36:38PM +0100, Matthew Wilcox wrote:
> Not so much a "folio problem" as "an enhancement nobody got around to doing
> yet".  Here's a first attempt.  It's still churning through an xfstests
> run for me.  I have seen this warning trigger:
> 
>                 WARN_ON_ONCE(!folio_test_uptodate(folio) &&
>                              folio_test_dirty(folio));
> 
> in iomap_invalidate_folio() as it's now possible to create a folio
> for write that is larger than the write, and therefore we won't
> mark it uptodate.  Maybe we should create slightly smaller folios.

Here's one that does.  A couple of other small problems also fixed.


diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index c739b258a2d9..3702e5e47b0f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 	if (status)
 		return ERR_PTR(status);
 
-	folio = iomap_get_folio(iter, pos);
+	folio = iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		gfs2_trans_end(sdp);
 	return folio;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 063133ec77f4..32ddddf9f35c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -461,19 +461,25 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  * iomap_get_folio - get a folio reference for writing
  * @iter: iteration structure
  * @pos: start offset of write
+ * @len: length of write
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
  */
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
+	struct folio *folio;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgp_order(len);
 
-	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
+	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
+	if (!IS_ERR(folio) && folio_test_large(folio))
+		printk("index:%lu len:%zu order:%u\n", (unsigned long)(pos / PAGE_SIZE), len, folio_order(folio));
+	return folio;
 }
 EXPORT_SYMBOL_GPL(iomap_get_folio);
 
@@ -510,8 +516,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 		iomap_page_release(folio);
 	} else if (folio_test_large(folio)) {
 		/* Must release the iop so the page can be split */
-		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
-			     folio_test_dirty(folio));
+		VM_WARN_ON_ONCE_FOLIO(!folio_test_uptodate(folio) &&
+				folio_test_dirty(folio), folio);
 		iomap_page_release(folio);
 	}
 }
@@ -603,7 +609,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 	if (folio_ops && folio_ops->get_folio)
 		return folio_ops->get_folio(iter, pos, len);
 	else
-		return iomap_get_folio(iter, pos);
+		return iomap_get_folio(iter, pos, len);
 }
 
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..80facb9c9e5b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..f4d05beb64eb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -466,6 +466,19 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
@@ -505,14 +518,24 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
 #define FGP_STABLE		0x00000080
+#define FGP_ORDER(fgp)		((fgp) >> 26)	/* top 6 bits */
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
+static inline unsigned fgp_order(size_t size)
+{
+	unsigned int shift = ilog2(size);
+
+	if (shift <= PAGE_SHIFT)
+		return 0;
+	return (shift - PAGE_SHIFT) << 26;
+}
+
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		unsigned fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		unsigned fgp_flags, gfp_t gfp);
 
 /**
  * filemap_get_folio - Find and get a folio.
@@ -586,7 +609,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
 }
 
 static inline struct page *find_get_page_flags(struct address_space *mapping,
-					pgoff_t offset, int fgp_flags)
+					pgoff_t offset, unsigned fgp_flags)
 {
 	return pagecache_get_page(mapping, offset, fgp_flags, 0);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..7abbb072d4d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1910,7 +1910,7 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  * Return: The found folio or an ERR_PTR() otherwise.
  */
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		unsigned fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
@@ -1952,7 +1952,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
+		unsigned order = FGP_ORDER(fgp_flags);
 		int err;
+
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
@@ -1961,26 +1963,38 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp &= ~GFP_KERNEL;
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
-
-		folio = filemap_alloc_folio(gfp, 0);
-		if (!folio)
-			return ERR_PTR(-ENOMEM);
-
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		/* Init accessed so avoid atomic mark_page_accessed later */
-		if (fgp_flags & FGP_ACCESSED)
-			__folio_set_referenced(folio);
+		if (order > MAX_PAGECACHE_ORDER)
+			order = MAX_PAGECACHE_ORDER;
+		/* If we're not aligned, allocate a smaller folio */
+		if (index & ((1UL << order) - 1))
+			order = __ffs(index);
 
-		err = filemap_add_folio(mapping, folio, index, gfp);
-		if (unlikely(err)) {
+		do {
+			err = -ENOMEM;
+			if (order == 1)
+				order = 0;
+			folio = filemap_alloc_folio(gfp, order);
+			if (!folio)
+				continue;
+
+			/* Init accessed so avoid atomic mark_page_accessed later */
+			if (fgp_flags & FGP_ACCESSED)
+				__folio_set_referenced(folio);
+
+			err = filemap_add_folio(mapping, folio, index, gfp);
+			if (!err)
+				break;
 			folio_put(folio);
 			folio = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+		} while (order-- > 0);
 
+		if (err == -EEXIST)
+			goto repeat;
+		if (err)
+			return ERR_PTR(err);
 		/*
 		 * filemap_add_folio locks the page, and for mmap
 		 * we expect an unlocked page.
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index c6f056c20503..c96e88d9a262 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
 
 noinline
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		unsigned fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..59a071badb90 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -462,19 +462,6 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Creating large folios in iomap buffered write path
  2023-05-18 21:46                         ` Matthew Wilcox
@ 2023-05-18 22:03                           ` Matthew Wilcox
  2023-05-19  2:55                             ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2023-05-18 22:03 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

On Thu, May 18, 2023 at 10:46:43PM +0100, Matthew Wilcox wrote:
> -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
> +	struct folio *folio;
>  
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> +	fgp |= fgp_order(len);
>  
> -	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> +	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
>  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> +	if (!IS_ERR(folio) && folio_test_large(folio))
> +		printk("index:%lu len:%zu order:%u\n", (unsigned long)(pos / PAGE_SIZE), len, folio_order(folio));
> +	return folio;
>  }

Forgot to take the debugging out.  This should read:

-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgp_order(len);
 
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
 }

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Creating large folios in iomap buffered write path
  2023-05-18 22:03                           ` Matthew Wilcox
@ 2023-05-19  2:55                             ` Wang Yugui
  2023-05-19 15:38                               ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-05-19  2:55 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1669 bytes --]

Hi,

> On Thu, May 18, 2023 at 10:46:43PM +0100, Matthew Wilcox wrote:
> > -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> > +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> >  {
> >  	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
> > +	struct folio *folio;
> >  
> >  	if (iter->flags & IOMAP_NOWAIT)
> >  		fgp |= FGP_NOWAIT;
> > +	fgp |= fgp_order(len);
> >  
> > -	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > +	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> >  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> > +	if (!IS_ERR(folio) && folio_test_large(folio))
> > +		printk("index:%lu len:%zu order:%u\n", (unsigned long)(pos / PAGE_SIZE), len, folio_order(folio));
> > +	return folio;
> >  }
> 
> Forgot to take the debugging out.  This should read:
> 
> -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
> +	fgp |= fgp_order(len);
>  
>  	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
>  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
>  }

I test it (attachment file) on 6.4.0-rc2.
fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4 -directory=/mnt/test

fio  WRITE: bw=2430MiB/s.
expected value: > 6000MiB/s
so yet no fio write bandwidth improvement.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/19




[-- Attachment #2: 0001-Creating-large-folios-in-iomap-buffered-write-path.patch --]
[-- Type: application/octet-stream, Size: 9714 bytes --]

From 46a20dc267cd2624841e0e12af12ea6c05e5f6b9 Mon Sep 17 00:00:00 2001
From: Matthew Wilcox <willy@infradead.org>
Date: Thu, 18 May 2023 22:46:43 +0100
Subject: [PATCH] Creating large folios in iomap buffered write path

On Thu, May 18, 2023 at 07:36:38PM +0100, Matthew Wilcox wrote:
> Not so much a "folio problem" as "an enhancement nobody got around to doing
> yet".  Here's a first attempt.  It's still churning through an xfstests
> run for me.  I have seen this warning trigger:
>
>                 WARN_ON_ONCE(!folio_test_uptodate(folio) &&
>                              folio_test_dirty(folio));
>
> in iomap_invalidate_folio() as it's now possible to create a folio
> for write that is larger than the write, and therefore we won't
> mark it uptodate.  Maybe we should create slightly smaller folios.

Here's one that does.  A couple of other small problems also fixed.
---
 fs/gfs2/bmap.c          |  2 +-
 fs/iomap/buffered-io.c  | 10 ++++++----
 include/linux/iomap.h   |  2 +-
 include/linux/pagemap.h | 29 +++++++++++++++++++++++++---
 mm/filemap.c            | 42 +++++++++++++++++++++++++++--------------
 mm/folio-compat.c       |  2 +-
 mm/readahead.c          | 13 -------------
 7 files changed, 63 insertions(+), 37 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index c739b258a2d9..3702e5e47b0f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -971,7 +971,7 @@ gfs2_iomap_get_folio(struct iomap_iter *iter, loff_t pos, unsigned len)
 	if (status)
 		return ERR_PTR(status);
 
-	folio = iomap_get_folio(iter, pos);
+	folio = iomap_get_folio(iter, pos, len);
 	if (IS_ERR(folio))
 		gfs2_trans_end(sdp);
 	return folio;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 063133ec77f4..651af2d424ac 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -461,16 +461,18 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  * iomap_get_folio - get a folio reference for writing
  * @iter: iteration structure
  * @pos: start offset of write
+ * @len: length of write
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
  */
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
+	fgp |= fgp_order(len);
 
 	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
 			fgp, mapping_gfp_mask(iter->inode->i_mapping));
@@ -510,8 +512,8 @@ void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len)
 		iomap_page_release(folio);
 	} else if (folio_test_large(folio)) {
 		/* Must release the iop so the page can be split */
-		WARN_ON_ONCE(!folio_test_uptodate(folio) &&
-			     folio_test_dirty(folio));
+		VM_WARN_ON_ONCE_FOLIO(!folio_test_uptodate(folio) &&
+				folio_test_dirty(folio), folio);
 		iomap_page_release(folio);
 	}
 }
@@ -603,7 +605,7 @@ static struct folio *__iomap_get_folio(struct iomap_iter *iter, loff_t pos,
 	if (folio_ops && folio_ops->get_folio)
 		return folio_ops->get_folio(iter, pos, len);
 	else
-		return iomap_get_folio(iter, pos);
+		return iomap_get_folio(iter, pos, len);
 }
 
 static void __iomap_put_folio(struct iomap_iter *iter, loff_t pos, size_t ret,
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index e2b836c2e119..80facb9c9e5b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -261,7 +261,7 @@ int iomap_file_buffered_write_punch_delalloc(struct inode *inode,
 int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops);
 void iomap_readahead(struct readahead_control *, const struct iomap_ops *ops);
 bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
-struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos);
+struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
 bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
 void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
 int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index a56308a9d1a4..f4d05beb64eb 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -466,6 +466,19 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
@@ -505,14 +518,24 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
 #define FGP_STABLE		0x00000080
+#define FGP_ORDER(fgp)		((fgp) >> 26)	/* top 6 bits */
 
 #define FGP_WRITEBEGIN		(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE)
 
+static inline unsigned fgp_order(size_t size)
+{
+	unsigned int shift = ilog2(size);
+
+	if (shift <= PAGE_SHIFT)
+		return 0;
+	return (shift - PAGE_SHIFT) << 26;
+}
+
 void *filemap_get_entry(struct address_space *mapping, pgoff_t index);
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		unsigned fgp_flags, gfp_t gfp);
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp);
+		unsigned fgp_flags, gfp_t gfp);
 
 /**
  * filemap_get_folio - Find and get a folio.
@@ -586,7 +609,7 @@ static inline struct page *find_get_page(struct address_space *mapping,
 }
 
 static inline struct page *find_get_page_flags(struct address_space *mapping,
-					pgoff_t offset, int fgp_flags)
+					pgoff_t offset, unsigned fgp_flags)
 {
 	return pagecache_get_page(mapping, offset, fgp_flags, 0);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..7abbb072d4d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1910,7 +1910,7 @@ void *filemap_get_entry(struct address_space *mapping, pgoff_t index)
  * Return: The found folio or an ERR_PTR() otherwise.
  */
 struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		unsigned fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
@@ -1952,7 +1952,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
+		unsigned order = FGP_ORDER(fgp_flags);
 		int err;
+
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
 		if (fgp_flags & FGP_NOFS)
@@ -1961,26 +1963,38 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp &= ~GFP_KERNEL;
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
-
-		folio = filemap_alloc_folio(gfp, 0);
-		if (!folio)
-			return ERR_PTR(-ENOMEM);
-
 		if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
 			fgp_flags |= FGP_LOCK;
 
-		/* Init accessed so avoid atomic mark_page_accessed later */
-		if (fgp_flags & FGP_ACCESSED)
-			__folio_set_referenced(folio);
+		if (order > MAX_PAGECACHE_ORDER)
+			order = MAX_PAGECACHE_ORDER;
+		/* If we're not aligned, allocate a smaller folio */
+		if (index & ((1UL << order) - 1))
+			order = __ffs(index);
 
-		err = filemap_add_folio(mapping, folio, index, gfp);
-		if (unlikely(err)) {
+		do {
+			err = -ENOMEM;
+			if (order == 1)
+				order = 0;
+			folio = filemap_alloc_folio(gfp, order);
+			if (!folio)
+				continue;
+
+			/* Init accessed so avoid atomic mark_page_accessed later */
+			if (fgp_flags & FGP_ACCESSED)
+				__folio_set_referenced(folio);
+
+			err = filemap_add_folio(mapping, folio, index, gfp);
+			if (!err)
+				break;
 			folio_put(folio);
 			folio = NULL;
-			if (err == -EEXIST)
-				goto repeat;
-		}
+		} while (order-- > 0);
 
+		if (err == -EEXIST)
+			goto repeat;
+		if (err)
+			return ERR_PTR(err);
 		/*
 		 * filemap_add_folio locks the page, and for mmap
 		 * we expect an unlocked page.
diff --git a/mm/folio-compat.c b/mm/folio-compat.c
index c6f056c20503..c96e88d9a262 100644
--- a/mm/folio-compat.c
+++ b/mm/folio-compat.c
@@ -92,7 +92,7 @@ EXPORT_SYMBOL(add_to_page_cache_lru);
 
 noinline
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t index,
-		int fgp_flags, gfp_t gfp)
+		unsigned fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
 
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..59a071badb90 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -462,19 +462,6 @@ static int try_context_readahead(struct address_space *mapping,
 	return 1;
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index,
 		pgoff_t mark, unsigned int order, gfp_t gfp)
 {
-- 
2.36.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Creating large folios in iomap buffered write path
  2023-05-19  2:55                             ` Wang Yugui
@ 2023-05-19 15:38                               ` Matthew Wilcox
  2023-05-20 13:35                                 ` Wang Yugui
  0 siblings, 1 reply; 20+ messages in thread
From: Matthew Wilcox @ 2023-05-19 15:38 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

On Fri, May 19, 2023 at 10:55:29AM +0800, Wang Yugui wrote:
> Hi,
> 
> > On Thu, May 18, 2023 at 10:46:43PM +0100, Matthew Wilcox wrote:
> > > -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> > > +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> > >  {
> > >  	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
> > > +	struct folio *folio;
> > >  
> > >  	if (iter->flags & IOMAP_NOWAIT)
> > >  		fgp |= FGP_NOWAIT;
> > > +	fgp |= fgp_order(len);
> > >  
> > > -	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > > +	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > >  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> > > +	if (!IS_ERR(folio) && folio_test_large(folio))
> > > +		printk("index:%lu len:%zu order:%u\n", (unsigned long)(pos / PAGE_SIZE), len, folio_order(folio));
> > > +	return folio;
> > >  }
> > 
> > Forgot to take the debugging out.  This should read:
> > 
> > -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> > +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> >  {
> >  	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
> >  	if (iter->flags & IOMAP_NOWAIT)
> >  		fgp |= FGP_NOWAIT;
> > +	fgp |= fgp_order(len);
> >  
> >  	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> >  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> >  }
> 
> I test it (attachment file) on 6.4.0-rc2.
> fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4 -directory=/mnt/test
> 
> fio  WRITE: bw=2430MiB/s.
> expected value: > 6000MiB/s
> so yet no fio write bandwidth improvement.

That's basically unchanged.  There's no per-page or per-block work being
done in start/end writeback, so if Dave's investigation is applicable
to your situation, I'd expect to see an improvement.

Maybe try the second version of the patch I sent with the debug in,
to confirm you really are seeing large folios being created (you might
want to use printk_ratelimit() instead of printk to ensure it doesn't
overwhelm your system)?  That fio command you were using ought to create
them, but there's always a chance it doesn't.

Perhaps you could use perf (the command Dave used) to see where the time
is being spent.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Creating large folios in iomap buffered write path
  2023-05-19 15:38                               ` Matthew Wilcox
@ 2023-05-20 13:35                                 ` Wang Yugui
  2023-05-20 16:35                                   ` Matthew Wilcox
  0 siblings, 1 reply; 20+ messages in thread
From: Wang Yugui @ 2023-05-20 13:35 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

Hi,

> On Fri, May 19, 2023 at 10:55:29AM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > On Thu, May 18, 2023 at 10:46:43PM +0100, Matthew Wilcox wrote:
> > > > -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> > > > +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> > > >  {
> > > >  	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
> > > > +	struct folio *folio;
> > > >  
> > > >  	if (iter->flags & IOMAP_NOWAIT)
> > > >  		fgp |= FGP_NOWAIT;
> > > > +	fgp |= fgp_order(len);
> > > >  
> > > > -	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > > > +	folio = __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > > >  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> > > > +	if (!IS_ERR(folio) && folio_test_large(folio))
> > > > +		printk("index:%lu len:%zu order:%u\n", (unsigned long)(pos / PAGE_SIZE), len, folio_order(folio));
> > > > +	return folio;
> > > >  }
> > > 
> > > Forgot to take the debugging out.  This should read:
> > > 
> > > -struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos)
> > > +struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> > >  {
> > >  	unsigned fgp = FGP_WRITEBEGIN | FGP_NOFS;
> > >  	if (iter->flags & IOMAP_NOWAIT)
> > >  		fgp |= FGP_NOWAIT;
> > > +	fgp |= fgp_order(len);
> > >  
> > >  	return __filemap_get_folio(iter->inode->i_mapping, pos >> PAGE_SHIFT,
> > >  			fgp, mapping_gfp_mask(iter->inode->i_mapping));
> > >  }
> > 
> > I test it (attachment file) on 6.4.0-rc2.
> > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4 -directory=/mnt/test
> > 
> > fio  WRITE: bw=2430MiB/s.
> > expected value: > 6000MiB/s
> > so yet no fio write bandwidth improvement.
> 
> That's basically unchanged.  There's no per-page or per-block work being
> done in start/end writeback, so if Dave's investigation is applicable
> to your situation, I'd expect to see an improvement.
> 
> Maybe try the second version of the patch I sent with the debug in,
> to confirm you really are seeing large folios being created (you might
> want to use printk_ratelimit() instead of printk to ensure it doesn't
> overwhelm your system)?  That fio command you were using ought to create
> them, but there's always a chance it doesn't.
> 
> Perhaps you could use perf (the command Dave used) to see where the time
> is being spent.

test result of the second version of the patch.

# dmesg |grep 'index\|suppressed'
[   89.376149] index:0 len:292 order:2
[   97.862938] index:0 len:4096 order:2
[   98.340665] index:0 len:4096 order:2
[   98.346633] index:0 len:4096 order:2
[   98.352323] index:0 len:4096 order:2
[   98.359952] index:0 len:4096 order:2
[   98.364015] index:3 len:4096 order:2
[   98.368943] index:0 len:4096 order:2
[   98.374285] index:0 len:4096 order:2
[   98.379163] index:3 len:4096 order:2
[   98.384760] index:0 len:4096 order:2
[  181.103751] iomap_get_folio: 342 callbacks suppressed
[  181.103761] index:0 len:292 order:2


'perf report -g' result:
Samples: 344K of event 'cycles', Event count (approx.): 103747884662
  Children      Self  Command          Shared Object            Symbol
+   58.73%     0.01%  fio              [kernel.kallsyms]        [k] entry_SYSCALL_64_after_hwframe
+   58.72%     0.01%  fio              [kernel.kallsyms]        [k] do_syscall_64
+   58.53%     0.00%  fio              libpthread-2.17.so       [.] 0x00007f83e400e6fd
+   58.47%     0.01%  fio              [kernel.kallsyms]        [k] ksys_write
+   58.45%     0.02%  fio              [kernel.kallsyms]        [k] vfs_write
+   58.41%     0.03%  fio              [kernel.kallsyms]        [k] xfs_file_buffered_write
+   57.96%     0.57%  fio              [kernel.kallsyms]        [k] iomap_file_buffered_write
+   27.57%     1.29%  fio              [kernel.kallsyms]        [k] iomap_write_begin
+   25.32%     0.43%  fio              [kernel.kallsyms]        [k] iomap_get_folio
+   24.84%     0.70%  fio              [kernel.kallsyms]        [k] __filemap_get_folio.part.69
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] ret_from_fork
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] kthread
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] worker_thread
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] process_one_work
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] wb_workfn
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] wb_writeback
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] __writeback_inodes_wb
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] writeback_sb_inodes
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] __writeback_single_inode
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] do_writepages
+   20.11%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] xfs_vm_writepages
+   20.10%     0.00%  kworker/u98:15-  [kernel.kallsyms]        [k] iomap_writepages
+   20.10%     0.69%  kworker/u98:15-  [kernel.kallsyms]        [k] write_cache_pages
+   16.95%     0.39%  fio              [kernel.kallsyms]        [k] copy_page_from_iter_atomic
+   16.53%     0.10%  fio              [kernel.kallsyms]        [k] copyin


'perf report ' result:

Samples: 335K of event 'cycles', Event count (approx.): 108508755052
Overhead  Command          Shared Object        Symbol
  17.70%  fio              [kernel.kallsyms]    [k] rep_movs_alternative
   2.89%  kworker/u98:2-e  [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
   2.88%  fio              [kernel.kallsyms]    [k] get_page_from_freelist
   2.67%  fio              [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
   2.59%  fio              [kernel.kallsyms]    [k] asm_exc_nmi
   2.25%  swapper          [kernel.kallsyms]    [k] intel_idle
   1.59%  kworker/u98:2-e  [kernel.kallsyms]    [k] __folio_start_writeback
   1.52%  fio              [kernel.kallsyms]    [k] xas_load
   1.45%  fio              [kernel.kallsyms]    [k] lru_add_fn
   1.44%  fio              [kernel.kallsyms]    [k] xas_descend
   1.32%  fio              [kernel.kallsyms]    [k] iomap_write_begin
   1.29%  fio              [kernel.kallsyms]    [k] __filemap_add_folio
   1.08%  kworker/u98:2-e  [kernel.kallsyms]    [k] folio_clear_dirty_for_io
   1.07%  fio              [kernel.kallsyms]    [k] __folio_mark_dirty
   0.94%  fio              [kernel.kallsyms]    [k] iomap_set_range_uptodate.part.24
   0.93%  fio              [kernel.kallsyms]    [k] node_dirty_ok
   0.92%  kworker/u98:2-e  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
   0.83%  fio              [kernel.kallsyms]    [k] xas_start
   0.83%  fio              [kernel.kallsyms]    [k] __alloc_pages
   0.83%  fio              [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
   0.81%  kworker/u98:2-e  [kernel.kallsyms]    [k] asm_exc_nmi
   0.79%  fio              [kernel.kallsyms]    [k] percpu_counter_add_batch
   0.75%  kworker/u98:2-e  [kernel.kallsyms]    [k] iomap_writepage_map
   0.74%  kworker/u98:2-e  [kernel.kallsyms]    [k] __mod_lruvec_page_state
   0.70%  fio              fio                  [.] 0x000000000001b1ac
   0.70%  fio              [kernel.kallsyms]    [k] filemap_dirty_folio
   0.69%  kworker/u98:2-e  [kernel.kallsyms]    [k] write_cache_pages
   0.69%  fio              [kernel.kallsyms]    [k] __filemap_get_folio.part.69
   0.67%  kworker/1:0-eve  [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
   0.66%  fio              [kernel.kallsyms]    [k] __mod_lruvec_page_state
   0.64%  fio              [kernel.kallsyms]    [k] __mod_node_page_state
   0.64%  fio              [kernel.kallsyms]    [k] folio_add_lru
   0.64%  fio              [kernel.kallsyms]    [k] balance_dirty_pages_ratelimited_flags
   0.62%  fio              [kernel.kallsyms]    [k] __mod_memcg_lruvec_state
   0.61%  fio              [kernel.kallsyms]    [k] iomap_write_end
   0.60%  fio              [kernel.kallsyms]    [k] xas_find_conflict
   0.59%  fio              [kernel.kallsyms]    [k] bad_range
   0.58%  kworker/u98:2-e  [kernel.kallsyms]    [k] xas_load
   0.57%  fio              [kernel.kallsyms]    [k] iomap_file_buffered_write
   0.56%  kworker/u98:2-e  [kernel.kallsyms]    [k] percpu_counter_add_batch
   0.49%  fio              [kernel.kallsyms]    [k] __might_resched


'perf top -g -U' result:
Samples: 78K of event 'cycles', 4000 Hz, Event count (approx.): 29400815085 lost: 0/0 drop: 0/0
  Children      Self  Shared Object       Symbol
+   62.59%     0.03%  [kernel]            [k] entry_SYSCALL_64_after_hwframe
+   60.15%     0.02%  [kernel]            [k] do_syscall_64
+   59.45%     0.02%  [kernel]            [k] vfs_write
+   59.09%     0.54%  [kernel]            [k] iomap_file_buffered_write
+   57.41%     0.00%  [kernel]            [k] ksys_write
+   57.36%     0.01%  [kernel]            [k] xfs_file_buffered_write
+   37.82%     0.00%  libpthread-2.17.so  [.] 0x00007fce6f20e6fd
+   26.83%     1.20%  [kernel]            [k] iomap_write_begin
+   24.65%     0.45%  [kernel]            [k] iomap_get_folio
+   24.15%     0.74%  [kernel]            [k] __filemap_get_folio.part.69
+   20.17%     0.00%  [kernel]            [k] __writeback_single_inode
+   20.17%     0.65%  [kernel]            [k] write_cache_pages
+   17.66%     0.43%  [kernel]            [k] copy_page_from_iter_atomic
+   17.18%     0.12%  [kernel]            [k] copyin
+   17.08%    16.71%  [kernel]            [k] rep_movs_alternative
+   16.97%     0.00%  [kernel]            [k] ret_from_fork
+   16.97%     0.00%  [kernel]            [k] kthread
+   16.87%     0.00%  [kernel]            [k] worker_thread
+   16.84%     0.00%  [kernel]            [k] process_one_work
+   14.86%     0.17%  [kernel]            [k] filemap_add_folio
+   13.83%     0.77%  [kernel]            [k] iomap_writepage_map
+   11.90%     0.33%  [kernel]            [k] iomap_finish_ioend
+   11.57%     0.23%  [kernel]            [k] folio_end_writeback
+   11.51%     0.73%  [kernel]            [k] iomap_write_end
+   11.30%     2.14%  [kernel]            [k] __folio_end_writeback
+   10.70%     0.00%  [kernel]            [k] wb_workfn
+   10.70%     0.00%  [kernel]            [k] wb_writeback
+   10.70%     0.00%  [kernel]            [k] __writeback_inodes_wb
+   10.70%     0.00%  [kernel]            [k] writeback_sb_inodes
+   10.70%     0.00%  [kernel]            [k] do_writepages
+   10.70%     0.00%  [kernel]            [k] xfs_vm_writepages
+   10.70%     0.00%  [kernel]            [k] iomap_writepages
+   10.19%     2.68%  [kernel]            [k] _raw_spin_lock_irqsave
+   10.17%     1.35%  [kernel]            [k] __filemap_add_folio
+    9.94%     0.00%  [unknown]           [k] 0x0000000001942a70
+    9.94%     0.00%  [unknown]           [k] 0x0000000001942ac0
+    9.94%     0.00%  [unknown]           [k] 0x0000000001942b30

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/20




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Creating large folios in iomap buffered write path
  2023-05-20 13:35                                 ` Wang Yugui
@ 2023-05-20 16:35                                   ` Matthew Wilcox
  0 siblings, 0 replies; 20+ messages in thread
From: Matthew Wilcox @ 2023-05-20 16:35 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

On Sat, May 20, 2023 at 09:35:32PM +0800, Wang Yugui wrote:
> test result of the second version of the patch.
> 
> # dmesg |grep 'index\|suppressed'
> [   89.376149] index:0 len:292 order:2
> [   97.862938] index:0 len:4096 order:2
> [   98.340665] index:0 len:4096 order:2
> [   98.346633] index:0 len:4096 order:2
> [   98.352323] index:0 len:4096 order:2
> [   98.359952] index:0 len:4096 order:2
> [   98.364015] index:3 len:4096 order:2
> [   98.368943] index:0 len:4096 order:2
> [   98.374285] index:0 len:4096 order:2
> [   98.379163] index:3 len:4096 order:2
> [   98.384760] index:0 len:4096 order:2
> [  181.103751] iomap_get_folio: 342 callbacks suppressed
> [  181.103761] index:0 len:292 order:2

Thanks.  Clearly we're not creating large folios in the write path.
I tracked it down, and used your fio command.  My system creates 1MB
folios, so I think yours will too.  Patch series incoming (I fixed a
couple of other oversights too).

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-05-20 16:36 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-08  9:24 performance regression between 6.1.x and 5.15.x Wang Yugui
2023-05-08 14:46 ` Wang Yugui
2023-05-08 22:32   ` Dave Chinner
2023-05-08 23:25     ` Wang Yugui
2023-05-09  1:36       ` Dave Chinner
2023-05-09 12:37         ` Wang Yugui
2023-05-09 22:14           ` Dave Chinner
2023-05-10  5:46             ` Wang Yugui
2023-05-10  7:27               ` Dave Chinner
2023-05-10  8:50                 ` Wang Yugui
2023-05-11  1:34                   ` Dave Chinner
2023-05-17 13:07                     ` Wang Yugui
2023-05-17 22:11                       ` Dave Chinner
2023-05-18 18:36                       ` Creating large folios in iomap buffered write path Matthew Wilcox
2023-05-18 21:46                         ` Matthew Wilcox
2023-05-18 22:03                           ` Matthew Wilcox
2023-05-19  2:55                             ` Wang Yugui
2023-05-19 15:38                               ` Matthew Wilcox
2023-05-20 13:35                                 ` Wang Yugui
2023-05-20 16:35                                   ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.