All of lore.kernel.org
 help / color / mirror / Atom feed
* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-06  1:37 ` Scott Branden
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-06  1:37 UTC (permalink / raw)
  To: linux-arm-kernel, Will Deacon, Mark Rutland, Arnd Bergmann,
	Russell King, Catalin Marinas, linux-kernel,
	bcm-kernel-feedback-list, Olof Johansson

I have updated the kernel to 4.11 and see significant performance
drops using fio-2.9.

Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
single core and task.
Percent performance drop becomes even worse if multi-cores and multi-
threads are used.

Platform is ARM64 based A72.  Can somebody reproduce the results or
know what may have changed to make such a dramatic change?

FIO command and resulting log output below using null_blk to remove
as many hardware specific driver dependencies as possible.

modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
submit_queues=1 bs=4096

taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
--gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
--iodepth=128 --time_based --runtime=15 --readwrite=read

**** 281 KIOPS RESULT on 4.11 Kernel ****
readtest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.9
Starting 1 process
Jobs: 1 (f=1): [R(1)] [100.0% done] [1098MB/0KB/0KB /s] [281K/0/0 iops] 
[eta 00m:00s]
readtest: (groupid=0, jobs=1): err= 0: pid=2868: Mon Apr  3 20:24:25 2017
   read : io=16456MB, bw=1096.1MB/s, iops=280825, runt= 15001msec
   cpu          : usr=28.35%, sys=71.55%, ctx=1560, majf=0, minf=146
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.1%
      issued    : total=r=4212670/w=0/d=0, short=r=0/w=0/d=0, 
drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
    READ: io=16456MB, aggrb=1096.1MB/s, minb=1096.1MB/s, 
maxb=1096.1MB/s, mint=15001msec, maxt=15001msec

Disk stats (read/write):
   nullb0: ios=4185627/0, merge=0/0, ticks=3664/0, in_queue=3308, 
util=22.05%


**** 207 KIOPS RESULT on 4.10 Kernel ****
taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1 
--gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k 
--iodepth=128 --time_based --runtime=15 --readwrite=read
readtest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.9
Starting 1 process
Jobs: 1 (f=1): [R(1)] [100.0% done] [807.6MB/0KB/0KB /s] [207K/0/0 iops] 
[eta 00m:00s]
readtest: (groupid=0, jobs=1): err= 0: pid=2832: Mon Apr  3 20:09:31 2017
   read : io=12109MB, bw=826620KB/s, iops=206654, runt= 15001msec
   cpu          : usr=24.62%, sys=75.28%, ctx=1571, majf=0, minf=146
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.1%
      issued    : total=r=3100030/w=0/d=0, short=r=0/w=0/d=0, 
drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
    READ: io=12109MB, aggrb=826619KB/s, minb=826619KB/s, 
maxb=826619KB/s, mint=15001msec, maxt=15001msec

Disk stats (read/write):
   nullb0: ios=3080149/0, merge=0/0, ticks=3952/0, in_queue=3560, 
util=23.73%



Regards,
  Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-06  1:37 ` Scott Branden
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-06  1:37 UTC (permalink / raw)
  To: linux-arm-kernel

I have updated the kernel to 4.11 and see significant performance
drops using fio-2.9.

Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
single core and task.
Percent performance drop becomes even worse if multi-cores and multi-
threads are used.

Platform is ARM64 based A72.  Can somebody reproduce the results or
know what may have changed to make such a dramatic change?

FIO command and resulting log output below using null_blk to remove
as many hardware specific driver dependencies as possible.

modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
submit_queues=1 bs=4096

taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
--gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
--iodepth=128 --time_based --runtime=15 --readwrite=read

**** 281 KIOPS RESULT on 4.11 Kernel ****
readtest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.9
Starting 1 process
Jobs: 1 (f=1): [R(1)] [100.0% done] [1098MB/0KB/0KB /s] [281K/0/0 iops] 
[eta 00m:00s]
readtest: (groupid=0, jobs=1): err= 0: pid=2868: Mon Apr  3 20:24:25 2017
   read : io=16456MB, bw=1096.1MB/s, iops=280825, runt= 15001msec
   cpu          : usr=28.35%, sys=71.55%, ctx=1560, majf=0, minf=146
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.1%
      issued    : total=r=4212670/w=0/d=0, short=r=0/w=0/d=0, 
drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
    READ: io=16456MB, aggrb=1096.1MB/s, minb=1096.1MB/s, 
maxb=1096.1MB/s, mint=15001msec, maxt=15001msec

Disk stats (read/write):
   nullb0: ios=4185627/0, merge=0/0, ticks=3664/0, in_queue=3308, 
util=22.05%


**** 207 KIOPS RESULT on 4.10 Kernel ****
taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1 
--gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k 
--iodepth=128 --time_based --runtime=15 --readwrite=read
readtest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.9
Starting 1 process
Jobs: 1 (f=1): [R(1)] [100.0% done] [807.6MB/0KB/0KB /s] [207K/0/0 iops] 
[eta 00m:00s]
readtest: (groupid=0, jobs=1): err= 0: pid=2832: Mon Apr  3 20:09:31 2017
   read : io=12109MB, bw=826620KB/s, iops=206654, runt= 15001msec
   cpu          : usr=24.62%, sys=75.28%, ctx=1571, majf=0, minf=146
   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, 
 >=64=100.0%
      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.0%
      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 >=64=0.1%
      issued    : total=r=3100030/w=0/d=0, short=r=0/w=0/d=0, 
drop=r=0/w=0/d=0
      latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
    READ: io=12109MB, aggrb=826619KB/s, minb=826619KB/s, 
maxb=826619KB/s, mint=15001msec, maxt=15001msec

Disk stats (read/write):
   nullb0: ios=3080149/0, merge=0/0, ticks=3952/0, in_queue=3560, 
util=23.73%



Regards,
  Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-06  1:37 ` Scott Branden
@ 2017-05-06  1:54   ` Scott Branden
  -1 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-06  1:54 UTC (permalink / raw)
  To: linux-arm-kernel, Will Deacon, Mark Rutland, Arnd Bergmann,
	Russell King, Catalin Marinas, linux-kernel,
	bcm-kernel-feedback-list, Olof Johansson

Please note the 4.11 and 4.10 log results in my email were
reversed - see correction below.

Performance regression is observed - FIO performance in 4.11 is
much less than 4.10 on ARM64 based platform.

On 17-05-05 06:37 PM, Scott Branden wrote:
> I have updated the kernel to 4.11 and see significant performance
> drops using fio-2.9.
>
> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
> single core and task.
> Percent performance drop becomes even worse if multi-cores and multi-
> threads are used.
>
> Platform is ARM64 based A72.  Can somebody reproduce the results or
> know what may have changed to make such a dramatic change?
>
> FIO command and resulting log output below using null_blk to remove
> as many hardware specific driver dependencies as possible.
>
> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
> submit_queues=1 bs=4096
>
> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
> --iodepth=128 --time_based --runtime=15 --readwrite=read
>
> **** 281 KIOPS RESULT on 4.11 Kernel ****
CORRECTION:
**** 281 KIOPS RESULT on 4.10 Kernel ****

> readtest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=128
> fio-2.9
> Starting 1 process
> Jobs: 1 (f=1): [R(1)] [100.0% done] [1098MB/0KB/0KB /s] [281K/0/0 iops]
> [eta 00m:00s]
> readtest: (groupid=0, jobs=1): err= 0: pid=2868: Mon Apr  3 20:24:25 2017
>   read : io=16456MB, bw=1096.1MB/s, iops=280825, runt= 15001msec
>   cpu          : usr=28.35%, sys=71.55%, ctx=1560, majf=0, minf=146
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.1%
>      issued    : total=r=4212670/w=0/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
>    READ: io=16456MB, aggrb=1096.1MB/s, minb=1096.1MB/s, maxb=1096.1MB/s,
> mint=15001msec, maxt=15001msec
>
> Disk stats (read/write):
>   nullb0: ios=4185627/0, merge=0/0, ticks=3664/0, in_queue=3308,
> util=22.05%
>
>
> **** 207 KIOPS RESULT on 4.10 Kernel ****
CORRECTION:
**** 281 KIOPS RESULT on 4.11 Kernel ****

> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
> --iodepth=128 --time_based --runtime=15 --readwrite=read
> readtest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=128
> fio-2.9
> Starting 1 process
> Jobs: 1 (f=1): [R(1)] [100.0% done] [807.6MB/0KB/0KB /s] [207K/0/0 iops]
> [eta 00m:00s]
> readtest: (groupid=0, jobs=1): err= 0: pid=2832: Mon Apr  3 20:09:31 2017
>   read : io=12109MB, bw=826620KB/s, iops=206654, runt= 15001msec
>   cpu          : usr=24.62%, sys=75.28%, ctx=1571, majf=0, minf=146
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.1%
>      issued    : total=r=3100030/w=0/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
>    READ: io=12109MB, aggrb=826619KB/s, minb=826619KB/s, maxb=826619KB/s,
> mint=15001msec, maxt=15001msec
>
> Disk stats (read/write):
>   nullb0: ios=3080149/0, merge=0/0, ticks=3952/0, in_queue=3560,
> util=23.73%
>
>
>
> Regards,
>  Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-06  1:54   ` Scott Branden
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-06  1:54 UTC (permalink / raw)
  To: linux-arm-kernel

Please note the 4.11 and 4.10 log results in my email were
reversed - see correction below.

Performance regression is observed - FIO performance in 4.11 is
much less than 4.10 on ARM64 based platform.

On 17-05-05 06:37 PM, Scott Branden wrote:
> I have updated the kernel to 4.11 and see significant performance
> drops using fio-2.9.
>
> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
> single core and task.
> Percent performance drop becomes even worse if multi-cores and multi-
> threads are used.
>
> Platform is ARM64 based A72.  Can somebody reproduce the results or
> know what may have changed to make such a dramatic change?
>
> FIO command and resulting log output below using null_blk to remove
> as many hardware specific driver dependencies as possible.
>
> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
> submit_queues=1 bs=4096
>
> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
> --iodepth=128 --time_based --runtime=15 --readwrite=read
>
> **** 281 KIOPS RESULT on 4.11 Kernel ****
CORRECTION:
**** 281 KIOPS RESULT on 4.10 Kernel ****

> readtest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=128
> fio-2.9
> Starting 1 process
> Jobs: 1 (f=1): [R(1)] [100.0% done] [1098MB/0KB/0KB /s] [281K/0/0 iops]
> [eta 00m:00s]
> readtest: (groupid=0, jobs=1): err= 0: pid=2868: Mon Apr  3 20:24:25 2017
>   read : io=16456MB, bw=1096.1MB/s, iops=280825, runt= 15001msec
>   cpu          : usr=28.35%, sys=71.55%, ctx=1560, majf=0, minf=146
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.1%
>      issued    : total=r=4212670/w=0/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
>    READ: io=16456MB, aggrb=1096.1MB/s, minb=1096.1MB/s, maxb=1096.1MB/s,
> mint=15001msec, maxt=15001msec
>
> Disk stats (read/write):
>   nullb0: ios=4185627/0, merge=0/0, ticks=3664/0, in_queue=3308,
> util=22.05%
>
>
> **** 207 KIOPS RESULT on 4.10 Kernel ****
CORRECTION:
**** 281 KIOPS RESULT on 4.11 Kernel ****

> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
> --iodepth=128 --time_based --runtime=15 --readwrite=read
> readtest: (g=0): rw=read, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> iodepth=128
> fio-2.9
> Starting 1 process
> Jobs: 1 (f=1): [R(1)] [100.0% done] [807.6MB/0KB/0KB /s] [207K/0/0 iops]
> [eta 00m:00s]
> readtest: (groupid=0, jobs=1): err= 0: pid=2832: Mon Apr  3 20:09:31 2017
>   read : io=12109MB, bw=826620KB/s, iops=206654, runt= 15001msec
>   cpu          : usr=24.62%, sys=75.28%, ctx=1571, majf=0, minf=146
>   IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>>=64=100.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>=64=0.1%
>      issued    : total=r=3100030/w=0/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
>      latency   : target=0, window=0, percentile=100.00%, depth=128
>
> Run status group 0 (all jobs):
>    READ: io=12109MB, aggrb=826619KB/s, minb=826619KB/s, maxb=826619KB/s,
> mint=15001msec, maxt=15001msec
>
> Disk stats (read/write):
>   nullb0: ios=3080149/0, merge=0/0, ticks=3952/0, in_queue=3560,
> util=23.73%
>
>
>
> Regards,
>  Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-06  1:37 ` Scott Branden
@ 2017-05-08 11:07   ` Will Deacon
  -1 siblings, 0 replies; 20+ messages in thread
From: Will Deacon @ 2017-05-08 11:07 UTC (permalink / raw)
  To: Scott Branden
  Cc: linux-arm-kernel, Mark Rutland, Arnd Bergmann, Russell King,
	Catalin Marinas, linux-kernel, bcm-kernel-feedback-list,
	Olof Johansson

Hi Scott,

Thanks for the report.

On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
> I have updated the kernel to 4.11 and see significant performance
> drops using fio-2.9.
> 
> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
> single core and task.
> Percent performance drop becomes even worse if multi-cores and multi-
> threads are used.
> 
> Platform is ARM64 based A72.  Can somebody reproduce the results or
> know what may have changed to make such a dramatic change?
> 
> FIO command and resulting log output below using null_blk to remove
> as many hardware specific driver dependencies as possible.
> 
> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
> submit_queues=1 bs=4096
> 
> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
> --iodepth=128 --time_based --runtime=15 --readwrite=read

I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
log.

Things you could try:

  1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
     defconfig between the releases).

  2. Try to reproduce on an x86 box

  3. Have a go at bisecting the issue, so we can revert the offender if
     necessary.

Cheers,

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-08 11:07   ` Will Deacon
  0 siblings, 0 replies; 20+ messages in thread
From: Will Deacon @ 2017-05-08 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Scott,

Thanks for the report.

On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
> I have updated the kernel to 4.11 and see significant performance
> drops using fio-2.9.
> 
> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
> single core and task.
> Percent performance drop becomes even worse if multi-cores and multi-
> threads are used.
> 
> Platform is ARM64 based A72.  Can somebody reproduce the results or
> know what may have changed to make such a dramatic change?
> 
> FIO command and resulting log output below using null_blk to remove
> as many hardware specific driver dependencies as possible.
> 
> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
> submit_queues=1 bs=4096
> 
> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
> --iodepth=128 --time_based --runtime=15 --readwrite=read

I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
log.

Things you could try:

  1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
     defconfig between the releases).

  2. Try to reproduce on an x86 box

  3. Have a go at bisecting the issue, so we can revert the offender if
     necessary.

Cheers,

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-08 11:07   ` Will Deacon
@ 2017-05-08 11:19     ` Arnd Bergmann
  -1 siblings, 0 replies; 20+ messages in thread
From: Arnd Bergmann @ 2017-05-08 11:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: Scott Branden, linux-arm-kernel, Mark Rutland, Russell King,
	Catalin Marinas, Linux Kernel Mailing List,
	bcm-kernel-feedback-list, Olof Johansson, Jens Axboe

On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
> Hi Scott,
>
> Thanks for the report.
>
> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>> I have updated the kernel to 4.11 and see significant performance
>> drops using fio-2.9.
>>
>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>> single core and task.
>> Percent performance drop becomes even worse if multi-cores and multi-
>> threads are used.
>>
>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>> know what may have changed to make such a dramatic change?
>>
>> FIO command and resulting log output below using null_blk to remove
>> as many hardware specific driver dependencies as possible.
>>
>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>> submit_queues=1 bs=4096
>>
>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>
> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
> log.
>
> Things you could try:
>
>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>      defconfig between the releases).
>
>   2. Try to reproduce on an x86 box
>
>   3. Have a go at bisecting the issue, so we can revert the offender if
>      necessary.

One more thing to try early: As 4.11 gained support for blk-mq I/O
schedulers compared to 4.10, null_blk will now also need some extra
cycles for each I/O request. Try loading the driver with "queue_mode=0"
or "queue_mode=1" instead of "queue_mode=2".

        Arnd

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-08 11:19     ` Arnd Bergmann
  0 siblings, 0 replies; 20+ messages in thread
From: Arnd Bergmann @ 2017-05-08 11:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
> Hi Scott,
>
> Thanks for the report.
>
> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>> I have updated the kernel to 4.11 and see significant performance
>> drops using fio-2.9.
>>
>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>> single core and task.
>> Percent performance drop becomes even worse if multi-cores and multi-
>> threads are used.
>>
>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>> know what may have changed to make such a dramatic change?
>>
>> FIO command and resulting log output below using null_blk to remove
>> as many hardware specific driver dependencies as possible.
>>
>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>> submit_queues=1 bs=4096
>>
>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>
> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
> log.
>
> Things you could try:
>
>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>      defconfig between the releases).
>
>   2. Try to reproduce on an x86 box
>
>   3. Have a go at bisecting the issue, so we can revert the offender if
>      necessary.

One more thing to try early: As 4.11 gained support for blk-mq I/O
schedulers compared to 4.10, null_blk will now also need some extra
cycles for each I/O request. Try loading the driver with "queue_mode=0"
or "queue_mode=1" instead of "queue_mode=2".

        Arnd

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-08 11:19     ` Arnd Bergmann
@ 2017-05-08 14:08       ` Jens Axboe
  -1 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2017-05-08 14:08 UTC (permalink / raw)
  To: Arnd Bergmann, Will Deacon
  Cc: Scott Branden, linux-arm-kernel, Mark Rutland, Russell King,
	Catalin Marinas, Linux Kernel Mailing List,
	bcm-kernel-feedback-list, Olof Johansson

On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>> Hi Scott,
>>
>> Thanks for the report.
>>
>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>> I have updated the kernel to 4.11 and see significant performance
>>> drops using fio-2.9.
>>>
>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>> single core and task.
>>> Percent performance drop becomes even worse if multi-cores and multi-
>>> threads are used.
>>>
>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>> know what may have changed to make such a dramatic change?
>>>
>>> FIO command and resulting log output below using null_blk to remove
>>> as many hardware specific driver dependencies as possible.
>>>
>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>> submit_queues=1 bs=4096
>>>
>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>
>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>> log.
>>
>> Things you could try:
>>
>>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>      defconfig between the releases).
>>
>>   2. Try to reproduce on an x86 box
>>
>>   3. Have a go at bisecting the issue, so we can revert the offender if
>>      necessary.
> 
> One more thing to try early: As 4.11 gained support for blk-mq I/O
> schedulers compared to 4.10, null_blk will now also need some extra
> cycles for each I/O request. Try loading the driver with "queue_mode=0"
> or "queue_mode=1" instead of "queue_mode=2".

Since you have 1 submit queues set, you are being loaded with deadline
attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
after loading null_blk in 4.11, do:

# echo none > /sys/block/nullb0/queue/scheduler

and re-test.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-08 14:08       ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2017-05-08 14:08 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>> Hi Scott,
>>
>> Thanks for the report.
>>
>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>> I have updated the kernel to 4.11 and see significant performance
>>> drops using fio-2.9.
>>>
>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>> single core and task.
>>> Percent performance drop becomes even worse if multi-cores and multi-
>>> threads are used.
>>>
>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>> know what may have changed to make such a dramatic change?
>>>
>>> FIO command and resulting log output below using null_blk to remove
>>> as many hardware specific driver dependencies as possible.
>>>
>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>> submit_queues=1 bs=4096
>>>
>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>
>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>> log.
>>
>> Things you could try:
>>
>>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>      defconfig between the releases).
>>
>>   2. Try to reproduce on an x86 box
>>
>>   3. Have a go at bisecting the issue, so we can revert the offender if
>>      necessary.
> 
> One more thing to try early: As 4.11 gained support for blk-mq I/O
> schedulers compared to 4.10, null_blk will now also need some extra
> cycles for each I/O request. Try loading the driver with "queue_mode=0"
> or "queue_mode=1" instead of "queue_mode=2".

Since you have 1 submit queues set, you are being loaded with deadline
attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
after loading null_blk in 4.11, do:

# echo none > /sys/block/nullb0/queue/scheduler

and re-test.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-08 14:08       ` Jens Axboe
@ 2017-05-08 15:24         ` Will Deacon
  -1 siblings, 0 replies; 20+ messages in thread
From: Will Deacon @ 2017-05-08 15:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Arnd Bergmann, Scott Branden, linux-arm-kernel, Mark Rutland,
	Russell King, Catalin Marinas, Linux Kernel Mailing List,
	bcm-kernel-feedback-list, Olof Johansson

On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
> > On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
> >> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
> >>> I have updated the kernel to 4.11 and see significant performance
> >>> drops using fio-2.9.
> >>>
> >>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
> >>> single core and task.
> >>> Percent performance drop becomes even worse if multi-cores and multi-
> >>> threads are used.
> >>>
> >>> Platform is ARM64 based A72.  Can somebody reproduce the results or
> >>> know what may have changed to make such a dramatic change?
> >>>
> >>> FIO command and resulting log output below using null_blk to remove
> >>> as many hardware specific driver dependencies as possible.
> >>>
> >>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
> >>> submit_queues=1 bs=4096
> >>>
> >>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
> >>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
> >>> --iodepth=128 --time_based --runtime=15 --readwrite=read
> >>
> >> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
> >> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
> >> log.
> >>
> >> Things you could try:
> >>
> >>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
> >>      defconfig between the releases).
> >>
> >>   2. Try to reproduce on an x86 box
> >>
> >>   3. Have a go at bisecting the issue, so we can revert the offender if
> >>      necessary.
> > 
> > One more thing to try early: As 4.11 gained support for blk-mq I/O
> > schedulers compared to 4.10, null_blk will now also need some extra
> > cycles for each I/O request. Try loading the driver with "queue_mode=0"
> > or "queue_mode=1" instead of "queue_mode=2".
> 
> Since you have 1 submit queues set, you are being loaded with deadline
> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
> after loading null_blk in 4.11, do:
> 
> # echo none > /sys/block/nullb0/queue/scheduler
> 
> and re-test.

On my setup, doing this restored a bunch of the performance, but the numbers
are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
Disabling NUMA as well cuts this down to ~2%.

Scott -- do you see the same sort of thing?

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-08 15:24         ` Will Deacon
  0 siblings, 0 replies; 20+ messages in thread
From: Will Deacon @ 2017-05-08 15:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
> > On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
> >> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
> >>> I have updated the kernel to 4.11 and see significant performance
> >>> drops using fio-2.9.
> >>>
> >>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
> >>> single core and task.
> >>> Percent performance drop becomes even worse if multi-cores and multi-
> >>> threads are used.
> >>>
> >>> Platform is ARM64 based A72.  Can somebody reproduce the results or
> >>> know what may have changed to make such a dramatic change?
> >>>
> >>> FIO command and resulting log output below using null_blk to remove
> >>> as many hardware specific driver dependencies as possible.
> >>>
> >>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
> >>> submit_queues=1 bs=4096
> >>>
> >>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
> >>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
> >>> --iodepth=128 --time_based --runtime=15 --readwrite=read
> >>
> >> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
> >> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
> >> log.
> >>
> >> Things you could try:
> >>
> >>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
> >>      defconfig between the releases).
> >>
> >>   2. Try to reproduce on an x86 box
> >>
> >>   3. Have a go at bisecting the issue, so we can revert the offender if
> >>      necessary.
> > 
> > One more thing to try early: As 4.11 gained support for blk-mq I/O
> > schedulers compared to 4.10, null_blk will now also need some extra
> > cycles for each I/O request. Try loading the driver with "queue_mode=0"
> > or "queue_mode=1" instead of "queue_mode=2".
> 
> Since you have 1 submit queues set, you are being loaded with deadline
> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
> after loading null_blk in 4.11, do:
> 
> # echo none > /sys/block/nullb0/queue/scheduler
> 
> and re-test.

On my setup, doing this restored a bunch of the performance, but the numbers
are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
Disabling NUMA as well cuts this down to ~2%.

Scott -- do you see the same sort of thing?

Will

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-08 15:24         ` Will Deacon
@ 2017-05-08 15:28           ` Jens Axboe
  -1 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2017-05-08 15:28 UTC (permalink / raw)
  To: Will Deacon
  Cc: Arnd Bergmann, Scott Branden, linux-arm-kernel, Mark Rutland,
	Russell King, Catalin Marinas, Linux Kernel Mailing List,
	bcm-kernel-feedback-list, Olof Johansson

On 05/08/2017 09:24 AM, Will Deacon wrote:
> On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
>> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
>>> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>>>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>>>> I have updated the kernel to 4.11 and see significant performance
>>>>> drops using fio-2.9.
>>>>>
>>>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>>>> single core and task.
>>>>> Percent performance drop becomes even worse if multi-cores and multi-
>>>>> threads are used.
>>>>>
>>>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>>>> know what may have changed to make such a dramatic change?
>>>>>
>>>>> FIO command and resulting log output below using null_blk to remove
>>>>> as many hardware specific driver dependencies as possible.
>>>>>
>>>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>>>> submit_queues=1 bs=4096
>>>>>
>>>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>>>
>>>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>>>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>>>> log.
>>>>
>>>> Things you could try:
>>>>
>>>>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>>>      defconfig between the releases).
>>>>
>>>>   2. Try to reproduce on an x86 box
>>>>
>>>>   3. Have a go at bisecting the issue, so we can revert the offender if
>>>>      necessary.
>>>
>>> One more thing to try early: As 4.11 gained support for blk-mq I/O
>>> schedulers compared to 4.10, null_blk will now also need some extra
>>> cycles for each I/O request. Try loading the driver with "queue_mode=0"
>>> or "queue_mode=1" instead of "queue_mode=2".
>>
>> Since you have 1 submit queues set, you are being loaded with deadline
>> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
>> after loading null_blk in 4.11, do:
>>
>> # echo none > /sys/block/nullb0/queue/scheduler
>>
>> and re-test.
> 
> On my setup, doing this restored a bunch of the performance, but the numbers
> are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
> Disabling NUMA as well cuts this down to ~2%.

So we're down to 2%. How stable are these numbers? With mq-deadline attached,
I'm not surprised there's a drop for a null_blk type of test.

Maybe a perf profile comparison between the two kernels would help?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-08 15:28           ` Jens Axboe
  0 siblings, 0 replies; 20+ messages in thread
From: Jens Axboe @ 2017-05-08 15:28 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/08/2017 09:24 AM, Will Deacon wrote:
> On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
>> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
>>> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>>>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>>>> I have updated the kernel to 4.11 and see significant performance
>>>>> drops using fio-2.9.
>>>>>
>>>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>>>> single core and task.
>>>>> Percent performance drop becomes even worse if multi-cores and multi-
>>>>> threads are used.
>>>>>
>>>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>>>> know what may have changed to make such a dramatic change?
>>>>>
>>>>> FIO command and resulting log output below using null_blk to remove
>>>>> as many hardware specific driver dependencies as possible.
>>>>>
>>>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>>>> submit_queues=1 bs=4096
>>>>>
>>>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>>>
>>>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>>>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>>>> log.
>>>>
>>>> Things you could try:
>>>>
>>>>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>>>      defconfig between the releases).
>>>>
>>>>   2. Try to reproduce on an x86 box
>>>>
>>>>   3. Have a go at bisecting the issue, so we can revert the offender if
>>>>      necessary.
>>>
>>> One more thing to try early: As 4.11 gained support for blk-mq I/O
>>> schedulers compared to 4.10, null_blk will now also need some extra
>>> cycles for each I/O request. Try loading the driver with "queue_mode=0"
>>> or "queue_mode=1" instead of "queue_mode=2".
>>
>> Since you have 1 submit queues set, you are being loaded with deadline
>> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
>> after loading null_blk in 4.11, do:
>>
>> # echo none > /sys/block/nullb0/queue/scheduler
>>
>> and re-test.
> 
> On my setup, doing this restored a bunch of the performance, but the numbers
> are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
> Disabling NUMA as well cuts this down to ~2%.

So we're down to 2%. How stable are these numbers? With mq-deadline attached,
I'm not surprised there's a drop for a null_blk type of test.

Maybe a perf profile comparison between the two kernels would help?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-08 15:24         ` Will Deacon
@ 2017-05-08 16:32           ` Scott Branden
  -1 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-08 16:32 UTC (permalink / raw)
  To: Will Deacon, Jens Axboe
  Cc: Arnd Bergmann, linux-arm-kernel, Mark Rutland, Russell King,
	Catalin Marinas, Linux Kernel Mailing List,
	bcm-kernel-feedback-list, Olof Johansson

Hi Will/Jens,

Thanks for reproducing.  Comment inline

On 17-05-08 08:24 AM, Will Deacon wrote:
> On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
>> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
>>> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>>>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>>>> I have updated the kernel to 4.11 and see significant performance
>>>>> drops using fio-2.9.
>>>>>
>>>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>>>> single core and task.
>>>>> Percent performance drop becomes even worse if multi-cores and multi-
>>>>> threads are used.
>>>>>
>>>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>>>> know what may have changed to make such a dramatic change?
>>>>>
>>>>> FIO command and resulting log output below using null_blk to remove
>>>>> as many hardware specific driver dependencies as possible.
>>>>>
>>>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>>>> submit_queues=1 bs=4096
>>>>>
>>>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>>>
>>>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>>>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>>>> log.
>>>>
>>>> Things you could try:
>>>>
>>>>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>>>      defconfig between the releases).
>>>>
>>>>   2. Try to reproduce on an x86 box
>>>>
>>>>   3. Have a go at bisecting the issue, so we can revert the offender if
>>>>      necessary.
>>>
>>> One more thing to try early: As 4.11 gained support for blk-mq I/O
>>> schedulers compared to 4.10, null_blk will now also need some extra
>>> cycles for each I/O request. Try loading the driver with "queue_mode=0"
>>> or "queue_mode=1" instead of "queue_mode=2".
>>
>> Since you have 1 submit queues set, you are being loaded with deadline
>> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
>> after loading null_blk in 4.11, do:
>>
>> # echo none > /sys/block/nullb0/queue/scheduler
>>
>> and re-test.
>
> On my setup, doing this restored a bunch of the performance, but the numbers
> are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
> Disabling NUMA as well cuts this down to ~2%.
>
> Scott -- do you see the same sort of thing?
NUMA was already disabled in my defconfig.

Using the echo to the scheduler restored half of my performance loss vs 
4.10.
echo none > /sys/block/nullb0/queue/scheduler

I will spend some time comparing and building defconfigs.
>
> Will
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-08 16:32           ` Scott Branden
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-08 16:32 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will/Jens,

Thanks for reproducing.  Comment inline

On 17-05-08 08:24 AM, Will Deacon wrote:
> On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
>> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
>>> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>>>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>>>> I have updated the kernel to 4.11 and see significant performance
>>>>> drops using fio-2.9.
>>>>>
>>>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>>>> single core and task.
>>>>> Percent performance drop becomes even worse if multi-cores and multi-
>>>>> threads are used.
>>>>>
>>>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>>>> know what may have changed to make such a dramatic change?
>>>>>
>>>>> FIO command and resulting log output below using null_blk to remove
>>>>> as many hardware specific driver dependencies as possible.
>>>>>
>>>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>>>> submit_queues=1 bs=4096
>>>>>
>>>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>>>
>>>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>>>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>>>> log.
>>>>
>>>> Things you could try:
>>>>
>>>>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>>>      defconfig between the releases).
>>>>
>>>>   2. Try to reproduce on an x86 box
>>>>
>>>>   3. Have a go at bisecting the issue, so we can revert the offender if
>>>>      necessary.
>>>
>>> One more thing to try early: As 4.11 gained support for blk-mq I/O
>>> schedulers compared to 4.10, null_blk will now also need some extra
>>> cycles for each I/O request. Try loading the driver with "queue_mode=0"
>>> or "queue_mode=1" instead of "queue_mode=2".
>>
>> Since you have 1 submit queues set, you are being loaded with deadline
>> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
>> after loading null_blk in 4.11, do:
>>
>> # echo none > /sys/block/nullb0/queue/scheduler
>>
>> and re-test.
>
> On my setup, doing this restored a bunch of the performance, but the numbers
> are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
> Disabling NUMA as well cuts this down to ~2%.
>
> Scott -- do you see the same sort of thing?
NUMA was already disabled in my defconfig.

Using the echo to the scheduler restored half of my performance loss vs 
4.10.
echo none > /sys/block/nullb0/queue/scheduler

I will spend some time comparing and building defconfigs.
>
> Will
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-08 15:28           ` Jens Axboe
@ 2017-05-08 17:38             ` Scott Branden
  -1 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-08 17:38 UTC (permalink / raw)
  To: Jens Axboe, Will Deacon
  Cc: Arnd Bergmann, linux-arm-kernel, Mark Rutland, Russell King,
	Catalin Marinas, Linux Kernel Mailing List,
	bcm-kernel-feedback-list, Olof Johansson

Hi Jens/Will,

More complex FIO test provided inline.  I think there are more than one 
changes in 4.11 that have degraded performance.

On 17-05-08 08:28 AM, Jens Axboe wrote:
> On 05/08/2017 09:24 AM, Will Deacon wrote:
>> On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
>>> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
>>>> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>>>>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>>>>> I have updated the kernel to 4.11 and see significant performance
>>>>>> drops using fio-2.9.
>>>>>>
>>>>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>>>>> single core and task.
>>>>>> Percent performance drop becomes even worse if multi-cores and multi-
>>>>>> threads are used.
>>>>>>
>>>>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>>>>> know what may have changed to make such a dramatic change?
>>>>>>
>>>>>> FIO command and resulting log output below using null_blk to remove
>>>>>> as many hardware specific driver dependencies as possible.
>>>>>>
>>>>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>>>>> submit_queues=1 bs=4096
>>>>>>
>>>>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>>>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>>>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>>>>
>>>>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>>>>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>>>>> log.
>>>>>
>>>>> Things you could try:
>>>>>
>>>>>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>>>>      defconfig between the releases).
>>>>>
>>>>>   2. Try to reproduce on an x86 box
>>>>>
>>>>>   3. Have a go at bisecting the issue, so we can revert the offender if
>>>>>      necessary.
>>>>
>>>> One more thing to try early: As 4.11 gained support for blk-mq I/O
>>>> schedulers compared to 4.10, null_blk will now also need some extra
>>>> cycles for each I/O request. Try loading the driver with "queue_mode=0"
>>>> or "queue_mode=1" instead of "queue_mode=2".
>>>
>>> Since you have 1 submit queues set, you are being loaded with deadline
>>> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
>>> after loading null_blk in 4.11, do:
>>>
>>> # echo none > /sys/block/nullb0/queue/scheduler
>>>
>>> and re-test.
>>
>> On my setup, doing this restored a bunch of the performance, but the numbers
>> are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
>> Disabling NUMA as well cuts this down to ~2%.
>
> So we're down to 2%. How stable are these numbers? With mq-deadline attached,
> I'm not surprised there's a drop for a null_blk type of test.
Could you try the following FIO test as well?  This is substantially 
worse on 4.11 vs. 4.10.  Echo none to scheduler has some benefit.  But 
by setting queue_mode=0 it is actually slightly better in 4.11 vs. 4.10. 
  So Arnd's comment about blk-mq also has a negative impact?

modprobe null_blk nr_devices=4;

fio --ioengine=libaio --direct=1 --gtod_reduce=1 --name=readtest 
--filename=/dev/nullb0:/dev/nullb1:/dev/nullb2:/dev/nullb3 --bs=4k 
--iodepth=128 --time_based --runtime=10 --readwrite=randread 
--iodepth_low=96 --iodepth_batch=16 --numjobs=8

>
> Maybe a perf profile comparison between the two kernels would help?
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-08 17:38             ` Scott Branden
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-08 17:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jens/Will,

More complex FIO test provided inline.  I think there are more than one 
changes in 4.11 that have degraded performance.

On 17-05-08 08:28 AM, Jens Axboe wrote:
> On 05/08/2017 09:24 AM, Will Deacon wrote:
>> On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
>>> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
>>>> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>>>>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>>>>> I have updated the kernel to 4.11 and see significant performance
>>>>>> drops using fio-2.9.
>>>>>>
>>>>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>>>>> single core and task.
>>>>>> Percent performance drop becomes even worse if multi-cores and multi-
>>>>>> threads are used.
>>>>>>
>>>>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>>>>> know what may have changed to make such a dramatic change?
>>>>>>
>>>>>> FIO command and resulting log output below using null_blk to remove
>>>>>> as many hardware specific driver dependencies as possible.
>>>>>>
>>>>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>>>>> submit_queues=1 bs=4096
>>>>>>
>>>>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>>>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>>>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>>>>
>>>>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>>>>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>>>>> log.
>>>>>
>>>>> Things you could try:
>>>>>
>>>>>   1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>>>>      defconfig between the releases).
>>>>>
>>>>>   2. Try to reproduce on an x86 box
>>>>>
>>>>>   3. Have a go at bisecting the issue, so we can revert the offender if
>>>>>      necessary.
>>>>
>>>> One more thing to try early: As 4.11 gained support for blk-mq I/O
>>>> schedulers compared to 4.10, null_blk will now also need some extra
>>>> cycles for each I/O request. Try loading the driver with "queue_mode=0"
>>>> or "queue_mode=1" instead of "queue_mode=2".
>>>
>>> Since you have 1 submit queues set, you are being loaded with deadline
>>> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
>>> after loading null_blk in 4.11, do:
>>>
>>> # echo none > /sys/block/nullb0/queue/scheduler
>>>
>>> and re-test.
>>
>> On my setup, doing this restored a bunch of the performance, but the numbers
>> are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
>> Disabling NUMA as well cuts this down to ~2%.
>
> So we're down to 2%. How stable are these numbers? With mq-deadline attached,
> I'm not surprised there's a drop for a null_blk type of test.
Could you try the following FIO test as well?  This is substantially 
worse on 4.11 vs. 4.10.  Echo none to scheduler has some benefit.  But 
by setting queue_mode=0 it is actually slightly better in 4.11 vs. 4.10. 
  So Arnd's comment about blk-mq also has a negative impact?

modprobe null_blk nr_devices=4;

fio --ioengine=libaio --direct=1 --gtod_reduce=1 --name=readtest 
--filename=/dev/nullb0:/dev/nullb1:/dev/nullb2:/dev/nullb3 --bs=4k 
--iodepth=128 --time_based --runtime=10 --readwrite=randread 
--iodepth_low=96 --iodepth_batch=16 --numjobs=8

>
> Maybe a perf profile comparison between the two kernels would help?
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
  2017-05-08 15:28           ` Jens Axboe
@ 2017-05-15 20:10             ` Scott Branden
  -1 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-15 20:10 UTC (permalink / raw)
  To: Jens Axboe, Will Deacon
  Cc: Arnd Bergmann, linux-arm-kernel, Mark Rutland, Russell King,
	Catalin Marinas, Linux Kernel Mailing List,
	bcm-kernel-feedback-list, Olof Johansson

Hi Jens,

Details on bisecting inline.


On 17-05-08 08:28 AM, Jens Axboe wrote:
> On 05/08/2017 09:24 AM, Will Deacon wrote:
>> On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
>>> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
>>>> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>>>>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>>>>> I have updated the kernel to 4.11 and see significant performance
>>>>>> drops using fio-2.9.
>>>>>>
>>>>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>>>>> single core and task.
>>>>>> Percent performance drop becomes even worse if multi-cores and multi-
>>>>>> threads are used.
>>>>>>
>>>>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>>>>> know what may have changed to make such a dramatic change?
>>>>>>
>>>>>> FIO command and resulting log output below using null_blk to remove
>>>>>> as many hardware specific driver dependencies as possible.
>>>>>>
>>>>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>>>>> submit_queues=1 bs=4096
>>>>>>
>>>>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>>>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>>>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>>>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>>>>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>>>>> log.
>>>>>
>>>>> Things you could try:
>>>>>
>>>>>    1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>>>>       defconfig between the releases).
>>>>>
>>>>>    2. Try to reproduce on an x86 box
>>>>>
>>>>>    3. Have a go at bisecting the issue, so we can revert the offender if
>>>>>       necessary.
The 4.11 kernel has numerous performance regressions.  I only bisected 
up to 4.11-rc1 as merge conflicts and complexities arise after that.

First Performance regression is:
b86dd815ff74  "block: get rid of blk-mq default scheduler choice Kconfig 
entries"

Using "echo none > /sys/block/nullb0/queue/schedule" does not restore 
all performance loss at this point.  I needed to revert this change to 
restore all loss.

Second Performance regression is:
113285b47382 "blk-mq: ensure that bd->last is always set correctly"

Third Performance regression is:
a528d35e8bfc "statx: Add a system call to make enhanced file info available"

Unfortunately due to reverting a528d35e8bfc there are merge conflicts in 
later 4.11-rcX versions.

I have only reported the simplest test case we have but there are other 
scenarios that are even worse in 4.11 that the single queue case.

Here is one:
modprobe null_blk nr_devices=4;
fio --ioengine=libaio --direct=1 --gtod_reduce=1 --name=readtest 
--filename=/dev/nullb0:/dev/nullb1:/dev/nullb2:/dev/nullb3 --bs=4k 
--iodepth=128 --time_based --runtime=10 --readwrite=randread 
--iodepth_low=96 --iodepth_batch=16 --numjobs=8

What is next step to fix these regressions?  It is not a simple single 
commit causing the performance problems.

Is anyone running any performance tests on the kernel to catch such issues?
>>>> One more thing to try early: As 4.11 gained support for blk-mq I/O
>>>> schedulers compared to 4.10, null_blk will now also need some extra
>>>> cycles for each I/O request. Try loading the driver with "queue_mode=0"
>>>> or "queue_mode=1" instead of "queue_mode=2".
>>> Since you have 1 submit queues set, you are being loaded with deadline
>>> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
>>> after loading null_blk in 4.11, do:
>>>
>>> # echo none > /sys/block/nullb0/queue/scheduler
>>>
>>> and re-test.
>> On my setup, doing this restored a bunch of the performance, but the numbers
>> are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
>> Disabling NUMA as well cuts this down to ~2%.
> So we're down to 2%. How stable are these numbers? With mq-deadline attached,
> I'm not surprised there's a drop for a null_blk type of test.
>
> Maybe a perf profile comparison between the two kernels would help?
>
Regards,
Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64
@ 2017-05-15 20:10             ` Scott Branden
  0 siblings, 0 replies; 20+ messages in thread
From: Scott Branden @ 2017-05-15 20:10 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Jens,

Details on bisecting inline.


On 17-05-08 08:28 AM, Jens Axboe wrote:
> On 05/08/2017 09:24 AM, Will Deacon wrote:
>> On Mon, May 08, 2017 at 08:08:55AM -0600, Jens Axboe wrote:
>>> On 05/08/2017 05:19 AM, Arnd Bergmann wrote:
>>>> On Mon, May 8, 2017 at 1:07 PM, Will Deacon <will.deacon@arm.com> wrote:
>>>>> On Fri, May 05, 2017 at 06:37:55PM -0700, Scott Branden wrote:
>>>>>> I have updated the kernel to 4.11 and see significant performance
>>>>>> drops using fio-2.9.
>>>>>>
>>>>>> Using FIO the performanced drops from 281 KIOPS to 207 KIOPS using
>>>>>> single core and task.
>>>>>> Percent performance drop becomes even worse if multi-cores and multi-
>>>>>> threads are used.
>>>>>>
>>>>>> Platform is ARM64 based A72.  Can somebody reproduce the results or
>>>>>> know what may have changed to make such a dramatic change?
>>>>>>
>>>>>> FIO command and resulting log output below using null_blk to remove
>>>>>> as many hardware specific driver dependencies as possible.
>>>>>>
>>>>>> modprobe null_blk queue_mode=2 irqmode=0 completion_nsec=0
>>>>>> submit_queues=1 bs=4096
>>>>>>
>>>>>> taskset 0x1 fio --randrepeat=1 --ioengine=libaio --direct=1 --numjobs=1
>>>>>> --gtod_reduce=1 --name=readtest --filename=/dev/nullb0 --bs=4k
>>>>>> --iodepth=128 --time_based --runtime=15 --readwrite=read
>>>>> I can confirm that I also see a ~20% drop in results from 4.10 to 4.11 on
>>>>> my AMD Seattle board w/ defconfig, but I can't see anything obvious in the
>>>>> log.
>>>>>
>>>>> Things you could try:
>>>>>
>>>>>    1. Try disabling CONFIG_NUMA in the 4.11 kernel (this was enabled in
>>>>>       defconfig between the releases).
>>>>>
>>>>>    2. Try to reproduce on an x86 box
>>>>>
>>>>>    3. Have a go at bisecting the issue, so we can revert the offender if
>>>>>       necessary.
The 4.11 kernel has numerous performance regressions.  I only bisected 
up to 4.11-rc1 as merge conflicts and complexities arise after that.

First Performance regression is:
b86dd815ff74  "block: get rid of blk-mq default scheduler choice Kconfig 
entries"

Using "echo none > /sys/block/nullb0/queue/schedule" does not restore 
all performance loss at this point.  I needed to revert this change to 
restore all loss.

Second Performance regression is:
113285b47382 "blk-mq: ensure that bd->last is always set correctly"

Third Performance regression is:
a528d35e8bfc "statx: Add a system call to make enhanced file info available"

Unfortunately due to reverting a528d35e8bfc there are merge conflicts in 
later 4.11-rcX versions.

I have only reported the simplest test case we have but there are other 
scenarios that are even worse in 4.11 that the single queue case.

Here is one:
modprobe null_blk nr_devices=4;
fio --ioengine=libaio --direct=1 --gtod_reduce=1 --name=readtest 
--filename=/dev/nullb0:/dev/nullb1:/dev/nullb2:/dev/nullb3 --bs=4k 
--iodepth=128 --time_based --runtime=10 --readwrite=randread 
--iodepth_low=96 --iodepth_batch=16 --numjobs=8

What is next step to fix these regressions?  It is not a simple single 
commit causing the performance problems.

Is anyone running any performance tests on the kernel to catch such issues?
>>>> One more thing to try early: As 4.11 gained support for blk-mq I/O
>>>> schedulers compared to 4.10, null_blk will now also need some extra
>>>> cycles for each I/O request. Try loading the driver with "queue_mode=0"
>>>> or "queue_mode=1" instead of "queue_mode=2".
>>> Since you have 1 submit queues set, you are being loaded with deadline
>>> attached. To compare 4.10 and 4.11, with queue_mode=2 and submit_queues=1,
>>> after loading null_blk in 4.11, do:
>>>
>>> # echo none > /sys/block/nullb0/queue/scheduler
>>>
>>> and re-test.
>> On my setup, doing this restored a bunch of the performance, but the numbers
>> are still ~5% worse than 4.10 (as opposed to ~20% worse with mq-deadline).
>> Disabling NUMA as well cuts this down to ~2%.
> So we're down to 2%. How stable are these numbers? With mq-deadline attached,
> I'm not surprised there's a drop for a null_blk type of test.
>
> Maybe a perf profile comparison between the two kernels would help?
>
Regards,
Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-05-15 20:10 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-06  1:37 FIO performance regression in 4.11 kernel vs. 4.10 kernel observed on ARM64 Scott Branden
2017-05-06  1:37 ` Scott Branden
2017-05-06  1:54 ` Scott Branden
2017-05-06  1:54   ` Scott Branden
2017-05-08 11:07 ` Will Deacon
2017-05-08 11:07   ` Will Deacon
2017-05-08 11:19   ` Arnd Bergmann
2017-05-08 11:19     ` Arnd Bergmann
2017-05-08 14:08     ` Jens Axboe
2017-05-08 14:08       ` Jens Axboe
2017-05-08 15:24       ` Will Deacon
2017-05-08 15:24         ` Will Deacon
2017-05-08 15:28         ` Jens Axboe
2017-05-08 15:28           ` Jens Axboe
2017-05-08 17:38           ` Scott Branden
2017-05-08 17:38             ` Scott Branden
2017-05-15 20:10           ` Scott Branden
2017-05-15 20:10             ` Scott Branden
2017-05-08 16:32         ` Scott Branden
2017-05-08 16:32           ` Scott Branden

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.