linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NVME performance regression in Linux 5.x due to lack of block level IO queueing
@ 2022-03-30 14:32 Michael Marod
  2022-03-30 14:50 ` Michael Marod
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Marod @ 2022-03-30 14:32 UTC (permalink / raw)
  To: linux-kernel

It has been noticed in an AWS Cloud environment that the IO performance on an Ubuntu Bionic 18.04 i3 instance with Linux 5.4.0 is significantly worse than that of an Ubuntu Xenial 16.04 i3 instance with Linux 4.4.0. I've been digging into the issue and it seems like block level queueing is not working as well (if at all?) on the Linux 5 instance. When I've run performance tests using fio, iostat reports 0.00 average queue size for the Linux 5.4.0 instance but goes up to 0.5 or higher on the 4.4.0 one.

In my setup, I have 2 NVMe devices (/dev/nvme0n1 and /dev/nvme1n1) configured in a raid0 on /dev/md127 which is mounted on /opt. I've tested devices directly and through the RAID controller and it does not seem to make a difference. In the output below I am writing through /opt which exercises the file system and RAID controller. Also, fio was compiled from source with latest version on both instances.

# fio -name=randrw -filename=/opt/testfile -direct=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=1 -group_reporting=1 -runtime=120

Note: these are different versions of iostat which is why the output is different

# Linux 4.4.0-1137-aws:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.00 3248.00 3173.00 12992.00 12692.00     8.00     0.44    0.07    0.10    0.03   0.07  44.00
nvme0n1           0.00     0.00 3229.00 3292.00 12916.00 13168.00     8.00     0.47    0.07    0.10    0.04   0.07  46.80

# Linux 5.4.0-1068-aws:
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme0n1       2842.57 2789.11  11370.30  11156.44     0.00     0.00   0.00   0.00    0.10    0.04   0.00     4.00     4.00   0.18  99.01
nvme1n1       2849.50 2748.51  11398.02  10994.06     0.00     0.00   0.00   0.00    0.10    0.04   0.00     4.00     4.00   0.18  99.01

For another (possibly simpler) demonstration -- I ran blktrace/blkparse on both Linux 4.4.0 and 5.5.0 instances with the same hardware (i3.4xlarge). Then I used fio to do a simple stress test that sends 5 write requests.

# fio -name=write -filename=/opt/testfile -direct=1 -thread -rw=write, -ioengine=psync -bs=4k -size=4k -numjobs=5 -group_reporting=1 -runtime=5

The blkparse output shows that the Linux 4.4.0 instance had 5 "Writes Queued" while the Linux 5.4.0 one had 0. Additionally, there were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "Timer unplugs" and 0 "IO unplugs" on the Linux 5.4.0 one.

// On the Bionic machine with Linux 5.4.0-1068-aws #72~18.04.1-Ubuntu SMP Thu Mar 3 08:49:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux:
# blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
259,0   12        1     0.000000000 114067  A  WS 401408 + 8 <- (9,127) 274432
259,0   12        2     0.000003433 114067  G  WS 401408 + 8 [fio]
259,0   12        3     0.000004078 114067  P   N [fio]
259,0   12        4     0.000004548 114067  U   N [fio] 1
259,0   12        5     0.000005654 114067  D  WS 401408 + 8 [fio]
259,0    4        1     0.000045379     0  C  WS 401408 + 8 [0]
259,0   14        1     0.000210927     0  C  WS 401408 + 8 [0]
259,0   14        2     0.000254591 114065  A  WS 401408 + 8 <- (9,127) 274432
259,0   14        3     0.000257373 114065  G  WS 401408 + 8 [fio]
259,0   14        4     0.000257853 114065  P   N [fio]
259,0   14        5     0.000258340 114065  U   N [fio] 1
259,0   14        6     0.000259293 114065  D  WS 401408 + 8 [fio]
259,0    8        2     0.000290583     0  C  WS 401408 + 8 [0]
259,0   13        1     0.000080964 114068  A  WS 401408 + 8 <- (9,127) 274432
259,0   13        2     0.000083702 114068  G  WS 401408 + 8 [fio]
259,0   13        3     0.000084247 114068  P   N [fio]
259,0   13        4     0.000084654 114068  U   N [fio] 1
259,0   13        5     0.000085801 114068  D  WS 401408 + 8 [fio]
259,0    3        1 1266874889.709450585 114066  A  WS 401408 + 8 <- (9,127) 274432
259,0    3        2 1266874889.709455280 114066  G  WS 401408 + 8 [fio]
259,0    3        3 1266874889.709456267 114066  P   N [fio]
259,0    3        4 1266874889.709456843 114066  U   N [fio] 1
259,0    3        5 1266874889.709458888 114066  D  WS 401408 + 8 [fio]
259,0   15        1     0.000164738 114069  A  WS 401408 + 8 <- (9,127) 274432
259,0   15        2     0.000168016 114069  G  WS 401408 + 8 [fio]
259,0   15        3     0.000168571 114069  P   N [fio]
259,0   15        4     0.000169037 114069  U   N [fio] 1
259,0   15        5     0.000169993 114069  D  WS 401408 + 8 [fio]
259,0   10        1     0.000128797     0  C  WS 401408 + 8 [0]
^C259,0    8        1 1266874889.709500996     0  C  WS 401408 + 8 [0]
CPU3 (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             3
 IO unplugs:             1        	 Timer unplugs:           0
CPU4 (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             3
 IO unplugs:             0        	 Timer unplugs:           0
CPU8 (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        2,        8KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             3
 IO unplugs:             0        	 Timer unplugs:           0
CPU10 (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             3
 IO unplugs:             0        	 Timer unplugs:           0
CPU12 (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             3
 IO unplugs:             1        	 Timer unplugs:           0
CPU13 (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             3
 IO unplugs:             1        	 Timer unplugs:           0
CPU14 (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             3
 IO unplugs:             1        	 Timer unplugs:           0
CPU15 (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             3
 IO unplugs:             1        	 Timer unplugs:           0

Total (259,0):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        5,       20KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        5,       20KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 IO unplugs:             5        	 Timer unplugs:           0

Throughput (R/W): 0KiB/s / 0KiB/s
Events (259,0): 30 entries
Skips: 0 forward (0 -   0.0%)

// On the Xenial machine with 4.4.0-1137-aws #151-Ubuntu SMP Tue Mar 1 09:47:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
# blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
259,1    1        1     0.000000000 82519  Q  WS 399368 + 8 [fio]
259,1    1        2     0.000000980 82519  G  WS 399368 + 8 [fio]
259,1    1        3     0.000003468 82519 UT   N [fio] 1
259,1    1        4     0.000003867 82519  I  WS 399368 + 8 [fio]
259,1    1        5     0.000011785  1393  D  WS 399368 + 8 [kworker/1:1H]
259,1    2        1 1266874889.709465021 82517  Q  WS 399368 + 8 [fio]
259,1    2        2 1266874889.709467133 82517  G  WS 399368 + 8 [fio]
259,1    2        3 1266874889.709471568 82517 UT   N [fio] 1
259,1    2        4 1266874889.709471568 82517  I  WS 399368 + 8 [fio]
259,1    2        5 1266874889.709476205   980  D  WS 399368 + 8 [kworker/2:1H]
259,1   15        1     0.000051309     0  C  WS 399368 + 8 [0]
259,1    3        1     0.000079179 82520  Q  WS 399368 + 8 [fio]
259,1    3        2     0.000080079 82520  G  WS 399368 + 8 [fio]
259,1    3        3     0.000082140 82520 UT   N [fio] 1
259,1    3        4     0.000082391 82520  I  WS 399368 + 8 [fio]
259,1    3        5     0.000087819   975  D  WS 399368 + 8 [kworker/3:1H]
259,1   12        1     0.000125455     0  C  WS 399368 + 8 [0]
259,1    5        1     0.000152778 82521  Q  WS 399368 + 8 [fio]
259,1    5        2     0.000153811 82521  G  WS 399368 + 8 [fio]
259,1    5        3     0.000155811 82521 UT   N [fio] 1
259,1    5        4     0.000156314 82521  I  WS 399368 + 8 [fio]
259,1    5        5     0.000168577  1552  D  WS 399368 + 8 [kworker/5:1H]
259,1    7        1     0.000206251     0  C  WS 399368 + 8 [0]
259,1    9        1     0.000234725 82518  Q  WS 399368 + 8 [fio]
259,1    9        2     0.000235496 82518  G  WS 399368 + 8 [fio]
259,1    9        3     0.000237147 82518 UT   N [fio] 1
259,1    9        4     0.000237382 82518  I  WS 399368 + 8 [fio]
259,1    9        5     0.000242425  2001  D  WS 399368 + 8 [kworker/9:1H]
259,1   15        2     0.000290161     0  C  WS 399368 + 8 [0]
259,1   13        1 1266874889.709520875     0  C  WS 399368 + 8 [0]
^CCPU1 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           1
CPU2 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           1
CPU3 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           1
CPU5 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           1
CPU7 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           0
CPU9 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           1
CPU12 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           0
CPU13 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           0
CPU15 (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        2,        8KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 Read depth:             0        	 Write depth:             2
 IO unplugs:             0        	 Timer unplugs:           0

Total (259,1):
 Reads Queued:           0,        0KiB	 Writes Queued:           5,       20KiB
 Read Dispatches:        0,        0KiB	 Write Dispatches:        5,       20KiB
 Reads Requeued:         0		 Writes Requeued:         0
 Reads Completed:        0,        0KiB	 Writes Completed:        5,       20KiB
 Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
 IO unplugs:             0        	 Timer unplugs:           5

Throughput (R/W): 0KiB/s / 0KiB/s
Events (259,1): 30 entries
Skips: 0 forward (0 -   0.0%)

Please CC me directly as I am not subscribed to LKML.

Thanks,

-- Michael Marod (michael@michaelmarod.com)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing
  2022-03-30 14:32 NVME performance regression in Linux 5.x due to lack of block level IO queueing Michael Marod
@ 2022-03-30 14:50 ` Michael Marod
  2022-03-31  0:11   ` Michael Marod
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Marod @ 2022-03-30 14:50 UTC (permalink / raw)
  To: linux-kernel

Fix: There were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "IO unplugs" and 0 "Timer unplugs" on the Linux 5.4.0 one.

> On Mar 30, 2022, at 10:31 AM, Michael Marod <michael@michaelmarod.com> wrote:
>
> It has been noticed in an AWS Cloud environment that the IO performance on an Ubuntu Bionic 18.04 i3 instance with Linux 5.4.0 is significantly worse than that of an Ubuntu Xenial 16.04 i3 instance with Linux 4.4.0. I've been digging into the issue and it seems like block level queueing is not working as well (if at all?) on the Linux 5 instance. When I've run performance tests using fio, iostat reports 0.00 average queue size for the Linux 5.4.0 instance but goes up to 0.5 or higher on the 4.4.0 one.
>
> In my setup, I have 2 NVMe devices (/dev/nvme0n1 and /dev/nvme1n1) configured in a raid0 on /dev/md127 which is mounted on /opt. I've tested devices directly and through the RAID controller and it does not seem to make a difference. In the output below I am writing through /opt which exercises the file system and RAID controller. Also, fio was compiled from source with latest version on both instances.
>
> # fio -name=randrw -filename=/opt/testfile -direct=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=1 -group_reporting=1 -runtime=120
>
> Note: these are different versions of iostat which is why the output is different
>
> # Linux 4.4.0-1137-aws:
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> nvme1n1           0.00     0.00 3248.00 3173.00 12992.00 12692.00     8.00     0.44    0.07    0.10    0.03   0.07  44.00
> nvme0n1           0.00     0.00 3229.00 3292.00 12916.00 13168.00     8.00     0.47    0.07    0.10    0.04   0.07  46.80
>
> # Linux 5.4.0-1068-aws:
> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme0n1       2842.57 2789.11  11370.30  11156.44     0.00     0.00   0.00   0.00    0.10    0.04   0.00     4.00     4.00   0.18  99.01
> nvme1n1       2849.50 2748.51  11398.02  10994.06     0.00     0.00   0.00   0.00    0.10    0.04   0.00     4.00     4.00   0.18  99.01
>
> For another (possibly simpler) demonstration -- I ran blktrace/blkparse on both Linux 4.4.0 and 5.5.0 instances with the same hardware (i3.4xlarge). Then I used fio to do a simple stress test that sends 5 write requests.
>
> # fio -name=write -filename=/opt/testfile -direct=1 -thread -rw=write, -ioengine=psync -bs=4k -size=4k -numjobs=5 -group_reporting=1 -runtime=5
>
> The blkparse output shows that the Linux 4.4.0 instance had 5 "Writes Queued" while the Linux 5.4.0 one had 0. Additionally, there were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "Timer unplugs" and 0 "IO unplugs" on the Linux 5.4.0 one.
>
> // On the Bionic machine with Linux 5.4.0-1068-aws #72~18.04.1-Ubuntu SMP Thu Mar 3 08:49:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux:
> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
> 259,0   12        1     0.000000000 114067  A  WS 401408 + 8 <- (9,127) 274432
> 259,0   12        2     0.000003433 114067  G  WS 401408 + 8 [fio]
> 259,0   12        3     0.000004078 114067  P   N [fio]
> 259,0   12        4     0.000004548 114067  U   N [fio] 1
> 259,0   12        5     0.000005654 114067  D  WS 401408 + 8 [fio]
> 259,0    4        1     0.000045379     0  C  WS 401408 + 8 [0]
> 259,0   14        1     0.000210927     0  C  WS 401408 + 8 [0]
> 259,0   14        2     0.000254591 114065  A  WS 401408 + 8 <- (9,127) 274432
> 259,0   14        3     0.000257373 114065  G  WS 401408 + 8 [fio]
> 259,0   14        4     0.000257853 114065  P   N [fio]
> 259,0   14        5     0.000258340 114065  U   N [fio] 1
> 259,0   14        6     0.000259293 114065  D  WS 401408 + 8 [fio]
> 259,0    8        2     0.000290583     0  C  WS 401408 + 8 [0]
> 259,0   13        1     0.000080964 114068  A  WS 401408 + 8 <- (9,127) 274432
> 259,0   13        2     0.000083702 114068  G  WS 401408 + 8 [fio]
> 259,0   13        3     0.000084247 114068  P   N [fio]
> 259,0   13        4     0.000084654 114068  U   N [fio] 1
> 259,0   13        5     0.000085801 114068  D  WS 401408 + 8 [fio]
> 259,0    3        1 1266874889.709450585 114066  A  WS 401408 + 8 <- (9,127) 274432
> 259,0    3        2 1266874889.709455280 114066  G  WS 401408 + 8 [fio]
> 259,0    3        3 1266874889.709456267 114066  P   N [fio]
> 259,0    3        4 1266874889.709456843 114066  U   N [fio] 1
> 259,0    3        5 1266874889.709458888 114066  D  WS 401408 + 8 [fio]
> 259,0   15        1     0.000164738 114069  A  WS 401408 + 8 <- (9,127) 274432
> 259,0   15        2     0.000168016 114069  G  WS 401408 + 8 [fio]
> 259,0   15        3     0.000168571 114069  P   N [fio]
> 259,0   15        4     0.000169037 114069  U   N [fio] 1
> 259,0   15        5     0.000169993 114069  D  WS 401408 + 8 [fio]
> 259,0   10        1     0.000128797     0  C  WS 401408 + 8 [0]
> ^C259,0    8        1 1266874889.709500996     0  C  WS 401408 + 8 [0]
> CPU3 (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             3
> IO unplugs:             1        	 Timer unplugs:           0
> CPU4 (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             3
> IO unplugs:             0        	 Timer unplugs:           0
> CPU8 (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        2,        8KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             3
> IO unplugs:             0        	 Timer unplugs:           0
> CPU10 (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             3
> IO unplugs:             0        	 Timer unplugs:           0
> CPU12 (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             3
> IO unplugs:             1        	 Timer unplugs:           0
> CPU13 (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             3
> IO unplugs:             1        	 Timer unplugs:           0
> CPU14 (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             3
> IO unplugs:             1        	 Timer unplugs:           0
> CPU15 (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             3
> IO unplugs:             1        	 Timer unplugs:           0
>
> Total (259,0):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        5,       20KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        5,       20KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> IO unplugs:             5        	 Timer unplugs:           0
>
> Throughput (R/W): 0KiB/s / 0KiB/s
> Events (259,0): 30 entries
> Skips: 0 forward (0 -   0.0%)
>
> // On the Xenial machine with 4.4.0-1137-aws #151-Ubuntu SMP Tue Mar 1 09:47:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
> 259,1    1        1     0.000000000 82519  Q  WS 399368 + 8 [fio]
> 259,1    1        2     0.000000980 82519  G  WS 399368 + 8 [fio]
> 259,1    1        3     0.000003468 82519 UT   N [fio] 1
> 259,1    1        4     0.000003867 82519  I  WS 399368 + 8 [fio]
> 259,1    1        5     0.000011785  1393  D  WS 399368 + 8 [kworker/1:1H]
> 259,1    2        1 1266874889.709465021 82517  Q  WS 399368 + 8 [fio]
> 259,1    2        2 1266874889.709467133 82517  G  WS 399368 + 8 [fio]
> 259,1    2        3 1266874889.709471568 82517 UT   N [fio] 1
> 259,1    2        4 1266874889.709471568 82517  I  WS 399368 + 8 [fio]
> 259,1    2        5 1266874889.709476205   980  D  WS 399368 + 8 [kworker/2:1H]
> 259,1   15        1     0.000051309     0  C  WS 399368 + 8 [0]
> 259,1    3        1     0.000079179 82520  Q  WS 399368 + 8 [fio]
> 259,1    3        2     0.000080079 82520  G  WS 399368 + 8 [fio]
> 259,1    3        3     0.000082140 82520 UT   N [fio] 1
> 259,1    3        4     0.000082391 82520  I  WS 399368 + 8 [fio]
> 259,1    3        5     0.000087819   975  D  WS 399368 + 8 [kworker/3:1H]
> 259,1   12        1     0.000125455     0  C  WS 399368 + 8 [0]
> 259,1    5        1     0.000152778 82521  Q  WS 399368 + 8 [fio]
> 259,1    5        2     0.000153811 82521  G  WS 399368 + 8 [fio]
> 259,1    5        3     0.000155811 82521 UT   N [fio] 1
> 259,1    5        4     0.000156314 82521  I  WS 399368 + 8 [fio]
> 259,1    5        5     0.000168577  1552  D  WS 399368 + 8 [kworker/5:1H]
> 259,1    7        1     0.000206251     0  C  WS 399368 + 8 [0]
> 259,1    9        1     0.000234725 82518  Q  WS 399368 + 8 [fio]
> 259,1    9        2     0.000235496 82518  G  WS 399368 + 8 [fio]
> 259,1    9        3     0.000237147 82518 UT   N [fio] 1
> 259,1    9        4     0.000237382 82518  I  WS 399368 + 8 [fio]
> 259,1    9        5     0.000242425  2001  D  WS 399368 + 8 [kworker/9:1H]
> 259,1   15        2     0.000290161     0  C  WS 399368 + 8 [0]
> 259,1   13        1 1266874889.709520875     0  C  WS 399368 + 8 [0]
> ^CCPU1 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           1
> CPU2 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           1
> CPU3 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           1
> CPU5 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           1
> CPU7 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           0
> CPU9 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           1
> CPU12 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           0
> CPU13 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           0
> CPU15 (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        2,        8KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> Read depth:             0        	 Write depth:             2
> IO unplugs:             0        	 Timer unplugs:           0
>
> Total (259,1):
> Reads Queued:           0,        0KiB	 Writes Queued:           5,       20KiB
> Read Dispatches:        0,        0KiB	 Write Dispatches:        5,       20KiB
> Reads Requeued:         0		 Writes Requeued:         0
> Reads Completed:        0,        0KiB	 Writes Completed:        5,       20KiB
> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
> IO unplugs:             0        	 Timer unplugs:           5
>
> Throughput (R/W): 0KiB/s / 0KiB/s
> Events (259,1): 30 entries
> Skips: 0 forward (0 -   0.0%)
>
> Please CC me directly as I am not subscribed to LKML.
>
> Thanks,
>
> -- Michael Marod (michael@michaelmarod.com)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing
  2022-03-30 14:50 ` Michael Marod
@ 2022-03-31  0:11   ` Michael Marod
  2022-03-31  4:35     ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Marod @ 2022-03-31  0:11 UTC (permalink / raw)
  To: linux-kernel, linux-block

Cross-posting from linux-kernel to linux-block as it relates to linux-block primarily.

Also realized that "Fix" is the wrong thing to say in my previous email -- "Edit" would have made more sense. The issue still remains, I just meant to correct my first email.

Finally wanted to point out that on both systems, the scheduler is set to "none".

Thanks,
Michael

> On Mar 30, 2022, at 10:49 AM, Michael Marod <michael@michaelmarod.com> wrote:
>
> Fix: There were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "IO unplugs" and 0 "Timer unplugs" on the Linux 5.4.0 one.
>
>> On Mar 30, 2022, at 10:31 AM, Michael Marod <michael@michaelmarod.com> wrote:
>>
>> It has been noticed in an AWS Cloud environment that the IO performance on an Ubuntu Bionic 18.04 i3 instance with Linux 5.4.0 is significantly worse than that of an Ubuntu Xenial 16.04 i3 instance with Linux 4.4.0. I've been digging into the issue and it seems like block level queueing is not working as well (if at all?) on the Linux 5 instance. When I've run performance tests using fio, iostat reports 0.00 average queue size for the Linux 5.4.0 instance but goes up to 0.5 or higher on the 4.4.0 one.
>>
>> In my setup, I have 2 NVMe devices (/dev/nvme0n1 and /dev/nvme1n1) configured in a raid0 on /dev/md127 which is mounted on /opt. I've tested devices directly and through the RAID controller and it does not seem to make a difference. In the output below I am writing through /opt which exercises the file system and RAID controller. Also, fio was compiled from source with latest version on both instances.
>>
>> # fio -name=randrw -filename=/opt/testfile -direct=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=1 -group_reporting=1 -runtime=120
>>
>> Note: these are different versions of iostat which is why the output is different
>>
>> # Linux 4.4.0-1137-aws:
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> nvme1n1           0.00     0.00 3248.00 3173.00 12992.00 12692.00     8.00     0.44    0.07    0.10    0.03   0.07  44.00
>> nvme0n1           0.00     0.00 3229.00 3292.00 12916.00 13168.00     8.00     0.47    0.07    0.10    0.04   0.07  46.80
>>
>> # Linux 5.4.0-1068-aws:
>> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>> nvme0n1       2842.57 2789.11  11370.30  11156.44     0.00     0.00   0.00   0.00    0.10    0.04   0.00     4.00     4.00   0.18  99.01
>> nvme1n1       2849.50 2748.51  11398.02  10994.06     0.00     0.00   0.00   0.00    0.10    0.04   0.00     4.00     4.00   0.18  99.01
>>
>> For another (possibly simpler) demonstration -- I ran blktrace/blkparse on both Linux 4.4.0 and 5.5.0 instances with the same hardware (i3.4xlarge). Then I used fio to do a simple stress test that sends 5 write requests.
>>
>> # fio -name=write -filename=/opt/testfile -direct=1 -thread -rw=write, -ioengine=psync -bs=4k -size=4k -numjobs=5 -group_reporting=1 -runtime=5
>>
>> The blkparse output shows that the Linux 4.4.0 instance had 5 "Writes Queued" while the Linux 5.4.0 one had 0. Additionally, there were 0 "IO unplugs" and 5 "Timer unplugs" on the Linux 4.4.0 instance and 5 "Timer unplugs" and 0 "IO unplugs" on the Linux 5.4.0 one.
>>
>> // On the Bionic machine with Linux 5.4.0-1068-aws #72~18.04.1-Ubuntu SMP Thu Mar 3 08:49:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux:
>> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
>> 259,0   12        1     0.000000000 114067  A  WS 401408 + 8 <- (9,127) 274432
>> 259,0   12        2     0.000003433 114067  G  WS 401408 + 8 [fio]
>> 259,0   12        3     0.000004078 114067  P   N [fio]
>> 259,0   12        4     0.000004548 114067  U   N [fio] 1
>> 259,0   12        5     0.000005654 114067  D  WS 401408 + 8 [fio]
>> 259,0    4        1     0.000045379     0  C  WS 401408 + 8 [0]
>> 259,0   14        1     0.000210927     0  C  WS 401408 + 8 [0]
>> 259,0   14        2     0.000254591 114065  A  WS 401408 + 8 <- (9,127) 274432
>> 259,0   14        3     0.000257373 114065  G  WS 401408 + 8 [fio]
>> 259,0   14        4     0.000257853 114065  P   N [fio]
>> 259,0   14        5     0.000258340 114065  U   N [fio] 1
>> 259,0   14        6     0.000259293 114065  D  WS 401408 + 8 [fio]
>> 259,0    8        2     0.000290583     0  C  WS 401408 + 8 [0]
>> 259,0   13        1     0.000080964 114068  A  WS 401408 + 8 <- (9,127) 274432
>> 259,0   13        2     0.000083702 114068  G  WS 401408 + 8 [fio]
>> 259,0   13        3     0.000084247 114068  P   N [fio]
>> 259,0   13        4     0.000084654 114068  U   N [fio] 1
>> 259,0   13        5     0.000085801 114068  D  WS 401408 + 8 [fio]
>> 259,0    3        1 1266874889.709450585 114066  A  WS 401408 + 8 <- (9,127) 274432
>> 259,0    3        2 1266874889.709455280 114066  G  WS 401408 + 8 [fio]
>> 259,0    3        3 1266874889.709456267 114066  P   N [fio]
>> 259,0    3        4 1266874889.709456843 114066  U   N [fio] 1
>> 259,0    3        5 1266874889.709458888 114066  D  WS 401408 + 8 [fio]
>> 259,0   15        1     0.000164738 114069  A  WS 401408 + 8 <- (9,127) 274432
>> 259,0   15        2     0.000168016 114069  G  WS 401408 + 8 [fio]
>> 259,0   15        3     0.000168571 114069  P   N [fio]
>> 259,0   15        4     0.000169037 114069  U   N [fio] 1
>> 259,0   15        5     0.000169993 114069  D  WS 401408 + 8 [fio]
>> 259,0   10        1     0.000128797     0  C  WS 401408 + 8 [0]
>> ^C259,0    8        1 1266874889.709500996     0  C  WS 401408 + 8 [0]
>> CPU3 (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             3
>> IO unplugs:             1        	 Timer unplugs:           0
>> CPU4 (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             3
>> IO unplugs:             0        	 Timer unplugs:           0
>> CPU8 (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        2,        8KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             3
>> IO unplugs:             0        	 Timer unplugs:           0
>> CPU10 (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             3
>> IO unplugs:             0        	 Timer unplugs:           0
>> CPU12 (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             3
>> IO unplugs:             1        	 Timer unplugs:           0
>> CPU13 (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             3
>> IO unplugs:             1        	 Timer unplugs:           0
>> CPU14 (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             3
>> IO unplugs:             1        	 Timer unplugs:           0
>> CPU15 (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             3
>> IO unplugs:             1        	 Timer unplugs:           0
>>
>> Total (259,0):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        5,       20KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        5,       20KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> IO unplugs:             5        	 Timer unplugs:           0
>>
>> Throughput (R/W): 0KiB/s / 0KiB/s
>> Events (259,0): 30 entries
>> Skips: 0 forward (0 -   0.0%)
>>
>> // On the Xenial machine with 4.4.0-1137-aws #151-Ubuntu SMP Tue Mar 1 09:47:43 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
>> # blktrace -d /dev/nvme0n1 -d /dev/nvme1n1 -o - | blkparse -i -
>> 259,1    1        1     0.000000000 82519  Q  WS 399368 + 8 [fio]
>> 259,1    1        2     0.000000980 82519  G  WS 399368 + 8 [fio]
>> 259,1    1        3     0.000003468 82519 UT   N [fio] 1
>> 259,1    1        4     0.000003867 82519  I  WS 399368 + 8 [fio]
>> 259,1    1        5     0.000011785  1393  D  WS 399368 + 8 [kworker/1:1H]
>> 259,1    2        1 1266874889.709465021 82517  Q  WS 399368 + 8 [fio]
>> 259,1    2        2 1266874889.709467133 82517  G  WS 399368 + 8 [fio]
>> 259,1    2        3 1266874889.709471568 82517 UT   N [fio] 1
>> 259,1    2        4 1266874889.709471568 82517  I  WS 399368 + 8 [fio]
>> 259,1    2        5 1266874889.709476205   980  D  WS 399368 + 8 [kworker/2:1H]
>> 259,1   15        1     0.000051309     0  C  WS 399368 + 8 [0]
>> 259,1    3        1     0.000079179 82520  Q  WS 399368 + 8 [fio]
>> 259,1    3        2     0.000080079 82520  G  WS 399368 + 8 [fio]
>> 259,1    3        3     0.000082140 82520 UT   N [fio] 1
>> 259,1    3        4     0.000082391 82520  I  WS 399368 + 8 [fio]
>> 259,1    3        5     0.000087819   975  D  WS 399368 + 8 [kworker/3:1H]
>> 259,1   12        1     0.000125455     0  C  WS 399368 + 8 [0]
>> 259,1    5        1     0.000152778 82521  Q  WS 399368 + 8 [fio]
>> 259,1    5        2     0.000153811 82521  G  WS 399368 + 8 [fio]
>> 259,1    5        3     0.000155811 82521 UT   N [fio] 1
>> 259,1    5        4     0.000156314 82521  I  WS 399368 + 8 [fio]
>> 259,1    5        5     0.000168577  1552  D  WS 399368 + 8 [kworker/5:1H]
>> 259,1    7        1     0.000206251     0  C  WS 399368 + 8 [0]
>> 259,1    9        1     0.000234725 82518  Q  WS 399368 + 8 [fio]
>> 259,1    9        2     0.000235496 82518  G  WS 399368 + 8 [fio]
>> 259,1    9        3     0.000237147 82518 UT   N [fio] 1
>> 259,1    9        4     0.000237382 82518  I  WS 399368 + 8 [fio]
>> 259,1    9        5     0.000242425  2001  D  WS 399368 + 8 [kworker/9:1H]
>> 259,1   15        2     0.000290161     0  C  WS 399368 + 8 [0]
>> 259,1   13        1 1266874889.709520875     0  C  WS 399368 + 8 [0]
>> ^CCPU1 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           1
>> CPU2 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           1
>> CPU3 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           1
>> CPU5 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           1
>> CPU7 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           0
>> CPU9 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           1,        4KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        1,        4KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        0,        0KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           1
>> CPU12 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           0
>> CPU13 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        1,        4KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           0
>> CPU15 (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           0,        0KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        0,        0KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        2,        8KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> Read depth:             0        	 Write depth:             2
>> IO unplugs:             0        	 Timer unplugs:           0
>>
>> Total (259,1):
>> Reads Queued:           0,        0KiB	 Writes Queued:           5,       20KiB
>> Read Dispatches:        0,        0KiB	 Write Dispatches:        5,       20KiB
>> Reads Requeued:         0		 Writes Requeued:         0
>> Reads Completed:        0,        0KiB	 Writes Completed:        5,       20KiB
>> Read Merges:            0,        0KiB	 Write Merges:            0,        0KiB
>> IO unplugs:             0        	 Timer unplugs:           5
>>
>> Throughput (R/W): 0KiB/s / 0KiB/s
>> Events (259,1): 30 entries
>> Skips: 0 forward (0 -   0.0%)
>>
>> Please CC me directly as I am not subscribed to LKML.
>>
>> Thanks,
>>
>> -- Michael Marod (michael@michaelmarod.com)
>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing
  2022-03-31  0:11   ` Michael Marod
@ 2022-03-31  4:35     ` Christoph Hellwig
  2022-03-31 23:22       ` Michael Marod
  0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2022-03-31  4:35 UTC (permalink / raw)
  To: Michael Marod; +Cc: linux-kernel, linux-block

5.4 is very old.  Do you still have a problem with the latest kernel
(5.17) ?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing
  2022-03-31  4:35     ` Christoph Hellwig
@ 2022-03-31 23:22       ` Michael Marod
  2022-04-01  2:27         ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Marod @ 2022-03-31 23:22 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-block

Good call -- Turns out that that cache issue is resolved in 5.17. I tried a number of kernels and narrowed it down to a problem that started after 4.9 and before 4.15, and ended some time after 5.13. Namely, 4.9 is good, 4.15 is bad, 5.13 is bad, and 5.17 is good. I did not bisect it all the way down to the specific versions where the behaviors changed.

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme1n1       2758.00 2783.00  11032.00  11132.00     0.00     0.00   0.00   0.00    0.10    0.03   0.36     4.00     4.00   0.18 100.00
nvme0n1       2830.00 2875.00  11320.00  11500.00     0.00     0.00   0.00   0.00    0.10    0.03   0.39     4.00     4.00   0.18 100.00

With regards to the performance between 4.4.0 and 5.17, for a single thread, 4.4.0 still had better performance over 5.17. However, the 5.17 kernel was significantly better at multiple threads. In fact, it is so much better I don't believe the results (10x improvement!). Is this to be expected that a single thread would be slower in 5.17, but recent improvements make it possible to run many of them in parallel more efficiently?

# /usr/local/bin/fio -name=randrw -filename=/opt/foo -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=16 -group_reporting=1 -runtime=120

// Ubuntu 16.04 / Linux 4.4.0:
Run status group 0 (all jobs):
   READ: bw=54.5MiB/s (57.1MB/s), 54.5MiB/s-54.5MiB/s (57.1MB/s-57.1MB/s), io=6537MiB (6854MB), run=120002-120002msec
  WRITE: bw=54.5MiB/s (57.2MB/s), 54.5MiB/s-54.5MiB/s (57.2MB/s-57.2MB/s), io=6544MiB (6862MB), run=120002-120002msec

// Ubuntu 18.04 / Linux 5.4.0:
Run status group 0 (all jobs):
   READ: bw=23.5MiB/s (24.7MB/s), 23.5MiB/s-23.5MiB/s (24.7MB/s-24.7MB/s), io=2821MiB (2959MB), run=120002-120002msec
  WRITE: bw=23.5MiB/s (24.6MB/s), 23.5MiB/s-23.5MiB/s (24.6MB/s-24.6MB/s), io=2819MiB (2955MB), run=120002-120002msec

// Ubuntu 18.04 / Linux 5.17:
Run status group 0 (all jobs):
   READ: bw=244MiB/s (255MB/s), 244MiB/s-244MiB/s (255MB/s-255MB/s), io=28.6GiB (30.7GB), run=120001-120001msec
  WRITE: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=28.6GiB (30.7GB), run=120001-120001msec

Thanks,
Michael

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing
  2022-03-31 23:22       ` Michael Marod
@ 2022-04-01  2:27         ` Keith Busch
  2022-04-01  2:59           ` Chaitanya Kulkarni
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2022-04-01  2:27 UTC (permalink / raw)
  To: Michael Marod; +Cc: Christoph Hellwig, linux-kernel, linux-block

On Thu, Mar 31, 2022 at 11:22:03PM +0000, Michael Marod wrote:
> # /usr/local/bin/fio -name=randrw -filename=/opt/foo -direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=4k -size=10G -numjobs=16 -group_reporting=1 -runtime=120
> 
> // Ubuntu 16.04 / Linux 4.4.0:
> Run status group 0 (all jobs):
>    READ: bw=54.5MiB/s (57.1MB/s), 54.5MiB/s-54.5MiB/s (57.1MB/s-57.1MB/s), io=6537MiB (6854MB), run=120002-120002msec
>   WRITE: bw=54.5MiB/s (57.2MB/s), 54.5MiB/s-54.5MiB/s (57.2MB/s-57.2MB/s), io=6544MiB (6862MB), run=120002-120002msec
> 
> // Ubuntu 18.04 / Linux 5.4.0:
> Run status group 0 (all jobs):
>    READ: bw=23.5MiB/s (24.7MB/s), 23.5MiB/s-23.5MiB/s (24.7MB/s-24.7MB/s), io=2821MiB (2959MB), run=120002-120002msec
>   WRITE: bw=23.5MiB/s (24.6MB/s), 23.5MiB/s-23.5MiB/s (24.6MB/s-24.6MB/s), io=2819MiB (2955MB), run=120002-120002msec
> 
> // Ubuntu 18.04 / Linux 5.17:
> Run status group 0 (all jobs):
>    READ: bw=244MiB/s (255MB/s), 244MiB/s-244MiB/s (255MB/s-255MB/s), io=28.6GiB (30.7GB), run=120001-120001msec
>   WRITE: bw=244MiB/s (256MB/s), 244MiB/s-244MiB/s (256MB/s-256MB/s), io=28.6GiB (30.7GB), run=120001-120001msec

Thanks for the info. I don't know of anything block or nvme specific that might
explain an order of magnitude perf difference.

Could you try the same test without the filesytems? You mentioned using mdraid,
so try '--filename=/dev/mdX'. If that also shows similiar performance
difference, try using one of your nvme member drives directly, like
'--filename=/dev/nvme1n1'. That should isolate which subsystem is contributing
to the difference.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVME performance regression in Linux 5.x due to lack of block level IO queueing
  2022-04-01  2:27         ` Keith Busch
@ 2022-04-01  2:59           ` Chaitanya Kulkarni
  0 siblings, 0 replies; 7+ messages in thread
From: Chaitanya Kulkarni @ 2022-04-01  2:59 UTC (permalink / raw)
  To: Michael Marod; +Cc: Keith Busch, Christoph Hellwig, linux-kernel, linux-block


> Thanks for the info. I don't know of anything block or nvme specific that might
> explain an order of magnitude perf difference.
> 
> Could you try the same test without the filesytems? You mentioned using mdraid,
> so try '--filename=/dev/mdX'. If that also shows similiar performance
> difference, try using one of your nvme member drives directly, like
> '--filename=/dev/nvme1n1'. That should isolate which subsystem is contributing
> to the difference.
> 

With Keith's suggestions you can always take the perf numbers and
compare before and after mainly for nvme_queue_rq() and nvme_irq().

-ck



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-04-01  2:59 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-30 14:32 NVME performance regression in Linux 5.x due to lack of block level IO queueing Michael Marod
2022-03-30 14:50 ` Michael Marod
2022-03-31  0:11   ` Michael Marod
2022-03-31  4:35     ` Christoph Hellwig
2022-03-31 23:22       ` Michael Marod
2022-04-01  2:27         ` Keith Busch
2022-04-01  2:59           ` Chaitanya Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).