linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Polled io for Linux kernel 5.x
@ 2019-12-19 19:25 Ober, Frank
  2019-12-19 20:52 ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Ober, Frank @ 2019-12-19 19:25 UTC (permalink / raw)
  To: linux-block, linux-nvme
  Cc: Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan

Hi block/nvme communities, 
On 4.x kernels we used to be able to do:
# echo 1 > /sys/block/nvme0n1/queue/io_poll
And then run a polled_io job in fio with pvsync2 as our ioengine, with the hipri flag set. This is actually how we test the very best SSDs that depend on 3D xpoint media.

On 5.x kernels we see the following error trying to write the device settings>>>
-bash: echo: write error: Invalid argument

We can reload the entire nvme module with nvme poll_queues but this is not well explained or written up anywhere? Or sorry "not found"?

This is verifiable on 5.3, 5.4 kernels with fio 3.16 builds.

What is the background on what has changed because Jens wrote this note back in 2015, which did work in the 4.x kernel era.
But now things have changed, and there is not a new lwn article that has replaced the one here:
https://lwn.net/Articles/663543/

More documentation can be found here on the confusion that exists today is here: https://stackoverflow.com/questions/55223883/echo-write-error-invalid-argument-while-setting-io-poll-for-nvme-ssd/ 

Can a new LWN article be written around design decisions, usage of these poll_queues?

How come we cannot have a device/controller level setup of polled io today in 5.x kernels, all that exists is module based?
Thank you!
Frank Ober

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Polled io for Linux kernel 5.x
  2019-12-19 19:25 Polled io for Linux kernel 5.x Ober, Frank
@ 2019-12-19 20:52 ` Keith Busch
  2019-12-19 21:59   ` Ober, Frank
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Busch @ 2019-12-19 20:52 UTC (permalink / raw)
  To: Ober, Frank
  Cc: linux-block, Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan,
	linux-nvme

On Thu, Dec 19, 2019 at 07:25:51PM +0000, Ober, Frank wrote:
> Hi block/nvme communities, 
> On 4.x kernels we used to be able to do:
> # echo 1 > /sys/block/nvme0n1/queue/io_poll
> And then run a polled_io job in fio with pvsync2 as our ioengine, with the hipri flag set. This is actually how we test the very best SSDs that depend on 3D xpoint media.
> 
> On 5.x kernels we see the following error trying to write the device settings>>>
> -bash: echo: write error: Invalid argument
> 
> We can reload the entire nvme module with nvme poll_queues but this is not well explained or written up anywhere? Or sorry "not found"?
> 
> This is verifiable on 5.3, 5.4 kernels with fio 3.16 builds.
> 
> What is the background on what has changed because Jens wrote this note back in 2015, which did work in the 4.x kernel era.

The original polling implementation shared resources that generate
interrupts. This prevents it from running as fast as it can, so
dedicated polling queues are used now.

> How come we cannot have a device/controller level setup of polled io today in 5.x kernels, all that exists is module based?

Polled queues are a dedicated resource that we have to reserve up front.
They're optional, so you don't need to use the hipri flag if you have a
device you don't want polled. But we need to know how many queues to
reserve before we've even discovered the controllers, so we don't have a
good way to define it per-controller.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Polled io for Linux kernel 5.x
  2019-12-19 20:52 ` Keith Busch
@ 2019-12-19 21:59   ` Ober, Frank
  2019-12-20 21:20     ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Ober, Frank @ 2019-12-19 21:59 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan,
	linux-nvme

Thanks Keith, it makes sense to reserve and set it up uniquely if you can save hw interrupts. But why would io_uring then not need these queues, because a stack trace I ran shows without the special queues I am still entering bio_poll. With pvsync2 I can only do polled io with the poll_queues?
 
Does io_uring avoid the shared resources?



-----Original Message-----
From: Keith Busch <kbusch@kernel.org> 
Sent: Thursday, December 19, 2019 12:52 PM
To: Ober, Frank <frank.ober@intel.com>
Cc: linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; Derrick, Jonathan <jonathan.derrick@intel.com>; Rajendiran, Swetha <swetha.rajendiran@intel.com>; Liang, Mark <mark.liang@intel.com>
Subject: Re: Polled io for Linux kernel 5.x

On Thu, Dec 19, 2019 at 07:25:51PM +0000, Ober, Frank wrote:
> Hi block/nvme communities,
> On 4.x kernels we used to be able to do:
> # echo 1 > /sys/block/nvme0n1/queue/io_poll And then run a polled_io 
> job in fio with pvsync2 as our ioengine, with the hipri flag set. This is actually how we test the very best SSDs that depend on 3D xpoint media.
> 
> On 5.x kernels we see the following error trying to write the device 
> settings>>>
> -bash: echo: write error: Invalid argument
> 
> We can reload the entire nvme module with nvme poll_queues but this is not well explained or written up anywhere? Or sorry "not found"?
> 
> This is verifiable on 5.3, 5.4 kernels with fio 3.16 builds.
> 
> What is the background on what has changed because Jens wrote this note back in 2015, which did work in the 4.x kernel era.

The original polling implementation shared resources that generate interrupts. This prevents it from running as fast as it can, so dedicated polling queues are used now.

> How come we cannot have a device/controller level setup of polled io today in 5.x kernels, all that exists is module based?

Polled queues are a dedicated resource that we have to reserve up front.
They're optional, so you don't need to use the hipri flag if you have a device you don't want polled. But we need to know how many queues to reserve before we've even discovered the controllers, so we don't have a good way to define it per-controller.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Polled io for Linux kernel 5.x
  2019-12-19 21:59   ` Ober, Frank
@ 2019-12-20 21:20     ` Keith Busch
  2019-12-31 19:06       ` Ober, Frank
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Busch @ 2019-12-20 21:20 UTC (permalink / raw)
  To: Ober, Frank
  Cc: linux-block, Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan,
	linux-nvme

On Thu, Dec 19, 2019 at 09:59:14PM +0000, Ober, Frank wrote:
> Thanks Keith, it makes sense to reserve and set it up uniquely if you
> can save hw interrupts. But why would io_uring then not need these
> queues, because a stack trace I ran shows without the special queues I
> am still entering bio_poll. With pvsync2 I can only do polled io with
> the poll_queues?

Polling can happen only if you have polled queues, so io_uring is not
accomplishing anything by calling iopoll. I don't see an immediately
good way to pass that information up to io_uring, though.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: Polled io for Linux kernel 5.x
  2019-12-20 21:20     ` Keith Busch
@ 2019-12-31 19:06       ` Ober, Frank
  0 siblings, 0 replies; 5+ messages in thread
From: Ober, Frank @ 2019-12-31 19:06 UTC (permalink / raw)
  To: Keith Busch
  Cc: linux-block, Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan,
	linux-nvme

Hi Keith, so the performance results I see are very close between poll_queues and io_uring. I posted them below. Because I think this topic is pretty new to people.

Is there anything we need to tell the reader/user about poll_queues. What is important to usage? 

And can it be dynamic or do we have only at (module) startup the ability to define poll_queues?

My goal is to update the blog we built around testing Optane SSDs. Is there a possibility of creating an LWN article that will go deeper (into this change) to poll_queues?

What's interesting in the below data is that the clat time for io_uring is (lower) better but the performance in IOPS is not. Pvsync2 is the most efficient, by a small margin against the newer 3D XPoint device.
Thanks
Frank

Results: 
(kernel (el repo) - 5.4.1-1.el8.elrepo.x86_64
cpu - Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz  - pinned to run at 3.1
fio - fio-3.16-64-gfd988
Results of Gen2 Optane SSD with poll_queues(pvsync2) v io_uring/hipri
pvsync2 (poll queues)
fio-3.16-64-gfd988
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=552MiB/s][r=141k IOPS][eta 00m:00s]
rand-read-4k-qd1: (groupid=0, jobs=1): err= 0: pid=10309: Tue Dec 31 10:49:33 2019
  read: IOPS=141k, BW=552MiB/s (579MB/s)(64.7GiB/120001msec)
    clat (nsec): min=6548, max=186309, avg=6809.48, stdev=497.58
     lat (nsec): min=6572, max=186333, avg=6834.24, stdev=499.28
    clat percentiles (usec):
     |  1.0000th=[    7],  5.0000th=[    7], 10.0000th=[    7],
     | 20.0000th=[    7], 30.0000th=[    7], 40.0000th=[    7],
     | 50.0000th=[    7], 60.0000th=[    7], 70.0000th=[    7],
     | 80.0000th=[    7], 90.0000th=[    7], 95.0000th=[    8],
     | 99.0000th=[    8], 99.5000th=[    8], 99.9000th=[    9],
     | 99.9500th=[   10], 99.9900th=[   18], 99.9990th=[  117],
     | 99.9999th=[  163]
   bw (  KiB/s): min=563512, max=567392, per=100.00%, avg=565635.38, stdev=846.99, samples=239
   iops        : min=140878, max=141848, avg=141408.82, stdev=211.76, samples=239
  lat (usec)   : 10=99.97%, 20=0.03%, 50=0.01%, 100=0.01%, 250=0.01%
  cpu          : usr=6.28%, sys=93.55%, ctx=408, majf=0, minf=96
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=16969949,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=552MiB/s (579MB/s), 552MiB/s-552MiB/s (579MB/s-579MB/s), io=64.7GiB (69.5GB), run=120001-120001msec

Disk stats (read/write):
  nvme3n1: ios=16955008/0, merge=0/0, ticks=101477/0, in_queue=0, util=99.95%

io_uring:
fio-3.16-64-gfd988
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=538MiB/s][r=138k IOPS][eta 00m:00s]
rand-read-4k-qd1: (groupid=0, jobs=1): err= 0: pid=10797: Tue Dec 31 10:53:29 2019
  read: IOPS=138k, BW=539MiB/s (565MB/s)(63.1GiB/120001msec)
    slat (nsec): min=1029, max=161248, avg=1204.69, stdev=219.02
    clat (nsec): min=262, max=208952, avg=5735.42, stdev=469.73
     lat (nsec): min=6691, max=210136, avg=7008.54, stdev=516.99
    clat percentiles (usec):
     |  1.0000th=[    6],  5.0000th=[    6], 10.0000th=[    6],
     | 20.0000th=[    6], 30.0000th=[    6], 40.0000th=[    6],
     | 50.0000th=[    6], 60.0000th=[    6], 70.0000th=[    6],
     | 80.0000th=[    6], 90.0000th=[    6], 95.0000th=[    6],
     | 99.0000th=[    7], 99.5000th=[    7], 99.9000th=[    8],
     | 99.9500th=[    9], 99.9900th=[   10], 99.9990th=[   52],
     | 99.9999th=[  161]
   bw (  KiB/s): min=548208, max=554504, per=100.00%, avg=551620.30, stdev=984.77, samples=239
   iops        : min=137052, max=138626, avg=137905.07, stdev=246.17, samples=239
  lat (nsec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.01%, 4=0.01%, 10=99.98%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.01%, 250=0.01%
  cpu          : usr=7.39%, sys=92.44%, ctx=408, majf=0, minf=93
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=16548899,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=539MiB/s (565MB/s), 539MiB/s-539MiB/s (565MB/s-565MB/s), io=63.1GiB (67.8GB), run=120001-120001msec

Disk stats (read/write):
  nvme3n1: ios=16534429/0, merge=0/0, ticks=100320/0, in_queue=0, util=99.95%

Happy New Year Keith!

-----Original Message-----
From: Keith Busch <kbusch@kernel.org> 
Sent: Friday, December 20, 2019 1:21 PM
To: Ober, Frank <frank.ober@intel.com>
Cc: linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; Derrick, Jonathan <jonathan.derrick@intel.com>; Rajendiran, Swetha <swetha.rajendiran@intel.com>; Liang, Mark <mark.liang@intel.com>
Subject: Re: Polled io for Linux kernel 5.x

On Thu, Dec 19, 2019 at 09:59:14PM +0000, Ober, Frank wrote:
> Thanks Keith, it makes sense to reserve and set it up uniquely if you 
> can save hw interrupts. But why would io_uring then not need these 
> queues, because a stack trace I ran shows without the special queues I 
> am still entering bio_poll. With pvsync2 I can only do polled io with 
> the poll_queues?

Polling can happen only if you have polled queues, so io_uring is not accomplishing anything by calling iopoll. I don't see an immediately good way to pass that information up to io_uring, though.

_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-12-31 19:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-19 19:25 Polled io for Linux kernel 5.x Ober, Frank
2019-12-19 20:52 ` Keith Busch
2019-12-19 21:59   ` Ober, Frank
2019-12-20 21:20     ` Keith Busch
2019-12-31 19:06       ` Ober, Frank

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).