* Polled io for Linux kernel 5.x @ 2019-12-19 19:25 Ober, Frank 2019-12-19 20:52 ` Keith Busch 0 siblings, 1 reply; 5+ messages in thread From: Ober, Frank @ 2019-12-19 19:25 UTC (permalink / raw) To: linux-block, linux-nvme Cc: Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan Hi block/nvme communities, On 4.x kernels we used to be able to do: # echo 1 > /sys/block/nvme0n1/queue/io_poll And then run a polled_io job in fio with pvsync2 as our ioengine, with the hipri flag set. This is actually how we test the very best SSDs that depend on 3D xpoint media. On 5.x kernels we see the following error trying to write the device settings>>> -bash: echo: write error: Invalid argument We can reload the entire nvme module with nvme poll_queues but this is not well explained or written up anywhere? Or sorry "not found"? This is verifiable on 5.3, 5.4 kernels with fio 3.16 builds. What is the background on what has changed because Jens wrote this note back in 2015, which did work in the 4.x kernel era. But now things have changed, and there is not a new lwn article that has replaced the one here: https://lwn.net/Articles/663543/ More documentation can be found here on the confusion that exists today is here: https://stackoverflow.com/questions/55223883/echo-write-error-invalid-argument-while-setting-io-poll-for-nvme-ssd/ Can a new LWN article be written around design decisions, usage of these poll_queues? How come we cannot have a device/controller level setup of polled io today in 5.x kernels, all that exists is module based? Thank you! Frank Ober _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Polled io for Linux kernel 5.x 2019-12-19 19:25 Polled io for Linux kernel 5.x Ober, Frank @ 2019-12-19 20:52 ` Keith Busch 2019-12-19 21:59 ` Ober, Frank 0 siblings, 1 reply; 5+ messages in thread From: Keith Busch @ 2019-12-19 20:52 UTC (permalink / raw) To: Ober, Frank Cc: linux-block, Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan, linux-nvme On Thu, Dec 19, 2019 at 07:25:51PM +0000, Ober, Frank wrote: > Hi block/nvme communities, > On 4.x kernels we used to be able to do: > # echo 1 > /sys/block/nvme0n1/queue/io_poll > And then run a polled_io job in fio with pvsync2 as our ioengine, with the hipri flag set. This is actually how we test the very best SSDs that depend on 3D xpoint media. > > On 5.x kernels we see the following error trying to write the device settings>>> > -bash: echo: write error: Invalid argument > > We can reload the entire nvme module with nvme poll_queues but this is not well explained or written up anywhere? Or sorry "not found"? > > This is verifiable on 5.3, 5.4 kernels with fio 3.16 builds. > > What is the background on what has changed because Jens wrote this note back in 2015, which did work in the 4.x kernel era. The original polling implementation shared resources that generate interrupts. This prevents it from running as fast as it can, so dedicated polling queues are used now. > How come we cannot have a device/controller level setup of polled io today in 5.x kernels, all that exists is module based? Polled queues are a dedicated resource that we have to reserve up front. They're optional, so you don't need to use the hipri flag if you have a device you don't want polled. But we need to know how many queues to reserve before we've even discovered the controllers, so we don't have a good way to define it per-controller. _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: Polled io for Linux kernel 5.x 2019-12-19 20:52 ` Keith Busch @ 2019-12-19 21:59 ` Ober, Frank 2019-12-20 21:20 ` Keith Busch 0 siblings, 1 reply; 5+ messages in thread From: Ober, Frank @ 2019-12-19 21:59 UTC (permalink / raw) To: Keith Busch Cc: linux-block, Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan, linux-nvme Thanks Keith, it makes sense to reserve and set it up uniquely if you can save hw interrupts. But why would io_uring then not need these queues, because a stack trace I ran shows without the special queues I am still entering bio_poll. With pvsync2 I can only do polled io with the poll_queues? Does io_uring avoid the shared resources? -----Original Message----- From: Keith Busch <kbusch@kernel.org> Sent: Thursday, December 19, 2019 12:52 PM To: Ober, Frank <frank.ober@intel.com> Cc: linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; Derrick, Jonathan <jonathan.derrick@intel.com>; Rajendiran, Swetha <swetha.rajendiran@intel.com>; Liang, Mark <mark.liang@intel.com> Subject: Re: Polled io for Linux kernel 5.x On Thu, Dec 19, 2019 at 07:25:51PM +0000, Ober, Frank wrote: > Hi block/nvme communities, > On 4.x kernels we used to be able to do: > # echo 1 > /sys/block/nvme0n1/queue/io_poll And then run a polled_io > job in fio with pvsync2 as our ioengine, with the hipri flag set. This is actually how we test the very best SSDs that depend on 3D xpoint media. > > On 5.x kernels we see the following error trying to write the device > settings>>> > -bash: echo: write error: Invalid argument > > We can reload the entire nvme module with nvme poll_queues but this is not well explained or written up anywhere? Or sorry "not found"? > > This is verifiable on 5.3, 5.4 kernels with fio 3.16 builds. > > What is the background on what has changed because Jens wrote this note back in 2015, which did work in the 4.x kernel era. The original polling implementation shared resources that generate interrupts. This prevents it from running as fast as it can, so dedicated polling queues are used now. > How come we cannot have a device/controller level setup of polled io today in 5.x kernels, all that exists is module based? Polled queues are a dedicated resource that we have to reserve up front. They're optional, so you don't need to use the hipri flag if you have a device you don't want polled. But we need to know how many queues to reserve before we've even discovered the controllers, so we don't have a good way to define it per-controller. _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Polled io for Linux kernel 5.x 2019-12-19 21:59 ` Ober, Frank @ 2019-12-20 21:20 ` Keith Busch 2019-12-31 19:06 ` Ober, Frank 0 siblings, 1 reply; 5+ messages in thread From: Keith Busch @ 2019-12-20 21:20 UTC (permalink / raw) To: Ober, Frank Cc: linux-block, Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan, linux-nvme On Thu, Dec 19, 2019 at 09:59:14PM +0000, Ober, Frank wrote: > Thanks Keith, it makes sense to reserve and set it up uniquely if you > can save hw interrupts. But why would io_uring then not need these > queues, because a stack trace I ran shows without the special queues I > am still entering bio_poll. With pvsync2 I can only do polled io with > the poll_queues? Polling can happen only if you have polled queues, so io_uring is not accomplishing anything by calling iopoll. I don't see an immediately good way to pass that information up to io_uring, though. _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: Polled io for Linux kernel 5.x 2019-12-20 21:20 ` Keith Busch @ 2019-12-31 19:06 ` Ober, Frank 0 siblings, 0 replies; 5+ messages in thread From: Ober, Frank @ 2019-12-31 19:06 UTC (permalink / raw) To: Keith Busch Cc: linux-block, Rajendiran, Swetha, Liang, Mark, Derrick, Jonathan, linux-nvme Hi Keith, so the performance results I see are very close between poll_queues and io_uring. I posted them below. Because I think this topic is pretty new to people. Is there anything we need to tell the reader/user about poll_queues. What is important to usage? And can it be dynamic or do we have only at (module) startup the ability to define poll_queues? My goal is to update the blog we built around testing Optane SSDs. Is there a possibility of creating an LWN article that will go deeper (into this change) to poll_queues? What's interesting in the below data is that the clat time for io_uring is (lower) better but the performance in IOPS is not. Pvsync2 is the most efficient, by a small margin against the newer 3D XPoint device. Thanks Frank Results: (kernel (el repo) - 5.4.1-1.el8.elrepo.x86_64 cpu - Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz - pinned to run at 3.1 fio - fio-3.16-64-gfd988 Results of Gen2 Optane SSD with poll_queues(pvsync2) v io_uring/hipri pvsync2 (poll queues) fio-3.16-64-gfd988 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=552MiB/s][r=141k IOPS][eta 00m:00s] rand-read-4k-qd1: (groupid=0, jobs=1): err= 0: pid=10309: Tue Dec 31 10:49:33 2019 read: IOPS=141k, BW=552MiB/s (579MB/s)(64.7GiB/120001msec) clat (nsec): min=6548, max=186309, avg=6809.48, stdev=497.58 lat (nsec): min=6572, max=186333, avg=6834.24, stdev=499.28 clat percentiles (usec): | 1.0000th=[ 7], 5.0000th=[ 7], 10.0000th=[ 7], | 20.0000th=[ 7], 30.0000th=[ 7], 40.0000th=[ 7], | 50.0000th=[ 7], 60.0000th=[ 7], 70.0000th=[ 7], | 80.0000th=[ 7], 90.0000th=[ 7], 95.0000th=[ 8], | 99.0000th=[ 8], 99.5000th=[ 8], 99.9000th=[ 9], | 99.9500th=[ 10], 99.9900th=[ 18], 99.9990th=[ 117], | 99.9999th=[ 163] bw ( KiB/s): min=563512, max=567392, per=100.00%, avg=565635.38, stdev=846.99, samples=239 iops : min=140878, max=141848, avg=141408.82, stdev=211.76, samples=239 lat (usec) : 10=99.97%, 20=0.03%, 50=0.01%, 100=0.01%, 250=0.01% cpu : usr=6.28%, sys=93.55%, ctx=408, majf=0, minf=96 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=16969949,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=552MiB/s (579MB/s), 552MiB/s-552MiB/s (579MB/s-579MB/s), io=64.7GiB (69.5GB), run=120001-120001msec Disk stats (read/write): nvme3n1: ios=16955008/0, merge=0/0, ticks=101477/0, in_queue=0, util=99.95% io_uring: fio-3.16-64-gfd988 Starting 1 process Jobs: 1 (f=1): [r(1)][100.0%][r=538MiB/s][r=138k IOPS][eta 00m:00s] rand-read-4k-qd1: (groupid=0, jobs=1): err= 0: pid=10797: Tue Dec 31 10:53:29 2019 read: IOPS=138k, BW=539MiB/s (565MB/s)(63.1GiB/120001msec) slat (nsec): min=1029, max=161248, avg=1204.69, stdev=219.02 clat (nsec): min=262, max=208952, avg=5735.42, stdev=469.73 lat (nsec): min=6691, max=210136, avg=7008.54, stdev=516.99 clat percentiles (usec): | 1.0000th=[ 6], 5.0000th=[ 6], 10.0000th=[ 6], | 20.0000th=[ 6], 30.0000th=[ 6], 40.0000th=[ 6], | 50.0000th=[ 6], 60.0000th=[ 6], 70.0000th=[ 6], | 80.0000th=[ 6], 90.0000th=[ 6], 95.0000th=[ 6], | 99.0000th=[ 7], 99.5000th=[ 7], 99.9000th=[ 8], | 99.9500th=[ 9], 99.9900th=[ 10], 99.9990th=[ 52], | 99.9999th=[ 161] bw ( KiB/s): min=548208, max=554504, per=100.00%, avg=551620.30, stdev=984.77, samples=239 iops : min=137052, max=138626, avg=137905.07, stdev=246.17, samples=239 lat (nsec) : 500=0.01%, 750=0.01%, 1000=0.01% lat (usec) : 2=0.01%, 4=0.01%, 10=99.98%, 20=0.01%, 50=0.01% lat (usec) : 100=0.01%, 250=0.01% cpu : usr=7.39%, sys=92.44%, ctx=408, majf=0, minf=93 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=16548899,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=539MiB/s (565MB/s), 539MiB/s-539MiB/s (565MB/s-565MB/s), io=63.1GiB (67.8GB), run=120001-120001msec Disk stats (read/write): nvme3n1: ios=16534429/0, merge=0/0, ticks=100320/0, in_queue=0, util=99.95% Happy New Year Keith! -----Original Message----- From: Keith Busch <kbusch@kernel.org> Sent: Friday, December 20, 2019 1:21 PM To: Ober, Frank <frank.ober@intel.com> Cc: linux-block@vger.kernel.org; linux-nvme@lists.infradead.org; Derrick, Jonathan <jonathan.derrick@intel.com>; Rajendiran, Swetha <swetha.rajendiran@intel.com>; Liang, Mark <mark.liang@intel.com> Subject: Re: Polled io for Linux kernel 5.x On Thu, Dec 19, 2019 at 09:59:14PM +0000, Ober, Frank wrote: > Thanks Keith, it makes sense to reserve and set it up uniquely if you > can save hw interrupts. But why would io_uring then not need these > queues, because a stack trace I ran shows without the special queues I > am still entering bio_poll. With pvsync2 I can only do polled io with > the poll_queues? Polling can happen only if you have polled queues, so io_uring is not accomplishing anything by calling iopoll. I don't see an immediately good way to pass that information up to io_uring, though. _______________________________________________ linux-nvme mailing list linux-nvme@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-nvme ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-12-31 19:06 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-19 19:25 Polled io for Linux kernel 5.x Ober, Frank 2019-12-19 20:52 ` Keith Busch 2019-12-19 21:59 ` Ober, Frank 2019-12-20 21:20 ` Keith Busch 2019-12-31 19:06 ` Ober, Frank
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).