All of lore.kernel.org
 help / color / mirror / Atom feed
* hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
@ 2020-12-10 20:51 Andres Freund
  2020-12-10 23:12 ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Andres Freund @ 2020-12-10 20:51 UTC (permalink / raw)
  To: linux-block, Jens Axboe

Hi,

When using hybrid polling (i.e echo 0 >
/sys/block/nvme1n1/queue/io_poll_delay) I see stalls with fio when using
an iodepth > 1. Sometimes fio hangs, other times the performance is
really poor. I reproduced this with SSDs from different vendors.


$ echo -1 | sudo tee /sys/block/nvme1n1/queue/io_poll_delay
$ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 1
93.4k iops

$ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32
426k iops

$ echo 0 | sudo tee /sys/block/nvme1n1/queue/io_poll_delay
$ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 1
94.3k iops

$ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32
167 iops
fio took 33s


However, if I ask fio / io_uring to perform all those IOs at once, the performance is pretty decent again (but obviously that's not that desirable)

$ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32 --iodepth_batch_submit=32 --iodepth_batch_complete_min=32
394k iops


So it looks like there's something wrong around tracking what needs to
be polled for in hybrid mode.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-10 20:51 hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5 Andres Freund
@ 2020-12-10 23:12 ` Pavel Begunkov
  2020-12-10 23:15   ` Pavel Begunkov
  2020-12-11  1:12   ` Andres Freund
  0 siblings, 2 replies; 14+ messages in thread
From: Pavel Begunkov @ 2020-12-10 23:12 UTC (permalink / raw)
  To: Andres Freund, linux-block, Jens Axboe

On 10/12/2020 20:51, Andres Freund wrote:
> Hi,
> 
> When using hybrid polling (i.e echo 0 >
> /sys/block/nvme1n1/queue/io_poll_delay) I see stalls with fio when using
> an iodepth > 1. Sometimes fio hangs, other times the performance is
> really poor. I reproduced this with SSDs from different vendors.

Can you get poll stats from debugfs while running with hybrid?
For both iodepth=1 and 32.

cat <debugfs>/block/nvme1n1/poll_stat

e.g. if already mounted
cat /sys/kernel/debug/block/nvme1n1/poll_stat

> 
> 
> $ echo -1 | sudo tee /sys/block/nvme1n1/queue/io_poll_delay
> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 1
> 93.4k iops
> 
> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32
> 426k iops
> 
> $ echo 0 | sudo tee /sys/block/nvme1n1/queue/io_poll_delay
> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 1
> 94.3k iops
> 
> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32
> 167 iops
> fio took 33s
> 
> 
> However, if I ask fio / io_uring to perform all those IOs at once, the performance is pretty decent again (but obviously that's not that desirable)
> 
> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32 --iodepth_batch_submit=32 --iodepth_batch_complete_min=32
> 394k iops
> 
> 
> So it looks like there's something wrong around tracking what needs to
> be polled for in hybrid mode.
-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-10 23:12 ` Pavel Begunkov
@ 2020-12-10 23:15   ` Pavel Begunkov
  2020-12-11  1:19     ` Andres Freund
  2020-12-11  1:12   ` Andres Freund
  1 sibling, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2020-12-10 23:15 UTC (permalink / raw)
  To: Andres Freund, linux-block, Jens Axboe

On 10/12/2020 23:12, Pavel Begunkov wrote:
> On 10/12/2020 20:51, Andres Freund wrote:
>> Hi,
>>
>> When using hybrid polling (i.e echo 0 >
>> /sys/block/nvme1n1/queue/io_poll_delay) I see stalls with fio when using
>> an iodepth > 1. Sometimes fio hangs, other times the performance is
>> really poor. I reproduced this with SSDs from different vendors.
> 
> Can you get poll stats from debugfs while running with hybrid?
> For both iodepth=1 and 32.

Even better if for 32 you would show it in dynamic, i.e. cat it several
times while running it.

> 
> cat <debugfs>/block/nvme1n1/poll_stat
> 
> e.g. if already mounted
> cat /sys/kernel/debug/block/nvme1n1/poll_stat
> 
>>
>>
>> $ echo -1 | sudo tee /sys/block/nvme1n1/queue/io_poll_delay
>> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 1
>> 93.4k iops
>>
>> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32
>> 426k iops
>>
>> $ echo 0 | sudo tee /sys/block/nvme1n1/queue/io_poll_delay
>> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 1
>> 94.3k iops
>>
>> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32
>> 167 iops
>> fio took 33s
>>
>>
>> However, if I ask fio / io_uring to perform all those IOs at once, the performance is pretty decent again (but obviously that's not that desirable)
>>
>> $ fio --ioengine io_uring --rw write --filesize 1GB --overwrite=1 --name=test --direct=1 --bs=$((1024*4)) --time_based=1 --runtime=10 --hipri --iodepth 32 --iodepth_batch_submit=32 --iodepth_batch_complete_min=32
>> 394k iops
>>
>>
>> So it looks like there's something wrong around tracking what needs to
>> be polled for in hybrid mode.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-10 23:12 ` Pavel Begunkov
  2020-12-10 23:15   ` Pavel Begunkov
@ 2020-12-11  1:12   ` Andres Freund
  1 sibling, 0 replies; 14+ messages in thread
From: Andres Freund @ 2020-12-11  1:12 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: linux-block, Jens Axboe

Hi,

On 2020-12-10 23:12:15 +0000, Pavel Begunkov wrote:
> Can you get poll stats from debugfs while running with hybrid?
> For both iodepth=1 and 32.
> 
> cat <debugfs>/block/nvme1n1/poll_stat

Sure.

QD1:
read  (512 Bytes): samples=2, mean=6673855, min=68005, max=13279705
write (512 Bytes): samples=1, mean=13232585, min=13232585, max=13232585
read  (1024 Bytes): samples=0
write (1024 Bytes): samples=4, mean=4968280, min=4815727, max=5121434
read  (2048 Bytes): samples=0
write (2048 Bytes): samples=2, mean=2090473, min=2089735, max=2091212
read  (4096 Bytes): samples=3, mean=75684, min=68069, max=88749
write (4096 Bytes): samples=9901, mean=7424, min=6636, max=27371
read  (8192 Bytes): samples=12, mean=1178627, min=59709, max=13310383
write (8192 Bytes): samples=1, mean=13231993, min=13231993, max=13231993
read  (16384 Bytes): samples=1, mean=13376610, min=13376610, max=13376610
write (16384 Bytes): samples=1, mean=13230532, min=13230532, max=13230532
read  (32768 Bytes): samples=12, mean=128980, min=81628, max=173096
write (32768 Bytes): samples=1, mean=13240766, min=13240766, max=13240766
read  (65536 Bytes): samples=1, mean=234465, min=234465, max=234465
write (65536 Bytes): samples=3, mean=4224941, min=66043, max=12534481

QD32:
read  (512 Bytes): samples=2, mean=6673855, min=68005, max=13279705
write (512 Bytes): samples=1, mean=13232585, min=13232585, max=13232585
read  (1024 Bytes): samples=0
write (1024 Bytes): samples=4, mean=4614410, min=4576806, max=4652813
read  (2048 Bytes): samples=0
write (2048 Bytes): samples=2, mean=2090473, min=2089735, max=2091212
read  (4096 Bytes): samples=3, mean=75684, min=68069, max=88749
write (4096 Bytes): samples=32, mean=6155072604, min=6155008198, max=6155132851
read  (8192 Bytes): samples=12, mean=1178627, min=59709, max=13310383
write (8192 Bytes): samples=1, mean=13231993, min=13231993, max=13231993
read  (16384 Bytes): samples=1, mean=13376610, min=13376610, max=13376610
write (16384 Bytes): samples=1, mean=13230532, min=13230532, max=13230532
read  (32768 Bytes): samples=12, mean=128980, min=81628, max=173096
write (32768 Bytes): samples=1, mean=13240766, min=13240766, max=13240766
read  (65536 Bytes): samples=1, mean=234465, min=234465, max=234465
write (65536 Bytes): samples=3, mean=4224941, min=66043, max=12534481


I also saw
[1036471.387012] nvme nvme1: I/O 576 QID 32 timeout, aborting
[1036471.387123] nvme nvme1: Abort status: 0x0
during one of the QD32 runs just now. But not in all.


- Andres

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-10 23:15   ` Pavel Begunkov
@ 2020-12-11  1:19     ` Andres Freund
  2020-12-11  1:44       ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Andres Freund @ 2020-12-11  1:19 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: linux-block, Jens Axboe

On 2020-12-10 23:15:15 +0000, Pavel Begunkov wrote:
> On 10/12/2020 23:12, Pavel Begunkov wrote:
> > On 10/12/2020 20:51, Andres Freund wrote:
> >> Hi,
> >>
> >> When using hybrid polling (i.e echo 0 >
> >> /sys/block/nvme1n1/queue/io_poll_delay) I see stalls with fio when using
> >> an iodepth > 1. Sometimes fio hangs, other times the performance is
> >> really poor. I reproduced this with SSDs from different vendors.
> > 
> > Can you get poll stats from debugfs while running with hybrid?
> > For both iodepth=1 and 32.
> 
> Even better if for 32 you would show it in dynamic, i.e. cat it several
> times while running it.

Should read all email before responding...

This is a loop of grepping for 4k writes (only type I am doing), with 1s
interval. I started it before the fio run (after one with
iodepth=1). Once the iodepth 32 run finished (--timeout 10, but took
42s0, I started a --iodepth 1 run.

write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351

Shortly after this I started the iodepth=1 run:

write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
write (4096 Bytes): samples=1, mean=2216868822, min=2216868822, max=2216868822
write (4096 Bytes): samples=1, mean=2216868822, min=2216868822, max=2216868822
write (4096 Bytes): samples=1, mean=2216851683, min=2216851683, max=2216851683
write (4096 Bytes): samples=1, mean=1108526485, min=1108526485, max=1108526485
write (4096 Bytes): samples=1, mean=1108522634, min=1108522634, max=1108522634
write (4096 Bytes): samples=1, mean=277274275, min=277274275, max=277274275
write (4096 Bytes): samples=19, mean=5787160, min=5496432, max=10087444
write (4096 Bytes): samples=1185, mean=67915, min=66408, max=145100
write (4096 Bytes): samples=1185, mean=67915, min=66408, max=145100
write (4096 Bytes): samples=1185, mean=67915, min=66408, max=145100
write (4096 Bytes): samples=1703, mean=50492, min=39200, max=13155316
write (4096 Bytes): samples=9983, mean=7408, min=6648, max=29950
write (4096 Bytes): samples=9980, mean=7395, min=6574, max=23454
write (4096 Bytes): samples=10011, mean=7381, min=6620, max=25533
write (4096 Bytes): samples=9381, mean=7936, min=7270, max=47315
write (4096 Bytes): samples=9295, mean=7377, min=6665, max=23490
write (4096 Bytes): samples=9987, mean=7415, min=6629, max=23352
write (4096 Bytes): samples=9992, mean=7411, min=6651, max=23071
write (4096 Bytes): samples=9404, mean=7941, min=7234, max=24193
write (4096 Bytes): samples=9434, mean=7942, min=7240, max=62745
write (4096 Bytes): samples=5370, mean=7935, min=7268, max=24116
write (4096 Bytes): samples=5370, mean=7935, min=7268, max=24116
write (4096 Bytes): samples=5370, mean=7935, min=7268, max=24116

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-11  1:19     ` Andres Freund
@ 2020-12-11  1:44       ` Pavel Begunkov
  2020-12-11  3:37         ` Keith Busch
  2020-12-11  8:00         ` Andres Freund
  0 siblings, 2 replies; 14+ messages in thread
From: Pavel Begunkov @ 2020-12-11  1:44 UTC (permalink / raw)
  To: Andres Freund; +Cc: linux-block, Jens Axboe

On 11/12/2020 01:19, Andres Freund wrote:
> On 2020-12-10 23:15:15 +0000, Pavel Begunkov wrote:
>> On 10/12/2020 23:12, Pavel Begunkov wrote:
>>> On 10/12/2020 20:51, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> When using hybrid polling (i.e echo 0 >
>>>> /sys/block/nvme1n1/queue/io_poll_delay) I see stalls with fio when using
>>>> an iodepth > 1. Sometimes fio hangs, other times the performance is
>>>> really poor. I reproduced this with SSDs from different vendors.
>>>
>>> Can you get poll stats from debugfs while running with hybrid?
>>> For both iodepth=1 and 32.
>>
>> Even better if for 32 you would show it in dynamic, i.e. cat it several
>> times while running it.
> 
> Should read all email before responding...
> 
> This is a loop of grepping for 4k writes (only type I am doing), with 1s
> interval. I started it before the fio run (after one with
> iodepth=1). Once the iodepth 32 run finished (--timeout 10, but took
> 42s0, I started a --iodepth 1 run.

Thanks! Your mean grows to more than 30s, so it'll sleep for 15s for each
IO. Yep, the sleep time calculation is clearly broken for you.

In general the current hybrid polling doesn't work well with high QD,
that's because statistics it based on are not very resilient to all sorts
of problems. And it might be a problem I described long ago

https://www.spinics.net/lists/linux-block/msg61479.html
https://lkml.org/lkml/2019/4/30/120


Are you interested in it just out of curiosity, or you have a good
use case? Modern SSDs are so fast that even with QD1 the sleep overhead
on sleeping getting considerable, all the more so for higher QD.
Because if there is no one who really cares, then instead of adding
elaborated correction schemes, I'd rather put max(time, 10ms) and
that's it.

> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=3002, mean=7402, min=6683, max=22498
> write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
> write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
> write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
> write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
> write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
> write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
> write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
> write (4096 Bytes): samples=32, mean=517838676, min=517774856, max=517901274
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=7365701186, min=7365642813, max=7365756630
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> 
> Shortly after this I started the iodepth=1 run:
> 
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=32, mean=30203322069, min=30203263000, max=30203381351
> write (4096 Bytes): samples=1, mean=2216868822, min=2216868822, max=2216868822
> write (4096 Bytes): samples=1, mean=2216868822, min=2216868822, max=2216868822
> write (4096 Bytes): samples=1, mean=2216851683, min=2216851683, max=2216851683
> write (4096 Bytes): samples=1, mean=1108526485, min=1108526485, max=1108526485
> write (4096 Bytes): samples=1, mean=1108522634, min=1108522634, max=1108522634
> write (4096 Bytes): samples=1, mean=277274275, min=277274275, max=277274275
> write (4096 Bytes): samples=19, mean=5787160, min=5496432, max=10087444
> write (4096 Bytes): samples=1185, mean=67915, min=66408, max=145100
> write (4096 Bytes): samples=1185, mean=67915, min=66408, max=145100
> write (4096 Bytes): samples=1185, mean=67915, min=66408, max=145100
> write (4096 Bytes): samples=1703, mean=50492, min=39200, max=13155316
> write (4096 Bytes): samples=9983, mean=7408, min=6648, max=29950
> write (4096 Bytes): samples=9980, mean=7395, min=6574, max=23454
> write (4096 Bytes): samples=10011, mean=7381, min=6620, max=25533
> write (4096 Bytes): samples=9381, mean=7936, min=7270, max=47315
> write (4096 Bytes): samples=9295, mean=7377, min=6665, max=23490
> write (4096 Bytes): samples=9987, mean=7415, min=6629, max=23352
> write (4096 Bytes): samples=9992, mean=7411, min=6651, max=23071
> write (4096 Bytes): samples=9404, mean=7941, min=7234, max=24193
> write (4096 Bytes): samples=9434, mean=7942, min=7240, max=62745
> write (4096 Bytes): samples=5370, mean=7935, min=7268, max=24116
> write (4096 Bytes): samples=5370, mean=7935, min=7268, max=24116
> write (4096 Bytes): samples=5370, mean=7935, min=7268, max=24116

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-11  1:44       ` Pavel Begunkov
@ 2020-12-11  3:37         ` Keith Busch
  2020-12-11 12:38           ` Pavel Begunkov
  2020-12-11  8:00         ` Andres Freund
  1 sibling, 1 reply; 14+ messages in thread
From: Keith Busch @ 2020-12-11  3:37 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Andres Freund, linux-block, Jens Axboe

On Fri, Dec 11, 2020 at 01:44:38AM +0000, Pavel Begunkov wrote:
> On 11/12/2020 01:19, Andres Freund wrote:
> > On 2020-12-10 23:15:15 +0000, Pavel Begunkov wrote:
> >> On 10/12/2020 23:12, Pavel Begunkov wrote:
> >>> On 10/12/2020 20:51, Andres Freund wrote:
> >>>> Hi,
> >>>>
> >>>> When using hybrid polling (i.e echo 0 >
> >>>> /sys/block/nvme1n1/queue/io_poll_delay) I see stalls with fio when using
> >>>> an iodepth > 1. Sometimes fio hangs, other times the performance is
> >>>> really poor. I reproduced this with SSDs from different vendors.
> >>>
> >>> Can you get poll stats from debugfs while running with hybrid?
> >>> For both iodepth=1 and 32.
> >>
> >> Even better if for 32 you would show it in dynamic, i.e. cat it several
> >> times while running it.
> > 
> > Should read all email before responding...
> > 
> > This is a loop of grepping for 4k writes (only type I am doing), with 1s
> > interval. I started it before the fio run (after one with
> > iodepth=1). Once the iodepth 32 run finished (--timeout 10, but took
> > 42s0, I started a --iodepth 1 run.
> 
> Thanks! Your mean grows to more than 30s, so it'll sleep for 15s for each
> IO. Yep, the sleep time calculation is clearly broken for you.
> 
> In general the current hybrid polling doesn't work well with high QD,
> that's because statistics it based on are not very resilient to all sorts
> of problems. And it might be a problem I described long ago
> 
> https://www.spinics.net/lists/linux-block/msg61479.html
> https://lkml.org/lkml/2019/4/30/120

It sounds like the statistic is using the wrong criteria. It ought to
use the average time for the next available completion for any request
rather than the average latency of a specific IO. It might work at high
depth if the hybrid poll knew the hctx's depth when calculating the
sleep time, but that information doesn't appear to be readily available.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-11  1:44       ` Pavel Begunkov
  2020-12-11  3:37         ` Keith Busch
@ 2020-12-11  8:00         ` Andres Freund
  1 sibling, 0 replies; 14+ messages in thread
From: Andres Freund @ 2020-12-11  8:00 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: linux-block, Jens Axboe

Hi,

On 2020-12-11 01:44:38 +0000, Pavel Begunkov wrote:
> In general the current hybrid polling doesn't work well with high QD,
> that's because statistics it based on are not very resilient to all sorts
> of problems. And it might be a problem I described long ago
> 
> https://www.spinics.net/lists/linux-block/msg61479.html
> https://lkml.org/lkml/2019/4/30/120

Interesting.


> Are you interested in it just out of curiosity, or you have a good
> use case? Modern SSDs are so fast that even with QD1 the sleep overhead
> on sleeping getting considerable, all the more so for higher QD.

It's a bit more than just idle curiosity, but not a strong need (yet). I
was experimenting with using it for postgres WAL writes. The CPU cost of
"classic" polling is high enough to make it not super attractive in a
lot of cases.  Often enough the QD is just 1 for data integrity writes
on fast drives, but there's also cases (bulk load particularly, or high
concurrency OLTP) where having multiple IOs in flight is important.


> Because if there is no one who really cares, then instead of adding
> elaborated correction schemes, I'd rather put max(time, 10ms) and
> that's it.

I wonder if it's doable to just switch from hybrid polling to classic
polling if there's more than one request in flight?

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-11  3:37         ` Keith Busch
@ 2020-12-11 12:38           ` Pavel Begunkov
  2020-12-13 18:19             ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2020-12-11 12:38 UTC (permalink / raw)
  To: Keith Busch; +Cc: Andres Freund, linux-block, Jens Axboe

On 11/12/2020 03:37, Keith Busch wrote:
> On Fri, Dec 11, 2020 at 01:44:38AM +0000, Pavel Begunkov wrote:
>> On 11/12/2020 01:19, Andres Freund wrote:
>>> On 2020-12-10 23:15:15 +0000, Pavel Begunkov wrote:
>>>> On 10/12/2020 23:12, Pavel Begunkov wrote:
>>>>> On 10/12/2020 20:51, Andres Freund wrote:
>>>>>> Hi,
>>>>>>
>>>>>> When using hybrid polling (i.e echo 0 >
>>>>>> /sys/block/nvme1n1/queue/io_poll_delay) I see stalls with fio when using
>>>>>> an iodepth > 1. Sometimes fio hangs, other times the performance is
>>>>>> really poor. I reproduced this with SSDs from different vendors.
>>>>>
>>>>> Can you get poll stats from debugfs while running with hybrid?
>>>>> For both iodepth=1 and 32.
>>>>
>>>> Even better if for 32 you would show it in dynamic, i.e. cat it several
>>>> times while running it.
>>>
>>> Should read all email before responding...
>>>
>>> This is a loop of grepping for 4k writes (only type I am doing), with 1s
>>> interval. I started it before the fio run (after one with
>>> iodepth=1). Once the iodepth 32 run finished (--timeout 10, but took
>>> 42s0, I started a --iodepth 1 run.
>>
>> Thanks! Your mean grows to more than 30s, so it'll sleep for 15s for each
>> IO. Yep, the sleep time calculation is clearly broken for you.
>>
>> In general the current hybrid polling doesn't work well with high QD,
>> that's because statistics it based on are not very resilient to all sorts
>> of problems. And it might be a problem I described long ago
>>
>> https://www.spinics.net/lists/linux-block/msg61479.html
>> https://lkml.org/lkml/2019/4/30/120
> 
> It sounds like the statistic is using the wrong criteria. It ought to
> use the average time for the next available completion for any request
> rather than the average latency of a specific IO. It might work at high
> depth if the hybrid poll knew the hctx's depth when calculating the
> sleep time, but that information doesn't appear to be readily available.

It polls (and so sleeps) from submission of a request to its completion,
not from request to request. Looks like the other scheme doesn't suit well
when you don't have a constant-ish flow of requests, e.g. QD=1 and with
different latency in the userspace.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-11 12:38           ` Pavel Begunkov
@ 2020-12-13 18:19             ` Keith Busch
  2020-12-14 17:58               ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2020-12-13 18:19 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Andres Freund, linux-block, Jens Axboe

On Fri, Dec 11, 2020 at 12:38:43PM +0000, Pavel Begunkov wrote:
> On 11/12/2020 03:37, Keith Busch wrote:
> > It sounds like the statistic is using the wrong criteria. It ought to
> > use the average time for the next available completion for any request
> > rather than the average latency of a specific IO. It might work at high
> > depth if the hybrid poll knew the hctx's depth when calculating the
> > sleep time, but that information doesn't appear to be readily available.
> 
> It polls (and so sleeps) from submission of a request to its completion,
> not from request to request. 

Right, but the polling thread is responsible for completing all
requests, not just the most recent cookie. If the sleep timer uses the
round trip of a single request when you have a high queue depth, there
are likely to be many completions in the pipeline that aren't getting
polled on time. This feeds back to the mean latency, pushing the sleep
timer further out.

> Looks like the other scheme doesn't suit well
> when you don't have a constant-ish flow of requests, e.g. QD=1 and with
> different latency in the userspace.

The idea I'm trying to convey shouldn't affect QD1. The following patch
seems to test "ok", but I know of at least a few scenarios where it
falls apart...

---
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e9799fed98c7..cab2dafcd3a9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3727,6 +3727,7 @@ static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
 static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 				       struct request *rq)
 {
+	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
 	unsigned long ret = 0;
 	int bucket;
 
@@ -3753,6 +3754,15 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 	if (q->poll_stat[bucket].nr_samples)
 		ret = (q->poll_stat[bucket].mean + 1) / 2;
 
+	/*
+	 * Finding completions on the first poll indicates we're sleeping too
+	 * long and pushing the latency statistic in the wrong direction for
+	 * future sleep consideration. Poll immediately until the average time
+	 * becomes more useful.
+	 */
+	if (hctx->poll_invoked < 3 * hctx->poll_considered)
+		return 0;
+
 	return ret;
 }
 
---

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-13 18:19             ` Keith Busch
@ 2020-12-14 17:58               ` Pavel Begunkov
  2020-12-14 18:23                 ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2020-12-14 17:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: Andres Freund, linux-block, Jens Axboe

On 13/12/2020 18:19, Keith Busch wrote:
> On Fri, Dec 11, 2020 at 12:38:43PM +0000, Pavel Begunkov wrote:
>> On 11/12/2020 03:37, Keith Busch wrote:
>>> It sounds like the statistic is using the wrong criteria. It ought to
>>> use the average time for the next available completion for any request
>>> rather than the average latency of a specific IO. It might work at high
>>> depth if the hybrid poll knew the hctx's depth when calculating the
>>> sleep time, but that information doesn't appear to be readily available.
>>
>> It polls (and so sleeps) from submission of a request to its completion,
>> not from request to request. 
> 
> Right, but the polling thread is responsible for completing all
> requests, not just the most recent cookie. If the sleep timer uses the
> round trip of a single request when you have a high queue depth, there
> are likely to be many completions in the pipeline that aren't getting
> polled on time. This feeds back to the mean latency, pushing the sleep
> timer further out.

It rather polls for a particular request and completes others by the way,
and that's the problem. Completion-to-completion would make much more
sense if we'd have a separate from waiters poll task.

Or if the semantics would be not "poll for a request", but poll a file.
And since io_uring IMHO that actually makes more sense even for
non-hybrid polling.

> 
>> Looks like the other scheme doesn't suit well
>> when you don't have a constant-ish flow of requests, e.g. QD=1 and with
>> different latency in the userspace.
> 
> The idea I'm trying to convey shouldn't affect QD1. The following patch
> seems to test "ok", but I know of at least a few scenarios where it
> falls apart...
> 
> ---
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e9799fed98c7..cab2dafcd3a9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3727,6 +3727,7 @@ static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
>  static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
>  				       struct request *rq)
>  {
> +	struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  	unsigned long ret = 0;
>  	int bucket;
>  
> @@ -3753,6 +3754,15 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
>  	if (q->poll_stat[bucket].nr_samples)
>  		ret = (q->poll_stat[bucket].mean + 1) / 2;
>  
> +	/*
> +	 * Finding completions on the first poll indicates we're sleeping too
> +	 * long and pushing the latency statistic in the wrong direction for
> +	 * future sleep consideration. Poll immediately until the average time
> +	 * becomes more useful.
> +	 */
> +	if (hctx->poll_invoked < 3 * hctx->poll_considered)
> +		return 0;
> +
>  	return ret;
>  }
>  
> ---
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-14 17:58               ` Pavel Begunkov
@ 2020-12-14 18:23                 ` Keith Busch
  2020-12-14 19:01                   ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2020-12-14 18:23 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Andres Freund, linux-block, Jens Axboe

On Mon, Dec 14, 2020 at 05:58:56PM +0000, Pavel Begunkov wrote:
> On 13/12/2020 18:19, Keith Busch wrote:
> > On Fri, Dec 11, 2020 at 12:38:43PM +0000, Pavel Begunkov wrote:
> >> On 11/12/2020 03:37, Keith Busch wrote:
> >>> It sounds like the statistic is using the wrong criteria. It ought to
> >>> use the average time for the next available completion for any request
> >>> rather than the average latency of a specific IO. It might work at high
> >>> depth if the hybrid poll knew the hctx's depth when calculating the
> >>> sleep time, but that information doesn't appear to be readily available.
> >>
> >> It polls (and so sleeps) from submission of a request to its completion,
> >> not from request to request. 
> > 
> > Right, but the polling thread is responsible for completing all
> > requests, not just the most recent cookie. If the sleep timer uses the
> > round trip of a single request when you have a high queue depth, there
> > are likely to be many completions in the pipeline that aren't getting
> > polled on time. This feeds back to the mean latency, pushing the sleep
> > timer further out.
> 
> It rather polls for a particular request and completes others by the way,
> and that's the problem. Completion-to-completion would make much more
> sense if we'd have a separate from waiters poll task.
> 
> Or if the semantics would be not "poll for a request", but poll a file.
> And since io_uring IMHO that actually makes more sense even for
> non-hybrid polling.

The existing block layer polling semantics doesn't poll for a specific
request. Please see the blk_mq_ops driver API for the 'poll' function.
It takes a hardware context, which does not indicate a specific request.
See also the blk_poll() function, which doesn't consider any specific
request in order to break out of the polling loop.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-14 18:23                 ` Keith Busch
@ 2020-12-14 19:01                   ` Pavel Begunkov
  2020-12-16 22:22                     ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2020-12-14 19:01 UTC (permalink / raw)
  To: Keith Busch; +Cc: Andres Freund, linux-block, Jens Axboe

On 14/12/2020 18:23, Keith Busch wrote:
> On Mon, Dec 14, 2020 at 05:58:56PM +0000, Pavel Begunkov wrote:
>> On 13/12/2020 18:19, Keith Busch wrote:
>>> On Fri, Dec 11, 2020 at 12:38:43PM +0000, Pavel Begunkov wrote:
>>>> On 11/12/2020 03:37, Keith Busch wrote:
>>>>> It sounds like the statistic is using the wrong criteria. It ought to
>>>>> use the average time for the next available completion for any request
>>>>> rather than the average latency of a specific IO. It might work at high
>>>>> depth if the hybrid poll knew the hctx's depth when calculating the
>>>>> sleep time, but that information doesn't appear to be readily available.
>>>>
>>>> It polls (and so sleeps) from submission of a request to its completion,
>>>> not from request to request. 
>>>
>>> Right, but the polling thread is responsible for completing all
>>> requests, not just the most recent cookie. If the sleep timer uses the
>>> round trip of a single request when you have a high queue depth, there
>>> are likely to be many completions in the pipeline that aren't getting
>>> polled on time. This feeds back to the mean latency, pushing the sleep
>>> timer further out.
>>
>> It rather polls for a particular request and completes others by the way,
>> and that's the problem. Completion-to-completion would make much more
>> sense if we'd have a separate from waiters poll task.
>>
>> Or if the semantics would be not "poll for a request", but poll a file.
>> And since io_uring IMHO that actually makes more sense even for
>> non-hybrid polling.
> 
> The existing block layer polling semantics doesn't poll for a specific
> request. Please see the blk_mq_ops driver API for the 'poll' function.
> It takes a hardware context, which does not indicate a specific request.
> See also the blk_poll() function, which doesn't consider any specific
> request in order to break out of the polling loop.

Yeah, thanks for pointing out, it's just the users do it that way --
block layer dio and somewhat true for io_uring, and also hybrid part is
per request based (and sleeps once per request), that stands out.
If would go with coml-to-compl it should be changed. And not to forget
that subm-to-compl sometimes is more desirable.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5
  2020-12-14 19:01                   ` Pavel Begunkov
@ 2020-12-16 22:22                     ` Keith Busch
  0 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2020-12-16 22:22 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Andres Freund, linux-block, Jens Axboe

On Mon, Dec 14, 2020 at 07:01:31PM +0000, Pavel Begunkov wrote:
> On 14/12/2020 18:23, Keith Busch wrote:
> > The existing block layer polling semantics doesn't poll for a specific
> > request. Please see the blk_mq_ops driver API for the 'poll' function.
> > It takes a hardware context, which does not indicate a specific request.
> > See also the blk_poll() function, which doesn't consider any specific
> > request in order to break out of the polling loop.
> 
> Yeah, thanks for pointing out, it's just the users do it that way --
> block layer dio and somewhat true for io_uring, and also hybrid part is
> per request based (and sleeps once per request), that stands out.
> If would go with coml-to-compl it should be changed. And not to forget
> that subm-to-compl sometimes is more desirable.

Right, so coming full circle to my initial reply: the block polling
thread may be responsible for multiple requests when it wakes up, yet
the hybrid sleep timer considers only one; therefore, the sleep criteria
is not always accurate and is worse than interrupt driven at high q
depth.

The current sleep calculation works fine for QD1, but I don't see a
clear way to calculate an accurate sleep time for higher q-depths within
a reasonable CPU cost. My only suggestion is just don't sleep at all as
long as the polling thread continues to reap completions on its first
poll.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-12-16 22:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 20:51 hybrid polling on an nvme doesn't seem to work with iodepth > 1 on 5.10.0-rc5 Andres Freund
2020-12-10 23:12 ` Pavel Begunkov
2020-12-10 23:15   ` Pavel Begunkov
2020-12-11  1:19     ` Andres Freund
2020-12-11  1:44       ` Pavel Begunkov
2020-12-11  3:37         ` Keith Busch
2020-12-11 12:38           ` Pavel Begunkov
2020-12-13 18:19             ` Keith Busch
2020-12-14 17:58               ` Pavel Begunkov
2020-12-14 18:23                 ` Keith Busch
2020-12-14 19:01                   ` Pavel Begunkov
2020-12-16 22:22                     ` Keith Busch
2020-12-11  8:00         ` Andres Freund
2020-12-11  1:12   ` Andres Freund

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.