All of lore.kernel.org
 help / color / mirror / Atom feed
* Seeking for help with NVMe arbitration questions
@ 2023-04-26  0:26 Wang Yicheng
  2023-04-26  5:17 ` Chaitanya Kulkarni
  0 siblings, 1 reply; 11+ messages in thread
From: Wang Yicheng @ 2023-04-26  0:26 UTC (permalink / raw)
  To: linux-nvme

Hi experts,

I'm trying to evaluate how NVMe arbitration would impact the overall
IO performance but have been suffering from finding the correct
materials. Thus, I'm reaching out for your expertise help. Could you
please comment on my following questions? I'm a newbie to this area,
so some of the questions might be beyond this mail-list's scope. But
any help is much appreciated!

1. To be more specific, I'm trying to enable WRR for accelerating a
specific set of IOs. So first, I'm figuring out in which layer the
arbitration works. From my understanding, today the NVMe block layer
adopts a multi-queue design to leverage the high density of CPU cores.
But I'm confused about if the WRR works in software queues or the
hardware queues. I suppose it's among hardware queues. Otherwise, it
brings synchronization problems to the software queues, which seems to
be against the design intention of parallelizing IO submissions. Am I
getting it right?

2. I've also learnt that the submission queues can be further
classified as default/read/poll. I did some experiments by issuing
different IOs to different queues (intense large writes->default
queues, sparse small writes->poll queues), aiming to prioritize small
writes over large writes. However the performance didn't vary no
matter how I distributed the queues. Is it because the node has far
more submission queues (64 cores and up to 128 queues) than the IO
jobs (8 jobs for large writes, and 1 job for small writes), so that
having separate queues for small IOs won't help? In other words, does
it mean the queue type has nothing to do with prioritization (or
arbitration)?

3. I came across this mail thread: https://lwn.net/Articles/810726/,
where Weiping was trying to add WRR support in the kernel. But it
seems the patch was eventually dropped. Does it mean today the
linux-nvme kernel doesn't have WRR support?

4. If the answer to question 3 is no, then what does that mean to the
application layer? Does it become a pure device layer stuff?

Many thanks in advance!

Best,
Yicheng


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-26  0:26 Seeking for help with NVMe arbitration questions Wang Yicheng
@ 2023-04-26  5:17 ` Chaitanya Kulkarni
  2023-04-27 19:03   ` Wang Yicheng
  0 siblings, 1 reply; 11+ messages in thread
From: Chaitanya Kulkarni @ 2023-04-26  5:17 UTC (permalink / raw)
  To: Wang Yicheng, linux-nvme


> 3. I came across this mail thread: https://lwn.net/Articles/810726/,
> where Weiping was trying to add WRR support in the kernel. But it
> seems the patch was eventually dropped. Does it mean today the
> linux-nvme kernel doesn't have WRR support?
>
>

no WRR:-

https://www.spinics.net/lists/linux-block/msg51575.html

-ck



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-26  5:17 ` Chaitanya Kulkarni
@ 2023-04-27 19:03   ` Wang Yicheng
  2023-04-27 20:45     ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Wang Yicheng @ 2023-04-27 19:03 UTC (permalink / raw)
  To: Chaitanya Kulkarni, linux-nvme

Thanks Chaitanya for confirming!

Given that the kernel module doesn't support configuring device-level
arbitration, I'm now trying to leverage the I/O queues to achieve the
same goal. If you could kindly help with the following questions it
will be much appreciated!

1. Is there a way to query the type of each I/O queue and which CPU it
resides on?

2. Is there a way to control which queue a submitted I/O goes into?

3. Say I have only 1 default queue and I submit an I/O from some CPU,
then there can be a chance that the I/O would need to cross CPUs, if
the default queue happens not to be on the same core right?

Best,
Yicheng

On Tue, Apr 25, 2023 at 10:17 PM Chaitanya Kulkarni
<chaitanyak@nvidia.com> wrote:
>
>
> > 3. I came across this mail thread: https://lwn.net/Articles/810726/,
> > where Weiping was trying to add WRR support in the kernel. But it
> > seems the patch was eventually dropped. Does it mean today the
> > linux-nvme kernel doesn't have WRR support?
> >
> >
>
> no WRR:-
>
> https://www.spinics.net/lists/linux-block/msg51575.html
>
> -ck
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-27 19:03   ` Wang Yicheng
@ 2023-04-27 20:45     ` Keith Busch
  2023-04-27 21:21       ` Wang Yicheng
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2023-04-27 20:45 UTC (permalink / raw)
  To: Wang Yicheng; +Cc: Chaitanya Kulkarni, linux-nvme

On Thu, Apr 27, 2023 at 12:03:05PM -0700, Wang Yicheng wrote:
> Thanks Chaitanya for confirming!
> 
> Given that the kernel module doesn't support configuring device-level
> arbitration,

You can't configure the arbitration method. The only thing you can
change is the arbitration burst size.

> I'm now trying to leverage the I/O queues to achieve the
> same goal. If you could kindly help with the following questions it
> will be much appreciated!
> 
> 1. Is there a way to query the type of each I/O queue and which CPU it
> resides on?

The type of I/O queue selected depends on what you submit. If you
have read queues configured, then reads go on the read queues. If
you have poll queues, hipri io goes on those. Everything else goes
on default.

> 2. Is there a way to control which queue a submitted I/O goes into?

The queues are assigned to specific CPUs. If you want a specific
queue to handle your command, then submit your request from one of
the CPUs that map to the queue.

If you want to know which queues map to which CPUs, consult sysfs,
example: /sys/block/nvme0n1/mq/.

That will show each queue as a unique number (not necessarily aligned
to the nvme sq/cq id's), and each number will have a cpu_list that
tells you which CPUs can dispatch to that queue. sysfs doesn't tell
you the type (read/write/poll), but all queues are accounted for,
and each type's queues will cover all cpus.
 
> 3. Say I have only 1 default queue and I submit an I/O from some CPU,
> then there can be a chance that the I/O would need to cross CPUs, if
> the default queue happens not to be on the same core right?

If you only have one queue of a particular type, then the sysfs mq
directory for that queue should show cpu_list having all CPU's set,
so no CPU crossing necessary for dispatch. In fact, for any queue
count and CPU topo, dispatch should never need to reschedule to
another core (that was the point of the design). Completions on the
other hand are typically affinitized to a single specific CPU, so
the complete may happen on a different core than your submit in
this scenario.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-27 20:45     ` Keith Busch
@ 2023-04-27 21:21       ` Wang Yicheng
  2023-04-27 21:41         ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Wang Yicheng @ 2023-04-27 21:21 UTC (permalink / raw)
  To: Keith Busch; +Cc: Chaitanya Kulkarni, linux-nvme

Thanks a lot for the prompt and detailed reply!

Before asking for further clarification of your answers (thanks for
being patient with me :) ), I'd like to sort out one fundamental
question first. When we're talking about configuring the queues
(default/read/poll), is it really the software queue or the hardware
queue in the blk-mq model (figure 5 in this paper:
https://kernel.dk/blk-mq.pdf)?

I've always been understanding the configuration procedure as
distributing that number of queues among the CPUs. But as you pointed
out, if I only distribute 1 queue as default, then the cpu_list will
contain all CPUs. It gives me an impression that the queue type is
really about the hardware queues. We don't need to set up the software
queues, as they're automatically affinitized to the underlying
hardware queues, in order to make sure all submitted IOs, whichever
the queue type is, don't need to transfer across CPUs. In other words,
if I allocate all 3 types of queues, there will be at least 3 software
queues in each CPU, corresponding to each queue type.

Best,
Yicheng


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-27 21:21       ` Wang Yicheng
@ 2023-04-27 21:41         ` Keith Busch
  2023-04-28  0:36           ` Wang Yicheng
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2023-04-27 21:41 UTC (permalink / raw)
  To: Wang Yicheng; +Cc: Chaitanya Kulkarni, linux-nvme

On Thu, Apr 27, 2023 at 02:21:21PM -0700, Wang Yicheng wrote:
> Thanks a lot for the prompt and detailed reply!
> 
> Before asking for further clarification of your answers (thanks for
> being patient with me :) ), I'd like to sort out one fundamental
> question first. When we're talking about configuring the queues
> (default/read/poll), is it really the software queue or the hardware
> queue in the blk-mq model (figure 5 in this paper:
> https://kernel.dk/blk-mq.pdf)?

We're talking about hardware queues.
 
> I've always been understanding the configuration procedure as
> distributing that number of queues among the CPUs. But as you pointed
> out, if I only distribute 1 queue as default, then the cpu_list will
> contain all CPUs. It gives me an impression that the queue type is
> really about the hardware queues. We don't need to set up the software
> queues, as they're automatically affinitized to the underlying
> hardware queues, in order to make sure all submitted IOs, whichever
> the queue type is, don't need to transfer across CPUs. In other words,
> if I allocate all 3 types of queues, there will be at least 3 software
> queues in each CPU, corresponding to each queue type.

Each CPU has one software context, and each of those have 3 possible
hardware queues. Any given software context may be sharing one or
more of its hardware queues with another software context.

The set of hardware queues of any given type is fully affinitized
to every CPU you can run on.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-27 21:41         ` Keith Busch
@ 2023-04-28  0:36           ` Wang Yicheng
  2023-04-28 15:43             ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Wang Yicheng @ 2023-04-28  0:36 UTC (permalink / raw)
  To: Keith Busch; +Cc: Chaitanya Kulkarni, linux-nvme

Thanks a lot Keith! This is very helpful!

1. Then do you see a way to prioritize a specific set of IOs (favor
small writes over large writes) from the IO queue's perspective?
Initially I was thinking of WRR, which later turned out to be not
supported. If I want to leverage the IO queues to achieve the same
goal, from what I understand I can simply send small writes to poll
queues, and allocate more of those queues. Say on average small writes
take up 20% of the total IOs. And if I distribute 40% of total queues
as poll queues, in some sense I give more weight to small writes and
thus prioritize them.

> > 3. Say I have only 1 default queue and I submit an I/O from some CPU,
> > then there can be a chance that the I/O would need to cross CPUs, if
> > the default queue happens not to be on the same core right?
>
> If you only have one queue of a particular type, then the sysfs mq
> directory for that queue should show cpu_list having all CPU's set,
> so no CPU crossing necessary for dispatch. In fact, for any queue
> count and CPU topo, dispatch should never need to reschedule to
> another core (that was the point of the design). Completions on the
> other hand are typically affinitized to a single specific CPU, so
> the complete may happen on a different core than your submit in
> this scenario.

2. You mentioned that completions are affinitized to a single specific
CPU. And this is exactly what I observed in my test. This also seems
to cause worse performance. Is there a way to query that affinity or
is it invisible from outside?

Best,
Yicheng


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-28  0:36           ` Wang Yicheng
@ 2023-04-28 15:43             ` Keith Busch
  2023-04-28 18:18               ` Wang Yicheng
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2023-04-28 15:43 UTC (permalink / raw)
  To: Wang Yicheng; +Cc: Chaitanya Kulkarni, linux-nvme

On Thu, Apr 27, 2023 at 05:36:12PM -0700, Wang Yicheng wrote:
> Thanks a lot Keith! This is very helpful!
> 
> 1. Then do you see a way to prioritize a specific set of IOs (favor
> small writes over large writes) from the IO queue's perspective?
> Initially I was thinking of WRR, which later turned out to be not
> supported. If I want to leverage the IO queues to achieve the same
> goal, from what I understand I can simply send small writes to poll
> queues, and allocate more of those queues. Say on average small writes
> take up 20% of the total IOs. And if I distribute 40% of total queues
> as poll queues, in some sense I give more weight to small writes and
> thus prioritize them.

You might expect that a new command placed on a shallow queue will
be handled ahead of a command place on a deep queue at the same
time. Indeed, some implementation may even show desirable results
with that scheme, but the spec doesn't really guarantee that, though.

For a pure software side solution, you could use an ioscheduler and
set your ioprio accordingly.
 
> > > 3. Say I have only 1 default queue and I submit an I/O from some CPU,
> > > then there can be a chance that the I/O would need to cross CPUs, if
> > > the default queue happens not to be on the same core right?
> >
> > If you only have one queue of a particular type, then the sysfs mq
> > directory for that queue should show cpu_list having all CPU's set,
> > so no CPU crossing necessary for dispatch. In fact, for any queue
> > count and CPU topo, dispatch should never need to reschedule to
> > another core (that was the point of the design). Completions on the
> > other hand are typically affinitized to a single specific CPU, so
> > the complete may happen on a different core than your submit in
> > this scenario.
> 
> 2. You mentioned that completions are affinitized to a single specific
> CPU. And this is exactly what I observed in my test. This also seems
> to cause worse performance. Is there a way to query that affinity or
> is it invisible from outside?

To query a queue's affinity, check /proc/irq/<#>/effective_affinity.
You can check /proc/interrupts to determine which irq# goes with
which queue.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-28 15:43             ` Keith Busch
@ 2023-04-28 18:18               ` Wang Yicheng
  2023-04-28 20:06                 ` Keith Busch
  0 siblings, 1 reply; 11+ messages in thread
From: Wang Yicheng @ 2023-04-28 18:18 UTC (permalink / raw)
  To: Keith Busch; +Cc: Chaitanya Kulkarni, linux-nvme

> You might expect that a new command placed on a shallow queue will
> be handled ahead of a command place on a deep queue at the same
> time. Indeed, some implementation may even show desirable results
> with that scheme, but the spec doesn't really guarantee that, though.
Thanks Keith, that makes total sense... This also aligns with what I
experimented with (please see below). I was hoping to see task2 would
benefit from a larger queue proportion but it didn't.
1. W/ default queue set-up, ran 2 identical randwrite FIO tasks (3
identical jobs each) to the same set of CPUs. The 2 tasks got the same
performance
2. Changed to 1/64/2 queue set-up, FIO tasks were the same, despite
task1 went to default queues and task2 went to poll queues. The result
was very close to test 1. Task2 did run faster but the difference was
trivial (seemed to come from IO polling instead of queue distribution)

Then this confuses me with the motivations of introducing different
queue types. Doesn't it aim for providing some sort of prioritization?

Best,
Yicheng

On Fri, Apr 28, 2023 at 8:43 AM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, Apr 27, 2023 at 05:36:12PM -0700, Wang Yicheng wrote:
> > Thanks a lot Keith! This is very helpful!
> >
> > 1. Then do you see a way to prioritize a specific set of IOs (favor
> > small writes over large writes) from the IO queue's perspective?
> > Initially I was thinking of WRR, which later turned out to be not
> > supported. If I want to leverage the IO queues to achieve the same
> > goal, from what I understand I can simply send small writes to poll
> > queues, and allocate more of those queues. Say on average small writes
> > take up 20% of the total IOs. And if I distribute 40% of total queues
> > as poll queues, in some sense I give more weight to small writes and
> > thus prioritize them.
>
> You might expect that a new command placed on a shallow queue will
> be handled ahead of a command place on a deep queue at the same
> time. Indeed, some implementation may even show desirable results
> with that scheme, but the spec doesn't really guarantee that, though.
>
> For a pure software side solution, you could use an ioscheduler and
> set your ioprio accordingly.
>
> > > > 3. Say I have only 1 default queue and I submit an I/O from some CPU,
> > > > then there can be a chance that the I/O would need to cross CPUs, if
> > > > the default queue happens not to be on the same core right?
> > >
> > > If you only have one queue of a particular type, then the sysfs mq
> > > directory for that queue should show cpu_list having all CPU's set,
> > > so no CPU crossing necessary for dispatch. In fact, for any queue
> > > count and CPU topo, dispatch should never need to reschedule to
> > > another core (that was the point of the design). Completions on the
> > > other hand are typically affinitized to a single specific CPU, so
> > > the complete may happen on a different core than your submit in
> > > this scenario.
> >
> > 2. You mentioned that completions are affinitized to a single specific
> > CPU. And this is exactly what I observed in my test. This also seems
> > to cause worse performance. Is there a way to query that affinity or
> > is it invisible from outside?
>
> To query a queue's affinity, check /proc/irq/<#>/effective_affinity.
> You can check /proc/interrupts to determine which irq# goes with
> which queue.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-28 18:18               ` Wang Yicheng
@ 2023-04-28 20:06                 ` Keith Busch
  2023-05-05  0:27                   ` Wang Yicheng
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Busch @ 2023-04-28 20:06 UTC (permalink / raw)
  To: Wang Yicheng; +Cc: Chaitanya Kulkarni, linux-nvme

On Fri, Apr 28, 2023 at 11:18:03AM -0700, Wang Yicheng wrote:
> 
> Then this confuses me with the motivations of introducing different
> queue types. Doesn't it aim for providing some sort of prioritization?

Having a separate read queue ensures that reads won't get starved for
a command resource by a write intensive workload. AFAIK, it's not a very
common option to enable.

The poll queues are intended for latency senstive applications.  I
don't think it will be as reliable if you are running concurrently
with interrupt driven workloads: the interrupts will just preempt
the polling threads.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Seeking for help with NVMe arbitration questions
  2023-04-28 20:06                 ` Keith Busch
@ 2023-05-05  0:27                   ` Wang Yicheng
  0 siblings, 0 replies; 11+ messages in thread
From: Wang Yicheng @ 2023-05-05  0:27 UTC (permalink / raw)
  To: Keith Busch; +Cc: Chaitanya Kulkarni, linux-nvme

Understood, thanks Keith!

Given that the IO queue distribution is not intended for IO
prioritization, I pivoted my focus to how enabling IO polling can help
the performance. I repeated a very simple single-process FIO job 4
times using the following set-ups:
1. W/o poll queues, "hipri" was set to 0
2. W/ poll queues, "hipri" was set to 0
3. W/o poll queues, "hipri" was set to 1
4. W/ poll queues, "hipri" was set to 1

The throughput result was 4>3>1=2. It's expected that case 4 has the
best performance. But what I don't understand is why case 3
outperformed 1 and 2. I thought the IO polling will only be enabled
when there're poll queues. Could you please comment on this result as
well?

Best,
Yicheng

On Fri, Apr 28, 2023 at 1:06 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, Apr 28, 2023 at 11:18:03AM -0700, Wang Yicheng wrote:
> >
> > Then this confuses me with the motivations of introducing different
> > queue types. Doesn't it aim for providing some sort of prioritization?
>
> Having a separate read queue ensures that reads won't get starved for
> a command resource by a write intensive workload. AFAIK, it's not a very
> common option to enable.
>
> The poll queues are intended for latency senstive applications.  I
> don't think it will be as reliable if you are running concurrently
> with interrupt driven workloads: the interrupts will just preempt
> the polling threads.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-05-05  0:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-26  0:26 Seeking for help with NVMe arbitration questions Wang Yicheng
2023-04-26  5:17 ` Chaitanya Kulkarni
2023-04-27 19:03   ` Wang Yicheng
2023-04-27 20:45     ` Keith Busch
2023-04-27 21:21       ` Wang Yicheng
2023-04-27 21:41         ` Keith Busch
2023-04-28  0:36           ` Wang Yicheng
2023-04-28 15:43             ` Keith Busch
2023-04-28 18:18               ` Wang Yicheng
2023-04-28 20:06                 ` Keith Busch
2023-05-05  0:27                   ` Wang Yicheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.