All of lore.kernel.org
 help / color / mirror / Atom feed
* [LSF/MM TOPIC] multiqueue and interrupt assignment
@ 2016-02-02 16:31 Hannes Reinecke
  2016-02-02 18:23 ` Bart Van Assche
  2016-02-02 18:45 ` Elliott, Robert (Persistent Memory)
  0 siblings, 2 replies; 8+ messages in thread
From: Hannes Reinecke @ 2016-02-02 16:31 UTC (permalink / raw)
  To: lsf-pc, linux-scsi@vger.kernel.org, linux-block

Hi all,

here's another topic which I've hit during my performance tests:
How should interrupt affinity be handled with blk-multiqueue?

The problem is that the blk-multiqueue assumes a certain
CPU-to-queue mapping, _and_ the 'queue' in blk-mq syntax is actually
a submission/completion queue pair.

To achieve optimal performance one should set the interrupt affinity
for a given (hardware) queue to the matchine (blk-mq) queue.
But typically the interrupt affinity has to be set during HBA setup
ie way before any queues are allocated.
Which means we have three choices:
- outguess the blk-mq algorithm in the driver and set the
  interrupt affinity during HBA setup
- Add some callbacks to coordinate interrupt affinity between
  driver and blk-mq
- Defer it to manual assignment, but inferring the risk of
  a suboptimal performance.

At LSF/MM  I would like to have a discussion on how the interrupt
affinity should be handled for blk-mq, and whether a generic method
is possible or desirable.
Also there is the issue of certain drivers (eg lpfc) which normally
do interrupt affinity themselves, but disable it for multiqueue.
Which results in abysmal performance when comparing single queue
against multiqueue :-(

As a side note, what does blk-mq do if the interrupt affinity is
_deliberately_ set wrong? IE if the completions for one command
arrive on completely the wrong queue? Discard the completion? Move
it to the correct queue?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] multiqueue and interrupt assignment
  2016-02-02 16:31 [LSF/MM TOPIC] multiqueue and interrupt assignment Hannes Reinecke
@ 2016-02-02 18:23 ` Bart Van Assche
  2016-02-03 12:57   ` Sagi Grimberg
  2016-02-02 18:45 ` Elliott, Robert (Persistent Memory)
  1 sibling, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2016-02-02 18:23 UTC (permalink / raw)
  To: Hannes Reinecke, lsf-pc, linux-scsi@vger.kernel.org, linux-block

On 02/02/2016 08:31 AM, Hannes Reinecke wrote:
> here's another topic which I've hit during my performance tests:
> How should interrupt affinity be handled with blk-multiqueue?
>
> The problem is that the blk-multiqueue assumes a certain
> CPU-to-queue mapping, _and_ the 'queue' in blk-mq syntax is actually
> a submission/completion queue pair.
>
> To achieve optimal performance one should set the interrupt affinity
> for a given (hardware) queue to the matchine (blk-mq) queue.
> But typically the interrupt affinity has to be set during HBA setup
> ie way before any queues are allocated.
> Which means we have three choices:
> - outguess the blk-mq algorithm in the driver and set the
>    interrupt affinity during HBA setup
> - Add some callbacks to coordinate interrupt affinity between
>    driver and blk-mq
> - Defer it to manual assignment, but inferring the risk of
>    a suboptimal performance.
>
> At LSF/MM  I would like to have a discussion on how the interrupt
> affinity should be handled for blk-mq, and whether a generic method
> is possible or desirable.
> Also there is the issue of certain drivers (eg lpfc) which normally
> do interrupt affinity themselves, but disable it for multiqueue.
> Which results in abysmal performance when comparing single queue
> against multiqueue :-(
>
> As a side note, what does blk-mq do if the interrupt affinity is
> _deliberately_ set wrong? IE if the completions for one command
> arrive on completely the wrong queue? Discard the completion? Move
> it to the correct queue?

Hello Hannes,

This topic indeed needs further attention. I also encountered this
challenge while adding scsi-mq support to the SRP initiator driver. What
I learned while working on the SRP driver is the following:
- Although I agree that requests and interrupts should be processed on
   the same processor (same physical chip) if the request has been
   submitted from the CPU closest to the HBA, I'm not convinced that
   processing request completions and interrupts on the same CPU core
   yields the best performance. I would appreciate it if there would
   remain some freedom in how to assign interrupts to CPU cores.
- In several older NUMA systems (Nehalem) the distance from processor
   to PCI adapter is the same for all processors. However, in current
   NUMA systems (Sandy Bridge and later) typically only from one
   processor access latency to a given PCI adapter is optimal. The
   question then becomes which code should hit the QPI latency penalty:
   the interrupt handler or the blk-mq request completion processing
   code ?
- All HBAs I know of support reassignment of an interrupt to another
   CPU core through /proc/irq/<n>/smp_affinity so I was surprised to
   read that you encountered a HBA for which CPU affinity has to be
   set at driver load time ?
- For HBAs that support multiple MSI-X vectors we need an approach for
   associating blk-mq hw-queues with MSI-X vectors. The approach
   implemented in the ib_srp driver is that that driver assumes that
   MSI-X vectors have been spread evenly over physical processors. The
   ib_srp driver then selects an MSI-X vector per hwqueue based on that
   assumption. Since neither the kernel nor irqbalance currently support
   this approach I wrote a script to implement this (see also
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/21312/focus=98409).
- We need support in irqbalance for HBAs that support multiple MSI-X
   vectors. Last time I checked irqbalance did not support this concept
   which means that it even could happen that irqbalance assigned
   multiple of these interrupt vectors to the same CPU core, something
   that doesn't make sense to me.

A previous discussion about this topic is available in the following
e-mail thread: Christoph Hellwig, [TECH TOPIC] IRQ affinity, linux-rdma
and linux-kernel mailing lists, July 2015
(http://thread.gmane.org/gmane.linux.drivers.rdma/27418). I would
appreciate it if Matthew Wilcox' proposal could be discussed further
during the LSF/MM (http://thread.gmane.org/gmane.linux.drivers.rdma/27418).

Thanks,

Bart.
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [LSF/MM TOPIC] multiqueue and interrupt assignment
  2016-02-02 16:31 [LSF/MM TOPIC] multiqueue and interrupt assignment Hannes Reinecke
  2016-02-02 18:23 ` Bart Van Assche
@ 2016-02-02 18:45 ` Elliott, Robert (Persistent Memory)
  1 sibling, 0 replies; 8+ messages in thread
From: Elliott, Robert (Persistent Memory) @ 2016-02-02 18:45 UTC (permalink / raw)
  To: Hannes Reinecke, lsf-pc, linux-scsi@vger.kernel.org, linux-block


> -----Original Message-----
> From: linux-block-owner@vger.kernel.org [mailto:linux-block-
> owner@vger.kernel.org] On Behalf Of Hannes Reinecke
> Sent: Tuesday, February 2, 2016 10:31 AM
> To: lsf-pc@lists.linux-foundation.org; linux-scsi@vger.kernel.org; linux-
> block@vger.kernel.org
> Subject: [LSF/MM TOPIC] multiqueue and interrupt assignment
> 
...
> As a side note, what does blk-mq do if the interrupt affinity is
> _deliberately_ set wrong? IE if the completions for one command
> arrive on completely the wrong queue? Discard the completion? Move
> it to the correct queue?

It sends an interprocessor interrupt (IPI) to the designated
processor and processes the completion there.

CPUs overloaded by this kind of work can lead to an unusable
system - see http://marc.info/?l=linux-kernel&m=141030836723181&w=2

---
Robert Elliott, HPE Persistent Memory


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] multiqueue and interrupt assignment
  2016-02-02 18:23 ` Bart Van Assche
@ 2016-02-03 12:57   ` Sagi Grimberg
  2016-02-03 13:13     ` Hannes Reinecke
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2016-02-03 12:57 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, lsf-pc,
	linux-scsi@vger.kernel.org, linux-block

Hi Bart and Hannes,

> This topic indeed needs further attention. I also encountered this
> challenge while adding scsi-mq support to the SRP initiator driver. What
> I learned while working on the SRP driver is the following:
> - Although I agree that requests and interrupts should be processed on
>    the same processor (same physical chip) if the request has been
>    submitted from the CPU closest to the HBA, I'm not convinced that
>    processing request completions and interrupts on the same CPU core
>    yields the best performance. I would appreciate it if there would
>    remain some freedom in how to assign interrupts to CPU cores.

This is true not only for this reason. Some block storage transports
(e.g. srp/iser) share the HBA with the networking stack and possibly
with user-space workloads in the case of RDMA. This is why I don't see
how would MSIX assignments can be done anywhere other than user-space.

However what I think we can do is have blk-mq ask the drivers
information about the MSIX mappings. This concept was introduced in
2011 by Ben Hutchings with the CPU affinity reverse-mapping API [1].

Perhaps we'd want to have drivers provide blk-mq a struct cpu_rmap when
assigning the hctx mappings, or possibly per I/O if we want to be
agnostic to MSIX topology changes. I think this approach would solve
Hannes is experiencing.

Thoughts?

[1]:
commit c39649c331c70952700f99832b03f87e9d7f5b4b
Author: Ben Hutchings <bhutchings@solarflare.com>
Date:   Wed Jan 19 11:03:25 2011 +0000

     lib: cpu_rmap: CPU affinity reverse-mapping

     When initiating I/O on a multiqueue and multi-IRQ device, we may want
     to select a queue for which the response will be handled on the same
     or a nearby CPU.  This requires a reverse-map of IRQ affinity.  Add
     library functions to support a generic reverse-mapping from CPUs to
     objects with affinity and the specific case where the objects are
     IRQs.

     Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
     Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] multiqueue and interrupt assignment
  2016-02-03 12:57   ` Sagi Grimberg
@ 2016-02-03 13:13     ` Hannes Reinecke
  2016-02-03 13:32       ` Sagi Grimberg
  0 siblings, 1 reply; 8+ messages in thread
From: Hannes Reinecke @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Sagi Grimberg, Bart Van Assche, lsf-pc,
	linux-scsi@vger.kernel.org, linux-block

On 02/03/2016 01:57 PM, Sagi Grimberg wrote:
> Hi Bart and Hannes,
> 
>> This topic indeed needs further attention. I also encountered this
>> challenge while adding scsi-mq support to the SRP initiator
>> driver. What
>> I learned while working on the SRP driver is the following:
>> - Although I agree that requests and interrupts should be
>> processed on
>>    the same processor (same physical chip) if the request has been
>>    submitted from the CPU closest to the HBA, I'm not convinced that
>>    processing request completions and interrupts on the same CPU core
>>    yields the best performance. I would appreciate it if there would
>>    remain some freedom in how to assign interrupts to CPU cores.
> 
> This is true not only for this reason. Some block storage transports
> (e.g. srp/iser) share the HBA with the networking stack and possibly
> with user-space workloads in the case of RDMA. This is why I don't see
> how would MSIX assignments can be done anywhere other than user-space.
> 
Oh, I don't doubt that we should allow to have the MSIX assignments
from user-space.
But ATM block-mq assigned a fixed CPU <-> hardware queue mapping,
and all we can do is to fixup things afterwards.
Irrespective on whether it's 'best' for any given hardware.

> However what I think we can do is have blk-mq ask the drivers
> information about the MSIX mappings. This concept was introduced in
> 2011 by Ben Hutchings with the CPU affinity reverse-mapping API [1].
> 
> Perhaps we'd want to have drivers provide blk-mq a struct cpu_rmap when
> assigning the hctx mappings, or possibly per I/O if we want to be
> agnostic to MSIX topology changes. I think this approach would solve
> Hannes is experiencing.
> 
> Thoughts?
> 
Indeed, something like this.
Quite some issues would be solved if we could push a hctx mapping
into blk-mq, instead of having it assign its own made-up one.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] multiqueue and interrupt assignment
  2016-02-03 13:13     ` Hannes Reinecke
@ 2016-02-03 13:32       ` Sagi Grimberg
  2016-02-03 15:03         ` Hannes Reinecke
  0 siblings, 1 reply; 8+ messages in thread
From: Sagi Grimberg @ 2016-02-03 13:32 UTC (permalink / raw)
  To: Hannes Reinecke, Bart Van Assche, lsf-pc,
	linux-scsi@vger.kernel.org, linux-block


> Indeed, something like this.
> Quite some issues would be solved if we could push a hctx mapping
> into blk-mq, instead of having it assign its own made-up one.

For that you can provide your own .map_queue in blk_mq_ops I think
(no one does that at the moment). This requires every driver to
implement it's own routine (probably with a similar logic) though...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] multiqueue and interrupt assignment
  2016-02-03 13:32       ` Sagi Grimberg
@ 2016-02-03 15:03         ` Hannes Reinecke
  2016-03-03  7:59           ` Ming Lei
  0 siblings, 1 reply; 8+ messages in thread
From: Hannes Reinecke @ 2016-02-03 15:03 UTC (permalink / raw)
  To: Sagi Grimberg, Bart Van Assche, lsf-pc,
	linux-scsi@vger.kernel.org, linux-block

On 02/03/2016 02:32 PM, Sagi Grimberg wrote:
> 
>> Indeed, something like this.
>> Quite some issues would be solved if we could push a hctx mapping
>> into blk-mq, instead of having it assign its own made-up one.
> 
> For that you can provide your own .map_queue in blk_mq_ops I think
> (no one does that at the moment). This requires every driver to
> implement it's own routine (probably with a similar logic) though...

And at the same time direct interrupt assigment from the driver is
frowned upon ... feels a bit stupid, having to setup a cpu-to-queue
assigment (which typically is identical to the cpu-to-msix
assignment), then pass this information to blk-mq, which then passed
it to user-space, which then uses the information to setup a
cpu-to-msix assignment.
There is room for improvement there ...

Are there any plans addressing this in blk-mq?
What does NVMe and virtio do?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM TOPIC] multiqueue and interrupt assignment
  2016-02-03 15:03         ` Hannes Reinecke
@ 2016-03-03  7:59           ` Ming Lei
  0 siblings, 0 replies; 8+ messages in thread
From: Ming Lei @ 2016-03-03  7:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Bart Van Assche, lsf-pc,
	linux-scsi@vger.kernel.org, linux-block

On Wed, Feb 3, 2016 at 11:03 PM, Hannes Reinecke <hare@suse.de> wrote:
> On 02/03/2016 02:32 PM, Sagi Grimberg wrote:
>>
>>> Indeed, something like this.
>>> Quite some issues would be solved if we could push a hctx mapping
>>> into blk-mq, instead of having it assign its own made-up one.
>>
>> For that you can provide your own .map_queue in blk_mq_ops I think
>> (no one does that at the moment). This requires every driver to
>> implement it's own routine (probably with a similar logic) though...
>
> And at the same time direct interrupt assigment from the driver is
> frowned upon ... feels a bit stupid, having to setup a cpu-to-queue
> assigment (which typically is identical to the cpu-to-msix
> assignment), then pass this information to blk-mq, which then passed
> it to user-space, which then uses the information to setup a
> cpu-to-msix assignment.
> There is room for improvement there ...
>
> Are there any plans addressing this in blk-mq?

Last year, I sent a patchset to address the issue[1], but
it wasn't good enough for merge, and I am happy to discuss
the issue further.

[1] http://marc.info/?t=144349691100002&r=1&w=2

> What does NVMe and virtio do?

virtio just takes the default irq affinity setting,  which means
the irq for one vq is only handled by the 1st CPU when it is
set to route to a group of CPU.

For NVMe, I remembered that the irq affinity setting is still
fixed after setting up the queue, and it should be better to
adjust it after hw/sw queue mapping is changed.

Thanks,
Ming

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Teamlead Storage & Networking
> hare@suse.de                                   +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-03-03  7:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-02 16:31 [LSF/MM TOPIC] multiqueue and interrupt assignment Hannes Reinecke
2016-02-02 18:23 ` Bart Van Assche
2016-02-03 12:57   ` Sagi Grimberg
2016-02-03 13:13     ` Hannes Reinecke
2016-02-03 13:32       ` Sagi Grimberg
2016-02-03 15:03         ` Hannes Reinecke
2016-03-03  7:59           ` Ming Lei
2016-02-02 18:45 ` Elliott, Robert (Persistent Memory)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.