linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [TECH TOPIC] IRQ affinity
@ 2015-07-15 12:07 Christoph Hellwig
  2015-07-15 14:38 ` Christoph Lameter
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Christoph Hellwig @ 2015-07-15 12:07 UTC (permalink / raw)
  To: ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Many years ago we decided to move setting of IRQ to core affnities to
userspace with the irqbalance daemon.

These days we have systems with lots of MSI-X vector, and we have
hardware and subsystem support for per-CPU I/O queues in the block
layer, the RDMA subsystem and probably the network stack (I'm not too
familar with the recent developments there).  It would really help the
out of the box performance and experience if we could allow such
subsystems to bind interrupt vectors to the node that the queue is
configured on.

I'd like to discuss if the rationale for moving the IRQ affinity setting
fully to userspace are still correct in todays world any any pitfalls
we'll have to learn from in irqbalanced and the old in-kernel affinity
code.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
       [not found] ` <20150715120708.GA24534-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-07-15 12:12   ` Thomas Gleixner
  2015-07-15 15:41     ` Bart Van Assche
  2015-07-15 14:56   ` Marc Zyngier
  2015-07-15 16:05   ` Michael S. Tsirkin
  2 siblings, 1 reply; 15+ messages in thread
From: Thomas Gleixner @ 2015-07-15 12:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, 15 Jul 2015, Christoph Hellwig wrote:

> Many years ago we decided to move setting of IRQ to core affnities to
> userspace with the irqbalance daemon.
> 
> These days we have systems with lots of MSI-X vector, and we have
> hardware and subsystem support for per-CPU I/O queues in the block
> layer, the RDMA subsystem and probably the network stack (I'm not too
> familar with the recent developments there).  It would really help the
> out of the box performance and experience if we could allow such
> subsystems to bind interrupt vectors to the node that the queue is
> configured on.
> 
> I'd like to discuss if the rationale for moving the IRQ affinity setting
> fully to userspace are still correct in todays world any any pitfalls
> we'll have to learn from in irqbalanced and the old in-kernel affinity
> code.

I think setting an initial affinity is not going to create the horror
of the old in-kernel irq balancer again. It still could be changed
from user space and does not try to be smart by moving interrupts
around in circles all the time.

Thanks,

	tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [TECH TOPIC] IRQ affinity
  2015-07-15 12:07 [TECH TOPIC] IRQ affinity Christoph Hellwig
@ 2015-07-15 14:38 ` Christoph Lameter
       [not found] ` <20150715120708.GA24534-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2015-10-12 16:09 ` Theodore Ts'o
  2 siblings, 0 replies; 15+ messages in thread
From: Christoph Lameter @ 2015-07-15 14:38 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: ksummit-discuss, linux-rdma, linux-nvme, linux-kernel

On Wed, 15 Jul 2015, Christoph Hellwig wrote:

> Many years ago we decided to move setting of IRQ to core affnities to
> userspace with the irqbalance daemon.
>
> These days we have systems with lots of MSI-X vector, and we have
> hardware and subsystem support for per-CPU I/O queues in the block
> layer, the RDMA subsystem and probably the network stack (I'm not too
> familar with the recent developments there).  It would really help the
> out of the box performance and experience if we could allow such
> subsystems to bind interrupt vectors to the node that the queue is
> configured on.
>
> I'd like to discuss if the rationale for moving the IRQ affinity setting
> fully to userspace are still correct in todays world any any pitfalls
> we'll have to learn from in irqbalanced and the old in-kernel affinity
> code.

Configuration with processors that are trying to be OS noise free
(NOHZ) would also benefit if device interrupts would be directed to
processors that are not in the NOHZ set. Currently we use scripts on
bootup that redirect interrupts away from these.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
       [not found] ` <20150715120708.GA24534-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2015-07-15 12:12   ` [Ksummit-discuss] " Thomas Gleixner
@ 2015-07-15 14:56   ` Marc Zyngier
  2015-07-15 16:05   ` Michael S. Tsirkin
  2 siblings, 0 replies; 15+ messages in thread
From: Marc Zyngier @ 2015-07-15 14:56 UTC (permalink / raw)
  To: Christoph Hellwig,
	ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 15/07/15 13:07, Christoph Hellwig wrote:
> Many years ago we decided to move setting of IRQ to core affnities to
> userspace with the irqbalance daemon.
> 
> These days we have systems with lots of MSI-X vector, and we have
> hardware and subsystem support for per-CPU I/O queues in the block
> layer, the RDMA subsystem and probably the network stack (I'm not too
> familar with the recent developments there).  It would really help the
> out of the box performance and experience if we could allow such
> subsystems to bind interrupt vectors to the node that the queue is
> configured on.
> 
> I'd like to discuss if the rationale for moving the IRQ affinity setting
> fully to userspace are still correct in todays world any any pitfalls
> we'll have to learn from in irqbalanced and the old in-kernel affinity
> code.

I've been pondering about having some notion of "grouping", where some
interrupts are logically part of the same working set, and it doesn't
make much sense to spread it among CPUs, and userspace doesn't really
have a clue about this.

A related problem is that some weird HW (things like cascaded interrupt
controllers) can only move a bunch of interrupt in one go, which isn't
really what userspace expects to see (move one interrupt, see another 31
moving).

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
  2015-07-15 12:12   ` [Ksummit-discuss] " Thomas Gleixner
@ 2015-07-15 15:41     ` Bart Van Assche
       [not found]       ` <55A67F11.1030709-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2015-07-15 15:41 UTC (permalink / raw)
  To: Thomas Gleixner, Christoph Hellwig
  Cc: ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jens Axboe

On 07/15/2015 05:12 AM, Thomas Gleixner wrote:
> On Wed, 15 Jul 2015, Christoph Hellwig wrote:
>> Many years ago we decided to move setting of IRQ to core affnities to
>> userspace with the irqbalance daemon.
>>
>> These days we have systems with lots of MSI-X vector, and we have
>> hardware and subsystem support for per-CPU I/O queues in the block
>> layer, the RDMA subsystem and probably the network stack (I'm not too
>> familar with the recent developments there).  It would really help the
>> out of the box performance and experience if we could allow such
>> subsystems to bind interrupt vectors to the node that the queue is
>> configured on.
>>
>> I'd like to discuss if the rationale for moving the IRQ affinity setting
>> fully to userspace are still correct in todays world any any pitfalls
>> we'll have to learn from in irqbalanced and the old in-kernel affinity
>> code.
>
> I think setting an initial affinity is not going to create the horror
> of the old in-kernel irq balancer again. It still could be changed
> from user space and does not try to be smart by moving interrupts
> around in circles all the time.

Thanks Thomas for your feedback. But no matter whether IRQ balancing 
happens in user space or in the kernel, the following issues need to be 
addressed and have not yet been addressed today:
* irqbalanced is not aware of the relationship between MSI-X vectors.
   If e.g. two kernel drivers each allocate 24 MSI-X vectors for the
   PCIe interfaces they control irqbalanced could e.g. decide to
   associate all MSI-X vectors for the first PCIe interface with a first
   set of CPUs and the MSI-X vectors of the second PCIe interface with a
   second set of CPUs. This will result in suboptimal performance if
   these two PCIe interfaces are used alternatingly instead of
   simultaneously.
* With blk-mq and scsi-mq optimal performance can only be achieved if
   the relationship between MSI-X vector and NUMA node does not change
   over time. This is necessary to allow a blk-mq/scsi-mq driver to
   ensure that interrupts are processed on the same NUMA node as the
   node on which the data structures for a communication channel have
   been allocated. However, today there is no API that allows
   blk-mq/scsi-mq drivers and irqbalanced to exchange information
   about the relationship between MSI-X vector ranges and NUMA nodes.
   The only approach I know of that works today to define IRQ affinity
   for blk-mq/scsi-mq drivers is to disable irqbalanced and to run a
   custom script that defines IRQ affinity (see e.g. the
   spread-mlx4-ib-interrupts attachment of 
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/21312/focus=98409).

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
       [not found] ` <20150715120708.GA24534-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2015-07-15 12:12   ` [Ksummit-discuss] " Thomas Gleixner
  2015-07-15 14:56   ` Marc Zyngier
@ 2015-07-15 16:05   ` Michael S. Tsirkin
  2 siblings, 0 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2015-07-15 16:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Wed, Jul 15, 2015 at 05:07:08AM -0700, Christoph Hellwig wrote:
> Many years ago we decided to move setting of IRQ to core affnities to
> userspace with the irqbalance daemon.
> 
> These days we have systems with lots of MSI-X vector, and we have
> hardware and subsystem support for per-CPU I/O queues in the block
> layer, the RDMA subsystem and probably the network stack (I'm not too
> familar with the recent developments there).  It would really help the
> out of the box performance and experience if we could allow such
> subsystems to bind interrupt vectors to the node that the queue is
> configured on.

I think you are right, it's true for networking.

Whenever someone tries to benchmark networking, first thing done is
always disabling irqbalance and pinning IRQs manually away from
whereever the benchmark is running, but at the same numa node.

Without that, interrupts don't let the benchmark make progress.

Alternatively, people give up on interrupts completely and
start polling hardware aggressively. Nice for a benchmark,
not nice for the environment.

> 
> I'd like to discuss if the rationale for moving the IRQ affinity setting
> fully to userspace are still correct in todays world any any pitfalls
> we'll have to learn from in irqbalanced and the old in-kernel affinity
> code.

IMHO there could be a benefit from a better integration with the scheduler.
Maybe an interrupt handler can be viewed as a kind of thread,
so scheduler can make decisions about where to run it next?

> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
       [not found]       ` <55A67F11.1030709-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2015-07-15 17:19         ` Keith Busch
       [not found]           ` <alpine.LNX.2.00.1507151700300.15930-bi+AKbBUZKYRn3MOdyr96VDQ4js95KgL@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2015-07-15 17:19 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Thomas Gleixner, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jens Axboe,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I

On Wed, 15 Jul 2015, Bart Van Assche wrote:
> * With blk-mq and scsi-mq optimal performance can only be achieved if
>  the relationship between MSI-X vector and NUMA node does not change
>  over time. This is necessary to allow a blk-mq/scsi-mq driver to
>  ensure that interrupts are processed on the same NUMA node as the
>  node on which the data structures for a communication channel have
>  been allocated. However, today there is no API that allows
>  blk-mq/scsi-mq drivers and irqbalanced to exchange information
>  about the relationship between MSI-X vector ranges and NUMA nodes.

We could have low-level drivers provide blk-mq the controller's irq
associated with a particular h/w context, and the block layer can provide
the context's cpumask to irqbalance with the smp affinity hint.

The nvme driver already uses the hwctx cpumask to set hints, but this
doesn't seems like it should be a driver responsibility. It currently
doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
the h/w contexts without syncing with the low-level driver.

If we can add this to blk-mq, one additional case to consider is if the
same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
assignment needs to be aware of this to prevent sharing a vector across
NUMA nodes.

>  The only approach I know of that works today to define IRQ affinity
>  for blk-mq/scsi-mq drivers is to disable irqbalanced and to run a
>  custom script that defines IRQ affinity (see e.g. the
>  spread-mlx4-ib-interrupts attachment of 
> http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/21312/focus=98409).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
       [not found]           ` <alpine.LNX.2.00.1507151700300.15930-bi+AKbBUZKYRn3MOdyr96VDQ4js95KgL@public.gmane.org>
@ 2015-07-15 17:25             ` Jens Axboe
  2015-07-15 18:24               ` Sagi Grimberg
  2015-07-15 18:48               ` Matthew Wilcox
  0 siblings, 2 replies; 15+ messages in thread
From: Jens Axboe @ 2015-07-15 17:25 UTC (permalink / raw)
  To: Keith Busch, Bart Van Assche
  Cc: Thomas Gleixner, Christoph Hellwig,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I

On 07/15/2015 11:19 AM, Keith Busch wrote:
> On Wed, 15 Jul 2015, Bart Van Assche wrote:
>> * With blk-mq and scsi-mq optimal performance can only be achieved if
>>  the relationship between MSI-X vector and NUMA node does not change
>>  over time. This is necessary to allow a blk-mq/scsi-mq driver to
>>  ensure that interrupts are processed on the same NUMA node as the
>>  node on which the data structures for a communication channel have
>>  been allocated. However, today there is no API that allows
>>  blk-mq/scsi-mq drivers and irqbalanced to exchange information
>>  about the relationship between MSI-X vector ranges and NUMA nodes.
>
> We could have low-level drivers provide blk-mq the controller's irq
> associated with a particular h/w context, and the block layer can provide
> the context's cpumask to irqbalance with the smp affinity hint.
>
> The nvme driver already uses the hwctx cpumask to set hints, but this
> doesn't seems like it should be a driver responsibility. It currently
> doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
> the h/w contexts without syncing with the low-level driver.
>
> If we can add this to blk-mq, one additional case to consider is if the
> same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
> assignment needs to be aware of this to prevent sharing a vector across
> NUMA nodes.

Exactly. I may have promised to do just that at the last LSF/MM 
conference, just haven't done it yet. The point is to share the mask, 
I'd ideally like to take it all the way where the driver just asks for a 
number of vecs through a nice API that takes care of all this. Lots of 
duplicated code in drivers for this these days, and it's a mess.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
  2015-07-15 17:25             ` Jens Axboe
@ 2015-07-15 18:24               ` Sagi Grimberg
  2015-07-15 18:48               ` Matthew Wilcox
  1 sibling, 0 replies; 15+ messages in thread
From: Sagi Grimberg @ 2015-07-15 18:24 UTC (permalink / raw)
  To: Jens Axboe, Keith Busch, Bart Van Assche
  Cc: Thomas Gleixner, Christoph Hellwig, linux-rdma, linux-nvme,
	linux-kernel, ksummit-discuss

On 7/15/2015 8:25 PM, Jens Axboe wrote:
> On 07/15/2015 11:19 AM, Keith Busch wrote:
>> On Wed, 15 Jul 2015, Bart Van Assche wrote:
>>> * With blk-mq and scsi-mq optimal performance can only be achieved if
>>>  the relationship between MSI-X vector and NUMA node does not change
>>>  over time. This is necessary to allow a blk-mq/scsi-mq driver to
>>>  ensure that interrupts are processed on the same NUMA node as the
>>>  node on which the data structures for a communication channel have
>>>  been allocated. However, today there is no API that allows
>>>  blk-mq/scsi-mq drivers and irqbalanced to exchange information
>>>  about the relationship between MSI-X vector ranges and NUMA nodes.
>>
>> We could have low-level drivers provide blk-mq the controller's irq
>> associated with a particular h/w context, and the block layer can provide
>> the context's cpumask to irqbalance with the smp affinity hint.
>>
>> The nvme driver already uses the hwctx cpumask to set hints, but this
>> doesn't seems like it should be a driver responsibility. It currently
>> doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
>> the h/w contexts without syncing with the low-level driver.
>>
>> If we can add this to blk-mq, one additional case to consider is if the
>> same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
>> assignment needs to be aware of this to prevent sharing a vector across
>> NUMA nodes.
>
> Exactly. I may have promised to do just that at the last LSF/MM
> conference, just haven't done it yet. The point is to share the mask,
> I'd ideally like to take it all the way where the driver just asks for a
> number of vecs through a nice API that takes care of all this. Lots of
> duplicated code in drivers for this these days, and it's a mess.
>

These are all good points.

But I'm not sure the block layer is always the correct place to take
care of msix vector assignments. It's probably a perfect fit for NVME
and other storage devices, but if we take RDMA for example, block
storage co-exists with file storage, Ethernet traffic and user-space 
applications that do RDMA. All of which share the device MSI-X vectors.
So in this case, the block layer would not be a suitable place to set
IRQ affinity since each deployment might present different constraints.

In any event, the irqbalance daemon is not helping here. Unfortunately
the common practice is to just turn it off in order to get optimized
performance.

Sagi.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
  2015-07-15 17:25             ` Jens Axboe
  2015-07-15 18:24               ` Sagi Grimberg
@ 2015-07-15 18:48               ` Matthew Wilcox
       [not found]                 ` <20150715184800.GL13681-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  1 sibling, 1 reply; 15+ messages in thread
From: Matthew Wilcox @ 2015-07-15 18:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Bart Van Assche, ksummit-discuss, linux-rdma,
	linux-kernel, linux-nvme, Christoph Hellwig, Thomas Gleixner

On Wed, Jul 15, 2015 at 11:25:55AM -0600, Jens Axboe wrote:
> On 07/15/2015 11:19 AM, Keith Busch wrote:
> >On Wed, 15 Jul 2015, Bart Van Assche wrote:
> >>* With blk-mq and scsi-mq optimal performance can only be achieved if
> >> the relationship between MSI-X vector and NUMA node does not change
> >> over time. This is necessary to allow a blk-mq/scsi-mq driver to
> >> ensure that interrupts are processed on the same NUMA node as the
> >> node on which the data structures for a communication channel have
> >> been allocated. However, today there is no API that allows
> >> blk-mq/scsi-mq drivers and irqbalanced to exchange information
> >> about the relationship between MSI-X vector ranges and NUMA nodes.
> >
> >We could have low-level drivers provide blk-mq the controller's irq
> >associated with a particular h/w context, and the block layer can provide
> >the context's cpumask to irqbalance with the smp affinity hint.
> >
> >The nvme driver already uses the hwctx cpumask to set hints, but this
> >doesn't seems like it should be a driver responsibility. It currently
> >doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
> >the h/w contexts without syncing with the low-level driver.
> >
> >If we can add this to blk-mq, one additional case to consider is if the
> >same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
> >assignment needs to be aware of this to prevent sharing a vector across
> >NUMA nodes.
> 
> Exactly. I may have promised to do just that at the last LSF/MM conference,
> just haven't done it yet. The point is to share the mask, I'd ideally like
> to take it all the way where the driver just asks for a number of vecs
> through a nice API that takes care of all this. Lots of duplicated code in
> drivers for this these days, and it's a mess.

Yes.  I think the fundamental problem is that our MSI-X API is so funky.
We have this incredibly flexible scheme where each MSI-X vector could
have its own interrupt handler, but that's not what drivers want.
They want to say "Give me eight MSI-X vectors spread across the CPUs,
and use this interrupt handler for all of them".  That is, instead of
the current scheme where each MSI-X vector gets its own Linux interrupt,
we should have one interrupt handler (of the per-cpu interrupt type),
which shows up with N bits set in its CPU mask.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
       [not found]                 ` <20150715184800.GL13681-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
@ 2015-07-16  6:13                   ` Michael S. Tsirkin
  2015-07-17 15:51                   ` Thomas Gleixner
  1 sibling, 0 replies; 15+ messages in thread
From: Michael S. Tsirkin @ 2015-07-16  6:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Christoph Hellwig,
	ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Keith Busch,
	Bart Van Assche

On Wed, Jul 15, 2015 at 02:48:00PM -0400, Matthew Wilcox wrote:
> On Wed, Jul 15, 2015 at 11:25:55AM -0600, Jens Axboe wrote:
> > On 07/15/2015 11:19 AM, Keith Busch wrote:
> > >On Wed, 15 Jul 2015, Bart Van Assche wrote:
> > >>* With blk-mq and scsi-mq optimal performance can only be achieved if
> > >> the relationship between MSI-X vector and NUMA node does not change
> > >> over time. This is necessary to allow a blk-mq/scsi-mq driver to
> > >> ensure that interrupts are processed on the same NUMA node as the
> > >> node on which the data structures for a communication channel have
> > >> been allocated. However, today there is no API that allows
> > >> blk-mq/scsi-mq drivers and irqbalanced to exchange information
> > >> about the relationship between MSI-X vector ranges and NUMA nodes.
> > >
> > >We could have low-level drivers provide blk-mq the controller's irq
> > >associated with a particular h/w context, and the block layer can provide
> > >the context's cpumask to irqbalance with the smp affinity hint.
> > >
> > >The nvme driver already uses the hwctx cpumask to set hints, but this
> > >doesn't seems like it should be a driver responsibility. It currently
> > >doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
> > >the h/w contexts without syncing with the low-level driver.
> > >
> > >If we can add this to blk-mq, one additional case to consider is if the
> > >same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
> > >assignment needs to be aware of this to prevent sharing a vector across
> > >NUMA nodes.
> > 
> > Exactly. I may have promised to do just that at the last LSF/MM conference,
> > just haven't done it yet. The point is to share the mask, I'd ideally like
> > to take it all the way where the driver just asks for a number of vecs
> > through a nice API that takes care of all this. Lots of duplicated code in
> > drivers for this these days, and it's a mess.
> 
> Yes.  I think the fundamental problem is that our MSI-X API is so funky.
> We have this incredibly flexible scheme where each MSI-X vector could
> have its own interrupt handler, but that's not what drivers want.
> They want to say "Give me eight MSI-X vectors spread across the CPUs,
> and use this interrupt handler for all of them".  That is, instead of
> the current scheme where each MSI-X vector gets its own Linux interrupt,
> we should have one interrupt handler (of the per-cpu interrupt type),
> which shows up with N bits set in its CPU mask.

It would definitely be nice to have a way to express that.  But it's
also pretty common for drivers to have e.g. RX and TX use separate
vectors, and these need separate handlers.

> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
       [not found]                 ` <20150715184800.GL13681-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
  2015-07-16  6:13                   ` Michael S. Tsirkin
@ 2015-07-17 15:51                   ` Thomas Gleixner
  1 sibling, 0 replies; 15+ messages in thread
From: Thomas Gleixner @ 2015-07-17 15:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Keith Busch, Bart Van Assche,
	ksummit-discuss-cunTk1MwBs98uUxBSJOaYoYkZiVZrdSR2LY78lusg7I,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, LKML,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Christoph Hellwig,
	Jiang Liu, Marc Zyngier

B1;2802;0cOn Wed, 15 Jul 2015, Matthew Wilcox wrote:
> On Wed, Jul 15, 2015 at 11:25:55AM -0600, Jens Axboe wrote:
> > On 07/15/2015 11:19 AM, Keith Busch wrote:
> > >On Wed, 15 Jul 2015, Bart Van Assche wrote:
> > >>* With blk-mq and scsi-mq optimal performance can only be achieved if
> > >> the relationship between MSI-X vector and NUMA node does not change
> > >> over time. This is necessary to allow a blk-mq/scsi-mq driver to
> > >> ensure that interrupts are processed on the same NUMA node as the
> > >> node on which the data structures for a communication channel have
> > >> been allocated. However, today there is no API that allows
> > >> blk-mq/scsi-mq drivers and irqbalanced to exchange information
> > >> about the relationship between MSI-X vector ranges and NUMA nodes.
> > >
> > >We could have low-level drivers provide blk-mq the controller's irq
> > >associated with a particular h/w context, and the block layer can provide
> > >the context's cpumask to irqbalance with the smp affinity hint.
> > >
> > >The nvme driver already uses the hwctx cpumask to set hints, but this
> > >doesn't seems like it should be a driver responsibility. It currently
> > >doesn't work correctly anyway with hot-cpu since blk-mq could rebalance
> > >the h/w contexts without syncing with the low-level driver.
> > >
> > >If we can add this to blk-mq, one additional case to consider is if the
> > >same interrupt vector is used with multiple h/w contexts. Blk-mq's cpu
> > >assignment needs to be aware of this to prevent sharing a vector across
> > >NUMA nodes.
> > 
> > Exactly. I may have promised to do just that at the last LSF/MM conference,
> > just haven't done it yet. The point is to share the mask, I'd ideally like
> > to take it all the way where the driver just asks for a number of vecs
> > through a nice API that takes care of all this. Lots of duplicated code in
> > drivers for this these days, and it's a mess.
> 
> Yes.  I think the fundamental problem is that our MSI-X API is so funky.
> We have this incredibly flexible scheme where each MSI-X vector could
> have its own interrupt handler, but that's not what drivers want.
> They want to say "Give me eight MSI-X vectors spread across the CPUs,
> and use this interrupt handler for all of them".  That is, instead of
> the current scheme where each MSI-X vector gets its own Linux interrupt,
> we should have one interrupt handler (of the per-cpu interrupt type),
> which shows up with N bits set in its CPU mask.

That certainly would help, but I'm definitely not going to open a huge
can of worms by providing a side channel for vector allocation with
all the variants of irq remapping and whatsoever.

Though we certainly can do better than we do now. We recently reworked
the whole interrupt handling of x86 to use hierarchical interrupt
domains. This allows us to come up with a clean solution for your
issue. The current hierarchy looks like this:

  [MSI-domain]
	|
	v
  [optional REMAP-domain]
	|
	v
  [Vector-domain]

Now it's simple to add another hierarchy level:

  [MSI/X-Multiqueue-domain]
	|
	v
  [MSI-domain]
	|
	v
  [optional REMAP-domain]
	|
	v
  [Vector-domain]

The MSI/X-Multiqueue-domain would be the one which is associated to
this class of devices. The domain would provide a single virtual
interrupt number to the device and hide the underlying details.

This needs a few new interfaces at the irq core level because we
cannot map that 1:1 to the per cpu interrupt mechanism which we have
on ARM and other architectures.

irqdomain interfaces used from PCI/MSI infrastructure code:

  irq_domain_alloc_mq(....., nr_vectors, spread_scheme)

    @nr_vectors:     The number of vectors to allocate underneath

    @spread_scheme:  Some form of advice/hint how to spread the vectors
    		     (nodes, cpus, ...)

    Returns a unique virtual interrupt number which shows up in
    /proc/irq. The virtual interrupt cannot be influenced by user space
    affinity settings (e.g. irqbalanced)

    The vectors will have seperate irq numbers and irq descriptors,
    but those should be supressed in /proc/interrupts. /proc/irq/NNN
    should expose the information at least for debugging purposes.

    One advantage of these seperate descriptors is that the associated
    data will be cpu/node local according to the spread scheme.

  irq_domain_free_mq()

    Counterpart to the above

irq core interfaces used from PCI/MSI infrastructure

  irq_move_mq_vector()

    Move a multiqueue vector to a new target (cpu, node)  

    That might even replace the underlying irq descriptor with a newly
    allocated one, if the vector moves across nodes.

Driver relevant interfaces:

  msi_alloc_mq_irqs()/msi_free_mq_irqs()

    PCI/MSI specific wrappers for the irqdomain interfaces


  msi_mq_move_vector(virq, vector_nr, target)

    @virq: 	The virtual interrupt number

    @vector_nr:	The vector number to move

    @target:	The target cpu/node information
    
    PCI/MSI specific wrapper around irq_move_mq_vector()
	
The existing interfaces will behave as follows:

    request_irq()
    free_irq()
    disable_irq()
    enable_irq()

    They all operate on the virtual irq number and affect all
    associated vectors.

Now the question is whether we need

    en/disable_irq_mq_vector(virq, vector_nr)

to can shut down / reenable a particular vector, but that would be
pretty straight forward to do, plus/minus the headache versus the
global disable/enable mechanism which operates on the virq.

Thoughts?

	tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
  2015-07-15 12:07 [TECH TOPIC] IRQ affinity Christoph Hellwig
  2015-07-15 14:38 ` Christoph Lameter
       [not found] ` <20150715120708.GA24534-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2015-10-12 16:09 ` Theodore Ts'o
  2015-10-12 18:41   ` Christoph Hellwig
  2 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2015-10-12 16:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: ksummit-discuss, linux-rdma, linux-kernel, linux-nvme

Hi Christoph,

Do you think this is still an issue that would be worth discsussing at
the kernel summit as a technical topic?  If so, would you be willing
to be responsible for kicking off the discussion for this topic?

Thanks,

					- Ted



On Wed, Jul 15, 2015 at 05:07:08AM -0700, Christoph Hellwig wrote:
> Many years ago we decided to move setting of IRQ to core affnities to
> userspace with the irqbalance daemon.
> 
> These days we have systems with lots of MSI-X vector, and we have
> hardware and subsystem support for per-CPU I/O queues in the block
> layer, the RDMA subsystem and probably the network stack (I'm not too
> familar with the recent developments there).  It would really help the
> out of the box performance and experience if we could allow such
> subsystems to bind interrupt vectors to the node that the queue is
> configured on.
> 
> I'd like to discuss if the rationale for moving the IRQ affinity setting
> fully to userspace are still correct in todays world any any pitfalls
> we'll have to learn from in irqbalanced and the old in-kernel affinity
> code.
> _______________________________________________
> Ksummit-discuss mailing list
> Ksummit-discuss@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
  2015-10-12 16:09 ` Theodore Ts'o
@ 2015-10-12 18:41   ` Christoph Hellwig
  2015-10-14 15:56     ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2015-10-12 18:41 UTC (permalink / raw)
  To: Theodore Ts'o, Christoph Hellwig, ksummit-discuss,
	linux-rdma, linux-kernel, linux-nvme

On Mon, Oct 12, 2015 at 12:09:48PM -0400, Theodore Ts'o wrote:
> Hi Christoph,
> 
> Do you think this is still an issue that would be worth discsussing at
> the kernel summit as a technical topic?  If so, would you be willing
> to be responsible for kicking off the discussion for this topic?

Hi Ted,

while we have a high level agreement there's still some discussion
needed.  I can prepare a few slides for 10 minute discussion and then
take it to the hallways with the interested people.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Ksummit-discuss] [TECH TOPIC] IRQ affinity
  2015-10-12 18:41   ` Christoph Hellwig
@ 2015-10-14 15:56     ` Theodore Ts'o
  0 siblings, 0 replies; 15+ messages in thread
From: Theodore Ts'o @ 2015-10-14 15:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: ksummit-discuss, linux-rdma, linux-kernel, linux-nvme

On Mon, Oct 12, 2015 at 11:41:45AM -0700, Christoph Hellwig wrote:
> 
> Hi Ted,
> 
> while we have a high level agreement there's still some discussion
> needed.  I can prepare a few slides for 10 minute discussion and then
> take it to the hallways with the interested people.

Thanks,

The Tech session day is designed for just that.  We should have plenty
of slots so you can stay and have the chat in the room.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-10-14 15:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-15 12:07 [TECH TOPIC] IRQ affinity Christoph Hellwig
2015-07-15 14:38 ` Christoph Lameter
     [not found] ` <20150715120708.GA24534-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2015-07-15 12:12   ` [Ksummit-discuss] " Thomas Gleixner
2015-07-15 15:41     ` Bart Van Assche
     [not found]       ` <55A67F11.1030709-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2015-07-15 17:19         ` Keith Busch
     [not found]           ` <alpine.LNX.2.00.1507151700300.15930-bi+AKbBUZKYRn3MOdyr96VDQ4js95KgL@public.gmane.org>
2015-07-15 17:25             ` Jens Axboe
2015-07-15 18:24               ` Sagi Grimberg
2015-07-15 18:48               ` Matthew Wilcox
     [not found]                 ` <20150715184800.GL13681-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
2015-07-16  6:13                   ` Michael S. Tsirkin
2015-07-17 15:51                   ` Thomas Gleixner
2015-07-15 14:56   ` Marc Zyngier
2015-07-15 16:05   ` Michael S. Tsirkin
2015-10-12 16:09 ` Theodore Ts'o
2015-10-12 18:41   ` Christoph Hellwig
2015-10-14 15:56     ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).