linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Affinity managed interrupts vs non-managed interrupts
       [not found] <eccc46e12890a1d033d9003837012502@mail.gmail.com>
@ 2018-08-29  8:46 ` Ming Lei
  2018-08-29 10:46   ` Sumit Saxena
  0 siblings, 1 reply; 28+ messages in thread
From: Ming Lei @ 2018-08-29  8:46 UTC (permalink / raw)
  To: Sumit Saxena; +Cc: tglx, hch, linux-kernel

Hello Sumit,

On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
>  Affinity managed interrupts vs non-managed interrupts
> 
> Hi Thomas,
> 
> We are working on next generation MegaRAID product where requirement is- to
> allocate additional 16 MSI-x vectors in addition to number of MSI-x vectors
> megaraid_sas driver usually allocates.  MegaRAID adapter supports 128 MSI-x
> vectors.
> 
> To explain the requirement and solution, consider that we have 2 socket
> system (each socket having 36 logical CPUs). Current driver will allocate
> total 72 MSI-x vectors by calling API- pci_alloc_irq_vectors(with flag-
> PCI_IRQ_AFFINITY).  All 72 MSI-x vectors will have affinity across NUMA node
> s and interrupts are affinity managed.
> 
> If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = 16
> and, driver can allocate 16 + 72 MSI-x vectors.

Could you explain a bit what the specific use case the extra 16 vectors
is?

> 
> All pre_vectors (16) will be mapped to all available online CPUs but e
> ffective affinity of each vector is to CPU 0. Our requirement is to have pre
> _vectors 16 reply queues to be mapped to local NUMA node with
> effective CPU should
> be spread within local node cpu mask. Without changing kernel code, we can

If all CPUs in one NUMA node is offline, can this use case work as expected?
Seems we have to understand what the use case is and how it works.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-08-29  8:46 ` Affinity managed interrupts vs non-managed interrupts Ming Lei
@ 2018-08-29 10:46   ` Sumit Saxena
  2018-08-30 17:15     ` Kashyap Desai
                       ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Sumit Saxena @ 2018-08-29 10:46 UTC (permalink / raw)
  To: Ming Lei
  Cc: tglx, hch, linux-kernel, Kashyap Desai, Shivasharan Srikanteshwara

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Wednesday, August 29, 2018 2:16 PM
> To: Sumit Saxena <sumit.saxena@broadcom.com>
> Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org
> Subject: Re: Affinity managed interrupts vs non-managed interrupts
>
> Hello Sumit,
Hi Ming,
Thanks for response.
>
> On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> >  Affinity managed interrupts vs non-managed interrupts
> >
> > Hi Thomas,
> >
> > We are working on next generation MegaRAID product where requirement
> > is- to allocate additional 16 MSI-x vectors in addition to number of
> > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > supports 128 MSI-x vectors.
> >
> > To explain the requirement and solution, consider that we have 2
> > socket system (each socket having 36 logical CPUs). Current driver
> > will allocate total 72 MSI-x vectors by calling API-
> > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > vectors will have affinity across NUMA node s and interrupts are
affinity
> managed.
> >
> > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > 16 and, driver can allocate 16 + 72 MSI-x vectors.
>
> Could you explain a bit what the specific use case the extra 16 vectors
is?
We are trying to avoid the penalty due to one interrupt per IO completion
and decided to coalesce interrupts on these extra 16 reply queues.
For regular 72 reply queues, we will not coalesce interrupts as for low IO
workload, interrupt coalescing may take more time due to less IO
completions.
In IO submission path, driver will decide which set of reply queues
(either extra 16 reply queues or regular 72 reply queues) to be picked
based on IO workload.
>
> >
> > All pre_vectors (16) will be mapped to all available online CPUs but e
> > ffective affinity of each vector is to CPU 0. Our requirement is to
> > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > effective CPU should be spread within local node cpu mask. Without
> > changing kernel code, we can
>
> If all CPUs in one NUMA node is offline, can this use case work as
expected?
> Seems we have to understand what the use case is and how it works.

Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
broken and irqbalancer takes care of migrating affected IRQs to online
CPUs of different NUMA node.
When offline CPUs are onlined again, irqbalancer restores affinity.
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-08-29 10:46   ` Sumit Saxena
@ 2018-08-30 17:15     ` Kashyap Desai
  2018-08-31  6:54     ` Ming Lei
  2018-09-11  9:21     ` Christoph Hellwig
  2 siblings, 0 replies; 28+ messages in thread
From: Kashyap Desai @ 2018-08-30 17:15 UTC (permalink / raw)
  To: Sumit Saxena, Ming Lei
  Cc: tglx, hch, linux-kernel, Shivasharan Srikanteshwara

Hi Thomas, Ming, Chris et all,

Your input will help us to do changes for megaraid_sas driver.  We are
currently waiting for community response.

Is it recommended to use " pci_enable_msix_range" and have low level driver
do affinity setting because current APIs around pci_alloc_irq_vectors do not
meet our requirement.

We want more than online CPU msix vectors and using pre_vector we can do
that, but first 16 msix should be mapped to local numa node with effective
cpu spread across cpus of local numa node. This is not possible using
pci_alloc_irq_vectors_affinity.

Do we need kernel API changes or let's have low level driver to manage it
via irq_set_affinity_hint ?

Kashyap

> -----Original Message-----
> From: Sumit Saxena [mailto:sumit.saxena@broadcom.com]
> Sent: Wednesday, August 29, 2018 4:46 AM
> To: Ming Lei
> Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org; Kashyap
> Desai; Shivasharan Srikanteshwara
> Subject: RE: Affinity managed interrupts vs non-managed interrupts
>
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Wednesday, August 29, 2018 2:16 PM
> > To: Sumit Saxena <sumit.saxena@broadcom.com>
> > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > Hello Sumit,
> Hi Ming,
> Thanks for response.
> >
> > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > >  Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hi Thomas,
> > >
> > > We are working on next generation MegaRAID product where requirement
> > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > > supports 128 MSI-x vectors.
> > >
> > > To explain the requirement and solution, consider that we have 2
> > > socket system (each socket having 36 logical CPUs). Current driver
> > > will allocate total 72 MSI-x vectors by calling API-
> > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > vectors will have affinity across NUMA node s and interrupts are
> affinity
> > managed.
> > >
> > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> >
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.
> >
> > >
> > > All pre_vectors (16) will be mapped to all available online CPUs but e
> > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > > effective CPU should be spread within local node cpu mask. Without
> > > changing kernel code, we can
> >
> > If all CPUs in one NUMA node is offline, can this use case work as
> expected?
> > Seems we have to understand what the use case is and how it works.
>
> Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> broken and irqbalancer takes care of migrating affected IRQs to online
> CPUs of different NUMA node.
> When offline CPUs are onlined again, irqbalancer restores affinity.
> >
> >
> > Thanks,
> > Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-08-29 10:46   ` Sumit Saxena
  2018-08-30 17:15     ` Kashyap Desai
@ 2018-08-31  6:54     ` Ming Lei
  2018-08-31  7:50       ` Kashyap Desai
  2018-09-11  9:21     ` Christoph Hellwig
  2 siblings, 1 reply; 28+ messages in thread
From: Ming Lei @ 2018-08-31  6:54 UTC (permalink / raw)
  To: sumit.saxena
  Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig,
	Linux Kernel Mailing List, Kashyap Desai,
	shivasharan.srikanteshwara, linux-block

On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena <sumit.saxena@broadcom.com> wrote:
>
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Wednesday, August 29, 2018 2:16 PM
> > To: Sumit Saxena <sumit.saxena@broadcom.com>
> > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > Hello Sumit,
> Hi Ming,
> Thanks for response.
> >
> > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > >  Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hi Thomas,
> > >
> > > We are working on next generation MegaRAID product where requirement
> > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID adapter
> > > supports 128 MSI-x vectors.
> > >
> > > To explain the requirement and solution, consider that we have 2
> > > socket system (each socket having 36 logical CPUs). Current driver
> > > will allocate total 72 MSI-x vectors by calling API-
> > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > vectors will have affinity across NUMA node s and interrupts are
> affinity
> > managed.
> > >
> > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> >
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.

I am just wondering how you can make the decision about using extra
16 or regular 72 queues in submission path, could you share us a bit
your idea? How are you going to recognize the IO workload inside your
driver? Even the current block layer doesn't recognize IO workload, such
as random IO or sequential IO.

Frankly speaking, you may reuse the 72 reply queues to do interrupt
coalescing by configuring one extra register to enable the coalescing mode,
and you may just use small part of the 72 reply queues under the
interrupt coalescing mode.

Or you can learn from SPDK to use one or small number of dedicated cores
or kernel threads to poll the interrupts from all reply queues, then I
guess you may benefit much compared with the extra 16 queue approach.

Introducing extra 16 queues just for interrupt coalescing and making it
coexisting with the regular 72 reply queues seems one very unusual use
case, not sure the current genirq affinity can support it well.

> >
> > >
> > > All pre_vectors (16) will be mapped to all available online CPUs but e
> > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > have pre _vectors 16 reply queues to be mapped to local NUMA node with
> > > effective CPU should be spread within local node cpu mask. Without
> > > changing kernel code, we can
> >
> > If all CPUs in one NUMA node is offline, can this use case work as
> expected?
> > Seems we have to understand what the use case is and how it works.
>
> Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> broken and irqbalancer takes care of migrating affected IRQs to online
> CPUs of different NUMA node.
> When offline CPUs are onlined again, irqbalancer restores affinity.

 irqbalance daemon can't cover managed interrupts, or you mean
you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-08-31  6:54     ` Ming Lei
@ 2018-08-31  7:50       ` Kashyap Desai
  2018-08-31 20:24         ` Thomas Gleixner
  2018-09-03  2:13         ` Ming Lei
  0 siblings, 2 replies; 28+ messages in thread
From: Kashyap Desai @ 2018-08-31  7:50 UTC (permalink / raw)
  To: Ming Lei, Sumit Saxena
  Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

> -----Original Message-----
> From: Ming Lei [mailto:tom.leiming@gmail.com]
> Sent: Friday, August 31, 2018 12:54 AM
> To: sumit.saxena@broadcom.com
> Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing
> List;
> Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block
> Subject: Re: Affinity managed interrupts vs non-managed interrupts
>
> On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena
> <sumit.saxena@broadcom.com> wrote:
> >
> > > -----Original Message-----
> > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > Sent: Wednesday, August 29, 2018 2:16 PM
> > > To: Sumit Saxena <sumit.saxena@broadcom.com>
> > > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org
> > > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> > >
> > > Hello Sumit,
> > Hi Ming,
> > Thanks for response.
> > >
> > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > > >  Affinity managed interrupts vs non-managed interrupts
> > > >
> > > > Hi Thomas,
> > > >
> > > > We are working on next generation MegaRAID product where
> requirement
> > > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID
> > > > adapter
> > > > supports 128 MSI-x vectors.
> > > >
> > > > To explain the requirement and solution, consider that we have 2
> > > > socket system (each socket having 36 logical CPUs). Current driver
> > > > will allocate total 72 MSI-x vectors by calling API-
> > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > > vectors will have affinity across NUMA node s and interrupts are
> > affinity
> > > managed.
> > > >
> > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> > >
> > > Could you explain a bit what the specific use case the extra 16
> > > vectors
> > is?
> > We are trying to avoid the penalty due to one interrupt per IO
> > completion
> > and decided to coalesce interrupts on these extra 16 reply queues.
> > For regular 72 reply queues, we will not coalesce interrupts as for low
> > IO
> > workload, interrupt coalescing may take more time due to less IO
> > completions.
> > In IO submission path, driver will decide which set of reply queues
> > (either extra 16 reply queues or regular 72 reply queues) to be picked
> > based on IO workload.
>
> I am just wondering how you can make the decision about using extra
> 16 or regular 72 queues in submission path, could you share us a bit
> your idea? How are you going to recognize the IO workload inside your
> driver? Even the current block layer doesn't recognize IO workload, such
> as random IO or sequential IO.

It is not yet finalized, but it can be based on per sdev outstanding,
shost_busy etc.
We want to use special 16 reply queue for IO acceleration (these queues are
working interrupt coalescing mode. This is a h/w feature)

>
> Frankly speaking, you may reuse the 72 reply queues to do interrupt
> coalescing by configuring one extra register to enable the coalescing
> mode,
> and you may just use small part of the 72 reply queues under the
> interrupt coalescing mode.
Our h/w can set interrupt coalescing per 8 reply queues. So smallest is 8.
If we choose to take 8 reply queue from existing 72 reply queue (without
asking for extra reply queue), we still have  an issue on more numa node
systems.  Example - in 8 numa node system each node will have only *one*
reply queue for effective interrupt coalescing. (since irq subsystem will
spread msix per numa).

To keep things scalable we cherry picked few reply queues and wanted them to
be out of cpu-msix mapping.

>
> Or you can learn from SPDK to use one or small number of dedicated cores
> or kernel threads to poll the interrupts from all reply queues, then I
> guess you may benefit much compared with the extra 16 queue approach.
Problem with polling -  It requires some steady completion, otherwise
prediction in driver gives different results on different profiles.
We attempted irq-poll and thread ISR based polling, but it has pros and
cons. One of the key usage of method what we are trying is not to impact
latency for lower QD workloads.
I posted RFC at
https://www.spinics.net/lists/linux-scsi/msg122874.html

We have done extensive study and concluded to use interrupt coalescing is
better if h/w can manage two different modes (coalescing on/off).

>
> Introducing extra 16 queues just for interrupt coalescing and making it
> coexisting with the regular 72 reply queues seems one very unusual use
> case, not sure the current genirq affinity can support it well.

Yes. This is unusual case. I think it is not used by any other drivers.

>
> > >
> > > >
> > > > All pre_vectors (16) will be mapped to all available online CPUs but
> > > > e
> > > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > > have pre _vectors 16 reply queues to be mapped to local NUMA node
> with
> > > > effective CPU should be spread within local node cpu mask. Without
> > > > changing kernel code, we can
> > >
> > > If all CPUs in one NUMA node is offline, can this use case work as
> > expected?
> > > Seems we have to understand what the use case is and how it works.
> >
> > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> > broken and irqbalancer takes care of migrating affected IRQs to online
> > CPUs of different NUMA node.
> > When offline CPUs are onlined again, irqbalancer restores affinity.
>
>  irqbalance daemon can't cover managed interrupts, or you mean
> you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)?

Yes. We did not used " pci_alloc_irq_vectors_affinity".
We used " pci_enable_msix_range" and manually set affinity in driver using
irq_set_affinity_hint.

>
> Thanks,
> Ming Lei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-08-31  7:50       ` Kashyap Desai
@ 2018-08-31 20:24         ` Thomas Gleixner
  2018-08-31 21:49           ` Kashyap Desai
  2018-09-03  2:13         ` Ming Lei
  1 sibling, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2018-08-31 20:24 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

On Fri, 31 Aug 2018, Kashyap Desai wrote:
> > From: Ming Lei [mailto:tom.leiming@gmail.com]
> > Sent: Friday, August 31, 2018 12:54 AM
> > To: sumit.saxena@broadcom.com
> > Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing
> > List;
> > Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts

Can you please teach your mail client NOT to insert the whole useless mail
header?

> > On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena
> > <sumit.saxena@broadcom.com> wrote:

> > > > > We are working on next generation MegaRAID product where
> > requirement
> > > > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID
> > > > > adapter
> > > > > supports 128 MSI-x vectors.
> > > > >
> > > > > To explain the requirement and solution, consider that we have 2
> > > > > socket system (each socket having 36 logical CPUs). Current driver
> > > > > will allocate total 72 MSI-x vectors by calling API-
> > > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > > > vectors will have affinity across NUMA node s and interrupts are
> > > affinity
> > > > managed.
> > > > >
> > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> > > >
> > > > Could you explain a bit what the specific use case the extra 16
> > > > vectors
> > > is?
> > > We are trying to avoid the penalty due to one interrupt per IO
> > > completion
> > > and decided to coalesce interrupts on these extra 16 reply queues.
> > > For regular 72 reply queues, we will not coalesce interrupts as for low
> > > IO
> > > workload, interrupt coalescing may take more time due to less IO
> > > completions.
> > > In IO submission path, driver will decide which set of reply queues
> > > (either extra 16 reply queues or regular 72 reply queues) to be picked
> > > based on IO workload.
> >
> > I am just wondering how you can make the decision about using extra
> > 16 or regular 72 queues in submission path, could you share us a bit
> > your idea? How are you going to recognize the IO workload inside your
> > driver? Even the current block layer doesn't recognize IO workload, such
> > as random IO or sequential IO.
> 
> It is not yet finalized, but it can be based on per sdev outstanding,
> shost_busy etc.
> We want to use special 16 reply queue for IO acceleration (these queues are
> working interrupt coalescing mode. This is a h/w feature)

TBH, this does not make any sense whatsoever. Why are you trying to have
extra interrupts for coalescing instead of doing the following:

1) Allocate 72 reply queues which get nicely spread out to every CPU on the
   system with affinity spreading.

2) Have a configuration for your reply queues which allows them to be
   grouped, e.g. by phsyical package.

3) Have a mechanism to mark a reply queue offline/online and handle that on
   CPU hotplug. That means on unplug you have to wait for the reply queue
   which is associated to the outgoing CPU to be empty and no new requests
   to be queued, which has to be done for the regular per CPU reply queues
   anyway.

4) On queueing the request, flag it 'coalescing' which causes the
   hard/firmware to direct the reply to the first online reply queue in the
   group.

If the last CPU of a group goes offline, then the normal hotplug mechanism
takes effect and the whole thing is put 'offline' as well. This works
nicely for all kind of scenarios even if you have more CPUs than queues. No
extras, no magic affinity hints, it just works.

Hmm?

> Yes. We did not used " pci_alloc_irq_vectors_affinity".
> We used " pci_enable_msix_range" and manually set affinity in driver using
> irq_set_affinity_hint.

I still regret the day when I merged that abomination.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-08-31 20:24         ` Thomas Gleixner
@ 2018-08-31 21:49           ` Kashyap Desai
  2018-08-31 22:48             ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Kashyap Desai @ 2018-08-31 21:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

> >
> > It is not yet finalized, but it can be based on per sdev outstanding,
> > shost_busy etc.
> > We want to use special 16 reply queue for IO acceleration (these
queues are
> > working interrupt coalescing mode. This is a h/w feature)
>
> TBH, this does not make any sense whatsoever. Why are you trying to have
> extra interrupts for coalescing instead of doing the following:

Thomas,

We are using this feature mainly for performance and not for CPU hotplug
issues.
I read your below #1 to #4  points are more of addressing CPU hotplug
stuffs. Right ?  We also want to make sure if we convert megaraid_sas
driver from managed to non-managed interrupt, we can still achieve CPU
hotplug requirement.  If we use " pci_enable_msix_range" and manually set
affinity in driver  using irq_set_affinity_hint, cpu hotplug feature works
as expected. <irqbalancer> is able to retain older mapping and whenever
offlined cpu comes back, irqbalancer restore the same old mapping.

If we use all 72 reply queue (all are in interrupt coalescing mode)
without any extra reply queues, we don't have any issue with cpu-msix
mapping and cpu hotplug issues.
Our major problem with that method is latency is very bad on lower QD
and/or single worker case.

To solve that problem we have added extra 16 reply queue (this is a
special h/w feature for performance only) which can be worked in interrupt
coalescing mode vs existing 72 reply queue will work without any interrupt
coalescing.   Best way to map additional 16 reply queue is map it to the
local numa node.

I understand that, it is unique requirement but at the same time we may be
able to do it gracefully (in irq sub system) as you mentioned  "
irq_set_affinity_hint" should be avoided in low level driver.



>
> 1) Allocate 72 reply queues which get nicely spread out to every CPU on
the
>    system with affinity spreading.
>
> 2) Have a configuration for your reply queues which allows them to be
>    grouped, e.g. by phsyical package.
>
> 3) Have a mechanism to mark a reply queue offline/online and handle that
on
>    CPU hotplug. That means on unplug you have to wait for the reply
queue
>    which is associated to the outgoing CPU to be empty and no new
requests
>    to be queued, which has to be done for the regular per CPU reply
queues
>    anyway.
>
> 4) On queueing the request, flag it 'coalescing' which causes the
>    hard/firmware to direct the reply to the first online reply queue in
the
>    group.
>
> If the last CPU of a group goes offline, then the normal hotplug
mechanism
> takes effect and the whole thing is put 'offline' as well. This works
> nicely for all kind of scenarios even if you have more CPUs than queues.
No
> extras, no magic affinity hints, it just works.
>
> Hmm?
>
> > Yes. We did not used " pci_alloc_irq_vectors_affinity".
> > We used " pci_enable_msix_range" and manually set affinity in driver
using
> > irq_set_affinity_hint.
>
> I still regret the day when I merged that abomination.

Is it possible to have similar mapping in managed interrupt case as below
?

    for (i = 0; i < 16 ; i++)
        irq_set_affinity_hint (pci_irq_vector(instance->pdev,
cpumask_of_node(local_numa_node));

Currently we always see managed interrupts for pre-vectors are 0-71 and
effective cpu is always 0.
We want some changes in current API which can allow us to  pass flags
(like *local numa affinity*) and cpu-msix mapping are from local numa node
+ effective cpu are spread across local numa node.

>
> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-08-31 21:49           ` Kashyap Desai
@ 2018-08-31 22:48             ` Thomas Gleixner
  2018-08-31 23:37               ` Kashyap Desai
  2018-09-11  9:22               ` Christoph Hellwig
  0 siblings, 2 replies; 28+ messages in thread
From: Thomas Gleixner @ 2018-08-31 22:48 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

On Fri, 31 Aug 2018, Kashyap Desai wrote:
> > > It is not yet finalized, but it can be based on per sdev outstanding,
> > > shost_busy etc.
> > > We want to use special 16 reply queue for IO acceleration (these
> queues are
> > > working interrupt coalescing mode. This is a h/w feature)
> >
> > TBH, this does not make any sense whatsoever. Why are you trying to have
> > extra interrupts for coalescing instead of doing the following:
> 
> Thomas,
> 
> We are using this feature mainly for performance and not for CPU hotplug
> issues.
> I read your below #1 to #4 points are more of addressing CPU hotplug
> stuffs. Right ? If we use all 72 reply queue (all are in interrupt
> coalescing mode) without any extra reply queues, we don't have any issue
> with cpu-msix mapping and cpu hotplug issues.  Our major problem with
> that method is latency is very bad on lower QD and/or single worker case.
> 
> To solve that problem we have added extra 16 reply queue (this is a
> special h/w feature for performance only) which can be worked in interrupt
> coalescing mode vs existing 72 reply queue will work without any interrupt
> coalescing.   Best way to map additional 16 reply queue is map it to the
> local numa node.

Ok. I misunderstood the whole thing a bit. So your real issue is that you
want to have reply queues which are instantaneous, the per cpu ones, and
then the extra 16 which do batching and are shared over a set of CPUs,
right?

> I understand that, it is unique requirement but at the same time we may
> be able to do it gracefully (in irq sub system) as you mentioned "
> irq_set_affinity_hint" should be avoided in low level driver.

> Is it possible to have similar mapping in managed interrupt case as below
> ?
> 
>     for (i = 0; i < 16 ; i++)
>         irq_set_affinity_hint (pci_irq_vector(instance->pdev,
> cpumask_of_node(local_numa_node));
> 
> Currently we always see managed interrupts for pre-vectors are 0-71 and
> effective cpu is always 0.

The pre-vectors are not affinity managed. They get the default affinity
assigned and at request_irq() the vectors are dynamically spread over CPUs
to avoid that the bulk of interrupts ends up on CPU0. That's handled that
way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation")

> We want some changes in current API which can allow us to  pass flags
> (like *local numa affinity*) and cpu-msix mapping are from local numa node
> + effective cpu are spread across local numa node.

What you really want is to split the vector space for your device into two
blocks. One for the regular per cpu queues and the other (16 or how many
ever) which are managed separately, i.e. spread out evenly. That needs some
extensions to the core allocation/management code, but that shouldn't be a
huge problem.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-08-31 22:48             ` Thomas Gleixner
@ 2018-08-31 23:37               ` Kashyap Desai
  2018-09-02 12:02                 ` Thomas Gleixner
  2018-09-11  9:22               ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: Kashyap Desai @ 2018-08-31 23:37 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

> > > > It is not yet finalized, but it can be based on per sdev
outstanding,
> > > > shost_busy etc.
> > > > We want to use special 16 reply queue for IO acceleration (these
> > queues are
> > > > working interrupt coalescing mode. This is a h/w feature)
> > >
> > > TBH, this does not make any sense whatsoever. Why are you trying to
have
> > > extra interrupts for coalescing instead of doing the following:
> >
> > Thomas,
> >
> > We are using this feature mainly for performance and not for CPU
hotplug
> > issues.
> > I read your below #1 to #4 points are more of addressing CPU hotplug
> > stuffs. Right ? If we use all 72 reply queue (all are in interrupt
> > coalescing mode) without any extra reply queues, we don't have any
issue
> > with cpu-msix mapping and cpu hotplug issues.  Our major problem with
> > that method is latency is very bad on lower QD and/or single worker
case.
> >
> > To solve that problem we have added extra 16 reply queue (this is a
> > special h/w feature for performance only) which can be worked in
interrupt
> > coalescing mode vs existing 72 reply queue will work without any
interrupt
> > coalescing.   Best way to map additional 16 reply queue is map it to
the
> > local numa node.
>
> Ok. I misunderstood the whole thing a bit. So your real issue is that
you
> want to have reply queues which are instantaneous, the per cpu ones, and
> then the extra 16 which do batching and are shared over a set of CPUs,
> right?

Yes that is correct.  Extra 16 or whatever should be shared over set of
CPUs of *local* numa node of the PCI device.

>
> > I understand that, it is unique requirement but at the same time we
may
> > be able to do it gracefully (in irq sub system) as you mentioned "
> > irq_set_affinity_hint" should be avoided in low level driver.
>
> > Is it possible to have similar mapping in managed interrupt case as
below
> > ?
> >
> >     for (i = 0; i < 16 ; i++)
> >         irq_set_affinity_hint (pci_irq_vector(instance->pdev,
> > cpumask_of_node(local_numa_node));
> >
> > Currently we always see managed interrupts for pre-vectors are 0-71
and
> > effective cpu is always 0.
>
> The pre-vectors are not affinity managed. They get the default affinity
> assigned and at request_irq() the vectors are dynamically spread over
CPUs
> to avoid that the bulk of interrupts ends up on CPU0. That's handled
that
> way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation")

I am not sure if this is working on 4.18 kernel. I can double check. What
I remember is pre_vectors are mapped to 0-71 in my case and effective cpu
is always 0.
Ideally you mentioned that it should be spread..let me check that.

>
> > We want some changes in current API which can allow us to  pass flags
> > (like *local numa affinity*) and cpu-msix mapping are from local numa
node
> > + effective cpu are spread across local numa node.
>
> What you really want is to split the vector space for your device into
two
> blocks. One for the regular per cpu queues and the other (16 or how many
> ever) which are managed separately, i.e. spread out evenly. That needs
some
> extensions to the core allocation/management code, but that shouldn't be
a
> huge problem.

Yes this is correct understanding.  I can test any proposed patch if that
is what we want to use as best practice.
We attempted but due to lack of knowledge  in irq-subsystem, we are not
able to settle down anything which is close to our requirement.

We did something like below - "added new flag PCI_IRQ_PRE_VEC_NUMA which
will indicate that all pre and post vector should be shared within local
numa node."

    int irq_flags;
    struct irq_affinity desc;
    desc.pre_vectors = 16;
    desc.post_vectors = 0;

    irq_flags = PCI_IRQ_MSIX;

    i = pci_alloc_irq_vectors_affinity(instance->pdev,
                instance->high_iops_vector_start * 2,
                instance->msix_vectors,
                irq_flags | PCI_IRQ_AFFINITY | PCI_IRQ_PRE_VEC_NUMA,
&desc);

Somehow, I was not able to understand which part of irq subsystem should
have changes.

~ Kashyap


>
> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-08-31 23:37               ` Kashyap Desai
@ 2018-09-02 12:02                 ` Thomas Gleixner
  2018-09-03  5:34                   ` Kashyap Desai
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2018-09-02 12:02 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

On Fri, 31 Aug 2018, Kashyap Desai wrote:
> > Ok. I misunderstood the whole thing a bit. So your real issue is that you
> > want to have reply queues which are instantaneous, the per cpu ones, and
> > then the extra 16 which do batching and are shared over a set of CPUs,
> > right?
> 
> Yes that is correct.  Extra 16 or whatever should be shared over set of
> CPUs of *local* numa node of the PCI device.

Why restricting it to the local NUMA node of the device? That doesn't
really make sense if you queue lots of requests from CPUs on a different
node.

Why don't you spread these extra interrupts accross all nodes and keep the
locality for the request/reply?

That also would allow to make them properly managed interrupts as you could
shutdown the per node batching interrupts when all CPUs of that node are
offlined and you'd avoid the whole affinity hint irq balancer hackery.

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-08-31  7:50       ` Kashyap Desai
  2018-08-31 20:24         ` Thomas Gleixner
@ 2018-09-03  2:13         ` Ming Lei
  2018-09-03  6:10           ` Kashyap Desai
  1 sibling, 1 reply; 28+ messages in thread
From: Ming Lei @ 2018-09-03  2:13 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

On Fri, Aug 31, 2018 at 01:50:31AM -0600, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:tom.leiming@gmail.com]
> > Sent: Friday, August 31, 2018 12:54 AM
> > To: sumit.saxena@broadcom.com
> > Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing
> > List;
> > Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block
> > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> >
> > On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena
> > <sumit.saxena@broadcom.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Ming Lei [mailto:ming.lei@redhat.com]
> > > > Sent: Wednesday, August 29, 2018 2:16 PM
> > > > To: Sumit Saxena <sumit.saxena@broadcom.com>
> > > > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org
> > > > Subject: Re: Affinity managed interrupts vs non-managed interrupts
> > > >
> > > > Hello Sumit,
> > > Hi Ming,
> > > Thanks for response.
> > > >
> > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote:
> > > > >  Affinity managed interrupts vs non-managed interrupts
> > > > >
> > > > > Hi Thomas,
> > > > >
> > > > > We are working on next generation MegaRAID product where
> > requirement
> > > > > is- to allocate additional 16 MSI-x vectors in addition to number of
> > > > > MSI-x vectors megaraid_sas driver usually allocates.  MegaRAID
> > > > > adapter
> > > > > supports 128 MSI-x vectors.
> > > > >
> > > > > To explain the requirement and solution, consider that we have 2
> > > > > socket system (each socket having 36 logical CPUs). Current driver
> > > > > will allocate total 72 MSI-x vectors by calling API-
> > > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY).  All 72 MSI-x
> > > > > vectors will have affinity across NUMA node s and interrupts are
> > > affinity
> > > > managed.
> > > > >
> > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors =
> > > > > 16 and, driver can allocate 16 + 72 MSI-x vectors.
> > > >
> > > > Could you explain a bit what the specific use case the extra 16
> > > > vectors
> > > is?
> > > We are trying to avoid the penalty due to one interrupt per IO
> > > completion
> > > and decided to coalesce interrupts on these extra 16 reply queues.
> > > For regular 72 reply queues, we will not coalesce interrupts as for low
> > > IO
> > > workload, interrupt coalescing may take more time due to less IO
> > > completions.
> > > In IO submission path, driver will decide which set of reply queues
> > > (either extra 16 reply queues or regular 72 reply queues) to be picked
> > > based on IO workload.
> >
> > I am just wondering how you can make the decision about using extra
> > 16 or regular 72 queues in submission path, could you share us a bit
> > your idea? How are you going to recognize the IO workload inside your
> > driver? Even the current block layer doesn't recognize IO workload, such
> > as random IO or sequential IO.
> 
> It is not yet finalized, but it can be based on per sdev outstanding,
> shost_busy etc.
> We want to use special 16 reply queue for IO acceleration (these queues are
> working interrupt coalescing mode. This is a h/w feature)

This part is very key to your approach, so I'd suggest to finalize it
first. That said this way doesn't make sense if you can't figure out
one doable approach to decide when to use the coalescing mode, and when to
use the regular 72 reply queues.

If it is just for IO acceleration, why not always use the coalescing mode?

> 
> >
> > Frankly speaking, you may reuse the 72 reply queues to do interrupt
> > coalescing by configuring one extra register to enable the coalescing
> > mode,
> > and you may just use small part of the 72 reply queues under the
> > interrupt coalescing mode.
> Our h/w can set interrupt coalescing per 8 reply queues. So smallest is 8.
> If we choose to take 8 reply queue from existing 72 reply queue (without
> asking for extra reply queue), we still have  an issue on more numa node
> systems.  Example - in 8 numa node system each node will have only *one*
> reply queue for effective interrupt coalescing. (since irq subsystem will
> spread msix per numa).
> 
> To keep things scalable we cherry picked few reply queues and wanted them to
> be out of cpu-msix mapping.

I mean you can group the reply queues according to the queue's numa node
info, given the mapping has been figured out there by genirq affinity
code.

> 
> >
> > Or you can learn from SPDK to use one or small number of dedicated cores
> > or kernel threads to poll the interrupts from all reply queues, then I
> > guess you may benefit much compared with the extra 16 queue approach.
> Problem with polling -  It requires some steady completion, otherwise
> prediction in driver gives different results on different profiles.
> We attempted irq-poll and thread ISR based polling, but it has pros and
> cons. One of the key usage of method what we are trying is not to impact
> latency for lower QD workloads.

Interrupt coalescing should effect latency too[1], or could you share your
idea how to use interrupt coalescing to address the latency issue?

	"Interrupt coalescing, also known as interrupt moderation,[1] is a
	technique in which events which would normally trigger a hardware interrupt
	are held back, either until a certain amount of work is pending, or a
	timeout timer triggers."[1] 

[1] https://en.wikipedia.org/wiki/Interrupt_coalescing

> I posted RFC at
> https://www.spinics.net/lists/linux-scsi/msg122874.html
> 
> We have done extensive study and concluded to use interrupt coalescing is
> better if h/w can manage two different modes (coalescing on/off).

Could you explain a bit why coalescing is better?

In theory, interrupt coalescing is just to move the implementation into
hardware. And the IO submitted from the same coalescing group is usually
irrelevant. The same problem you found in polling should have been in
coalescing too.

> 
> >
> > Introducing extra 16 queues just for interrupt coalescing and making it
> > coexisting with the regular 72 reply queues seems one very unusual use
> > case, not sure the current genirq affinity can support it well.
> 
> Yes. This is unusual case. I think it is not used by any other drivers.
> 
> >
> > > >
> > > > >
> > > > > All pre_vectors (16) will be mapped to all available online CPUs but
> > > > > e
> > > > > ffective affinity of each vector is to CPU 0. Our requirement is to
> > > > > have pre _vectors 16 reply queues to be mapped to local NUMA node
> > with
> > > > > effective CPU should be spread within local node cpu mask. Without
> > > > > changing kernel code, we can
> > > >
> > > > If all CPUs in one NUMA node is offline, can this use case work as
> > > expected?
> > > > Seems we have to understand what the use case is and how it works.
> > >
> > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be
> > > broken and irqbalancer takes care of migrating affected IRQs to online
> > > CPUs of different NUMA node.
> > > When offline CPUs are onlined again, irqbalancer restores affinity.
> >
> >  irqbalance daemon can't cover managed interrupts, or you mean
> > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)?
> 
> Yes. We did not used " pci_alloc_irq_vectors_affinity".
> We used " pci_enable_msix_range" and manually set affinity in driver using
> irq_set_affinity_hint.

Then you have to cover all kind of CPU hotplug issues in your driver
because you switch to driver to maintain the queue mapping.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-09-02 12:02                 ` Thomas Gleixner
@ 2018-09-03  5:34                   ` Kashyap Desai
  2018-09-03 16:28                     ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Kashyap Desai @ 2018-09-03  5:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

> On Fri, 31 Aug 2018, Kashyap Desai wrote:
> > > Ok. I misunderstood the whole thing a bit. So your real issue is
that you
> > > want to have reply queues which are instantaneous, the per cpu ones,
and
> > > then the extra 16 which do batching and are shared over a set of
CPUs,
> > > right?
> >
> > Yes that is correct.  Extra 16 or whatever should be shared over set
of
> > CPUs of *local* numa node of the PCI device.
>
> Why restricting it to the local NUMA node of the device? That doesn't
> really make sense if you queue lots of requests from CPUs on a different
> node.

Penalty of cross numa node is minimal with higher interrupt coalescing
used in h/w.  We see penalty of cross numa traffic for lower IOPs type
work load.
In this particular case we are taking care cross numa traffic via higher
interrupt coalescing.

>
> Why don't you spread these extra interrupts accross all nodes and keep
the
> locality for the request/reply?

I assuming you are refereeing spreading msix to all numa node the way
"pci_alloc_irq_vectors" does.

Having extra 16 reply queue spread across nodes will have negative impact.
Take example of 8 node system (total 128 logical cpus).
If  16 reply queue are spread across numa node, there will be total 8
logical cpu mapped to 1 reply queue (eventually one numa node will have
only 2 reply queue mapped).

Running IO from one numa node will only consume 2 reply queues.
Performance dropped drastically in such case.  This is typical problem
with cpu-msix mapping goes to N:1 where msix is less than online cpus.

Mapping extra 16 reply queue to local numa node will always make sure that
driver will round robin all 16 reply queue irrespective of originated cpu.
We validated this method sending IOs from remote node and did not observed
performance penalty.

>
> That also would allow to make them properly managed interrupts as you
> could
> shutdown the per node batching interrupts when all CPUs of that node are
> offlined and you'd avoid the whole affinity hint irq balancer hackery.

One more clarification -

I am using " for-4.19/block " and this particular patch "a0c9259
irq/matrix: Spread interrupts on allocation" is included.
I can see that 16 extra reply queues via pre_vectors are still assigned to
CPU 0 (effective affinity ).

irq 33, cpu list 0-71
irq 34, cpu list 0-71
irq 35, cpu list 0-71
irq 36, cpu list 0-71
irq 37, cpu list 0-71
irq 38, cpu list 0-71
irq 39, cpu list 0-71
irq 40, cpu list 0-71
irq 41, cpu list 0-71
irq 42, cpu list 0-71
irq 43, cpu list 0-71
irq 44, cpu list 0-71
irq 45, cpu list 0-71
irq 46, cpu list 0-71
irq 47, cpu list 0-71
irq 48, cpu list 0-71


# cat /sys/kernel/debug/irq/irqs/34
handler:  handle_edge_irq
device:   0000:86:00.0
status:   0x00004000
istate:   0x00000000
ddepth:   0
wdepth:   0
dstate:   0x01608200
            IRQD_ACTIVATED
            IRQD_IRQ_STARTED
            IRQD_SINGLE_TARGET
            IRQD_MOVE_PCNTXT
            IRQD_AFFINITY_MANAGED
node:     0
affinity: 0-71
effectiv: 0
pending:
domain:  INTEL-IR-MSI-1-2
 hwirq:   0x4300001
 chip:    IR-PCI-MSI
  flags:   0x10
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-1
     hwirq:   0x40000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x22
         chip:    APIC
          flags:   0x0
         Vector:    46
         Target:     0
         move_in_progress: 0
         is_managed:       1
         can_reserve:      0
         has_reserved:     0
         cleanup_pending:  0

#cat /sys/kernel/debug/irq/irqs/35
handler:  handle_edge_irq
device:   0000:86:00.0
status:   0x00004000
istate:   0x00000000
ddepth:   0
wdepth:   0
dstate:   0x01608200
            IRQD_ACTIVATED
            IRQD_IRQ_STARTED
            IRQD_SINGLE_TARGET
            IRQD_MOVE_PCNTXT
            IRQD_AFFINITY_MANAGED
node:     0
affinity: 0-71
effectiv: 0
pending:
domain:  INTEL-IR-MSI-1-2
 hwirq:   0x4300002
 chip:    IR-PCI-MSI
  flags:   0x10
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-1
     hwirq:   0x50000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x23
         chip:    APIC
          flags:   0x0
         Vector:    47
         Target:     0
         move_in_progress: 0
         is_managed:       1
         can_reserve:      0
         has_reserved:     0
         cleanup_pending:  0

Ideally, what we are looking for 16 extra pre_vector reply queue is
"effective affinity" to be within local numa node as long as that numa
node has online CPUs. If not, we are ok to have effective cpu from any
node.

>
> Thanks,
>
> 	tglx
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-09-03  2:13         ` Ming Lei
@ 2018-09-03  6:10           ` Kashyap Desai
  2018-09-03  9:21             ` Ming Lei
  0 siblings, 1 reply; 28+ messages in thread
From: Kashyap Desai @ 2018-09-03  6:10 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

> > It is not yet finalized, but it can be based on per sdev outstanding,
> > shost_busy etc.
> > We want to use special 16 reply queue for IO acceleration (these
queues are
> > working interrupt coalescing mode. This is a h/w feature)
>
> This part is very key to your approach, so I'd suggest to finalize it
> first. That said this way doesn't make sense if you can't figure out
> one doable approach to decide when to use the coalescing mode, and when
> to
> use the regular 72 reply queues.
This is almost finalized, but going through testing and may take some time
to review all the output.
At very high level -
If scsi device is Virtual Disk, it will count each physical disk for data
arm and required condition to use io acceleration (interrupt coalescing)
path is - outstanding for sdev should be more than 8 * data_arms. Using
this method we are not going to impact low latency intensive workload.

>
> If it is just for IO acceleration, why not always use the coalescing
mode?

Ming, we attempted all the possible approaches. Let me summarize.

If we use *all* interrupt coalescing, single worker and lower queue depth
profile is impacted and latency drop is seen upto 20%.

>
> >
> > >
> > > Frankly speaking, you may reuse the 72 reply queues to do interrupt
> > > coalescing by configuring one extra register to enable the
coalescing
> > > mode,
> > > and you may just use small part of the 72 reply queues under the
> > > interrupt coalescing mode.
> > Our h/w can set interrupt coalescing per 8 reply queues. So smallest
is 8.
> > If we choose to take 8 reply queue from existing 72 reply queue
(without
> > asking for extra reply queue), we still have  an issue on more numa
node
> > systems.  Example - in 8 numa node system each node will have only
*one*
> > reply queue for effective interrupt coalescing. (since irq subsystem
will
> > spread msix per numa).
> >
> > To keep things scalable we cherry picked few reply queues and wanted
them
> to
> > be out of cpu-msix mapping.
>
> I mean you can group the reply queues according to the queue's numa node
> info, given the mapping has been figured out there by genirq affinity
> code.

Not able to follow you.  I replied to Thomas on the same topic. Is that
reply clarifies or I am still missing ?

>
> >
> > >
> > > Or you can learn from SPDK to use one or small number of dedicated
cores
> > > or kernel threads to poll the interrupts from all reply queues, then
I
> > > guess you may benefit much compared with the extra 16 queue
approach.
> > Problem with polling -  It requires some steady completion, otherwise
> > prediction in driver gives different results on different profiles.
> > We attempted irq-poll and thread ISR based polling, but it has pros
and
> > cons. One of the key usage of method what we are trying is not to
impact
> > latency for lower QD workloads.
>
> Interrupt coalescing should effect latency too[1], or could you share
your
> idea how to use interrupt coalescing to address the latency issue?
>
> 	"Interrupt coalescing, also known as interrupt moderation,[1] is a
> 	technique in which events which would normally trigger a hardware
> interrupt
> 	are held back, either until a certain amount of work is pending,
or a
> 	timeout timer triggers."[1]
>
> [1] https://en.wikipedia.org/wiki/Interrupt_coalescing

That is correct. We are not going to use 100% interrupt coalescing to
avoid latency impact.  We will have two set of queues. You can consider
this as hybrid interrupt coalescing.
On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix
index). Only first 16 reply queue will be configured in interrupt
coalescing mode (This is special h/w feature.) and remaining 72 reply are
without any interrupt coalescing.  72 reply queue are 1:1 cpu-msix map and
16 reply queue are mapped to local numa node.

As explained above, per scsi device outstanding is a key factors to route
io to queues with interrupt coalescing vs regular queue (without interrupt
coalescing.)
Example -
If there are sync IO request per scsi device (one IO at a time), driver
will keep posting those IO to the queues without any interrupt coalescing.
If there are more than 8 outstanding io per scsi device, driver will post
those io to reply queues with interrupt coalescing. This particular group
of io will not have latency impact because coalescing depth are key
factors to flush the ios. There can be some corner cases of workload which
can theoretically possible to have latency impact, but having more scsi
devices doing active io submission will close that loop and we are not
suspecting those issue need any special treatment. In fact, this solution
is to provide reasonable latency + higher iops for most of the cases and
if there are some deployment which need tuning..it is still possible to
disable this feature.  We really want to deal with those scenario on case
by case bases (through firmware settings).


>
> > I posted RFC at
> > https://www.spinics.net/lists/linux-scsi/msg122874.html
> >
> > We have done extensive study and concluded to use interrupt coalescing
is
> > better if h/w can manage two different modes (coalescing on/off).
>
> Could you explain a bit why coalescing is better?

Actually we are doing hybrid coalescing. You are correct, we have no
single answer here, but there are pros and cons.
For such hybrid coalescing we need h/w support.

>
> In theory, interrupt coalescing is just to move the implementation into
> hardware. And the IO submitted from the same coalescing group is usually
> irrelevant. The same problem you found in polling should have been in
> coalescing too.

Coalescing either in software or hardware is best attempt mechanism and
there is no steady snapshot of submission and completion in both the case.

One of the problem with coalescing/polling in OS driver is - Irq-poll
works in interrupt context and waiting in polling consume more CPU because
driver should do some predictive loop. At the same time driver should quit
after some completion to give fairness to other devices.  Threaded
interrupt can resolve the cpu hogging issue, but we are moving our key
interrupt processing to threaded context so fairness will be compromised.
In case of threaded interrupt polling we may be impacted if interrupt of
other devices request the same cpu where threaded isr is running.  If
polling logic in driver does not work well on different systems, we are
going to see extra penalty of doing disable/enable interrupt call.  This
particular problem is not a concern if h/w does interrupt coalescing.

>
> >
> > >
> > > Introducing extra 16 queues just for interrupt coalescing and making
it
> > > coexisting with the regular 72 reply queues seems one very unusual
use
> > > case, not sure the current genirq affinity can support it well.
> >
> > Yes. This is unusual case. I think it is not used by any other
drivers.
> >
> > >
> > > > >
> > > > > >
> > > > > > All pre_vectors (16) will be mapped to all available online
CPUs but
> > > > > > e
> > > > > > ffective affinity of each vector is to CPU 0. Our requirement
is to
> > > > > > have pre _vectors 16 reply queues to be mapped to local NUMA
node
> > > with
> > > > > > effective CPU should be spread within local node cpu mask.
Without
> > > > > > changing kernel code, we can
> > > > >
> > > > > If all CPUs in one NUMA node is offline, can this use case work
as
> > > > expected?
> > > > > Seems we have to understand what the use case is and how it
works.
> > > >
> > > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity
will be
> > > > broken and irqbalancer takes care of migrating affected IRQs to
online
> > > > CPUs of different NUMA node.
> > > > When offline CPUs are onlined again, irqbalancer restores
affinity.
> > >
> > >  irqbalance daemon can't cover managed interrupts, or you mean
> > > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)?
> >
> > Yes. We did not used " pci_alloc_irq_vectors_affinity".
> > We used " pci_enable_msix_range" and manually set affinity in driver
using
> > irq_set_affinity_hint.
>
> Then you have to cover all kind of CPU hotplug issues in your driver
> because you switch to driver to maintain the queue mapping.
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-09-03  6:10           ` Kashyap Desai
@ 2018-09-03  9:21             ` Ming Lei
  2018-09-03  9:50               ` Kashyap Desai
  0 siblings, 1 reply; 28+ messages in thread
From: Ming Lei @ 2018-09-03  9:21 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

On Mon, Sep 03, 2018 at 11:40:53AM +0530, Kashyap Desai wrote:
> > > It is not yet finalized, but it can be based on per sdev outstanding,
> > > shost_busy etc.
> > > We want to use special 16 reply queue for IO acceleration (these
> queues are
> > > working interrupt coalescing mode. This is a h/w feature)
> >
> > This part is very key to your approach, so I'd suggest to finalize it
> > first. That said this way doesn't make sense if you can't figure out
> > one doable approach to decide when to use the coalescing mode, and when
> > to
> > use the regular 72 reply queues.
> This is almost finalized, but going through testing and may take some time
> to review all the output.
> At very high level -
> If scsi device is Virtual Disk, it will count each physical disk for data
> arm and required condition to use io acceleration (interrupt coalescing)
> path is - outstanding for sdev should be more than 8 * data_arms. Using
> this method we are not going to impact low latency intensive workload.
> 
> >
> > If it is just for IO acceleration, why not always use the coalescing
> mode?
> 
> Ming, we attempted all the possible approaches. Let me summarize.
> 
> If we use *all* interrupt coalescing, single worker and lower queue depth
> profile is impacted and latency drop is seen upto 20%.
> 
> >
> > >
> > > >
> > > > Frankly speaking, you may reuse the 72 reply queues to do interrupt
> > > > coalescing by configuring one extra register to enable the
> coalescing
> > > > mode,
> > > > and you may just use small part of the 72 reply queues under the
> > > > interrupt coalescing mode.
> > > Our h/w can set interrupt coalescing per 8 reply queues. So smallest
> is 8.
> > > If we choose to take 8 reply queue from existing 72 reply queue
> (without
> > > asking for extra reply queue), we still have  an issue on more numa
> node
> > > systems.  Example - in 8 numa node system each node will have only
> *one*
> > > reply queue for effective interrupt coalescing. (since irq subsystem
> will
> > > spread msix per numa).
> > >
> > > To keep things scalable we cherry picked few reply queues and wanted
> them
> > to
> > > be out of cpu-msix mapping.
> >
> > I mean you can group the reply queues according to the queue's numa node
> > info, given the mapping has been figured out there by genirq affinity
> > code.
> 
> Not able to follow you.  I replied to Thomas on the same topic. Is that
> reply clarifies or I am still missing ?
> 
> >
> > >
> > > >
> > > > Or you can learn from SPDK to use one or small number of dedicated
> cores
> > > > or kernel threads to poll the interrupts from all reply queues, then
> I
> > > > guess you may benefit much compared with the extra 16 queue
> approach.
> > > Problem with polling -  It requires some steady completion, otherwise
> > > prediction in driver gives different results on different profiles.
> > > We attempted irq-poll and thread ISR based polling, but it has pros
> and
> > > cons. One of the key usage of method what we are trying is not to
> impact
> > > latency for lower QD workloads.
> >
> > Interrupt coalescing should effect latency too[1], or could you share
> your
> > idea how to use interrupt coalescing to address the latency issue?
> >
> > 	"Interrupt coalescing, also known as interrupt moderation,[1] is a
> > 	technique in which events which would normally trigger a hardware
> > interrupt
> > 	are held back, either until a certain amount of work is pending,
> or a
> > 	timeout timer triggers."[1]
> >
> > [1] https://en.wikipedia.org/wiki/Interrupt_coalescing
> 
> That is correct. We are not going to use 100% interrupt coalescing to
> avoid latency impact.  We will have two set of queues. You can consider
> this as hybrid interrupt coalescing.
> On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix
> index). Only first 16 reply queue will be configured in interrupt
> coalescing mode (This is special h/w feature.) and remaining 72 reply are
> without any interrupt coalescing.  72 reply queue are 1:1 cpu-msix map and
> 16 reply queue are mapped to local numa node.
> 
> As explained above, per scsi device outstanding is a key factors to route
> io to queues with interrupt coalescing vs regular queue (without interrupt
> coalescing.)
> Example -
> If there are sync IO request per scsi device (one IO at a time), driver
> will keep posting those IO to the queues without any interrupt coalescing.
> If there are more than 8 outstanding io per scsi device, driver will post
> those io to reply queues with interrupt coalescing. This particular group

If the more than 8 outstanding io are from different CPU or different NUMA node,
which replay queue will be chosen in the io submission path?

Under this situation, any one of 16 reply queues may not work as
expected, I guess.

> of io will not have latency impact because coalescing depth are key
> factors to flush the ios. There can be some corner cases of workload which
> can theoretically possible to have latency impact, but having more scsi
> devices doing active io submission will close that loop and we are not
> suspecting those issue need any special treatment. In fact, this solution
> is to provide reasonable latency + higher iops for most of the cases and
> if there are some deployment which need tuning..it is still possible to
> disable this feature.  We really want to deal with those scenario on case
> by case bases (through firmware settings).
> 
> 
> >
> > > I posted RFC at
> > > https://www.spinics.net/lists/linux-scsi/msg122874.html
> > >
> > > We have done extensive study and concluded to use interrupt coalescing
> is
> > > better if h/w can manage two different modes (coalescing on/off).
> >
> > Could you explain a bit why coalescing is better?
> 
> Actually we are doing hybrid coalescing. You are correct, we have no
> single answer here, but there are pros and cons.
> For such hybrid coalescing we need h/w support.
> 
> >
> > In theory, interrupt coalescing is just to move the implementation into
> > hardware. And the IO submitted from the same coalescing group is usually
> > irrelevant. The same problem you found in polling should have been in
> > coalescing too.
> 
> Coalescing either in software or hardware is best attempt mechanism and
> there is no steady snapshot of submission and completion in both the case.
> 
> One of the problem with coalescing/polling in OS driver is - Irq-poll
> works in interrupt context and waiting in polling consume more CPU because
> driver should do some predictive loop. At the same time driver should quit

One similar way is to use the outstanding IO on this device to predicate
the poll time.

> after some completion to give fairness to other devices.  Threaded
> interrupt can resolve the cpu hogging issue, but we are moving our key
> interrupt processing to threaded context so fairness will be compromised.
> In case of threaded interrupt polling we may be impacted if interrupt of
> other devices request the same cpu where threaded isr is running.  If
> polling logic in driver does not work well on different systems, we are
> going to see extra penalty of doing disable/enable interrupt call.  This
> particular problem is not a concern if h/w does interrupt coalescing.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-09-03  9:21             ` Ming Lei
@ 2018-09-03  9:50               ` Kashyap Desai
  0 siblings, 0 replies; 28+ messages in thread
From: Kashyap Desai @ 2018-09-03  9:50 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

> > On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues
(msix
> > index). Only first 16 reply queue will be configured in interrupt
> > coalescing mode (This is special h/w feature.) and remaining 72 reply
are
> > without any interrupt coalescing.  72 reply queue are 1:1 cpu-msix map
and
> > 16 reply queue are mapped to local numa node.
> >
> > As explained above, per scsi device outstanding is a key factors to
route
> > io to queues with interrupt coalescing vs regular queue (without
interrupt
> > coalescing.)
> > Example -
> > If there are sync IO request per scsi device (one IO at a time),
driver
> > will keep posting those IO to the queues without any interrupt
coalescing.
> > If there are more than 8 outstanding io per scsi device, driver will
post
> > those io to reply queues with interrupt coalescing. This particular
group
>
> If the more than 8 outstanding io are from different CPU or different
NUMA
> node,
> which replay queue will be chosen in the io submission path?

We tried this combination as well. If IO is submitted from different NUMA
node, we anyways have penalty of cache invalidate issue.  We trust
rq_affinity = 2 settings to have actual io completion to go back to origin
cpu.   This approach (of io acceleration queue) is as good as using
irqbalancer policy "ignore", where we have all reply queue mapped to local
numa node.


>
> Under this situation, any one of 16 reply queues may not work as
> expected, I guess.

I tried this and it was same performance with or without this new feature
we are discussing.

>
> > of io will not have latency impact because coalescing depth are key
> > factors to flush the ios. There can be some corner cases of workload
which
> > can theoretically possible to have latency impact, but having more
scsi
> > devices doing active io submission will close that loop and we are not
> > suspecting those issue need any special treatment. In fact, this
solution
> > is to provide reasonable latency + higher iops for most of the cases
and
> > if there are some deployment which need tuning..it is still possible
to
> > disable this feature.  We really want to deal with those scenario on
case
> > by case bases (through firmware settings).
> >
> >
> > >
> > > > I posted RFC at
> > > > https://www.spinics.net/lists/linux-scsi/msg122874.html
> > > >
> > > > We have done extensive study and concluded to use interrupt
coalescing
> > is
> > > > better if h/w can manage two different modes (coalescing on/off).
> > >
> > > Could you explain a bit why coalescing is better?
> >
> > Actually we are doing hybrid coalescing. You are correct, we have no
> > single answer here, but there are pros and cons.
> > For such hybrid coalescing we need h/w support.
> >
> > >
> > > In theory, interrupt coalescing is just to move the implementation
into
> > > hardware. And the IO submitted from the same coalescing group is
usually
> > > irrelevant. The same problem you found in polling should have been
in
> > > coalescing too.
> >
> > Coalescing either in software or hardware is best attempt mechanism
and
> > there is no steady snapshot of submission and completion in both the
case.
> >
> > One of the problem with coalescing/polling in OS driver is - Irq-poll
> > works in interrupt context and waiting in polling consume more CPU
> because
> > driver should do some predictive loop. At the same time driver should
quit
>
> One similar way is to use the outstanding IO on this device to predicate
> the poll time.

We attempted this model as well. If outstanding is always available
(constant workload), driver will never quit. Most of the time interrupt
will be disabled and thread will be in polling work. Ideally, driver
should quit after some defined time. Right ? That is why *budget* of
irq-poll is for. If outstanding goes up and down (burst workload), we will
be doing frequent irq enable/disable and that will vary  the results.

Irq-poll is best option to do polling in OS (mainly because of budget and
interrupt context mechanism), but predicting poll helps for constant
workload and also at the same time it hogs host CPU because most of the
time driver keep polling without any work in interrupt context.
If we use h/w interrupt coalescing, we are not wasting host CPU since h/w
can manage coalescing without host consuming host cpu.

>
> > after some completion to give fairness to other devices.  Threaded
> > interrupt can resolve the cpu hogging issue, but we are moving our key
> > interrupt processing to threaded context so fairness will be
compromised.
> > In case of threaded interrupt polling we may be impacted if interrupt
of
> > other devices request the same cpu where threaded isr is running.  If
> > polling logic in driver does not work well on different systems, we
are
> > going to see extra penalty of doing disable/enable interrupt call.
This
> > particular problem is not a concern if h/w does interrupt coalescing.
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-09-03  5:34                   ` Kashyap Desai
@ 2018-09-03 16:28                     ` Thomas Gleixner
  2018-09-04 10:29                       ` Kashyap Desai
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2018-09-03 16:28 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

On Mon, 3 Sep 2018, Kashyap Desai wrote:
> I am using " for-4.19/block " and this particular patch "a0c9259
> irq/matrix: Spread interrupts on allocation" is included.

Can you please try against 4.19-rc2 or later?

> I can see that 16 extra reply queues via pre_vectors are still assigned to
> CPU 0 (effective affinity ).
> 
> irq 33, cpu list 0-71

The cpu list is irrelevant because that's the allowed affinity mask. The
effective one is what counts.

> # cat /sys/kernel/debug/irq/irqs/34
> node:     0
> affinity: 0-71
> effectiv: 0

So if all 16 have their effective affinity set to CPU0 then that's strange
at least.

Can you please provide the output of /sys/kernel/debug/irq/domains/VECTOR ?

> Ideally, what we are looking for 16 extra pre_vector reply queue is
> "effective affinity" to be within local numa node as long as that numa
> node has online CPUs. If not, we are ok to have effective cpu from any
> node.

Well, we surely can do the initial allocation and spreading on the local
numa node, but once all CPUs are offline on that node, then the whole thing
goes down the drain and allocates from where it sees fit. I'll think about
it some more, especially how to avoid the proliferation of the affinity
hint.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-09-03 16:28                     ` Thomas Gleixner
@ 2018-09-04 10:29                       ` Kashyap Desai
  2018-09-05  5:46                         ` Dou Liyang
  0 siblings, 1 reply; 28+ messages in thread
From: Kashyap Desai @ 2018-09-04 10:29 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block

>
> On Mon, 3 Sep 2018, Kashyap Desai wrote:
> > I am using " for-4.19/block " and this particular patch "a0c9259
> > irq/matrix: Spread interrupts on allocation" is included.
>
> Can you please try against 4.19-rc2 or later?
>
> > I can see that 16 extra reply queues via pre_vectors are still
assigned to
> > CPU 0 (effective affinity ).
> >
> > irq 33, cpu list 0-71
>
> The cpu list is irrelevant because that's the allowed affinity mask. The
> effective one is what counts.
>
> > # cat /sys/kernel/debug/irq/irqs/34
> > node:     0
> > affinity: 0-71
> > effectiv: 0
>
> So if all 16 have their effective affinity set to CPU0 then that's
strange
> at least.
>
> Can you please provide the output of
> /sys/kernel/debug/irq/domains/VECTOR ?

I tried 4.19-rc2. Same behavior as I posted earlier. All 16 pre_vector irq
has effective CPU = 0.

Here is output of "/sys/kernel/debug/irq/domains/VECTOR"

# cat /sys/kernel/debug/irq/domains/VECTOR
name:   VECTOR
 size:   0
 mapped: 360
 flags:  0x00000041
Online bitmaps:       72
Global available:  13062
Global reserved:      86
Total allocated:     274
System: 43: 0-19,32,50,128,236-255
 | CPU | avl | man | act | vectors
     0   169    17    32  33-49,51-65
     1   181    17     4  33,36,52-53
     2   181    17     4  33-36
     3   181    17     4  33-34,52-53
     4   181    17     4  33,35,53-54
     5   181    17     4  33,35-36,54
     6   182    17     3  33,35-36
     7   182    17     3  33-34,36
     8   182    17     3  34-35,53
     9   181    17     4  33-34,52-53
    10   182    17     3  34,36,53
    11   182    17     3  34-35,54
    12   182    17     3  33-34,53
    13   182    17     3  33,37,55
    14   181    17     4  33-36
    15   181    17     4  33,35-36,54
    16   181    17     4  33,35,53-54
    17   182    17     3  33,36-37
    18   181    17     4  33,36,54-55
    19   181    17     4  33,35-36,54
    20   181    17     4  33,35-37
    21   180    17     5  33,35,37,55-56
    22   181    17     4  33-36
    23   181    17     4  33,35,37,55
    24   180    17     5  33-36,54
    25   181    17     4  33-36
    26   181    17     4  33-35,54
    27   181    17     4  34-36,54
    28   181    17     4  33-35,53
    29   182    17     3  34-35,53
    30   182    17     3  33-35
    31   181    17     4  34-36,54
    32   182    17     3  33-34,53
    33   182    17     3  34-35,53
    34   182    17     3  33-34,53
    35   182    17     3  34-36
    36   182    17     3  33-34,53
    37   181    17     4  33,35,52-53
    38   182    17     3  34-35,53
    39   182    17     3  34,52-53
    40   182    17     3  33-35
    41   182    17     3  34-35,53
    42   182    17     3  33-35
    43   182    17     3  34,52-53
    44   182    17     3  33-34,53
    45   182    17     3  34-35,53
    46   182    17     3  34,36,54
    47   182    17     3  33-34,52
    48   182    17     3  34,36,54
    49   182    17     3  33,51-52
    50   181    17     4  33-36
    51   182    17     3  33-35
    52   182    17     3  33-35
    53   182    17     3  34-35,53
    54   182    17     3  33-34,53
    55   182    17     3  34-36
    56   181    17     4  33-35,53
    57   182    17     3  34-36
    58   182    17     3  33-34,53
    59   181    17     4  33-35,53
    60   181    17     4  33-35,53
    61   182    17     3  33-34,53
    62   182    17     3  33-35
    63   182    17     3  34-36
    64   182    17     3  33-34,54
    65   181    17     4  33-35,53
    66   182    17     3  33-34,54
    67   182    17     3  34-36
    68   182    17     3  33-34,54
    69   182    17     3  34,36,54
    70   182    17     3  33-35
    71   182    17     3  34,36,54

>
> > Ideally, what we are looking for 16 extra pre_vector reply queue is
> > "effective affinity" to be within local numa node as long as that numa
> > node has online CPUs. If not, we are ok to have effective cpu from any
> > node.
>
> Well, we surely can do the initial allocation and spreading on the local
> numa node, but once all CPUs are offline on that node, then the whole
thing
> goes down the drain and allocates from where it sees fit. I'll think
about
> it some more, especially how to avoid the proliferation of the affinity
> hint.

Thanks for looking this request. This will help us to implement WIP
megaraid_sas driver changes.  I can test any patch you want me to try.

>
> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-09-04 10:29                       ` Kashyap Desai
@ 2018-09-05  5:46                         ` Dou Liyang
  2018-09-05  9:45                           ` Kashyap Desai
  0 siblings, 1 reply; 28+ messages in thread
From: Dou Liyang @ 2018-09-05  5:46 UTC (permalink / raw)
  To: Kashyap Desai, Thomas Gleixner
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block, Dou Liyang

Hi Thomas, Kashyap,

At 09/04/2018 06:29 PM, Kashyap Desai wrote:
>>> I am using " for-4.19/block " and this particular patch "a0c9259
>>> irq/matrix: Spread interrupts on allocation" is included.
>>

IMO, this patch is just used for non-managed interrupts.

>> So if all 16 have their effective affinity set to CPU0 then that's
> strange

But, all these 16 are managed interrupts, and will be assigned vectors
by assign_managed_vector():
{
     cpumask_and(vector_searchmask, vector_searchmask, affmsk);
     cpu = cpumask_first(vector_searchmask);

     ...
     vector = irq_matrix_alloc_managed(vector_matrix, cpu);
     ...
}

Where we always used the *first* cpu in the vector_searchmask(0-71), not
the suitable one. So I guess this situation happened.

Shall we also spread the managed interrupts on allocation?

Thanks,
     dou
-----------------8<----------------------------------------

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 9f148e3d45b4..57dc05691f44 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -314,13 +314,12 @@ assign_managed_vector(struct irq_data *irqd, const 
struct cpumask *dest)
         int vector, cpu;

         cpumask_and(vector_searchmask, vector_searchmask, affmsk);
-       cpu = cpumask_first(vector_searchmask);
-       if (cpu >= nr_cpu_ids)
-               return -EINVAL;
+
         /* set_affinity might call here for nothing */
         if (apicd->vector && cpumask_test_cpu(apicd->cpu, 
vector_searchmask))
                 return 0;
-       vector = irq_matrix_alloc_managed(vector_matrix, cpu);
+
+       vector = irq_matrix_alloc_managed(vector_matrix, 
vector_searchmask, &cpu);
         trace_vector_alloc_managed(irqd->irq, vector, vector);
         if (vector < 0)
                 return vector;
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 201de12a9957..36fdeff5043a 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1151,7 +1151,8 @@ void irq_matrix_offline(struct irq_matrix *m);
  void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, 
bool replace);
  int irq_matrix_reserve_managed(struct irq_matrix *m, const struct 
cpumask *msk);
  void irq_matrix_remove_managed(struct irq_matrix *m, const struct 
cpumask *msk);
-int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu);
+int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask 
*msk,
+                                       unsigned int *mapped_cpu);
  void irq_matrix_reserve(struct irq_matrix *m);
  void irq_matrix_remove_reserved(struct irq_matrix *m);
  int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 5092494bf261..d9e4e0a385fa 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -239,21 +239,40 @@ void irq_matrix_remove_managed(struct irq_matrix 
*m, const struct cpumask *msk)
   * @m:         Matrix pointer
   * @cpu:       On which CPU the interrupt should be allocated
   */
-int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu)
+int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask 
*msk,
+                                       unsigned int *mapped_cpu)
  {
-       struct cpumap *cm = per_cpu_ptr(m->maps, cpu);
-       unsigned int bit, end = m->alloc_end;
-
-       /* Get managed bit which are not allocated */
-       bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end);
-       bit = find_first_bit(m->scratch_map, end);
-       if (bit >= end)
-               return -ENOSPC;
-       set_bit(bit, cm->alloc_map);
-       cm->allocated++;
-       m->total_allocated++;
-       trace_irq_matrix_alloc_managed(bit, cpu, m, cm);
-       return bit;
+       unsigned int cpu, best_cpu, maxavl = 0;
+       unsigned int bit, end;
+       struct cpumap *cm;
+
+       best_cpu = UINT_MAX;
+       for_each_cpu(cpu, msk) {
+               cm = per_cpu_ptr(m->maps, cpu);
+
+               if (!cm->online || cm->available <= maxavl)
+                       continue;
+
+               best_cpu = cpu;
+               maxavl = cm->available;
+       }
+
+       if (maxavl) {
+               cm = per_cpu_ptr(m->maps, best_cpu);
+               end = m->alloc_end;
+               /* Get managed bit which are not allocated */
+               bitmap_andnot(m->scratch_map, cm->managed_map, 
cm->alloc_map, end);
+               bit = find_first_bit(m->scratch_map, end);
+               if (bit >= end)
+                       return -ENOSPC;
+               set_bit(bit, cm->alloc_map);
+               cm->allocated++;
+               m->total_allocated++;
+               *mapped_cpu = best_cpu;
+               trace_irq_matrix_alloc_managed(bit, cpu, m, cm);
+               return bit;
+       }
+       return -ENOSPC;
  }

  /**


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-09-05  5:46                         ` Dou Liyang
@ 2018-09-05  9:45                           ` Kashyap Desai
  2018-09-05 10:38                             ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Kashyap Desai @ 2018-09-05  9:45 UTC (permalink / raw)
  To: Dou Liyang, Thomas Gleixner
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block, Dou Liyang

> Hi Thomas, Kashyap,
>
> At 09/04/2018 06:29 PM, Kashyap Desai wrote:
> >>> I am using " for-4.19/block " and this particular patch "a0c9259
> >>> irq/matrix: Spread interrupts on allocation" is included.
> >>
>
> IMO, this patch is just used for non-managed interrupts.
>
> >> So if all 16 have their effective affinity set to CPU0 then that's
> > strange
>
> But, all these 16 are managed interrupts, and will be assigned vectors
> by assign_managed_vector():
> {
>      cpumask_and(vector_searchmask, vector_searchmask, affmsk);
>      cpu = cpumask_first(vector_searchmask);
>
>      ...
>      vector = irq_matrix_alloc_managed(vector_matrix, cpu);
>      ...
> }
>
> Where we always used the *first* cpu in the vector_searchmask(0-71), not
> the suitable one. So I guess this situation happened.
>
> Shall we also spread the managed interrupts on allocation?


Hi Dou,

I tried your proposed patch. Using patch, It is not assigning effective irq
to CPU = 0 , but it pick *one* cpu from 0-71 range.
Eventually, effective cpu is always *one* logical cpu. Behavior is
different, but impact is still same.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-09-05  9:45                           ` Kashyap Desai
@ 2018-09-05 10:38                             ` Thomas Gleixner
  2018-09-06 10:14                               ` Dou Liyang
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2018-09-05 10:38 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: Dou Liyang, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block, Dou Liyang

On Wed, 5 Sep 2018, Kashyap Desai wrote:
> > Shall we also spread the managed interrupts on allocation?
> 
> I tried your proposed patch. Using patch, It is not assigning effective irq
> to CPU = 0 , but it pick *one* cpu from 0-71 range.
> Eventually, effective cpu is always *one* logical cpu. Behavior is
> different, but impact is still same.

Oh well. This was not intended to magically provide the solution you want
to have. It merily changed the behaviour of the managed interrupt
selection, which is a valid thing to do independent of the stuff you want
to see.

As I said that needs more thought and I really can't tell when I have a
time slot to look at that.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-09-05 10:38                             ` Thomas Gleixner
@ 2018-09-06 10:14                               ` Dou Liyang
  2018-09-06 11:46                                 ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Dou Liyang @ 2018-09-06 10:14 UTC (permalink / raw)
  To: Thomas Gleixner, Kashyap Desai
  Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block, Dou Liyang

Hi Thomas,

At 09/05/2018 06:38 PM, Thomas Gleixner wrote:
> Oh well. This was not intended to magically provide the solution you want
> to have. It merily changed the behaviour of the managed interrupt
> selection, which is a valid thing to do independent of the stuff you want
> to see.
> 

Thank you for clarifying it, I will send the patch independently.

> As I said that needs more thought and I really can't tell when I have a
> time slot to look at that.
> 

In this period, I am willing to be a volunteer to try to do that you
said in the previous reply. May I?


Thanks
	dou


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-09-06 10:14                               ` Dou Liyang
@ 2018-09-06 11:46                                 ` Thomas Gleixner
  2018-09-11  9:13                                   ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2018-09-06 11:46 UTC (permalink / raw)
  To: Dou Liyang
  Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei,
	Christoph Hellwig, Linux Kernel Mailing List,
	Shivasharan Srikanteshwara, linux-block, Dou Liyang

On Thu, 6 Sep 2018, Dou Liyang wrote:
> At 09/05/2018 06:38 PM, Thomas Gleixner wrote:
> > Oh well. This was not intended to magically provide the solution you want
> > to have. It merily changed the behaviour of the managed interrupt
> > selection, which is a valid thing to do independent of the stuff you want
> > to see.
> > 
> 
> Thank you for clarifying it, I will send the patch independently.
> 
> > As I said that needs more thought and I really can't tell when I have a
> > time slot to look at that.
> > 
> 
> In this period, I am willing to be a volunteer to try to do that you
> said in the previous reply. May I?

You don't have to ask for permission. It's Open Source :)

There are a few things we need to clarify upfront:

Right now the pre and post vectors are marked managed and their
affinity mask is set to the irq default affinity mask.

The default affinity mask is by default ALL cpus, but it can be tweaked
both on the kernel command line and via proc.

If that mask is only a subset of CPUs and all of them go offline
then these vectors are shutdown in managed mode.

That means we need to set the affinity mask of the pre and post vectors to
possible mask, but that doesn't make much sense either, unless there is a
reason to have them marked managed.

I think the right solution for these pre/post vectors is to _NOT_ mark
them managed and leave them as regular interrupts which can be affinity
controlled and also can move freely on hotplug.

Christoph?

Thanks,

	Thomas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-09-06 11:46                                 ` Thomas Gleixner
@ 2018-09-11  9:13                                   ` Christoph Hellwig
  2018-09-11  9:38                                     ` Dou Liyang
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2018-09-11  9:13 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dou Liyang, Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei,
	Christoph Hellwig, Linux Kernel Mailing List,
	Shivasharan Srikanteshwara, linux-block, Dou Liyang

On Thu, Sep 06, 2018 at 01:46:46PM +0200, Thomas Gleixner wrote:
> There are a few things we need to clarify upfront:
> 
> Right now the pre and post vectors are marked managed and their
> affinity mask is set to the irq default affinity mask.
> 
> The default affinity mask is by default ALL cpus, but it can be tweaked
> both on the kernel command line and via proc.
> 
> If that mask is only a subset of CPUs and all of them go offline
> then these vectors are shutdown in managed mode.
> 
> That means we need to set the affinity mask of the pre and post vectors to
> possible mask, but that doesn't make much sense either, unless there is a
> reason to have them marked managed.
> 
> I think the right solution for these pre/post vectors is to _NOT_ mark
> them managed and leave them as regular interrupts which can be affinity
> controlled and also can move freely on hotplug.

Yes, agreed.  Marking the pre/post vector as managed was a mistake
(and I don't think it even was intentional, at least on my part).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-08-29 10:46   ` Sumit Saxena
  2018-08-30 17:15     ` Kashyap Desai
  2018-08-31  6:54     ` Ming Lei
@ 2018-09-11  9:21     ` Christoph Hellwig
  2018-09-11  9:54       ` Kashyap Desai
  2 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2018-09-11  9:21 UTC (permalink / raw)
  To: Sumit Saxena
  Cc: Ming Lei, tglx, hch, linux-kernel, Kashyap Desai,
	Shivasharan Srikanteshwara

On Wed, Aug 29, 2018 at 04:16:23PM +0530, Sumit Saxena wrote:
> > Could you explain a bit what the specific use case the extra 16 vectors
> is?
> We are trying to avoid the penalty due to one interrupt per IO completion
> and decided to coalesce interrupts on these extra 16 reply queues.
> For regular 72 reply queues, we will not coalesce interrupts as for low IO
> workload, interrupt coalescing may take more time due to less IO
> completions.
> In IO submission path, driver will decide which set of reply queues
> (either extra 16 reply queues or regular 72 reply queues) to be picked
> based on IO workload.

The point I don't get here is why you need separate reply queues for
the interrupt coalesce setting.  Shouldn't this just be a flag at
submission time that indicates the amount of coalescing that should
happen?

What is the benefit of having different completion queues?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-08-31 22:48             ` Thomas Gleixner
  2018-08-31 23:37               ` Kashyap Desai
@ 2018-09-11  9:22               ` Christoph Hellwig
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2018-09-11  9:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei,
	Christoph Hellwig, Linux Kernel Mailing List,
	Shivasharan Srikanteshwara, linux-block

On Sat, Sep 01, 2018 at 12:48:46AM +0200, Thomas Gleixner wrote:
> > We want some changes in current API which can allow us to  pass flags
> > (like *local numa affinity*) and cpu-msix mapping are from local numa node
> > + effective cpu are spread across local numa node.
> 
> What you really want is to split the vector space for your device into two
> blocks. One for the regular per cpu queues and the other (16 or how many
> ever) which are managed separately, i.e. spread out evenly. That needs some
> extensions to the core allocation/management code, but that shouldn't be a
> huge problem.

Note that there are some other uses cases for multiple sets of affinity
managed irqs.  Various network devices insist on having separate TX vs
RX interrupts for example.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Affinity managed interrupts vs non-managed interrupts
  2018-09-11  9:13                                   ` Christoph Hellwig
@ 2018-09-11  9:38                                     ` Dou Liyang
  0 siblings, 0 replies; 28+ messages in thread
From: Dou Liyang @ 2018-09-11  9:38 UTC (permalink / raw)
  To: Christoph Hellwig, Thomas Gleixner
  Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei,
	Linux Kernel Mailing List, Shivasharan Srikanteshwara,
	linux-block, Dou Liyang

Hi,
At 09/11/2018 05:13 PM, Christoph Hellwig wrote:
> On Thu, Sep 06, 2018 at 01:46:46PM +0200, Thomas Gleixner wrote:
>>
>> I think the right solution for these pre/post vectors is to _NOT_ mark
>> them managed and leave them as regular interrupts which can be affinity
>> controlled and also can move freely on hotplug.
> 
> Yes, agreed.  Marking the pre/post vector as managed was a mistake
> (and I don't think it even was intentional, at least on my part).
> 
Got it !

And, I am trying to fix this by:

  -Don't set affinity for pre/post vectors in
   irq_create_affinity_masks().

  -And do not setup the desc->affinity of pre/post vectors in
   alloc_msi_entry().

So, the affinity in alloc_descs() will be NULL, and the interrupt won't
be marked as IRQD_AFFINITY_MANAGED.

Is it OK? and I will show the codes after testing it.

Thanks,
	dou


^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: Affinity managed interrupts vs non-managed interrupts
  2018-09-11  9:21     ` Christoph Hellwig
@ 2018-09-11  9:54       ` Kashyap Desai
  0 siblings, 0 replies; 28+ messages in thread
From: Kashyap Desai @ 2018-09-11  9:54 UTC (permalink / raw)
  To: Christoph Hellwig, Sumit Saxena
  Cc: Ming Lei, tglx, linux-kernel, Shivasharan Srikanteshwara

>
> The point I don't get here is why you need separate reply queues for
> the interrupt coalesce setting.  Shouldn't this just be a flag at
> submission time that indicates the amount of coalescing that should
> happen?
>
> What is the benefit of having different completion queues?

Having different set of queues (it will is something like N:16 where N
queues are without interrupt coalescing and 16 dedicated queues for
interrupt coalescing) we want to avoid penalty introduced by interrupt
coalescing especially for lower QD profiles.

Kashyap

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Affinity managed interrupts vs non-managed interrupts
@ 2018-08-28  6:47 Sumit Saxena
  0 siblings, 0 replies; 28+ messages in thread
From: Sumit Saxena @ 2018-08-28  6:47 UTC (permalink / raw)
  To: tglx; +Cc: Ming Lei, hch, linux-kernel

Hi Thomas,

We are working on next generation MegaRAID product where requirement is-
to allocate additional 16 MSI-x vectors in addition to number of MSI-x
vectors megaraid_sas driver usually allocates.  MegaRAID adapter supports
128 MSI-x vectors.

To explain the requirement and solution, consider that we have 2 socket
system (each socket having 36 logical CPUs). Current driver will allocate
total 72 MSI-x vectors by calling API- pci_alloc_irq_vectors(with flag-
PCI_IRQ_AFFINITY).  All 72 MSI-x vectors will have affinity across NUMA
nodes and interrupts are affinity managed.

If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = 16
and, driver can allocate 16 + 72 MSI-x vectors.
All pre_vectors (16) will be mapped to all available online CPUs but
effective affinity of each vector is to CPU 0. Our requirement is to have
pre_vectors 16 reply queues to be mapped to local NUMA node with effective
CPU should be spread within local node cpu mask. Without changing kernel
code, we can achieve this by driver calling pci_enable_msix_range()
(requesting to allocate 16 + 72 MSI-x vectors) instead of
pci_alloc_irq_vectors() API. If we use pci_enable_msix_range(), it also
requires MSI-x to CPU affinity handled by driver and these interrupts will
be non-managed.

Question is-
Is there any restriction or preference of using
pci_alloc_irq_vectors{/_affinity} vs pci_enable_msix_range in low level
driver?
If driver uses non-managed interrupt, all cases are handled correctly
through irqbalancer. Is there any plan in future to migrate to managed
interrupts entirely or it is a choice based call for driver maintainers?

Thanks,
Sumit

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-09-11  9:54 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <eccc46e12890a1d033d9003837012502@mail.gmail.com>
2018-08-29  8:46 ` Affinity managed interrupts vs non-managed interrupts Ming Lei
2018-08-29 10:46   ` Sumit Saxena
2018-08-30 17:15     ` Kashyap Desai
2018-08-31  6:54     ` Ming Lei
2018-08-31  7:50       ` Kashyap Desai
2018-08-31 20:24         ` Thomas Gleixner
2018-08-31 21:49           ` Kashyap Desai
2018-08-31 22:48             ` Thomas Gleixner
2018-08-31 23:37               ` Kashyap Desai
2018-09-02 12:02                 ` Thomas Gleixner
2018-09-03  5:34                   ` Kashyap Desai
2018-09-03 16:28                     ` Thomas Gleixner
2018-09-04 10:29                       ` Kashyap Desai
2018-09-05  5:46                         ` Dou Liyang
2018-09-05  9:45                           ` Kashyap Desai
2018-09-05 10:38                             ` Thomas Gleixner
2018-09-06 10:14                               ` Dou Liyang
2018-09-06 11:46                                 ` Thomas Gleixner
2018-09-11  9:13                                   ` Christoph Hellwig
2018-09-11  9:38                                     ` Dou Liyang
2018-09-11  9:22               ` Christoph Hellwig
2018-09-03  2:13         ` Ming Lei
2018-09-03  6:10           ` Kashyap Desai
2018-09-03  9:21             ` Ming Lei
2018-09-03  9:50               ` Kashyap Desai
2018-09-11  9:21     ` Christoph Hellwig
2018-09-11  9:54       ` Kashyap Desai
2018-08-28  6:47 Sumit Saxena

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).