* Re: Affinity managed interrupts vs non-managed interrupts [not found] <eccc46e12890a1d033d9003837012502@mail.gmail.com> @ 2018-08-29 8:46 ` Ming Lei 2018-08-29 10:46 ` Sumit Saxena 0 siblings, 1 reply; 49+ messages in thread From: Ming Lei @ 2018-08-29 8:46 UTC (permalink / raw) To: Sumit Saxena; +Cc: tglx, hch, linux-kernel Hello Sumit, On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > Affinity managed interrupts vs non-managed interrupts > > Hi Thomas, > > We are working on next generation MegaRAID product where requirement is- to > allocate additional 16 MSI-x vectors in addition to number of MSI-x vectors > megaraid_sas driver usually allocates. MegaRAID adapter supports 128 MSI-x > vectors. > > To explain the requirement and solution, consider that we have 2 socket > system (each socket having 36 logical CPUs). Current driver will allocate > total 72 MSI-x vectors by calling API- pci_alloc_irq_vectors(with flag- > PCI_IRQ_AFFINITY). All 72 MSI-x vectors will have affinity across NUMA node > s and interrupts are affinity managed. > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = 16 > and, driver can allocate 16 + 72 MSI-x vectors. Could you explain a bit what the specific use case the extra 16 vectors is? > > All pre_vectors (16) will be mapped to all available online CPUs but e > ffective affinity of each vector is to CPU 0. Our requirement is to have pre > _vectors 16 reply queues to be mapped to local NUMA node with > effective CPU should > be spread within local node cpu mask. Without changing kernel code, we can If all CPUs in one NUMA node is offline, can this use case work as expected? Seems we have to understand what the use case is and how it works. Thanks, Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-08-29 8:46 ` Affinity managed interrupts vs non-managed interrupts Ming Lei @ 2018-08-29 10:46 ` Sumit Saxena 2018-08-30 17:15 ` Kashyap Desai ` (2 more replies) 0 siblings, 3 replies; 49+ messages in thread From: Sumit Saxena @ 2018-08-29 10:46 UTC (permalink / raw) To: Ming Lei Cc: tglx, hch, linux-kernel, Kashyap Desai, Shivasharan Srikanteshwara > -----Original Message----- > From: Ming Lei [mailto:ming.lei@redhat.com] > Sent: Wednesday, August 29, 2018 2:16 PM > To: Sumit Saxena <sumit.saxena@broadcom.com> > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > Hello Sumit, Hi Ming, Thanks for response. > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > Affinity managed interrupts vs non-managed interrupts > > > > Hi Thomas, > > > > We are working on next generation MegaRAID product where requirement > > is- to allocate additional 16 MSI-x vectors in addition to number of > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > supports 128 MSI-x vectors. > > > > To explain the requirement and solution, consider that we have 2 > > socket system (each socket having 36 logical CPUs). Current driver > > will allocate total 72 MSI-x vectors by calling API- > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > vectors will have affinity across NUMA node s and interrupts are affinity > managed. > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > Could you explain a bit what the specific use case the extra 16 vectors is? We are trying to avoid the penalty due to one interrupt per IO completion and decided to coalesce interrupts on these extra 16 reply queues. For regular 72 reply queues, we will not coalesce interrupts as for low IO workload, interrupt coalescing may take more time due to less IO completions. In IO submission path, driver will decide which set of reply queues (either extra 16 reply queues or regular 72 reply queues) to be picked based on IO workload. > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > ffective affinity of each vector is to CPU 0. Our requirement is to > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > effective CPU should be spread within local node cpu mask. Without > > changing kernel code, we can > > If all CPUs in one NUMA node is offline, can this use case work as expected? > Seems we have to understand what the use case is and how it works. Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be broken and irqbalancer takes care of migrating affected IRQs to online CPUs of different NUMA node. When offline CPUs are onlined again, irqbalancer restores affinity. > > > Thanks, > Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-08-29 10:46 ` Sumit Saxena @ 2018-08-30 17:15 ` Kashyap Desai 2018-08-31 6:54 ` Ming Lei 2018-09-11 9:21 ` Christoph Hellwig 2 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-08-30 17:15 UTC (permalink / raw) To: Sumit Saxena, Ming Lei Cc: tglx, hch, linux-kernel, Shivasharan Srikanteshwara Hi Thomas, Ming, Chris et all, Your input will help us to do changes for megaraid_sas driver. We are currently waiting for community response. Is it recommended to use " pci_enable_msix_range" and have low level driver do affinity setting because current APIs around pci_alloc_irq_vectors do not meet our requirement. We want more than online CPU msix vectors and using pre_vector we can do that, but first 16 msix should be mapped to local numa node with effective cpu spread across cpus of local numa node. This is not possible using pci_alloc_irq_vectors_affinity. Do we need kernel API changes or let's have low level driver to manage it via irq_set_affinity_hint ? Kashyap > -----Original Message----- > From: Sumit Saxena [mailto:sumit.saxena@broadcom.com] > Sent: Wednesday, August 29, 2018 4:46 AM > To: Ming Lei > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org; Kashyap > Desai; Shivasharan Srikanteshwara > Subject: RE: Affinity managed interrupts vs non-managed interrupts > > > -----Original Message----- > > From: Ming Lei [mailto:ming.lei@redhat.com] > > Sent: Wednesday, August 29, 2018 2:16 PM > > To: Sumit Saxena <sumit.saxena@broadcom.com> > > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > Hello Sumit, > Hi Ming, > Thanks for response. > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > Affinity managed interrupts vs non-managed interrupts > > > > > > Hi Thomas, > > > > > > We are working on next generation MegaRAID product where requirement > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > > supports 128 MSI-x vectors. > > > > > > To explain the requirement and solution, consider that we have 2 > > > socket system (each socket having 36 logical CPUs). Current driver > > > will allocate total 72 MSI-x vectors by calling API- > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > vectors will have affinity across NUMA node s and interrupts are > affinity > > managed. > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > > effective CPU should be spread within local node cpu mask. Without > > > changing kernel code, we can > > > > If all CPUs in one NUMA node is offline, can this use case work as > expected? > > Seems we have to understand what the use case is and how it works. > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > broken and irqbalancer takes care of migrating affected IRQs to online > CPUs of different NUMA node. > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > > > Thanks, > > Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-08-29 10:46 ` Sumit Saxena 2018-08-30 17:15 ` Kashyap Desai @ 2018-08-31 6:54 ` Ming Lei 2018-08-31 7:50 ` Kashyap Desai 2018-09-11 9:21 ` Christoph Hellwig 2 siblings, 1 reply; 49+ messages in thread From: Ming Lei @ 2018-08-31 6:54 UTC (permalink / raw) To: sumit.saxena Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Kashyap Desai, shivasharan.srikanteshwara, linux-block On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena <sumit.saxena@broadcom.com> wrote: > > > -----Original Message----- > > From: Ming Lei [mailto:ming.lei@redhat.com] > > Sent: Wednesday, August 29, 2018 2:16 PM > > To: Sumit Saxena <sumit.saxena@broadcom.com> > > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > Hello Sumit, > Hi Ming, > Thanks for response. > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > Affinity managed interrupts vs non-managed interrupts > > > > > > Hi Thomas, > > > > > > We are working on next generation MegaRAID product where requirement > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter > > > supports 128 MSI-x vectors. > > > > > > To explain the requirement and solution, consider that we have 2 > > > socket system (each socket having 36 logical CPUs). Current driver > > > will allocate total 72 MSI-x vectors by calling API- > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > vectors will have affinity across NUMA node s and interrupts are > affinity > > managed. > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. I am just wondering how you can make the decision about using extra 16 or regular 72 queues in submission path, could you share us a bit your idea? How are you going to recognize the IO workload inside your driver? Even the current block layer doesn't recognize IO workload, such as random IO or sequential IO. Frankly speaking, you may reuse the 72 reply queues to do interrupt coalescing by configuring one extra register to enable the coalescing mode, and you may just use small part of the 72 reply queues under the interrupt coalescing mode. Or you can learn from SPDK to use one or small number of dedicated cores or kernel threads to poll the interrupts from all reply queues, then I guess you may benefit much compared with the extra 16 queue approach. Introducing extra 16 queues just for interrupt coalescing and making it coexisting with the regular 72 reply queues seems one very unusual use case, not sure the current genirq affinity can support it well. > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but e > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > have pre _vectors 16 reply queues to be mapped to local NUMA node with > > > effective CPU should be spread within local node cpu mask. Without > > > changing kernel code, we can > > > > If all CPUs in one NUMA node is offline, can this use case work as > expected? > > Seems we have to understand what the use case is and how it works. > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > broken and irqbalancer takes care of migrating affected IRQs to online > CPUs of different NUMA node. > When offline CPUs are onlined again, irqbalancer restores affinity. irqbalance daemon can't cover managed interrupts, or you mean you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? Thanks, Ming Lei ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-08-31 6:54 ` Ming Lei @ 2018-08-31 7:50 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-08-31 7:50 UTC (permalink / raw) To: Ming Lei, Sumit Saxena Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > -----Original Message----- > From: Ming Lei [mailto:tom.leiming@gmail.com] > Sent: Friday, August 31, 2018 12:54 AM > To: sumit.saxena@broadcom.com > Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing > List; > Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena > <sumit.saxena@broadcom.com> wrote: > > > > > -----Original Message----- > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > Sent: Wednesday, August 29, 2018 2:16 PM > > > To: Sumit Saxena <sumit.saxena@broadcom.com> > > > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org > > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > > > Hello Sumit, > > Hi Ming, > > Thanks for response. > > > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > > Affinity managed interrupts vs non-managed interrupts > > > > > > > > Hi Thomas, > > > > > > > > We are working on next generation MegaRAID product where > requirement > > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID > > > > adapter > > > > supports 128 MSI-x vectors. > > > > > > > > To explain the requirement and solution, consider that we have 2 > > > > socket system (each socket having 36 logical CPUs). Current driver > > > > will allocate total 72 MSI-x vectors by calling API- > > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > > vectors will have affinity across NUMA node s and interrupts are > > affinity > > > managed. > > > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > > > Could you explain a bit what the specific use case the extra 16 > > > vectors > > is? > > We are trying to avoid the penalty due to one interrupt per IO > > completion > > and decided to coalesce interrupts on these extra 16 reply queues. > > For regular 72 reply queues, we will not coalesce interrupts as for low > > IO > > workload, interrupt coalescing may take more time due to less IO > > completions. > > In IO submission path, driver will decide which set of reply queues > > (either extra 16 reply queues or regular 72 reply queues) to be picked > > based on IO workload. > > I am just wondering how you can make the decision about using extra > 16 or regular 72 queues in submission path, could you share us a bit > your idea? How are you going to recognize the IO workload inside your > driver? Even the current block layer doesn't recognize IO workload, such > as random IO or sequential IO. It is not yet finalized, but it can be based on per sdev outstanding, shost_busy etc. We want to use special 16 reply queue for IO acceleration (these queues are working interrupt coalescing mode. This is a h/w feature) > > Frankly speaking, you may reuse the 72 reply queues to do interrupt > coalescing by configuring one extra register to enable the coalescing > mode, > and you may just use small part of the 72 reply queues under the > interrupt coalescing mode. Our h/w can set interrupt coalescing per 8 reply queues. So smallest is 8. If we choose to take 8 reply queue from existing 72 reply queue (without asking for extra reply queue), we still have an issue on more numa node systems. Example - in 8 numa node system each node will have only *one* reply queue for effective interrupt coalescing. (since irq subsystem will spread msix per numa). To keep things scalable we cherry picked few reply queues and wanted them to be out of cpu-msix mapping. > > Or you can learn from SPDK to use one or small number of dedicated cores > or kernel threads to poll the interrupts from all reply queues, then I > guess you may benefit much compared with the extra 16 queue approach. Problem with polling - It requires some steady completion, otherwise prediction in driver gives different results on different profiles. We attempted irq-poll and thread ISR based polling, but it has pros and cons. One of the key usage of method what we are trying is not to impact latency for lower QD workloads. I posted RFC at https://www.spinics.net/lists/linux-scsi/msg122874.html We have done extensive study and concluded to use interrupt coalescing is better if h/w can manage two different modes (coalescing on/off). > > Introducing extra 16 queues just for interrupt coalescing and making it > coexisting with the regular 72 reply queues seems one very unusual use > case, not sure the current genirq affinity can support it well. Yes. This is unusual case. I think it is not used by any other drivers. > > > > > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but > > > > e > > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > > have pre _vectors 16 reply queues to be mapped to local NUMA node > with > > > > effective CPU should be spread within local node cpu mask. Without > > > > changing kernel code, we can > > > > > > If all CPUs in one NUMA node is offline, can this use case work as > > expected? > > > Seems we have to understand what the use case is and how it works. > > > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > > broken and irqbalancer takes care of migrating affected IRQs to online > > CPUs of different NUMA node. > > When offline CPUs are onlined again, irqbalancer restores affinity. > > irqbalance daemon can't cover managed interrupts, or you mean > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? Yes. We did not used " pci_alloc_irq_vectors_affinity". We used " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint. > > Thanks, > Ming Lei ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-08-31 7:50 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-08-31 7:50 UTC (permalink / raw) To: Ming Lei, Sumit Saxena Cc: Ming Lei, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > -----Original Message----- > From: Ming Lei [mailto:tom.leiming@gmail.com] > Sent: Friday, August 31, 2018 12:54 AM > To: sumit.saxena@broadcom.com > Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing > List; > Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena > <sumit.saxena@broadcom.com> wrote: > > > > > -----Original Message----- > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > Sent: Wednesday, August 29, 2018 2:16 PM > > > To: Sumit Saxena <sumit.saxena@broadcom.com> > > > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org > > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > > > Hello Sumit, > > Hi Ming, > > Thanks for response. > > > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > > Affinity managed interrupts vs non-managed interrupts > > > > > > > > Hi Thomas, > > > > > > > > We are working on next generation MegaRAID product where > requirement > > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID > > > > adapter > > > > supports 128 MSI-x vectors. > > > > > > > > To explain the requirement and solution, consider that we have 2 > > > > socket system (each socket having 36 logical CPUs). Current driver > > > > will allocate total 72 MSI-x vectors by calling API- > > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > > vectors will have affinity across NUMA node s and interrupts are > > affinity > > > managed. > > > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > > > Could you explain a bit what the specific use case the extra 16 > > > vectors > > is? > > We are trying to avoid the penalty due to one interrupt per IO > > completion > > and decided to coalesce interrupts on these extra 16 reply queues. > > For regular 72 reply queues, we will not coalesce interrupts as for low > > IO > > workload, interrupt coalescing may take more time due to less IO > > completions. > > In IO submission path, driver will decide which set of reply queues > > (either extra 16 reply queues or regular 72 reply queues) to be picked > > based on IO workload. > > I am just wondering how you can make the decision about using extra > 16 or regular 72 queues in submission path, could you share us a bit > your idea? How are you going to recognize the IO workload inside your > driver? Even the current block layer doesn't recognize IO workload, such > as random IO or sequential IO. It is not yet finalized, but it can be based on per sdev outstanding, shost_busy etc. We want to use special 16 reply queue for IO acceleration (these queues are working interrupt coalescing mode. This is a h/w feature) > > Frankly speaking, you may reuse the 72 reply queues to do interrupt > coalescing by configuring one extra register to enable the coalescing > mode, > and you may just use small part of the 72 reply queues under the > interrupt coalescing mode. Our h/w can set interrupt coalescing per 8 reply queues. So smallest is 8. If we choose to take 8 reply queue from existing 72 reply queue (without asking for extra reply queue), we still have an issue on more numa node systems. Example - in 8 numa node system each node will have only *one* reply queue for effective interrupt coalescing. (since irq subsystem will spread msix per numa). To keep things scalable we cherry picked few reply queues and wanted them to be out of cpu-msix mapping. > > Or you can learn from SPDK to use one or small number of dedicated cores > or kernel threads to poll the interrupts from all reply queues, then I > guess you may benefit much compared with the extra 16 queue approach. Problem with polling - It requires some steady completion, otherwise prediction in driver gives different results on different profiles. We attempted irq-poll and thread ISR based polling, but it has pros and cons. One of the key usage of method what we are trying is not to impact latency for lower QD workloads. I posted RFC at https://www.spinics.net/lists/linux-scsi/msg122874.html We have done extensive study and concluded to use interrupt coalescing is better if h/w can manage two different modes (coalescing on/off). > > Introducing extra 16 queues just for interrupt coalescing and making it > coexisting with the regular 72 reply queues seems one very unusual use > case, not sure the current genirq affinity can support it well. Yes. This is unusual case. I think it is not used by any other drivers. > > > > > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but > > > > e > > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > > have pre _vectors 16 reply queues to be mapped to local NUMA node > with > > > > effective CPU should be spread within local node cpu mask. Without > > > > changing kernel code, we can > > > > > > If all CPUs in one NUMA node is offline, can this use case work as > > expected? > > > Seems we have to understand what the use case is and how it works. > > > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > > broken and irqbalancer takes care of migrating affected IRQs to online > > CPUs of different NUMA node. > > When offline CPUs are onlined again, irqbalancer restores affinity. > > irqbalance daemon can't cover managed interrupts, or you mean > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? Yes. We did not used " pci_alloc_irq_vectors_affinity". We used " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint. > > Thanks, > Ming Lei ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-08-31 7:50 ` Kashyap Desai @ 2018-08-31 20:24 ` Thomas Gleixner -1 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-08-31 20:24 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Fri, 31 Aug 2018, Kashyap Desai wrote: > > From: Ming Lei [mailto:tom.leiming@gmail.com] > > Sent: Friday, August 31, 2018 12:54 AM > > To: sumit.saxena@broadcom.com > > Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing > > List; > > Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block > > Subject: Re: Affinity managed interrupts vs non-managed interrupts Can you please teach your mail client NOT to insert the whole useless mail header? > > On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena > > <sumit.saxena@broadcom.com> wrote: > > > > > We are working on next generation MegaRAID product where > > requirement > > > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID > > > > > adapter > > > > > supports 128 MSI-x vectors. > > > > > > > > > > To explain the requirement and solution, consider that we have 2 > > > > > socket system (each socket having 36 logical CPUs). Current driver > > > > > will allocate total 72 MSI-x vectors by calling API- > > > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > > > vectors will have affinity across NUMA node s and interrupts are > > > affinity > > > > managed. > > > > > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > > > > > Could you explain a bit what the specific use case the extra 16 > > > > vectors > > > is? > > > We are trying to avoid the penalty due to one interrupt per IO > > > completion > > > and decided to coalesce interrupts on these extra 16 reply queues. > > > For regular 72 reply queues, we will not coalesce interrupts as for low > > > IO > > > workload, interrupt coalescing may take more time due to less IO > > > completions. > > > In IO submission path, driver will decide which set of reply queues > > > (either extra 16 reply queues or regular 72 reply queues) to be picked > > > based on IO workload. > > > > I am just wondering how you can make the decision about using extra > > 16 or regular 72 queues in submission path, could you share us a bit > > your idea? How are you going to recognize the IO workload inside your > > driver? Even the current block layer doesn't recognize IO workload, such > > as random IO or sequential IO. > > It is not yet finalized, but it can be based on per sdev outstanding, > shost_busy etc. > We want to use special 16 reply queue for IO acceleration (these queues are > working interrupt coalescing mode. This is a h/w feature) TBH, this does not make any sense whatsoever. Why are you trying to have extra interrupts for coalescing instead of doing the following: 1) Allocate 72 reply queues which get nicely spread out to every CPU on the system with affinity spreading. 2) Have a configuration for your reply queues which allows them to be grouped, e.g. by phsyical package. 3) Have a mechanism to mark a reply queue offline/online and handle that on CPU hotplug. That means on unplug you have to wait for the reply queue which is associated to the outgoing CPU to be empty and no new requests to be queued, which has to be done for the regular per CPU reply queues anyway. 4) On queueing the request, flag it 'coalescing' which causes the hard/firmware to direct the reply to the first online reply queue in the group. If the last CPU of a group goes offline, then the normal hotplug mechanism takes effect and the whole thing is put 'offline' as well. This works nicely for all kind of scenarios even if you have more CPUs than queues. No extras, no magic affinity hints, it just works. Hmm? > Yes. We did not used " pci_alloc_irq_vectors_affinity". > We used " pci_enable_msix_range" and manually set affinity in driver using > irq_set_affinity_hint. I still regret the day when I merged that abomination. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-08-31 20:24 ` Thomas Gleixner 0 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-08-31 20:24 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Fri, 31 Aug 2018, Kashyap Desai wrote: > > From: Ming Lei [mailto:tom.leiming@gmail.com] > > Sent: Friday, August 31, 2018 12:54 AM > > To: sumit.saxena@broadcom.com > > Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing > > List; > > Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block > > Subject: Re: Affinity managed interrupts vs non-managed interrupts Can you please teach your mail client NOT to insert the whole useless mail header? > > On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena > > <sumit.saxena@broadcom.com> wrote: > > > > > We are working on next generation MegaRAID product where > > requirement > > > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID > > > > > adapter > > > > > supports 128 MSI-x vectors. > > > > > > > > > > To explain the requirement and solution, consider that we have 2 > > > > > socket system (each socket having 36 logical CPUs). Current driver > > > > > will allocate total 72 MSI-x vectors by calling API- > > > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > > > vectors will have affinity across NUMA node s and interrupts are > > > affinity > > > > managed. > > > > > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > > > > > Could you explain a bit what the specific use case the extra 16 > > > > vectors > > > is? > > > We are trying to avoid the penalty due to one interrupt per IO > > > completion > > > and decided to coalesce interrupts on these extra 16 reply queues. > > > For regular 72 reply queues, we will not coalesce interrupts as for low > > > IO > > > workload, interrupt coalescing may take more time due to less IO > > > completions. > > > In IO submission path, driver will decide which set of reply queues > > > (either extra 16 reply queues or regular 72 reply queues) to be picked > > > based on IO workload. > > > > I am just wondering how you can make the decision about using extra > > 16 or regular 72 queues in submission path, could you share us a bit > > your idea? How are you going to recognize the IO workload inside your > > driver? Even the current block layer doesn't recognize IO workload, such > > as random IO or sequential IO. > > It is not yet finalized, but it can be based on per sdev outstanding, > shost_busy etc. > We want to use special 16 reply queue for IO acceleration (these queues are > working interrupt coalescing mode. This is a h/w feature) TBH, this does not make any sense whatsoever. Why are you trying to have extra interrupts for coalescing instead of doing the following: 1) Allocate 72 reply queues which get nicely spread out to every CPU on the system with affinity spreading. 2) Have a configuration for your reply queues which allows them to be grouped, e.g. by phsyical package. 3) Have a mechanism to mark a reply queue offline/online and handle that on CPU hotplug. That means on unplug you have to wait for the reply queue which is associated to the outgoing CPU to be empty and no new requests to be queued, which has to be done for the regular per CPU reply queues anyway. 4) On queueing the request, flag it 'coalescing' which causes the hard/firmware to direct the reply to the first online reply queue in the group. If the last CPU of a group goes offline, then the normal hotplug mechanism takes effect and the whole thing is put 'offline' as well. This works nicely for all kind of scenarios even if you have more CPUs than queues. No extras, no magic affinity hints, it just works. Hmm? > Yes. We did not used " pci_alloc_irq_vectors_affinity". > We used " pci_enable_msix_range" and manually set affinity in driver using > irq_set_affinity_hint. I still regret the day when I merged that abomination. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-08-31 20:24 ` Thomas Gleixner @ 2018-08-31 21:49 ` Kashyap Desai -1 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-08-31 21:49 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > TBH, this does not make any sense whatsoever. Why are you trying to have > extra interrupts for coalescing instead of doing the following: Thomas, We are using this feature mainly for performance and not for CPU hotplug issues. I read your below #1 to #4 points are more of addressing CPU hotplug stuffs. Right ? We also want to make sure if we convert megaraid_sas driver from managed to non-managed interrupt, we can still achieve CPU hotplug requirement. If we use " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint, cpu hotplug feature works as expected. <irqbalancer> is able to retain older mapping and whenever offlined cpu comes back, irqbalancer restore the same old mapping. If we use all 72 reply queue (all are in interrupt coalescing mode) without any extra reply queues, we don't have any issue with cpu-msix mapping and cpu hotplug issues. Our major problem with that method is latency is very bad on lower QD and/or single worker case. To solve that problem we have added extra 16 reply queue (this is a special h/w feature for performance only) which can be worked in interrupt coalescing mode vs existing 72 reply queue will work without any interrupt coalescing. Best way to map additional 16 reply queue is map it to the local numa node. I understand that, it is unique requirement but at the same time we may be able to do it gracefully (in irq sub system) as you mentioned " irq_set_affinity_hint" should be avoided in low level driver. > > 1) Allocate 72 reply queues which get nicely spread out to every CPU on the > system with affinity spreading. > > 2) Have a configuration for your reply queues which allows them to be > grouped, e.g. by phsyical package. > > 3) Have a mechanism to mark a reply queue offline/online and handle that on > CPU hotplug. That means on unplug you have to wait for the reply queue > which is associated to the outgoing CPU to be empty and no new requests > to be queued, which has to be done for the regular per CPU reply queues > anyway. > > 4) On queueing the request, flag it 'coalescing' which causes the > hard/firmware to direct the reply to the first online reply queue in the > group. > > If the last CPU of a group goes offline, then the normal hotplug mechanism > takes effect and the whole thing is put 'offline' as well. This works > nicely for all kind of scenarios even if you have more CPUs than queues. No > extras, no magic affinity hints, it just works. > > Hmm? > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > I still regret the day when I merged that abomination. Is it possible to have similar mapping in managed interrupt case as below ? for (i = 0; i < 16 ; i++) irq_set_affinity_hint (pci_irq_vector(instance->pdev, cpumask_of_node(local_numa_node)); Currently we always see managed interrupts for pre-vectors are 0-71 and effective cpu is always 0. We want some changes in current API which can allow us to pass flags (like *local numa affinity*) and cpu-msix mapping are from local numa node + effective cpu are spread across local numa node. > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-08-31 21:49 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-08-31 21:49 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > TBH, this does not make any sense whatsoever. Why are you trying to have > extra interrupts for coalescing instead of doing the following: Thomas, We are using this feature mainly for performance and not for CPU hotplug issues. I read your below #1 to #4 points are more of addressing CPU hotplug stuffs. Right ? We also want to make sure if we convert megaraid_sas driver from managed to non-managed interrupt, we can still achieve CPU hotplug requirement. If we use " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint, cpu hotplug feature works as expected. <irqbalancer> is able to retain older mapping and whenever offlined cpu comes back, irqbalancer restore the same old mapping. If we use all 72 reply queue (all are in interrupt coalescing mode) without any extra reply queues, we don't have any issue with cpu-msix mapping and cpu hotplug issues. Our major problem with that method is latency is very bad on lower QD and/or single worker case. To solve that problem we have added extra 16 reply queue (this is a special h/w feature for performance only) which can be worked in interrupt coalescing mode vs existing 72 reply queue will work without any interrupt coalescing. Best way to map additional 16 reply queue is map it to the local numa node. I understand that, it is unique requirement but at the same time we may be able to do it gracefully (in irq sub system) as you mentioned " irq_set_affinity_hint" should be avoided in low level driver. > > 1) Allocate 72 reply queues which get nicely spread out to every CPU on the > system with affinity spreading. > > 2) Have a configuration for your reply queues which allows them to be > grouped, e.g. by phsyical package. > > 3) Have a mechanism to mark a reply queue offline/online and handle that on > CPU hotplug. That means on unplug you have to wait for the reply queue > which is associated to the outgoing CPU to be empty and no new requests > to be queued, which has to be done for the regular per CPU reply queues > anyway. > > 4) On queueing the request, flag it 'coalescing' which causes the > hard/firmware to direct the reply to the first online reply queue in the > group. > > If the last CPU of a group goes offline, then the normal hotplug mechanism > takes effect and the whole thing is put 'offline' as well. This works > nicely for all kind of scenarios even if you have more CPUs than queues. No > extras, no magic affinity hints, it just works. > > Hmm? > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > I still regret the day when I merged that abomination. Is it possible to have similar mapping in managed interrupt case as below ? for (i = 0; i < 16 ; i++) irq_set_affinity_hint (pci_irq_vector(instance->pdev, cpumask_of_node(local_numa_node)); Currently we always see managed interrupts for pre-vectors are 0-71 and effective cpu is always 0. We want some changes in current API which can allow us to pass flags (like *local numa affinity*) and cpu-msix mapping are from local numa node + effective cpu are spread across local numa node. > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-08-31 21:49 ` Kashyap Desai @ 2018-08-31 22:48 ` Thomas Gleixner -1 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-08-31 22:48 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Fri, 31 Aug 2018, Kashyap Desai wrote: > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > shost_busy etc. > > > We want to use special 16 reply queue for IO acceleration (these > queues are > > > working interrupt coalescing mode. This is a h/w feature) > > > > TBH, this does not make any sense whatsoever. Why are you trying to have > > extra interrupts for coalescing instead of doing the following: > > Thomas, > > We are using this feature mainly for performance and not for CPU hotplug > issues. > I read your below #1 to #4 points are more of addressing CPU hotplug > stuffs. Right ? If we use all 72 reply queue (all are in interrupt > coalescing mode) without any extra reply queues, we don't have any issue > with cpu-msix mapping and cpu hotplug issues. Our major problem with > that method is latency is very bad on lower QD and/or single worker case. > > To solve that problem we have added extra 16 reply queue (this is a > special h/w feature for performance only) which can be worked in interrupt > coalescing mode vs existing 72 reply queue will work without any interrupt > coalescing. Best way to map additional 16 reply queue is map it to the > local numa node. Ok. I misunderstood the whole thing a bit. So your real issue is that you want to have reply queues which are instantaneous, the per cpu ones, and then the extra 16 which do batching and are shared over a set of CPUs, right? > I understand that, it is unique requirement but at the same time we may > be able to do it gracefully (in irq sub system) as you mentioned " > irq_set_affinity_hint" should be avoided in low level driver. > Is it possible to have similar mapping in managed interrupt case as below > ? > > for (i = 0; i < 16 ; i++) > irq_set_affinity_hint (pci_irq_vector(instance->pdev, > cpumask_of_node(local_numa_node)); > > Currently we always see managed interrupts for pre-vectors are 0-71 and > effective cpu is always 0. The pre-vectors are not affinity managed. They get the default affinity assigned and at request_irq() the vectors are dynamically spread over CPUs to avoid that the bulk of interrupts ends up on CPU0. That's handled that way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation") > We want some changes in current API which can allow us to pass flags > (like *local numa affinity*) and cpu-msix mapping are from local numa node > + effective cpu are spread across local numa node. What you really want is to split the vector space for your device into two blocks. One for the regular per cpu queues and the other (16 or how many ever) which are managed separately, i.e. spread out evenly. That needs some extensions to the core allocation/management code, but that shouldn't be a huge problem. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-08-31 22:48 ` Thomas Gleixner 0 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-08-31 22:48 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Fri, 31 Aug 2018, Kashyap Desai wrote: > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > shost_busy etc. > > > We want to use special 16 reply queue for IO acceleration (these > queues are > > > working interrupt coalescing mode. This is a h/w feature) > > > > TBH, this does not make any sense whatsoever. Why are you trying to have > > extra interrupts for coalescing instead of doing the following: > > Thomas, > > We are using this feature mainly for performance and not for CPU hotplug > issues. > I read your below #1 to #4 points are more of addressing CPU hotplug > stuffs. Right ? If we use all 72 reply queue (all are in interrupt > coalescing mode) without any extra reply queues, we don't have any issue > with cpu-msix mapping and cpu hotplug issues. Our major problem with > that method is latency is very bad on lower QD and/or single worker case. > > To solve that problem we have added extra 16 reply queue (this is a > special h/w feature for performance only) which can be worked in interrupt > coalescing mode vs existing 72 reply queue will work without any interrupt > coalescing. Best way to map additional 16 reply queue is map it to the > local numa node. Ok. I misunderstood the whole thing a bit. So your real issue is that you want to have reply queues which are instantaneous, the per cpu ones, and then the extra 16 which do batching and are shared over a set of CPUs, right? > I understand that, it is unique requirement but at the same time we may > be able to do it gracefully (in irq sub system) as you mentioned " > irq_set_affinity_hint" should be avoided in low level driver. > Is it possible to have similar mapping in managed interrupt case as below > ? > > for (i = 0; i < 16 ; i++) > irq_set_affinity_hint (pci_irq_vector(instance->pdev, > cpumask_of_node(local_numa_node)); > > Currently we always see managed interrupts for pre-vectors are 0-71 and > effective cpu is always 0. The pre-vectors are not affinity managed. They get the default affinity assigned and at request_irq() the vectors are dynamically spread over CPUs to avoid that the bulk of interrupts ends up on CPU0. That's handled that way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation") > We want some changes in current API which can allow us to pass flags > (like *local numa affinity*) and cpu-msix mapping are from local numa node > + effective cpu are spread across local numa node. What you really want is to split the vector space for your device into two blocks. One for the regular per cpu queues and the other (16 or how many ever) which are managed separately, i.e. spread out evenly. That needs some extensions to the core allocation/management code, but that shouldn't be a huge problem. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-08-31 22:48 ` Thomas Gleixner @ 2018-08-31 23:37 ` Kashyap Desai -1 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-08-31 23:37 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > > shost_busy etc. > > > > We want to use special 16 reply queue for IO acceleration (these > > queues are > > > > working interrupt coalescing mode. This is a h/w feature) > > > > > > TBH, this does not make any sense whatsoever. Why are you trying to have > > > extra interrupts for coalescing instead of doing the following: > > > > Thomas, > > > > We are using this feature mainly for performance and not for CPU hotplug > > issues. > > I read your below #1 to #4 points are more of addressing CPU hotplug > > stuffs. Right ? If we use all 72 reply queue (all are in interrupt > > coalescing mode) without any extra reply queues, we don't have any issue > > with cpu-msix mapping and cpu hotplug issues. Our major problem with > > that method is latency is very bad on lower QD and/or single worker case. > > > > To solve that problem we have added extra 16 reply queue (this is a > > special h/w feature for performance only) which can be worked in interrupt > > coalescing mode vs existing 72 reply queue will work without any interrupt > > coalescing. Best way to map additional 16 reply queue is map it to the > > local numa node. > > Ok. I misunderstood the whole thing a bit. So your real issue is that you > want to have reply queues which are instantaneous, the per cpu ones, and > then the extra 16 which do batching and are shared over a set of CPUs, > right? Yes that is correct. Extra 16 or whatever should be shared over set of CPUs of *local* numa node of the PCI device. > > > I understand that, it is unique requirement but at the same time we may > > be able to do it gracefully (in irq sub system) as you mentioned " > > irq_set_affinity_hint" should be avoided in low level driver. > > > Is it possible to have similar mapping in managed interrupt case as below > > ? > > > > for (i = 0; i < 16 ; i++) > > irq_set_affinity_hint (pci_irq_vector(instance->pdev, > > cpumask_of_node(local_numa_node)); > > > > Currently we always see managed interrupts for pre-vectors are 0-71 and > > effective cpu is always 0. > > The pre-vectors are not affinity managed. They get the default affinity > assigned and at request_irq() the vectors are dynamically spread over CPUs > to avoid that the bulk of interrupts ends up on CPU0. That's handled that > way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation") I am not sure if this is working on 4.18 kernel. I can double check. What I remember is pre_vectors are mapped to 0-71 in my case and effective cpu is always 0. Ideally you mentioned that it should be spread..let me check that. > > > We want some changes in current API which can allow us to pass flags > > (like *local numa affinity*) and cpu-msix mapping are from local numa node > > + effective cpu are spread across local numa node. > > What you really want is to split the vector space for your device into two > blocks. One for the regular per cpu queues and the other (16 or how many > ever) which are managed separately, i.e. spread out evenly. That needs some > extensions to the core allocation/management code, but that shouldn't be a > huge problem. Yes this is correct understanding. I can test any proposed patch if that is what we want to use as best practice. We attempted but due to lack of knowledge in irq-subsystem, we are not able to settle down anything which is close to our requirement. We did something like below - "added new flag PCI_IRQ_PRE_VEC_NUMA which will indicate that all pre and post vector should be shared within local numa node." int irq_flags; struct irq_affinity desc; desc.pre_vectors = 16; desc.post_vectors = 0; irq_flags = PCI_IRQ_MSIX; i = pci_alloc_irq_vectors_affinity(instance->pdev, instance->high_iops_vector_start * 2, instance->msix_vectors, irq_flags | PCI_IRQ_AFFINITY | PCI_IRQ_PRE_VEC_NUMA, &desc); Somehow, I was not able to understand which part of irq subsystem should have changes. ~ Kashyap > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-08-31 23:37 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-08-31 23:37 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > > shost_busy etc. > > > > We want to use special 16 reply queue for IO acceleration (these > > queues are > > > > working interrupt coalescing mode. This is a h/w feature) > > > > > > TBH, this does not make any sense whatsoever. Why are you trying to have > > > extra interrupts for coalescing instead of doing the following: > > > > Thomas, > > > > We are using this feature mainly for performance and not for CPU hotplug > > issues. > > I read your below #1 to #4 points are more of addressing CPU hotplug > > stuffs. Right ? If we use all 72 reply queue (all are in interrupt > > coalescing mode) without any extra reply queues, we don't have any issue > > with cpu-msix mapping and cpu hotplug issues. Our major problem with > > that method is latency is very bad on lower QD and/or single worker case. > > > > To solve that problem we have added extra 16 reply queue (this is a > > special h/w feature for performance only) which can be worked in interrupt > > coalescing mode vs existing 72 reply queue will work without any interrupt > > coalescing. Best way to map additional 16 reply queue is map it to the > > local numa node. > > Ok. I misunderstood the whole thing a bit. So your real issue is that you > want to have reply queues which are instantaneous, the per cpu ones, and > then the extra 16 which do batching and are shared over a set of CPUs, > right? Yes that is correct. Extra 16 or whatever should be shared over set of CPUs of *local* numa node of the PCI device. > > > I understand that, it is unique requirement but at the same time we may > > be able to do it gracefully (in irq sub system) as you mentioned " > > irq_set_affinity_hint" should be avoided in low level driver. > > > Is it possible to have similar mapping in managed interrupt case as below > > ? > > > > for (i = 0; i < 16 ; i++) > > irq_set_affinity_hint (pci_irq_vector(instance->pdev, > > cpumask_of_node(local_numa_node)); > > > > Currently we always see managed interrupts for pre-vectors are 0-71 and > > effective cpu is always 0. > > The pre-vectors are not affinity managed. They get the default affinity > assigned and at request_irq() the vectors are dynamically spread over CPUs > to avoid that the bulk of interrupts ends up on CPU0. That's handled that > way since a0c9259dc4e1 ("irq/matrix: Spread interrupts on allocation") I am not sure if this is working on 4.18 kernel. I can double check. What I remember is pre_vectors are mapped to 0-71 in my case and effective cpu is always 0. Ideally you mentioned that it should be spread..let me check that. > > > We want some changes in current API which can allow us to pass flags > > (like *local numa affinity*) and cpu-msix mapping are from local numa node > > + effective cpu are spread across local numa node. > > What you really want is to split the vector space for your device into two > blocks. One for the regular per cpu queues and the other (16 or how many > ever) which are managed separately, i.e. spread out evenly. That needs some > extensions to the core allocation/management code, but that shouldn't be a > huge problem. Yes this is correct understanding. I can test any proposed patch if that is what we want to use as best practice. We attempted but due to lack of knowledge in irq-subsystem, we are not able to settle down anything which is close to our requirement. We did something like below - "added new flag PCI_IRQ_PRE_VEC_NUMA which will indicate that all pre and post vector should be shared within local numa node." int irq_flags; struct irq_affinity desc; desc.pre_vectors = 16; desc.post_vectors = 0; irq_flags = PCI_IRQ_MSIX; i = pci_alloc_irq_vectors_affinity(instance->pdev, instance->high_iops_vector_start * 2, instance->msix_vectors, irq_flags | PCI_IRQ_AFFINITY | PCI_IRQ_PRE_VEC_NUMA, &desc); Somehow, I was not able to understand which part of irq subsystem should have changes. ~ Kashyap > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-08-31 23:37 ` Kashyap Desai @ 2018-09-02 12:02 ` Thomas Gleixner -1 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-09-02 12:02 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Fri, 31 Aug 2018, Kashyap Desai wrote: > > Ok. I misunderstood the whole thing a bit. So your real issue is that you > > want to have reply queues which are instantaneous, the per cpu ones, and > > then the extra 16 which do batching and are shared over a set of CPUs, > > right? > > Yes that is correct. Extra 16 or whatever should be shared over set of > CPUs of *local* numa node of the PCI device. Why restricting it to the local NUMA node of the device? That doesn't really make sense if you queue lots of requests from CPUs on a different node. Why don't you spread these extra interrupts accross all nodes and keep the locality for the request/reply? That also would allow to make them properly managed interrupts as you could shutdown the per node batching interrupts when all CPUs of that node are offlined and you'd avoid the whole affinity hint irq balancer hackery. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-09-02 12:02 ` Thomas Gleixner 0 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-09-02 12:02 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Fri, 31 Aug 2018, Kashyap Desai wrote: > > Ok. I misunderstood the whole thing a bit. So your real issue is that you > > want to have reply queues which are instantaneous, the per cpu ones, and > > then the extra 16 which do batching and are shared over a set of CPUs, > > right? > > Yes that is correct. Extra 16 or whatever should be shared over set of > CPUs of *local* numa node of the PCI device. Why restricting it to the local NUMA node of the device? That doesn't really make sense if you queue lots of requests from CPUs on a different node. Why don't you spread these extra interrupts accross all nodes and keep the locality for the request/reply? That also would allow to make them properly managed interrupts as you could shutdown the per node batching interrupts when all CPUs of that node are offlined and you'd avoid the whole affinity hint irq balancer hackery. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-09-02 12:02 ` Thomas Gleixner @ 2018-09-03 5:34 ` Kashyap Desai -1 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-03 5:34 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > On Fri, 31 Aug 2018, Kashyap Desai wrote: > > > Ok. I misunderstood the whole thing a bit. So your real issue is that you > > > want to have reply queues which are instantaneous, the per cpu ones, and > > > then the extra 16 which do batching and are shared over a set of CPUs, > > > right? > > > > Yes that is correct. Extra 16 or whatever should be shared over set of > > CPUs of *local* numa node of the PCI device. > > Why restricting it to the local NUMA node of the device? That doesn't > really make sense if you queue lots of requests from CPUs on a different > node. Penalty of cross numa node is minimal with higher interrupt coalescing used in h/w. We see penalty of cross numa traffic for lower IOPs type work load. In this particular case we are taking care cross numa traffic via higher interrupt coalescing. > > Why don't you spread these extra interrupts accross all nodes and keep the > locality for the request/reply? I assuming you are refereeing spreading msix to all numa node the way "pci_alloc_irq_vectors" does. Having extra 16 reply queue spread across nodes will have negative impact. Take example of 8 node system (total 128 logical cpus). If 16 reply queue are spread across numa node, there will be total 8 logical cpu mapped to 1 reply queue (eventually one numa node will have only 2 reply queue mapped). Running IO from one numa node will only consume 2 reply queues. Performance dropped drastically in such case. This is typical problem with cpu-msix mapping goes to N:1 where msix is less than online cpus. Mapping extra 16 reply queue to local numa node will always make sure that driver will round robin all 16 reply queue irrespective of originated cpu. We validated this method sending IOs from remote node and did not observed performance penalty. > > That also would allow to make them properly managed interrupts as you > could > shutdown the per node batching interrupts when all CPUs of that node are > offlined and you'd avoid the whole affinity hint irq balancer hackery. One more clarification - I am using " for-4.19/block " and this particular patch "a0c9259 irq/matrix: Spread interrupts on allocation" is included. I can see that 16 extra reply queues via pre_vectors are still assigned to CPU 0 (effective affinity ). irq 33, cpu list 0-71 irq 34, cpu list 0-71 irq 35, cpu list 0-71 irq 36, cpu list 0-71 irq 37, cpu list 0-71 irq 38, cpu list 0-71 irq 39, cpu list 0-71 irq 40, cpu list 0-71 irq 41, cpu list 0-71 irq 42, cpu list 0-71 irq 43, cpu list 0-71 irq 44, cpu list 0-71 irq 45, cpu list 0-71 irq 46, cpu list 0-71 irq 47, cpu list 0-71 irq 48, cpu list 0-71 # cat /sys/kernel/debug/irq/irqs/34 handler: handle_edge_irq device: 0000:86:00.0 status: 0x00004000 istate: 0x00000000 ddepth: 0 wdepth: 0 dstate: 0x01608200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_MOVE_PCNTXT IRQD_AFFINITY_MANAGED node: 0 affinity: 0-71 effectiv: 0 pending: domain: INTEL-IR-MSI-1-2 hwirq: 0x4300001 chip: IR-PCI-MSI flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: INTEL-IR-1 hwirq: 0x40000 chip: INTEL-IR flags: 0x0 parent: domain: VECTOR hwirq: 0x22 chip: APIC flags: 0x0 Vector: 46 Target: 0 move_in_progress: 0 is_managed: 1 can_reserve: 0 has_reserved: 0 cleanup_pending: 0 #cat /sys/kernel/debug/irq/irqs/35 handler: handle_edge_irq device: 0000:86:00.0 status: 0x00004000 istate: 0x00000000 ddepth: 0 wdepth: 0 dstate: 0x01608200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_MOVE_PCNTXT IRQD_AFFINITY_MANAGED node: 0 affinity: 0-71 effectiv: 0 pending: domain: INTEL-IR-MSI-1-2 hwirq: 0x4300002 chip: IR-PCI-MSI flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: INTEL-IR-1 hwirq: 0x50000 chip: INTEL-IR flags: 0x0 parent: domain: VECTOR hwirq: 0x23 chip: APIC flags: 0x0 Vector: 47 Target: 0 move_in_progress: 0 is_managed: 1 can_reserve: 0 has_reserved: 0 cleanup_pending: 0 Ideally, what we are looking for 16 extra pre_vector reply queue is "effective affinity" to be within local numa node as long as that numa node has online CPUs. If not, we are ok to have effective cpu from any node. > > Thanks, > > tglx > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-09-03 5:34 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-03 5:34 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > On Fri, 31 Aug 2018, Kashyap Desai wrote: > > > Ok. I misunderstood the whole thing a bit. So your real issue is that you > > > want to have reply queues which are instantaneous, the per cpu ones, and > > > then the extra 16 which do batching and are shared over a set of CPUs, > > > right? > > > > Yes that is correct. Extra 16 or whatever should be shared over set of > > CPUs of *local* numa node of the PCI device. > > Why restricting it to the local NUMA node of the device? That doesn't > really make sense if you queue lots of requests from CPUs on a different > node. Penalty of cross numa node is minimal with higher interrupt coalescing used in h/w. We see penalty of cross numa traffic for lower IOPs type work load. In this particular case we are taking care cross numa traffic via higher interrupt coalescing. > > Why don't you spread these extra interrupts accross all nodes and keep the > locality for the request/reply? I assuming you are refereeing spreading msix to all numa node the way "pci_alloc_irq_vectors" does. Having extra 16 reply queue spread across nodes will have negative impact. Take example of 8 node system (total 128 logical cpus). If 16 reply queue are spread across numa node, there will be total 8 logical cpu mapped to 1 reply queue (eventually one numa node will have only 2 reply queue mapped). Running IO from one numa node will only consume 2 reply queues. Performance dropped drastically in such case. This is typical problem with cpu-msix mapping goes to N:1 where msix is less than online cpus. Mapping extra 16 reply queue to local numa node will always make sure that driver will round robin all 16 reply queue irrespective of originated cpu. We validated this method sending IOs from remote node and did not observed performance penalty. > > That also would allow to make them properly managed interrupts as you > could > shutdown the per node batching interrupts when all CPUs of that node are > offlined and you'd avoid the whole affinity hint irq balancer hackery. One more clarification - I am using " for-4.19/block " and this particular patch "a0c9259 irq/matrix: Spread interrupts on allocation" is included. I can see that 16 extra reply queues via pre_vectors are still assigned to CPU 0 (effective affinity ). irq 33, cpu list 0-71 irq 34, cpu list 0-71 irq 35, cpu list 0-71 irq 36, cpu list 0-71 irq 37, cpu list 0-71 irq 38, cpu list 0-71 irq 39, cpu list 0-71 irq 40, cpu list 0-71 irq 41, cpu list 0-71 irq 42, cpu list 0-71 irq 43, cpu list 0-71 irq 44, cpu list 0-71 irq 45, cpu list 0-71 irq 46, cpu list 0-71 irq 47, cpu list 0-71 irq 48, cpu list 0-71 # cat /sys/kernel/debug/irq/irqs/34 handler: handle_edge_irq device: 0000:86:00.0 status: 0x00004000 istate: 0x00000000 ddepth: 0 wdepth: 0 dstate: 0x01608200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_MOVE_PCNTXT IRQD_AFFINITY_MANAGED node: 0 affinity: 0-71 effectiv: 0 pending: domain: INTEL-IR-MSI-1-2 hwirq: 0x4300001 chip: IR-PCI-MSI flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: INTEL-IR-1 hwirq: 0x40000 chip: INTEL-IR flags: 0x0 parent: domain: VECTOR hwirq: 0x22 chip: APIC flags: 0x0 Vector: 46 Target: 0 move_in_progress: 0 is_managed: 1 can_reserve: 0 has_reserved: 0 cleanup_pending: 0 #cat /sys/kernel/debug/irq/irqs/35 handler: handle_edge_irq device: 0000:86:00.0 status: 0x00004000 istate: 0x00000000 ddepth: 0 wdepth: 0 dstate: 0x01608200 IRQD_ACTIVATED IRQD_IRQ_STARTED IRQD_SINGLE_TARGET IRQD_MOVE_PCNTXT IRQD_AFFINITY_MANAGED node: 0 affinity: 0-71 effectiv: 0 pending: domain: INTEL-IR-MSI-1-2 hwirq: 0x4300002 chip: IR-PCI-MSI flags: 0x10 IRQCHIP_SKIP_SET_WAKE parent: domain: INTEL-IR-1 hwirq: 0x50000 chip: INTEL-IR flags: 0x0 parent: domain: VECTOR hwirq: 0x23 chip: APIC flags: 0x0 Vector: 47 Target: 0 move_in_progress: 0 is_managed: 1 can_reserve: 0 has_reserved: 0 cleanup_pending: 0 Ideally, what we are looking for 16 extra pre_vector reply queue is "effective affinity" to be within local numa node as long as that numa node has online CPUs. If not, we are ok to have effective cpu from any node. > > Thanks, > > tglx > > ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-09-03 5:34 ` Kashyap Desai @ 2018-09-03 16:28 ` Thomas Gleixner -1 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-09-03 16:28 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Mon, 3 Sep 2018, Kashyap Desai wrote: > I am using " for-4.19/block " and this particular patch "a0c9259 > irq/matrix: Spread interrupts on allocation" is included. Can you please try against 4.19-rc2 or later? > I can see that 16 extra reply queues via pre_vectors are still assigned to > CPU 0 (effective affinity ). > > irq 33, cpu list 0-71 The cpu list is irrelevant because that's the allowed affinity mask. The effective one is what counts. > # cat /sys/kernel/debug/irq/irqs/34 > node: 0 > affinity: 0-71 > effectiv: 0 So if all 16 have their effective affinity set to CPU0 then that's strange at least. Can you please provide the output of /sys/kernel/debug/irq/domains/VECTOR ? > Ideally, what we are looking for 16 extra pre_vector reply queue is > "effective affinity" to be within local numa node as long as that numa > node has online CPUs. If not, we are ok to have effective cpu from any > node. Well, we surely can do the initial allocation and spreading on the local numa node, but once all CPUs are offline on that node, then the whole thing goes down the drain and allocates from where it sees fit. I'll think about it some more, especially how to avoid the proliferation of the affinity hint. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-09-03 16:28 ` Thomas Gleixner 0 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-09-03 16:28 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Mon, 3 Sep 2018, Kashyap Desai wrote: > I am using " for-4.19/block " and this particular patch "a0c9259 > irq/matrix: Spread interrupts on allocation" is included. Can you please try against 4.19-rc2 or later? > I can see that 16 extra reply queues via pre_vectors are still assigned to > CPU 0 (effective affinity ). > > irq 33, cpu list 0-71 The cpu list is irrelevant because that's the allowed affinity mask. The effective one is what counts. > # cat /sys/kernel/debug/irq/irqs/34 > node: 0 > affinity: 0-71 > effectiv: 0 So if all 16 have their effective affinity set to CPU0 then that's strange at least. Can you please provide the output of /sys/kernel/debug/irq/domains/VECTOR ? > Ideally, what we are looking for 16 extra pre_vector reply queue is > "effective affinity" to be within local numa node as long as that numa > node has online CPUs. If not, we are ok to have effective cpu from any > node. Well, we surely can do the initial allocation and spreading on the local numa node, but once all CPUs are offline on that node, then the whole thing goes down the drain and allocates from where it sees fit. I'll think about it some more, especially how to avoid the proliferation of the affinity hint. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-09-03 16:28 ` Thomas Gleixner @ 2018-09-04 10:29 ` Kashyap Desai -1 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-04 10:29 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > On Mon, 3 Sep 2018, Kashyap Desai wrote: > > I am using " for-4.19/block " and this particular patch "a0c9259 > > irq/matrix: Spread interrupts on allocation" is included. > > Can you please try against 4.19-rc2 or later? > > > I can see that 16 extra reply queues via pre_vectors are still assigned to > > CPU 0 (effective affinity ). > > > > irq 33, cpu list 0-71 > > The cpu list is irrelevant because that's the allowed affinity mask. The > effective one is what counts. > > > # cat /sys/kernel/debug/irq/irqs/34 > > node: 0 > > affinity: 0-71 > > effectiv: 0 > > So if all 16 have their effective affinity set to CPU0 then that's strange > at least. > > Can you please provide the output of > /sys/kernel/debug/irq/domains/VECTOR ? I tried 4.19-rc2. Same behavior as I posted earlier. All 16 pre_vector irq has effective CPU = 0. Here is output of "/sys/kernel/debug/irq/domains/VECTOR" # cat /sys/kernel/debug/irq/domains/VECTOR name: VECTOR size: 0 mapped: 360 flags: 0x00000041 Online bitmaps: 72 Global available: 13062 Global reserved: 86 Total allocated: 274 System: 43: 0-19,32,50,128,236-255 | CPU | avl | man | act | vectors 0 169 17 32 33-49,51-65 1 181 17 4 33,36,52-53 2 181 17 4 33-36 3 181 17 4 33-34,52-53 4 181 17 4 33,35,53-54 5 181 17 4 33,35-36,54 6 182 17 3 33,35-36 7 182 17 3 33-34,36 8 182 17 3 34-35,53 9 181 17 4 33-34,52-53 10 182 17 3 34,36,53 11 182 17 3 34-35,54 12 182 17 3 33-34,53 13 182 17 3 33,37,55 14 181 17 4 33-36 15 181 17 4 33,35-36,54 16 181 17 4 33,35,53-54 17 182 17 3 33,36-37 18 181 17 4 33,36,54-55 19 181 17 4 33,35-36,54 20 181 17 4 33,35-37 21 180 17 5 33,35,37,55-56 22 181 17 4 33-36 23 181 17 4 33,35,37,55 24 180 17 5 33-36,54 25 181 17 4 33-36 26 181 17 4 33-35,54 27 181 17 4 34-36,54 28 181 17 4 33-35,53 29 182 17 3 34-35,53 30 182 17 3 33-35 31 181 17 4 34-36,54 32 182 17 3 33-34,53 33 182 17 3 34-35,53 34 182 17 3 33-34,53 35 182 17 3 34-36 36 182 17 3 33-34,53 37 181 17 4 33,35,52-53 38 182 17 3 34-35,53 39 182 17 3 34,52-53 40 182 17 3 33-35 41 182 17 3 34-35,53 42 182 17 3 33-35 43 182 17 3 34,52-53 44 182 17 3 33-34,53 45 182 17 3 34-35,53 46 182 17 3 34,36,54 47 182 17 3 33-34,52 48 182 17 3 34,36,54 49 182 17 3 33,51-52 50 181 17 4 33-36 51 182 17 3 33-35 52 182 17 3 33-35 53 182 17 3 34-35,53 54 182 17 3 33-34,53 55 182 17 3 34-36 56 181 17 4 33-35,53 57 182 17 3 34-36 58 182 17 3 33-34,53 59 181 17 4 33-35,53 60 181 17 4 33-35,53 61 182 17 3 33-34,53 62 182 17 3 33-35 63 182 17 3 34-36 64 182 17 3 33-34,54 65 181 17 4 33-35,53 66 182 17 3 33-34,54 67 182 17 3 34-36 68 182 17 3 33-34,54 69 182 17 3 34,36,54 70 182 17 3 33-35 71 182 17 3 34,36,54 > > > Ideally, what we are looking for 16 extra pre_vector reply queue is > > "effective affinity" to be within local numa node as long as that numa > > node has online CPUs. If not, we are ok to have effective cpu from any > > node. > > Well, we surely can do the initial allocation and spreading on the local > numa node, but once all CPUs are offline on that node, then the whole thing > goes down the drain and allocates from where it sees fit. I'll think about > it some more, especially how to avoid the proliferation of the affinity > hint. Thanks for looking this request. This will help us to implement WIP megaraid_sas driver changes. I can test any patch you want me to try. > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-09-04 10:29 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-04 10:29 UTC (permalink / raw) To: Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > On Mon, 3 Sep 2018, Kashyap Desai wrote: > > I am using " for-4.19/block " and this particular patch "a0c9259 > > irq/matrix: Spread interrupts on allocation" is included. > > Can you please try against 4.19-rc2 or later? > > > I can see that 16 extra reply queues via pre_vectors are still assigned to > > CPU 0 (effective affinity ). > > > > irq 33, cpu list 0-71 > > The cpu list is irrelevant because that's the allowed affinity mask. The > effective one is what counts. > > > # cat /sys/kernel/debug/irq/irqs/34 > > node: 0 > > affinity: 0-71 > > effectiv: 0 > > So if all 16 have their effective affinity set to CPU0 then that's strange > at least. > > Can you please provide the output of > /sys/kernel/debug/irq/domains/VECTOR ? I tried 4.19-rc2. Same behavior as I posted earlier. All 16 pre_vector irq has effective CPU = 0. Here is output of "/sys/kernel/debug/irq/domains/VECTOR" # cat /sys/kernel/debug/irq/domains/VECTOR name: VECTOR size: 0 mapped: 360 flags: 0x00000041 Online bitmaps: 72 Global available: 13062 Global reserved: 86 Total allocated: 274 System: 43: 0-19,32,50,128,236-255 | CPU | avl | man | act | vectors 0 169 17 32 33-49,51-65 1 181 17 4 33,36,52-53 2 181 17 4 33-36 3 181 17 4 33-34,52-53 4 181 17 4 33,35,53-54 5 181 17 4 33,35-36,54 6 182 17 3 33,35-36 7 182 17 3 33-34,36 8 182 17 3 34-35,53 9 181 17 4 33-34,52-53 10 182 17 3 34,36,53 11 182 17 3 34-35,54 12 182 17 3 33-34,53 13 182 17 3 33,37,55 14 181 17 4 33-36 15 181 17 4 33,35-36,54 16 181 17 4 33,35,53-54 17 182 17 3 33,36-37 18 181 17 4 33,36,54-55 19 181 17 4 33,35-36,54 20 181 17 4 33,35-37 21 180 17 5 33,35,37,55-56 22 181 17 4 33-36 23 181 17 4 33,35,37,55 24 180 17 5 33-36,54 25 181 17 4 33-36 26 181 17 4 33-35,54 27 181 17 4 34-36,54 28 181 17 4 33-35,53 29 182 17 3 34-35,53 30 182 17 3 33-35 31 181 17 4 34-36,54 32 182 17 3 33-34,53 33 182 17 3 34-35,53 34 182 17 3 33-34,53 35 182 17 3 34-36 36 182 17 3 33-34,53 37 181 17 4 33,35,52-53 38 182 17 3 34-35,53 39 182 17 3 34,52-53 40 182 17 3 33-35 41 182 17 3 34-35,53 42 182 17 3 33-35 43 182 17 3 34,52-53 44 182 17 3 33-34,53 45 182 17 3 34-35,53 46 182 17 3 34,36,54 47 182 17 3 33-34,52 48 182 17 3 34,36,54 49 182 17 3 33,51-52 50 181 17 4 33-36 51 182 17 3 33-35 52 182 17 3 33-35 53 182 17 3 34-35,53 54 182 17 3 33-34,53 55 182 17 3 34-36 56 181 17 4 33-35,53 57 182 17 3 34-36 58 182 17 3 33-34,53 59 181 17 4 33-35,53 60 181 17 4 33-35,53 61 182 17 3 33-34,53 62 182 17 3 33-35 63 182 17 3 34-36 64 182 17 3 33-34,54 65 181 17 4 33-35,53 66 182 17 3 33-34,54 67 182 17 3 34-36 68 182 17 3 33-34,54 69 182 17 3 34,36,54 70 182 17 3 33-35 71 182 17 3 34,36,54 > > > Ideally, what we are looking for 16 extra pre_vector reply queue is > > "effective affinity" to be within local numa node as long as that numa > > node has online CPUs. If not, we are ok to have effective cpu from any > > node. > > Well, we surely can do the initial allocation and spreading on the local > numa node, but once all CPUs are offline on that node, then the whole thing > goes down the drain and allocates from where it sees fit. I'll think about > it some more, especially how to avoid the proliferation of the affinity > hint. Thanks for looking this request. This will help us to implement WIP megaraid_sas driver changes. I can test any patch you want me to try. > > Thanks, > > tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-09-04 10:29 ` Kashyap Desai @ 2018-09-05 5:46 ` Dou Liyang -1 siblings, 0 replies; 49+ messages in thread From: Dou Liyang @ 2018-09-05 5:46 UTC (permalink / raw) To: Kashyap Desai, Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang Hi Thomas, Kashyap, At 09/04/2018 06:29 PM, Kashyap Desai wrote: >>> I am using " for-4.19/block " and this particular patch "a0c9259 >>> irq/matrix: Spread interrupts on allocation" is included. >> IMO, this patch is just used for non-managed interrupts. >> So if all 16 have their effective affinity set to CPU0 then that's > strange But, all these 16 are managed interrupts, and will be assigned vectors by assign_managed_vector(): { cpumask_and(vector_searchmask, vector_searchmask, affmsk); cpu = cpumask_first(vector_searchmask); ... vector = irq_matrix_alloc_managed(vector_matrix, cpu); ... } Where we always used the *first* cpu in the vector_searchmask(0-71), not the suitable one. So I guess this situation happened. Shall we also spread the managed interrupts on allocation? Thanks, dou -----------------8<---------------------------------------- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 9f148e3d45b4..57dc05691f44 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -314,13 +314,12 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest) int vector, cpu; cpumask_and(vector_searchmask, vector_searchmask, affmsk); - cpu = cpumask_first(vector_searchmask); - if (cpu >= nr_cpu_ids) - return -EINVAL; + /* set_affinity might call here for nothing */ if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask)) return 0; - vector = irq_matrix_alloc_managed(vector_matrix, cpu); + + vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask, &cpu); trace_vector_alloc_managed(irqd->irq, vector, vector); if (vector < 0) return vector; diff --git a/include/linux/irq.h b/include/linux/irq.h index 201de12a9957..36fdeff5043a 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -1151,7 +1151,8 @@ void irq_matrix_offline(struct irq_matrix *m); void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool replace); int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk); void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk); -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu); +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu); void irq_matrix_reserve(struct irq_matrix *m); void irq_matrix_remove_reserved(struct irq_matrix *m); int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 5092494bf261..d9e4e0a385fa 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -239,21 +239,40 @@ void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk) * @m: Matrix pointer * @cpu: On which CPU the interrupt should be allocated */ -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu) { - struct cpumap *cm = per_cpu_ptr(m->maps, cpu); - unsigned int bit, end = m->alloc_end; - - /* Get managed bit which are not allocated */ - bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); - bit = find_first_bit(m->scratch_map, end); - if (bit >= end) - return -ENOSPC; - set_bit(bit, cm->alloc_map); - cm->allocated++; - m->total_allocated++; - trace_irq_matrix_alloc_managed(bit, cpu, m, cm); - return bit; + unsigned int cpu, best_cpu, maxavl = 0; + unsigned int bit, end; + struct cpumap *cm; + + best_cpu = UINT_MAX; + for_each_cpu(cpu, msk) { + cm = per_cpu_ptr(m->maps, cpu); + + if (!cm->online || cm->available <= maxavl) + continue; + + best_cpu = cpu; + maxavl = cm->available; + } + + if (maxavl) { + cm = per_cpu_ptr(m->maps, best_cpu); + end = m->alloc_end; + /* Get managed bit which are not allocated */ + bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); + bit = find_first_bit(m->scratch_map, end); + if (bit >= end) + return -ENOSPC; + set_bit(bit, cm->alloc_map); + cm->allocated++; + m->total_allocated++; + *mapped_cpu = best_cpu; + trace_irq_matrix_alloc_managed(bit, cpu, m, cm); + return bit; + } + return -ENOSPC; } /** ^ permalink raw reply related [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts @ 2018-09-05 5:46 ` Dou Liyang 0 siblings, 0 replies; 49+ messages in thread From: Dou Liyang @ 2018-09-05 5:46 UTC (permalink / raw) To: Kashyap Desai, Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang Hi Thomas, Kashyap, At 09/04/2018 06:29 PM, Kashyap Desai wrote: >>> I am using " for-4.19/block " and this particular patch "a0c9259 >>> irq/matrix: Spread interrupts on allocation" is included. >> IMO, this patch is just used for non-managed interrupts. >> So if all 16 have their effective affinity set to CPU0 then that's > strange But, all these 16 are managed interrupts, and will be assigned vectors by assign_managed_vector(): { cpumask_and(vector_searchmask, vector_searchmask, affmsk); cpu = cpumask_first(vector_searchmask); ... vector = irq_matrix_alloc_managed(vector_matrix, cpu); ... } Where we always used the *first* cpu in the vector_searchmask(0-71), not the suitable one. So I guess this situation happened. Shall we also spread the managed interrupts on allocation? Thanks, dou -----------------8<---------------------------------------- diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 9f148e3d45b4..57dc05691f44 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -314,13 +314,12 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest) int vector, cpu; cpumask_and(vector_searchmask, vector_searchmask, affmsk); - cpu = cpumask_first(vector_searchmask); - if (cpu >= nr_cpu_ids) - return -EINVAL; + /* set_affinity might call here for nothing */ if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask)) return 0; - vector = irq_matrix_alloc_managed(vector_matrix, cpu); + + vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask, &cpu); trace_vector_alloc_managed(irqd->irq, vector, vector); if (vector < 0) return vector; diff --git a/include/linux/irq.h b/include/linux/irq.h index 201de12a9957..36fdeff5043a 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -1151,7 +1151,8 @@ void irq_matrix_offline(struct irq_matrix *m); void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool replace); int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk); void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk); -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu); +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu); void irq_matrix_reserve(struct irq_matrix *m); void irq_matrix_remove_reserved(struct irq_matrix *m); int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 5092494bf261..d9e4e0a385fa 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -239,21 +239,40 @@ void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk) * @m: Matrix pointer * @cpu: On which CPU the interrupt should be allocated */ -int irq_matrix_alloc_managed(struct irq_matrix *m, unsigned int cpu) +int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, + unsigned int *mapped_cpu) { - struct cpumap *cm = per_cpu_ptr(m->maps, cpu); - unsigned int bit, end = m->alloc_end; - - /* Get managed bit which are not allocated */ - bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); - bit = find_first_bit(m->scratch_map, end); - if (bit >= end) - return -ENOSPC; - set_bit(bit, cm->alloc_map); - cm->allocated++; - m->total_allocated++; - trace_irq_matrix_alloc_managed(bit, cpu, m, cm); - return bit; + unsigned int cpu, best_cpu, maxavl = 0; + unsigned int bit, end; + struct cpumap *cm; + + best_cpu = UINT_MAX; + for_each_cpu(cpu, msk) { + cm = per_cpu_ptr(m->maps, cpu); + + if (!cm->online || cm->available <= maxavl) + continue; + + best_cpu = cpu; + maxavl = cm->available; + } + + if (maxavl) { + cm = per_cpu_ptr(m->maps, best_cpu); + end = m->alloc_end; + /* Get managed bit which are not allocated */ + bitmap_andnot(m->scratch_map, cm->managed_map, cm->alloc_map, end); + bit = find_first_bit(m->scratch_map, end); + if (bit >= end) + return -ENOSPC; + set_bit(bit, cm->alloc_map); + cm->allocated++; + m->total_allocated++; + *mapped_cpu = best_cpu; + trace_irq_matrix_alloc_managed(bit, cpu, m, cm); + return bit; + } + return -ENOSPC; } /** ^ permalink raw reply related [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-09-05 5:46 ` Dou Liyang @ 2018-09-05 9:45 ` Kashyap Desai -1 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-05 9:45 UTC (permalink / raw) To: Dou Liyang, Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang > Hi Thomas, Kashyap, > > At 09/04/2018 06:29 PM, Kashyap Desai wrote: > >>> I am using " for-4.19/block " and this particular patch "a0c9259 > >>> irq/matrix: Spread interrupts on allocation" is included. > >> > > IMO, this patch is just used for non-managed interrupts. > > >> So if all 16 have their effective affinity set to CPU0 then that's > > strange > > But, all these 16 are managed interrupts, and will be assigned vectors > by assign_managed_vector(): > { > cpumask_and(vector_searchmask, vector_searchmask, affmsk); > cpu = cpumask_first(vector_searchmask); > > ... > vector = irq_matrix_alloc_managed(vector_matrix, cpu); > ... > } > > Where we always used the *first* cpu in the vector_searchmask(0-71), not > the suitable one. So I guess this situation happened. > > Shall we also spread the managed interrupts on allocation? Hi Dou, I tried your proposed patch. Using patch, It is not assigning effective irq to CPU = 0 , but it pick *one* cpu from 0-71 range. Eventually, effective cpu is always *one* logical cpu. Behavior is different, but impact is still same. ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-09-05 9:45 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-05 9:45 UTC (permalink / raw) To: Dou Liyang, Thomas Gleixner Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang > Hi Thomas, Kashyap, > > At 09/04/2018 06:29 PM, Kashyap Desai wrote: > >>> I am using " for-4.19/block " and this particular patch "a0c9259 > >>> irq/matrix: Spread interrupts on allocation" is included. > >> > > IMO, this patch is just used for non-managed interrupts. > > >> So if all 16 have their effective affinity set to CPU0 then that's > > strange > > But, all these 16 are managed interrupts, and will be assigned vectors > by assign_managed_vector(): > { > cpumask_and(vector_searchmask, vector_searchmask, affmsk); > cpu = cpumask_first(vector_searchmask); > > ... > vector = irq_matrix_alloc_managed(vector_matrix, cpu); > ... > } > > Where we always used the *first* cpu in the vector_searchmask(0-71), not > the suitable one. So I guess this situation happened. > > Shall we also spread the managed interrupts on allocation? Hi Dou, I tried your proposed patch. Using patch, It is not assigning effective irq to CPU = 0 , but it pick *one* cpu from 0-71 range. Eventually, effective cpu is always *one* logical cpu. Behavior is different, but impact is still same. ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-09-05 9:45 ` Kashyap Desai @ 2018-09-05 10:38 ` Thomas Gleixner -1 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-09-05 10:38 UTC (permalink / raw) To: Kashyap Desai Cc: Dou Liyang, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang On Wed, 5 Sep 2018, Kashyap Desai wrote: > > Shall we also spread the managed interrupts on allocation? > > I tried your proposed patch. Using patch, It is not assigning effective irq > to CPU = 0 , but it pick *one* cpu from 0-71 range. > Eventually, effective cpu is always *one* logical cpu. Behavior is > different, but impact is still same. Oh well. This was not intended to magically provide the solution you want to have. It merily changed the behaviour of the managed interrupt selection, which is a valid thing to do independent of the stuff you want to see. As I said that needs more thought and I really can't tell when I have a time slot to look at that. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-09-05 10:38 ` Thomas Gleixner 0 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-09-05 10:38 UTC (permalink / raw) To: Kashyap Desai Cc: Dou Liyang, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang On Wed, 5 Sep 2018, Kashyap Desai wrote: > > Shall we also spread the managed interrupts on allocation? > > I tried your proposed patch. Using patch, It is not assigning effective irq > to CPU = 0 , but it pick *one* cpu from 0-71 range. > Eventually, effective cpu is always *one* logical cpu. Behavior is > different, but impact is still same. Oh well. This was not intended to magically provide the solution you want to have. It merily changed the behaviour of the managed interrupt selection, which is a valid thing to do independent of the stuff you want to see. As I said that needs more thought and I really can't tell when I have a time slot to look at that. Thanks, tglx ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-09-05 10:38 ` Thomas Gleixner @ 2018-09-06 10:14 ` Dou Liyang -1 siblings, 0 replies; 49+ messages in thread From: Dou Liyang @ 2018-09-06 10:14 UTC (permalink / raw) To: Thomas Gleixner, Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang Hi Thomas, At 09/05/2018 06:38 PM, Thomas Gleixner wrote: > Oh well. This was not intended to magically provide the solution you want > to have. It merily changed the behaviour of the managed interrupt > selection, which is a valid thing to do independent of the stuff you want > to see. > Thank you for clarifying it, I will send the patch independently. > As I said that needs more thought and I really can't tell when I have a > time slot to look at that. > In this period, I am willing to be a volunteer to try to do that you said in the previous reply. May I? Thanks dou ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts @ 2018-09-06 10:14 ` Dou Liyang 0 siblings, 0 replies; 49+ messages in thread From: Dou Liyang @ 2018-09-06 10:14 UTC (permalink / raw) To: Thomas Gleixner, Kashyap Desai Cc: Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang Hi Thomas, At 09/05/2018 06:38 PM, Thomas Gleixner wrote: > Oh well. This was not intended to magically provide the solution you want > to have. It merily changed the behaviour of the managed interrupt > selection, which is a valid thing to do independent of the stuff you want > to see. > Thank you for clarifying it, I will send the patch independently. > As I said that needs more thought and I really can't tell when I have a > time slot to look at that. > In this period, I am willing to be a volunteer to try to do that you said in the previous reply. May I? Thanks dou ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-09-06 10:14 ` Dou Liyang @ 2018-09-06 11:46 ` Thomas Gleixner -1 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-09-06 11:46 UTC (permalink / raw) To: Dou Liyang Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang On Thu, 6 Sep 2018, Dou Liyang wrote: > At 09/05/2018 06:38 PM, Thomas Gleixner wrote: > > Oh well. This was not intended to magically provide the solution you want > > to have. It merily changed the behaviour of the managed interrupt > > selection, which is a valid thing to do independent of the stuff you want > > to see. > > > > Thank you for clarifying it, I will send the patch independently. > > > As I said that needs more thought and I really can't tell when I have a > > time slot to look at that. > > > > In this period, I am willing to be a volunteer to try to do that you > said in the previous reply. May I? You don't have to ask for permission. It's Open Source :) There are a few things we need to clarify upfront: Right now the pre and post vectors are marked managed and their affinity mask is set to the irq default affinity mask. The default affinity mask is by default ALL cpus, but it can be tweaked both on the kernel command line and via proc. If that mask is only a subset of CPUs and all of them go offline then these vectors are shutdown in managed mode. That means we need to set the affinity mask of the pre and post vectors to possible mask, but that doesn't make much sense either, unless there is a reason to have them marked managed. I think the right solution for these pre/post vectors is to _NOT_ mark them managed and leave them as regular interrupts which can be affinity controlled and also can move freely on hotplug. Christoph? Thanks, Thomas ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts @ 2018-09-06 11:46 ` Thomas Gleixner 0 siblings, 0 replies; 49+ messages in thread From: Thomas Gleixner @ 2018-09-06 11:46 UTC (permalink / raw) To: Dou Liyang Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang On Thu, 6 Sep 2018, Dou Liyang wrote: > At 09/05/2018 06:38 PM, Thomas Gleixner wrote: > > Oh well. This was not intended to magically provide the solution you want > > to have. It merily changed the behaviour of the managed interrupt > > selection, which is a valid thing to do independent of the stuff you want > > to see. > > > > Thank you for clarifying it, I will send the patch independently. > > > As I said that needs more thought and I really can't tell when I have a > > time slot to look at that. > > > > In this period, I am willing to be a volunteer to try to do that you > said in the previous reply. May I? You don't have to ask for permission. It's Open Source :) There are a few things we need to clarify upfront: Right now the pre and post vectors are marked managed and their affinity mask is set to the irq default affinity mask. The default affinity mask is by default ALL cpus, but it can be tweaked both on the kernel command line and via proc. If that mask is only a subset of CPUs and all of them go offline then these vectors are shutdown in managed mode. That means we need to set the affinity mask of the pre and post vectors to possible mask, but that doesn't make much sense either, unless there is a reason to have them marked managed. I think the right solution for these pre/post vectors is to _NOT_ mark them managed and leave them as regular interrupts which can be affinity controlled and also can move freely on hotplug. Christoph? Thanks, Thomas ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-09-06 11:46 ` Thomas Gleixner @ 2018-09-11 9:13 ` Christoph Hellwig -1 siblings, 0 replies; 49+ messages in thread From: Christoph Hellwig @ 2018-09-11 9:13 UTC (permalink / raw) To: Thomas Gleixner Cc: Dou Liyang, Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang On Thu, Sep 06, 2018 at 01:46:46PM +0200, Thomas Gleixner wrote: > There are a few things we need to clarify upfront: > > Right now the pre and post vectors are marked managed and their > affinity mask is set to the irq default affinity mask. > > The default affinity mask is by default ALL cpus, but it can be tweaked > both on the kernel command line and via proc. > > If that mask is only a subset of CPUs and all of them go offline > then these vectors are shutdown in managed mode. > > That means we need to set the affinity mask of the pre and post vectors to > possible mask, but that doesn't make much sense either, unless there is a > reason to have them marked managed. > > I think the right solution for these pre/post vectors is to _NOT_ mark > them managed and leave them as regular interrupts which can be affinity > controlled and also can move freely on hotplug. Yes, agreed. Marking the pre/post vector as managed was a mistake (and I don't think it even was intentional, at least on my part). ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts @ 2018-09-11 9:13 ` Christoph Hellwig 0 siblings, 0 replies; 49+ messages in thread From: Christoph Hellwig @ 2018-09-11 9:13 UTC (permalink / raw) To: Thomas Gleixner Cc: Dou Liyang, Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang On Thu, Sep 06, 2018 at 01:46:46PM +0200, Thomas Gleixner wrote: > There are a few things we need to clarify upfront: > > Right now the pre and post vectors are marked managed and their > affinity mask is set to the irq default affinity mask. > > The default affinity mask is by default ALL cpus, but it can be tweaked > both on the kernel command line and via proc. > > If that mask is only a subset of CPUs and all of them go offline > then these vectors are shutdown in managed mode. > > That means we need to set the affinity mask of the pre and post vectors to > possible mask, but that doesn't make much sense either, unless there is a > reason to have them marked managed. > > I think the right solution for these pre/post vectors is to _NOT_ mark > them managed and leave them as regular interrupts which can be affinity > controlled and also can move freely on hotplug. Yes, agreed. Marking the pre/post vector as managed was a mistake (and I don't think it even was intentional, at least on my part). ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-09-11 9:13 ` Christoph Hellwig @ 2018-09-11 9:38 ` Dou Liyang -1 siblings, 0 replies; 49+ messages in thread From: Dou Liyang @ 2018-09-11 9:38 UTC (permalink / raw) To: Christoph Hellwig, Thomas Gleixner Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang Hi, At 09/11/2018 05:13 PM, Christoph Hellwig wrote: > On Thu, Sep 06, 2018 at 01:46:46PM +0200, Thomas Gleixner wrote: >> >> I think the right solution for these pre/post vectors is to _NOT_ mark >> them managed and leave them as regular interrupts which can be affinity >> controlled and also can move freely on hotplug. > > Yes, agreed. Marking the pre/post vector as managed was a mistake > (and I don't think it even was intentional, at least on my part). > Got it ! And, I am trying to fix this by: -Don't set affinity for pre/post vectors in irq_create_affinity_masks(). -And do not setup the desc->affinity of pre/post vectors in alloc_msi_entry(). So, the affinity in alloc_descs() will be NULL, and the interrupt won't be marked as IRQD_AFFINITY_MANAGED. Is it OK? and I will show the codes after testing it. Thanks, dou ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts @ 2018-09-11 9:38 ` Dou Liyang 0 siblings, 0 replies; 49+ messages in thread From: Dou Liyang @ 2018-09-11 9:38 UTC (permalink / raw) To: Christoph Hellwig, Thomas Gleixner Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block, Dou Liyang Hi, At 09/11/2018 05:13 PM, Christoph Hellwig wrote: > On Thu, Sep 06, 2018 at 01:46:46PM +0200, Thomas Gleixner wrote: >> >> I think the right solution for these pre/post vectors is to _NOT_ mark >> them managed and leave them as regular interrupts which can be affinity >> controlled and also can move freely on hotplug. > > Yes, agreed. Marking the pre/post vector as managed was a mistake > (and I don't think it even was intentional, at least on my part). > Got it ! And, I am trying to fix this by: -Don't set affinity for pre/post vectors in irq_create_affinity_masks(). -And do not setup the desc->affinity of pre/post vectors in alloc_msi_entry(). So, the affinity in alloc_descs() will be NULL, and the interrupt won't be marked as IRQD_AFFINITY_MANAGED. Is it OK? and I will show the codes after testing it. Thanks, dou ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-08-31 22:48 ` Thomas Gleixner @ 2018-09-11 9:22 ` Christoph Hellwig -1 siblings, 0 replies; 49+ messages in thread From: Christoph Hellwig @ 2018-09-11 9:22 UTC (permalink / raw) To: Thomas Gleixner Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Sat, Sep 01, 2018 at 12:48:46AM +0200, Thomas Gleixner wrote: > > We want some changes in current API which can allow us to pass flags > > (like *local numa affinity*) and cpu-msix mapping are from local numa node > > + effective cpu are spread across local numa node. > > What you really want is to split the vector space for your device into two > blocks. One for the regular per cpu queues and the other (16 or how many > ever) which are managed separately, i.e. spread out evenly. That needs some > extensions to the core allocation/management code, but that shouldn't be a > huge problem. Note that there are some other uses cases for multiple sets of affinity managed irqs. Various network devices insist on having separate TX vs RX interrupts for example. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts @ 2018-09-11 9:22 ` Christoph Hellwig 0 siblings, 0 replies; 49+ messages in thread From: Christoph Hellwig @ 2018-09-11 9:22 UTC (permalink / raw) To: Thomas Gleixner Cc: Kashyap Desai, Ming Lei, Sumit Saxena, Ming Lei, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Sat, Sep 01, 2018 at 12:48:46AM +0200, Thomas Gleixner wrote: > > We want some changes in current API which can allow us to pass flags > > (like *local numa affinity*) and cpu-msix mapping are from local numa node > > + effective cpu are spread across local numa node. > > What you really want is to split the vector space for your device into two > blocks. One for the regular per cpu queues and the other (16 or how many > ever) which are managed separately, i.e. spread out evenly. That needs some > extensions to the core allocation/management code, but that shouldn't be a > huge problem. Note that there are some other uses cases for multiple sets of affinity managed irqs. Various network devices insist on having separate TX vs RX interrupts for example. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-08-31 7:50 ` Kashyap Desai @ 2018-09-03 2:13 ` Ming Lei -1 siblings, 0 replies; 49+ messages in thread From: Ming Lei @ 2018-09-03 2:13 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Fri, Aug 31, 2018 at 01:50:31AM -0600, Kashyap Desai wrote: > > -----Original Message----- > > From: Ming Lei [mailto:tom.leiming@gmail.com] > > Sent: Friday, August 31, 2018 12:54 AM > > To: sumit.saxena@broadcom.com > > Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing > > List; > > Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena > > <sumit.saxena@broadcom.com> wrote: > > > > > > > -----Original Message----- > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > Sent: Wednesday, August 29, 2018 2:16 PM > > > > To: Sumit Saxena <sumit.saxena@broadcom.com> > > > > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org > > > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > > > > > Hello Sumit, > > > Hi Ming, > > > Thanks for response. > > > > > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > > > Affinity managed interrupts vs non-managed interrupts > > > > > > > > > > Hi Thomas, > > > > > > > > > > We are working on next generation MegaRAID product where > > requirement > > > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID > > > > > adapter > > > > > supports 128 MSI-x vectors. > > > > > > > > > > To explain the requirement and solution, consider that we have 2 > > > > > socket system (each socket having 36 logical CPUs). Current driver > > > > > will allocate total 72 MSI-x vectors by calling API- > > > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > > > vectors will have affinity across NUMA node s and interrupts are > > > affinity > > > > managed. > > > > > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > > > > > Could you explain a bit what the specific use case the extra 16 > > > > vectors > > > is? > > > We are trying to avoid the penalty due to one interrupt per IO > > > completion > > > and decided to coalesce interrupts on these extra 16 reply queues. > > > For regular 72 reply queues, we will not coalesce interrupts as for low > > > IO > > > workload, interrupt coalescing may take more time due to less IO > > > completions. > > > In IO submission path, driver will decide which set of reply queues > > > (either extra 16 reply queues or regular 72 reply queues) to be picked > > > based on IO workload. > > > > I am just wondering how you can make the decision about using extra > > 16 or regular 72 queues in submission path, could you share us a bit > > your idea? How are you going to recognize the IO workload inside your > > driver? Even the current block layer doesn't recognize IO workload, such > > as random IO or sequential IO. > > It is not yet finalized, but it can be based on per sdev outstanding, > shost_busy etc. > We want to use special 16 reply queue for IO acceleration (these queues are > working interrupt coalescing mode. This is a h/w feature) This part is very key to your approach, so I'd suggest to finalize it first. That said this way doesn't make sense if you can't figure out one doable approach to decide when to use the coalescing mode, and when to use the regular 72 reply queues. If it is just for IO acceleration, why not always use the coalescing mode? > > > > > Frankly speaking, you may reuse the 72 reply queues to do interrupt > > coalescing by configuring one extra register to enable the coalescing > > mode, > > and you may just use small part of the 72 reply queues under the > > interrupt coalescing mode. > Our h/w can set interrupt coalescing per 8 reply queues. So smallest is 8. > If we choose to take 8 reply queue from existing 72 reply queue (without > asking for extra reply queue), we still have an issue on more numa node > systems. Example - in 8 numa node system each node will have only *one* > reply queue for effective interrupt coalescing. (since irq subsystem will > spread msix per numa). > > To keep things scalable we cherry picked few reply queues and wanted them to > be out of cpu-msix mapping. I mean you can group the reply queues according to the queue's numa node info, given the mapping has been figured out there by genirq affinity code. > > > > > Or you can learn from SPDK to use one or small number of dedicated cores > > or kernel threads to poll the interrupts from all reply queues, then I > > guess you may benefit much compared with the extra 16 queue approach. > Problem with polling - It requires some steady completion, otherwise > prediction in driver gives different results on different profiles. > We attempted irq-poll and thread ISR based polling, but it has pros and > cons. One of the key usage of method what we are trying is not to impact > latency for lower QD workloads. Interrupt coalescing should effect latency too[1], or could you share your idea how to use interrupt coalescing to address the latency issue? "Interrupt coalescing, also known as interrupt moderation,[1] is a technique in which events which would normally trigger a hardware interrupt are held back, either until a certain amount of work is pending, or a timeout timer triggers."[1] [1] https://en.wikipedia.org/wiki/Interrupt_coalescing > I posted RFC at > https://www.spinics.net/lists/linux-scsi/msg122874.html > > We have done extensive study and concluded to use interrupt coalescing is > better if h/w can manage two different modes (coalescing on/off). Could you explain a bit why coalescing is better? In theory, interrupt coalescing is just to move the implementation into hardware. And the IO submitted from the same coalescing group is usually irrelevant. The same problem you found in polling should have been in coalescing too. > > > > > Introducing extra 16 queues just for interrupt coalescing and making it > > coexisting with the regular 72 reply queues seems one very unusual use > > case, not sure the current genirq affinity can support it well. > > Yes. This is unusual case. I think it is not used by any other drivers. > > > > > > > > > > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but > > > > > e > > > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > > > have pre _vectors 16 reply queues to be mapped to local NUMA node > > with > > > > > effective CPU should be spread within local node cpu mask. Without > > > > > changing kernel code, we can > > > > > > > > If all CPUs in one NUMA node is offline, can this use case work as > > > expected? > > > > Seems we have to understand what the use case is and how it works. > > > > > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > > > broken and irqbalancer takes care of migrating affected IRQs to online > > > CPUs of different NUMA node. > > > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > irqbalance daemon can't cover managed interrupts, or you mean > > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > We used " pci_enable_msix_range" and manually set affinity in driver using > irq_set_affinity_hint. Then you have to cover all kind of CPU hotplug issues in your driver because you switch to driver to maintain the queue mapping. Thanks, Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts @ 2018-09-03 2:13 ` Ming Lei 0 siblings, 0 replies; 49+ messages in thread From: Ming Lei @ 2018-09-03 2:13 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Fri, Aug 31, 2018 at 01:50:31AM -0600, Kashyap Desai wrote: > > -----Original Message----- > > From: Ming Lei [mailto:tom.leiming@gmail.com] > > Sent: Friday, August 31, 2018 12:54 AM > > To: sumit.saxena@broadcom.com > > Cc: Ming Lei; Thomas Gleixner; Christoph Hellwig; Linux Kernel Mailing > > List; > > Kashyap Desai; shivasharan.srikanteshwara@broadcom.com; linux-block > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > On Wed, Aug 29, 2018 at 6:47 PM Sumit Saxena > > <sumit.saxena@broadcom.com> wrote: > > > > > > > -----Original Message----- > > > > From: Ming Lei [mailto:ming.lei@redhat.com] > > > > Sent: Wednesday, August 29, 2018 2:16 PM > > > > To: Sumit Saxena <sumit.saxena@broadcom.com> > > > > Cc: tglx@linutronix.de; hch@lst.de; linux-kernel@vger.kernel.org > > > > Subject: Re: Affinity managed interrupts vs non-managed interrupts > > > > > > > > Hello Sumit, > > > Hi Ming, > > > Thanks for response. > > > > > > > > On Tue, Aug 28, 2018 at 12:04:52PM +0530, Sumit Saxena wrote: > > > > > Affinity managed interrupts vs non-managed interrupts > > > > > > > > > > Hi Thomas, > > > > > > > > > > We are working on next generation MegaRAID product where > > requirement > > > > > is- to allocate additional 16 MSI-x vectors in addition to number of > > > > > MSI-x vectors megaraid_sas driver usually allocates. MegaRAID > > > > > adapter > > > > > supports 128 MSI-x vectors. > > > > > > > > > > To explain the requirement and solution, consider that we have 2 > > > > > socket system (each socket having 36 logical CPUs). Current driver > > > > > will allocate total 72 MSI-x vectors by calling API- > > > > > pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x > > > > > vectors will have affinity across NUMA node s and interrupts are > > > affinity > > > > managed. > > > > > > > > > > If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = > > > > > 16 and, driver can allocate 16 + 72 MSI-x vectors. > > > > > > > > Could you explain a bit what the specific use case the extra 16 > > > > vectors > > > is? > > > We are trying to avoid the penalty due to one interrupt per IO > > > completion > > > and decided to coalesce interrupts on these extra 16 reply queues. > > > For regular 72 reply queues, we will not coalesce interrupts as for low > > > IO > > > workload, interrupt coalescing may take more time due to less IO > > > completions. > > > In IO submission path, driver will decide which set of reply queues > > > (either extra 16 reply queues or regular 72 reply queues) to be picked > > > based on IO workload. > > > > I am just wondering how you can make the decision about using extra > > 16 or regular 72 queues in submission path, could you share us a bit > > your idea? How are you going to recognize the IO workload inside your > > driver? Even the current block layer doesn't recognize IO workload, such > > as random IO or sequential IO. > > It is not yet finalized, but it can be based on per sdev outstanding, > shost_busy etc. > We want to use special 16 reply queue for IO acceleration (these queues are > working interrupt coalescing mode. This is a h/w feature) This part is very key to your approach, so I'd suggest to finalize it first. That said this way doesn't make sense if you can't figure out one doable approach to decide when to use the coalescing mode, and when to use the regular 72 reply queues. If it is just for IO acceleration, why not always use the coalescing mode? > > > > > Frankly speaking, you may reuse the 72 reply queues to do interrupt > > coalescing by configuring one extra register to enable the coalescing > > mode, > > and you may just use small part of the 72 reply queues under the > > interrupt coalescing mode. > Our h/w can set interrupt coalescing per 8 reply queues. So smallest is 8. > If we choose to take 8 reply queue from existing 72 reply queue (without > asking for extra reply queue), we still have an issue on more numa node > systems. Example - in 8 numa node system each node will have only *one* > reply queue for effective interrupt coalescing. (since irq subsystem will > spread msix per numa). > > To keep things scalable we cherry picked few reply queues and wanted them to > be out of cpu-msix mapping. I mean you can group the reply queues according to the queue's numa node info, given the mapping has been figured out there by genirq affinity code. > > > > > Or you can learn from SPDK to use one or small number of dedicated cores > > or kernel threads to poll the interrupts from all reply queues, then I > > guess you may benefit much compared with the extra 16 queue approach. > Problem with polling - It requires some steady completion, otherwise > prediction in driver gives different results on different profiles. > We attempted irq-poll and thread ISR based polling, but it has pros and > cons. One of the key usage of method what we are trying is not to impact > latency for lower QD workloads. Interrupt coalescing should effect latency too[1], or could you share your idea how to use interrupt coalescing to address the latency issue? "Interrupt coalescing, also known as interrupt moderation,[1] is a technique in which events which would normally trigger a hardware interrupt are held back, either until a certain amount of work is pending, or a timeout timer triggers."[1] [1] https://en.wikipedia.org/wiki/Interrupt_coalescing > I posted RFC at > https://www.spinics.net/lists/linux-scsi/msg122874.html > > We have done extensive study and concluded to use interrupt coalescing is > better if h/w can manage two different modes (coalescing on/off). Could you explain a bit why coalescing is better? In theory, interrupt coalescing is just to move the implementation into hardware. And the IO submitted from the same coalescing group is usually irrelevant. The same problem you found in polling should have been in coalescing too. > > > > > Introducing extra 16 queues just for interrupt coalescing and making it > > coexisting with the regular 72 reply queues seems one very unusual use > > case, not sure the current genirq affinity can support it well. > > Yes. This is unusual case. I think it is not used by any other drivers. > > > > > > > > > > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but > > > > > e > > > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > > > have pre _vectors 16 reply queues to be mapped to local NUMA node > > with > > > > > effective CPU should be spread within local node cpu mask. Without > > > > > changing kernel code, we can > > > > > > > > If all CPUs in one NUMA node is offline, can this use case work as > > > expected? > > > > Seems we have to understand what the use case is and how it works. > > > > > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > > > broken and irqbalancer takes care of migrating affected IRQs to online > > > CPUs of different NUMA node. > > > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > irqbalance daemon can't cover managed interrupts, or you mean > > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > We used " pci_enable_msix_range" and manually set affinity in driver using > irq_set_affinity_hint. Then you have to cover all kind of CPU hotplug issues in your driver because you switch to driver to maintain the queue mapping. Thanks, Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-09-03 2:13 ` Ming Lei @ 2018-09-03 6:10 ` Kashyap Desai -1 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-03 6:10 UTC (permalink / raw) To: Ming Lei Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > This part is very key to your approach, so I'd suggest to finalize it > first. That said this way doesn't make sense if you can't figure out > one doable approach to decide when to use the coalescing mode, and when > to > use the regular 72 reply queues. This is almost finalized, but going through testing and may take some time to review all the output. At very high level - If scsi device is Virtual Disk, it will count each physical disk for data arm and required condition to use io acceleration (interrupt coalescing) path is - outstanding for sdev should be more than 8 * data_arms. Using this method we are not going to impact low latency intensive workload. > > If it is just for IO acceleration, why not always use the coalescing mode? Ming, we attempted all the possible approaches. Let me summarize. If we use *all* interrupt coalescing, single worker and lower queue depth profile is impacted and latency drop is seen upto 20%. > > > > > > > > > Frankly speaking, you may reuse the 72 reply queues to do interrupt > > > coalescing by configuring one extra register to enable the coalescing > > > mode, > > > and you may just use small part of the 72 reply queues under the > > > interrupt coalescing mode. > > Our h/w can set interrupt coalescing per 8 reply queues. So smallest is 8. > > If we choose to take 8 reply queue from existing 72 reply queue (without > > asking for extra reply queue), we still have an issue on more numa node > > systems. Example - in 8 numa node system each node will have only *one* > > reply queue for effective interrupt coalescing. (since irq subsystem will > > spread msix per numa). > > > > To keep things scalable we cherry picked few reply queues and wanted them > to > > be out of cpu-msix mapping. > > I mean you can group the reply queues according to the queue's numa node > info, given the mapping has been figured out there by genirq affinity > code. Not able to follow you. I replied to Thomas on the same topic. Is that reply clarifies or I am still missing ? > > > > > > > > > Or you can learn from SPDK to use one or small number of dedicated cores > > > or kernel threads to poll the interrupts from all reply queues, then I > > > guess you may benefit much compared with the extra 16 queue approach. > > Problem with polling - It requires some steady completion, otherwise > > prediction in driver gives different results on different profiles. > > We attempted irq-poll and thread ISR based polling, but it has pros and > > cons. One of the key usage of method what we are trying is not to impact > > latency for lower QD workloads. > > Interrupt coalescing should effect latency too[1], or could you share your > idea how to use interrupt coalescing to address the latency issue? > > "Interrupt coalescing, also known as interrupt moderation,[1] is a > technique in which events which would normally trigger a hardware > interrupt > are held back, either until a certain amount of work is pending, or a > timeout timer triggers."[1] > > [1] https://en.wikipedia.org/wiki/Interrupt_coalescing That is correct. We are not going to use 100% interrupt coalescing to avoid latency impact. We will have two set of queues. You can consider this as hybrid interrupt coalescing. On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix index). Only first 16 reply queue will be configured in interrupt coalescing mode (This is special h/w feature.) and remaining 72 reply are without any interrupt coalescing. 72 reply queue are 1:1 cpu-msix map and 16 reply queue are mapped to local numa node. As explained above, per scsi device outstanding is a key factors to route io to queues with interrupt coalescing vs regular queue (without interrupt coalescing.) Example - If there are sync IO request per scsi device (one IO at a time), driver will keep posting those IO to the queues without any interrupt coalescing. If there are more than 8 outstanding io per scsi device, driver will post those io to reply queues with interrupt coalescing. This particular group of io will not have latency impact because coalescing depth are key factors to flush the ios. There can be some corner cases of workload which can theoretically possible to have latency impact, but having more scsi devices doing active io submission will close that loop and we are not suspecting those issue need any special treatment. In fact, this solution is to provide reasonable latency + higher iops for most of the cases and if there are some deployment which need tuning..it is still possible to disable this feature. We really want to deal with those scenario on case by case bases (through firmware settings). > > > I posted RFC at > > https://www.spinics.net/lists/linux-scsi/msg122874.html > > > > We have done extensive study and concluded to use interrupt coalescing is > > better if h/w can manage two different modes (coalescing on/off). > > Could you explain a bit why coalescing is better? Actually we are doing hybrid coalescing. You are correct, we have no single answer here, but there are pros and cons. For such hybrid coalescing we need h/w support. > > In theory, interrupt coalescing is just to move the implementation into > hardware. And the IO submitted from the same coalescing group is usually > irrelevant. The same problem you found in polling should have been in > coalescing too. Coalescing either in software or hardware is best attempt mechanism and there is no steady snapshot of submission and completion in both the case. One of the problem with coalescing/polling in OS driver is - Irq-poll works in interrupt context and waiting in polling consume more CPU because driver should do some predictive loop. At the same time driver should quit after some completion to give fairness to other devices. Threaded interrupt can resolve the cpu hogging issue, but we are moving our key interrupt processing to threaded context so fairness will be compromised. In case of threaded interrupt polling we may be impacted if interrupt of other devices request the same cpu where threaded isr is running. If polling logic in driver does not work well on different systems, we are going to see extra penalty of doing disable/enable interrupt call. This particular problem is not a concern if h/w does interrupt coalescing. > > > > > > > > > Introducing extra 16 queues just for interrupt coalescing and making it > > > coexisting with the regular 72 reply queues seems one very unusual use > > > case, not sure the current genirq affinity can support it well. > > > > Yes. This is unusual case. I think it is not used by any other drivers. > > > > > > > > > > > > > > > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but > > > > > > e > > > > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > > > > have pre _vectors 16 reply queues to be mapped to local NUMA node > > > with > > > > > > effective CPU should be spread within local node cpu mask. Without > > > > > > changing kernel code, we can > > > > > > > > > > If all CPUs in one NUMA node is offline, can this use case work as > > > > expected? > > > > > Seems we have to understand what the use case is and how it works. > > > > > > > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > > > > broken and irqbalancer takes care of migrating affected IRQs to online > > > > CPUs of different NUMA node. > > > > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > > > irqbalance daemon can't cover managed interrupts, or you mean > > > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? > > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > Then you have to cover all kind of CPU hotplug issues in your driver > because you switch to driver to maintain the queue mapping. > > Thanks, > Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-09-03 6:10 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-03 6:10 UTC (permalink / raw) To: Ming Lei Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > This part is very key to your approach, so I'd suggest to finalize it > first. That said this way doesn't make sense if you can't figure out > one doable approach to decide when to use the coalescing mode, and when > to > use the regular 72 reply queues. This is almost finalized, but going through testing and may take some time to review all the output. At very high level - If scsi device is Virtual Disk, it will count each physical disk for data arm and required condition to use io acceleration (interrupt coalescing) path is - outstanding for sdev should be more than 8 * data_arms. Using this method we are not going to impact low latency intensive workload. > > If it is just for IO acceleration, why not always use the coalescing mode? Ming, we attempted all the possible approaches. Let me summarize. If we use *all* interrupt coalescing, single worker and lower queue depth profile is impacted and latency drop is seen upto 20%. > > > > > > > > > Frankly speaking, you may reuse the 72 reply queues to do interrupt > > > coalescing by configuring one extra register to enable the coalescing > > > mode, > > > and you may just use small part of the 72 reply queues under the > > > interrupt coalescing mode. > > Our h/w can set interrupt coalescing per 8 reply queues. So smallest is 8. > > If we choose to take 8 reply queue from existing 72 reply queue (without > > asking for extra reply queue), we still have an issue on more numa node > > systems. Example - in 8 numa node system each node will have only *one* > > reply queue for effective interrupt coalescing. (since irq subsystem will > > spread msix per numa). > > > > To keep things scalable we cherry picked few reply queues and wanted them > to > > be out of cpu-msix mapping. > > I mean you can group the reply queues according to the queue's numa node > info, given the mapping has been figured out there by genirq affinity > code. Not able to follow you. I replied to Thomas on the same topic. Is that reply clarifies or I am still missing ? > > > > > > > > > Or you can learn from SPDK to use one or small number of dedicated cores > > > or kernel threads to poll the interrupts from all reply queues, then I > > > guess you may benefit much compared with the extra 16 queue approach. > > Problem with polling - It requires some steady completion, otherwise > > prediction in driver gives different results on different profiles. > > We attempted irq-poll and thread ISR based polling, but it has pros and > > cons. One of the key usage of method what we are trying is not to impact > > latency for lower QD workloads. > > Interrupt coalescing should effect latency too[1], or could you share your > idea how to use interrupt coalescing to address the latency issue? > > "Interrupt coalescing, also known as interrupt moderation,[1] is a > technique in which events which would normally trigger a hardware > interrupt > are held back, either until a certain amount of work is pending, or a > timeout timer triggers."[1] > > [1] https://en.wikipedia.org/wiki/Interrupt_coalescing That is correct. We are not going to use 100% interrupt coalescing to avoid latency impact. We will have two set of queues. You can consider this as hybrid interrupt coalescing. On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix index). Only first 16 reply queue will be configured in interrupt coalescing mode (This is special h/w feature.) and remaining 72 reply are without any interrupt coalescing. 72 reply queue are 1:1 cpu-msix map and 16 reply queue are mapped to local numa node. As explained above, per scsi device outstanding is a key factors to route io to queues with interrupt coalescing vs regular queue (without interrupt coalescing.) Example - If there are sync IO request per scsi device (one IO at a time), driver will keep posting those IO to the queues without any interrupt coalescing. If there are more than 8 outstanding io per scsi device, driver will post those io to reply queues with interrupt coalescing. This particular group of io will not have latency impact because coalescing depth are key factors to flush the ios. There can be some corner cases of workload which can theoretically possible to have latency impact, but having more scsi devices doing active io submission will close that loop and we are not suspecting those issue need any special treatment. In fact, this solution is to provide reasonable latency + higher iops for most of the cases and if there are some deployment which need tuning..it is still possible to disable this feature. We really want to deal with those scenario on case by case bases (through firmware settings). > > > I posted RFC at > > https://www.spinics.net/lists/linux-scsi/msg122874.html > > > > We have done extensive study and concluded to use interrupt coalescing is > > better if h/w can manage two different modes (coalescing on/off). > > Could you explain a bit why coalescing is better? Actually we are doing hybrid coalescing. You are correct, we have no single answer here, but there are pros and cons. For such hybrid coalescing we need h/w support. > > In theory, interrupt coalescing is just to move the implementation into > hardware. And the IO submitted from the same coalescing group is usually > irrelevant. The same problem you found in polling should have been in > coalescing too. Coalescing either in software or hardware is best attempt mechanism and there is no steady snapshot of submission and completion in both the case. One of the problem with coalescing/polling in OS driver is - Irq-poll works in interrupt context and waiting in polling consume more CPU because driver should do some predictive loop. At the same time driver should quit after some completion to give fairness to other devices. Threaded interrupt can resolve the cpu hogging issue, but we are moving our key interrupt processing to threaded context so fairness will be compromised. In case of threaded interrupt polling we may be impacted if interrupt of other devices request the same cpu where threaded isr is running. If polling logic in driver does not work well on different systems, we are going to see extra penalty of doing disable/enable interrupt call. This particular problem is not a concern if h/w does interrupt coalescing. > > > > > > > > > Introducing extra 16 queues just for interrupt coalescing and making it > > > coexisting with the regular 72 reply queues seems one very unusual use > > > case, not sure the current genirq affinity can support it well. > > > > Yes. This is unusual case. I think it is not used by any other drivers. > > > > > > > > > > > > > > > > > > > > > > All pre_vectors (16) will be mapped to all available online CPUs but > > > > > > e > > > > > > ffective affinity of each vector is to CPU 0. Our requirement is to > > > > > > have pre _vectors 16 reply queues to be mapped to local NUMA node > > > with > > > > > > effective CPU should be spread within local node cpu mask. Without > > > > > > changing kernel code, we can > > > > > > > > > > If all CPUs in one NUMA node is offline, can this use case work as > > > > expected? > > > > > Seems we have to understand what the use case is and how it works. > > > > > > > > Yes, if all CPUs of the NUMA node is offlined, IRQ-CPU affinity will be > > > > broken and irqbalancer takes care of migrating affected IRQs to online > > > > CPUs of different NUMA node. > > > > When offline CPUs are onlined again, irqbalancer restores affinity. > > > > > > irqbalance daemon can't cover managed interrupts, or you mean > > > you don't use pci_alloc_irq_vectors_affinity(PCI_IRQ_AFFINITY)? > > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > Then you have to cover all kind of CPU hotplug issues in your driver > because you switch to driver to maintain the queue mapping. > > Thanks, > Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-09-03 6:10 ` Kashyap Desai @ 2018-09-03 9:21 ` Ming Lei -1 siblings, 0 replies; 49+ messages in thread From: Ming Lei @ 2018-09-03 9:21 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Mon, Sep 03, 2018 at 11:40:53AM +0530, Kashyap Desai wrote: > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > shost_busy etc. > > > We want to use special 16 reply queue for IO acceleration (these > queues are > > > working interrupt coalescing mode. This is a h/w feature) > > > > This part is very key to your approach, so I'd suggest to finalize it > > first. That said this way doesn't make sense if you can't figure out > > one doable approach to decide when to use the coalescing mode, and when > > to > > use the regular 72 reply queues. > This is almost finalized, but going through testing and may take some time > to review all the output. > At very high level - > If scsi device is Virtual Disk, it will count each physical disk for data > arm and required condition to use io acceleration (interrupt coalescing) > path is - outstanding for sdev should be more than 8 * data_arms. Using > this method we are not going to impact low latency intensive workload. > > > > > If it is just for IO acceleration, why not always use the coalescing > mode? > > Ming, we attempted all the possible approaches. Let me summarize. > > If we use *all* interrupt coalescing, single worker and lower queue depth > profile is impacted and latency drop is seen upto 20%. > > > > > > > > > > > > > > Frankly speaking, you may reuse the 72 reply queues to do interrupt > > > > coalescing by configuring one extra register to enable the > coalescing > > > > mode, > > > > and you may just use small part of the 72 reply queues under the > > > > interrupt coalescing mode. > > > Our h/w can set interrupt coalescing per 8 reply queues. So smallest > is 8. > > > If we choose to take 8 reply queue from existing 72 reply queue > (without > > > asking for extra reply queue), we still have an issue on more numa > node > > > systems. Example - in 8 numa node system each node will have only > *one* > > > reply queue for effective interrupt coalescing. (since irq subsystem > will > > > spread msix per numa). > > > > > > To keep things scalable we cherry picked few reply queues and wanted > them > > to > > > be out of cpu-msix mapping. > > > > I mean you can group the reply queues according to the queue's numa node > > info, given the mapping has been figured out there by genirq affinity > > code. > > Not able to follow you. I replied to Thomas on the same topic. Is that > reply clarifies or I am still missing ? > > > > > > > > > > > > > > Or you can learn from SPDK to use one or small number of dedicated > cores > > > > or kernel threads to poll the interrupts from all reply queues, then > I > > > > guess you may benefit much compared with the extra 16 queue > approach. > > > Problem with polling - It requires some steady completion, otherwise > > > prediction in driver gives different results on different profiles. > > > We attempted irq-poll and thread ISR based polling, but it has pros > and > > > cons. One of the key usage of method what we are trying is not to > impact > > > latency for lower QD workloads. > > > > Interrupt coalescing should effect latency too[1], or could you share > your > > idea how to use interrupt coalescing to address the latency issue? > > > > "Interrupt coalescing, also known as interrupt moderation,[1] is a > > technique in which events which would normally trigger a hardware > > interrupt > > are held back, either until a certain amount of work is pending, > or a > > timeout timer triggers."[1] > > > > [1] https://en.wikipedia.org/wiki/Interrupt_coalescing > > That is correct. We are not going to use 100% interrupt coalescing to > avoid latency impact. We will have two set of queues. You can consider > this as hybrid interrupt coalescing. > On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix > index). Only first 16 reply queue will be configured in interrupt > coalescing mode (This is special h/w feature.) and remaining 72 reply are > without any interrupt coalescing. 72 reply queue are 1:1 cpu-msix map and > 16 reply queue are mapped to local numa node. > > As explained above, per scsi device outstanding is a key factors to route > io to queues with interrupt coalescing vs regular queue (without interrupt > coalescing.) > Example - > If there are sync IO request per scsi device (one IO at a time), driver > will keep posting those IO to the queues without any interrupt coalescing. > If there are more than 8 outstanding io per scsi device, driver will post > those io to reply queues with interrupt coalescing. This particular group If the more than 8 outstanding io are from different CPU or different NUMA node, which replay queue will be chosen in the io submission path? Under this situation, any one of 16 reply queues may not work as expected, I guess. > of io will not have latency impact because coalescing depth are key > factors to flush the ios. There can be some corner cases of workload which > can theoretically possible to have latency impact, but having more scsi > devices doing active io submission will close that loop and we are not > suspecting those issue need any special treatment. In fact, this solution > is to provide reasonable latency + higher iops for most of the cases and > if there are some deployment which need tuning..it is still possible to > disable this feature. We really want to deal with those scenario on case > by case bases (through firmware settings). > > > > > > > I posted RFC at > > > https://www.spinics.net/lists/linux-scsi/msg122874.html > > > > > > We have done extensive study and concluded to use interrupt coalescing > is > > > better if h/w can manage two different modes (coalescing on/off). > > > > Could you explain a bit why coalescing is better? > > Actually we are doing hybrid coalescing. You are correct, we have no > single answer here, but there are pros and cons. > For such hybrid coalescing we need h/w support. > > > > > In theory, interrupt coalescing is just to move the implementation into > > hardware. And the IO submitted from the same coalescing group is usually > > irrelevant. The same problem you found in polling should have been in > > coalescing too. > > Coalescing either in software or hardware is best attempt mechanism and > there is no steady snapshot of submission and completion in both the case. > > One of the problem with coalescing/polling in OS driver is - Irq-poll > works in interrupt context and waiting in polling consume more CPU because > driver should do some predictive loop. At the same time driver should quit One similar way is to use the outstanding IO on this device to predicate the poll time. > after some completion to give fairness to other devices. Threaded > interrupt can resolve the cpu hogging issue, but we are moving our key > interrupt processing to threaded context so fairness will be compromised. > In case of threaded interrupt polling we may be impacted if interrupt of > other devices request the same cpu where threaded isr is running. If > polling logic in driver does not work well on different systems, we are > going to see extra penalty of doing disable/enable interrupt call. This > particular problem is not a concern if h/w does interrupt coalescing. Thanks, Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts @ 2018-09-03 9:21 ` Ming Lei 0 siblings, 0 replies; 49+ messages in thread From: Ming Lei @ 2018-09-03 9:21 UTC (permalink / raw) To: Kashyap Desai Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block On Mon, Sep 03, 2018 at 11:40:53AM +0530, Kashyap Desai wrote: > > > It is not yet finalized, but it can be based on per sdev outstanding, > > > shost_busy etc. > > > We want to use special 16 reply queue for IO acceleration (these > queues are > > > working interrupt coalescing mode. This is a h/w feature) > > > > This part is very key to your approach, so I'd suggest to finalize it > > first. That said this way doesn't make sense if you can't figure out > > one doable approach to decide when to use the coalescing mode, and when > > to > > use the regular 72 reply queues. > This is almost finalized, but going through testing and may take some time > to review all the output. > At very high level - > If scsi device is Virtual Disk, it will count each physical disk for data > arm and required condition to use io acceleration (interrupt coalescing) > path is - outstanding for sdev should be more than 8 * data_arms. Using > this method we are not going to impact low latency intensive workload. > > > > > If it is just for IO acceleration, why not always use the coalescing > mode? > > Ming, we attempted all the possible approaches. Let me summarize. > > If we use *all* interrupt coalescing, single worker and lower queue depth > profile is impacted and latency drop is seen upto 20%. > > > > > > > > > > > > > > Frankly speaking, you may reuse the 72 reply queues to do interrupt > > > > coalescing by configuring one extra register to enable the > coalescing > > > > mode, > > > > and you may just use small part of the 72 reply queues under the > > > > interrupt coalescing mode. > > > Our h/w can set interrupt coalescing per 8 reply queues. So smallest > is 8. > > > If we choose to take 8 reply queue from existing 72 reply queue > (without > > > asking for extra reply queue), we still have an issue on more numa > node > > > systems. Example - in 8 numa node system each node will have only > *one* > > > reply queue for effective interrupt coalescing. (since irq subsystem > will > > > spread msix per numa). > > > > > > To keep things scalable we cherry picked few reply queues and wanted > them > > to > > > be out of cpu-msix mapping. > > > > I mean you can group the reply queues according to the queue's numa node > > info, given the mapping has been figured out there by genirq affinity > > code. > > Not able to follow you. I replied to Thomas on the same topic. Is that > reply clarifies or I am still missing ? > > > > > > > > > > > > > > Or you can learn from SPDK to use one or small number of dedicated > cores > > > > or kernel threads to poll the interrupts from all reply queues, then > I > > > > guess you may benefit much compared with the extra 16 queue > approach. > > > Problem with polling - It requires some steady completion, otherwise > > > prediction in driver gives different results on different profiles. > > > We attempted irq-poll and thread ISR based polling, but it has pros > and > > > cons. One of the key usage of method what we are trying is not to > impact > > > latency for lower QD workloads. > > > > Interrupt coalescing should effect latency too[1], or could you share > your > > idea how to use interrupt coalescing to address the latency issue? > > > > "Interrupt coalescing, also known as interrupt moderation,[1] is a > > technique in which events which would normally trigger a hardware > > interrupt > > are held back, either until a certain amount of work is pending, > or a > > timeout timer triggers."[1] > > > > [1] https://en.wikipedia.org/wiki/Interrupt_coalescing > > That is correct. We are not going to use 100% interrupt coalescing to > avoid latency impact. We will have two set of queues. You can consider > this as hybrid interrupt coalescing. > On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix > index). Only first 16 reply queue will be configured in interrupt > coalescing mode (This is special h/w feature.) and remaining 72 reply are > without any interrupt coalescing. 72 reply queue are 1:1 cpu-msix map and > 16 reply queue are mapped to local numa node. > > As explained above, per scsi device outstanding is a key factors to route > io to queues with interrupt coalescing vs regular queue (without interrupt > coalescing.) > Example - > If there are sync IO request per scsi device (one IO at a time), driver > will keep posting those IO to the queues without any interrupt coalescing. > If there are more than 8 outstanding io per scsi device, driver will post > those io to reply queues with interrupt coalescing. This particular group If the more than 8 outstanding io are from different CPU or different NUMA node, which replay queue will be chosen in the io submission path? Under this situation, any one of 16 reply queues may not work as expected, I guess. > of io will not have latency impact because coalescing depth are key > factors to flush the ios. There can be some corner cases of workload which > can theoretically possible to have latency impact, but having more scsi > devices doing active io submission will close that loop and we are not > suspecting those issue need any special treatment. In fact, this solution > is to provide reasonable latency + higher iops for most of the cases and > if there are some deployment which need tuning..it is still possible to > disable this feature. We really want to deal with those scenario on case > by case bases (through firmware settings). > > > > > > > I posted RFC at > > > https://www.spinics.net/lists/linux-scsi/msg122874.html > > > > > > We have done extensive study and concluded to use interrupt coalescing > is > > > better if h/w can manage two different modes (coalescing on/off). > > > > Could you explain a bit why coalescing is better? > > Actually we are doing hybrid coalescing. You are correct, we have no > single answer here, but there are pros and cons. > For such hybrid coalescing we need h/w support. > > > > > In theory, interrupt coalescing is just to move the implementation into > > hardware. And the IO submitted from the same coalescing group is usually > > irrelevant. The same problem you found in polling should have been in > > coalescing too. > > Coalescing either in software or hardware is best attempt mechanism and > there is no steady snapshot of submission and completion in both the case. > > One of the problem with coalescing/polling in OS driver is - Irq-poll > works in interrupt context and waiting in polling consume more CPU because > driver should do some predictive loop. At the same time driver should quit One similar way is to use the outstanding IO on this device to predicate the poll time. > after some completion to give fairness to other devices. Threaded > interrupt can resolve the cpu hogging issue, but we are moving our key > interrupt processing to threaded context so fairness will be compromised. > In case of threaded interrupt polling we may be impacted if interrupt of > other devices request the same cpu where threaded isr is running. If > polling logic in driver does not work well on different systems, we are > going to see extra penalty of doing disable/enable interrupt call. This > particular problem is not a concern if h/w does interrupt coalescing. Thanks, Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-09-03 9:21 ` Ming Lei @ 2018-09-03 9:50 ` Kashyap Desai -1 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-03 9:50 UTC (permalink / raw) To: Ming Lei Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix > > index). Only first 16 reply queue will be configured in interrupt > > coalescing mode (This is special h/w feature.) and remaining 72 reply are > > without any interrupt coalescing. 72 reply queue are 1:1 cpu-msix map and > > 16 reply queue are mapped to local numa node. > > > > As explained above, per scsi device outstanding is a key factors to route > > io to queues with interrupt coalescing vs regular queue (without interrupt > > coalescing.) > > Example - > > If there are sync IO request per scsi device (one IO at a time), driver > > will keep posting those IO to the queues without any interrupt coalescing. > > If there are more than 8 outstanding io per scsi device, driver will post > > those io to reply queues with interrupt coalescing. This particular group > > If the more than 8 outstanding io are from different CPU or different NUMA > node, > which replay queue will be chosen in the io submission path? We tried this combination as well. If IO is submitted from different NUMA node, we anyways have penalty of cache invalidate issue. We trust rq_affinity = 2 settings to have actual io completion to go back to origin cpu. This approach (of io acceleration queue) is as good as using irqbalancer policy "ignore", where we have all reply queue mapped to local numa node. > > Under this situation, any one of 16 reply queues may not work as > expected, I guess. I tried this and it was same performance with or without this new feature we are discussing. > > > of io will not have latency impact because coalescing depth are key > > factors to flush the ios. There can be some corner cases of workload which > > can theoretically possible to have latency impact, but having more scsi > > devices doing active io submission will close that loop and we are not > > suspecting those issue need any special treatment. In fact, this solution > > is to provide reasonable latency + higher iops for most of the cases and > > if there are some deployment which need tuning..it is still possible to > > disable this feature. We really want to deal with those scenario on case > > by case bases (through firmware settings). > > > > > > > > > > > I posted RFC at > > > > https://www.spinics.net/lists/linux-scsi/msg122874.html > > > > > > > > We have done extensive study and concluded to use interrupt coalescing > > is > > > > better if h/w can manage two different modes (coalescing on/off). > > > > > > Could you explain a bit why coalescing is better? > > > > Actually we are doing hybrid coalescing. You are correct, we have no > > single answer here, but there are pros and cons. > > For such hybrid coalescing we need h/w support. > > > > > > > > In theory, interrupt coalescing is just to move the implementation into > > > hardware. And the IO submitted from the same coalescing group is usually > > > irrelevant. The same problem you found in polling should have been in > > > coalescing too. > > > > Coalescing either in software or hardware is best attempt mechanism and > > there is no steady snapshot of submission and completion in both the case. > > > > One of the problem with coalescing/polling in OS driver is - Irq-poll > > works in interrupt context and waiting in polling consume more CPU > because > > driver should do some predictive loop. At the same time driver should quit > > One similar way is to use the outstanding IO on this device to predicate > the poll time. We attempted this model as well. If outstanding is always available (constant workload), driver will never quit. Most of the time interrupt will be disabled and thread will be in polling work. Ideally, driver should quit after some defined time. Right ? That is why *budget* of irq-poll is for. If outstanding goes up and down (burst workload), we will be doing frequent irq enable/disable and that will vary the results. Irq-poll is best option to do polling in OS (mainly because of budget and interrupt context mechanism), but predicting poll helps for constant workload and also at the same time it hogs host CPU because most of the time driver keep polling without any work in interrupt context. If we use h/w interrupt coalescing, we are not wasting host CPU since h/w can manage coalescing without host consuming host cpu. > > > after some completion to give fairness to other devices. Threaded > > interrupt can resolve the cpu hogging issue, but we are moving our key > > interrupt processing to threaded context so fairness will be compromised. > > In case of threaded interrupt polling we may be impacted if interrupt of > > other devices request the same cpu where threaded isr is running. If > > polling logic in driver does not work well on different systems, we are > > going to see extra penalty of doing disable/enable interrupt call. This > > particular problem is not a concern if h/w does interrupt coalescing. > > Thanks, > Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts @ 2018-09-03 9:50 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-03 9:50 UTC (permalink / raw) To: Ming Lei Cc: Ming Lei, Sumit Saxena, Thomas Gleixner, Christoph Hellwig, Linux Kernel Mailing List, Shivasharan Srikanteshwara, linux-block > > On 72 logical cpu case, we will allocate 88 (72 + 16) reply queues (msix > > index). Only first 16 reply queue will be configured in interrupt > > coalescing mode (This is special h/w feature.) and remaining 72 reply are > > without any interrupt coalescing. 72 reply queue are 1:1 cpu-msix map and > > 16 reply queue are mapped to local numa node. > > > > As explained above, per scsi device outstanding is a key factors to route > > io to queues with interrupt coalescing vs regular queue (without interrupt > > coalescing.) > > Example - > > If there are sync IO request per scsi device (one IO at a time), driver > > will keep posting those IO to the queues without any interrupt coalescing. > > If there are more than 8 outstanding io per scsi device, driver will post > > those io to reply queues with interrupt coalescing. This particular group > > If the more than 8 outstanding io are from different CPU or different NUMA > node, > which replay queue will be chosen in the io submission path? We tried this combination as well. If IO is submitted from different NUMA node, we anyways have penalty of cache invalidate issue. We trust rq_affinity = 2 settings to have actual io completion to go back to origin cpu. This approach (of io acceleration queue) is as good as using irqbalancer policy "ignore", where we have all reply queue mapped to local numa node. > > Under this situation, any one of 16 reply queues may not work as > expected, I guess. I tried this and it was same performance with or without this new feature we are discussing. > > > of io will not have latency impact because coalescing depth are key > > factors to flush the ios. There can be some corner cases of workload which > > can theoretically possible to have latency impact, but having more scsi > > devices doing active io submission will close that loop and we are not > > suspecting those issue need any special treatment. In fact, this solution > > is to provide reasonable latency + higher iops for most of the cases and > > if there are some deployment which need tuning..it is still possible to > > disable this feature. We really want to deal with those scenario on case > > by case bases (through firmware settings). > > > > > > > > > > > I posted RFC at > > > > https://www.spinics.net/lists/linux-scsi/msg122874.html > > > > > > > > We have done extensive study and concluded to use interrupt coalescing > > is > > > > better if h/w can manage two different modes (coalescing on/off). > > > > > > Could you explain a bit why coalescing is better? > > > > Actually we are doing hybrid coalescing. You are correct, we have no > > single answer here, but there are pros and cons. > > For such hybrid coalescing we need h/w support. > > > > > > > > In theory, interrupt coalescing is just to move the implementation into > > > hardware. And the IO submitted from the same coalescing group is usually > > > irrelevant. The same problem you found in polling should have been in > > > coalescing too. > > > > Coalescing either in software or hardware is best attempt mechanism and > > there is no steady snapshot of submission and completion in both the case. > > > > One of the problem with coalescing/polling in OS driver is - Irq-poll > > works in interrupt context and waiting in polling consume more CPU > because > > driver should do some predictive loop. At the same time driver should quit > > One similar way is to use the outstanding IO on this device to predicate > the poll time. We attempted this model as well. If outstanding is always available (constant workload), driver will never quit. Most of the time interrupt will be disabled and thread will be in polling work. Ideally, driver should quit after some defined time. Right ? That is why *budget* of irq-poll is for. If outstanding goes up and down (burst workload), we will be doing frequent irq enable/disable and that will vary the results. Irq-poll is best option to do polling in OS (mainly because of budget and interrupt context mechanism), but predicting poll helps for constant workload and also at the same time it hogs host CPU because most of the time driver keep polling without any work in interrupt context. If we use h/w interrupt coalescing, we are not wasting host CPU since h/w can manage coalescing without host consuming host cpu. > > > after some completion to give fairness to other devices. Threaded > > interrupt can resolve the cpu hogging issue, but we are moving our key > > interrupt processing to threaded context so fairness will be compromised. > > In case of threaded interrupt polling we may be impacted if interrupt of > > other devices request the same cpu where threaded isr is running. If > > polling logic in driver does not work well on different systems, we are > > going to see extra penalty of doing disable/enable interrupt call. This > > particular problem is not a concern if h/w does interrupt coalescing. > > Thanks, > Ming ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: Affinity managed interrupts vs non-managed interrupts 2018-08-29 10:46 ` Sumit Saxena 2018-08-30 17:15 ` Kashyap Desai 2018-08-31 6:54 ` Ming Lei @ 2018-09-11 9:21 ` Christoph Hellwig 2018-09-11 9:54 ` Kashyap Desai 2 siblings, 1 reply; 49+ messages in thread From: Christoph Hellwig @ 2018-09-11 9:21 UTC (permalink / raw) To: Sumit Saxena Cc: Ming Lei, tglx, hch, linux-kernel, Kashyap Desai, Shivasharan Srikanteshwara On Wed, Aug 29, 2018 at 04:16:23PM +0530, Sumit Saxena wrote: > > Could you explain a bit what the specific use case the extra 16 vectors > is? > We are trying to avoid the penalty due to one interrupt per IO completion > and decided to coalesce interrupts on these extra 16 reply queues. > For regular 72 reply queues, we will not coalesce interrupts as for low IO > workload, interrupt coalescing may take more time due to less IO > completions. > In IO submission path, driver will decide which set of reply queues > (either extra 16 reply queues or regular 72 reply queues) to be picked > based on IO workload. The point I don't get here is why you need separate reply queues for the interrupt coalesce setting. Shouldn't this just be a flag at submission time that indicates the amount of coalescing that should happen? What is the benefit of having different completion queues? ^ permalink raw reply [flat|nested] 49+ messages in thread
* RE: Affinity managed interrupts vs non-managed interrupts 2018-09-11 9:21 ` Christoph Hellwig @ 2018-09-11 9:54 ` Kashyap Desai 0 siblings, 0 replies; 49+ messages in thread From: Kashyap Desai @ 2018-09-11 9:54 UTC (permalink / raw) To: Christoph Hellwig, Sumit Saxena Cc: Ming Lei, tglx, linux-kernel, Shivasharan Srikanteshwara > > The point I don't get here is why you need separate reply queues for > the interrupt coalesce setting. Shouldn't this just be a flag at > submission time that indicates the amount of coalescing that should > happen? > > What is the benefit of having different completion queues? Having different set of queues (it will is something like N:16 where N queues are without interrupt coalescing and 16 dedicated queues for interrupt coalescing) we want to avoid penalty introduced by interrupt coalescing especially for lower QD profiles. Kashyap ^ permalink raw reply [flat|nested] 49+ messages in thread
* Affinity managed interrupts vs non-managed interrupts @ 2018-08-28 6:47 Sumit Saxena 0 siblings, 0 replies; 49+ messages in thread From: Sumit Saxena @ 2018-08-28 6:47 UTC (permalink / raw) To: tglx; +Cc: Ming Lei, hch, linux-kernel Hi Thomas, We are working on next generation MegaRAID product where requirement is- to allocate additional 16 MSI-x vectors in addition to number of MSI-x vectors megaraid_sas driver usually allocates. MegaRAID adapter supports 128 MSI-x vectors. To explain the requirement and solution, consider that we have 2 socket system (each socket having 36 logical CPUs). Current driver will allocate total 72 MSI-x vectors by calling API- pci_alloc_irq_vectors(with flag- PCI_IRQ_AFFINITY). All 72 MSI-x vectors will have affinity across NUMA nodes and interrupts are affinity managed. If driver calls- pci_alloc_irq_vectors_affinity() with pre_vectors = 16 and, driver can allocate 16 + 72 MSI-x vectors. All pre_vectors (16) will be mapped to all available online CPUs but effective affinity of each vector is to CPU 0. Our requirement is to have pre_vectors 16 reply queues to be mapped to local NUMA node with effective CPU should be spread within local node cpu mask. Without changing kernel code, we can achieve this by driver calling pci_enable_msix_range() (requesting to allocate 16 + 72 MSI-x vectors) instead of pci_alloc_irq_vectors() API. If we use pci_enable_msix_range(), it also requires MSI-x to CPU affinity handled by driver and these interrupts will be non-managed. Question is- Is there any restriction or preference of using pci_alloc_irq_vectors{/_affinity} vs pci_enable_msix_range in low level driver? If driver uses non-managed interrupt, all cases are handled correctly through irqbalancer. Is there any plan in future to migrate to managed interrupts entirely or it is a choice based call for driver maintainers? Thanks, Sumit ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, other threads:[~2018-09-11 14:37 UTC | newest] Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <eccc46e12890a1d033d9003837012502@mail.gmail.com> 2018-08-29 8:46 ` Affinity managed interrupts vs non-managed interrupts Ming Lei 2018-08-29 10:46 ` Sumit Saxena 2018-08-30 17:15 ` Kashyap Desai 2018-08-31 6:54 ` Ming Lei 2018-08-31 7:50 ` Kashyap Desai 2018-08-31 7:50 ` Kashyap Desai 2018-08-31 20:24 ` Thomas Gleixner 2018-08-31 20:24 ` Thomas Gleixner 2018-08-31 21:49 ` Kashyap Desai 2018-08-31 21:49 ` Kashyap Desai 2018-08-31 22:48 ` Thomas Gleixner 2018-08-31 22:48 ` Thomas Gleixner 2018-08-31 23:37 ` Kashyap Desai 2018-08-31 23:37 ` Kashyap Desai 2018-09-02 12:02 ` Thomas Gleixner 2018-09-02 12:02 ` Thomas Gleixner 2018-09-03 5:34 ` Kashyap Desai 2018-09-03 5:34 ` Kashyap Desai 2018-09-03 16:28 ` Thomas Gleixner 2018-09-03 16:28 ` Thomas Gleixner 2018-09-04 10:29 ` Kashyap Desai 2018-09-04 10:29 ` Kashyap Desai 2018-09-05 5:46 ` Dou Liyang 2018-09-05 5:46 ` Dou Liyang 2018-09-05 9:45 ` Kashyap Desai 2018-09-05 9:45 ` Kashyap Desai 2018-09-05 10:38 ` Thomas Gleixner 2018-09-05 10:38 ` Thomas Gleixner 2018-09-06 10:14 ` Dou Liyang 2018-09-06 10:14 ` Dou Liyang 2018-09-06 11:46 ` Thomas Gleixner 2018-09-06 11:46 ` Thomas Gleixner 2018-09-11 9:13 ` Christoph Hellwig 2018-09-11 9:13 ` Christoph Hellwig 2018-09-11 9:38 ` Dou Liyang 2018-09-11 9:38 ` Dou Liyang 2018-09-11 9:22 ` Christoph Hellwig 2018-09-11 9:22 ` Christoph Hellwig 2018-09-03 2:13 ` Ming Lei 2018-09-03 2:13 ` Ming Lei 2018-09-03 6:10 ` Kashyap Desai 2018-09-03 6:10 ` Kashyap Desai 2018-09-03 9:21 ` Ming Lei 2018-09-03 9:21 ` Ming Lei 2018-09-03 9:50 ` Kashyap Desai 2018-09-03 9:50 ` Kashyap Desai 2018-09-11 9:21 ` Christoph Hellwig 2018-09-11 9:54 ` Kashyap Desai 2018-08-28 6:47 Sumit Saxena
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.