RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

All of lore.kernel.org
 help / color / mirror / Atom feed

* RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue
@ 2018-01-22 11:33 Kashyap Desai
  0 siblings, 0 replies; 7+ messages in thread
From: Kashyap Desai @ 2018-01-22 11:33 UTC (permalink / raw)
  To: linux-scsi, Peter Rivera

>
> In Summary,
> CPU completing IO which is not contributing to IO submission, may cause
cpu
> lockup.
> If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then
using irq poll
> interface, we can avoid the CPU lockups and by equally distributing the
> interrupts among the enabled MSI-x interrupts we can avoid performance
> issues.
>
> We are planning to use both the fixes only if cpu count is more than FW
> supported MSI-x vector.
> Please review and provide your feedback. I have appended both the
patches.

Hi -
Assuming method explained here is in-line with Linux SCSI subsystem and
there is no better method to fix such issue.
I am planning to provide the same solution for internal testing and
maintainers of respective driver (mpt3sas and megaraid_sas) will post
final patch to the upstream based on results.
As of now PoC results looks promising with the above mentioned solution
and no cpu lock was discovered.

>
> Thanks, Kashyap
>
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue
  2018-02-02 11:38   ` Kashyap Desai
@ 2018-02-02 12:52     ` Ming Lei
  0 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2018-02-02 12:52 UTC (permalink / raw)
  To: Kashyap Desai; +Cc: linux-scsi, Peter Rivera

Hi Kashyap,

On Fri, Feb 02, 2018 at 05:08:12PM +0530, Kashyap Desai wrote:
> > -----Original Message-----
> > From: Ming Lei [mailto:ming.lei@redhat.com]
> > Sent: Friday, February 2, 2018 3:44 PM
> > To: Kashyap Desai
> > Cc: linux-scsi@vger.kernel.org; Peter Rivera
> > Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load
> balancing of
> > reply queue
> >
> > Hi Kashyap,
> >
> > On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> > > Hi All -
> > >
> > > We have seen cpu lock up issue from fields if system has greater (more
> > > than 96) logical cpu count.
> > > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> > >
> > > This may be a generic issue (if PCI device support  completion on
> > > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > > h/w just to simplify the problem and possible changes to handle such
> > > issues. IT HBA
> > > (mpt3sas) supports multiple reply queues in completion path. Driver
> > > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > > queue, Logical CPUs)". If submitter is not interrupted via completion
> > > on same CPU, there is a loop in the IO path. This behavior can cause
> > > hard/soft CPU lockups, IO timeout, system sluggish etc.
> >
> > As I mentioned in another thread, this issue may be solved by SCSI_MQ
> via
> > mapping reply queue into hctx of blk_mq, together with
> > QUEUE_FLAG_SAME_FORCE, especially you have set 'smp_affinity_enable' as
> > 1 at default already, then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can
> do IRQ
> > vectors spread on CPUs perfectly for you.
> >
> > But the following Hannes's patch is required for the conversion.
> >
> > 	https://marc.info/?l=linux-block&m=149130770004507&w=2
> >
> 
> Hi Ming -
> 
> I gone through thread discussing "support host-wide tagset". Below Link
> has latest reply on that thread.
> https://marc.info/?l=linux-block&m=149132580511346&w=2
> 
> I think, there is a confusion over mpt3sas and megaraid_sas h/w behavior.
> Broadcom/LSI HBA and MR h/w has only one h/w queue for submission but
> there are multiple reply queue.

That shouldn't be a problem, you still can submit to same hw queue in
all submission paths(all hw queues) just like the current implementation.

> Even though I include Hannes' patch for host-side tagset, problem
> described in this RFC will not be resolved.  In fact, tagset can also
> provide same results if completion queue is less than online CPU. Don't
> you think ? OR I am missing anything ?

If reply queue is less than online CPU, more than one CPU may be mapped
to some(or all) of hw queue, but the completion is only handled on one of
the mapped CPUs, and can be done on the request's submission CPU via the
queue flag of QUEUE_FLAG_SAME_FORCE, please see __blk_mq_complete_request().
Or do you have other requirement except for completing request on its
submission CPU?

> 
> We don't have problem in submission path.  Current problem is MSI-x to
> more than one  CPU can cause I/O loop. This is visible, if we have higher
> number of online CPUs.

Yeah, I know, as I mentioned above, your requirement of completing
request on its submission CPU can be met with current SCSI_MQ without
much difficulty.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue
  2018-02-02 10:13 ` Ming Lei
@ 2018-02-02 11:38   ` Kashyap Desai
  2018-02-02 12:52     ` Ming Lei
  0 siblings, 1 reply; 7+ messages in thread
From: Kashyap Desai @ 2018-02-02 11:38 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-scsi, Peter Rivera

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@redhat.com]
> Sent: Friday, February 2, 2018 3:44 PM
> To: Kashyap Desai
> Cc: linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load
balancing of
> reply queue
>
> Hi Kashyap,
>
> On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> > Hi All -
> >
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.
> > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> >
> > This may be a generic issue (if PCI device support  completion on
> > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > h/w just to simplify the problem and possible changes to handle such
> > issues. IT HBA
> > (mpt3sas) supports multiple reply queues in completion path. Driver
> > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > queue, Logical CPUs)". If submitter is not interrupted via completion
> > on same CPU, there is a loop in the IO path. This behavior can cause
> > hard/soft CPU lockups, IO timeout, system sluggish etc.
>
> As I mentioned in another thread, this issue may be solved by SCSI_MQ
via
> mapping reply queue into hctx of blk_mq, together with
> QUEUE_FLAG_SAME_FORCE, especially you have set 'smp_affinity_enable' as
> 1 at default already, then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can
do IRQ
> vectors spread on CPUs perfectly for you.
>
> But the following Hannes's patch is required for the conversion.
>
> 	https://marc.info/?l=linux-block&m=149130770004507&w=2
>

Hi Ming -

I gone through thread discussing "support host-wide tagset". Below Link
has latest reply on that thread.
https://marc.info/?l=linux-block&m=149132580511346&w=2

I think, there is a confusion over mpt3sas and megaraid_sas h/w behavior.
Broadcom/LSI HBA and MR h/w has only one h/w queue for submission but
there are multiple reply queue.
Even though I include Hannes' patch for host-side tagset, problem
described in this RFC will not be resolved.  In fact, tagset can also
provide same results if completion queue is less than online CPU. Don't
you think ? OR I am missing anything ?

We don't have problem in submission path.  Current problem is MSI-x to
more than one  CPU can cause I/O loop. This is visible, if we have higher
number of online CPUs.

> >
> > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another
> > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply
> > descriptors from reply descriptor queue upon receiving the interrupts
> > from HBA. If the CPU A is continuously pumping the IOs then always CPU
> > B (which is executing the ISR) will see the valid reply descriptors in
> > the reply descriptor queue and it will be continuously processing
> > those reply descriptor in a loop without quitting the ISR handler.
> > Mpt3sas driver will exit ISR handler if it finds unused reply
> > descriptor in the reply descriptor queue. Since CPU A will be
> > continuously sending the IOs, CPU B may always see a valid reply
> > descriptor (posted by HBA Firmware after processing the IO) in the
> > reply descriptor queue. In worst case, driver will not quit from this
> > loop in the ISR handler. Eventually, CPU lockup will be detected by
> watchdog.
> >
> > Above mentioned behavior is not common if "rq_affinity" set to 2 or
> > affinity_hint is honored by irqbalance as "exact".
> > If rq_affinity is set to 2, submitter will be always interrupted via
> > completion on same CPU.
> > If irqbalance is using "exact" policy, interrupt will be delivered to
> > submitter CPU.
>
> Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix
> vector number, the irq affinity can't be changed by userspace any more.
>
> >
> > Problem statement -
> > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio
> > is not 1:1, we still have  exposure of issue explained above and for
> > that we don't have any solution.
> >
> > Exposure of soft/hard lockup if CPU count is more than MSI-x supported
> > by device.
> >
> > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if
> > CPU counts to MSI-x vector count ratio is something like X:1, where X
> > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to
> > avoid CPU hard/soft lockups. There won't be any one to one mapping
> > between CPU to MSI-x vector instead one MSI-x interrupt (or reply
> > descriptor queue) is shared with group/set of CPUs and there is a
> > possibility of having a loop in the IO path within that CPU group and
may
> observe lockups.
> >
> > For example: Consider a system having two NUMA nodes and each node
> > having four logical CPUs and also consider that number of MSI-x
> > vectors enabled on the HBA is two, then CPUs count to MSI-x vector
count
> ratio as 4:1.
> > e.g.
> > MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node
> > 0 and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of
> > NUMA node 1.
> >
> > numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3
-->
> > MSI-x 0
> > node 0 size: 65536 MB
> > node 0 free: 63176 MB
> > node 1 cpus: 4 5 6 7
> > -->MSI-x 1
> > node 1 size: 65536 MB
> > node 1 free: 63176 MB
> >
> > Assume that user started an application which uses all the CPUs of
> > NUMA node 0 for issuing the IOs.
> > Only one CPU from affinity list (it can be any cpu since this behavior
> > depends upon irqbalance) CPU0 will receive the interrupts from MSIx
> > vector
> > 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> > decreasing and ISR processing percentage will be increasing as it is
> > more busy with processing the interrupts. Gradually IO submission
> > percentage on CPU 0 will be zero and it's ISR processing percentage
> > will be 100 percentage as IO loop has already formed within the NUMA
> > node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
> > submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
> > always find the valid reply descriptor in the reply descriptor queue.
> > Eventually, we will observe the hard lockup here.
> >
> > Chances of occurring of hard/soft lockups are directly proportional to
> > value of X. If value of X is high, then chances of observing CPU
> > lockups is high.
> >
> > Solution -
> > Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas
> > driver will execute ISR routine in Softirq context and it will always
> > quit the loop based on budget provided in IRQ poll interface.
> >
> > In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio
> > is
> > X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups
> > due to voluntary exit from the reply queue processing based on budget.
> > Note - Only one MSI-x vector is busy doing processing. Irqstat ouput -
> >
> > IRQs / 1 second(s)
> > IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
> >   44    122871   122871   0       0       0  IR-PCI-MSI-edge
> > mpt3sas0-msix0
> >   45        0              0           0       0       0
IR-PCI-MSI-edge
> > mpt3sas0-msix1
> >
> > Fix-2 - Above fix will avoid lockups, but there can be some
> > performance issue if very few reply queue is busy. Driver should round
> > robin the reply queue, so that each reply queue is load balanced.
> > Irqstat ouput after driver does reply queue load balance-
> >
> > IRQs / 1 second(s)
> > IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
> >   44  62871  62871       0       0       0  IR-PCI-MSI-edge
mpt3sas0-msix0
> >   45  62718  62718       0       0       0  IR-PCI-MSI-edge
mpt3sas0-msix1
> >
> > In Summary,
> > CPU completing IO which is not contributing to IO submission, may
> > cause cpu lockup.
> > If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then
> > using irq poll interface, we can avoid the CPU lockups and by equally
> > distributing the interrupts among the enabled MSI-x interrupts we can
> > avoid performance issues.
> >
> > We are planning to use both the fixes only if cpu count is more than
> > FW supported MSI-x vector.
> > Please review and provide your feedback. I have appended both the
> patches.
> >
>
> Please take a look at pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) and
> SCSI_MQ/blk_mq, you issue can be solved without much difficulty.
>
> One annoying thing is that SCSI driver has to support both MQ and non-MQ
> path. Long time ago, I submitted patch to support force-MQ in driver,
but it is
> rejected.
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue
  2018-01-15 12:12 Kashyap Desai
  2018-01-29  8:59 ` Hannes Reinecke
@ 2018-02-02 10:13 ` Ming Lei
  2018-02-02 11:38   ` Kashyap Desai
  1 sibling, 1 reply; 7+ messages in thread
From: Ming Lei @ 2018-02-02 10:13 UTC (permalink / raw)
  To: Kashyap Desai; +Cc: linux-scsi, Peter Rivera

Hi Kashyap,

On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> Hi All -
> 
> We have seen cpu lock up issue from fields if system has greater (more
> than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at max 96 msix vector and
> SAS3.5 product (Ventura) supports at max 128 msix vectors.
> 
> This may be a generic issue (if PCI device support  completion on multiple
> reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
> simplify the problem and possible changes to handle such issues. IT HBA
> (mpt3sas) supports multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of ( FW supported Reply
> queue, Logical CPUs)". If submitter is not interrupted via completion on
> same CPU, there is a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.

As I mentioned in another thread, this issue may be solved by SCSI_MQ
via mapping reply queue into hctx of blk_mq, together with QUEUE_FLAG_SAME_FORCE,
especially you have set 'smp_affinity_enable' as 1 at default already,
then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can do IRQ vectors spread on
CPUs perfectly for you.

But the following Hannes's patch is required for the conversion.

	https://marc.info/?l=linux-block&m=149130770004507&w=2

> 
> Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
> (e.g. CPU B) is busy with processing the corresponding IO's reply
> descriptors from reply descriptor queue upon receiving the interrupts from
> HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
> is executing the ISR) will see the valid reply descriptors in the reply
> descriptor queue and it will be continuously processing those reply
> descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
> will exit ISR handler if it finds unused reply descriptor in the reply
> descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
> may always see a valid reply descriptor (posted by HBA Firmware after
> processing the IO) in the reply descriptor queue. In worst case, driver
> will not quit from this loop in the ISR handler. Eventually, CPU lockup
> will be detected by watchdog.
> 
> Above mentioned behavior is not common if "rq_affinity" set to 2 or
> affinity_hint is honored by irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always interrupted via
> completion on same CPU.
> If irqbalance is using "exact" policy, interrupt will be delivered to
> submitter CPU.

Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix
vector number, the irq affinity can't be changed by userspace any more.

> 
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
> not 1:1, we still have  exposure of issue explained above and for that we
> don't have any solution.
> 
> Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
> device.
> 
> If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
> counts to MSI-x vector count ratio is something like X:1, where X > 1)
> then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping between CPU to
> MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
> shared with group/set of CPUs and there is a possibility of having a loop
> in the IO path within that CPU group and may observe lockups.
> 
> For example: Consider a system having two NUMA nodes and each node having
> four logical CPUs and also consider that number of MSI-x vectors enabled
> on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
> and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
> 1.
> 
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3                                                -->
> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7
> -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
> 
> Assume that user started an application which uses all the CPUs of NUMA
> node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since this behavior
> depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
> 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> decreasing and ISR processing percentage will be increasing as it is more
> busy with processing the interrupts. Gradually IO submission percentage on
> CPU 0 will be zero and it's ISR processing percentage will be 100
> percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
> 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
> and only CPU 0 is busy in the ISR path as it always find the valid reply
> descriptor in the reply descriptor queue. Eventually, we will observe the
> hard lockup here.
> 
> Chances of occurring of hard/soft lockups are directly proportional to
> value of X. If value of X is high, then chances of observing CPU lockups
> is high.
> 
> Solution -
> Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
> will execute ISR routine in Softirq context and it will always quit the
> loop based on budget provided in IRQ poll interface.
> 
> In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio is
> X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups due
> to voluntary exit from the reply queue processing based on budget.  Note -
> Only one MSI-x vector is busy doing processing. Irqstat ouput -
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44    122871   122871   0       0       0  IR-PCI-MSI-edge
> mpt3sas0-msix0
>   45        0              0           0       0       0  IR-PCI-MSI-edge
> mpt3sas0-msix1
> 
> Fix-2 - Above fix will avoid lockups, but there can be some performance
> issue if very few reply queue is busy. Driver should round robin the reply
> queue, so that each reply queue is load balanced.  Irqstat ouput after
> driver does reply queue load balance-
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44  62871  62871       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
>   45  62718  62718       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1
> 
> In Summary,
> CPU completing IO which is not contributing to IO submission, may cause
> cpu lockup.
> If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then using
> irq poll interface, we can avoid the CPU lockups and by equally
> distributing the interrupts among the enabled MSI-x interrupts we can
> avoid performance issues.
> 
> We are planning to use both the fixes only if cpu count is more than FW
> supported MSI-x vector.
> Please review and provide your feedback. I have appended both the patches.
> 

Please take a look at pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) and SCSI_MQ/blk_mq,
you issue can be solved without much difficulty.

One annoying thing is that SCSI driver has to support both MQ and non-MQ
path. Long time ago, I submitted patch to support force-MQ in driver,
but it is rejected.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue
  2018-01-29  8:59 ` Hannes Reinecke
@ 2018-01-29 16:52   ` Kashyap Desai
  0 siblings, 0 replies; 7+ messages in thread
From: Kashyap Desai @ 2018-01-29 16:52 UTC (permalink / raw)
  To: Hannes Reinecke, linux-scsi, Peter Rivera

> -----Original Message-----
> From: Hannes Reinecke [mailto:hare@suse.de]
> Sent: Monday, January 29, 2018 2:29 PM
> To: Kashyap Desai; linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing
> of
> reply queue
>
> On 01/15/2018 01:12 PM, Kashyap Desai wrote:
> > Hi All -
> >
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.
> > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> >
> > This may be a generic issue (if PCI device support  completion on
> > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > h/w just to simplify the problem and possible changes to handle such
> > issues. IT HBA
> > (mpt3sas) supports multiple reply queues in completion path. Driver
> > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > queue, Logical CPUs)". If submitter is not interrupted via completion
> > on same CPU, there is a loop in the IO path. This behavior can cause
> > hard/soft CPU lockups, IO timeout, system sluggish etc.
> >
> > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another
> > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply
> > descriptors from reply descriptor queue upon receiving the interrupts
> > from HBA. If the CPU A is continuously pumping the IOs then always CPU
> > B (which is executing the ISR) will see the valid reply descriptors in
> > the reply descriptor queue and it will be continuously processing
> > those reply descriptor in a loop without quitting the ISR handler.
> > Mpt3sas driver will exit ISR handler if it finds unused reply
> > descriptor in the reply descriptor queue. Since CPU A will be
> > continuously sending the IOs, CPU B may always see a valid reply
> > descriptor (posted by HBA Firmware after processing the IO) in the
> > reply descriptor queue. In worst case, driver will not quit from this
> > loop in the ISR handler. Eventually, CPU lockup will be detected by
> watchdog.
> >
> > Above mentioned behavior is not common if "rq_affinity" set to 2 or
> > affinity_hint is honored by irqbalance as "exact".
> > If rq_affinity is set to 2, submitter will be always interrupted via
> > completion on same CPU.
> > If irqbalance is using "exact" policy, interrupt will be delivered to
> > submitter CPU.
> >
> > Problem statement -
> > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio
> > is not 1:1, we still have  exposure of issue explained above and for
> > that we don't have any solution.
> >
> > Exposure of soft/hard lockup if CPU count is more than MSI-x supported
> > by device.
> >
> > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if
> > CPU counts to MSI-x vector count ratio is something like X:1, where X
> > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to
> > avoid CPU hard/soft lockups. There won't be any one to one mapping
> > between CPU to MSI-x vector instead one MSI-x interrupt (or reply
> > descriptor queue) is shared with group/set of CPUs and there is a
> > possibility of having a loop in the IO path within that CPU group and
> > may
> observe lockups.
> >
> > For example: Consider a system having two NUMA nodes and each node
> > having four logical CPUs and also consider that number of MSI-x
> > vectors enabled on the HBA is two, then CPUs count to MSI-x vector count
> ratio as 4:1.
> > e.g.
> > MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node
> > 0 and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of
> > NUMA node 1.
> >
> > numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3                                                -->
> > MSI-x 0
> > node 0 size: 65536 MB
> > node 0 free: 63176 MB
> > node 1 cpus: 4 5 6 7
> > -->MSI-x 1
> > node 1 size: 65536 MB
> > node 1 free: 63176 MB
> >
> > Assume that user started an application which uses all the CPUs of
> > NUMA node 0 for issuing the IOs.
> > Only one CPU from affinity list (it can be any cpu since this behavior
> > depends upon irqbalance) CPU0 will receive the interrupts from MSIx
> > vector
> > 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> > decreasing and ISR processing percentage will be increasing as it is
> > more busy with processing the interrupts. Gradually IO submission
> > percentage on CPU 0 will be zero and it's ISR processing percentage
> > will be 100 percentage as IO loop has already formed within the NUMA
> > node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
> > submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
> > always find the valid reply descriptor in the reply descriptor queue.
> > Eventually, we will observe the hard lockup here.
> >
> > Chances of occurring of hard/soft lockups are directly proportional to
> > value of X. If value of X is high, then chances of observing CPU
> > lockups is high.
> >
> > Solution -
> > Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas
> > driver will execute ISR routine in Softirq context and it will always
> > quit the loop based on budget provided in IRQ poll interface.
> >
> > In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio
> > is
> > X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups
> > due to voluntary exit from the reply queue processing based on budget.
> > Note - Only one MSI-x vector is busy doing processing. Irqstat ouput -
> >
> > IRQs / 1 second(s)
> > IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
> >   44    122871   122871   0       0       0  IR-PCI-MSI-edge
> > mpt3sas0-msix0
> >   45        0              0           0       0       0
> > IR-PCI-MSI-edge
> > mpt3sas0-msix1
> >
> > Fix-2 - Above fix will avoid lockups, but there can be some
> > performance issue if very few reply queue is busy. Driver should round
> > robin the reply queue, so that each reply queue is load balanced.
> > Irqstat ouput after driver does reply queue load balance-
> >
> > IRQs / 1 second(s)
> > IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
> >   44  62871  62871       0       0       0  IR-PCI-MSI-edge
> > mpt3sas0-msix0
> >   45  62718  62718       0       0       0  IR-PCI-MSI-edge
> > mpt3sas0-msix1
> >
> > In Summary,
> > CPU completing IO which is not contributing to IO submission, may
> > cause cpu lockup.
> > If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then
> > using irq poll interface, we can avoid the CPU lockups and by equally
> > distributing the interrupts among the enabled MSI-x interrupts we can
> > avoid performance issues.
> >
> > We are planning to use both the fixes only if cpu count is more than
> > FW supported MSI-x vector.
> > Please review and provide your feedback. I have appended both the
> patches.
> >
> Actually, I think we should be discussing this issue at LSF; you are not
> alone
> here with this problem, as this could (potentially) hit other drivers,
> too.
>
> I think I'll be submitting a topic for this.
>
> In general I'm all for enabling irq polling in individual drivers, but
> this should
> be in addition to the existing code (ie enabled via a module option or
> somesuch). Enabling it in general has a high risk of performance
> degradation
> on slower hardware.


Thanks for feedback and considering common solution. Agree that we have
generic problem for many other drivers as well.
 I see your another thread. We can discuss the same using your latest thread
"[LSF/MM TOPIC] irq affinity handling for high CPU count machines"

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG Nürnberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue
  2018-01-15 12:12 Kashyap Desai
@ 2018-01-29  8:59 ` Hannes Reinecke
  2018-01-29 16:52   ` Kashyap Desai
  2018-02-02 10:13 ` Ming Lei
  1 sibling, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2018-01-29  8:59 UTC (permalink / raw)
  To: Kashyap Desai, linux-scsi, Peter Rivera

On 01/15/2018 01:12 PM, Kashyap Desai wrote:
> Hi All -
> 
> We have seen cpu lock up issue from fields if system has greater (more
> than 96) logical cpu count.
> SAS3.0 controller (Invader series) supports at max 96 msix vector and
> SAS3.5 product (Ventura) supports at max 128 msix vectors.
> 
> This may be a generic issue (if PCI device support  completion on multiple
> reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
> simplify the problem and possible changes to handle such issues. IT HBA
> (mpt3sas) supports multiple reply queues in completion path. Driver
> creates MSI-x vectors for controller as "min of ( FW supported Reply
> queue, Logical CPUs)". If submitter is not interrupted via completion on
> same CPU, there is a loop in the IO path. This behavior can cause
> hard/soft CPU lockups, IO timeout, system sluggish etc.
> 
> Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
> (e.g. CPU B) is busy with processing the corresponding IO's reply
> descriptors from reply descriptor queue upon receiving the interrupts from
> HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
> is executing the ISR) will see the valid reply descriptors in the reply
> descriptor queue and it will be continuously processing those reply
> descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
> will exit ISR handler if it finds unused reply descriptor in the reply
> descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
> may always see a valid reply descriptor (posted by HBA Firmware after
> processing the IO) in the reply descriptor queue. In worst case, driver
> will not quit from this loop in the ISR handler. Eventually, CPU lockup
> will be detected by watchdog.
> 
> Above mentioned behavior is not common if "rq_affinity" set to 2 or
> affinity_hint is honored by irqbalance as "exact".
> If rq_affinity is set to 2, submitter will be always interrupted via
> completion on same CPU.
> If irqbalance is using "exact" policy, interrupt will be delivered to
> submitter CPU.
> 
> Problem statement -
> If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
> not 1:1, we still have  exposure of issue explained above and for that we
> don't have any solution.
> 
> Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
> device.
> 
> If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
> counts to MSI-x vector count ratio is something like X:1, where X > 1)
> then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
> hard/soft lockups. There won't be any one to one mapping between CPU to
> MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
> shared with group/set of CPUs and there is a possibility of having a loop
> in the IO path within that CPU group and may observe lockups.
> 
> For example: Consider a system having two NUMA nodes and each node having
> four logical CPUs and also consider that number of MSI-x vectors enabled
> on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
> e.g.
> MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
> and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
> 1.
> 
> numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3                                                -->
> MSI-x 0
> node 0 size: 65536 MB
> node 0 free: 63176 MB
> node 1 cpus: 4 5 6 7
> -->MSI-x 1
> node 1 size: 65536 MB
> node 1 free: 63176 MB
> 
> Assume that user started an application which uses all the CPUs of NUMA
> node 0 for issuing the IOs.
> Only one CPU from affinity list (it can be any cpu since this behavior
> depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
> 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> decreasing and ISR processing percentage will be increasing as it is more
> busy with processing the interrupts. Gradually IO submission percentage on
> CPU 0 will be zero and it's ISR processing percentage will be 100
> percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
> 1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
> and only CPU 0 is busy in the ISR path as it always find the valid reply
> descriptor in the reply descriptor queue. Eventually, we will observe the
> hard lockup here.
> 
> Chances of occurring of hard/soft lockups are directly proportional to
> value of X. If value of X is high, then chances of observing CPU lockups
> is high.
> 
> Solution -
> Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
> will execute ISR routine in Softirq context and it will always quit the
> loop based on budget provided in IRQ poll interface.
> 
> In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio is
> X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups due
> to voluntary exit from the reply queue processing based on budget.  Note -
> Only one MSI-x vector is busy doing processing. Irqstat ouput -
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44    122871   122871   0       0       0  IR-PCI-MSI-edge
> mpt3sas0-msix0
>   45        0              0           0       0       0  IR-PCI-MSI-edge
> mpt3sas0-msix1
> 
> Fix-2 - Above fix will avoid lockups, but there can be some performance
> issue if very few reply queue is busy. Driver should round robin the reply
> queue, so that each reply queue is load balanced.  Irqstat ouput after
> driver does reply queue load balance-
> 
> IRQs / 1 second(s)
> IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
>   44  62871  62871       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
>   45  62718  62718       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1
> 
> In Summary,
> CPU completing IO which is not contributing to IO submission, may cause
> cpu lockup.
> If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then using
> irq poll interface, we can avoid the CPU lockups and by equally
> distributing the interrupts among the enabled MSI-x interrupts we can
> avoid performance issues.
> 
> We are planning to use both the fixes only if cpu count is more than FW
> supported MSI-x vector.
> Please review and provide your feedback. I have appended both the patches.
> 
Actually, I think we should be discussing this issue at LSF; you are not
alone here with this problem, as this could (potentially) hit other
drivers, too.

I think I'll be submitting a topic for this.

In general I'm all for enabling irq polling in individual drivers, but
this should be in addition to the existing code (ie enabled via a module
option or somesuch). Enabling it in general has a high risk of
performance degradation on slower hardware.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue
@ 2018-01-15 12:12 Kashyap Desai
  2018-01-29  8:59 ` Hannes Reinecke
  2018-02-02 10:13 ` Ming Lei
  0 siblings, 2 replies; 7+ messages in thread
From: Kashyap Desai @ 2018-01-15 12:12 UTC (permalink / raw)
  To: linux-scsi, Peter Rivera

Hi All -

We have seen cpu lock up issue from fields if system has greater (more
than 96) logical cpu count.
SAS3.0 controller (Invader series) supports at max 96 msix vector and
SAS3.5 product (Ventura) supports at max 128 msix vectors.

This may be a generic issue (if PCI device support  completion on multiple
reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
simplify the problem and possible changes to handle such issues. IT HBA
(mpt3sas) supports multiple reply queues in completion path. Driver
creates MSI-x vectors for controller as "min of ( FW supported Reply
queue, Logical CPUs)". If submitter is not interrupted via completion on
same CPU, there is a loop in the IO path. This behavior can cause
hard/soft CPU lockups, IO timeout, system sluggish etc.

Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
(e.g. CPU B) is busy with processing the corresponding IO's reply
descriptors from reply descriptor queue upon receiving the interrupts from
HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
is executing the ISR) will see the valid reply descriptors in the reply
descriptor queue and it will be continuously processing those reply
descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
will exit ISR handler if it finds unused reply descriptor in the reply
descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
may always see a valid reply descriptor (posted by HBA Firmware after
processing the IO) in the reply descriptor queue. In worst case, driver
will not quit from this loop in the ISR handler. Eventually, CPU lockup
will be detected by watchdog.

Above mentioned behavior is not common if "rq_affinity" set to 2 or
affinity_hint is honored by irqbalance as "exact".
If rq_affinity is set to 2, submitter will be always interrupted via
completion on same CPU.
If irqbalance is using "exact" policy, interrupt will be delivered to
submitter CPU.

Problem statement -
If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
not 1:1, we still have  exposure of issue explained above and for that we
don't have any solution.

Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
device.

If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
counts to MSI-x vector count ratio is something like X:1, where X > 1)
then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping between CPU to
MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
shared with group/set of CPUs and there is a possibility of having a loop
in the IO path within that CPU group and may observe lockups.

For example: Consider a system having two NUMA nodes and each node having
four logical CPUs and also consider that number of MSI-x vectors enabled
on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
e.g.
MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
1.

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3                                                -->
MSI-x 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7
-->MSI-x 1
node 1 size: 65536 MB
node 1 free: 63176 MB

Assume that user started an application which uses all the CPUs of NUMA
node 0 for issuing the IOs.
Only one CPU from affinity list (it can be any cpu since this behavior
depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
decreasing and ISR processing percentage will be increasing as it is more
busy with processing the interrupts. Gradually IO submission percentage on
CPU 0 will be zero and it's ISR processing percentage will be 100
percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
and only CPU 0 is busy in the ISR path as it always find the valid reply
descriptor in the reply descriptor queue. Eventually, we will observe the
hard lockup here.

Chances of occurring of hard/soft lockups are directly proportional to
value of X. If value of X is high, then chances of observing CPU lockups
is high.

Solution -
Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
will execute ISR routine in Softirq context and it will always quit the
loop based on budget provided in IRQ poll interface.

In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio is
X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups due
to voluntary exit from the reply queue processing based on budget.  Note -
Only one MSI-x vector is busy doing processing. Irqstat ouput -

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
  44    122871   122871   0       0       0  IR-PCI-MSI-edge
mpt3sas0-msix0
  45        0              0           0       0       0  IR-PCI-MSI-edge
mpt3sas0-msix1

Fix-2 - Above fix will avoid lockups, but there can be some performance
issue if very few reply queue is busy. Driver should round robin the reply
queue, so that each reply queue is load balanced.  Irqstat ouput after
driver does reply queue load balance-

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
  44  62871  62871       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
  45  62718  62718       0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1

In Summary,
CPU completing IO which is not contributing to IO submission, may cause
cpu lockup.
If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then using
irq poll interface, we can avoid the CPU lockups and by equally
distributing the interrupts among the enabled MSI-x interrupts we can
avoid performance issues.

We are planning to use both the fixes only if cpu count is more than FW
supported MSI-x vector.
Please review and provide your feedback. I have appended both the patches.

Thanks, Kashyap

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-02-02 12:53 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-22 11:33 [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue Kashyap Desai
  -- strict thread matches above, loose matches on Subject: below --
2018-01-15 12:12 Kashyap Desai
2018-01-29  8:59 ` Hannes Reinecke
2018-01-29 16:52   ` Kashyap Desai
2018-02-02 10:13 ` Ming Lei
2018-02-02 11:38   ` Kashyap Desai
2018-02-02 12:52     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.