linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
@ 2018-08-10 13:21 Paul Menzel
  2018-08-10 13:36 ` Greg Kroah-Hartman
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Menzel @ 2018-08-10 13:21 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: stable, Christoph Hellwig, Ming Lei, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 3971 bytes --]

Dear Greg,


Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
for Linux 4.14.56 causes the aacraid module to not detect the attached devices
anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.

```
$ dmesg | grep raid
[    0.269768] raid6: sse2x1   gen()  7179 MB/s
[    0.290069] raid6: sse2x1   xor()  5636 MB/s
[    0.311068] raid6: sse2x2   gen()  9160 MB/s
[    0.332076] raid6: sse2x2   xor()  6375 MB/s
[    0.353075] raid6: sse2x4   gen() 11164 MB/s
[    0.374064] raid6: sse2x4   xor()  7429 MB/s
[    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
[    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
[    0.391008] raid6: using ssse3x2 recovery algorithm
[    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
[    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
[   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
[   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
[   10.743295] aacraid: Comm Interface type3 enabled
$ lspci -nn | grep Adaptec
04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
```

But, it still works with a Dell PowerEdge R715 with two eight core AMD
Opteron 6136, the card below.

```
$ lspci -nn | grep Adaptec
22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
```

Reverting the commit fixes the issue.

commit ef86f3a72adb8a7931f67335560740a7ad696d1d
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Jan 12 10:53:05 2018 +0800

    genirq/affinity: assign vectors to all possible CPUs
    
    commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
    
    Currently we assign managed interrupt vectors to all present CPUs.  This
    works fine for systems were we only online/offline CPUs.  But in case of
    systems that support physical CPU hotplug (or the virtualized version of
    it) this means the additional CPUs covered for in the ACPI tables or on
    the command line are not catered for.  To fix this we'd either need to
    introduce new hotplug CPU states just for this case, or we can start
    assining vectors to possible but not present CPUs.
    
    Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
    Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
    Cc: linux-kernel@vger.kernel.org
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

The problem doesn’t happen with Linux 4.17.11, so there are commits in
Linux master fixing this. Unfortunately, my attempts to find out failed.

I was able to cherry-pick the three commits below on top of 4.14.62,
but the problem persists.

6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
355d7ecdea35 scsi: hpsa: fix selection of reply queue
e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity

Trying to cherry-pick the commits below, referencing the commit
in question, gave conflicts.

1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible

To avoid further trial and error with the server with a slow firmware,
do you know what commits should fix the issue?


Kind regards,

Paul


PS: I couldn’t find, who suggested this for stable, that means how
it was picked to be added to stable. Is there an easy way to find
that out?


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-10 13:21 aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs* Paul Menzel
@ 2018-08-10 13:36 ` Greg Kroah-Hartman
  2018-08-10 14:11   ` Paul Menzel
  0 siblings, 1 reply; 21+ messages in thread
From: Greg Kroah-Hartman @ 2018-08-10 13:36 UTC (permalink / raw)
  To: Paul Menzel
  Cc: stable, Christoph Hellwig, Ming Lei, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi

On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
> Dear Greg,
> 
> 
> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
> for Linux 4.14.56 causes the aacraid module to not detect the attached devices
> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
> 
> ```
> $ dmesg | grep raid
> [    0.269768] raid6: sse2x1   gen()  7179 MB/s
> [    0.290069] raid6: sse2x1   xor()  5636 MB/s
> [    0.311068] raid6: sse2x2   gen()  9160 MB/s
> [    0.332076] raid6: sse2x2   xor()  6375 MB/s
> [    0.353075] raid6: sse2x4   gen() 11164 MB/s
> [    0.374064] raid6: sse2x4   xor()  7429 MB/s
> [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
> [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
> [    0.391008] raid6: using ssse3x2 recovery algorithm
> [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
> [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
> [   10.743295] aacraid: Comm Interface type3 enabled
> $ lspci -nn | grep Adaptec
> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
> ```
> 
> But, it still works with a Dell PowerEdge R715 with two eight core AMD
> Opteron 6136, the card below.
> 
> ```
> $ lspci -nn | grep Adaptec
> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> ```
> 
> Reverting the commit fixes the issue.
> 
> commit ef86f3a72adb8a7931f67335560740a7ad696d1d
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Fri Jan 12 10:53:05 2018 +0800
> 
>     genirq/affinity: assign vectors to all possible CPUs
>     
>     commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
>     
>     Currently we assign managed interrupt vectors to all present CPUs.  This
>     works fine for systems were we only online/offline CPUs.  But in case of
>     systems that support physical CPU hotplug (or the virtualized version of
>     it) this means the additional CPUs covered for in the ACPI tables or on
>     the command line are not catered for.  To fix this we'd either need to
>     introduce new hotplug CPU states just for this case, or we can start
>     assining vectors to possible but not present CPUs.
>     
>     Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
>     Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
>     Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
>     Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
>     Cc: linux-kernel@vger.kernel.org
>     Cc: Thomas Gleixner <tglx@linutronix.de>
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> 
> The problem doesn’t happen with Linux 4.17.11, so there are commits in
> Linux master fixing this. Unfortunately, my attempts to find out failed.
> 
> I was able to cherry-pick the three commits below on top of 4.14.62,
> but the problem persists.
> 
> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> 355d7ecdea35 scsi: hpsa: fix selection of reply queue
> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> 
> Trying to cherry-pick the commits below, referencing the commit
> in question, gave conflicts.
> 
> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
> 
> To avoid further trial and error with the server with a slow firmware,
> do you know what commits should fix the issue?

Look at the email on the stable mailing list:
	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
it should help you out here.  Can you try the patches listed there?

> PS: I couldn’t find, who suggested this for stable, that means how
> it was picked to be added to stable. Is there an easy way to find
> that out?

Dig through the archives is usually the best way.  Looks like it came in
through a suggestion from Sasha.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-10 13:36 ` Greg Kroah-Hartman
@ 2018-08-10 14:11   ` Paul Menzel
  2018-08-10 15:55     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Menzel @ 2018-08-10 14:11 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: stable, Christoph Hellwig, Ming Lei, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 5374 bytes --]

Dear Greg,


On 08/10/18 15:36, Greg Kroah-Hartman wrote:
> On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
>> Dear Greg,
>>
>>
>> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
>> for Linux 4.14.56 causes the aacraid module to not detect the attached devices
>> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
>>
>> ```
>> $ dmesg | grep raid
>> [    0.269768] raid6: sse2x1   gen()  7179 MB/s
>> [    0.290069] raid6: sse2x1   xor()  5636 MB/s
>> [    0.311068] raid6: sse2x2   gen()  9160 MB/s
>> [    0.332076] raid6: sse2x2   xor()  6375 MB/s
>> [    0.353075] raid6: sse2x4   gen() 11164 MB/s
>> [    0.374064] raid6: sse2x4   xor()  7429 MB/s
>> [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
>> [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
>> [    0.391008] raid6: using ssse3x2 recovery algorithm
>> [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
>> [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
>> [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
>> [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
>> [   10.743295] aacraid: Comm Interface type3 enabled
>> $ lspci -nn | grep Adaptec
>> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
>> ```
>>
>> But, it still works with a Dell PowerEdge R715 with two eight core AMD
>> Opteron 6136, the card below.
>>
>> ```
>> $ lspci -nn | grep Adaptec
>> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>> ```
>>
>> Reverting the commit fixes the issue.
>>
>> commit ef86f3a72adb8a7931f67335560740a7ad696d1d
>> Author: Christoph Hellwig <hch@lst.de>
>> Date:   Fri Jan 12 10:53:05 2018 +0800
>>
>>     genirq/affinity: assign vectors to all possible CPUs
>>     
>>     commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
>>     
>>     Currently we assign managed interrupt vectors to all present CPUs.  This
>>     works fine for systems were we only online/offline CPUs.  But in case of
>>     systems that support physical CPU hotplug (or the virtualized version of
>>     it) this means the additional CPUs covered for in the ACPI tables or on
>>     the command line are not catered for.  To fix this we'd either need to
>>     introduce new hotplug CPU states just for this case, or we can start
>>     assining vectors to possible but not present CPUs.
>>     
>>     Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>     Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>     Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
>>     Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
>>     Cc: linux-kernel@vger.kernel.org
>>     Cc: Thomas Gleixner <tglx@linutronix.de>
>>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>
>> The problem doesn’t happen with Linux 4.17.11, so there are commits in
>> Linux master fixing this. Unfortunately, my attempts to find out failed.
>>
>> I was able to cherry-pick the three commits below on top of 4.14.62,
>> but the problem persists.
>>
>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>> 355d7ecdea35 scsi: hpsa: fix selection of reply queue
>> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>
>> Trying to cherry-pick the commits below, referencing the commit
>> in question, gave conflicts.
>>
>> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
>>
>> To avoid further trial and error with the server with a slow firmware,
>> do you know what commits should fix the issue?
> 
> Look at the email on the stable mailing list:
> 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
> it should help you out here.

Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
way to reply to that thread. Therefore, here is my reply.

> Can you try the patches listed there?

I tried some of these already without success.

b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
2f31115e940c scsi: core: introduce force_blk_mq
adbe552349f2 scsi: megaraid_sas: fix selection of reply queue

The commit above is already in v4.14.56.

8b834bff1b73 scsi: hpsa: fix selection of reply queue

The problem persists.

The problem also persists with the state below.

3528f73a4e5d scsi: core: introduce force_blk_mq
16dc4d8215f3 scsi: hpsa: fix selection of reply queue
f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62

So, some more commits are necessary.


Kind regards,

Paul


[1]: https://www.spinics.net/lists/stable/msg251103.html


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-10 14:11   ` Paul Menzel
@ 2018-08-10 15:55     ` Greg Kroah-Hartman
  2018-08-11  8:14       ` Paul Menzel
  0 siblings, 1 reply; 21+ messages in thread
From: Greg Kroah-Hartman @ 2018-08-10 15:55 UTC (permalink / raw)
  To: Paul Menzel
  Cc: stable, Christoph Hellwig, Ming Lei, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi

On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote:
> Dear Greg,
> 
> 
> On 08/10/18 15:36, Greg Kroah-Hartman wrote:
> > On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
> >> Dear Greg,
> >>
> >>
> >> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
> >> for Linux 4.14.56 causes the aacraid module to not detect the attached devices
> >> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
> >>
> >> ```
> >> $ dmesg | grep raid
> >> [    0.269768] raid6: sse2x1   gen()  7179 MB/s
> >> [    0.290069] raid6: sse2x1   xor()  5636 MB/s
> >> [    0.311068] raid6: sse2x2   gen()  9160 MB/s
> >> [    0.332076] raid6: sse2x2   xor()  6375 MB/s
> >> [    0.353075] raid6: sse2x4   gen() 11164 MB/s
> >> [    0.374064] raid6: sse2x4   xor()  7429 MB/s
> >> [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
> >> [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
> >> [    0.391008] raid6: using ssse3x2 recovery algorithm
> >> [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> >> [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> >> [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
> >> [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
> >> [   10.743295] aacraid: Comm Interface type3 enabled
> >> $ lspci -nn | grep Adaptec
> >> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> >> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
> >> ```
> >>
> >> But, it still works with a Dell PowerEdge R715 with two eight core AMD
> >> Opteron 6136, the card below.
> >>
> >> ```
> >> $ lspci -nn | grep Adaptec
> >> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> >> ```
> >>
> >> Reverting the commit fixes the issue.
> >>
> >> commit ef86f3a72adb8a7931f67335560740a7ad696d1d
> >> Author: Christoph Hellwig <hch@lst.de>
> >> Date:   Fri Jan 12 10:53:05 2018 +0800
> >>
> >>     genirq/affinity: assign vectors to all possible CPUs
> >>     
> >>     commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
> >>     
> >>     Currently we assign managed interrupt vectors to all present CPUs.  This
> >>     works fine for systems were we only online/offline CPUs.  But in case of
> >>     systems that support physical CPU hotplug (or the virtualized version of
> >>     it) this means the additional CPUs covered for in the ACPI tables or on
> >>     the command line are not catered for.  To fix this we'd either need to
> >>     introduce new hotplug CPU states just for this case, or we can start
> >>     assining vectors to possible but not present CPUs.
> >>     
> >>     Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> >>     Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> >>     Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
> >>     Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
> >>     Cc: linux-kernel@vger.kernel.org
> >>     Cc: Thomas Gleixner <tglx@linutronix.de>
> >>     Signed-off-by: Christoph Hellwig <hch@lst.de>
> >>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>
> >> The problem doesn’t happen with Linux 4.17.11, so there are commits in
> >> Linux master fixing this. Unfortunately, my attempts to find out failed.
> >>
> >> I was able to cherry-pick the three commits below on top of 4.14.62,
> >> but the problem persists.
> >>
> >> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> >> 355d7ecdea35 scsi: hpsa: fix selection of reply queue
> >> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> >>
> >> Trying to cherry-pick the commits below, referencing the commit
> >> in question, gave conflicts.
> >>
> >> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> >> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
> >>
> >> To avoid further trial and error with the server with a slow firmware,
> >> do you know what commits should fix the issue?
> > 
> > Look at the email on the stable mailing list:
> > 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
> > it should help you out here.
> 
> Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
> way to reply to that thread. Therefore, here is my reply.
> 
> > Can you try the patches listed there?
> 
> I tried some of these already without success.
> 
> b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> 2f31115e940c scsi: core: introduce force_blk_mq
> adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> 
> The commit above is already in v4.14.56.
> 
> 8b834bff1b73 scsi: hpsa: fix selection of reply queue
> 
> The problem persists.
> 
> The problem also persists with the state below.
> 
> 3528f73a4e5d scsi: core: introduce force_blk_mq
> 16dc4d8215f3 scsi: hpsa: fix selection of reply queue
> f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62
> 
> So, some more commits are necessary.

Or I revert the original patch here, and the follow-on ones that were
added to "fix" this issue.  I think that might be the better thing
overall here, right?  Have you tried that?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-10 15:55     ` Greg Kroah-Hartman
@ 2018-08-11  8:14       ` Paul Menzel
  2018-08-11 13:50         ` Greg Kroah-Hartman
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Menzel @ 2018-08-11  8:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: stable, Christoph Hellwig, Ming Lei, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi

Dear Greg,


Am 10.08.2018 um 17:55 schrieb Greg Kroah-Hartman:
> On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote:

>> On 08/10/18 15:36, Greg Kroah-Hartman wrote:
>>> On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
>>>> Dear Greg,
>>>>
>>>>
>>>> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
>>>> for Linux 4.14.56 causes the aacraid module to not detect the attached devices
>>>> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
>>>>
>>>> ```
>>>> $ dmesg | grep raid
>>>> [    0.269768] raid6: sse2x1   gen()  7179 MB/s
>>>> [    0.290069] raid6: sse2x1   xor()  5636 MB/s
>>>> [    0.311068] raid6: sse2x2   gen()  9160 MB/s
>>>> [    0.332076] raid6: sse2x2   xor()  6375 MB/s
>>>> [    0.353075] raid6: sse2x4   gen() 11164 MB/s
>>>> [    0.374064] raid6: sse2x4   xor()  7429 MB/s
>>>> [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
>>>> [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
>>>> [    0.391008] raid6: using ssse3x2 recovery algorithm
>>>> [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
>>>> [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
>>>> [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
>>>> [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
>>>> [   10.743295] aacraid: Comm Interface type3 enabled
>>>> $ lspci -nn | grep Adaptec
>>>> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>>>> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
>>>> ```
>>>>
>>>> But, it still works with a Dell PowerEdge R715 with two eight core AMD
>>>> Opteron 6136, the card below.
>>>>
>>>> ```
>>>> $ lspci -nn | grep Adaptec
>>>> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>>>> ```
>>>>
>>>> Reverting the commit fixes the issue.
>>>>
>>>> commit ef86f3a72adb8a7931f67335560740a7ad696d1d
>>>> Author: Christoph Hellwig <hch@lst.de>
>>>> Date:   Fri Jan 12 10:53:05 2018 +0800
>>>>
>>>>      genirq/affinity: assign vectors to all possible CPUs
>>>>      
>>>>      commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
>>>>      
>>>>      Currently we assign managed interrupt vectors to all present CPUs.  This
>>>>      works fine for systems were we only online/offline CPUs.  But in case of
>>>>      systems that support physical CPU hotplug (or the virtualized version of
>>>>      it) this means the additional CPUs covered for in the ACPI tables or on
>>>>      the command line are not catered for.  To fix this we'd either need to
>>>>      introduce new hotplug CPU states just for this case, or we can start
>>>>      assining vectors to possible but not present CPUs.
>>>>      
>>>>      Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>>      Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>>      Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
>>>>      Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
>>>>      Cc: linux-kernel@vger.kernel.org
>>>>      Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>      Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>      Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>>
>>>> The problem doesn’t happen with Linux 4.17.11, so there are commits in
>>>> Linux master fixing this. Unfortunately, my attempts to find out failed.
>>>>
>>>> I was able to cherry-pick the three commits below on top of 4.14.62,
>>>> but the problem persists.
>>>>
>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>>>> 355d7ecdea35 scsi: hpsa: fix selection of reply queue
>>>> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>>
>>>> Trying to cherry-pick the commits below, referencing the commit
>>>> in question, gave conflicts.
>>>>
>>>> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>>>> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
>>>>
>>>> To avoid further trial and error with the server with a slow firmware,
>>>> do you know what commits should fix the issue?
>>>
>>> Look at the email on the stable mailing list:
>>> 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
>>> it should help you out here.
>>
>> Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
>> way to reply to that thread. Therefore, here is my reply.
>>
>>> Can you try the patches listed there?
>>
>> I tried some of these already without success.
>>
>> b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>> 2f31115e940c scsi: core: introduce force_blk_mq
>> adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>>
>> The commit above is already in v4.14.56.
>>
>> 8b834bff1b73 scsi: hpsa: fix selection of reply queue
>>
>> The problem persists.
>>
>> The problem also persists with the state below.
>>
>> 3528f73a4e5d scsi: core: introduce force_blk_mq
>> 16dc4d8215f3 scsi: hpsa: fix selection of reply queue
>> f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>> 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62
>>
>> So, some more commits are necessary.
> 
> Or I revert the original patch here, and the follow-on ones that were
> added to "fix" this issue.  I think that might be the better thing
> overall here, right?  Have you tried that?

Yes, reverting the commit fixed the issue for us. If Christoph or Ming 
do not have another suggestion for a commit, that would be the way to go.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-11  8:14       ` Paul Menzel
@ 2018-08-11 13:50         ` Greg Kroah-Hartman
  2018-08-12  8:35           ` Paul Menzel
  2018-08-13  3:32           ` Ming Lei
  0 siblings, 2 replies; 21+ messages in thread
From: Greg Kroah-Hartman @ 2018-08-11 13:50 UTC (permalink / raw)
  To: Paul Menzel
  Cc: stable, Christoph Hellwig, Ming Lei, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi

On Sat, Aug 11, 2018 at 10:14:18AM +0200, Paul Menzel wrote:
> Dear Greg,
> 
> 
> Am 10.08.2018 um 17:55 schrieb Greg Kroah-Hartman:
> > On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote:
> 
> > > On 08/10/18 15:36, Greg Kroah-Hartman wrote:
> > > > On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
> > > > > Dear Greg,
> > > > > 
> > > > > 
> > > > > Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
> > > > > for Linux 4.14.56 causes the aacraid module to not detect the attached devices
> > > > > anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
> > > > > 
> > > > > ```
> > > > > $ dmesg | grep raid
> > > > > [    0.269768] raid6: sse2x1   gen()  7179 MB/s
> > > > > [    0.290069] raid6: sse2x1   xor()  5636 MB/s
> > > > > [    0.311068] raid6: sse2x2   gen()  9160 MB/s
> > > > > [    0.332076] raid6: sse2x2   xor()  6375 MB/s
> > > > > [    0.353075] raid6: sse2x4   gen() 11164 MB/s
> > > > > [    0.374064] raid6: sse2x4   xor()  7429 MB/s
> > > > > [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
> > > > > [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
> > > > > [    0.391008] raid6: using ssse3x2 recovery algorithm
> > > > > [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> > > > > [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> > > > > [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
> > > > > [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
> > > > > [   10.743295] aacraid: Comm Interface type3 enabled
> > > > > $ lspci -nn | grep Adaptec
> > > > > 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> > > > > 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
> > > > > ```
> > > > > 
> > > > > But, it still works with a Dell PowerEdge R715 with two eight core AMD
> > > > > Opteron 6136, the card below.
> > > > > 
> > > > > ```
> > > > > $ lspci -nn | grep Adaptec
> > > > > 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> > > > > ```
> > > > > 
> > > > > Reverting the commit fixes the issue.
> > > > > 
> > > > > commit ef86f3a72adb8a7931f67335560740a7ad696d1d
> > > > > Author: Christoph Hellwig <hch@lst.de>
> > > > > Date:   Fri Jan 12 10:53:05 2018 +0800
> > > > > 
> > > > >      genirq/affinity: assign vectors to all possible CPUs
> > > > >      commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
> > > > >      Currently we assign managed interrupt vectors to all present CPUs.  This
> > > > >      works fine for systems were we only online/offline CPUs.  But in case of
> > > > >      systems that support physical CPU hotplug (or the virtualized version of
> > > > >      it) this means the additional CPUs covered for in the ACPI tables or on
> > > > >      the command line are not catered for.  To fix this we'd either need to
> > > > >      introduce new hotplug CPU states just for this case, or we can start
> > > > >      assining vectors to possible but not present CPUs.
> > > > >      Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > > > >      Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > > > >      Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
> > > > >      Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
> > > > >      Cc: linux-kernel@vger.kernel.org
> > > > >      Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > >      Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > >      Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > > >      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > > > 
> > > > > The problem doesn’t happen with Linux 4.17.11, so there are commits in
> > > > > Linux master fixing this. Unfortunately, my attempts to find out failed.
> > > > > 
> > > > > I was able to cherry-pick the three commits below on top of 4.14.62,
> > > > > but the problem persists.
> > > > > 
> > > > > 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> > > > > 355d7ecdea35 scsi: hpsa: fix selection of reply queue
> > > > > e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> > > > > 
> > > > > Trying to cherry-pick the commits below, referencing the commit
> > > > > in question, gave conflicts.
> > > > > 
> > > > > 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> > > > > 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
> > > > > 
> > > > > To avoid further trial and error with the server with a slow firmware,
> > > > > do you know what commits should fix the issue?
> > > > 
> > > > Look at the email on the stable mailing list:
> > > > 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
> > > > it should help you out here.
> > > 
> > > Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
> > > way to reply to that thread. Therefore, here is my reply.
> > > 
> > > > Can you try the patches listed there?
> > > 
> > > I tried some of these already without success.
> > > 
> > > b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> > > 2f31115e940c scsi: core: introduce force_blk_mq
> > > adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> > > 
> > > The commit above is already in v4.14.56.
> > > 
> > > 8b834bff1b73 scsi: hpsa: fix selection of reply queue
> > > 
> > > The problem persists.
> > > 
> > > The problem also persists with the state below.
> > > 
> > > 3528f73a4e5d scsi: core: introduce force_blk_mq
> > > 16dc4d8215f3 scsi: hpsa: fix selection of reply queue
> > > f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> > > 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> > > 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62
> > > 
> > > So, some more commits are necessary.
> > 
> > Or I revert the original patch here, and the follow-on ones that were
> > added to "fix" this issue.  I think that might be the better thing
> > overall here, right?  Have you tried that?
> 
> Yes, reverting the commit fixed the issue for us. If Christoph or Ming do
> not have another suggestion for a commit, that would be the way to go.

Christoph or Ming, any ideas here?

In looking at the aacraid code, I don't see anywhere that this is using
a specific cpu number for queues or anything, but I could be wrong.
Ideally this should also be failing in 4.17 or 4.18-rc right now as well
as I don't see anything that would have "fixed" this recently.  Unless
I'm missing something here?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-11 13:50         ` Greg Kroah-Hartman
@ 2018-08-12  8:35           ` Paul Menzel
  2018-08-13  3:32           ` Ming Lei
  1 sibling, 0 replies; 21+ messages in thread
From: Paul Menzel @ 2018-08-12  8:35 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: stable, Christoph Hellwig, Ming Lei, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi

Dear Greg,


Am 11.08.2018 um 15:50 schrieb Greg Kroah-Hartman:
> On Sat, Aug 11, 2018 at 10:14:18AM +0200, Paul Menzel wrote:

>> Am 10.08.2018 um 17:55 schrieb Greg Kroah-Hartman:
>>> On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote:
>>
>>>> On 08/10/18 15:36, Greg Kroah-Hartman wrote:
>>>>> On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
>>>>>> Dear Greg,
>>>>>>
>>>>>>
>>>>>> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
>>>>>> for Linux 4.14.56 causes the aacraid module to not detect the attached devices
>>>>>> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
>>>>>>
>>>>>> ```
>>>>>> $ dmesg | grep raid
>>>>>> [    0.269768] raid6: sse2x1   gen()  7179 MB/s
>>>>>> [    0.290069] raid6: sse2x1   xor()  5636 MB/s
>>>>>> [    0.311068] raid6: sse2x2   gen()  9160 MB/s
>>>>>> [    0.332076] raid6: sse2x2   xor()  6375 MB/s
>>>>>> [    0.353075] raid6: sse2x4   gen() 11164 MB/s
>>>>>> [    0.374064] raid6: sse2x4   xor()  7429 MB/s
>>>>>> [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
>>>>>> [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
>>>>>> [    0.391008] raid6: using ssse3x2 recovery algorithm
>>>>>> [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
>>>>>> [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
>>>>>> [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
>>>>>> [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
>>>>>> [   10.743295] aacraid: Comm Interface type3 enabled
>>>>>> $ lspci -nn | grep Adaptec
>>>>>> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>>>>>> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
>>>>>> ```
>>>>>>
>>>>>> But, it still works with a Dell PowerEdge R715 with two eight core AMD
>>>>>> Opteron 6136, the card below.
>>>>>>
>>>>>> ```
>>>>>> $ lspci -nn | grep Adaptec
>>>>>> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>>>>>> ```
>>>>>>
>>>>>> Reverting the commit fixes the issue.
>>>>>>
>>>>>> commit ef86f3a72adb8a7931f67335560740a7ad696d1d
>>>>>> Author: Christoph Hellwig <hch@lst.de>
>>>>>> Date:   Fri Jan 12 10:53:05 2018 +0800
>>>>>>
>>>>>>       genirq/affinity: assign vectors to all possible CPUs
>>>>>>       commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
>>>>>>       Currently we assign managed interrupt vectors to all present CPUs.  This
>>>>>>       works fine for systems were we only online/offline CPUs.  But in case of
>>>>>>       systems that support physical CPU hotplug (or the virtualized version of
>>>>>>       it) this means the additional CPUs covered for in the ACPI tables or on
>>>>>>       the command line are not catered for.  To fix this we'd either need to
>>>>>>       introduce new hotplug CPU states just for this case, or we can start
>>>>>>       assining vectors to possible but not present CPUs.
>>>>>>       Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>>>>       Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>>>>       Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
>>>>>>       Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
>>>>>>       Cc: linux-kernel@vger.kernel.org
>>>>>>       Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>>>       Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>>>       Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>       Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>>>>
>>>>>> The problem doesn’t happen with Linux 4.17.11, so there are commits in
>>>>>> Linux master fixing this. Unfortunately, my attempts to find out failed.
>>>>>>
>>>>>> I was able to cherry-pick the three commits below on top of 4.14.62,
>>>>>> but the problem persists.
>>>>>>
>>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>>>>>> 355d7ecdea35 scsi: hpsa: fix selection of reply queue
>>>>>> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>>>>
>>>>>> Trying to cherry-pick the commits below, referencing the commit
>>>>>> in question, gave conflicts.
>>>>>>
>>>>>> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>>>>>> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
>>>>>>
>>>>>> To avoid further trial and error with the server with a slow firmware,
>>>>>> do you know what commits should fix the issue?
>>>>>
>>>>> Look at the email on the stable mailing list:
>>>>> 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
>>>>> it should help you out here.
>>>>
>>>> Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
>>>> way to reply to that thread. Therefore, here is my reply.
>>>>
>>>>> Can you try the patches listed there?
>>>>
>>>> I tried some of these already without success.
>>>>
>>>> b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>> 2f31115e940c scsi: core: introduce force_blk_mq
>>>> adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>>>>
>>>> The commit above is already in v4.14.56.
>>>>
>>>> 8b834bff1b73 scsi: hpsa: fix selection of reply queue
>>>>
>>>> The problem persists.
>>>>
>>>> The problem also persists with the state below.
>>>>
>>>> 3528f73a4e5d scsi: core: introduce force_blk_mq
>>>> 16dc4d8215f3 scsi: hpsa: fix selection of reply queue
>>>> f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>>>> 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62
>>>>
>>>> So, some more commits are necessary.
>>>
>>> Or I revert the original patch here, and the follow-on ones that were
>>> added to "fix" this issue.  I think that might be the better thing
>>> overall here, right?  Have you tried that?
>>
>> Yes, reverting the commit fixed the issue for us. If Christoph or Ming do
>> not have another suggestion for a commit, that would be the way to go.
> 
> Christoph or Ming, any ideas here?
> 
> In looking at the aacraid code, I don't see anywhere that this is using
> a specific cpu number for queues or anything, but I could be wrong.
> Ideally this should also be failing in 4.17 or 4.18-rc right now as well
> as I don't see anything that would have "fixed" this recently.  Unless
> I'm missing something here?

As written strangely, it works with Linux 4.17.11.


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-11 13:50         ` Greg Kroah-Hartman
  2018-08-12  8:35           ` Paul Menzel
@ 2018-08-13  3:32           ` Ming Lei
  2018-08-16 17:09             ` Paul Menzel
  1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2018-08-13  3:32 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Paul Menzel, stable, Christoph Hellwig,
	Linux Kernel Mailing List, it+linux-scsi,
	Adaptec OEM Raid Solutions, linux-scsi

On Sat, Aug 11, 2018 at 03:50:21PM +0200, Greg Kroah-Hartman wrote:
> On Sat, Aug 11, 2018 at 10:14:18AM +0200, Paul Menzel wrote:
> > Dear Greg,
> > 
> > 
> > Am 10.08.2018 um 17:55 schrieb Greg Kroah-Hartman:
> > > On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote:
> > 
> > > > On 08/10/18 15:36, Greg Kroah-Hartman wrote:
> > > > > On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
> > > > > > Dear Greg,
> > > > > > 
> > > > > > 
> > > > > > Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
> > > > > > for Linux 4.14.56 causes the aacraid module to not detect the attached devices
> > > > > > anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
> > > > > > 
> > > > > > ```
> > > > > > $ dmesg | grep raid
> > > > > > [    0.269768] raid6: sse2x1   gen()  7179 MB/s
> > > > > > [    0.290069] raid6: sse2x1   xor()  5636 MB/s
> > > > > > [    0.311068] raid6: sse2x2   gen()  9160 MB/s
> > > > > > [    0.332076] raid6: sse2x2   xor()  6375 MB/s
> > > > > > [    0.353075] raid6: sse2x4   gen() 11164 MB/s
> > > > > > [    0.374064] raid6: sse2x4   xor()  7429 MB/s
> > > > > > [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
> > > > > > [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
> > > > > > [    0.391008] raid6: using ssse3x2 recovery algorithm
> > > > > > [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> > > > > > [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> > > > > > [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
> > > > > > [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
> > > > > > [   10.743295] aacraid: Comm Interface type3 enabled
> > > > > > $ lspci -nn | grep Adaptec
> > > > > > 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> > > > > > 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
> > > > > > ```
> > > > > > 
> > > > > > But, it still works with a Dell PowerEdge R715 with two eight core AMD
> > > > > > Opteron 6136, the card below.
> > > > > > 
> > > > > > ```
> > > > > > $ lspci -nn | grep Adaptec
> > > > > > 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> > > > > > ```
> > > > > > 
> > > > > > Reverting the commit fixes the issue.
> > > > > > 
> > > > > > commit ef86f3a72adb8a7931f67335560740a7ad696d1d
> > > > > > Author: Christoph Hellwig <hch@lst.de>
> > > > > > Date:   Fri Jan 12 10:53:05 2018 +0800
> > > > > > 
> > > > > >      genirq/affinity: assign vectors to all possible CPUs
> > > > > >      commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
> > > > > >      Currently we assign managed interrupt vectors to all present CPUs.  This
> > > > > >      works fine for systems were we only online/offline CPUs.  But in case of
> > > > > >      systems that support physical CPU hotplug (or the virtualized version of
> > > > > >      it) this means the additional CPUs covered for in the ACPI tables or on
> > > > > >      the command line are not catered for.  To fix this we'd either need to
> > > > > >      introduce new hotplug CPU states just for this case, or we can start
> > > > > >      assining vectors to possible but not present CPUs.
> > > > > >      Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > > > > >      Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> > > > > >      Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
> > > > > >      Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
> > > > > >      Cc: linux-kernel@vger.kernel.org
> > > > > >      Cc: Thomas Gleixner <tglx@linutronix.de>
> > > > > >      Signed-off-by: Christoph Hellwig <hch@lst.de>
> > > > > >      Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > > > > >      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > > > > 
> > > > > > The problem doesn’t happen with Linux 4.17.11, so there are commits in
> > > > > > Linux master fixing this. Unfortunately, my attempts to find out failed.
> > > > > > 
> > > > > > I was able to cherry-pick the three commits below on top of 4.14.62,
> > > > > > but the problem persists.
> > > > > > 
> > > > > > 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> > > > > > 355d7ecdea35 scsi: hpsa: fix selection of reply queue
> > > > > > e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> > > > > > 
> > > > > > Trying to cherry-pick the commits below, referencing the commit
> > > > > > in question, gave conflicts.
> > > > > > 
> > > > > > 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> > > > > > 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
> > > > > > 
> > > > > > To avoid further trial and error with the server with a slow firmware,
> > > > > > do you know what commits should fix the issue?
> > > > > 
> > > > > Look at the email on the stable mailing list:
> > > > > 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
> > > > > it should help you out here.
> > > > 
> > > > Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
> > > > way to reply to that thread. Therefore, here is my reply.
> > > > 
> > > > > Can you try the patches listed there?
> > > > 
> > > > I tried some of these already without success.
> > > > 
> > > > b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> > > > 2f31115e940c scsi: core: introduce force_blk_mq
> > > > adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> > > > 
> > > > The commit above is already in v4.14.56.
> > > > 
> > > > 8b834bff1b73 scsi: hpsa: fix selection of reply queue
> > > > 
> > > > The problem persists.
> > > > 
> > > > The problem also persists with the state below.
> > > > 
> > > > 3528f73a4e5d scsi: core: introduce force_blk_mq
> > > > 16dc4d8215f3 scsi: hpsa: fix selection of reply queue
> > > > f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> > > > 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> > > > 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62
> > > > 
> > > > So, some more commits are necessary.
> > > 
> > > Or I revert the original patch here, and the follow-on ones that were
> > > added to "fix" this issue.  I think that might be the better thing
> > > overall here, right?  Have you tried that?
> > 
> > Yes, reverting the commit fixed the issue for us. If Christoph or Ming do
> > not have another suggestion for a commit, that would be the way to go.
> 
> Christoph or Ming, any ideas here?
> 
> In looking at the aacraid code, I don't see anywhere that this is using
> a specific cpu number for queues or anything, but I could be wrong.
> Ideally this should also be failing in 4.17 or 4.18-rc right now as well
> as I don't see anything that would have "fixed" this recently.  Unless
> I'm missing something here?

From both aac_fib_vector_assign() and aac_src_deliver_message(), looks
every msix vector is used inside driver, so causes this trouble.

From 84676c1f21e8ff54b (genirq/affinity: assign vectors to all possible CPUs),
actually it should have been since 9a0ef98e18 (genirq/affinity: Assign vectors
to all present CPUs), it isn't guaranteed that one vector can be mapped to
at least one online CPU.

So it should be fixed in driver by similar approach taken in the
following patches:
	scsi: hpsa: fix selection of reply queue
	scsi: megaraid_sas: fix selection of reply queue

And the reason this issue won't be observed is that the following
patches(merged to v4.17) may avoid this issue, but in theory, the issue
won't be avoided completely by these patches:

d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
1a2d0914e23a genirq/affinity: Allow irq spreading from a given starting point
b3e6aaa8d94d genirq/affinity: Move actual irq vector spreading into a helper function
47778f33dcba genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask
0211e12dd0a5 genirq/affinity: Don't return with empty affinity masks on error

In short, this issue should have been fixed in upstream aacraid, and the idea
is simple, use the current CPU id to retrieve the msix vector offset, this
way will make sure that the msix vector can be mapped from one online
CPU.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-13  3:32           ` Ming Lei
@ 2018-08-16 17:09             ` Paul Menzel
  2018-09-11 10:53               ` Paul Menzel
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Menzel @ 2018-08-16 17:09 UTC (permalink / raw)
  To: Ming Lei, Greg Kroah-Hartman
  Cc: stable, Christoph Hellwig, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi

[-- Attachment #1: Type: text/plain, Size: 8636 bytes --]

Dear Ming,


On 08/13/18 05:32, Ming Lei wrote:
> On Sat, Aug 11, 2018 at 03:50:21PM +0200, Greg Kroah-Hartman wrote:
>> On Sat, Aug 11, 2018 at 10:14:18AM +0200, Paul Menzel wrote:

>>> Am 10.08.2018 um 17:55 schrieb Greg Kroah-Hartman:
>>>> On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote:
>>>
>>>>> On 08/10/18 15:36, Greg Kroah-Hartman wrote:
>>>>>> On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:

>>>>>>> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
>>>>>>> for Linux 4.14.56 causes the aacraid module to not detect the attached devices
>>>>>>> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
>>>>>>>
>>>>>>> ```
>>>>>>> $ dmesg | grep raid
>>>>>>> [    0.269768] raid6: sse2x1   gen()  7179 MB/s
>>>>>>> [    0.290069] raid6: sse2x1   xor()  5636 MB/s
>>>>>>> [    0.311068] raid6: sse2x2   gen()  9160 MB/s
>>>>>>> [    0.332076] raid6: sse2x2   xor()  6375 MB/s
>>>>>>> [    0.353075] raid6: sse2x4   gen() 11164 MB/s
>>>>>>> [    0.374064] raid6: sse2x4   xor()  7429 MB/s
>>>>>>> [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
>>>>>>> [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
>>>>>>> [    0.391008] raid6: using ssse3x2 recovery algorithm
>>>>>>> [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
>>>>>>> [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
>>>>>>> [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
>>>>>>> [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
>>>>>>> [   10.743295] aacraid: Comm Interface type3 enabled
>>>>>>> $ lspci -nn | grep Adaptec
>>>>>>> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>>>>>>> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
>>>>>>> ```
>>>>>>>
>>>>>>> But, it still works with a Dell PowerEdge R715 with two eight core AMD
>>>>>>> Opteron 6136, the card below.
>>>>>>>
>>>>>>> ```
>>>>>>> $ lspci -nn | grep Adaptec
>>>>>>> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>>>>>>> ```
>>>>>>>
>>>>>>> Reverting the commit fixes the issue.
>>>>>>>
>>>>>>> commit ef86f3a72adb8a7931f67335560740a7ad696d1d
>>>>>>> Author: Christoph Hellwig <hch@lst.de>
>>>>>>> Date:   Fri Jan 12 10:53:05 2018 +0800
>>>>>>>
>>>>>>>      genirq/affinity: assign vectors to all possible CPUs
>>>>>>>      commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
>>>>>>>      Currently we assign managed interrupt vectors to all present CPUs.  This
>>>>>>>      works fine for systems were we only online/offline CPUs.  But in case of
>>>>>>>      systems that support physical CPU hotplug (or the virtualized version of
>>>>>>>      it) this means the additional CPUs covered for in the ACPI tables or on
>>>>>>>      the command line are not catered for.  To fix this we'd either need to
>>>>>>>      introduce new hotplug CPU states just for this case, or we can start
>>>>>>>      assining vectors to possible but not present CPUs.
>>>>>>>      Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>>>>>      Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>>>>>      Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
>>>>>>>      Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
>>>>>>>      Cc: linux-kernel@vger.kernel.org
>>>>>>>      Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>>>>      Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>>>>      Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>>>>>
>>>>>>> The problem doesn’t happen with Linux 4.17.11, so there are commits in
>>>>>>> Linux master fixing this. Unfortunately, my attempts to find out failed.
>>>>>>>
>>>>>>> I was able to cherry-pick the three commits below on top of 4.14.62,
>>>>>>> but the problem persists.
>>>>>>>
>>>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>>>>>>> 355d7ecdea35 scsi: hpsa: fix selection of reply queue
>>>>>>> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>>>>>
>>>>>>> Trying to cherry-pick the commits below, referencing the commit
>>>>>>> in question, gave conflicts.
>>>>>>>
>>>>>>> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>>>>>>> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
>>>>>>>
>>>>>>> To avoid further trial and error with the server with a slow firmware,
>>>>>>> do you know what commits should fix the issue?
>>>>>>
>>>>>> Look at the email on the stable mailing list:
>>>>>> 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
>>>>>> it should help you out here.
>>>>>
>>>>> Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
>>>>> way to reply to that thread. Therefore, here is my reply.
>>>>>
>>>>>> Can you try the patches listed there?
>>>>>
>>>>> I tried some of these already without success.
>>>>>
>>>>> b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>>> 2f31115e940c scsi: core: introduce force_blk_mq
>>>>> adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>>>>>
>>>>> The commit above is already in v4.14.56.
>>>>>
>>>>> 8b834bff1b73 scsi: hpsa: fix selection of reply queue
>>>>>
>>>>> The problem persists.
>>>>>
>>>>> The problem also persists with the state below.
>>>>>
>>>>> 3528f73a4e5d scsi: core: introduce force_blk_mq
>>>>> 16dc4d8215f3 scsi: hpsa: fix selection of reply queue
>>>>> f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>>>>> 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62
>>>>>
>>>>> So, some more commits are necessary.
>>>>
>>>> Or I revert the original patch here, and the follow-on ones that were
>>>> added to "fix" this issue.  I think that might be the better thing
>>>> overall here, right?  Have you tried that?
>>>
>>> Yes, reverting the commit fixed the issue for us. If Christoph or Ming do
>>> not have another suggestion for a commit, that would be the way to go.
>>
>> Christoph or Ming, any ideas here?
>>
>> In looking at the aacraid code, I don't see anywhere that this is using
>> a specific cpu number for queues or anything, but I could be wrong.
>> Ideally this should also be failing in 4.17 or 4.18-rc right now as well
>> as I don't see anything that would have "fixed" this recently.  Unless
>> I'm missing something here?
> 
> From both aac_fib_vector_assign() and aac_src_deliver_message(), looks
> every msix vector is used inside driver, so causes this trouble.
> 
> From 84676c1f21e8ff54b (genirq/affinity: assign vectors to all possible CPUs),
> actually it should have been since 9a0ef98e18 (genirq/affinity: Assign vectors
> to all present CPUs), it isn't guaranteed that one vector can be mapped to
> at least one online CPU.
> 
> So it should be fixed in driver by similar approach taken in the
> following patches:
> 	scsi: hpsa: fix selection of reply queue
> 	scsi: megaraid_sas: fix selection of reply queue
> 
> And the reason this issue won't be observed is that the following
> patches(merged to v4.17) may avoid this issue, but in theory, the issue
> won't be avoided completely by these patches:
> 
> d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
> 1a2d0914e23a genirq/affinity: Allow irq spreading from a given starting point
> b3e6aaa8d94d genirq/affinity: Move actual irq vector spreading into a helper function
> 47778f33dcba genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask
> 0211e12dd0a5 genirq/affinity: Don't return with empty affinity masks on error
> 
> In short, this issue should have been fixed in upstream aacraid, and the idea
> is simple, use the current CPU id to retrieve the msix vector offset, this
> way will make sure that the msix vector can be mapped from one online
> CPU.

AACRAID folks, are you going to work on that? I looked at the code a bit,
and it’ll take me some days to understand it and to fix it.

Just out of curiosity, shouldn’t the subsystems(?) communicate better to
avoid such breakage, or was that unexpected?


Kind regards,

Paul


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-08-16 17:09             ` Paul Menzel
@ 2018-09-11 10:53               ` Paul Menzel
  2018-10-01 12:33                 ` [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs" Paul Menzel
  2019-02-18 11:40                 ` aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs* Greg Kroah-Hartman
  0 siblings, 2 replies; 21+ messages in thread
From: Paul Menzel @ 2018-09-11 10:53 UTC (permalink / raw)
  To: Ming Lei, Greg Kroah-Hartman
  Cc: stable, Christoph Hellwig, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi,
	Raghava Aditya Renukunta, Dave Carroll

[-- Attachment #1: Type: text/plain, Size: 9106 bytes --]

Dear Ming, dear Raghava, dear Dave,


On 08/16/18 19:09, Paul Menzel wrote:

> On 08/13/18 05:32, Ming Lei wrote:
>> On Sat, Aug 11, 2018 at 03:50:21PM +0200, Greg Kroah-Hartman wrote:
>>> On Sat, Aug 11, 2018 at 10:14:18AM +0200, Paul Menzel wrote:
> 
>>>> Am 10.08.2018 um 17:55 schrieb Greg Kroah-Hartman:
>>>>> On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote:
>>>>
>>>>>> On 08/10/18 15:36, Greg Kroah-Hartman wrote:
>>>>>>> On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
> 
>>>>>>>> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
>>>>>>>> for Linux 4.14.56 causes the aacraid module to not detect the attached devices
>>>>>>>> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
>>>>>>>>
>>>>>>>> ```
>>>>>>>> $ dmesg | grep raid
>>>>>>>> [    0.269768] raid6: sse2x1   gen()  7179 MB/s
>>>>>>>> [    0.290069] raid6: sse2x1   xor()  5636 MB/s
>>>>>>>> [    0.311068] raid6: sse2x2   gen()  9160 MB/s
>>>>>>>> [    0.332076] raid6: sse2x2   xor()  6375 MB/s
>>>>>>>> [    0.353075] raid6: sse2x4   gen() 11164 MB/s
>>>>>>>> [    0.374064] raid6: sse2x4   xor()  7429 MB/s
>>>>>>>> [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
>>>>>>>> [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
>>>>>>>> [    0.391008] raid6: using ssse3x2 recovery algorithm
>>>>>>>> [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
>>>>>>>> [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
>>>>>>>> [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
>>>>>>>> [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
>>>>>>>> [   10.743295] aacraid: Comm Interface type3 enabled
>>>>>>>> $ lspci -nn | grep Adaptec
>>>>>>>> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>>>>>>>> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
>>>>>>>> ```
>>>>>>>>
>>>>>>>> But, it still works with a Dell PowerEdge R715 with two eight core AMD
>>>>>>>> Opteron 6136, the card below.
>>>>>>>>
>>>>>>>> ```
>>>>>>>> $ lspci -nn | grep Adaptec
>>>>>>>> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
>>>>>>>> ```
>>>>>>>>
>>>>>>>> Reverting the commit fixes the issue.
>>>>>>>>
>>>>>>>> commit ef86f3a72adb8a7931f67335560740a7ad696d1d
>>>>>>>> Author: Christoph Hellwig <hch@lst.de>
>>>>>>>> Date:   Fri Jan 12 10:53:05 2018 +0800
>>>>>>>>
>>>>>>>>      genirq/affinity: assign vectors to all possible CPUs
>>>>>>>>      commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
>>>>>>>>      Currently we assign managed interrupt vectors to all present CPUs.  This
>>>>>>>>      works fine for systems were we only online/offline CPUs.  But in case of
>>>>>>>>      systems that support physical CPU hotplug (or the virtualized version of
>>>>>>>>      it) this means the additional CPUs covered for in the ACPI tables or on
>>>>>>>>      the command line are not catered for.  To fix this we'd either need to
>>>>>>>>      introduce new hotplug CPU states just for this case, or we can start
>>>>>>>>      assining vectors to possible but not present CPUs.
>>>>>>>>      Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>>>>>>      Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
>>>>>>>>      Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
>>>>>>>>      Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
>>>>>>>>      Cc: linux-kernel@vger.kernel.org
>>>>>>>>      Cc: Thomas Gleixner <tglx@linutronix.de>
>>>>>>>>      Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>>>>>>      Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>>>      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>>>>>>
>>>>>>>> The problem doesn’t happen with Linux 4.17.11, so there are commits in
>>>>>>>> Linux master fixing this. Unfortunately, my attempts to find out failed.
>>>>>>>>
>>>>>>>> I was able to cherry-pick the three commits below on top of 4.14.62,
>>>>>>>> but the problem persists.
>>>>>>>>
>>>>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>>>>>>>> 355d7ecdea35 scsi: hpsa: fix selection of reply queue
>>>>>>>> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>>>>>>
>>>>>>>> Trying to cherry-pick the commits below, referencing the commit
>>>>>>>> in question, gave conflicts.
>>>>>>>>
>>>>>>>> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>>>>>>>> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
>>>>>>>>
>>>>>>>> To avoid further trial and error with the server with a slow firmware,
>>>>>>>> do you know what commits should fix the issue?
>>>>>>>
>>>>>>> Look at the email on the stable mailing list:
>>>>>>> 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
>>>>>>> it should help you out here.
>>>>>>
>>>>>> Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
>>>>>> way to reply to that thread. Therefore, here is my reply.
>>>>>>
>>>>>>> Can you try the patches listed there?
>>>>>>
>>>>>> I tried some of these already without success.
>>>>>>
>>>>>> b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>>>> 2f31115e940c scsi: core: introduce force_blk_mq
>>>>>> adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
>>>>>>
>>>>>> The commit above is already in v4.14.56.
>>>>>>
>>>>>> 8b834bff1b73 scsi: hpsa: fix selection of reply queue
>>>>>>
>>>>>> The problem persists.
>>>>>>
>>>>>> The problem also persists with the state below.
>>>>>>
>>>>>> 3528f73a4e5d scsi: core: introduce force_blk_mq
>>>>>> 16dc4d8215f3 scsi: hpsa: fix selection of reply queue
>>>>>> f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
>>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
>>>>>> 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62
>>>>>>
>>>>>> So, some more commits are necessary.
>>>>>
>>>>> Or I revert the original patch here, and the follow-on ones that were
>>>>> added to "fix" this issue.  I think that might be the better thing
>>>>> overall here, right?  Have you tried that?
>>>>
>>>> Yes, reverting the commit fixed the issue for us. If Christoph or Ming do
>>>> not have another suggestion for a commit, that would be the way to go.
>>>
>>> Christoph or Ming, any ideas here?
>>>
>>> In looking at the aacraid code, I don't see anywhere that this is using
>>> a specific cpu number for queues or anything, but I could be wrong.
>>> Ideally this should also be failing in 4.17 or 4.18-rc right now as well
>>> as I don't see anything that would have "fixed" this recently.  Unless
>>> I'm missing something here?
>>
>> From both aac_fib_vector_assign() and aac_src_deliver_message(), looks
>> every msix vector is used inside driver, so causes this trouble.
>>
>> From 84676c1f21e8ff54b (genirq/affinity: assign vectors to all possible CPUs),
>> actually it should have been since 9a0ef98e18 (genirq/affinity: Assign vectors
>> to all present CPUs), it isn't guaranteed that one vector can be mapped to
>> at least one online CPU.
>>
>> So it should be fixed in driver by similar approach taken in the
>> following patches:
>> 	scsi: hpsa: fix selection of reply queue
>> 	scsi: megaraid_sas: fix selection of reply queue
>>
>> And the reason this issue won't be observed is that the following
>> patches(merged to v4.17) may avoid this issue, but in theory, the issue
>> won't be avoided completely by these patches:
>>
>> d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
>> 1a2d0914e23a genirq/affinity: Allow irq spreading from a given starting point
>> b3e6aaa8d94d genirq/affinity: Move actual irq vector spreading into a helper function
>> 47778f33dcba genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask
>> 0211e12dd0a5 genirq/affinity: Don't return with empty affinity masks on error
>>
>> In short, this issue should have been fixed in upstream aacraid, and the idea
>> is simple, use the current CPU id to retrieve the msix vector offset, this
>> way will make sure that the msix vector can be mapped from one online
>> CPU.
> 
> AACRAID folks, are you going to work on that? I looked at the code a bit,
> and it’ll take me some days to understand it and to fix it.

Unfortunately, there was no reply yet and I won’t be able to fix it any
time soon. The aacraid driver is still broken in 4.14.68. Reverting the
*one* mentioned commit fixes the issue.

Raghava, Dave, are you aware of the issue.

> Just out of curiosity, shouldn’t the subsystems(?) communicate better to
> avoid such breakage, or was that unexpected?


Kind regards,

Paul


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-09-11 10:53               ` Paul Menzel
@ 2018-10-01 12:33                 ` Paul Menzel
  2018-10-01 12:35                   ` Christoph Hellwig
  2018-10-08  2:11                   ` Ming Lei
  2019-02-18 11:40                 ` aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs* Greg Kroah-Hartman
  1 sibling, 2 replies; 21+ messages in thread
From: Paul Menzel @ 2018-10-01 12:33 UTC (permalink / raw)
  To: Ming Lei, Greg Kroah-Hartman
  Cc: stable, Christoph Hellwig, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi,
	Raghava Aditya Renukunta, Dave Carroll

[-- Attachment #1: Type: text/plain, Size: 5073 bytes --]

Date: Wed, 29 Aug 2018 17:28:45 +0200

This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.

See the discussion in the thread *aacraid: Regression in 4.14.56
with *genirq/affinity: assign vectors to all possible CPUs** on
the linux-scsi list.

Reverting the commit, the aacraid driver loads correctly, and the
drives are visible. So revert the commit to adhere to Linux’ no
regression policy.

Strangely, the issue is not reproducible with Linux 4.18.x, but
Ming Lei wrote, that this might only be by chance.

Link: https://www.spinics.net/lists/linux-scsi/msg122732.html
Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>

---
 kernel/irq/affinity.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a37a3b4b6342..e12d35108225 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
 	}
 }
 
-static cpumask_var_t *alloc_node_to_possible_cpumask(void)
+static cpumask_var_t *alloc_node_to_present_cpumask(void)
 {
 	cpumask_var_t *masks;
 	int node;
@@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_possible_cpumask(void)
 	return NULL;
 }
 
-static void free_node_to_possible_cpumask(cpumask_var_t *masks)
+static void free_node_to_present_cpumask(cpumask_var_t *masks)
 {
 	int node;
 
@@ -71,22 +71,22 @@ static void free_node_to_possible_cpumask(cpumask_var_t *masks)
 	kfree(masks);
 }
 
-static void build_node_to_possible_cpumask(cpumask_var_t *masks)
+static void build_node_to_present_cpumask(cpumask_var_t *masks)
 {
 	int cpu;
 
-	for_each_possible_cpu(cpu)
+	for_each_present_cpu(cpu)
 		cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
 }
 
-static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
 				const struct cpumask *mask, nodemask_t *nodemsk)
 {
 	int n, nodes = 0;
 
 	/* Calculate the number of nodes in the supplied affinity mask */
 	for_each_node(n) {
-		if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
+		if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
 			node_set(n, *nodemsk);
 			nodes++;
 		}
@@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	int last_affv = affv + affd->pre_vectors;
 	nodemask_t nodemsk = NODE_MASK_NONE;
 	struct cpumask *masks;
-	cpumask_var_t nmsk, *node_to_possible_cpumask;
+	cpumask_var_t nmsk, *node_to_present_cpumask;
 
 	/*
 	 * If there aren't any vectors left after applying the pre/post
@@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	if (!masks)
 		goto out;
 
-	node_to_possible_cpumask = alloc_node_to_possible_cpumask();
-	if (!node_to_possible_cpumask)
+	node_to_present_cpumask = alloc_node_to_present_cpumask();
+	if (!node_to_present_cpumask)
 		goto out;
 
 	/* Fill out vectors at the beginning that don't need affinity */
@@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 
 	/* Stabilize the cpumasks */
 	get_online_cpus();
-	build_node_to_possible_cpumask(node_to_possible_cpumask);
-	nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask,
+	build_node_to_present_cpumask(node_to_present_cpumask);
+	nodes = get_nodes_in_cpumask(node_to_present_cpumask, cpu_present_mask,
 				     &nodemsk);
 
 	/*
@@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	if (affv <= nodes) {
 		for_each_node_mask(n, nodemsk) {
 			cpumask_copy(masks + curvec,
-				     node_to_possible_cpumask[n]);
+				     node_to_present_cpumask[n]);
 			if (++curvec == last_affv)
 				break;
 		}
@@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 		vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
 		/* Get the cpus on this node which are in the mask */
-		cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]);
+		cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);
 
 		/* Calculate the number of cpus per vector */
 		ncpus = cpumask_weight(nmsk);
@@ -192,7 +192,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 	/* Fill out vectors at the end that don't need affinity */
 	for (; curvec < nvecs; curvec++)
 		cpumask_copy(masks + curvec, irq_default_affinity);
-	free_node_to_possible_cpumask(node_to_possible_cpumask);
+	free_node_to_present_cpumask(node_to_present_cpumask);
 out:
 	free_cpumask_var(nmsk);
 	return masks;
@@ -214,7 +214,7 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
 		return 0;
 
 	get_online_cpus();
-	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
+	ret = min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
 	put_online_cpus();
 	return ret;
 }
-- 
2.17.1


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-01 12:33                 ` [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs" Paul Menzel
@ 2018-10-01 12:35                   ` Christoph Hellwig
  2018-10-01 12:43                     ` Paul Menzel
  2018-10-08  2:11                   ` Ming Lei
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2018-10-01 12:35 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Ming Lei, Greg Kroah-Hartman, stable, Christoph Hellwig,
	Linux Kernel Mailing List, it+linux-scsi,
	Adaptec OEM Raid Solutions, linux-scsi, Raghava Aditya Renukunta,
	Dave Carroll

On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
> Date: Wed, 29 Aug 2018 17:28:45 +0200
> 
> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.

This seems rather odd.  If add all you'd revert the patch adding the
PCI_IRQ_AFFINITY to aacraid, not core infrastructure.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-01 12:35                   ` Christoph Hellwig
@ 2018-10-01 12:43                     ` Paul Menzel
  2018-10-01 15:59                       ` Paul Menzel
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Menzel @ 2018-10-01 12:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Greg Kroah-Hartman, stable, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi,
	Raghava Aditya Renukunta, Dave Carroll

[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]

Dear Christoph,


On 10/01/18 14:35, Christoph Hellwig wrote:
> On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
>> Date: Wed, 29 Aug 2018 17:28:45 +0200
>>
>> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.
> 
> This seems rather odd.  If at all you'd revert the patch adding the
> PCI_IRQ_AFFINITY to aacraid, not core infrastructure.

Thank you for the suggestion, but that flag was added in 2016
to the aacraid driver.

> commit 0910d8bbdd99856af1394d3d8830955abdefee4a
> Author: Hannes Reinecke <hare@suse.de>
> Date:   Tue Nov 8 08:11:30 2016 +0100
> 
>     scsi: aacraid: switch to pci_alloc_irq_vectors
>     
>     Use pci_alloc_irq_vectors and drop the hand-crafted interrupt affinity
>     routines.

So what would happen, if `PCI_IRQ_AFFINITY` was removed? Will the
system still work with the same performance?

As far as I understood, the no regression policy is there for
exactly that reason, and it shouldn’t matter if it’s core
infrastructure or not. As written, I have no idea, and just know
reverting the commit in question fixes the problem here. So I’ll
gladly test other solutions to fix this issue.


Kind regards,

Paul


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-01 12:43                     ` Paul Menzel
@ 2018-10-01 15:59                       ` Paul Menzel
  2018-10-15 12:17                         ` Paul Menzel
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Menzel @ 2018-10-01 15:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ming Lei, Greg Kroah-Hartman, stable, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi,
	Raghava Aditya Renukunta, Dave Carroll

[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]

Dear Christoph,


On 10/01/18 14:43, Paul Menzel wrote:

> On 10/01/18 14:35, Christoph Hellwig wrote:
>> On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
>>> Date: Wed, 29 Aug 2018 17:28:45 +0200
>>>
>>> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.
>>
>> This seems rather odd.  If at all you'd revert the patch adding the
>> PCI_IRQ_AFFINITY to aacraid, not core infrastructure.
> 
> Thank you for the suggestion, but that flag was added in 2016
> to the aacraid driver.
> 
>> commit 0910d8bbdd99856af1394d3d8830955abdefee4a
>> Author: Hannes Reinecke <hare@suse.de>
>> Date:   Tue Nov 8 08:11:30 2016 +0100
>>
>>     scsi: aacraid: switch to pci_alloc_irq_vectors
>>     
>>     Use pci_alloc_irq_vectors and drop the hand-crafted interrupt affinity
>>     routines.
> 
> So what would happen, if `PCI_IRQ_AFFINITY` was removed? Will the
> system still work with the same performance?
> 
> As far as I understood, the no regression policy is there for
> exactly that reason, and it shouldn’t matter if it’s core
> infrastructure or not. As written, I have no idea, and just know
> reverting the commit in question fixes the problem here. So I’ll
> gladly test other solutions to fix this issue.

Just as another datapoint, with `PCI_IRQ_AFFINITY` removed from
`drivers/scsi/aacraid/comminit.c` in Linux 4.14.73, the driver
initializes correctly. I have no idea regarding the performance.


Kind regards,

Paul


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-01 12:33                 ` [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs" Paul Menzel
  2018-10-01 12:35                   ` Christoph Hellwig
@ 2018-10-08  2:11                   ` Ming Lei
  2018-10-08  7:56                     ` IT (Donald Buczek)
  1 sibling, 1 reply; 21+ messages in thread
From: Ming Lei @ 2018-10-08  2:11 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Greg Kroah-Hartman, stable, Christoph Hellwig,
	Linux Kernel Mailing List, it+linux-scsi,
	Adaptec OEM Raid Solutions, linux-scsi, Raghava Aditya Renukunta,
	Dave Carroll

On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
> Date: Wed, 29 Aug 2018 17:28:45 +0200
> 
> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.
> 
> See the discussion in the thread *aacraid: Regression in 4.14.56
> with *genirq/affinity: assign vectors to all possible CPUs** on
> the linux-scsi list.
> 
> Reverting the commit, the aacraid driver loads correctly, and the
> drives are visible. So revert the commit to adhere to Linux’ no
> regression policy.
> 
> Strangely, the issue is not reproducible with Linux 4.18.x, but
> Ming Lei wrote, that this might only be by chance.
> 
> Link: https://www.spinics.net/lists/linux-scsi/msg122732.html
> Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
> 
> ---
>  kernel/irq/affinity.c | 30 +++++++++++++++---------------
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
> index a37a3b4b6342..e12d35108225 100644
> --- a/kernel/irq/affinity.c
> +++ b/kernel/irq/affinity.c
> @@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>  	}
>  }
>  
> -static cpumask_var_t *alloc_node_to_possible_cpumask(void)
> +static cpumask_var_t *alloc_node_to_present_cpumask(void)
>  {
>  	cpumask_var_t *masks;
>  	int node;
> @@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_possible_cpumask(void)
>  	return NULL;
>  }
>  
> -static void free_node_to_possible_cpumask(cpumask_var_t *masks)
> +static void free_node_to_present_cpumask(cpumask_var_t *masks)
>  {
>  	int node;
>  
> @@ -71,22 +71,22 @@ static void free_node_to_possible_cpumask(cpumask_var_t *masks)
>  	kfree(masks);
>  }
>  
> -static void build_node_to_possible_cpumask(cpumask_var_t *masks)
> +static void build_node_to_present_cpumask(cpumask_var_t *masks)
>  {
>  	int cpu;
>  
> -	for_each_possible_cpu(cpu)
> +	for_each_present_cpu(cpu)
>  		cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
>  }
>  
> -static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
> +static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
>  				const struct cpumask *mask, nodemask_t *nodemsk)
>  {
>  	int n, nodes = 0;
>  
>  	/* Calculate the number of nodes in the supplied affinity mask */
>  	for_each_node(n) {
> -		if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
> +		if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
>  			node_set(n, *nodemsk);
>  			nodes++;
>  		}
> @@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	int last_affv = affv + affd->pre_vectors;
>  	nodemask_t nodemsk = NODE_MASK_NONE;
>  	struct cpumask *masks;
> -	cpumask_var_t nmsk, *node_to_possible_cpumask;
> +	cpumask_var_t nmsk, *node_to_present_cpumask;
>  
>  	/*
>  	 * If there aren't any vectors left after applying the pre/post
> @@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	if (!masks)
>  		goto out;
>  
> -	node_to_possible_cpumask = alloc_node_to_possible_cpumask();
> -	if (!node_to_possible_cpumask)
> +	node_to_present_cpumask = alloc_node_to_present_cpumask();
> +	if (!node_to_present_cpumask)
>  		goto out;
>  
>  	/* Fill out vectors at the beginning that don't need affinity */
> @@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  
>  	/* Stabilize the cpumasks */
>  	get_online_cpus();
> -	build_node_to_possible_cpumask(node_to_possible_cpumask);
> -	nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask,
> +	build_node_to_present_cpumask(node_to_present_cpumask);
> +	nodes = get_nodes_in_cpumask(node_to_present_cpumask, cpu_present_mask,
>  				     &nodemsk);
>  
>  	/*
> @@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	if (affv <= nodes) {
>  		for_each_node_mask(n, nodemsk) {
>  			cpumask_copy(masks + curvec,
> -				     node_to_possible_cpumask[n]);
> +				     node_to_present_cpumask[n]);
>  			if (++curvec == last_affv)
>  				break;
>  		}
> @@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  		vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
>  
>  		/* Get the cpus on this node which are in the mask */
> -		cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]);
> +		cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);
>  
>  		/* Calculate the number of cpus per vector */
>  		ncpus = cpumask_weight(nmsk);
> @@ -192,7 +192,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>  	/* Fill out vectors at the end that don't need affinity */
>  	for (; curvec < nvecs; curvec++)
>  		cpumask_copy(masks + curvec, irq_default_affinity);
> -	free_node_to_possible_cpumask(node_to_possible_cpumask);
> +	free_node_to_present_cpumask(node_to_present_cpumask);
>  out:
>  	free_cpumask_var(nmsk);
>  	return masks;
> @@ -214,7 +214,7 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
>  		return 0;
>  
>  	get_online_cpus();
> -	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
> +	ret = min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
>  	put_online_cpus();
>  	return ret;
>  }
> -- 
> 2.17.1
> 

Hello Paul,

I have explained that this is one aacraid issue in the following link:

https://marc.info/?l=linux-kernel&m=153413115909580&w=2

So this issue should have been fixed in the driver instead of genirq
core code, otherwise regression on other drivers can be caused too.

So I suggest you to ask aacraid guys to take a look at this issue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-08  2:11                   ` Ming Lei
@ 2018-10-08  7:56                     ` IT (Donald Buczek)
  0 siblings, 0 replies; 21+ messages in thread
From: IT (Donald Buczek) @ 2018-10-08  7:56 UTC (permalink / raw)
  To: Ming Lei
  Cc: Paul Menzel, Greg Kroah-Hartman, stable, Christoph Hellwig,
	Linux Kernel Mailing List, it+linux-scsi,
	Adaptec OEM Raid Solutions, linux-scsi, Raghava Aditya Renukunta,
	Dave Carroll

On 10/08/18 04:11, Ming Lei wrote:
> On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
>> Date: Wed, 29 Aug 2018 17:28:45 +0200
>>
>> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.
>>
>> See the discussion in the thread *aacraid: Regression in 4.14.56
>> with *genirq/affinity: assign vectors to all possible CPUs** on
>> the linux-scsi list.
>>
>> Reverting the commit, the aacraid driver loads correctly, and the
>> drives are visible. So revert the commit to adhere to Linux’ no
>> regression policy.
>>
>> Strangely, the issue is not reproducible with Linux 4.18.x, but
>> Ming Lei wrote, that this might only be by chance.
>>
>> Link: https://www.spinics.net/lists/linux-scsi/msg122732.html
>> Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de>
>>
>> ---
>>   kernel/irq/affinity.c | 30 +++++++++++++++---------------
>>   1 file changed, 15 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
>> index a37a3b4b6342..e12d35108225 100644
>> --- a/kernel/irq/affinity.c
>> +++ b/kernel/irq/affinity.c
>> @@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, struct cpumask *nmsk,
>>   	}
>>   }
>>   
>> -static cpumask_var_t *alloc_node_to_possible_cpumask(void)
>> +static cpumask_var_t *alloc_node_to_present_cpumask(void)
>>   {
>>   	cpumask_var_t *masks;
>>   	int node;
>> @@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_possible_cpumask(void)
>>   	return NULL;
>>   }
>>   
>> -static void free_node_to_possible_cpumask(cpumask_var_t *masks)
>> +static void free_node_to_present_cpumask(cpumask_var_t *masks)
>>   {
>>   	int node;
>>   
>> @@ -71,22 +71,22 @@ static void free_node_to_possible_cpumask(cpumask_var_t *masks)
>>   	kfree(masks);
>>   }
>>   
>> -static void build_node_to_possible_cpumask(cpumask_var_t *masks)
>> +static void build_node_to_present_cpumask(cpumask_var_t *masks)
>>   {
>>   	int cpu;
>>   
>> -	for_each_possible_cpu(cpu)
>> +	for_each_present_cpu(cpu)
>>   		cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
>>   }
>>   
>> -static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
>> +static int get_nodes_in_cpumask(cpumask_var_t *node_to_present_cpumask,
>>   				const struct cpumask *mask, nodemask_t *nodemsk)
>>   {
>>   	int n, nodes = 0;
>>   
>>   	/* Calculate the number of nodes in the supplied affinity mask */
>>   	for_each_node(n) {
>> -		if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
>> +		if (cpumask_intersects(mask, node_to_present_cpumask[n])) {
>>   			node_set(n, *nodemsk);
>>   			nodes++;
>>   		}
>> @@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>   	int last_affv = affv + affd->pre_vectors;
>>   	nodemask_t nodemsk = NODE_MASK_NONE;
>>   	struct cpumask *masks;
>> -	cpumask_var_t nmsk, *node_to_possible_cpumask;
>> +	cpumask_var_t nmsk, *node_to_present_cpumask;
>>   
>>   	/*
>>   	 * If there aren't any vectors left after applying the pre/post
>> @@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>   	if (!masks)
>>   		goto out;
>>   
>> -	node_to_possible_cpumask = alloc_node_to_possible_cpumask();
>> -	if (!node_to_possible_cpumask)
>> +	node_to_present_cpumask = alloc_node_to_present_cpumask();
>> +	if (!node_to_present_cpumask)
>>   		goto out;
>>   
>>   	/* Fill out vectors at the beginning that don't need affinity */
>> @@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>   
>>   	/* Stabilize the cpumasks */
>>   	get_online_cpus();
>> -	build_node_to_possible_cpumask(node_to_possible_cpumask);
>> -	nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask,
>> +	build_node_to_present_cpumask(node_to_present_cpumask);
>> +	nodes = get_nodes_in_cpumask(node_to_present_cpumask, cpu_present_mask,
>>   				     &nodemsk);
>>   
>>   	/*
>> @@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>   	if (affv <= nodes) {
>>   		for_each_node_mask(n, nodemsk) {
>>   			cpumask_copy(masks + curvec,
>> -				     node_to_possible_cpumask[n]);
>> +				     node_to_present_cpumask[n]);
>>   			if (++curvec == last_affv)
>>   				break;
>>   		}
>> @@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>   		vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
>>   
>>   		/* Get the cpus on this node which are in the mask */
>> -		cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]);
>> +		cpumask_and(nmsk, cpu_present_mask, node_to_present_cpumask[n]);
>>   
>>   		/* Calculate the number of cpus per vector */
>>   		ncpus = cpumask_weight(nmsk);
>> @@ -192,7 +192,7 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
>>   	/* Fill out vectors at the end that don't need affinity */
>>   	for (; curvec < nvecs; curvec++)
>>   		cpumask_copy(masks + curvec, irq_default_affinity);
>> -	free_node_to_possible_cpumask(node_to_possible_cpumask);
>> +	free_node_to_present_cpumask(node_to_present_cpumask);
>>   out:
>>   	free_cpumask_var(nmsk);
>>   	return masks;
>> @@ -214,7 +214,7 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity
>>   		return 0;
>>   
>>   	get_online_cpus();
>> -	ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
>> +	ret = min_t(int, cpumask_weight(cpu_present_mask), vecs) + resv;
>>   	put_online_cpus();
>>   	return ret;
>>   }
>> -- 
>> 2.17.1
>>
> 
> Hello Paul,
> 
> I have explained that this is one aacraid issue in the following link:
> 
> https://marc.info/?l=linux-kernel&m=153413115909580&w=2
> 
> So this issue should have been fixed in the driver instead of genirq
> core code, otherwise regression on other drivers can be caused too.

Dear Ming,

is your argument, that the bug existed before, because even with the IRQs spread over the present CPUs only, a driver was wrong to assume that every IRQ would be served, because of offline CPUs? So that the change to spread over all possible CPUs did not fundamentally change the API of pci_alloc_irq_vectors() but just increased the already existing chance of a failure?

When you say, reverting this can cause a regression, do you refer to performance or functional regression? I don't see the later yet, because even updated drivers have no guarantee, that all possible CPUs are used for the IRQs, so limiting this to the present CPUs again should do no harm?

Thank you

   Donald

> 
> So I suggest you to ask aacraid guys to take a look at this issue.> 
> Thanks,
> Ming
> 


-- 
Donald Buczek
it@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-01 15:59                       ` Paul Menzel
@ 2018-10-15 12:17                         ` Paul Menzel
  2018-10-15 13:21                           ` Greg Kroah-Hartman
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Menzel @ 2018-10-15 12:17 UTC (permalink / raw)
  To: Christoph Hellwig, Greg Kroah-Hartman
  Cc: Ming Lei, stable, Linux Kernel Mailing List, it+linux-scsi,
	Adaptec OEM Raid Solutions, linux-scsi, Raghava Aditya Renukunta,
	Dave Carroll

[-- Attachment #1: Type: text/plain, Size: 1973 bytes --]

Dear Greg, dear Linux folks,


On 10/01/18 17:59, Paul Menzel wrote:

> On 10/01/18 14:43, Paul Menzel wrote:
> 
>> On 10/01/18 14:35, Christoph Hellwig wrote:
>>> On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
>>>> Date: Wed, 29 Aug 2018 17:28:45 +0200
>>>>
>>>> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.
>>>
>>> This seems rather odd.  If at all you'd revert the patch adding the
>>> PCI_IRQ_AFFINITY to aacraid, not core infrastructure.
>>
>> Thank you for the suggestion, but that flag was added in 2016
>> to the aacraid driver.
>>
>>> commit 0910d8bbdd99856af1394d3d8830955abdefee4a
>>> Author: Hannes Reinecke <hare@suse.de>
>>> Date:   Tue Nov 8 08:11:30 2016 +0100
>>>
>>>     scsi: aacraid: switch to pci_alloc_irq_vectors
>>>     
>>>     Use pci_alloc_irq_vectors and drop the hand-crafted interrupt affinity
>>>     routines.
>>
>> So what would happen, if `PCI_IRQ_AFFINITY` was removed? Will the
>> system still work with the same performance?
>>
>> As far as I understood, the no regression policy is there for
>> exactly that reason, and it shouldn’t matter if it’s core
>> infrastructure or not. As written, I have no idea, and just know
>> reverting the commit in question fixes the problem here. So I’ll
>> gladly test other solutions to fix this issue.
> 
> Just as another datapoint, with `PCI_IRQ_AFFINITY` removed from
> `drivers/scsi/aacraid/comminit.c` in Linux 4.14.73, the driver
> initializes correctly. I have no idea regarding the performance.

This commit has not been picked up yet. I guess, you are busy, but
in case there are still objections, it’d be great if the two
questions below were answered.

1.  What bug is fixed in the LTS series by backporting the commit
    causing the regression?

2.  Why does the *no regression* policy *not* apply in this case?


Kind regards,

Paul


[1]: Documentation/process/stable-kernel-rules.rst


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-15 12:17                         ` Paul Menzel
@ 2018-10-15 13:21                           ` Greg Kroah-Hartman
  2018-10-17 15:00                             ` Paul Menzel
  0 siblings, 1 reply; 21+ messages in thread
From: Greg Kroah-Hartman @ 2018-10-15 13:21 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Christoph Hellwig, Ming Lei, stable, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi,
	Raghava Aditya Renukunta, Dave Carroll

On Mon, Oct 15, 2018 at 02:17:11PM +0200, Paul Menzel wrote:
> Dear Greg, dear Linux folks,
> 
> 
> On 10/01/18 17:59, Paul Menzel wrote:
> 
> > On 10/01/18 14:43, Paul Menzel wrote:
> > 
> >> On 10/01/18 14:35, Christoph Hellwig wrote:
> >>> On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
> >>>> Date: Wed, 29 Aug 2018 17:28:45 +0200
> >>>>
> >>>> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.
> >>>
> >>> This seems rather odd.  If at all you'd revert the patch adding the
> >>> PCI_IRQ_AFFINITY to aacraid, not core infrastructure.
> >>
> >> Thank you for the suggestion, but that flag was added in 2016
> >> to the aacraid driver.
> >>
> >>> commit 0910d8bbdd99856af1394d3d8830955abdefee4a
> >>> Author: Hannes Reinecke <hare@suse.de>
> >>> Date:   Tue Nov 8 08:11:30 2016 +0100
> >>>
> >>>     scsi: aacraid: switch to pci_alloc_irq_vectors
> >>>     
> >>>     Use pci_alloc_irq_vectors and drop the hand-crafted interrupt affinity
> >>>     routines.
> >>
> >> So what would happen, if `PCI_IRQ_AFFINITY` was removed? Will the
> >> system still work with the same performance?
> >>
> >> As far as I understood, the no regression policy is there for
> >> exactly that reason, and it shouldn’t matter if it’s core
> >> infrastructure or not. As written, I have no idea, and just know
> >> reverting the commit in question fixes the problem here. So I’ll
> >> gladly test other solutions to fix this issue.
> > 
> > Just as another datapoint, with `PCI_IRQ_AFFINITY` removed from
> > `drivers/scsi/aacraid/comminit.c` in Linux 4.14.73, the driver
> > initializes correctly. I have no idea regarding the performance.
> 
> This commit has not been picked up yet. I guess, you are busy, but
> in case there are still objections, it’d be great if the two
> questions below were answered.
> 
> 1.  What bug is fixed in the LTS series by backporting the commit
>     causing the regression?

I can't remember anymore, but unwinding this mess is going to be a pain :(

> 2.  Why does the *no regression* policy *not* apply in this case?

It does, but also we are following the "stick to what mainline does",
and the fact that this is not showing up in mainline seems just to be a
lucky accident at the moment.  My real worry is that suddenly you are
going to have problems there and that this is just the early-warning
system happening...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-15 13:21                           ` Greg Kroah-Hartman
@ 2018-10-17 15:00                             ` Paul Menzel
  2018-10-30 15:30                               ` Paul Menzel
  0 siblings, 1 reply; 21+ messages in thread
From: Paul Menzel @ 2018-10-17 15:00 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Christoph Hellwig, Ming Lei, stable, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi,
	Raghava Aditya Renukunta, Dave Carroll

[-- Attachment #1: Type: text/plain, Size: 3157 bytes --]

Dear Greg,


On 10/15/18 15:21, Greg Kroah-Hartman wrote:
> On Mon, Oct 15, 2018 at 02:17:11PM +0200, Paul Menzel wrote:

>> On 10/01/18 17:59, Paul Menzel wrote:
>>
>>> On 10/01/18 14:43, Paul Menzel wrote:
>>>
>>>> On 10/01/18 14:35, Christoph Hellwig wrote:
>>>>> On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
>>>>>> Date: Wed, 29 Aug 2018 17:28:45 +0200
>>>>>>
>>>>>> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.
>>>>>
>>>>> This seems rather odd.  If at all you'd revert the patch adding the
>>>>> PCI_IRQ_AFFINITY to aacraid, not core infrastructure.
>>>>
>>>> Thank you for the suggestion, but that flag was added in 2016
>>>> to the aacraid driver.
>>>>
>>>>> commit 0910d8bbdd99856af1394d3d8830955abdefee4a
>>>>> Author: Hannes Reinecke <hare@suse.de>
>>>>> Date:   Tue Nov 8 08:11:30 2016 +0100
>>>>>
>>>>>     scsi: aacraid: switch to pci_alloc_irq_vectors
>>>>>     
>>>>>     Use pci_alloc_irq_vectors and drop the hand-crafted interrupt affinity
>>>>>     routines.
>>>>
>>>> So what would happen, if `PCI_IRQ_AFFINITY` was removed? Will the
>>>> system still work with the same performance?
>>>>
>>>> As far as I understood, the no regression policy is there for
>>>> exactly that reason, and it shouldn’t matter if it’s core
>>>> infrastructure or not. As written, I have no idea, and just know
>>>> reverting the commit in question fixes the problem here. So I’ll
>>>> gladly test other solutions to fix this issue.
>>>
>>> Just as another datapoint, with `PCI_IRQ_AFFINITY` removed from
>>> `drivers/scsi/aacraid/comminit.c` in Linux 4.14.73, the driver
>>> initializes correctly. I have no idea regarding the performance.
>>
>> This commit has not been picked up yet. I guess, you are busy, but
>> in case there are still objections, it’d be great if the two
>> questions below were answered.
>>
>> 1.  What bug is fixed in the LTS series by backporting the commit
>>     causing the regression?
> 
> I can't remember anymore, but unwinding this mess is going to be a
> pain :(

Agreed.

>> 2.  Why does the *no regression* policy *not* apply in this case?
> 
> It does, but also we are following the "stick to what mainline does",

Hmm, but I thought only for bug fixes.

> and the fact that this is not showing up in mainline seems just to be a
> lucky accident at the moment.  My real worry is that suddenly you are
> going to have problems there and that this is just the early-warning
> system happening...

It is still a mystery for me, why it doesn’t happen in master.

In the current situation, where the SCSI/AACRAID subsystem folks haven’t
joined the discussion, I still think, the best way for the Linux 4.14
series is to revert.

Additionally, there are other reports about errors with the aacraid
driver [1]. I heard they develop against the Linux kernel version in
the enterprise distributions, and then port that to master. Maybe
that is one of the reasons for the current state. (But also off-topic.)


Kind regards,

Paul


[1]: https://www.spinics.net/lists/linux-scsi/threads.html#123414


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs"
  2018-10-17 15:00                             ` Paul Menzel
@ 2018-10-30 15:30                               ` Paul Menzel
  0 siblings, 0 replies; 21+ messages in thread
From: Paul Menzel @ 2018-10-30 15:30 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Christoph Hellwig, Ming Lei, stable, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi,
	Raghava Aditya Renukunta, Dave Carroll

[-- Attachment #1: Type: text/plain, Size: 3573 bytes --]

Dear Greg,


Hopefully, you enjoyed the maintainers summit, and didn’t have too
many emails to catch up on.

On 10/17/18 17:00, Paul Menzel wrote:

> On 10/15/18 15:21, Greg Kroah-Hartman wrote:
>> On Mon, Oct 15, 2018 at 02:17:11PM +0200, Paul Menzel wrote:
> 
>>> On 10/01/18 17:59, Paul Menzel wrote:
>>>
>>>> On 10/01/18 14:43, Paul Menzel wrote:
>>>>
>>>>> On 10/01/18 14:35, Christoph Hellwig wrote:
>>>>>> On Mon, Oct 01, 2018 at 02:33:07PM +0200, Paul Menzel wrote:
>>>>>>> Date: Wed, 29 Aug 2018 17:28:45 +0200
>>>>>>>
>>>>>>> This reverts commit ef86f3a72adb8a7931f67335560740a7ad696d1d.
>>>>>>
>>>>>> This seems rather odd.  If at all you'd revert the patch adding the
>>>>>> PCI_IRQ_AFFINITY to aacraid, not core infrastructure.
>>>>>
>>>>> Thank you for the suggestion, but that flag was added in 2016
>>>>> to the aacraid driver.
>>>>>
>>>>>> commit 0910d8bbdd99856af1394d3d8830955abdefee4a
>>>>>> Author: Hannes Reinecke <hare@suse.de>
>>>>>> Date:   Tue Nov 8 08:11:30 2016 +0100
>>>>>>
>>>>>>     scsi: aacraid: switch to pci_alloc_irq_vectors
>>>>>>     
>>>>>>     Use pci_alloc_irq_vectors and drop the hand-crafted interrupt affinity
>>>>>>     routines.
>>>>>
>>>>> So what would happen, if `PCI_IRQ_AFFINITY` was removed? Will the
>>>>> system still work with the same performance?
>>>>>
>>>>> As far as I understood, the no regression policy is there for
>>>>> exactly that reason, and it shouldn’t matter if it’s core
>>>>> infrastructure or not. As written, I have no idea, and just know
>>>>> reverting the commit in question fixes the problem here. So I’ll
>>>>> gladly test other solutions to fix this issue.
>>>>
>>>> Just as another datapoint, with `PCI_IRQ_AFFINITY` removed from
>>>> `drivers/scsi/aacraid/comminit.c` in Linux 4.14.73, the driver
>>>> initializes correctly. I have no idea regarding the performance.
>>>
>>> This commit has not been picked up yet. I guess, you are busy, but
>>> in case there are still objections, it’d be great if the two
>>> questions below were answered.
>>>
>>> 1.  What bug is fixed in the LTS series by backporting the commit
>>>     causing the regression?
>>
>> I can't remember anymore, but unwinding this mess is going to be a
>> pain :(
> 
> Agreed.
> 
>>> 2.  Why does the *no regression* policy *not* apply in this case?
>>
>> It does, but also we are following the "stick to what mainline does",
> 
> Hmm, but I thought only for bug fixes.
> 
>> and the fact that this is not showing up in mainline seems just to be a
>> lucky accident at the moment.  My real worry is that suddenly you are
>> going to have problems there and that this is just the early-warning
>> system happening...
> 
> It is still a mystery for me, why it doesn’t happen in master.
> 
> In the current situation, where the SCSI/AACRAID subsystem folks
> haven’t joined the discussion, I still think, the best way for the
> Linux 4.14 series is to revert.
> 
> Additionally, there are other reports about errors with the aacraid 
> driver [1]. I heard they develop against the Linux kernel version in 
> the enterprise distributions, and then port that to master. Maybe 
> that is one of the reasons for the current state. (But also
> off-topic.)

Do you have an update, how to deal with this situation in the LTS
series? The longer we wait, the more likely it is, that there will be
conflicts in the future releases.


Kind regards,

Paul


> [1]: https://www.spinics.net/lists/linux-scsi/threads.html#123414


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs*
  2018-09-11 10:53               ` Paul Menzel
  2018-10-01 12:33                 ` [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs" Paul Menzel
@ 2019-02-18 11:40                 ` Greg Kroah-Hartman
  1 sibling, 0 replies; 21+ messages in thread
From: Greg Kroah-Hartman @ 2019-02-18 11:40 UTC (permalink / raw)
  To: Paul Menzel
  Cc: Ming Lei, stable, Christoph Hellwig, Linux Kernel Mailing List,
	it+linux-scsi, Adaptec OEM Raid Solutions, linux-scsi,
	Raghava Aditya Renukunta, Dave Carroll

On Tue, Sep 11, 2018 at 12:53:54PM +0200, Paul Menzel wrote:
> Dear Ming, dear Raghava, dear Dave,
> 
> 
> On 08/16/18 19:09, Paul Menzel wrote:
> 
> > On 08/13/18 05:32, Ming Lei wrote:
> >> On Sat, Aug 11, 2018 at 03:50:21PM +0200, Greg Kroah-Hartman wrote:
> >>> On Sat, Aug 11, 2018 at 10:14:18AM +0200, Paul Menzel wrote:
> > 
> >>>> Am 10.08.2018 um 17:55 schrieb Greg Kroah-Hartman:
> >>>>> On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote:
> >>>>
> >>>>>> On 08/10/18 15:36, Greg Kroah-Hartman wrote:
> >>>>>>> On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote:
> > 
> >>>>>>>> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added
> >>>>>>>> for Linux 4.14.56 causes the aacraid module to not detect the attached devices
> >>>>>>>> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz.
> >>>>>>>>
> >>>>>>>> ```
> >>>>>>>> $ dmesg | grep raid
> >>>>>>>> [    0.269768] raid6: sse2x1   gen()  7179 MB/s
> >>>>>>>> [    0.290069] raid6: sse2x1   xor()  5636 MB/s
> >>>>>>>> [    0.311068] raid6: sse2x2   gen()  9160 MB/s
> >>>>>>>> [    0.332076] raid6: sse2x2   xor()  6375 MB/s
> >>>>>>>> [    0.353075] raid6: sse2x4   gen() 11164 MB/s
> >>>>>>>> [    0.374064] raid6: sse2x4   xor()  7429 MB/s
> >>>>>>>> [    0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s
> >>>>>>>> [    0.386001] raid6: .... xor() 7429 MB/s, rmw enabled
> >>>>>>>> [    0.391008] raid6: using ssse3x2 recovery algorithm
> >>>>>>>> [    3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
> >>>>>>>> [    3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
> >>>>>>>> [   10.725767] Adaptec aacraid driver 1.2.1[50834]-custom
> >>>>>>>> [   10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
> >>>>>>>> [   10.743295] aacraid: Comm Interface type3 enabled
> >>>>>>>> $ lspci -nn | grep Adaptec
> >>>>>>>> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> >>>>>>>> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01)
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>> But, it still works with a Dell PowerEdge R715 with two eight core AMD
> >>>>>>>> Opteron 6136, the card below.
> >>>>>>>>
> >>>>>>>> ```
> >>>>>>>> $ lspci -nn | grep Adaptec
> >>>>>>>> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01)
> >>>>>>>> ```
> >>>>>>>>
> >>>>>>>> Reverting the commit fixes the issue.
> >>>>>>>>
> >>>>>>>> commit ef86f3a72adb8a7931f67335560740a7ad696d1d
> >>>>>>>> Author: Christoph Hellwig <hch@lst.de>
> >>>>>>>> Date:   Fri Jan 12 10:53:05 2018 +0800
> >>>>>>>>
> >>>>>>>>      genirq/affinity: assign vectors to all possible CPUs
> >>>>>>>>      commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream.
> >>>>>>>>      Currently we assign managed interrupt vectors to all present CPUs.  This
> >>>>>>>>      works fine for systems were we only online/offline CPUs.  But in case of
> >>>>>>>>      systems that support physical CPU hotplug (or the virtualized version of
> >>>>>>>>      it) this means the additional CPUs covered for in the ACPI tables or on
> >>>>>>>>      the command line are not catered for.  To fix this we'd either need to
> >>>>>>>>      introduce new hotplug CPU states just for this case, or we can start
> >>>>>>>>      assining vectors to possible but not present CPUs.
> >>>>>>>>      Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
> >>>>>>>>      Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
> >>>>>>>>      Tested-by: Stefan Haberland <sth@linux.vnet.ibm.com>
> >>>>>>>>      Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU")
> >>>>>>>>      Cc: linux-kernel@vger.kernel.org
> >>>>>>>>      Cc: Thomas Gleixner <tglx@linutronix.de>
> >>>>>>>>      Signed-off-by: Christoph Hellwig <hch@lst.de>
> >>>>>>>>      Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >>>>>>>>      Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>>>>>>>
> >>>>>>>> The problem doesn’t happen with Linux 4.17.11, so there are commits in
> >>>>>>>> Linux master fixing this. Unfortunately, my attempts to find out failed.
> >>>>>>>>
> >>>>>>>> I was able to cherry-pick the three commits below on top of 4.14.62,
> >>>>>>>> but the problem persists.
> >>>>>>>>
> >>>>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> >>>>>>>> 355d7ecdea35 scsi: hpsa: fix selection of reply queue
> >>>>>>>> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> >>>>>>>>
> >>>>>>>> Trying to cherry-pick the commits below, referencing the commit
> >>>>>>>> in question, gave conflicts.
> >>>>>>>>
> >>>>>>>> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> >>>>>>>> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
> >>>>>>>>
> >>>>>>>> To avoid further trial and error with the server with a slow firmware,
> >>>>>>>> do you know what commits should fix the issue?
> >>>>>>>
> >>>>>>> Look at the email on the stable mailing list:
> >>>>>>> 	Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y
> >>>>>>> it should help you out here.
> >>>>>>
> >>>>>> Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a
> >>>>>> way to reply to that thread. Therefore, here is my reply.
> >>>>>>
> >>>>>>> Can you try the patches listed there?
> >>>>>>
> >>>>>> I tried some of these already without success.
> >>>>>>
> >>>>>> b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> >>>>>> 2f31115e940c scsi: core: introduce force_blk_mq
> >>>>>> adbe552349f2 scsi: megaraid_sas: fix selection of reply queue
> >>>>>>
> >>>>>> The commit above is already in v4.14.56.
> >>>>>>
> >>>>>> 8b834bff1b73 scsi: hpsa: fix selection of reply queue
> >>>>>>
> >>>>>> The problem persists.
> >>>>>>
> >>>>>> The problem also persists with the state below.
> >>>>>>
> >>>>>> 3528f73a4e5d scsi: core: introduce force_blk_mq
> >>>>>> 16dc4d8215f3 scsi: hpsa: fix selection of reply queue
> >>>>>> f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity
> >>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error
> >>>>>> 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62
> >>>>>>
> >>>>>> So, some more commits are necessary.
> >>>>>
> >>>>> Or I revert the original patch here, and the follow-on ones that were
> >>>>> added to "fix" this issue.  I think that might be the better thing
> >>>>> overall here, right?  Have you tried that?
> >>>>
> >>>> Yes, reverting the commit fixed the issue for us. If Christoph or Ming do
> >>>> not have another suggestion for a commit, that would be the way to go.
> >>>
> >>> Christoph or Ming, any ideas here?
> >>>
> >>> In looking at the aacraid code, I don't see anywhere that this is using
> >>> a specific cpu number for queues or anything, but I could be wrong.
> >>> Ideally this should also be failing in 4.17 or 4.18-rc right now as well
> >>> as I don't see anything that would have "fixed" this recently.  Unless
> >>> I'm missing something here?
> >>
> >> From both aac_fib_vector_assign() and aac_src_deliver_message(), looks
> >> every msix vector is used inside driver, so causes this trouble.
> >>
> >> From 84676c1f21e8ff54b (genirq/affinity: assign vectors to all possible CPUs),
> >> actually it should have been since 9a0ef98e18 (genirq/affinity: Assign vectors
> >> to all present CPUs), it isn't guaranteed that one vector can be mapped to
> >> at least one online CPU.
> >>
> >> So it should be fixed in driver by similar approach taken in the
> >> following patches:
> >> 	scsi: hpsa: fix selection of reply queue
> >> 	scsi: megaraid_sas: fix selection of reply queue
> >>
> >> And the reason this issue won't be observed is that the following
> >> patches(merged to v4.17) may avoid this issue, but in theory, the issue
> >> won't be avoided completely by these patches:
> >>
> >> d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible
> >> 1a2d0914e23a genirq/affinity: Allow irq spreading from a given starting point
> >> b3e6aaa8d94d genirq/affinity: Move actual irq vector spreading into a helper function
> >> 47778f33dcba genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask
> >> 0211e12dd0a5 genirq/affinity: Don't return with empty affinity masks on error
> >>
> >> In short, this issue should have been fixed in upstream aacraid, and the idea
> >> is simple, use the current CPU id to retrieve the msix vector offset, this
> >> way will make sure that the msix vector can be mapped from one online
> >> CPU.
> > 
> > AACRAID folks, are you going to work on that? I looked at the code a bit,
> > and it’ll take me some days to understand it and to fix it.
> 
> Unfortunately, there was no reply yet and I won’t be able to fix it any
> time soon. The aacraid driver is still broken in 4.14.68. Reverting the
> *one* mentioned commit fixes the issue.
> 
> Raghava, Dave, are you aware of the issue.

I'm pretty sure this should now be fixed, have you tested this recently?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-02-18 11:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-10 13:21 aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs* Paul Menzel
2018-08-10 13:36 ` Greg Kroah-Hartman
2018-08-10 14:11   ` Paul Menzel
2018-08-10 15:55     ` Greg Kroah-Hartman
2018-08-11  8:14       ` Paul Menzel
2018-08-11 13:50         ` Greg Kroah-Hartman
2018-08-12  8:35           ` Paul Menzel
2018-08-13  3:32           ` Ming Lei
2018-08-16 17:09             ` Paul Menzel
2018-09-11 10:53               ` Paul Menzel
2018-10-01 12:33                 ` [PATCH] Revert "genirq/affinity: assign vectors to all possible CPUs" Paul Menzel
2018-10-01 12:35                   ` Christoph Hellwig
2018-10-01 12:43                     ` Paul Menzel
2018-10-01 15:59                       ` Paul Menzel
2018-10-15 12:17                         ` Paul Menzel
2018-10-15 13:21                           ` Greg Kroah-Hartman
2018-10-17 15:00                             ` Paul Menzel
2018-10-30 15:30                               ` Paul Menzel
2018-10-08  2:11                   ` Ming Lei
2018-10-08  7:56                     ` IT (Donald Buczek)
2019-02-18 11:40                 ` aacraid: Regression in 4.14.56 with *genirq/affinity: assign vectors to all possible CPUs* Greg Kroah-Hartman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).