Dear Ming, dear Raghava, dear Dave, On 08/16/18 19:09, Paul Menzel wrote: > On 08/13/18 05:32, Ming Lei wrote: >> On Sat, Aug 11, 2018 at 03:50:21PM +0200, Greg Kroah-Hartman wrote: >>> On Sat, Aug 11, 2018 at 10:14:18AM +0200, Paul Menzel wrote: > >>>> Am 10.08.2018 um 17:55 schrieb Greg Kroah-Hartman: >>>>> On Fri, Aug 10, 2018 at 04:11:23PM +0200, Paul Menzel wrote: >>>> >>>>>> On 08/10/18 15:36, Greg Kroah-Hartman wrote: >>>>>>> On Fri, Aug 10, 2018 at 03:21:52PM +0200, Paul Menzel wrote: > >>>>>>>> Commit ef86f3a7 (genirq/affinity: assign vectors to all possible CPUs) added >>>>>>>> for Linux 4.14.56 causes the aacraid module to not detect the attached devices >>>>>>>> anymore on a Dell PowerEdge R720 with two six core 24x E5-2630 @ 2.30GHz. >>>>>>>> >>>>>>>> ``` >>>>>>>> $ dmesg | grep raid >>>>>>>> [ 0.269768] raid6: sse2x1 gen() 7179 MB/s >>>>>>>> [ 0.290069] raid6: sse2x1 xor() 5636 MB/s >>>>>>>> [ 0.311068] raid6: sse2x2 gen() 9160 MB/s >>>>>>>> [ 0.332076] raid6: sse2x2 xor() 6375 MB/s >>>>>>>> [ 0.353075] raid6: sse2x4 gen() 11164 MB/s >>>>>>>> [ 0.374064] raid6: sse2x4 xor() 7429 MB/s >>>>>>>> [ 0.379001] raid6: using algorithm sse2x4 gen() 11164 MB/s >>>>>>>> [ 0.386001] raid6: .... xor() 7429 MB/s, rmw enabled >>>>>>>> [ 0.391008] raid6: using ssse3x2 recovery algorithm >>>>>>>> [ 3.559682] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006) >>>>>>>> [ 3.570061] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006) >>>>>>>> [ 10.725767] Adaptec aacraid driver 1.2.1[50834]-custom >>>>>>>> [ 10.731724] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control >>>>>>>> [ 10.743295] aacraid: Comm Interface type3 enabled >>>>>>>> $ lspci -nn | grep Adaptec >>>>>>>> 04:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01) >>>>>>>> 42:00.0 Serial Attached SCSI controller [0107]: Adaptec Smart Storage PQI 12G SAS/PCIe 3 [9005:028f] (rev 01) >>>>>>>> ``` >>>>>>>> >>>>>>>> But, it still works with a Dell PowerEdge R715 with two eight core AMD >>>>>>>> Opteron 6136, the card below. >>>>>>>> >>>>>>>> ``` >>>>>>>> $ lspci -nn | grep Adaptec >>>>>>>> 22:00.0 Serial Attached SCSI controller [0107]: Adaptec Series 8 12G SAS/PCIe 3 [9005:028d] (rev 01) >>>>>>>> ``` >>>>>>>> >>>>>>>> Reverting the commit fixes the issue. >>>>>>>> >>>>>>>> commit ef86f3a72adb8a7931f67335560740a7ad696d1d >>>>>>>> Author: Christoph Hellwig >>>>>>>> Date: Fri Jan 12 10:53:05 2018 +0800 >>>>>>>> >>>>>>>> genirq/affinity: assign vectors to all possible CPUs >>>>>>>> commit 84676c1f21e8ff54befe985f4f14dc1edc10046b upstream. >>>>>>>> Currently we assign managed interrupt vectors to all present CPUs. This >>>>>>>> works fine for systems were we only online/offline CPUs. But in case of >>>>>>>> systems that support physical CPU hotplug (or the virtualized version of >>>>>>>> it) this means the additional CPUs covered for in the ACPI tables or on >>>>>>>> the command line are not catered for. To fix this we'd either need to >>>>>>>> introduce new hotplug CPU states just for this case, or we can start >>>>>>>> assining vectors to possible but not present CPUs. >>>>>>>> Reported-by: Christian Borntraeger >>>>>>>> Tested-by: Christian Borntraeger >>>>>>>> Tested-by: Stefan Haberland >>>>>>>> Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU") >>>>>>>> Cc: linux-kernel@vger.kernel.org >>>>>>>> Cc: Thomas Gleixner >>>>>>>> Signed-off-by: Christoph Hellwig >>>>>>>> Signed-off-by: Jens Axboe >>>>>>>> Signed-off-by: Greg Kroah-Hartman >>>>>>>> >>>>>>>> The problem doesn’t happen with Linux 4.17.11, so there are commits in >>>>>>>> Linux master fixing this. Unfortunately, my attempts to find out failed. >>>>>>>> >>>>>>>> I was able to cherry-pick the three commits below on top of 4.14.62, >>>>>>>> but the problem persists. >>>>>>>> >>>>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error >>>>>>>> 355d7ecdea35 scsi: hpsa: fix selection of reply queue >>>>>>>> e944e9615741 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity >>>>>>>> >>>>>>>> Trying to cherry-pick the commits below, referencing the commit >>>>>>>> in question, gave conflicts. >>>>>>>> >>>>>>>> 1. adbe552349f2 scsi: megaraid_sas: fix selection of reply queue >>>>>>>> 2. d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible >>>>>>>> >>>>>>>> To avoid further trial and error with the server with a slow firmware, >>>>>>>> do you know what commits should fix the issue? >>>>>>> >>>>>>> Look at the email on the stable mailing list: >>>>>>> Subject: Re: Fix for 84676c1f (b5b6e8c8) missing in 4.14.y >>>>>>> it should help you out here. >>>>>> >>>>>> Ah, I didn’t see that [1] yet. Also I can’t find the original message, and a >>>>>> way to reply to that thread. Therefore, here is my reply. >>>>>> >>>>>>> Can you try the patches listed there? >>>>>> >>>>>> I tried some of these already without success. >>>>>> >>>>>> b5b6e8c8d3b4 scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity >>>>>> 2f31115e940c scsi: core: introduce force_blk_mq >>>>>> adbe552349f2 scsi: megaraid_sas: fix selection of reply queue >>>>>> >>>>>> The commit above is already in v4.14.56. >>>>>> >>>>>> 8b834bff1b73 scsi: hpsa: fix selection of reply queue >>>>>> >>>>>> The problem persists. >>>>>> >>>>>> The problem also persists with the state below. >>>>>> >>>>>> 3528f73a4e5d scsi: core: introduce force_blk_mq >>>>>> 16dc4d8215f3 scsi: hpsa: fix selection of reply queue >>>>>> f0a7ab12232d scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity >>>>>> 6aba81b5a2f5 genirq/affinity: Don't return with empty affinity masks on error >>>>>> 1aa1166eface (tag: v4.14.62, stable/linux-4.14.y) Linux 4.14.62 >>>>>> >>>>>> So, some more commits are necessary. >>>>> >>>>> Or I revert the original patch here, and the follow-on ones that were >>>>> added to "fix" this issue. I think that might be the better thing >>>>> overall here, right? Have you tried that? >>>> >>>> Yes, reverting the commit fixed the issue for us. If Christoph or Ming do >>>> not have another suggestion for a commit, that would be the way to go. >>> >>> Christoph or Ming, any ideas here? >>> >>> In looking at the aacraid code, I don't see anywhere that this is using >>> a specific cpu number for queues or anything, but I could be wrong. >>> Ideally this should also be failing in 4.17 or 4.18-rc right now as well >>> as I don't see anything that would have "fixed" this recently. Unless >>> I'm missing something here? >> >> From both aac_fib_vector_assign() and aac_src_deliver_message(), looks >> every msix vector is used inside driver, so causes this trouble. >> >> From 84676c1f21e8ff54b (genirq/affinity: assign vectors to all possible CPUs), >> actually it should have been since 9a0ef98e18 (genirq/affinity: Assign vectors >> to all present CPUs), it isn't guaranteed that one vector can be mapped to >> at least one online CPU. >> >> So it should be fixed in driver by similar approach taken in the >> following patches: >> scsi: hpsa: fix selection of reply queue >> scsi: megaraid_sas: fix selection of reply queue >> >> And the reason this issue won't be observed is that the following >> patches(merged to v4.17) may avoid this issue, but in theory, the issue >> won't be avoided completely by these patches: >> >> d3056812e7df genirq/affinity: Spread irq vectors among present CPUs as far as possible >> 1a2d0914e23a genirq/affinity: Allow irq spreading from a given starting point >> b3e6aaa8d94d genirq/affinity: Move actual irq vector spreading into a helper function >> 47778f33dcba genirq/affinity: Rename *node_to_possible_cpumask as *node_to_cpumask >> 0211e12dd0a5 genirq/affinity: Don't return with empty affinity masks on error >> >> In short, this issue should have been fixed in upstream aacraid, and the idea >> is simple, use the current CPU id to retrieve the msix vector offset, this >> way will make sure that the msix vector can be mapped from one online >> CPU. > > AACRAID folks, are you going to work on that? I looked at the code a bit, > and it’ll take me some days to understand it and to fix it. Unfortunately, there was no reply yet and I won’t be able to fix it any time soon. The aacraid driver is still broken in 4.14.68. Reverting the *one* mentioned commit fixes the issue. Raghava, Dave, are you aware of the issue. > Just out of curiosity, shouldn’t the subsystems(?) communicate better to > avoid such breakage, or was that unexpected? Kind regards, Paul