All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] nvme fix for 4.16-rc6
@ 2018-03-16 16:01 Keith Busch
  2018-03-16 16:14 ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-03-16 16:01 UTC (permalink / raw)


Hi Jens,

One last fix from Jianchao to address a troubling scenario Ming identified
with offlining CPUs that could leave nvme controllers unusable. We can
not allow that to occur, so we wish to include a new IRQ spread that
prevents that situation.

The following changes since commit 0c8efd610b58cb23cefdfa12015799079aef94ae:

  Linux 4.16-rc5 (2018-03-11 17:25:09 -0700)

are available in the Git repository at:

  git://git.infradead.org/nvme.git nvme-4.16-rc6

for you to fetch changes up to 3aadd3b10f018f26c451f8a0feac7605c2deb0c7:

  nvme-pci: assign separate irq vectors for adminq and ioq1 (2018-03-13 11:27:57 -0600)

----------------------------------------------------------------
Jianchao Wang (1):
      nvme-pci: assign separate irq vectors for adminq and ioq1

 drivers/nvme/host/pci.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-16 16:01 [GIT PULL] nvme fix for 4.16-rc6 Keith Busch
@ 2018-03-16 16:14 ` Jens Axboe
  2018-03-16 16:16   ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2018-03-16 16:14 UTC (permalink / raw)


On 3/16/18 9:01 AM, Keith Busch wrote:
> Hi Jens,
> 
> One last fix from Jianchao to address a troubling scenario Ming identified
> with offlining CPUs that could leave nvme controllers unusable. We can
> not allow that to occur, so we wish to include a new IRQ spread that
> prevents that situation.

Since this isn't really a regression in this release, how about
we push this one to 4.17 instead?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-16 16:14 ` Jens Axboe
@ 2018-03-16 16:16   ` Christoph Hellwig
  2018-03-16 16:26     ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2018-03-16 16:16 UTC (permalink / raw)


On Fri, Mar 16, 2018@09:14:34AM -0700, Jens Axboe wrote:
> On 3/16/18 9:01 AM, Keith Busch wrote:
> > Hi Jens,
> > 
> > One last fix from Jianchao to address a troubling scenario Ming identified
> > with offlining CPUs that could leave nvme controllers unusable. We can
> > not allow that to occur, so we wish to include a new IRQ spread that
> > prevents that situation.
> 
> Since this isn't really a regression in this release, how about
> we push this one to 4.17 instead?

It is a regression in 4.15.  So I think we should fix this ASAP.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-16 16:16   ` Christoph Hellwig
@ 2018-03-16 16:26     ` Jens Axboe
  2018-03-16 16:53       ` Keith Busch
  2018-03-21 20:59       ` Keith Busch
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2018-03-16 16:26 UTC (permalink / raw)


On 3/16/18 9:16 AM, Christoph Hellwig wrote:
> On Fri, Mar 16, 2018@09:14:34AM -0700, Jens Axboe wrote:
>> On 3/16/18 9:01 AM, Keith Busch wrote:
>>> Hi Jens,
>>>
>>> One last fix from Jianchao to address a troubling scenario Ming identified
>>> with offlining CPUs that could leave nvme controllers unusable. We can
>>> not allow that to occur, so we wish to include a new IRQ spread that
>>> prevents that situation.
>>
>> Since this isn't really a regression in this release, how about
>> we push this one to 4.17 instead?
> 
> It is a regression in 4.15.  So I think we should fix this ASAP.

It's not that I dislike the patch (in fact it makes the code
easier to read), but it's pretty late for something that isn't
a regression in this series. I can queue it up for some testing,
but it's then -rc7 time. I guess we can see how it goes and
push the decision until start next week.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-16 16:26     ` Jens Axboe
@ 2018-03-16 16:53       ` Keith Busch
  2018-03-21 20:59       ` Keith Busch
  1 sibling, 0 replies; 13+ messages in thread
From: Keith Busch @ 2018-03-16 16:53 UTC (permalink / raw)


On Fri, Mar 16, 2018@09:26:24AM -0700, Jens Axboe wrote:
> 
> It's not that I dislike the patch (in fact it makes the code
> easier to read), but it's pretty late for something that isn't
> a regression in this series. I can queue it up for some testing,
> but it's then -rc7 time. I guess we can see how it goes and
> push the decision until start next week.

I understand the concern on the timing. If you can queue this
for additional testing and make a final call, that would be much
appreciated. We can copy to stable during 4.17 if you consider this too
risky for 4.16.

FWIW, this patch tests well here and even correctly aligns to Intel's
intended usage, but I'm limited to a single vendor.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-16 16:26     ` Jens Axboe
  2018-03-16 16:53       ` Keith Busch
@ 2018-03-21 20:59       ` Keith Busch
  2018-03-21 21:02         ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-03-21 20:59 UTC (permalink / raw)


On Fri, Mar 16, 2018@09:26:24AM -0700, Jens Axboe wrote:
> It's not that I dislike the patch (in fact it makes the code
> easier to read), but it's pretty late for something that isn't
> a regression in this series. I can queue it up for some testing,
> but it's then -rc7 time. I guess we can see how it goes and
> push the decision until start next week.

Hi Jens,

Do you need more time on this one or have you decided where you want
this fix to go? I'm planning to send the first nvme 4.17 pull request
this week, so just checking if I should include this one.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-21 20:59       ` Keith Busch
@ 2018-03-21 21:02         ` Jens Axboe
  2018-03-21 21:44           ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2018-03-21 21:02 UTC (permalink / raw)


On 3/21/18 2:59 PM, Keith Busch wrote:
> On Fri, Mar 16, 2018@09:26:24AM -0700, Jens Axboe wrote:
>> It's not that I dislike the patch (in fact it makes the code
>> easier to read), but it's pretty late for something that isn't
>> a regression in this series. I can queue it up for some testing,
>> but it's then -rc7 time. I guess we can see how it goes and
>> push the decision until start next week.
> 
> Hi Jens,
> 
> Do you need more time on this one or have you decided where you want
> this fix to go? I'm planning to send the first nvme 4.17 pull request
> this week, so just checking if I should include this one.

Let's ship it for 4.16.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-21 21:02         ` Jens Axboe
@ 2018-03-21 21:44           ` Jens Axboe
  2018-03-21 22:08             ` Keith Busch
  2018-03-22 21:09             ` Keith Busch
  0 siblings, 2 replies; 13+ messages in thread
From: Jens Axboe @ 2018-03-21 21:44 UTC (permalink / raw)


On 3/21/18 3:02 PM, Jens Axboe wrote:
> On 3/21/18 2:59 PM, Keith Busch wrote:
>> On Fri, Mar 16, 2018@09:26:24AM -0700, Jens Axboe wrote:
>>> It's not that I dislike the patch (in fact it makes the code
>>> easier to read), but it's pretty late for something that isn't
>>> a regression in this series. I can queue it up for some testing,
>>> but it's then -rc7 time. I guess we can see how it goes and
>>> push the decision until start next week.
>>
>> Hi Jens,
>>
>> Do you need more time on this one or have you decided where you want
>> this fix to go? I'm planning to send the first nvme 4.17 pull request
>> this week, so just checking if I should include this one.
> 
> Let's ship it for 4.16.

On 2nd though, let's not. While it worked fine on one box, my other
test box (that has a bunch of drives) is not very happy:

[   30.241598] nvme nvme2: pci function 0000:0b:00.0                            
[   30.247205] nvme nvme3: pci function 0000:81:00.0                            
[   30.252684] nvme nvme4: pci function 0000:82:00.0                            
[   30.258144] nvme nvme5: pci function 0000:83:00.0                            
[   30.263606] nvme nvme6: pci function 0000:84:00.0                            
[   30.360555] nvme nvme3: could not set timestamp (8194)                       
[   30.481649] nvme nvme6: Shutdown timeout set to 8 seconds                    
[   38.790949] nvme nvme4: Device not ready; aborting initialisation            
[   38.797857] nvme nvme4: Removing after probe failure status: -19             
[   60.708816] nvme nvme3: I/O 363 QID 8 timeout, completion polled             
[   60.708820] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
[   68.068772] nvme nvme2: I/O 769 QID 28 timeout, completion polled            
[   91.108626] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
[   98.660581] nvme nvme2: I/O 769 QID 28 timeout, completion polled            
[  121.702691] nvme nvme6: I/O 100 QID 7 timeout, completion polled             
[  128.998648] nvme nvme3: I/O 387 QID 4 timeout, completion polled             
[  152.038523] nvme nvme6: I/O 781 QID 7 timeout, completion polled             

This is just doing an fdisk -l after load. No interrupts triggering,
looking at /proc/interrupts for the queues that timeout. The commands
do complete eventually, but only because we poll the queue. Ignore
the probe failure, that one is expected.

So that's a pretty horrific failure, about half (or more) of the
devices simply don't work. For something being pushed aggressively
at -rc6 time, I'd say your testing is lacking.

I'm going to drop it from my 4.16 queue, and don't queue it up for
4.17 before we figure out what's going on here.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-21 21:44           ` Jens Axboe
@ 2018-03-21 22:08             ` Keith Busch
  2018-03-22 21:09             ` Keith Busch
  1 sibling, 0 replies; 13+ messages in thread
From: Keith Busch @ 2018-03-21 22:08 UTC (permalink / raw)


On Wed, Mar 21, 2018@03:44:32PM -0600, Jens Axboe wrote:
> On 2nd though, let's not. While it worked fine on one box, my other
> test box (that has a bunch of drives) is not very happy:
> 
> [   30.241598] nvme nvme2: pci function 0000:0b:00.0                            
> [   30.247205] nvme nvme3: pci function 0000:81:00.0                            
> [   30.252684] nvme nvme4: pci function 0000:82:00.0                            
> [   30.258144] nvme nvme5: pci function 0000:83:00.0                            
> [   30.263606] nvme nvme6: pci function 0000:84:00.0                            
> [   30.360555] nvme nvme3: could not set timestamp (8194)                       
> [   30.481649] nvme nvme6: Shutdown timeout set to 8 seconds                    
> [   38.790949] nvme nvme4: Device not ready; aborting initialisation            
> [   38.797857] nvme nvme4: Removing after probe failure status: -19             
> [   60.708816] nvme nvme3: I/O 363 QID 8 timeout, completion polled             
> [   60.708820] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
> [   68.068772] nvme nvme2: I/O 769 QID 28 timeout, completion polled            
> [   91.108626] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
> [   98.660581] nvme nvme2: I/O 769 QID 28 timeout, completion polled            
> [  121.702691] nvme nvme6: I/O 100 QID 7 timeout, completion polled             
> [  128.998648] nvme nvme3: I/O 387 QID 4 timeout, completion polled             
> [  152.038523] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
> 
> This is just doing an fdisk -l after load. No interrupts triggering,
> looking at /proc/interrupts for the queues that timeout. The commands
> do complete eventually, but only because we poll the queue. Ignore
> the probe failure, that one is expected.
> 
> So that's a pretty horrific failure, about half (or more) of the
> devices simply don't work. For something being pushed aggressively
> at -rc6 time, I'd say your testing is lacking.
> 
> I'm going to drop it from my 4.16 queue, and don't queue it up for
> 4.17 before we figure out what's going on here.

Well hell, that is awful. Thank you for checking; patch dropped.

Admin queue interrupts appear to be working, otherwise it couldn't have
gotten that far. I've no immediate explanation for your results, so
going back to the drawing board to sort it out.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-21 21:44           ` Jens Axboe
  2018-03-21 22:08             ` Keith Busch
@ 2018-03-22 21:09             ` Keith Busch
  2018-03-22 21:32               ` Jens Axboe
  1 sibling, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-03-22 21:09 UTC (permalink / raw)


On Wed, Mar 21, 2018@03:44:32PM -0600, Jens Axboe wrote:
> 
> [   30.241598] nvme nvme2: pci function 0000:0b:00.0                            
> [   30.247205] nvme nvme3: pci function 0000:81:00.0                            
> [   30.252684] nvme nvme4: pci function 0000:82:00.0                            
> [   30.258144] nvme nvme5: pci function 0000:83:00.0                            
> [   30.263606] nvme nvme6: pci function 0000:84:00.0                            
> [   30.360555] nvme nvme3: could not set timestamp (8194)                       
> [   30.481649] nvme nvme6: Shutdown timeout set to 8 seconds                    
> [   38.790949] nvme nvme4: Device not ready; aborting initialisation            
> [   38.797857] nvme nvme4: Removing after probe failure status: -19             
> [   60.708816] nvme nvme3: I/O 363 QID 8 timeout, completion polled             
> [   60.708820] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
> [   68.068772] nvme nvme2: I/O 769 QID 28 timeout, completion polled            
> [   91.108626] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
> [   98.660581] nvme nvme2: I/O 769 QID 28 timeout, completion polled            
> [  121.702691] nvme nvme6: I/O 100 QID 7 timeout, completion polled             
> [  128.998648] nvme nvme3: I/O 387 QID 4 timeout, completion polled             
> [  152.038523] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
> 
> This is just doing an fdisk -l after load. No interrupts triggering,
> looking at /proc/interrupts for the queues that timeout.

So no interrupts triggered for the queues that timeout, but you are
getting interrupts for other queues. Are there by chance many spurious
interrupts for those other queues?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-22 21:09             ` Keith Busch
@ 2018-03-22 21:32               ` Jens Axboe
  2018-03-22 22:02                 ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2018-03-22 21:32 UTC (permalink / raw)


On 3/22/18 3:09 PM, Keith Busch wrote:
> On Wed, Mar 21, 2018@03:44:32PM -0600, Jens Axboe wrote:
>>
>> [   30.241598] nvme nvme2: pci function 0000:0b:00.0                            
>> [   30.247205] nvme nvme3: pci function 0000:81:00.0                            
>> [   30.252684] nvme nvme4: pci function 0000:82:00.0                            
>> [   30.258144] nvme nvme5: pci function 0000:83:00.0                            
>> [   30.263606] nvme nvme6: pci function 0000:84:00.0                            
>> [   30.360555] nvme nvme3: could not set timestamp (8194)                       
>> [   30.481649] nvme nvme6: Shutdown timeout set to 8 seconds                    
>> [   38.790949] nvme nvme4: Device not ready; aborting initialisation            
>> [   38.797857] nvme nvme4: Removing after probe failure status: -19             
>> [   60.708816] nvme nvme3: I/O 363 QID 8 timeout, completion polled             
>> [   60.708820] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
>> [   68.068772] nvme nvme2: I/O 769 QID 28 timeout, completion polled            
>> [   91.108626] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
>> [   98.660581] nvme nvme2: I/O 769 QID 28 timeout, completion polled            
>> [  121.702691] nvme nvme6: I/O 100 QID 7 timeout, completion polled             
>> [  128.998648] nvme nvme3: I/O 387 QID 4 timeout, completion polled             
>> [  152.038523] nvme nvme6: I/O 781 QID 7 timeout, completion polled             
>>
>> This is just doing an fdisk -l after load. No interrupts triggering,
>> looking at /proc/interrupts for the queues that timeout.
> 
> So no interrupts triggered for the queues that timeout, but you are
> getting interrupts for other queues. Are there by chance many spurious
> interrupts for those other queues?

I picked one device, and I did:

for i in $(seq 0 47); do echo cpu $i; taskset -c $i dd if=/dev/nvme6n1 of=/dev/null bs=4k iflag=direct count=1; done
[...]
cpu 17
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 30.608 s, 0.1 kB/s

same slowness for CPU 19, 21, 23, 41, 43, 45, 47. Looking at dmesg:
[  701.725191] nvme nvme6: I/O 912 QID 7 timeout, completion polled
[  732.124646] nvme nvme6: I/O 896 QID 7 timeout, completion polled
[  762.652479] nvme nvme6: I/O 840 QID 7 timeout, completion polled
[  793.052259] nvme nvme6: I/O 754 QID 7 timeout, completion polled
[  823.644082] nvme nvme6: I/O 339 QID 7 timeout, completion polled
[  853.979878] nvme nvme6: I/O 86 QID 7 timeout, completion polled
[  884.571686] nvme nvme6: I/O 997 QID 7 timeout, completion polled
[  914.907506] nvme nvme6: I/O 694 QID 7 timeout, completion polled

we see them all timing out after 30s, and all being on QID 7, which
is hctx6. There are 8 hw queues (and interrupts), only some of
them seem to be actually triggering looking at /proc/interrupts.

There seems to be some mismatch. nvme6q7 is 244:

# cat /proc/irq/244/smp_affinity_list 
49,51,53,55,57,59,61,63

and 243 is nvme6q6:

# cat /proc/irq/243/smp_affinity_list 
17,19,21,23,41,43,45,47

244 has never triggered, if I do:

# taskset -c 17 dd if=/dev/nvme6n1 of=/dev/null bs=4k iflag=direct count=1

then look at interrupts, none of the nvme6 associated interrupts have
triggered.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-22 21:32               ` Jens Axboe
@ 2018-03-22 22:02                 ` Keith Busch
  2018-03-22 22:09                   ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2018-03-22 22:02 UTC (permalink / raw)


On Thu, Mar 22, 2018@03:32:45PM -0600, Jens Axboe wrote:
> There seems to be some mismatch. nvme6q7 is 244:
> 
> # cat /proc/irq/244/smp_affinity_list 
> 49,51,53,55,57,59,61,63
> 
> and 243 is nvme6q6:
> 
> # cat /proc/irq/243/smp_affinity_list 
> 17,19,21,23,41,43,45,47
> 
> 244 has never triggered, if I do:
> 
> # taskset -c 17 dd if=/dev/nvme6n1 of=/dev/null bs=4k iflag=direct count=1
> 
> then look at interrupts, none of the nvme6 associated interrupts have
> triggered.

Thanks, got it now: blk_mq_pci_map_queues() doesn't take pre_vectors into
account, nor is there a way for it to know even about them. Some queues,
then, get interrupt affinity assigned to "possible" CPUs that aren't
online.

This approach will definitely need some more. Sorry for the trouble.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [GIT PULL] nvme fix for 4.16-rc6
  2018-03-22 22:02                 ` Keith Busch
@ 2018-03-22 22:09                   ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2018-03-22 22:09 UTC (permalink / raw)


On 3/22/18 4:02 PM, Keith Busch wrote:
> On Thu, Mar 22, 2018@03:32:45PM -0600, Jens Axboe wrote:
>> There seems to be some mismatch. nvme6q7 is 244:
>>
>> # cat /proc/irq/244/smp_affinity_list 
>> 49,51,53,55,57,59,61,63
>>
>> and 243 is nvme6q6:
>>
>> # cat /proc/irq/243/smp_affinity_list 
>> 17,19,21,23,41,43,45,47
>>
>> 244 has never triggered, if I do:
>>
>> # taskset -c 17 dd if=/dev/nvme6n1 of=/dev/null bs=4k iflag=direct count=1
>>
>> then look at interrupts, none of the nvme6 associated interrupts have
>> triggered.
> 
> Thanks, got it now: blk_mq_pci_map_queues() doesn't take pre_vectors into
> account, nor is there a way for it to know even about them. Some queues,
> then, get interrupt affinity assigned to "possible" CPUs that aren't
> online.
> 
> This approach will definitely need some more. Sorry for the trouble.

That brings up the question I had in the initial reply - what sort of
testing goes into patches, especially ones that are being pushed this
late in the game? My setup can't be that esoteric that nothing else
would hit this, since the issue is pretty generic.

This could have left nvme broken for tons of folks, potentially for
a final release. I know who would've gotten yelled at for that.

Improve your testing and quality control. This isn't good enough.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-03-22 22:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-16 16:01 [GIT PULL] nvme fix for 4.16-rc6 Keith Busch
2018-03-16 16:14 ` Jens Axboe
2018-03-16 16:16   ` Christoph Hellwig
2018-03-16 16:26     ` Jens Axboe
2018-03-16 16:53       ` Keith Busch
2018-03-21 20:59       ` Keith Busch
2018-03-21 21:02         ` Jens Axboe
2018-03-21 21:44           ` Jens Axboe
2018-03-21 22:08             ` Keith Busch
2018-03-22 21:09             ` Keith Busch
2018-03-22 21:32               ` Jens Axboe
2018-03-22 22:02                 ` Keith Busch
2018-03-22 22:09                   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.