linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sysfs interface to force power off
@ 2022-11-04 23:08 James Puthukattukaran
  2022-11-07 20:41 ` Bjorn Helgaas
  0 siblings, 1 reply; 8+ messages in thread
From: James Puthukattukaran @ 2022-11-04 23:08 UTC (permalink / raw)
  To: linux-pci, helgaas

Looking to solve a problem where we have nvme drives that are hung in the field and we are not sure of the root cause but the working theory is that the controller is "bad" and not responding properly to commands. The nvme driver times out on outstanding IO requests and as part of recovery, attempts to reset the controller and reinitialize the device. The reset controller also hangs like here --   

ernel:info: [10419813.132341] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
kernel:warning: [10419813.132342] Call Trace:
kernel:warning: [10419813.132345]  __schedule+0x2bc/0x89b
kernel:warning: [10419813.132348]  schedule+0x36/0x7c
kernel:warning: [10419813.132351]  blk_mq_freeze_queue_wait+0x4b/0xaa
kernel:warning: [10419813.132353]  ? remove_wait_queue+0x60/0x60
kernel:warning: [10419813.132359]  nvme_wait_freeze+0x33/0x50 [nvme_core]
kernel:warning: [10419813.132362]  nvme_reset_work+0x802/0xd84 [nvme]
kernel:warning: [10419813.132364]  ? __switch_to_asm+0x40/0x62
kernel:warning: [10419813.132365]  ? __switch_to_asm+0x34/0x62
kernel:warning: [10419813.132367]  ? __switch_to+0x9b/0x505
kernel:warning: [10419813.132368]  ? __switch_to_asm+0x40/0x62
kernel:warning: [10419813.132370]  ? __switch_to_asm+0x40/0x62
kernel:warning: [10419813.132372]  process_one_work+0x169/0x399
kernel:warning: [10419813.132374]  worker_thread+0x4d/0x3e5
kernel:warning: [10419813.132377]  kthread+0x105/0x138
kernel:warning: [10419813.132379]  ? rescuer_thread+0x380/0x375
kernel:warning: [10419813.132380]  ? kthread_bind+0x20/0x15
kernel:warning: [10419813.132382]  ret_from_fork+0x24/0x49
...

So, I tried to hot power off the device via "echo 0 > /sys/bus/pci/slots/X/power" -- the thread also hangs waiting for the nvme reset thread to finish (like so) -- 


kernel:warning: [10419813.158116]  __schedule+0x2bc/0x89b
kernel:warning: [10419813.158119]  schedule+0x36/0x7c
kernel:warning: [10419813.158122]  schedule_timeout+0x1f6/0x31f
kernel:warning: [10419813.158124]  ? sched_clock_cpu+0x11/0xa5
kernel:warning: [10419813.158126]  ? try_to_wake_up+0x59/0x505
kernel:warning: [10419813.158130]  wait_for_completion+0x12b/0x18a
kernel:warning: [10419813.158132]  ? wake_up_q+0x80/0x73
kernel:warning: [10419813.158134]  flush_work+0x122/0x1a7
kernel:warning: [10419813.158137]  ? wake_up_worker+0x30/0x2b
kernel:warning: [10419813.158141]  nvme_remove+0x71/0x100 [nvme]
kernel:warning: [10419813.158146]  pci_device_remove+0x3e/0xb6
kernel:warning: [10419813.158149]  device_release_driver_internal+0x134/0x1eb
kernel:warning: [10419813.158151]  device_release_driver+0x12/0x14
kernel:warning: [10419813.158155]  pci_stop_bus_device+0x7c/0x96
kernel:warning: [10419813.158158]  pci_stop_bus_device+0x39/0x96
kernel:warning: [10419813.158164]  pci_stop_and_remove_bus_device+0x12/0x1d
kernel:warning: [10419813.158167]  pciehp_unconfigure_device+0x7a/0x1d7
kernel:warning: [10419813.158169]  pciehp_disable_slot+0x52/0xca
kernel:warning: [10419813.158171]  pciehp_sysfs_disable_slot+0x67/0x112
kernel:warning: [10419813.158174]  disable_slot+0x12/0x14
kernel:warning: [10419813.158175]  power_write_file+0x6e/0xf8
kernel:warning: [10419813.158179]  pci_slot_attr_store+0x24/0x2e
kernel:warning: [10419813.158180]  sysfs_kf_write+0x3f/0x46
kernel:warning: [10419813.158182]  kernfs_fop_write+0x124/0x1a3
kernel:warning: [10419813.158184]  __vfs_write+0x3a/0x16d
kernel:warning: [10419813.158187]  ? audit_filter_syscall+0x33/0xce
kernel:warning: [10419813.158189]  vfs_write+0xb2/0x1a1

Is there a way to force power off the device instead of the "graceful" approach? Obviously, we don't want to reset the system and don't have physical access to the device.  

Would it make sense to create a "force power off" in /sys/bus/pci/slots/X which basically 
a) Sets completion timeout mask (CTO) (for outstanding IO requests not causing a fatal error due to CTOs; not an issue for DPCs I would think)
b) power off the slot 
c) enable CTO mask
d) unconfigure the device via pciehp_unconfigure_device

Any help here appreciated! 
thanks
James



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sysfs interface to force power off
  2022-11-04 23:08 sysfs interface to force power off James Puthukattukaran
@ 2022-11-07 20:41 ` Bjorn Helgaas
  2022-11-07 21:14   ` [External] : " James Puthukattukaran
  2022-11-08  9:53   ` Lukas Wunner
  0 siblings, 2 replies; 8+ messages in thread
From: Bjorn Helgaas @ 2022-11-07 20:41 UTC (permalink / raw)
  To: James Puthukattukaran; +Cc: Lukas Wunner, Hans de Goede, linux-pci

[+cc Lukas, Hans]

On Fri, Nov 04, 2022 at 07:08:34PM -0400, James Puthukattukaran wrote:
> Looking to solve a problem where we have nvme drives that are hung
> in the field and we are not sure of the root cause but the working
> theory is that the controller is "bad" and not responding properly
> to commands. The nvme driver times out on outstanding IO requests
> and as part of recovery, attempts to reset the controller and
> reinitialize the device. The reset controller also hangs like here
> --   
> 
> ernel:info: [10419813.132341] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
> kernel:warning: [10419813.132342] Call Trace:
> kernel:warning: [10419813.132345]  __schedule+0x2bc/0x89b
> kernel:warning: [10419813.132348]  schedule+0x36/0x7c
> kernel:warning: [10419813.132351]  blk_mq_freeze_queue_wait+0x4b/0xaa
> kernel:warning: [10419813.132353]  ? remove_wait_queue+0x60/0x60
> kernel:warning: [10419813.132359]  nvme_wait_freeze+0x33/0x50 [nvme_core]
> kernel:warning: [10419813.132362]  nvme_reset_work+0x802/0xd84 [nvme]
> kernel:warning: [10419813.132364]  ? __switch_to_asm+0x40/0x62
> kernel:warning: [10419813.132365]  ? __switch_to_asm+0x34/0x62
> kernel:warning: [10419813.132367]  ? __switch_to+0x9b/0x505
> kernel:warning: [10419813.132368]  ? __switch_to_asm+0x40/0x62
> kernel:warning: [10419813.132370]  ? __switch_to_asm+0x40/0x62
> kernel:warning: [10419813.132372]  process_one_work+0x169/0x399
> kernel:warning: [10419813.132374]  worker_thread+0x4d/0x3e5
> kernel:warning: [10419813.132377]  kthread+0x105/0x138
> kernel:warning: [10419813.132379]  ? rescuer_thread+0x380/0x375
> kernel:warning: [10419813.132380]  ? kthread_bind+0x20/0x15
> kernel:warning: [10419813.132382]  ret_from_fork+0x24/0x49
> ...
> 
> So, I tried to hot power off the device via
> "echo 0 > /sys/bus/pci/slots/X/power" -- the thread also hangs
> waiting for the nvme reset thread to finish (like so) -- 

Looks like this "power" sysfs file could use some documentation.  I
couldn't find anything in Documentation/ABI/testing/ that seems to
cover it.

> kernel:warning: [10419813.158116]  __schedule+0x2bc/0x89b
> kernel:warning: [10419813.158119]  schedule+0x36/0x7c
> kernel:warning: [10419813.158122]  schedule_timeout+0x1f6/0x31f
> kernel:warning: [10419813.158124]  ? sched_clock_cpu+0x11/0xa5
> kernel:warning: [10419813.158126]  ? try_to_wake_up+0x59/0x505
> kernel:warning: [10419813.158130]  wait_for_completion+0x12b/0x18a
> kernel:warning: [10419813.158132]  ? wake_up_q+0x80/0x73
> kernel:warning: [10419813.158134]  flush_work+0x122/0x1a7
> kernel:warning: [10419813.158137]  ? wake_up_worker+0x30/0x2b
> kernel:warning: [10419813.158141]  nvme_remove+0x71/0x100 [nvme]
> kernel:warning: [10419813.158146]  pci_device_remove+0x3e/0xb6
> kernel:warning: [10419813.158149]  device_release_driver_internal+0x134/0x1eb
> kernel:warning: [10419813.158151]  device_release_driver+0x12/0x14
> kernel:warning: [10419813.158155]  pci_stop_bus_device+0x7c/0x96
> kernel:warning: [10419813.158158]  pci_stop_bus_device+0x39/0x96
> kernel:warning: [10419813.158164]  pci_stop_and_remove_bus_device+0x12/0x1d
> kernel:warning: [10419813.158167]  pciehp_unconfigure_device+0x7a/0x1d7
> kernel:warning: [10419813.158169]  pciehp_disable_slot+0x52/0xca
> kernel:warning: [10419813.158171]  pciehp_sysfs_disable_slot+0x67/0x112
> kernel:warning: [10419813.158174]  disable_slot+0x12/0x14
> kernel:warning: [10419813.158175]  power_write_file+0x6e/0xf8
> kernel:warning: [10419813.158179]  pci_slot_attr_store+0x24/0x2e
> kernel:warning: [10419813.158180]  sysfs_kf_write+0x3f/0x46
> kernel:warning: [10419813.158182]  kernfs_fop_write+0x124/0x1a3
> kernel:warning: [10419813.158184]  __vfs_write+0x3a/0x16d
> kernel:warning: [10419813.158187]  ? audit_filter_syscall+0x33/0xce
> kernel:warning: [10419813.158189]  vfs_write+0xb2/0x1a1
> 
> Is there a way to force power off the device instead of the
> "graceful" approach? Obviously, we don't want to reset the system
> and don't have physical access to the device.  
> 
> Would it make sense to create a "force power off" in
> /sys/bus/pci/slots/X which basically 

> a) Sets completion timeout mask (CTO) (for outstanding IO requests
>    not causing a fatal error due to CTOs; not an issue for DPCs I
>    would think)
> b) power off the slot 
> c) enable CTO mask
> d) unconfigure the device via pciehp_unconfigure_device

So I assume the existing sysfs slot "power" interface would do what
you want except that nvme_remove() hangs?

There might be some improvement to make in nvme_remove(); maybe it
doesn't correctly detect I/O errors or something.

But maybe there's *also* a case to be made for an interface like you
suggest.  Lukas, Hans, any reaction to this?

Bjorn

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] : Re: sysfs interface to force power off
  2022-11-07 20:41 ` Bjorn Helgaas
@ 2022-11-07 21:14   ` James Puthukattukaran
  2022-11-07 21:29     ` Bjorn Helgaas
  2022-11-08 16:12     ` Keith Busch
  2022-11-08  9:53   ` Lukas Wunner
  1 sibling, 2 replies; 8+ messages in thread
From: James Puthukattukaran @ 2022-11-07 21:14 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Lukas Wunner, Hans de Goede, linux-pci



On 11/7/22 15:41, Bjorn Helgaas wrote:
> [+cc Lukas, Hans]
> 
> On Fri, Nov 04, 2022 at 07:08:34PM -0400, James Puthukattukaran wrote:
>> Looking to solve a problem where we have nvme drives that are hung
>> in the field and we are not sure of the root cause but the working
>> theory is that the controller is "bad" and not responding properly
>> to commands. The nvme driver times out on outstanding IO requests
>> and as part of recovery, attempts to reset the controller and
>> reinitialize the device. The reset controller also hangs like here
>> --   
>>
>> ernel:info: [10419813.132341] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
>> kernel:warning: [10419813.132342] Call Trace:
>> kernel:warning: [10419813.132345]  __schedule+0x2bc/0x89b
>> kernel:warning: [10419813.132348]  schedule+0x36/0x7c
>> kernel:warning: [10419813.132351]  blk_mq_freeze_queue_wait+0x4b/0xaa
>> kernel:warning: [10419813.132353]  ? remove_wait_queue+0x60/0x60
>> kernel:warning: [10419813.132359]  nvme_wait_freeze+0x33/0x50 [nvme_core]
>> kernel:warning: [10419813.132362]  nvme_reset_work+0x802/0xd84 [nvme]
>> kernel:warning: [10419813.132364]  ? __switch_to_asm+0x40/0x62
>> kernel:warning: [10419813.132365]  ? __switch_to_asm+0x34/0x62
>> kernel:warning: [10419813.132367]  ? __switch_to+0x9b/0x505
>> kernel:warning: [10419813.132368]  ? __switch_to_asm+0x40/0x62
>> kernel:warning: [10419813.132370]  ? __switch_to_asm+0x40/0x62
>> kernel:warning: [10419813.132372]  process_one_work+0x169/0x399
>> kernel:warning: [10419813.132374]  worker_thread+0x4d/0x3e5
>> kernel:warning: [10419813.132377]  kthread+0x105/0x138
>> kernel:warning: [10419813.132379]  ? rescuer_thread+0x380/0x375
>> kernel:warning: [10419813.132380]  ? kthread_bind+0x20/0x15
>> kernel:warning: [10419813.132382]  ret_from_fork+0x24/0x49
>> ...
>>
>> So, I tried to hot power off the device via
>> "echo 0 > /sys/bus/pci/slots/X/power" -- the thread also hangs
>> waiting for the nvme reset thread to finish (like so) -- 
> 
> Looks like this "power" sysfs file could use some documentation.  I
> couldn't find anything in Documentation/ABI/testing/ that seems to
> cover it.
> 
>> kernel:warning: [10419813.158116]  __schedule+0x2bc/0x89b
>> kernel:warning: [10419813.158119]  schedule+0x36/0x7c
>> kernel:warning: [10419813.158122]  schedule_timeout+0x1f6/0x31f
>> kernel:warning: [10419813.158124]  ? sched_clock_cpu+0x11/0xa5
>> kernel:warning: [10419813.158126]  ? try_to_wake_up+0x59/0x505
>> kernel:warning: [10419813.158130]  wait_for_completion+0x12b/0x18a
>> kernel:warning: [10419813.158132]  ? wake_up_q+0x80/0x73
>> kernel:warning: [10419813.158134]  flush_work+0x122/0x1a7
>> kernel:warning: [10419813.158137]  ? wake_up_worker+0x30/0x2b
>> kernel:warning: [10419813.158141]  nvme_remove+0x71/0x100 [nvme]
>> kernel:warning: [10419813.158146]  pci_device_remove+0x3e/0xb6
>> kernel:warning: [10419813.158149]  device_release_driver_internal+0x134/0x1eb
>> kernel:warning: [10419813.158151]  device_release_driver+0x12/0x14
>> kernel:warning: [10419813.158155]  pci_stop_bus_device+0x7c/0x96
>> kernel:warning: [10419813.158158]  pci_stop_bus_device+0x39/0x96
>> kernel:warning: [10419813.158164]  pci_stop_and_remove_bus_device+0x12/0x1d
>> kernel:warning: [10419813.158167]  pciehp_unconfigure_device+0x7a/0x1d7
>> kernel:warning: [10419813.158169]  pciehp_disable_slot+0x52/0xca
>> kernel:warning: [10419813.158171]  pciehp_sysfs_disable_slot+0x67/0x112
>> kernel:warning: [10419813.158174]  disable_slot+0x12/0x14
>> kernel:warning: [10419813.158175]  power_write_file+0x6e/0xf8
>> kernel:warning: [10419813.158179]  pci_slot_attr_store+0x24/0x2e
>> kernel:warning: [10419813.158180]  sysfs_kf_write+0x3f/0x46
>> kernel:warning: [10419813.158182]  kernfs_fop_write+0x124/0x1a3
>> kernel:warning: [10419813.158184]  __vfs_write+0x3a/0x16d
>> kernel:warning: [10419813.158187]  ? audit_filter_syscall+0x33/0xce
>> kernel:warning: [10419813.158189]  vfs_write+0xb2/0x1a1
>>
>> Is there a way to force power off the device instead of the
>> "graceful" approach? Obviously, we don't want to reset the system
>> and don't have physical access to the device.  
>>
>> Would it make sense to create a "force power off" in
>> /sys/bus/pci/slots/X which basically 
> 
>> a) Sets completion timeout mask (CTO) (for outstanding IO requests
>>    not causing a fatal error due to CTOs; not an issue for DPCs I
>>    would think)
>> b) power off the slot 
>> c) enable CTO mask
>> d) unconfigure the device via pciehp_unconfigure_device
> 
> So I assume the existing sysfs slot "power" interface would do what
> you want except that nvme_remove() hangs?
> 
> There might be some improvement to make in nvme_remove(); maybe it
> doesn't correctly detect I/O errors or something.

There is a path to disable the controller and that code ran but did not help. I checked wit the nvme folks and Keith mentioned that there might be an issue with the nvme queue management. Unfortunately, we can't try newer kernels in the field. So, looking for a way to just "shut off the device" when we have scenarios like this where we can't untangle the mess. 
> But maybe there's *also* a case to be made for an interface like you
> suggest.  Lukas, Hans, any reaction to this?
>> Bjorn

I have a patch that I've tested out assuming this makes  approach makes sense. 
thanks
James



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] : Re: sysfs interface to force power off
  2022-11-07 21:14   ` [External] : " James Puthukattukaran
@ 2022-11-07 21:29     ` Bjorn Helgaas
  2022-11-08 16:12     ` Keith Busch
  1 sibling, 0 replies; 8+ messages in thread
From: Bjorn Helgaas @ 2022-11-07 21:29 UTC (permalink / raw)
  To: James Puthukattukaran; +Cc: Lukas Wunner, Hans de Goede, linux-pci

On Mon, Nov 07, 2022 at 04:14:54PM -0500, James Puthukattukaran wrote:
> ...

> I have a patch that I've tested out assuming this makes  approach
> makes sense.

Don't hesitate to post the patch.  It's always easier to talk about
things when we can see the concrete details.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: sysfs interface to force power off
  2022-11-07 20:41 ` Bjorn Helgaas
  2022-11-07 21:14   ` [External] : " James Puthukattukaran
@ 2022-11-08  9:53   ` Lukas Wunner
  1 sibling, 0 replies; 8+ messages in thread
From: Lukas Wunner @ 2022-11-08  9:53 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: James Puthukattukaran, Hans de Goede, linux-pci

On Mon, Nov 07, 2022 at 02:41:29PM -0600, Bjorn Helgaas wrote:
> On Fri, Nov 04, 2022 at 07:08:34PM -0400, James Puthukattukaran wrote:
> > Looking to solve a problem where we have nvme drives that are hung
> > in the field and we are not sure of the root cause but the working
> > theory is that the controller is "bad" and not responding properly
> > to commands. The nvme driver times out on outstanding IO requests
> > and as part of recovery, attempts to reset the controller and
> > reinitialize the device. The reset controller also hangs like here
> > --   
> > 
> > ernel:info: [10419813.132341] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
> > kernel:warning: [10419813.132342] Call Trace:
> > kernel:warning: [10419813.132345]  __schedule+0x2bc/0x89b
> > kernel:warning: [10419813.132348]  schedule+0x36/0x7c
> > kernel:warning: [10419813.132351]  blk_mq_freeze_queue_wait+0x4b/0xaa
> > kernel:warning: [10419813.132353]  ? remove_wait_queue+0x60/0x60
> > kernel:warning: [10419813.132359]  nvme_wait_freeze+0x33/0x50 [nvme_core]
> > kernel:warning: [10419813.132362]  nvme_reset_work+0x802/0xd84 [nvme]
> > kernel:warning: [10419813.132364]  ? __switch_to_asm+0x40/0x62
> > kernel:warning: [10419813.132365]  ? __switch_to_asm+0x34/0x62
> > kernel:warning: [10419813.132367]  ? __switch_to+0x9b/0x505
> > kernel:warning: [10419813.132368]  ? __switch_to_asm+0x40/0x62
> > kernel:warning: [10419813.132370]  ? __switch_to_asm+0x40/0x62
> > kernel:warning: [10419813.132372]  process_one_work+0x169/0x399
> > kernel:warning: [10419813.132374]  worker_thread+0x4d/0x3e5
> > kernel:warning: [10419813.132377]  kthread+0x105/0x138
> > kernel:warning: [10419813.132379]  ? rescuer_thread+0x380/0x375
> > kernel:warning: [10419813.132380]  ? kthread_bind+0x20/0x15
> > kernel:warning: [10419813.132382]  ret_from_fork+0x24/0x49
> > ...
> > 
> > So, I tried to hot power off the device via
> > "echo 0 > /sys/bus/pci/slots/X/power" -- the thread also hangs
> > waiting for the nvme reset thread to finish (like so) -- 
> 
> Looks like this "power" sysfs file could use some documentation.  I
> couldn't find anything in Documentation/ABI/testing/ that seems to
> cover it.

That sysfs attribute was introduced in early 2002, I guess we were
less diligent with documentation back then:

http://git.kernel.org/tglx/history/c/a8a2069f432c

(search for power_write_file() in the commit)


The problem here is in the NVMe / block layer, not the PCI layer.
nvme_wait_freeze() calls blk_mq_freeze_queue_wait(), but obviously
it should call blk_mq_freeze_queue_wait_timeout() instead and handle
a timeout by retiring any outstanding I/O requests to the drive and
marking it as dead.


> > kernel:warning: [10419813.158116]  __schedule+0x2bc/0x89b
> > kernel:warning: [10419813.158119]  schedule+0x36/0x7c
> > kernel:warning: [10419813.158122]  schedule_timeout+0x1f6/0x31f
> > kernel:warning: [10419813.158124]  ? sched_clock_cpu+0x11/0xa5
> > kernel:warning: [10419813.158126]  ? try_to_wake_up+0x59/0x505
> > kernel:warning: [10419813.158130]  wait_for_completion+0x12b/0x18a
> > kernel:warning: [10419813.158132]  ? wake_up_q+0x80/0x73
> > kernel:warning: [10419813.158134]  flush_work+0x122/0x1a7
> > kernel:warning: [10419813.158137]  ? wake_up_worker+0x30/0x2b
> > kernel:warning: [10419813.158141]  nvme_remove+0x71/0x100 [nvme]
> > kernel:warning: [10419813.158146]  pci_device_remove+0x3e/0xb6
> > kernel:warning: [10419813.158149]  device_release_driver_internal+0x134/0x1eb
> > kernel:warning: [10419813.158151]  device_release_driver+0x12/0x14
> > kernel:warning: [10419813.158155]  pci_stop_bus_device+0x7c/0x96
> > kernel:warning: [10419813.158158]  pci_stop_bus_device+0x39/0x96
> > kernel:warning: [10419813.158164]  pci_stop_and_remove_bus_device+0x12/0x1d
> > kernel:warning: [10419813.158167]  pciehp_unconfigure_device+0x7a/0x1d7
> > kernel:warning: [10419813.158169]  pciehp_disable_slot+0x52/0xca
> > kernel:warning: [10419813.158171]  pciehp_sysfs_disable_slot+0x67/0x112
> > kernel:warning: [10419813.158174]  disable_slot+0x12/0x14
> > kernel:warning: [10419813.158175]  power_write_file+0x6e/0xf8
> > kernel:warning: [10419813.158179]  pci_slot_attr_store+0x24/0x2e
> > kernel:warning: [10419813.158180]  sysfs_kf_write+0x3f/0x46
> > kernel:warning: [10419813.158182]  kernfs_fop_write+0x124/0x1a3
> > kernel:warning: [10419813.158184]  __vfs_write+0x3a/0x16d
> > kernel:warning: [10419813.158187]  ? audit_filter_syscall+0x33/0xce
> > kernel:warning: [10419813.158189]  vfs_write+0xb2/0x1a1
> > 
> > Is there a way to force power off the device instead of the
> > "graceful" approach? Obviously, we don't want to reset the system
> > and don't have physical access to the device.  
> > 
> > Would it make sense to create a "force power off" in
> > /sys/bus/pci/slots/X which basically

The power attribute in sysfs already does what you want, but when
unbinding the nvme driver from the device, the flush_work() call
waits for nvme_reset_work() to finish.  And because that's stuck,
unbinding also gets stuck.  Again, the solution is a code fix
in the NVMe / block layer, so the proper mailing list to ask
would be linux-nvme and linux-block.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] : Re: sysfs interface to force power off
  2022-11-07 21:14   ` [External] : " James Puthukattukaran
  2022-11-07 21:29     ` Bjorn Helgaas
@ 2022-11-08 16:12     ` Keith Busch
  2022-11-08 20:16       ` Lukas Wunner
  1 sibling, 1 reply; 8+ messages in thread
From: Keith Busch @ 2022-11-08 16:12 UTC (permalink / raw)
  To: James Puthukattukaran
  Cc: Bjorn Helgaas, Lukas Wunner, Hans de Goede, linux-pci

On Mon, Nov 07, 2022 at 04:14:54PM -0500, James Puthukattukaran wrote:
> 
> There is a path to disable the controller and that code ran but did
> not help. I checked wit the nvme folks and Keith mentioned that there
> might be an issue with the nvme queue management. Unfortunately, we
> can't try newer kernels in the field. So, looking for a way to just
> "shut off the device" when we have scenarios like this where we can't
> untangle the mess. 

Well, I didn't request you try new kernels in the field. I asked if you
could experiment with a newer one on a development machine to confirm if
the bug was fixed by some of the significant changes in this path so
that we could confirm a reason to port to stable. You're going to have
to change your kernel to fix this observation, so it would be worth the
effort to know if the changes being considered actually address the
problem.

If you're just looking for a work-around for this specific scenario,
sorry, I don't think we'll find one. You should just avoid this scenario
if you can't change your kernel.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] : Re: sysfs interface to force power off
  2022-11-08 16:12     ` Keith Busch
@ 2022-11-08 20:16       ` Lukas Wunner
  2022-11-08 20:37         ` Keith Busch
  0 siblings, 1 reply; 8+ messages in thread
From: Lukas Wunner @ 2022-11-08 20:16 UTC (permalink / raw)
  To: Keith Busch
  Cc: James Puthukattukaran, Bjorn Helgaas, Hans de Goede, linux-pci

On Tue, Nov 08, 2022 at 09:12:44AM -0700, Keith Busch wrote:
> On Mon, Nov 07, 2022 at 04:14:54PM -0500, James Puthukattukaran wrote:
> > 
> > There is a path to disable the controller and that code ran but did
> > not help. I checked wit the nvme folks and Keith mentioned that there
> > might be an issue with the nvme queue management. Unfortunately, we
> > can't try newer kernels in the field. So, looking for a way to just
> > "shut off the device" when we have scenarios like this where we can't
> > untangle the mess. 
> 
> Well, I didn't request you try new kernels in the field. I asked if you
> could experiment with a newer one on a development machine to confirm if
> the bug was fixed by some of the significant changes in this path so
> that we could confirm a reason to port to stable. You're going to have
> to change your kernel to fix this observation, so it would be worth the
> effort to know if the changes being considered actually address the
> problem.

Current mainline still contains this problematic sequence:

  nvme_reset_work()
    nvme_wait_freeze()
      blk_mq_freeze_queue_wait()

So I'm inclined to believe that the issue still persists, but I agree
that validating that hypothesis with a contemporary kernel should be
the first step.

I think nvme_reset_work() is overly optimistic that resetting the drive
succeeded.  It just freezes and unfreezes the I/O queue without checking
for errors.

In particular, nvme_wait_freeze() should call the _timeout variant of
blk_mq_freeze_queue_wait() and cope with failure of freezing.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [External] : Re: sysfs interface to force power off
  2022-11-08 20:16       ` Lukas Wunner
@ 2022-11-08 20:37         ` Keith Busch
  0 siblings, 0 replies; 8+ messages in thread
From: Keith Busch @ 2022-11-08 20:37 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: James Puthukattukaran, Bjorn Helgaas, Hans de Goede, linux-pci

On Tue, Nov 08, 2022 at 09:16:53PM +0100, Lukas Wunner wrote:
> On Tue, Nov 08, 2022 at 09:12:44AM -0700, Keith Busch wrote:
> > On Mon, Nov 07, 2022 at 04:14:54PM -0500, James Puthukattukaran wrote:
> > > 
> > > There is a path to disable the controller and that code ran but did
> > > not help. I checked wit the nvme folks and Keith mentioned that there
> > > might be an issue with the nvme queue management. Unfortunately, we
> > > can't try newer kernels in the field. So, looking for a way to just
> > > "shut off the device" when we have scenarios like this where we can't
> > > untangle the mess. 
> > 
> > Well, I didn't request you try new kernels in the field. I asked if you
> > could experiment with a newer one on a development machine to confirm if
> > the bug was fixed by some of the significant changes in this path so
> > that we could confirm a reason to port to stable. You're going to have
> > to change your kernel to fix this observation, so it would be worth the
> > effort to know if the changes being considered actually address the
> > problem.
> 
> Current mainline still contains this problematic sequence:
> 
>   nvme_reset_work()
>     nvme_wait_freeze()
>       blk_mq_freeze_queue_wait()
> 
> So I'm inclined to believe that the issue still persists, but I agree

Yeah, that sequence exists, but there are some subtle changes with how
the workqueues account for unquiesceing hardware queues that can affect
how a freeze can make forward progress.

> I think nvme_reset_work() is overly optimistic that resetting the drive
> succeeded.  It just freezes and unfreezes the I/O queue without checking
> for errors.

I'm not sure what you mean. An nvme reset is a CC.EN 0->1 transition,
and we definitely confirm that succeeds.

If you're referring to the 1->0 transition, that has to happen after the
initial freeze/quiesce steps, but whether or not that succeeds shouldn't
be relevant to the rest of the sequence: we're about to disable the
device at the PCI level.
 
> In particular, nvme_wait_freeze() should call the _timeout variant of
> blk_mq_freeze_queue_wait() and cope with failure of freezing.

That would indicate we have a mismatched freeze depth or a unbalanced
quiesce problem, so the timeout freeze would just mask the underlying
issue.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-11-08 20:37 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-04 23:08 sysfs interface to force power off James Puthukattukaran
2022-11-07 20:41 ` Bjorn Helgaas
2022-11-07 21:14   ` [External] : " James Puthukattukaran
2022-11-07 21:29     ` Bjorn Helgaas
2022-11-08 16:12     ` Keith Busch
2022-11-08 20:16       ` Lukas Wunner
2022-11-08 20:37         ` Keith Busch
2022-11-08  9:53   ` Lukas Wunner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).