All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel oops after nvme_set_queue_count()
@ 2021-01-21  8:25 Hannes Reinecke
  2021-01-21  9:06 ` Sagi Grimberg
  2021-01-21 17:07 ` Keith Busch
  0 siblings, 2 replies; 3+ messages in thread
From: Hannes Reinecke @ 2021-01-21  8:25 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg, linux-nvme

Hi all,

a customer of ours ran into this oops:

[44157.918962] nvme nvme5: I/O 22 QID 0 timeout
[44163.347467] nvme nvme5: Could not set queue count (880)
[44163.347551] nvme nvme5: Successfully reconnected (6 attempts)
[44168.414977] BUG: unable to handle kernel paging request at 
ffff888e261e7808
[44168.414988] IP: 0xffff888e261e7808
[44168.414994] PGD 98c2ae067 P4D 98c2ae067 PUD f57937063 PMD 
8000000f660001e3

It's related to this code snippet in drivers/nvme/host/core.c

	/*
	 * Degraded controllers might return an error when setting the queue
	 * count.  We still want to be able to bring them online and offer
	 * access to the admin queue, as that might be only way to fix them up.
	 */
	if (status > 0) {
		dev_err(ctrl->device, "Could not set queue count (%d)\n", status);
		*count = 0;


causing nvme_set_queue_count() _not_ to return an error, but rather let 
the reconnect complete.
Of course, as this failure is due to a timeout (cf the status code; 880
is NVME_SC_HOST_PATH_ERROR), the admin queue has been torn down by the 
transport, causing this crash.

So, question: _why_ do we ignore the status?

For fabrics I completely fail to see the reason here; even _if_ it 
worked we would end up with a connection for which just the admin queue 
is operable, the state is LIVE, and all information we could glance 
would indicate that the connection is perfectly healthy.
It just doesn't have any I/O queues.
Which will lead to some very confused customers and some very unhappy 
support folks trying to figure out what has happened.

Can we just kill this statement and always return an error?
In all other cases we are quite trigger-happy with controller reset; why 
not here?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: kernel oops after nvme_set_queue_count()
  2021-01-21  8:25 kernel oops after nvme_set_queue_count() Hannes Reinecke
@ 2021-01-21  9:06 ` Sagi Grimberg
  2021-01-21 17:07 ` Keith Busch
  1 sibling, 0 replies; 3+ messages in thread
From: Sagi Grimberg @ 2021-01-21  9:06 UTC (permalink / raw)
  To: Hannes Reinecke, Christoph Hellwig, linux-nvme


> Hi all,
> 
> a customer of ours ran into this oops:
> 
> [44157.918962] nvme nvme5: I/O 22 QID 0 timeout
> [44163.347467] nvme nvme5: Could not set queue count (880)
> [44163.347551] nvme nvme5: Successfully reconnected (6 attempts)
> [44168.414977] BUG: unable to handle kernel paging request at 
> ffff888e261e7808
> [44168.414988] IP: 0xffff888e261e7808
> [44168.414994] PGD 98c2ae067 P4D 98c2ae067 PUD f57937063 PMD 
> 8000000f660001e3
> 
> It's related to this code snippet in drivers/nvme/host/core.c
> 
>      /*
>       * Degraded controllers might return an error when setting the queue
>       * count.  We still want to be able to bring them online and offer
>       * access to the admin queue, as that might be only way to fix them 
> up.
>       */
>      if (status > 0) {
>          dev_err(ctrl->device, "Could not set queue count (%d)\n", status);
>          *count = 0;
> 
> 
> causing nvme_set_queue_count() _not_ to return an error, but rather let 
> the reconnect complete.
> Of course, as this failure is due to a timeout (cf the status code; 880
> is NVME_SC_HOST_PATH_ERROR), the admin queue has been torn down by the 
> transport, causing this crash.
> 
> So, question: _why_ do we ignore the status?

This used to exist in pci where a controller reset will fail to set up
I/O queues, at least the controller can accept admin commands to get
some diagnostics (perhaps an error log page).

> For fabrics I completely fail to see the reason here; even _if_ it 
> worked we would end up with a connection for which just the admin queue 
> is operable, the state is LIVE, and all information we could glance 
> would indicate that the connection is perfectly healthy.

We also had a ADMIN_ONLY state at some point, but that was dropped as
well for reasons I don't remember at the moment.

> It just doesn't have any I/O queues.
> Which will lead to some very confused customers and some very unhappy 
> support folks trying to figure out what has happened.
> 
> Can we just kill this statement and always return an error?
> In all other cases we are quite trigger-happy with controller reset; why 
> not here?

I think we will want to keep the existing behavior for pci, but agree we
probably want to change it for fabrics...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: kernel oops after nvme_set_queue_count()
  2021-01-21  8:25 kernel oops after nvme_set_queue_count() Hannes Reinecke
  2021-01-21  9:06 ` Sagi Grimberg
@ 2021-01-21 17:07 ` Keith Busch
  1 sibling, 0 replies; 3+ messages in thread
From: Keith Busch @ 2021-01-21 17:07 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Christoph Hellwig, linux-nvme, Sagi Grimberg

On Thu, Jan 21, 2021 at 09:25:39AM +0100, Hannes Reinecke wrote:
> Hi all,
> 
> a customer of ours ran into this oops:
> 
> [44157.918962] nvme nvme5: I/O 22 QID 0 timeout
> [44163.347467] nvme nvme5: Could not set queue count (880)
> [44163.347551] nvme nvme5: Successfully reconnected (6 attempts)
> [44168.414977] BUG: unable to handle kernel paging request at
> ffff888e261e7808
> [44168.414988] IP: 0xffff888e261e7808
> [44168.414994] PGD 98c2ae067 P4D 98c2ae067 PUD f57937063 PMD
> 8000000f660001e3
> 
> It's related to this code snippet in drivers/nvme/host/core.c
> 
> 	/*
> 	 * Degraded controllers might return an error when setting the queue
> 	 * count.  We still want to be able to bring them online and offer
> 	 * access to the admin queue, as that might be only way to fix them up.
> 	 */
> 	if (status > 0) {
> 		dev_err(ctrl->device, "Could not set queue count (%d)\n", status);
> 		*count = 0;
> 
> 
> causing nvme_set_queue_count() _not_ to return an error, but rather let the
> reconnect complete.
> Of course, as this failure is due to a timeout (cf the status code; 880
> is NVME_SC_HOST_PATH_ERROR), the admin queue has been torn down by the
> transport, causing this crash.

This doesn't sound right. No response from a controller timeout is
supposed to get the -EINTR return, which exits earlier with a returned
error above what you're showing.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-01-21 17:07 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-21  8:25 kernel oops after nvme_set_queue_count() Hannes Reinecke
2021-01-21  9:06 ` Sagi Grimberg
2021-01-21 17:07 ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.