All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] nvme/pci: Log PCI_STATUS when the controller dies
@ 2016-12-02 16:58 Andy Lutomirski
  2016-12-02 22:48 ` Keith Busch
  2016-12-03  4:16 ` Jens Axboe
  0 siblings, 2 replies; 4+ messages in thread
From: Andy Lutomirski @ 2016-12-02 16:58 UTC (permalink / raw)


When debugging nvme controller crashes, it's nice to know whether
the controller died cleanly so that the failure is just reflected in
CSTS, it died and put an error in PCI_STATUS, or whether it died so
badly that it stopped responding to PCI configuration space reads.

I've seen a failure that gives 0xffff in PCI_STATUS on a Samsung
"SM951 NVMe SAMSUNG 256GB" with firmware "BXW75D0Q".

Reviewed-by: Christoph Hellwig <hch at lst.de>
Signed-off-by: Andy Lutomirski <luto at kernel.org>
---
 drivers/nvme/host/pci.c | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5e52034ab010..628dc916ae9a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1281,6 +1281,24 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 	return true;
 }
 
+static void nvme_warn_reset(struct nvme_dev *dev, u32 csts)
+{
+	/* Read a config register to help see what died. */
+	u16 pci_status;
+	int result;
+
+	result = pci_read_config_word(to_pci_dev(dev->dev), PCI_STATUS,
+				      &pci_status);
+	if (result == PCIBIOS_SUCCESSFUL)
+		dev_warn(dev->dev,
+			 "controller is down; will reset: CSTS=0x%x, PCI_STATUS=0x%hx\n",
+			 csts, pci_status);
+	else
+		dev_warn(dev->dev,
+			 "controller is down; will reset: CSTS=0x%x, PCI_STATUS read failed (%d)\n",
+			 csts, result);
+}
+
 static void nvme_watchdog_timer(unsigned long data)
 {
 	struct nvme_dev *dev = (struct nvme_dev *)data;
@@ -1289,9 +1307,7 @@ static void nvme_watchdog_timer(unsigned long data)
 	/* Skip controllers under certain specific conditions. */
 	if (nvme_should_reset(dev, csts)) {
 		if (!nvme_reset(dev))
-			dev_warn(dev->dev,
-				"Failed status: 0x%x, reset controller.\n",
-				csts);
+			 nvme_warn_reset(dev, csts);
 		return;
 	}
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH v3] nvme/pci: Log PCI_STATUS when the controller dies
  2016-12-02 22:48 ` Keith Busch
@ 2016-12-02 22:47   ` Andy Lutomirski
  0 siblings, 0 replies; 4+ messages in thread
From: Andy Lutomirski @ 2016-12-02 22:47 UTC (permalink / raw)


On Fri, Dec 2, 2016@2:48 PM, Keith Busch <keith.busch@intel.com> wrote:
> On Fri, Dec 02, 2016@08:58:57AM -0800, Andy Lutomirski wrote:
>> When debugging nvme controller crashes, it's nice to know whether
>> the controller died cleanly so that the failure is just reflected in
>> CSTS, it died and put an error in PCI_STATUS, or whether it died so
>> badly that it stopped responding to PCI configuration space reads.
>>
>> I've seen a failure that gives 0xffff in PCI_STATUS on a Samsung
>> "SM951 NVMe SAMSUNG 256GB" with firmware "BXW75D0Q".
>>
>> Reviewed-by: Christoph Hellwig <hch at lst.de>
>> Signed-off-by: Andy Lutomirski <luto at kernel.org>
>
> Totally fine with this, but just want to mention that even the MMIO read
> has caused problems when racing a pciehp hot plug event. A config read in
> this case is another opprotunity for a completion timeout, unless I can
> get Bjorn to apply the patch series disabling config access on surprise
> removed devices. Or maybe our nvme health check polling implementation
> is misguided.
>

I've been meaning to complain about the keepalive polling kicking my
laptop out of idle every now and then :)  But yes, I think this
particular issue should be solved in the PCI layer.  Also, the PCI
layer should have a nice way to reset devices that go completely out
to lunch, too, IMO.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3] nvme/pci: Log PCI_STATUS when the controller dies
  2016-12-02 16:58 [PATCH v3] nvme/pci: Log PCI_STATUS when the controller dies Andy Lutomirski
@ 2016-12-02 22:48 ` Keith Busch
  2016-12-02 22:47   ` Andy Lutomirski
  2016-12-03  4:16 ` Jens Axboe
  1 sibling, 1 reply; 4+ messages in thread
From: Keith Busch @ 2016-12-02 22:48 UTC (permalink / raw)


On Fri, Dec 02, 2016@08:58:57AM -0800, Andy Lutomirski wrote:
> When debugging nvme controller crashes, it's nice to know whether
> the controller died cleanly so that the failure is just reflected in
> CSTS, it died and put an error in PCI_STATUS, or whether it died so
> badly that it stopped responding to PCI configuration space reads.
> 
> I've seen a failure that gives 0xffff in PCI_STATUS on a Samsung
> "SM951 NVMe SAMSUNG 256GB" with firmware "BXW75D0Q".
> 
> Reviewed-by: Christoph Hellwig <hch at lst.de>
> Signed-off-by: Andy Lutomirski <luto at kernel.org>

Totally fine with this, but just want to mention that even the MMIO read
has caused problems when racing a pciehp hot plug event. A config read in
this case is another opprotunity for a completion timeout, unless I can
get Bjorn to apply the patch series disabling config access on surprise
removed devices. Or maybe our nvme health check polling implementation
is misguided.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH v3] nvme/pci: Log PCI_STATUS when the controller dies
  2016-12-02 16:58 [PATCH v3] nvme/pci: Log PCI_STATUS when the controller dies Andy Lutomirski
  2016-12-02 22:48 ` Keith Busch
@ 2016-12-03  4:16 ` Jens Axboe
  1 sibling, 0 replies; 4+ messages in thread
From: Jens Axboe @ 2016-12-03  4:16 UTC (permalink / raw)


On 12/02/2016 09:58 AM, Andy Lutomirski wrote:
> When debugging nvme controller crashes, it's nice to know whether
> the controller died cleanly so that the failure is just reflected in
> CSTS, it died and put an error in PCI_STATUS, or whether it died so
> badly that it stopped responding to PCI configuration space reads.
> 
> I've seen a failure that gives 0xffff in PCI_STATUS on a Samsung
> "SM951 NVMe SAMSUNG 256GB" with firmware "BXW75D0Q".

I have added this for 4.10. Note that I had to hand-apply, since your
patch did not apply to the 4.10 branch. Additionally, you had a white
space problem before the call to nvme_warn_reset() that I also fixed up.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-12-03  4:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-02 16:58 [PATCH v3] nvme/pci: Log PCI_STATUS when the controller dies Andy Lutomirski
2016-12-02 22:48 ` Keith Busch
2016-12-02 22:47   ` Andy Lutomirski
2016-12-03  4:16 ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.