linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] nvme APST fixes/improvements for 4.11
@ 2017-04-20  3:02 Andy Lutomirski
  2017-04-20  3:02 ` [PATCH 1/5] nvme: Fix APST comment Andy Lutomirski
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  3:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Kai-Heng Feng, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Keith Busch, Andy Lutomirski

Sorry for waiting so long for this.  I was waiting for feedback from
Samsung, but they haven't root-caused the issue yet, and I should
have just done this from the beginning.

This series makes APST more debuggable and updates the quirk list.
The quirks I'm aware of are:

 - Samsung 950 series SSDs in Dell XPS 15 9550 and Precision 5510
   laptops (which are essentially the same laptop) can lose their
   PCIe link if they're allowed to use the deepest APST state.
   Samsung engineers have an affected system and are working on
   it.  The same exact SSDs in other machines (even an XPS 13)
   seem to work fine.

 - One Toshiba device malfunctions if APST is used at all.

One thing that improves my confidence that there aren't too many
more problems with APST is that Ubuntu has backported APST to Zesty,
so it's already gotten a bit of testing in a widely used (if very
new) release.

Andy Lutomirski (5):
  nvme: Fix APST comment
  nvme: Display raw APST configuration via DYNAMIC_DEBUG
  nvme: Add nvme_core.force_apst to ignore the NO_APST quirk
  nvme: Adjust the Samsung APST quirk
  nvme: Quirk APST off on "THNSF5256GPUK TOSHIBA"

 drivers/nvme/host/core.c | 61 ++++++++++++++++++++++++++++++++++++++++--------
 drivers/nvme/host/nvme.h |  5 ++++
 drivers/nvme/host/pci.c  | 26 ++++++++++++++++++++-
 3 files changed, 81 insertions(+), 11 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/5] nvme: Fix APST comment
  2017-04-20  3:02 [PATCH 0/5] nvme APST fixes/improvements for 4.11 Andy Lutomirski
@ 2017-04-20  3:02 ` Andy Lutomirski
  2017-04-20  3:02 ` [PATCH 2/5] nvme: Display raw APST configuration via DYNAMIC_DEBUG Andy Lutomirski
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  3:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Kai-Heng Feng, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Keith Busch, Andy Lutomirski

There was a typo in the description of the timeout heuristic.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9583a5f58a1d..4fc58866640e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1267,7 +1267,7 @@ static void nvme_configure_apst(struct nvme_ctrl *ctrl)
 	 * heuristic: we are willing to spend at most 2% of the time
 	 * transitioning between power states.  Therefore, when running
 	 * in any given state, we will enter the next lower-power
-	 * non-operational state after waiting 100 * (enlat + exlat)
+	 * non-operational state after waiting 50 * (enlat + exlat)
 	 * microseconds, as long as that state's total latency is under
 	 * the requested maximum latency.
 	 *
-- 
2.9.3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 2/5] nvme: Display raw APST configuration via DYNAMIC_DEBUG
  2017-04-20  3:02 [PATCH 0/5] nvme APST fixes/improvements for 4.11 Andy Lutomirski
  2017-04-20  3:02 ` [PATCH 1/5] nvme: Fix APST comment Andy Lutomirski
@ 2017-04-20  3:02 ` Andy Lutomirski
  2017-04-20  3:02 ` [PATCH 3/5] nvme: Add nvme_core.force_apst to ignore the NO_APST quirk Andy Lutomirski
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  3:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Kai-Heng Feng, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Keith Busch, Andy Lutomirski

Debugging APST is currently a bit of a pain.  This gives optional
simple log messages that describe the APST state.

The easiest way to use this is probably with the nvme_core.dyndbg=+p
module parameter.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4fc58866640e..52b4e52b85a2 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1278,6 +1278,8 @@ static void nvme_configure_apst(struct nvme_ctrl *ctrl)
 
 	unsigned apste;
 	struct nvme_feat_auto_pst *table;
+	u64 max_lat_us = 0;
+	int max_ps = -1;
 	int ret;
 
 	/*
@@ -1299,6 +1301,7 @@ static void nvme_configure_apst(struct nvme_ctrl *ctrl)
 	if (ctrl->ps_max_latency_us == 0) {
 		/* Turn off APST. */
 		apste = 0;
+		dev_dbg(ctrl->device, "APST disabled\n");
 	} else {
 		__le64 target = cpu_to_le64(0);
 		int state;
@@ -1340,9 +1343,22 @@ static void nvme_configure_apst(struct nvme_ctrl *ctrl)
 
 			target = cpu_to_le64((state << 3) |
 					     (transition_ms << 8));
+
+			if (max_ps == -1)
+				max_ps = state;
+
+			if (total_latency_us > max_lat_us)
+				max_lat_us = total_latency_us;
 		}
 
 		apste = 1;
+
+		if (max_ps == -1) {
+			dev_dbg(ctrl->device, "APST enabled but no non-operational states are available\n");
+		} else {
+			dev_dbg(ctrl->device, "APST enabled: max PS = %d, max round-trip latency = %lluus, table = %*phN\n",
+				max_ps, max_lat_us, (int)sizeof(*table), table);
+		}
 	}
 
 	ret = nvme_set_features(ctrl, NVME_FEAT_AUTO_PST, apste,
-- 
2.9.3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 3/5] nvme: Add nvme_core.force_apst to ignore the NO_APST quirk
  2017-04-20  3:02 [PATCH 0/5] nvme APST fixes/improvements for 4.11 Andy Lutomirski
  2017-04-20  3:02 ` [PATCH 1/5] nvme: Fix APST comment Andy Lutomirski
  2017-04-20  3:02 ` [PATCH 2/5] nvme: Display raw APST configuration via DYNAMIC_DEBUG Andy Lutomirski
@ 2017-04-20  3:02 ` Andy Lutomirski
  2017-04-20  3:02 ` [PATCH 4/5] nvme: Adjust the Samsung APST quirk Andy Lutomirski
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  3:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Kai-Heng Feng, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Keith Busch, Andy Lutomirski

We're probably going to be stuck quirking APST off on an over-broad
range of devices for 4.11.  Let's make it easy to override the quirk
for testing.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 52b4e52b85a2..b422996f88ed 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -62,6 +62,10 @@ module_param(default_ps_max_latency_us, ulong, 0644);
 MODULE_PARM_DESC(default_ps_max_latency_us,
 		 "max power saving latency for new devices; use PM QOS to change per device");
 
+static bool force_apst;
+module_param(force_apst, bool, 0644);
+MODULE_PARM_DESC(force_apst, "allow APST for newly enumerated devices even if quirked off");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1519,7 +1523,16 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 
 	ctrl->npss = id->npss;
 	prev_apsta = ctrl->apsta;
-	ctrl->apsta = (ctrl->quirks & NVME_QUIRK_NO_APST) ? 0 : id->apsta;
+	if (ctrl->quirks & NVME_QUIRK_NO_APST) {
+		if (force_apst && id->apsta) {
+			dev_warn(ctrl->dev, "forcibly allowing APST due to nvme_core.force_apst -- use at your own risk\n");
+			ctrl->apsta = 1;
+		} else {
+			ctrl->apsta = 0;
+		}
+	} else {
+		ctrl->apsta = id->apsta;
+	}
 	memcpy(ctrl->psd, id->psd, sizeof(ctrl->psd));
 
 	if (ctrl->ops->is_fabrics) {
-- 
2.9.3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/5] nvme: Adjust the Samsung APST quirk
  2017-04-20  3:02 [PATCH 0/5] nvme APST fixes/improvements for 4.11 Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-04-20  3:02 ` [PATCH 3/5] nvme: Add nvme_core.force_apst to ignore the NO_APST quirk Andy Lutomirski
@ 2017-04-20  3:02 ` Andy Lutomirski
  2017-04-20  3:07   ` Jens Axboe
  2017-04-20  3:02 ` [PATCH 5/5] nvme: Quirk APST off on "THNSF5256GPUK TOSHIBA" Andy Lutomirski
  2017-04-20  3:10 ` [PATCH 0/5] nvme APST fixes/improvements for 4.11 Jens Axboe
  5 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  3:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Kai-Heng Feng, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Keith Busch, Andy Lutomirski, Niranjan Sivakumar

I got a couple more reports: the Samsung APST issues appears to
affect multiple 950-series devices in Dell XPS 15 9550 and Precision
5510 laptops.  Change the quirk: rather than blacklisting the
firmware on the first problematic SSD that was reported, disable
APST on all 144d:a802 devices if they're installed in the two
affected Dell models.  While we're at it, disable only the deepest
sleep state instead of all of them -- the reporters say that this is
sufficient to fix the problem.

(I have a device that appears to be entirely identical to one of the
affected devices, but I have a different Dell laptop, so it's not
the case that all Samsung devices with firmware BXW75D0Q are broken
under all circumstances.)

Samsung engineers have an affected system, and hopefully they'll
give us a better workaround some time soon.  In the mean time, this
should minimize regressions.

See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184

Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Niranjan Sivakumar <ns253@cornell.edu>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 23 +++++++++++++----------
 drivers/nvme/host/nvme.h |  5 +++++
 drivers/nvme/host/pci.c  | 26 +++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b422996f88ed..bec8c6973ae5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1323,6 +1323,14 @@ static void nvme_configure_apst(struct nvme_ctrl *ctrl)
 				table->entries[state] = target;
 
 			/*
+			 * Don't allow transitions to the deepest state
+			 * if it's quirked off.
+			 */
+			if (state == ctrl->npss &&
+			    (ctrl->quirks & NVME_QUIRK_NO_DEEPEST_PS))
+				continue;
+
+			/*
 			 * Is this state a useful non-operational state for
 			 * higher-power states to autonomously transition to?
 			 */
@@ -1407,16 +1415,6 @@ struct nvme_core_quirk_entry {
 };
 
 static const struct nvme_core_quirk_entry core_quirks[] = {
-	/*
-	 * Seen on a Samsung "SM951 NVMe SAMSUNG 256GB": using APST causes
-	 * the controller to go out to lunch.  It dies when the watchdog
-	 * timer reads CSTS and gets 0xffffffff.
-	 */
-	{
-		.vid = 0x144d,
-		.fr = "BXW75D0Q",
-		.quirks = NVME_QUIRK_NO_APST,
-	},
 };
 
 /* match is null-terminated but idstr is space-padded. */
@@ -1501,6 +1499,11 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		}
 	}
 
+	if (force_apst && (ctrl->quirks & NVME_QUIRK_NO_DEEPEST_PS)) {
+		dev_warn(ctrl->dev, "forcibly allowing all power states due to nvme_core.force_apst -- use at your own risk\n");
+		ctrl->quirks &= ~NVME_QUIRK_NO_DEEPEST_PS;
+	}
+
 	ctrl->oacs = le16_to_cpu(id->oacs);
 	ctrl->vid = le16_to_cpu(id->vid);
 	ctrl->oncs = le16_to_cpup(&id->oncs);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 2aa20e3e5675..ab2d6ec7eb5c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -83,6 +83,11 @@ enum nvme_quirks {
 	 * APST should not be used.
 	 */
 	NVME_QUIRK_NO_APST			= (1 << 4),
+
+	/*
+	 * The deepest sleep state should not be used.
+	 */
+	NVME_QUIRK_NO_DEEPEST_PS		= (1 << 5),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 26a5fd05fe88..5d309535abbd 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -19,6 +19,7 @@
 #include <linux/blk-mq-pci.h>
 #include <linux/cpu.h>
 #include <linux/delay.h>
+#include <linux/dmi.h>
 #include <linux/errno.h>
 #include <linux/fs.h>
 #include <linux/genhd.h>
@@ -1943,10 +1944,31 @@ static int nvme_dev_map(struct nvme_dev *dev)
 	return -ENODEV;
 }
 
+static unsigned long check_dell_samsung_bug(struct pci_dev *pdev)
+{
+	if (pdev->vendor == 0x144d && pdev->device == 0xa802) {
+		/*
+		 * Several Samsung devices seem to drop off the PCIe bus
+		 * randomly when APST is on and uses the deepest sleep state.
+		 * This has been observed on a Samsung "SM951 NVMe SAMSUNG
+		 * 256GB", a "PM951 NVMe SAMSUNG 512GB", and a "Samsung SSD
+		 * 950 PRO 256GB", but it seems to be restricted to two Dell
+		 * laptops.
+		 */
+		if (dmi_match(DMI_SYS_VENDOR, "Dell Inc.") &&
+		    (dmi_match(DMI_PRODUCT_NAME, "XPS 15 9550") ||
+		     dmi_match(DMI_PRODUCT_NAME, "Precision 5510")))
+			return NVME_QUIRK_NO_DEEPEST_PS;
+	}
+
+	return 0;
+}
+
 static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 {
 	int node, result = -ENOMEM;
 	struct nvme_dev *dev;
+	unsigned long quirks = id->driver_data;
 
 	node = dev_to_node(&pdev->dev);
 	if (node == NUMA_NO_NODE)
@@ -1978,8 +2000,10 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (result)
 		goto put_pci;
 
+	quirks |= check_dell_samsung_bug(pdev);
+
 	result = nvme_init_ctrl(&dev->ctrl, &pdev->dev, &nvme_pci_ctrl_ops,
-			id->driver_data);
+			quirks);
 	if (result)
 		goto release_pools;
 
-- 
2.9.3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 5/5] nvme: Quirk APST off on "THNSF5256GPUK TOSHIBA"
  2017-04-20  3:02 [PATCH 0/5] nvme APST fixes/improvements for 4.11 Andy Lutomirski
                   ` (3 preceding siblings ...)
  2017-04-20  3:02 ` [PATCH 4/5] nvme: Adjust the Samsung APST quirk Andy Lutomirski
@ 2017-04-20  3:02 ` Andy Lutomirski
  2017-04-20  3:10 ` [PATCH 0/5] nvme APST fixes/improvements for 4.11 Jens Axboe
  5 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  3:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, Kai-Heng Feng, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Keith Busch, Andy Lutomirski

There's a report that it malfunctions with APST on.

See https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184

Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bec8c6973ae5..5249027a76ca 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1415,6 +1415,15 @@ struct nvme_core_quirk_entry {
 };
 
 static const struct nvme_core_quirk_entry core_quirks[] = {
+	{
+		/*
+		 * This Toshiba device seems to die using any APST states.  See:
+		 * https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184/comments/11
+		 */
+		.vid = 0x1179,
+		.mn = "THNSF5256GPUK TOSHIBA",
+		.quirks = NVME_QUIRK_NO_APST,
+	}
 };
 
 /* match is null-terminated but idstr is space-padded. */
-- 
2.9.3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/5] nvme: Adjust the Samsung APST quirk
  2017-04-20  3:02 ` [PATCH 4/5] nvme: Adjust the Samsung APST quirk Andy Lutomirski
@ 2017-04-20  3:07   ` Jens Axboe
  2017-04-20  3:51     ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2017-04-20  3:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kai-Heng Feng, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Keith Busch, Niranjan Sivakumar

On Wed, Apr 19 2017, Andy Lutomirski wrote:
> I got a couple more reports: the Samsung APST issues appears to
> affect multiple 950-series devices in Dell XPS 15 9550 and Precision
> 5510 laptops.  Change the quirk: rather than blacklisting the
> firmware on the first problematic SSD that was reported, disable
> APST on all 144d:a802 devices if they're installed in the two
> affected Dell models.  While we're at it, disable only the deepest
> sleep state instead of all of them -- the reporters say that this is
> sufficient to fix the problem.
> 
> (I have a device that appears to be entirely identical to one of the
> affected devices, but I have a different Dell laptop, so it's not
> the case that all Samsung devices with firmware BXW75D0Q are broken
> under all circumstances.)
> 
> Samsung engineers have an affected system, and hopefully they'll
> give us a better workaround some time soon.  In the mean time, this
> should minimize regressions.

Do we know for a fact that it only happens on those systems, and isn't
purely specific to the device?

At this point in time, I'd be much more comfortable completely disabling
APST on Samsung, period.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11
  2017-04-20  3:02 [PATCH 0/5] nvme APST fixes/improvements for 4.11 Andy Lutomirski
                   ` (4 preceding siblings ...)
  2017-04-20  3:02 ` [PATCH 5/5] nvme: Quirk APST off on "THNSF5256GPUK TOSHIBA" Andy Lutomirski
@ 2017-04-20  3:10 ` Jens Axboe
  2017-04-20  3:55   ` Andy Lutomirski
  5 siblings, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2017-04-20  3:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Kai-Heng Feng, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Keith Busch

On Wed, Apr 19 2017, Andy Lutomirski wrote:
> Sorry for waiting so long for this.  I was waiting for feedback from
> Samsung, but they haven't root-caused the issue yet, and I should
> have just done this from the beginning.
> 
> This series makes APST more debuggable and updates the quirk list.
> The quirks I'm aware of are:
> 
>  - Samsung 950 series SSDs in Dell XPS 15 9550 and Precision 5510
>    laptops (which are essentially the same laptop) can lose their
>    PCIe link if they're allowed to use the deepest APST state.
>    Samsung engineers have an affected system and are working on
>    it.  The same exact SSDs in other machines (even an XPS 13)
>    seem to work fine.
> 
>  - One Toshiba device malfunctions if APST is used at all.

You need to split this series in two, patches 1-3 can wait. For 4.11,
all we need to do is turn off APST on any device that potentially has
this problem.

> One thing that improves my confidence that there aren't too many
> more problems with APST is that Ubuntu has backported APST to Zesty,
> so it's already gotten a bit of testing in a widely used (if very
> new) release.

Honestly, I think the best path for 4.11 is to turn off APST by default,
make it opt-in instead. I don't share your optimism here, as I made
clear back from before we even merged this feature.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/5] nvme: Adjust the Samsung APST quirk
  2017-04-20  3:07   ` Jens Axboe
@ 2017-04-20  3:51     ` Andy Lutomirski
       [not found]       ` <CGME20170420043337uscas1p1614437f255a643cdb08e44c4fd43424f@uscas1p1.samsung.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  3:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andy Lutomirski, linux-kernel, Kai-Heng Feng, linux-nvme,
	Christoph Hellwig, Sagi Grimberg, Keith Busch,
	Niranjan Sivakumar

On Wed, Apr 19, 2017 at 8:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On Wed, Apr 19 2017, Andy Lutomirski wrote:
>> I got a couple more reports: the Samsung APST issues appears to
>> affect multiple 950-series devices in Dell XPS 15 9550 and Precision
>> 5510 laptops.  Change the quirk: rather than blacklisting the
>> firmware on the first problematic SSD that was reported, disable
>> APST on all 144d:a802 devices if they're installed in the two
>> affected Dell models.  While we're at it, disable only the deepest
>> sleep state instead of all of them -- the reporters say that this is
>> sufficient to fix the problem.
>>
>> (I have a device that appears to be entirely identical to one of the
>> affected devices, but I have a different Dell laptop, so it's not
>> the case that all Samsung devices with firmware BXW75D0Q are broken
>> under all circumstances.)
>>
>> Samsung engineers have an affected system, and hopefully they'll
>> give us a better workaround some time soon.  In the mean time, this
>> should minimize regressions.
>
> Do we know for a fact that it only happens on those systems, and isn't
> purely specific to the device?

I have decent evidence.  All of the reports are from XPS 15 9550 or
Precision 5510, and Dell confirmed that they're basically the same
machine and run literally the same BIOS.  One of these reports is from
a device with exactly the same model and firmware as my SSD, and mine
is fine.  (I have a different laptop.)

>
> At this point in time, I'd be much more comfortable completely disabling
> APST on Samsung, period.
>

I'd be fine with doing that for 4.11 and then doing this for 4.12-rc1.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11
  2017-04-20  3:10 ` [PATCH 0/5] nvme APST fixes/improvements for 4.11 Jens Axboe
@ 2017-04-20  3:55   ` Andy Lutomirski
  2017-04-20  4:52     ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  3:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andy Lutomirski, linux-kernel, Kai-Heng Feng, linux-nvme,
	Christoph Hellwig, Sagi Grimberg, Keith Busch

On Wed, Apr 19, 2017 at 8:10 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On Wed, Apr 19 2017, Andy Lutomirski wrote:
>> Sorry for waiting so long for this.  I was waiting for feedback from
>> Samsung, but they haven't root-caused the issue yet, and I should
>> have just done this from the beginning.
>>
>> This series makes APST more debuggable and updates the quirk list.
>> The quirks I'm aware of are:
>>
>>  - Samsung 950 series SSDs in Dell XPS 15 9550 and Precision 5510
>>    laptops (which are essentially the same laptop) can lose their
>>    PCIe link if they're allowed to use the deepest APST state.
>>    Samsung engineers have an affected system and are working on
>>    it.  The same exact SSDs in other machines (even an XPS 13)
>>    seem to work fine.
>>
>>  - One Toshiba device malfunctions if APST is used at all.
>
> You need to split this series in two, patches 1-3 can wait. For 4.11,
> all we need to do is turn off APST on any device that potentially has
> this problem.
>
>> One thing that improves my confidence that there aren't too many
>> more problems with APST is that Ubuntu has backported APST to Zesty,
>> so it's already gotten a bit of testing in a widely used (if very
>> new) release.
>
> Honestly, I think the best path for 4.11 is to turn off APST by default,
> make it opt-in instead. I don't share your optimism here, as I made
> clear back from before we even merged this feature.
>
>

I can make it so that force_apst=0 means no APST and force_apst=1 mean
yes APST and we could try again with a quirk list for 4.12.  There's a
decent chance that a few more weeks with Ubuntu having APST on will
shake out all the problems fairly quickly.

--Andy

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 4/5] nvme: Adjust the Samsung APST quirk
       [not found]       ` <CGME20170420043337uscas1p1614437f255a643cdb08e44c4fd43424f@uscas1p1.samsung.com>
@ 2017-04-20  4:33         ` Judy Brock
  0 siblings, 0 replies; 14+ messages in thread
From: Judy Brock @ 2017-04-20  4:33 UTC (permalink / raw)
  To: Andy Lutomirski, Jens Axboe
  Cc: Sagi Grimberg, linux-kernel, linux-nvme, Keith Busch,
	Kai-Heng Feng, Christoph Hellwig, Niranjan Sivakumar

[Jens] Do we know for a fact that it only happens on those systems, and isn't 
> purely specific to the device?

[Andy] I have decent evidence.  All of the reports are from XPS 15 9550 or Precision 5510, and Dell confirmed that they're basically the same machine and run literally the same BIOS.

The answer as per the above as far as we know is "yes".
>
> At this point in time, I'd be much more comfortable completely 
> disabling APST on Samsung, period.
>

1) Why? The answer to the question above was "Yes". This has been reported exclusively on the two Dell models with that exact same BIOS. Additionally, there are reports of the device acting fine on other systems so it is not purely specific to the device.

We request that the quirk should be only on the affected Dell machines - there is no reason to completely disable APST on Samsung.

2) Samsung shared in the more private thread that we are seeing excessive recovery attempts on the PCIe bus - no PCIe TLPs seen, just ordered sets.  This looks like a signal integrity problem to us on the Dell side. We shared excerpt of PCIe trace on the offline thread.

3) We also shared a more extensive report with Dell today. We've asked them to look into it. 

4) There was at least one report of same symptom on a Toshiba device and Lenovo system that seemed to also disappear by avoiding PS4. So it seems it would be best to continue to try to get to the bottom of the problem (root cause) and quirk judiciously in the meantime.

Thanks,
Judy



-----Original Message-----
From: Linux-nvme [mailto:linux-nvme-bounces@lists.infradead.org] On Behalf Of Andy Lutomirski
Sent: Wednesday, April 19, 2017 8:51 PM
To: Jens Axboe
Cc: Sagi Grimberg; linux-kernel@vger.kernel.org; linux-nvme; Keith Busch; Kai-Heng Feng; Andy Lutomirski; Christoph Hellwig; Niranjan Sivakumar
Subject: Re: [PATCH 4/5] nvme: Adjust the Samsung APST quirk

On Wed, Apr 19, 2017 at 8:07 PM, Jens Axboe <axboe@kernel.dk> wrote:
> On Wed, Apr 19 2017, Andy Lutomirski wrote:
>> I got a couple more reports: the Samsung APST issues appears to 
>> affect multiple 950-series devices in Dell XPS 15 9550 and Precision
>> 5510 laptops.  Change the quirk: rather than blacklisting the 
>> firmware on the first problematic SSD that was reported, disable APST 
>> on all 144d:a802 devices if they're installed in the two affected 
>> Dell models.  While we're at it, disable only the deepest sleep state 
>> instead of all of them -- the reporters say that this is sufficient 
>> to fix the problem.
>>
>> (I have a device that appears to be entirely identical to one of the 
>> affected devices, but I have a different Dell laptop, so it's not the 
>> case that all Samsung devices with firmware BXW75D0Q are broken under 
>> all circumstances.)
>>
>> Samsung engineers have an affected system, and hopefully they'll give 
>> us a better workaround some time soon.  In the mean time, this should 
>> minimize regressions.
>
> Do we know for a fact that it only happens on those systems, and isn't 
> purely specific to the device?

I have decent evidence.  All of the reports are from XPS 15 9550 or Precision 5510, and Dell confirmed that they're basically the same machine and run literally the same BIOS.  One of these reports is from a device with exactly the same model and firmware as my SSD, and mine is fine.  (I have a different laptop.)

>
> At this point in time, I'd be much more comfortable completely 
> disabling APST on Samsung, period.
>

I'd be fine with doing that for 4.11 and then doing this for 4.12-rc1.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11
  2017-04-20  3:55   ` Andy Lutomirski
@ 2017-04-20  4:52     ` Andy Lutomirski
  2017-04-20  5:19       ` Christoph Hellwig
  0 siblings, 1 reply; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20  4:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Jens Axboe, linux-kernel, Kai-Heng Feng, linux-nvme,
	Christoph Hellwig, Sagi Grimberg, Keith Busch

On Wed, Apr 19, 2017 at 8:55 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Apr 19, 2017 at 8:10 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> On Wed, Apr 19 2017, Andy Lutomirski wrote:
>>> Sorry for waiting so long for this.  I was waiting for feedback from
>>> Samsung, but they haven't root-caused the issue yet, and I should
>>> have just done this from the beginning.
>>>
>>> This series makes APST more debuggable and updates the quirk list.
>>> The quirks I'm aware of are:
>>>
>>>  - Samsung 950 series SSDs in Dell XPS 15 9550 and Precision 5510
>>>    laptops (which are essentially the same laptop) can lose their
>>>    PCIe link if they're allowed to use the deepest APST state.
>>>    Samsung engineers have an affected system and are working on
>>>    it.  The same exact SSDs in other machines (even an XPS 13)
>>>    seem to work fine.
>>>
>>>  - One Toshiba device malfunctions if APST is used at all.
>>
>> You need to split this series in two, patches 1-3 can wait. For 4.11,
>> all we need to do is turn off APST on any device that potentially has
>> this problem.
>>
>>> One thing that improves my confidence that there aren't too many
>>> more problems with APST is that Ubuntu has backported APST to Zesty,
>>> so it's already gotten a bit of testing in a widely used (if very
>>> new) release.
>>
>> Honestly, I think the best path for 4.11 is to turn off APST by default,
>> make it opt-in instead. I don't share your optimism here, as I made
>> clear back from before we even merged this feature.
>>
>>
>
> I can make it so that force_apst=0 means no APST and force_apst=1 mean
> yes APST and we could try again with a quirk list for 4.12.  There's a
> decent chance that a few more weeks with Ubuntu having APST on will
> shake out all the problems fairly quickly.

Here's a more concrete and more sensible proposal:

For 4.11:

force_apst=0: Default.  APST off on all Samsung 950-like devices
regardless of what laptop and on the Toshiba device.
force_apst=1: Use APST except where known bad.  APST deepest state
disabled on Samsung 950-like devices on XPS 15 and Precision 5510.
APST off on the Toshiba device.
force_apst=2: APST fully on regardless of any quirks.

For 4.12-rc1: force_apst=0 works like force_apst=1, but we keep both
values for compatibility and in case we need to add another overly
broad quirk some day.

Would something like this make sense?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11
  2017-04-20  4:52     ` Andy Lutomirski
@ 2017-04-20  5:19       ` Christoph Hellwig
  2017-04-20 16:01         ` Andy Lutomirski
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2017-04-20  5:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Jens Axboe, linux-kernel, Kai-Heng Feng,
	linux-nvme, Christoph Hellwig, Sagi Grimberg, Keith Busch

On Wed, Apr 19, 2017 at 09:52:17PM -0700, Andy Lutomirski wrote:
> > I can make it so that force_apst=0 means no APST and force_apst=1 mean
> > yes APST and we could try again with a quirk list for 4.12.  There's a
> > decent chance that a few more weeks with Ubuntu having APST on will
> > shake out all the problems fairly quickly.
> 
> Here's a more concrete and more sensible proposal:

Can we just have force_apst=on to force it on, force_apst=off to turn
it off, and leave it with that?  And yes, I mean the strings instead
of the weird numbers.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11
  2017-04-20  5:19       ` Christoph Hellwig
@ 2017-04-20 16:01         ` Andy Lutomirski
  0 siblings, 0 replies; 14+ messages in thread
From: Andy Lutomirski @ 2017-04-20 16:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andy Lutomirski, Jens Axboe, linux-kernel, Kai-Heng Feng,
	linux-nvme, Sagi Grimberg, Keith Busch

On Wed, Apr 19, 2017 at 10:19 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Wed, Apr 19, 2017 at 09:52:17PM -0700, Andy Lutomirski wrote:
>> > I can make it so that force_apst=0 means no APST and force_apst=1 mean
>> > yes APST and we could try again with a quirk list for 4.12.  There's a
>> > decent chance that a few more weeks with Ubuntu having APST on will
>> > shake out all the problems fairly quickly.
>>
>> Here's a more concrete and more sensible proposal:
>
> Can we just have force_apst=on to force it on, force_apst=off to turn
> it off, and leave it with that?  And yes, I mean the strings instead
> of the weird numbers.

That's essentially what the patch I emailed here does -- it's a
regular bool param, so force_apst=1, force_apst=Y, and force_apst=y
all mean the same thing.  I can just leave it like that.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-04-20 16:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-20  3:02 [PATCH 0/5] nvme APST fixes/improvements for 4.11 Andy Lutomirski
2017-04-20  3:02 ` [PATCH 1/5] nvme: Fix APST comment Andy Lutomirski
2017-04-20  3:02 ` [PATCH 2/5] nvme: Display raw APST configuration via DYNAMIC_DEBUG Andy Lutomirski
2017-04-20  3:02 ` [PATCH 3/5] nvme: Add nvme_core.force_apst to ignore the NO_APST quirk Andy Lutomirski
2017-04-20  3:02 ` [PATCH 4/5] nvme: Adjust the Samsung APST quirk Andy Lutomirski
2017-04-20  3:07   ` Jens Axboe
2017-04-20  3:51     ` Andy Lutomirski
     [not found]       ` <CGME20170420043337uscas1p1614437f255a643cdb08e44c4fd43424f@uscas1p1.samsung.com>
2017-04-20  4:33         ` Judy Brock
2017-04-20  3:02 ` [PATCH 5/5] nvme: Quirk APST off on "THNSF5256GPUK TOSHIBA" Andy Lutomirski
2017-04-20  3:10 ` [PATCH 0/5] nvme APST fixes/improvements for 4.11 Jens Axboe
2017-04-20  3:55   ` Andy Lutomirski
2017-04-20  4:52     ` Andy Lutomirski
2017-04-20  5:19       ` Christoph Hellwig
2017-04-20 16:01         ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).