linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] nvme power saving
@ 2016-09-16 18:16 Andy Lutomirski
  2016-09-16 18:16 ` [PATCH v4 1/3] nvme/scsi: Remove power management support Andy Lutomirski
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-09-16 18:16 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, J Freyensee,
	Andy Lutomirski

Hi all-

Here's v4 of the APST patch set.  The biggest bikesheddable thing (I
think) is the scaling factor.  I currently have it hardcoded so that
we wait 50x the total latency before entering a power saving state.
On my Samsung 950, this means we enter state 3 (70mW, 0.5ms entry
latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms entry
latency, 22ms exit latency) after 1200ms.  I have the default max
latency set to 25ms.

FWIW, in practice, the latency this introduces seems to be well
under 22ms, but my benchmark is a bit silly and I might have
measured it wrong.  I certainly haven't observed a slowdown just
using my laptop.

This time around, I changed the names of parameters after Jay
Frayensee got confused by the first try.  Now they are:

 - ps_max_latency_us in sysfs: actually controls it.
 - nvme_core.default_ps_max_latency_us: sets the default.

Yeah, they're mouthfuls, but they should be clearer now.

Changes from v3:
 - Remove const from nvme_set_feature()'s parameter.  My inner C++
   programmer cringes a little...

Changes from v2:
 - Rename the parameters.

Changes from v1:
 - Get rid of feature buffer alignment warnings.
 - Change the error message if NPSS is bogus.
 - Rename apst_max_latency_ns to apst_max_latency_us because module params
   don't like u64 or unsigned long long and I wanted to make it fit more
   comfortably in a ulong module param.  (And the nanoseconds were useless.)
 - Add a module parameter for the default max latency.

Andy Lutomirski (3):
  nvme/scsi: Remove power management support
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme: Enable autonomous power state transitions

 drivers/nvme/host/core.c | 180 ++++++++++++++++++++++++++++++++++++++++++++---
 drivers/nvme/host/nvme.h |  10 ++-
 drivers/nvme/host/scsi.c |  80 ++-------------------
 include/linux/nvme.h     |   6 ++
 4 files changed, 192 insertions(+), 84 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v4 1/3] nvme/scsi: Remove power management support
  2016-09-16 18:16 [PATCH v4 0/3] nvme power saving Andy Lutomirski
@ 2016-09-16 18:16 ` Andy Lutomirski
  2016-09-16 23:37   ` J Freyensee
  2016-09-16 18:16 ` [PATCH v4 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features() Andy Lutomirski
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-09-16 18:16 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, J Freyensee,
	Andy Lutomirski

As far as I can tell, there is basically nothing correct about this
code.  It misinterprets npss (off-by-one).  It hardcodes a bunch of
power states, which is nonsense, because they're all just indices
into a table that software needs to parse.  It completely ignores
the distinction between operational and non-operational states.
And, until 4.8, if all of the above magically succeeded, it would
dereference a NULL pointer and OOPS.

Since this code appears to be useless, just delete it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/scsi.c | 74 ++----------------------------------------------
 1 file changed, 3 insertions(+), 71 deletions(-)

diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index e947e298a737..44009105f8c8 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -72,15 +72,6 @@ static int sg_version_num = 30534;	/* 2 digits for each component */
 #define ALL_LUNS_RETURNED				0x02
 #define ALL_WELL_KNOWN_LUNS_RETURNED			0x01
 #define RESTRICTED_LUNS_RETURNED			0x00
-#define NVME_POWER_STATE_START_VALID			0x00
-#define NVME_POWER_STATE_ACTIVE				0x01
-#define NVME_POWER_STATE_IDLE				0x02
-#define NVME_POWER_STATE_STANDBY			0x03
-#define NVME_POWER_STATE_LU_CONTROL			0x07
-#define POWER_STATE_0					0
-#define POWER_STATE_1					1
-#define POWER_STATE_2					2
-#define POWER_STATE_3					3
 #define DOWNLOAD_SAVE_ACTIVATE				0x05
 #define DOWNLOAD_SAVE_DEFER_ACTIVATE			0x0E
 #define ACTIVATE_DEFERRED_MICROCODE			0x0F
@@ -1229,64 +1220,6 @@ static void nvme_trans_fill_read_cap(u8 *response, struct nvme_id_ns *id_ns,
 
 /* Start Stop Unit Helper Functions */
 
-static int nvme_trans_power_state(struct nvme_ns *ns, struct sg_io_hdr *hdr,
-						u8 pc, u8 pcmod, u8 start)
-{
-	int res;
-	int nvme_sc;
-	struct nvme_id_ctrl *id_ctrl;
-	int lowest_pow_st;	/* max npss = lowest power consumption */
-	unsigned ps_desired = 0;
-
-	nvme_sc = nvme_identify_ctrl(ns->ctrl, &id_ctrl);
-	res = nvme_trans_status_code(hdr, nvme_sc);
-	if (res)
-		return res;
-
-	lowest_pow_st = max(POWER_STATE_0, (int)(id_ctrl->npss - 1));
-	kfree(id_ctrl);
-
-	switch (pc) {
-	case NVME_POWER_STATE_START_VALID:
-		/* Action unspecified if POWER CONDITION MODIFIER != 0 */
-		if (pcmod == 0 && start == 0x1)
-			ps_desired = POWER_STATE_0;
-		if (pcmod == 0 && start == 0x0)
-			ps_desired = lowest_pow_st;
-		break;
-	case NVME_POWER_STATE_ACTIVE:
-		/* Action unspecified if POWER CONDITION MODIFIER != 0 */
-		if (pcmod == 0)
-			ps_desired = POWER_STATE_0;
-		break;
-	case NVME_POWER_STATE_IDLE:
-		/* Action unspecified if POWER CONDITION MODIFIER != [0,1,2] */
-		if (pcmod == 0x0)
-			ps_desired = POWER_STATE_1;
-		else if (pcmod == 0x1)
-			ps_desired = POWER_STATE_2;
-		else if (pcmod == 0x2)
-			ps_desired = POWER_STATE_3;
-		break;
-	case NVME_POWER_STATE_STANDBY:
-		/* Action unspecified if POWER CONDITION MODIFIER != [0,1] */
-		if (pcmod == 0x0)
-			ps_desired = max(POWER_STATE_0, (lowest_pow_st - 2));
-		else if (pcmod == 0x1)
-			ps_desired = max(POWER_STATE_0, (lowest_pow_st - 1));
-		break;
-	case NVME_POWER_STATE_LU_CONTROL:
-	default:
-		res = nvme_trans_completion(hdr, SAM_STAT_CHECK_CONDITION,
-				ILLEGAL_REQUEST, SCSI_ASC_INVALID_CDB,
-				SCSI_ASCQ_CAUSE_NOT_REPORTABLE);
-		break;
-	}
-	nvme_sc = nvme_set_features(ns->ctrl, NVME_FEAT_POWER_MGMT, ps_desired, 0,
-				    NULL);
-	return nvme_trans_status_code(hdr, nvme_sc);
-}
-
 static int nvme_trans_send_activate_fw_cmd(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 					u8 buffer_id)
 {
@@ -2235,11 +2168,10 @@ static int nvme_trans_synchronize_cache(struct nvme_ns *ns,
 static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 							u8 *cmd)
 {
-	u8 immed, pcmod, pc, no_flush, start;
+	u8 immed, pcmod, no_flush, start;
 
 	immed = cmd[1] & 0x01;
 	pcmod = cmd[3] & 0x0f;
-	pc = (cmd[4] & 0xf0) >> 4;
 	no_flush = cmd[4] & 0x04;
 	start = cmd[4] & 0x01;
 
@@ -2254,8 +2186,8 @@ static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 			if (res)
 				return res;
 		}
-		/* Setup the expected power state transition */
-		return nvme_trans_power_state(ns, hdr, pc, pcmod, start);
+
+		return 0;
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  2016-09-16 18:16 [PATCH v4 0/3] nvme power saving Andy Lutomirski
  2016-09-16 18:16 ` [PATCH v4 1/3] nvme/scsi: Remove power management support Andy Lutomirski
@ 2016-09-16 18:16 ` Andy Lutomirski
  2016-09-16 18:16 ` [PATCH v4 3/3] nvme: Enable autonomous power state transitions Andy Lutomirski
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-09-16 18:16 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, J Freyensee,
	Andy Lutomirski

Any user I can imagine that needs a buffer at all will want to pass
a pointer directly.  There are no currently callers that use
buffers, so this change is painless, and it will make it much easier
to start using features that use buffers (e.g. APST).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 14 ++++++--------
 drivers/nvme/host/nvme.h |  4 ++--
 drivers/nvme/host/scsi.c |  6 +++---
 3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2feacc70bf61..b92a7f767bfc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -597,7 +597,7 @@ int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 }
 
 int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
-					dma_addr_t dma_addr, u32 *result)
+		      void *buffer, size_t buflen, u32 *result)
 {
 	struct nvme_command c;
 	struct nvme_completion cqe;
@@ -606,10 +606,9 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 	memset(&c, 0, sizeof(c));
 	c.features.opcode = nvme_admin_get_features;
 	c.features.nsid = cpu_to_le32(nsid);
-	c.features.dptr.prp1 = cpu_to_le64(dma_addr);
 	c.features.fid = cpu_to_le32(fid);
 
-	ret = __nvme_submit_sync_cmd(dev->admin_q, &c, &cqe, NULL, 0, 0,
+	ret = __nvme_submit_sync_cmd(dev->admin_q, &c, &cqe, buffer, buflen, 0,
 			NVME_QID_ANY, 0, 0);
 	if (ret >= 0 && result)
 		*result = le32_to_cpu(cqe.result);
@@ -617,7 +616,7 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 }
 
 int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
-					dma_addr_t dma_addr, u32 *result)
+		      void *buffer, size_t buflen, u32 *result)
 {
 	struct nvme_command c;
 	struct nvme_completion cqe;
@@ -625,12 +624,11 @@ int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
 
 	memset(&c, 0, sizeof(c));
 	c.features.opcode = nvme_admin_set_features;
-	c.features.dptr.prp1 = cpu_to_le64(dma_addr);
 	c.features.fid = cpu_to_le32(fid);
 	c.features.dword11 = cpu_to_le32(dword11);
 
-	ret = __nvme_submit_sync_cmd(dev->admin_q, &c, &cqe, NULL, 0, 0,
-			NVME_QID_ANY, 0, 0);
+	ret = __nvme_submit_sync_cmd(dev->admin_q, &c, &cqe,
+			buffer, buflen, 0, NVME_QID_ANY, 0, 0);
 	if (ret >= 0 && result)
 		*result = le32_to_cpu(cqe.result);
 	return ret;
@@ -664,7 +662,7 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 	u32 result;
 	int status, nr_io_queues;
 
-	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, 0,
+	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
 			&result);
 	if (status < 0)
 		return status;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ab18b78102bf..c2151761ec5f 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -292,9 +292,9 @@ int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 		struct nvme_id_ns **id);
 int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log);
 int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
-			dma_addr_t dma_addr, u32 *result);
+		      void *buffer, size_t buflen, u32 *result);
 int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
-			dma_addr_t dma_addr, u32 *result);
+		      void *buffer, size_t buflen, u32 *result);
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
 void nvme_start_keep_alive(struct nvme_ctrl *ctrl);
 void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index 44009105f8c8..c2a0a1c7d05d 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -906,7 +906,7 @@ static int nvme_trans_log_temperature(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	kfree(smart_log);
 
 	/* Get Features for Temp Threshold */
-	res = nvme_get_features(ns->ctrl, NVME_FEAT_TEMP_THRESH, 0, 0,
+	res = nvme_get_features(ns->ctrl, NVME_FEAT_TEMP_THRESH, 0, NULL, 0,
 								&feature_resp);
 	if (res != NVME_SC_SUCCESS)
 		temp_c_thresh = LOG_TEMP_UNKNOWN;
@@ -1039,7 +1039,7 @@ static int nvme_trans_fill_caching_page(struct nvme_ns *ns,
 	if (len < MODE_PAGE_CACHING_LEN)
 		return -EINVAL;
 
-	nvme_sc = nvme_get_features(ns->ctrl, NVME_FEAT_VOLATILE_WC, 0, 0,
+	nvme_sc = nvme_get_features(ns->ctrl, NVME_FEAT_VOLATILE_WC, 0, NULL, 0,
 								&feature_resp);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
@@ -1328,7 +1328,7 @@ static int nvme_trans_modesel_get_mp(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	case MODE_PAGE_CACHING:
 		dword11 = ((mode_page[2] & CACHING_MODE_PAGE_WCE_MASK) ? 1 : 0);
 		nvme_sc = nvme_set_features(ns->ctrl, NVME_FEAT_VOLATILE_WC,
-					    dword11, 0, NULL);
+					    dword11, NULL, 0, NULL);
 		res = nvme_trans_status_code(hdr, nvme_sc);
 		break;
 	case MODE_PAGE_CONTROL:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v4 3/3] nvme: Enable autonomous power state transitions
  2016-09-16 18:16 [PATCH v4 0/3] nvme power saving Andy Lutomirski
  2016-09-16 18:16 ` [PATCH v4 1/3] nvme/scsi: Remove power management support Andy Lutomirski
  2016-09-16 18:16 ` [PATCH v4 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features() Andy Lutomirski
@ 2016-09-16 18:16 ` Andy Lutomirski
  2016-09-17  0:49 ` [PATCH v4 0/3] nvme power saving J Freyensee
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Andy Lutomirski @ 2016-09-16 18:16 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, J Freyensee,
	Andy Lutomirski

NVMe devices can advertise multiple power states.  These states can
be either "operational" (the device is fully functional but possibly
slow) or "non-operational" (the device is asleep until woken up).
Some devices can automatically enter a non-operational state when
idle for a specified amount of time and then automatically wake back
up when needed.

The hardware configuration is a table.  For each state, an entry in
the table indicates the next deeper non-operational state, if any,
to autonomously transition to and the idle time required before
transitioning.

This patch teaches the driver to program APST so that each
successive non-operational state will be entered after an idle time
equal to 100% of the total latency (entry plus exit) associated with
that state.  A sysfs attribute 'ps_max_latency_us' gives the maximum
acceptable latency in ns; non-operational states with total latency
greater than this value will not be used.  As a special case,
ps_max_latency_us=0 will disable APST entirely.  On hardware without
APST support, ps_max_latency_us will not be exposed in sysfs.

The ps_max_latency_us parameter for newly-probed devices is set by
the module parameter nvme_core.default_ps_max_latency_us.

In theory, the device can expose "default" APST table, but this
doesn't seem to function correctly on my device (Samsung 950), nor
does it seem particularly useful.  There is also an optional
mechanism by which a configuration can be "saved" so it will be
automatically loaded on reset.  This can be configured from
userspace, but it doesn't seem useful to support in the driver.

On my laptop, enabling APST seems to save nearly 1W.

The hardware tables can be decoded in userspace with nvme-cli.
'nvme id-ctrl /dev/nvmeN' will show the power state table and
'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
configuration.

I called the parameters ps_max_latency_us instead of
apst_max_latency_us because we might support other power saving
modes (e.g. non-automonous power state transitions or even runtime
D3) and the same parameter could control the maximum allowable
latency for these states as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |   6 ++
 include/linux/nvme.h     |   6 ++
 3 files changed, 178 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index b92a7f767bfc..a143c72c3d0d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -56,6 +56,11 @@ EXPORT_SYMBOL_GPL(nvme_max_retries);
 static int nvme_char_major;
 module_param(nvme_char_major, int, 0);
 
+static unsigned long default_ps_max_latency_us = 25000;
+module_param(default_ps_max_latency_us, ulong, 0644);
+MODULE_PARM_DESC(ps_max_latency_us,
+		 "default max power saving latency; overridden per device in sysfs");
+
 static LIST_HEAD(nvme_ctrl_list);
 static DEFINE_SPINLOCK(dev_list_lock);
 
@@ -1205,6 +1210,98 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 	blk_queue_write_cache(q, vwc, vwc);
 }
 
+static void nvme_configure_apst(struct nvme_ctrl *ctrl)
+{
+	/*
+	 * APST (Autonomous Power State Transition) lets us program a
+	 * table of power state transitions that the controller will
+	 * perform automatically.  We configure it with a simple
+	 * heuristic: we are willing to spend at most 2% of the time
+	 * transitioning between power states.  Therefore, when running
+	 * in any given state, we will enter the next lower-power
+	 * non-operational state after waiting 100 * (enlat + exlat)
+	 * microseconds, as long as that state's total latency is under
+	 * the requested maximum latency.
+	 *
+	 * We will not autonomously enter any non-operational state for
+	 * which the total latency exceeds ps_max_latency_us.  Users
+	 * can set ps_max_latency_us to zero to turn off APST.
+	 */
+
+	unsigned apste;
+	struct nvme_feat_auto_pst *table;
+	int ret;
+
+	if (!ctrl->apsta)
+		return;	/* APST isn't supported. */
+
+	if (ctrl->npss > 31) {
+		dev_warn(ctrl->device, "NPSS is invalid; not using APST\n");
+		return;
+	}
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return;
+
+	if (ctrl->ps_max_latency_us == 0) {
+		/* Turn off APST. */
+		apste = 0;
+	} else {
+		__le64 target = cpu_to_le64(0);
+		int state;
+
+		/*
+		 * Walk through all states from lowest- to highest-power.
+		 * According to the spec, lower-numbered states use more
+		 * power.  NPSS, despite the name, is the index of the
+		 * lowest-power state, not the number of states.
+		 */
+		for (state = (int)ctrl->npss; state >= 0; state--) {
+			u64 total_latency_us, transition_ms;
+
+			if (target)
+				table->entries[state] = target;
+
+			/*
+			 * Is this state a useful non-operational state for
+			 * higher-power states to autonomously transition to?
+			 */
+			if (!(ctrl->psd[state].flags & 2))
+				continue;  /* It's an operational state. */
+
+			total_latency_us =
+				(u64)cpu_to_le32(ctrl->psd[state].entry_lat) +
+				+ cpu_to_le32(ctrl->psd[state].exit_lat);
+			if (total_latency_us > ctrl->ps_max_latency_us)
+				continue;
+
+			/*
+			 * This state is good.  Use it as the APST idle
+			 * target for higher power states.
+			 */
+			transition_ms = total_latency_us + 19;
+			do_div(transition_ms, 20);
+			if (transition_ms >= (1 << 24))
+				transition_ms = (1 << 24);
+
+			target = cpu_to_le64((state << 3) |
+					     (transition_ms << 8));
+		}
+
+		apste = 1;
+	}
+
+	ret = nvme_set_features(ctrl, NVME_FEAT_AUTO_PST, apste,
+				table, sizeof(*table), NULL);
+	if (ret)
+		dev_err(ctrl->device, "failed to set APST feature (%d)\n", ret);
+
+	kfree(table);
+}
+
+static struct attribute_group nvme_dev_dynamic_attrs_group;
+
 /*
  * Initialize the cached copies of the Identify data and various controller
  * register in our nvme_ctrl structure.  This should be called as soon as
@@ -1271,6 +1368,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	ctrl->sgls = le32_to_cpu(id->sgls);
 	ctrl->kas = le16_to_cpu(id->kas);
 
+	ctrl->npss = id->npss;
+	ctrl->apsta = id->apsta;
+	memcpy(ctrl->psd, id->psd, sizeof(ctrl->psd));
+
 	if (ctrl->ops->is_fabrics) {
 		ctrl->icdoff = le16_to_cpu(id->icdoff);
 		ctrl->ioccsz = le32_to_cpu(id->ioccsz);
@@ -1294,6 +1395,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	}
 
 	kfree(id);
+
+	nvme_configure_apst(ctrl);
+
+	sysfs_update_group(&ctrl->device->kobj, &nvme_dev_dynamic_attrs_group);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(nvme_init_identify);
@@ -1564,6 +1669,41 @@ static ssize_t nvme_sysfs_show_address(struct device *dev,
 }
 static DEVICE_ATTR(address, S_IRUGO, nvme_sysfs_show_address, NULL);
 
+static ssize_t nvme_sysfs_show_ps_max_latency_us(
+	struct device *dev,
+	struct device_attribute *attr,
+	char *buf)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%llu\n", ctrl->ps_max_latency_us);
+}
+
+static ssize_t nvme_sysfs_store_ps_max_latency_us(
+	struct device *dev,
+	struct device_attribute *attr,
+	const char *buf, size_t size)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+	int ret;
+	u64 val;
+
+	ret = kstrtoull(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	if (ctrl->ps_max_latency_us != val) {
+		ctrl->ps_max_latency_us = val;
+		nvme_configure_apst(ctrl);
+	}
+
+	return size;
+}
+
+static DEVICE_ATTR(ps_max_latency_us, 0644,
+		   nvme_sysfs_show_ps_max_latency_us,
+		   nvme_sysfs_store_ps_max_latency_us);
+
 static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_reset_controller.attr,
 	&dev_attr_rescan_controller.attr,
@@ -1605,8 +1745,33 @@ static struct attribute_group nvme_dev_attrs_group = {
 	.is_visible	= nvme_dev_attrs_are_visible,
 };
 
+static struct attribute *nvme_dev_dynamic_attrs[] = {
+	&dev_attr_ps_max_latency_us.attr,
+	NULL
+};
+
+static umode_t nvme_dev_dynamic_attrs_are_visible(struct kobject *kobj,
+		struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+	if (a == &dev_attr_ps_max_latency_us.attr) {
+		if (!ctrl->apsta)
+			return 0;
+	}
+
+	return a->mode;
+}
+
+static struct attribute_group nvme_dev_dynamic_attrs_group = {
+	.attrs		= nvme_dev_dynamic_attrs,
+	.is_visible	= nvme_dev_dynamic_attrs_are_visible,
+};
+
 static const struct attribute_group *nvme_dev_attr_groups[] = {
 	&nvme_dev_attrs_group,
+	&nvme_dev_dynamic_attrs_group,
 	NULL,
 };
 
@@ -1991,6 +2156,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	ctrl->quirks = quirks;
 	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
 	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
+	ctrl->ps_max_latency_us = default_ps_max_latency_us;
 
 	ret = nvme_set_instance(ctrl);
 	if (ret)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c2151761ec5f..16d51006afc0 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -129,13 +129,19 @@ struct nvme_ctrl {
 	u32 vs;
 	u32 sgls;
 	u16 kas;
+	u8 npss;
+	u8 apsta;
 	unsigned int kato;
 	bool subsystem;
 	unsigned long quirks;
+	struct nvme_id_power_state psd[32];
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
 
+	/* Power saving configuration */
+	u64 ps_max_latency_us;
+
 	/* Fabrics only */
 	u16 sqsize;
 	u32 ioccsz;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index d8b37bab2887..a76237dac4b0 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -543,6 +543,12 @@ struct nvme_dsm_range {
 	__le64			slba;
 };
 
+/* Features */
+
+struct nvme_feat_auto_pst {
+	__le64 entries[32];
+};
+
 /* Admin commands */
 
 enum nvme_admin_opcode {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 1/3] nvme/scsi: Remove power management support
  2016-09-16 18:16 ` [PATCH v4 1/3] nvme/scsi: Remove power management support Andy Lutomirski
@ 2016-09-16 23:37   ` J Freyensee
  0 siblings, 0 replies; 19+ messages in thread
From: J Freyensee @ 2016-09-16 23:37 UTC (permalink / raw)
  To: Andy Lutomirski, Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel

On Fri, 2016-09-16 at 11:16 -0700, Andy Lutomirski wrote:
> As far as I can tell, there is basically nothing correct about this
> code.  It misinterprets npss (off-by-one).  It hardcodes a bunch of
> power states, which is nonsense, because they're all just indices
> into a table that software needs to parse.  It completely ignores
> the distinction between operational and non-operational states.
> And, until 4.8, if all of the above magically succeeded, it would
> dereference a NULL pointer and OOPS.
> 
> Since this code appears to be useless, just delete it.

Acked-by: Jay Freyensee <james_p_freyensee@linux.intel.com>

> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-16 18:16 [PATCH v4 0/3] nvme power saving Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-09-16 18:16 ` [PATCH v4 3/3] nvme: Enable autonomous power state transitions Andy Lutomirski
@ 2016-09-17  0:49 ` J Freyensee
  2016-09-22  0:11 ` Andy Lutomirski
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: J Freyensee @ 2016-09-17  0:49 UTC (permalink / raw)
  To: Andy Lutomirski, Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel

On Fri, 2016-09-16 at 11:16 -0700, Andy Lutomirski wrote:
> Hi all-
> 
> Here's v4 of the APST patch set.  The biggest bikesheddable thing (I
> think) is the scaling factor.  I currently have it hardcoded so that
> we wait 50x the total latency before entering a power saving state.
> On my Samsung 950, this means we enter state 3 (70mW, 0.5ms entry
> latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms entry
> latency, 22ms exit latency) after 1200ms.  I have the default max
> latency set to 25ms.
> 
> FWIW, in practice, the latency this introduces seems to be well
> under 22ms, but my benchmark is a bit silly and I might have
> measured it wrong.  I certainly haven't observed a slowdown just
> using my laptop.
> 
> This time around, I changed the names of parameters after Jay
> Frayensee got confused by the first try.  Now they are:
> 
>  - ps_max_latency_us in sysfs: actually controls it.
>  - nvme_core.default_ps_max_latency_us: sets the default.
> 
> Yeah, they're mouthfuls, but they should be clearer now.
> 

I took the patches and applied them to one of my NVMe fabric hosts on
my NVMe-over-Fabrics setup.  Basically, it doesn't test much other than
Andy's explanation that "ps_max_latency_us" does not appear in any of
/sys/block/nvmeXnY sysfs nodes (I have 7) so seems good to me on this
front.

Tested-by: Jay Freyensee <james_p_freyensee@linux.intel.com>
[jpf: defaults benign to NVMe-over-Fabrics]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-16 18:16 [PATCH v4 0/3] nvme power saving Andy Lutomirski
                   ` (3 preceding siblings ...)
  2016-09-17  0:49 ` [PATCH v4 0/3] nvme power saving J Freyensee
@ 2016-09-22  0:11 ` Andy Lutomirski
  2016-09-22 13:21   ` Christoph Hellwig
  2016-09-22 14:23 ` Jens Axboe
  2016-09-23 23:42 ` Christoph Hellwig
  6 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-09-22  0:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Keith Busch, Jens Axboe, linux-nvme, Christoph Hellwig,
	linux-kernel, J Freyensee

On Fri, Sep 16, 2016 at 11:16 AM, Andy Lutomirski <luto@kernel.org> wrote:
> Hi all-
>
> Here's v4 of the APST patch set.  The biggest bikesheddable thing (I
> think) is the scaling factor.  I currently have it hardcoded so that
> we wait 50x the total latency before entering a power saving state.
> On my Samsung 950, this means we enter state 3 (70mW, 0.5ms entry
> latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms entry
> latency, 22ms exit latency) after 1200ms.  I have the default max
> latency set to 25ms.

Anything I can/should do to help this make it for 4.9? :)

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-22  0:11 ` Andy Lutomirski
@ 2016-09-22 13:21   ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2016-09-22 13:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Keith Busch, Jens Axboe, linux-nvme,
	Christoph Hellwig, linux-kernel, J Freyensee

On Wed, Sep 21, 2016 at 05:11:03PM -0700, Andy Lutomirski wrote:
> Anything I can/should do to help this make it for 4.9? :)

Maybe you should have added the Reviewed-by: tags you already got to
this repost? :)

Here we go again:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-16 18:16 [PATCH v4 0/3] nvme power saving Andy Lutomirski
                   ` (4 preceding siblings ...)
  2016-09-22  0:11 ` Andy Lutomirski
@ 2016-09-22 14:23 ` Jens Axboe
  2016-09-22 20:11   ` Andy Lutomirski
  2016-09-23 23:42 ` Christoph Hellwig
  6 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2016-09-22 14:23 UTC (permalink / raw)
  To: Andy Lutomirski, Keith Busch
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, J Freyensee

On 09/16/2016 12:16 PM, Andy Lutomirski wrote:
> Hi all-
>
> Here's v4 of the APST patch set.  The biggest bikesheddable thing (I
> think) is the scaling factor.  I currently have it hardcoded so that
> we wait 50x the total latency before entering a power saving state.
> On my Samsung 950, this means we enter state 3 (70mW, 0.5ms entry
> latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms entry
> latency, 22ms exit latency) after 1200ms.  I have the default max
> latency set to 25ms.
>
> FWIW, in practice, the latency this introduces seems to be well
> under 22ms, but my benchmark is a bit silly and I might have
> measured it wrong.  I certainly haven't observed a slowdown just
> using my laptop.
>
> This time around, I changed the names of parameters after Jay
> Frayensee got confused by the first try.  Now they are:
>
>  - ps_max_latency_us in sysfs: actually controls it.
>  - nvme_core.default_ps_max_latency_us: sets the default.
>
> Yeah, they're mouthfuls, but they should be clearer now.

The only thing I don't like about this is the fact that's it's a driver 
private thing. Similar to ALPM on SATA, it's yet another knob that needs 
to be set. It we put it somewhere generic, then at least we could 
potentially use it in a generic fashion.

Additionally, it should not be on by default.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-22 14:23 ` Jens Axboe
@ 2016-09-22 20:11   ` Andy Lutomirski
  2016-09-22 20:43     ` Jens Axboe
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-09-22 20:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andy Lutomirski, Keith Busch, linux-nvme, Christoph Hellwig,
	linux-kernel, J Freyensee

On Thu, Sep 22, 2016 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote:
>
> On 09/16/2016 12:16 PM, Andy Lutomirski wrote:
>>
>> Hi all-
>>
>> Here's v4 of the APST patch set.  The biggest bikesheddable thing (I
>> think) is the scaling factor.  I currently have it hardcoded so that
>> we wait 50x the total latency before entering a power saving state.
>> On my Samsung 950, this means we enter state 3 (70mW, 0.5ms entry
>> latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms entry
>> latency, 22ms exit latency) after 1200ms.  I have the default max
>> latency set to 25ms.
>>
>> FWIW, in practice, the latency this introduces seems to be well
>> under 22ms, but my benchmark is a bit silly and I might have
>> measured it wrong.  I certainly haven't observed a slowdown just
>> using my laptop.
>>
>> This time around, I changed the names of parameters after Jay
>> Frayensee got confused by the first try.  Now they are:
>>
>>  - ps_max_latency_us in sysfs: actually controls it.
>>  - nvme_core.default_ps_max_latency_us: sets the default.
>>
>> Yeah, they're mouthfuls, but they should be clearer now.
>
>
> The only thing I don't like about this is the fact that's it's a driver private thing. Similar to ALPM on SATA, it's yet another knob that needs to be set. It we put it somewhere generic, then at least we could potentially use it in a generic fashion.

Agreed.  I'm hoping to hear back from Rafael soon about the dev_pm_qos thing.

>
> Additionally, it should not be on by default.

I think I disagree with this.  Since we don't have anything like
laptop-mode AFAIK, I think we do want it on by default.  For the
server workloads that want to consume more idle power for faster
response when idle, I think the servers should be willing to make this
change, just like they need to disable overly deep C states, etc.
(Admittedly, unifying the configuration would be nice.)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-22 20:11   ` Andy Lutomirski
@ 2016-09-22 20:43     ` Jens Axboe
  2016-09-22 21:33       ` J Freyensee
  0 siblings, 1 reply; 19+ messages in thread
From: Jens Axboe @ 2016-09-22 20:43 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Keith Busch, linux-nvme, Christoph Hellwig,
	linux-kernel, J Freyensee

On 09/22/2016 02:11 PM, Andy Lutomirski wrote:
> On Thu, Sep 22, 2016 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote:
>>
>> On 09/16/2016 12:16 PM, Andy Lutomirski wrote:
>>>
>>> Hi all-
>>>
>>> Here's v4 of the APST patch set.  The biggest bikesheddable thing (I
>>> think) is the scaling factor.  I currently have it hardcoded so that
>>> we wait 50x the total latency before entering a power saving state.
>>> On my Samsung 950, this means we enter state 3 (70mW, 0.5ms entry
>>> latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms entry
>>> latency, 22ms exit latency) after 1200ms.  I have the default max
>>> latency set to 25ms.
>>>
>>> FWIW, in practice, the latency this introduces seems to be well
>>> under 22ms, but my benchmark is a bit silly and I might have
>>> measured it wrong.  I certainly haven't observed a slowdown just
>>> using my laptop.
>>>
>>> This time around, I changed the names of parameters after Jay
>>> Frayensee got confused by the first try.  Now they are:
>>>
>>>  - ps_max_latency_us in sysfs: actually controls it.
>>>  - nvme_core.default_ps_max_latency_us: sets the default.
>>>
>>> Yeah, they're mouthfuls, but they should be clearer now.
>>
>>
>> The only thing I don't like about this is the fact that's it's a driver private thing. Similar to ALPM on SATA, it's yet another knob that needs to be set. It we put it somewhere generic, then at least we could potentially use it in a generic fashion.
>
> Agreed.  I'm hoping to hear back from Rafael soon about the dev_pm_qos
> thing.
>
>>
>> Additionally, it should not be on by default.
>
> I think I disagree with this.  Since we don't have anything like
> laptop-mode AFAIK, I think we do want it on by default.  For the
> server workloads that want to consume more idle power for faster
> response when idle, I think the servers should be willing to make this
> change, just like they need to disable overly deep C states, etc.
> (Admittedly, unifying the configuration would be nice.)

I can see two reasons why we don't want it the default:

1) Changes like this has a tendency to cause issues on various types of
hardware. How many NVMe devices have you tested this on? ALPM on SATA
had a lot of initial problems, where slowed down some SSDs unberably.

2) Rolling out a new kernel and seeing a weird slowdown on some
workloads usually costs a LOT of time to investigate and finally get to
the bottom of. It's not that server setups don't want to make this
change, it's usually that they don't know about it until it's caused
some issue in production (eg slowdown, or otherwise).

Either one of those is enough, in my book, to default it to off. I ran
it on my laptop and saw no power saving wins, unfortunately, for what
it's worth.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-22 20:43     ` Jens Axboe
@ 2016-09-22 21:33       ` J Freyensee
  2016-09-22 22:15         ` Andy Lutomirski
  2016-09-22 22:16         ` Keith Busch
  0 siblings, 2 replies; 19+ messages in thread
From: J Freyensee @ 2016-09-22 21:33 UTC (permalink / raw)
  To: Jens Axboe, Andy Lutomirski
  Cc: linux-kernel, linux-nvme, Keith Busch, Andy Lutomirski,
	Christoph Hellwig

On Thu, 2016-09-22 at 14:43 -0600, Jens Axboe wrote:
> On 09/22/2016 02:11 PM, Andy Lutomirski wrote:
> > 
> > On Thu, Sep 22, 2016 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote:
> > > 
> > > 
> > > On 09/16/2016 12:16 PM, Andy Lutomirski wrote:
> > > > 
> > > > 
> > > > Hi all-
> > > > 
> > > > Here's v4 of the APST patch set.  The biggest bikesheddable
> > > > thing (I
> > > > think) is the scaling factor.  I currently have it hardcoded so
> > > > that
> > > > we wait 50x the total latency before entering a power saving
> > > > state.
> > > > On my Samsung 950, this means we enter state 3 (70mW, 0.5ms
> > > > entry
> > > > latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms
> > > > entry
> > > > latency, 22ms exit latency) after 1200ms.  I have the default
> > > > max
> > > > latency set to 25ms.
> > > > 
> > > > FWIW, in practice, the latency this introduces seems to be well
> > > > under 22ms, but my benchmark is a bit silly and I might have
> > > > measured it wrong.  I certainly haven't observed a slowdown
> > > > just
> > > > using my laptop.
> > > > 
> > > > This time around, I changed the names of parameters after Jay
> > > > Frayensee got confused by the first try.  Now they are:
> > > > 
> > > >  - ps_max_latency_us in sysfs: actually controls it.
> > > >  - nvme_core.default_ps_max_latency_us: sets the default.
> > > > 
> > > > Yeah, they're mouthfuls, but they should be clearer now.
> > > 
> > > 
> > > The only thing I don't like about this is the fact that's it's a
> > > driver private thing. Similar to ALPM on SATA, it's yet another
> > > knob that needs to be set. It we put it somewhere generic, then
> > > at least we could potentially use it in a generic fashion.
> > 
> > Agreed.  I'm hoping to hear back from Rafael soon about the
> > dev_pm_qos
> > thing.
> > 
> > > 
> > > 
> > > Additionally, it should not be on by default.
> > 
> > I think I disagree with this.  Since we don't have anything like
> > laptop-mode AFAIK, I think we do want it on by default.  For the
> > server workloads that want to consume more idle power for faster
> > response when idle, I think the servers should be willing to make
> > this
> > change, just like they need to disable overly deep C states, etc.
> > (Admittedly, unifying the configuration would be nice.)
> 
> I can see two reasons why we don't want it the default:
> 
> 1) Changes like this has a tendency to cause issues on various types
> of
> hardware. How many NVMe devices have you tested this on? ALPM on SATA
> had a lot of initial problems, where slowed down some SSDs unberably.

...and some SSDs don't even support this feature yet, so the number of
different NVMe devices available to test initially will most likely be
small (like the Fultondales I have, all I could check is to see if the
code broke anything if the device did not have this power-save
feature).

I agree with Jens, makes a lot of sense to start with this feature
'off'.

To 'advertise' the feature, maybe make the feature a new selection in
Kconfig?  Example, initially make it "EXPERIMENTAL", and later when
more devices implement this feature it can be integrated more tightly
into the NVMe solution and default to on.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-22 22:16         ` Keith Busch
@ 2016-09-22 22:07           ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2016-09-22 22:07 UTC (permalink / raw)
  To: Keith Busch, J Freyensee
  Cc: Andy Lutomirski, linux-kernel, linux-nvme, Andy Lutomirski,
	Christoph Hellwig

On 09/22/2016 04:16 PM, Keith Busch wrote:
> On Thu, Sep 22, 2016 at 02:33:36PM -0700, J Freyensee wrote:
>> ...and some SSDs don't even support this feature yet, so the number of
>> different NVMe devices available to test initially will most likely be
>> small (like the Fultondales I have, all I could check is to see if the
>> code broke anything if the device did not have this power-save
>> feature).
>>
>> I agree with Jens, makes a lot of sense to start with this feature
>> 'off'.
>>
>> To 'advertise' the feature, maybe make the feature a new selection in
>> Kconfig?  Example, initially make it "EXPERIMENTAL", and later when
>> more devices implement this feature it can be integrated more tightly
>> into the NVMe solution and default to on.
>
> Should we just leave the kernel out of this then? I bet we could script
> this feature in user space.

That actually might be the sanest approach. Then we can tie it into some
generic PM latency setup in the future, from the kernel, when it's
available.

That way we don't get left with some odd NVMe specific PM sysfs knob
that is exposed to userland.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-22 21:33       ` J Freyensee
@ 2016-09-22 22:15         ` Andy Lutomirski
  2016-10-28  0:06           ` Andy Lutomirski
  2016-09-22 22:16         ` Keith Busch
  1 sibling, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-09-22 22:15 UTC (permalink / raw)
  To: J Freyensee
  Cc: Jens Axboe, linux-kernel, linux-nvme, Keith Busch,
	Andy Lutomirski, Christoph Hellwig

On Thu, Sep 22, 2016 at 2:33 PM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:
> On Thu, 2016-09-22 at 14:43 -0600, Jens Axboe wrote:
>> On 09/22/2016 02:11 PM, Andy Lutomirski wrote:
>> >
>> > On Thu, Sep 22, 2016 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote:
>> > >
>> > >
>> > > On 09/16/2016 12:16 PM, Andy Lutomirski wrote:
>> > > >
>> > > >
>> > > > Hi all-
>> > > >
>> > > > Here's v4 of the APST patch set.  The biggest bikesheddable
>> > > > thing (I
>> > > > think) is the scaling factor.  I currently have it hardcoded so
>> > > > that
>> > > > we wait 50x the total latency before entering a power saving
>> > > > state.
>> > > > On my Samsung 950, this means we enter state 3 (70mW, 0.5ms
>> > > > entry
>> > > > latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms
>> > > > entry
>> > > > latency, 22ms exit latency) after 1200ms.  I have the default
>> > > > max
>> > > > latency set to 25ms.
>> > > >
>> > > > FWIW, in practice, the latency this introduces seems to be well
>> > > > under 22ms, but my benchmark is a bit silly and I might have
>> > > > measured it wrong.  I certainly haven't observed a slowdown
>> > > > just
>> > > > using my laptop.
>> > > >
>> > > > This time around, I changed the names of parameters after Jay
>> > > > Frayensee got confused by the first try.  Now they are:
>> > > >
>> > > >  - ps_max_latency_us in sysfs: actually controls it.
>> > > >  - nvme_core.default_ps_max_latency_us: sets the default.
>> > > >
>> > > > Yeah, they're mouthfuls, but they should be clearer now.
>> > >
>> > >
>> > > The only thing I don't like about this is the fact that's it's a
>> > > driver private thing. Similar to ALPM on SATA, it's yet another
>> > > knob that needs to be set. It we put it somewhere generic, then
>> > > at least we could potentially use it in a generic fashion.
>> >
>> > Agreed.  I'm hoping to hear back from Rafael soon about the
>> > dev_pm_qos
>> > thing.
>> >
>> > >
>> > >
>> > > Additionally, it should not be on by default.
>> >
>> > I think I disagree with this.  Since we don't have anything like
>> > laptop-mode AFAIK, I think we do want it on by default.  For the
>> > server workloads that want to consume more idle power for faster
>> > response when idle, I think the servers should be willing to make
>> > this
>> > change, just like they need to disable overly deep C states, etc.
>> > (Admittedly, unifying the configuration would be nice.)
>>
>> I can see two reasons why we don't want it the default:
>>
>> 1) Changes like this has a tendency to cause issues on various types
>> of
>> hardware. How many NVMe devices have you tested this on? ALPM on SATA
>> had a lot of initial problems, where slowed down some SSDs unberably.

I'm reasonably optimistic that the NVMe situation will be a lot better
for a couple of reasons:

1. There's only one player involved.  With ALPM, the controller and
the drive need to cooperate on entering and leaving various idle
states.  With NVMe, the controller *is* the drive, so there's no issue
where a drive manufacturer might not have tested with the relevant
controller or vice versa.

2. Windows appears to use it.  I haven't tested directly, but the
Internet seems to think that Windows uses APST and maybe even manual
state transitions, and that NVMe power states are even mandatory for
Connected Standby logo compliance.

3. The feature is new.  NVMe 1.0 didn't support APST at all, so the
driver is unlikely to cause problems with older drivers.

>
> ...and some SSDs don't even support this feature yet, so the number of
> different NVMe devices available to test initially will most likely be
> small (like the Fultondales I have, all I could check is to see if the
> code broke anything if the device did not have this power-save
> feature).
>
> I agree with Jens, makes a lot of sense to start with this feature
> 'off'.
>
> To 'advertise' the feature, maybe make the feature a new selection in
> Kconfig?  Example, initially make it "EXPERIMENTAL", and later when
> more devices implement this feature it can be integrated more tightly
> into the NVMe solution and default to on.
>

How about having a config option that's "default n" that changes the
default?  I could also add a log message when APST is first enabled on
a device to make it easier to notice a change.

--Andy

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-22 21:33       ` J Freyensee
  2016-09-22 22:15         ` Andy Lutomirski
@ 2016-09-22 22:16         ` Keith Busch
  2016-09-22 22:07           ` Jens Axboe
  1 sibling, 1 reply; 19+ messages in thread
From: Keith Busch @ 2016-09-22 22:16 UTC (permalink / raw)
  To: J Freyensee
  Cc: Jens Axboe, Andy Lutomirski, linux-kernel, linux-nvme,
	Andy Lutomirski, Christoph Hellwig

On Thu, Sep 22, 2016 at 02:33:36PM -0700, J Freyensee wrote:
> ...and some SSDs don't even support this feature yet, so the number of
> different NVMe devices available to test initially will most likely be
> small (like the Fultondales I have, all I could check is to see if the
> code broke anything if the device did not have this power-save
> feature).
> 
> I agree with Jens, makes a lot of sense to start with this feature
> 'off'.
> 
> To 'advertise' the feature, maybe make the feature a new selection in
> Kconfig?  Example, initially make it "EXPERIMENTAL", and later when
> more devices implement this feature it can be integrated more tightly
> into the NVMe solution and default to on.

Should we just leave the kernel out of this then? I bet we could script
this feature in user space.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-16 18:16 [PATCH v4 0/3] nvme power saving Andy Lutomirski
                   ` (5 preceding siblings ...)
  2016-09-22 14:23 ` Jens Axboe
@ 2016-09-23 23:42 ` Christoph Hellwig
  2016-09-24 16:55   ` Jens Axboe
  6 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2016-09-23 23:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Keith Busch, Jens Axboe, linux-nvme, Christoph Hellwig,
	linux-kernel, J Freyensee

Jens,

can we at least get patches 1 and 2 in while pondering the fate
of the right interface for patch 3?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-23 23:42 ` Christoph Hellwig
@ 2016-09-24 16:55   ` Jens Axboe
  0 siblings, 0 replies; 19+ messages in thread
From: Jens Axboe @ 2016-09-24 16:55 UTC (permalink / raw)
  To: Christoph Hellwig, Andy Lutomirski
  Cc: Keith Busch, linux-nvme, linux-kernel, J Freyensee

On 09/23/2016 05:42 PM, Christoph Hellwig wrote:
> Jens,
>
> can we at least get patches 1 and 2 in while pondering the fate
> of the right interface for patch 3?
>

Yes definitely, I have no beef with the first two patches. I'll add them
for 4.9.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-09-22 22:15         ` Andy Lutomirski
@ 2016-10-28  0:06           ` Andy Lutomirski
  2016-10-28  5:29             ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Andy Lutomirski @ 2016-10-28  0:06 UTC (permalink / raw)
  To: J Freyensee
  Cc: Jens Axboe, linux-kernel, linux-nvme, Keith Busch,
	Andy Lutomirski, Christoph Hellwig

On Thu, Sep 22, 2016 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Thu, Sep 22, 2016 at 2:33 PM, J Freyensee
> <james_p_freyensee@linux.intel.com> wrote:
>> On Thu, 2016-09-22 at 14:43 -0600, Jens Axboe wrote:
>>> On 09/22/2016 02:11 PM, Andy Lutomirski wrote:
>>> >
>>> > On Thu, Sep 22, 2016 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote:
>>> > >
>>> > >
>>> > > On 09/16/2016 12:16 PM, Andy Lutomirski wrote:
>>> > > >
>>> > > >
>>> > > > Hi all-
>>> > > >
>>> > > > Here's v4 of the APST patch set.  The biggest bikesheddable
>>> > > > thing (I
>>> > > > think) is the scaling factor.  I currently have it hardcoded so
>>> > > > that
>>> > > > we wait 50x the total latency before entering a power saving
>>> > > > state.
>>> > > > On my Samsung 950, this means we enter state 3 (70mW, 0.5ms
>>> > > > entry
>>> > > > latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms
>>> > > > entry
>>> > > > latency, 22ms exit latency) after 1200ms.  I have the default
>>> > > > max
>>> > > > latency set to 25ms.
>>> > > >
>>> > > > FWIW, in practice, the latency this introduces seems to be well
>>> > > > under 22ms, but my benchmark is a bit silly and I might have
>>> > > > measured it wrong.  I certainly haven't observed a slowdown
>>> > > > just
>>> > > > using my laptop.
>>> > > >
>>> > > > This time around, I changed the names of parameters after Jay
>>> > > > Frayensee got confused by the first try.  Now they are:
>>> > > >
>>> > > >  - ps_max_latency_us in sysfs: actually controls it.
>>> > > >  - nvme_core.default_ps_max_latency_us: sets the default.
>>> > > >
>>> > > > Yeah, they're mouthfuls, but they should be clearer now.
>>> > >
>>> > >
>>> > > The only thing I don't like about this is the fact that's it's a
>>> > > driver private thing. Similar to ALPM on SATA, it's yet another
>>> > > knob that needs to be set. It we put it somewhere generic, then
>>> > > at least we could potentially use it in a generic fashion.
>>> >
>>> > Agreed.  I'm hoping to hear back from Rafael soon about the
>>> > dev_pm_qos
>>> > thing.
>>> >
>>> > >
>>> > >
>>> > > Additionally, it should not be on by default.
>>> >
>>> > I think I disagree with this.  Since we don't have anything like
>>> > laptop-mode AFAIK, I think we do want it on by default.  For the
>>> > server workloads that want to consume more idle power for faster
>>> > response when idle, I think the servers should be willing to make
>>> > this
>>> > change, just like they need to disable overly deep C states, etc.
>>> > (Admittedly, unifying the configuration would be nice.)
>>>
>>> I can see two reasons why we don't want it the default:
>>>
>>> 1) Changes like this has a tendency to cause issues on various types
>>> of
>>> hardware. How many NVMe devices have you tested this on? ALPM on SATA
>>> had a lot of initial problems, where slowed down some SSDs unberably.
>
> I'm reasonably optimistic that the NVMe situation will be a lot better
> for a couple of reasons:
>
> 1. There's only one player involved.  With ALPM, the controller and
> the drive need to cooperate on entering and leaving various idle
> states.  With NVMe, the controller *is* the drive, so there's no issue
> where a drive manufacturer might not have tested with the relevant
> controller or vice versa.
>
> 2. Windows appears to use it.  I haven't tested directly, but the
> Internet seems to think that Windows uses APST and maybe even manual
> state transitions, and that NVMe power states are even mandatory for
> Connected Standby logo compliance.
>
> 3. The feature is new.  NVMe 1.0 didn't support APST at all, so the
> driver is unlikely to cause problems with older drivers.
>
>>
>> ...and some SSDs don't even support this feature yet, so the number of
>> different NVMe devices available to test initially will most likely be
>> small (like the Fultondales I have, all I could check is to see if the
>> code broke anything if the device did not have this power-save
>> feature).
>>
>> I agree with Jens, makes a lot of sense to start with this feature
>> 'off'.
>>
>> To 'advertise' the feature, maybe make the feature a new selection in
>> Kconfig?  Example, initially make it "EXPERIMENTAL", and later when
>> more devices implement this feature it can be integrated more tightly
>> into the NVMe solution and default to on.
>>
>
> How about having a config option that's "default n" that changes the
> default?  I could also add a log message when APST is first enabled on
> a device to make it easier to notice a change.
>

It looks like there is at least one NVMe disk in existence (a
different Samsung device) that sporadically dies when APST is on.
This device appears to also sporadically die when APST is off, but it
lasts considerably longer before dying with APST off.

So here's what I'm tempted to do:

 - For devices that report NVMe version 1.2 support, APST is on by
default.  I hope this is safe.

 - For devices that don't report NVMe 1.2 or higher but do report
APSTA (which implies NVMe 1.1), then we can have a blacklist or a
whitelist.  A blacklist is nicer, but a whitelist is safer.

 - A sysfs and/or module control allows overriding this.

 - Implement dev_pm_qos latency control.  The chosen latency (if APST
is enabled) will be the lesser of the dev_pm_qos setting and a module
parameter.

How does that sound?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v4 0/3] nvme power saving
  2016-10-28  0:06           ` Andy Lutomirski
@ 2016-10-28  5:29             ` Christoph Hellwig
  0 siblings, 0 replies; 19+ messages in thread
From: Christoph Hellwig @ 2016-10-28  5:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: J Freyensee, Jens Axboe, linux-kernel, linux-nvme, Keith Busch,
	Andy Lutomirski, Judy Brock-SSI

On Thu, Oct 27, 2016 at 05:06:16PM -0700, Andy Lutomirski wrote:
> It looks like there is at least one NVMe disk in existence (a
> different Samsung device) that sporadically dies when APST is on.
> This device appears to also sporadically die when APST is off, but it
> lasts considerably longer before dying with APST off.

Judy, can you help Andy to find someone in Samsung to report this
to?

> So here's what I'm tempted to do:
> 
>  - For devices that report NVMe version 1.2 support, APST is on by
> default.  I hope this is safe.

It should be safe.  That being said NVMe is being driven more and more
into consumer markets so eventually we will find some device we need
to work around inevitably, but that's life.

>  - For devices that don't report NVMe 1.2 or higher but do report
> APSTA (which implies NVMe 1.1), then we can have a blacklist or a
> whitelist.  A blacklist is nicer, but a whitelist is safer.

We just had a discussion about advertising features before claiming
conformance where they appear in in the NVMe technical working group.
The general concensus was that it should be safe.  I'm thus tempted
to start out with the blacklist.

>  - A sysfs and/or module control allows overriding this.
> 
>  - Implement dev_pm_qos latency control.  The chosen latency (if APST
> is enabled) will be the lesser of the dev_pm_qos setting and a module
> parameter.
> 
> How does that sound?

Great!

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2016-10-28  5:29 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-16 18:16 [PATCH v4 0/3] nvme power saving Andy Lutomirski
2016-09-16 18:16 ` [PATCH v4 1/3] nvme/scsi: Remove power management support Andy Lutomirski
2016-09-16 23:37   ` J Freyensee
2016-09-16 18:16 ` [PATCH v4 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features() Andy Lutomirski
2016-09-16 18:16 ` [PATCH v4 3/3] nvme: Enable autonomous power state transitions Andy Lutomirski
2016-09-17  0:49 ` [PATCH v4 0/3] nvme power saving J Freyensee
2016-09-22  0:11 ` Andy Lutomirski
2016-09-22 13:21   ` Christoph Hellwig
2016-09-22 14:23 ` Jens Axboe
2016-09-22 20:11   ` Andy Lutomirski
2016-09-22 20:43     ` Jens Axboe
2016-09-22 21:33       ` J Freyensee
2016-09-22 22:15         ` Andy Lutomirski
2016-10-28  0:06           ` Andy Lutomirski
2016-10-28  5:29             ` Christoph Hellwig
2016-09-22 22:16         ` Keith Busch
2016-09-22 22:07           ` Jens Axboe
2016-09-23 23:42 ` Christoph Hellwig
2016-09-24 16:55   ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).