linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] nvme power saving
@ 2016-08-29  9:25 Andy Lutomirski
  2016-08-29  9:25 ` [PATCH 1/3] nvme/scsi: Remove power management support Andy Lutomirski
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-29  9:25 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, Andy Lutomirski

Hi all-

Here's v1 of the APST patch set.  The biggest bikesheddable thing (I
think) is the scaling factor.  I currently have it hardcoded so that
we wait 50x the total latency before entering a power saving state.
On my Samsung 950, this means we enter state 3 (70mW, 0.5ms entry
latency, 5ms exit latency) after 275ms and state 4 (5mW, 2ms entry
latency, 22ms exit latency) after 1200ms.  I have the default max
latency set to 25ms.

FWIW, in practice, the latency this introduces seems to be well
under 22ms, but my benchmark is a bit silly and I might have
measured it wrong.  I certainly haven't observed a slowdown just
using my laptop.

Andy Lutomirski (3):
  nvme/scsi: Remove power management support
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme: Enable autonomous power state transitions

 drivers/nvme/host/core.c | 199 +++++++++++++++++++++++++++++++++++++++++++++--
 drivers/nvme/host/nvme.h |  10 ++-
 drivers/nvme/host/scsi.c |  80 ++-----------------
 include/linux/nvme.h     |   6 ++
 4 files changed, 211 insertions(+), 84 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] nvme/scsi: Remove power management support
  2016-08-29  9:25 [PATCH 0/3] nvme power saving Andy Lutomirski
@ 2016-08-29  9:25 ` Andy Lutomirski
  2016-08-29  9:25 ` [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features() Andy Lutomirski
  2016-08-29  9:25 ` [PATCH 3/3] nvme: Enable autonomous power state transitions Andy Lutomirski
  2 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-29  9:25 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, Andy Lutomirski

As far as I can tell, there is basically nothing correct about this
code.  It misinterprets npss (off-by-one).  It hardcodes a bunch of
power states, which is nonsense, because they're all just indices
into a table that software needs to parse.  It completely ignores
the distinction between operational and non-operational states.
And, until 4.8, if all of the above magically succeeded, it would
dereference a NULL pointer and OOPS.

Since this code appears to be useless, just delete it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/scsi.c | 74 ++----------------------------------------------
 1 file changed, 3 insertions(+), 71 deletions(-)

diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index e947e298a737..44009105f8c8 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -72,15 +72,6 @@ static int sg_version_num = 30534;	/* 2 digits for each component */
 #define ALL_LUNS_RETURNED				0x02
 #define ALL_WELL_KNOWN_LUNS_RETURNED			0x01
 #define RESTRICTED_LUNS_RETURNED			0x00
-#define NVME_POWER_STATE_START_VALID			0x00
-#define NVME_POWER_STATE_ACTIVE				0x01
-#define NVME_POWER_STATE_IDLE				0x02
-#define NVME_POWER_STATE_STANDBY			0x03
-#define NVME_POWER_STATE_LU_CONTROL			0x07
-#define POWER_STATE_0					0
-#define POWER_STATE_1					1
-#define POWER_STATE_2					2
-#define POWER_STATE_3					3
 #define DOWNLOAD_SAVE_ACTIVATE				0x05
 #define DOWNLOAD_SAVE_DEFER_ACTIVATE			0x0E
 #define ACTIVATE_DEFERRED_MICROCODE			0x0F
@@ -1229,64 +1220,6 @@ static void nvme_trans_fill_read_cap(u8 *response, struct nvme_id_ns *id_ns,
 
 /* Start Stop Unit Helper Functions */
 
-static int nvme_trans_power_state(struct nvme_ns *ns, struct sg_io_hdr *hdr,
-						u8 pc, u8 pcmod, u8 start)
-{
-	int res;
-	int nvme_sc;
-	struct nvme_id_ctrl *id_ctrl;
-	int lowest_pow_st;	/* max npss = lowest power consumption */
-	unsigned ps_desired = 0;
-
-	nvme_sc = nvme_identify_ctrl(ns->ctrl, &id_ctrl);
-	res = nvme_trans_status_code(hdr, nvme_sc);
-	if (res)
-		return res;
-
-	lowest_pow_st = max(POWER_STATE_0, (int)(id_ctrl->npss - 1));
-	kfree(id_ctrl);
-
-	switch (pc) {
-	case NVME_POWER_STATE_START_VALID:
-		/* Action unspecified if POWER CONDITION MODIFIER != 0 */
-		if (pcmod == 0 && start == 0x1)
-			ps_desired = POWER_STATE_0;
-		if (pcmod == 0 && start == 0x0)
-			ps_desired = lowest_pow_st;
-		break;
-	case NVME_POWER_STATE_ACTIVE:
-		/* Action unspecified if POWER CONDITION MODIFIER != 0 */
-		if (pcmod == 0)
-			ps_desired = POWER_STATE_0;
-		break;
-	case NVME_POWER_STATE_IDLE:
-		/* Action unspecified if POWER CONDITION MODIFIER != [0,1,2] */
-		if (pcmod == 0x0)
-			ps_desired = POWER_STATE_1;
-		else if (pcmod == 0x1)
-			ps_desired = POWER_STATE_2;
-		else if (pcmod == 0x2)
-			ps_desired = POWER_STATE_3;
-		break;
-	case NVME_POWER_STATE_STANDBY:
-		/* Action unspecified if POWER CONDITION MODIFIER != [0,1] */
-		if (pcmod == 0x0)
-			ps_desired = max(POWER_STATE_0, (lowest_pow_st - 2));
-		else if (pcmod == 0x1)
-			ps_desired = max(POWER_STATE_0, (lowest_pow_st - 1));
-		break;
-	case NVME_POWER_STATE_LU_CONTROL:
-	default:
-		res = nvme_trans_completion(hdr, SAM_STAT_CHECK_CONDITION,
-				ILLEGAL_REQUEST, SCSI_ASC_INVALID_CDB,
-				SCSI_ASCQ_CAUSE_NOT_REPORTABLE);
-		break;
-	}
-	nvme_sc = nvme_set_features(ns->ctrl, NVME_FEAT_POWER_MGMT, ps_desired, 0,
-				    NULL);
-	return nvme_trans_status_code(hdr, nvme_sc);
-}
-
 static int nvme_trans_send_activate_fw_cmd(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 					u8 buffer_id)
 {
@@ -2235,11 +2168,10 @@ static int nvme_trans_synchronize_cache(struct nvme_ns *ns,
 static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 							u8 *cmd)
 {
-	u8 immed, pcmod, pc, no_flush, start;
+	u8 immed, pcmod, no_flush, start;
 
 	immed = cmd[1] & 0x01;
 	pcmod = cmd[3] & 0x0f;
-	pc = (cmd[4] & 0xf0) >> 4;
 	no_flush = cmd[4] & 0x04;
 	start = cmd[4] & 0x01;
 
@@ -2254,8 +2186,8 @@ static int nvme_trans_start_stop(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 			if (res)
 				return res;
 		}
-		/* Setup the expected power state transition */
-		return nvme_trans_power_state(ns, hdr, pc, pcmod, start);
+
+		return 0;
 	}
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  2016-08-29  9:25 [PATCH 0/3] nvme power saving Andy Lutomirski
  2016-08-29  9:25 ` [PATCH 1/3] nvme/scsi: Remove power management support Andy Lutomirski
@ 2016-08-29  9:25 ` Andy Lutomirski
  2016-08-29 16:27   ` Keith Busch
  2016-08-29  9:25 ` [PATCH 3/3] nvme: Enable autonomous power state transitions Andy Lutomirski
  2 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-29  9:25 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, Andy Lutomirski

Any user I can imagine that needs a buffer at all will want to pass
a pointer directly.  There are no currently callers that use
buffers, so this change is painless, and it will make it much easier
to start using features that use buffers (e.g. APST).

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 32 ++++++++++++++++++++++++--------
 drivers/nvme/host/nvme.h |  4 ++--
 drivers/nvme/host/scsi.c |  6 +++---
 3 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2feacc70bf61..3f7561ab54dc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -597,19 +597,25 @@ int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 }
 
 int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
-					dma_addr_t dma_addr, u32 *result)
+		      void *buffer, size_t buflen, u32 *result)
 {
 	struct nvme_command c;
 	struct nvme_completion cqe;
 	int ret;
 
+	/*
+	 * A controller "page" may be bigger than a Linux page, but we can
+	 * be conservative here.
+	 */
+	WARN_ONCE(((unsigned long)buffer & (PAGE_SIZE-1)) + buflen > PAGE_SIZE,
+		  "NVME feature crosses a page boundary\n");
+
 	memset(&c, 0, sizeof(c));
 	c.features.opcode = nvme_admin_get_features;
 	c.features.nsid = cpu_to_le32(nsid);
-	c.features.dptr.prp1 = cpu_to_le64(dma_addr);
 	c.features.fid = cpu_to_le32(fid);
 
-	ret = __nvme_submit_sync_cmd(dev->admin_q, &c, &cqe, NULL, 0, 0,
+	ret = __nvme_submit_sync_cmd(dev->admin_q, &c, &cqe, buffer, buflen, 0,
 			NVME_QID_ANY, 0, 0);
 	if (ret >= 0 && result)
 		*result = le32_to_cpu(cqe.result);
@@ -617,20 +623,30 @@ int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
 }
 
 int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
-					dma_addr_t dma_addr, u32 *result)
+		      const void *buffer, size_t buflen, u32 *result)
 {
 	struct nvme_command c;
 	struct nvme_completion cqe;
 	int ret;
 
+	/*
+	 * A controller "page" may be bigger than a Linux page, but we can
+	 * be conservative here.
+	 */
+	WARN_ONCE(((unsigned long)buffer & (PAGE_SIZE-1)) + buflen > PAGE_SIZE,
+		  "NVME feature crosses a page boundary\n");
+
 	memset(&c, 0, sizeof(c));
 	c.features.opcode = nvme_admin_set_features;
-	c.features.dptr.prp1 = cpu_to_le64(dma_addr);
 	c.features.fid = cpu_to_le32(fid);
 	c.features.dword11 = cpu_to_le32(dword11);
 
-	ret = __nvme_submit_sync_cmd(dev->admin_q, &c, &cqe, NULL, 0, 0,
-			NVME_QID_ANY, 0, 0);
+	/*
+	 * Casting buffer to void* is safe here: __nvme_submit_sync_cmd knows
+	 * that we're writing because it decodes the opcode.
+	 */
+	ret = __nvme_submit_sync_cmd(dev->admin_q, &c, &cqe,
+			(void *)buffer, buflen, 0, NVME_QID_ANY, 0, 0);
 	if (ret >= 0 && result)
 		*result = le32_to_cpu(cqe.result);
 	return ret;
@@ -664,7 +680,7 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 	u32 result;
 	int status, nr_io_queues;
 
-	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, 0,
+	status = nvme_set_features(ctrl, NVME_FEAT_NUM_QUEUES, q_count, NULL, 0,
 			&result);
 	if (status < 0)
 		return status;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ab18b78102bf..383ae22e169e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -292,9 +292,9 @@ int nvme_identify_ns(struct nvme_ctrl *dev, unsigned nsid,
 		struct nvme_id_ns **id);
 int nvme_get_log_page(struct nvme_ctrl *dev, struct nvme_smart_log **log);
 int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned nsid,
-			dma_addr_t dma_addr, u32 *result);
+		      void *buffer, size_t buflen, u32 *result);
 int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
-			dma_addr_t dma_addr, u32 *result);
+		      const void *buffer, size_t buflen, u32 *result);
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
 void nvme_start_keep_alive(struct nvme_ctrl *ctrl);
 void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/scsi.c b/drivers/nvme/host/scsi.c
index 44009105f8c8..c2a0a1c7d05d 100644
--- a/drivers/nvme/host/scsi.c
+++ b/drivers/nvme/host/scsi.c
@@ -906,7 +906,7 @@ static int nvme_trans_log_temperature(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	kfree(smart_log);
 
 	/* Get Features for Temp Threshold */
-	res = nvme_get_features(ns->ctrl, NVME_FEAT_TEMP_THRESH, 0, 0,
+	res = nvme_get_features(ns->ctrl, NVME_FEAT_TEMP_THRESH, 0, NULL, 0,
 								&feature_resp);
 	if (res != NVME_SC_SUCCESS)
 		temp_c_thresh = LOG_TEMP_UNKNOWN;
@@ -1039,7 +1039,7 @@ static int nvme_trans_fill_caching_page(struct nvme_ns *ns,
 	if (len < MODE_PAGE_CACHING_LEN)
 		return -EINVAL;
 
-	nvme_sc = nvme_get_features(ns->ctrl, NVME_FEAT_VOLATILE_WC, 0, 0,
+	nvme_sc = nvme_get_features(ns->ctrl, NVME_FEAT_VOLATILE_WC, 0, NULL, 0,
 								&feature_resp);
 	res = nvme_trans_status_code(hdr, nvme_sc);
 	if (res)
@@ -1328,7 +1328,7 @@ static int nvme_trans_modesel_get_mp(struct nvme_ns *ns, struct sg_io_hdr *hdr,
 	case MODE_PAGE_CACHING:
 		dword11 = ((mode_page[2] & CACHING_MODE_PAGE_WCE_MASK) ? 1 : 0);
 		nvme_sc = nvme_set_features(ns->ctrl, NVME_FEAT_VOLATILE_WC,
-					    dword11, 0, NULL);
+					    dword11, NULL, 0, NULL);
 		res = nvme_trans_status_code(hdr, nvme_sc);
 		break;
 	case MODE_PAGE_CONTROL:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] nvme: Enable autonomous power state transitions
  2016-08-29  9:25 [PATCH 0/3] nvme power saving Andy Lutomirski
  2016-08-29  9:25 ` [PATCH 1/3] nvme/scsi: Remove power management support Andy Lutomirski
  2016-08-29  9:25 ` [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features() Andy Lutomirski
@ 2016-08-29  9:25 ` Andy Lutomirski
  2016-08-29 15:07   ` J Freyensee
  2016-08-29 16:45   ` Keith Busch
  2 siblings, 2 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-29  9:25 UTC (permalink / raw)
  To: Keith Busch, Jens Axboe
  Cc: linux-nvme, Christoph Hellwig, linux-kernel, Andy Lutomirski

NVME devices can advertise multiple power states.  These states can
be either "operational" (the device is fully functional but possibly
slow) or "non-operational" (the device is asleep until woken up).
Some devices can automatically enter a non-operational state when
idle for a specified amount of time and then automatically wake back
up when needed.

The hardware configuration is a table.  For each state, an entry in
the table indicates the next deeper non-operational state, if any,
to autonomously transition to and the idle time required before
transitioning.

This patch teaches the driver to program APST so that each
successive non-operational state will be entered after an idle time
equal to 100% of the total latency (entry plus exit) associated with
that state.  A sysfs attribute 'apst_max_latency_ns' gives the
maximum acceptable latency in ns; non-operational states with total
latency greater than this value will not be used.  As a special
case, apst_max_latency_ns=0 will disable APST entirely.

On hardware without APST support, apst_max_latency_ns will not be
exposed in sysfs.

In theory, the device can expose "default" APST table, but this
doesn't seem to function correctly on my device (Samsung 950), nor
does it seem particularly useful.  There is also an optional
mechanism by which a configuration can be "saved" so it will be
automatically loaded on reset.  This can be configured from
userspace, but it doesn't seem useful to support in the driver.

On my laptop, enabling APST seems to save nearly 1W.

The hardware tables can be decoded in userspace with nvme-cli.
'nvme id-ctrl /dev/nvmeN' will show the power state table and
'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
configuration.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 drivers/nvme/host/core.c | 167 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |   6 ++
 include/linux/nvme.h     |   6 ++
 3 files changed, 179 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3f7561ab54dc..042137ad2437 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1223,6 +1223,98 @@ static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 	blk_queue_write_cache(q, vwc, vwc);
 }
 
+static void nvme_configure_apst(struct nvme_ctrl *ctrl)
+{
+	/*
+	 * APST (Autonomous Power State Transition) lets us program a
+	 * table of power state transitions that the controller will
+	 * perform automatically.  We configure it with a simple
+	 * heuristic: we are willing to spend at most 2% of the time
+	 * transitioning between power states.  Therefore, when running
+	 * in any given state, we will enter the next lower-power
+	 * non-operational state after waiting 100 * (enlat + exlat)
+	 * microseconds, as long as that state's total latency is under
+	 * the requested maximum latency.
+	 *
+	 * We will not autonomously enter any non-operational state for
+	 * which the total latency exceeds apst_max_latency_ns.  Users
+	 * can set apst_max_latency_ns to zero to turn off APST.
+	 */
+
+	unsigned apste;
+	struct nvme_feat_auto_pst *table;
+	int ret;
+
+	if (!ctrl->apsta)
+		return;	/* APST isn't supported. */
+
+	if (ctrl->npss > 31) {
+		dev_warn(ctrl->device, "NPSS is invalid; disabling APST\n");
+		return;
+	}
+
+	table = kzalloc(sizeof(*table), GFP_KERNEL);
+	if (!table)
+		return;
+
+	if (ctrl->apst_max_latency_ns == 0) {
+		/* Turn off APST. */
+		apste = 0;
+	} else {
+		__le64 target = cpu_to_le64(0);
+		int state;
+
+		/*
+		 * Walk through all states from lowest- to highest-power.
+		 * According to the spec, lower-numbered states use more
+		 * power.  NPSS, despite the name, is the index of the
+		 * lowest-power state, not the number of states.
+		 */
+		for (state = (int)ctrl->npss; state >= 0; state--) {
+			u64 total_latency_us, transition_ms;
+
+			if (target)
+				table->entries[state] = target;
+
+			/*
+			 * Is this state a useful non-operational state for
+			 * higher-power states to autonomously transition to?
+			 */
+			if (!(ctrl->psd[state].flags & 2))
+				continue;  /* It's an operational state. */
+
+			total_latency_us =
+				(u64)cpu_to_le32(ctrl->psd[state].entry_lat) +
+				+ cpu_to_le32(ctrl->psd[state].exit_lat);
+			if (total_latency_us * 1000 > ctrl->apst_max_latency_ns)
+				continue;
+
+			/*
+			 * This state is good.  Use it as the APST idle
+			 * target for higher power states.
+			 */
+			transition_ms = total_latency_us + 19;
+			do_div(transition_ms, 20);
+			if (transition_ms >= (1 << 24))
+				transition_ms = (1 << 24);
+
+			target = cpu_to_le64((state << 3) |
+					     (transition_ms << 8));
+		}
+
+		apste = 1;
+	}
+
+	ret = nvme_set_features(ctrl, NVME_FEAT_AUTO_PST, apste,
+				table, sizeof(*table), NULL);
+	if (ret)
+		dev_err(ctrl->device, "failed to set APST feature (%d)\n", ret);
+
+	kfree(table);
+}
+
+static struct attribute_group nvme_dev_dynamic_attrs_group;
+
 /*
  * Initialize the cached copies of the Identify data and various controller
  * register in our nvme_ctrl structure.  This should be called as soon as
@@ -1289,6 +1381,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	ctrl->sgls = le32_to_cpu(id->sgls);
 	ctrl->kas = le16_to_cpu(id->kas);
 
+	ctrl->npss = id->npss;
+	ctrl->apsta = id->apsta;
+	memcpy(ctrl->psd, id->psd, sizeof(ctrl->psd));
+
 	if (ctrl->ops->is_fabrics) {
 		ctrl->icdoff = le16_to_cpu(id->icdoff);
 		ctrl->ioccsz = le32_to_cpu(id->ioccsz);
@@ -1312,6 +1408,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	}
 
 	kfree(id);
+
+	nvme_configure_apst(ctrl);
+
+	sysfs_update_group(&ctrl->device->kobj, &nvme_dev_dynamic_attrs_group);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(nvme_init_identify);
@@ -1582,6 +1682,41 @@ static ssize_t nvme_sysfs_show_address(struct device *dev,
 }
 static DEVICE_ATTR(address, S_IRUGO, nvme_sysfs_show_address, NULL);
 
+static ssize_t nvme_sysfs_show_apst_max_latency_ns(
+	struct device *dev,
+	struct device_attribute *attr,
+	char *buf)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+	return snprintf(buf, PAGE_SIZE, "%llu\n", ctrl->apst_max_latency_ns);
+}
+
+static ssize_t nvme_sysfs_store_apst_max_latency_ns(
+	struct device *dev,
+	struct device_attribute *attr,
+	const char *buf, size_t size)
+{
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+	int ret;
+	u64 val;
+
+	ret = kstrtoull(buf, 10, &val);
+	if (ret)
+		return ret;
+
+	if (ctrl->apst_max_latency_ns != val) {
+		ctrl->apst_max_latency_ns = val;
+		nvme_configure_apst(ctrl);
+	}
+
+	return size;
+}
+
+static DEVICE_ATTR(apst_max_latency_ns, 0644,
+		   nvme_sysfs_show_apst_max_latency_ns,
+		   nvme_sysfs_store_apst_max_latency_ns);
+
 static struct attribute *nvme_dev_attrs[] = {
 	&dev_attr_reset_controller.attr,
 	&dev_attr_rescan_controller.attr,
@@ -1623,8 +1758,33 @@ static struct attribute_group nvme_dev_attrs_group = {
 	.is_visible	= nvme_dev_attrs_are_visible,
 };
 
+static struct attribute *nvme_dev_dynamic_attrs[] = {
+	&dev_attr_apst_max_latency_ns.attr,
+	NULL
+};
+
+static umode_t nvme_dev_dynamic_attrs_are_visible(struct kobject *kobj,
+		struct attribute *a, int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
+
+	if (a == &dev_attr_apst_max_latency_ns.attr) {
+		if (!ctrl->apsta)
+			return 0;
+	}
+
+	return a->mode;
+}
+
+static struct attribute_group nvme_dev_dynamic_attrs_group = {
+	.attrs		= nvme_dev_dynamic_attrs,
+	.is_visible	= nvme_dev_dynamic_attrs_are_visible,
+};
+
 static const struct attribute_group *nvme_dev_attr_groups[] = {
 	&nvme_dev_attrs_group,
+	&nvme_dev_dynamic_attrs_group,
 	NULL,
 };
 
@@ -2010,6 +2170,13 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	INIT_WORK(&ctrl->scan_work, nvme_scan_work);
 	INIT_WORK(&ctrl->async_event_work, nvme_async_event_work);
 
+	/*
+	 * By default, allow up to 25ms of APST-induced latency.  This will
+	 * have no effect on non-APST supporting controllers (i.e. any
+	 * controller with APSTA == 0).
+	 */
+	ctrl->apst_max_latency_ns = 25000000;
+
 	ret = nvme_set_instance(ctrl);
 	if (ret)
 		goto out;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 383ae22e169e..88cabd643bda 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -129,13 +129,19 @@ struct nvme_ctrl {
 	u32 vs;
 	u32 sgls;
 	u16 kas;
+	u8 npss;
+	u8 apsta;
 	unsigned int kato;
 	bool subsystem;
 	unsigned long quirks;
+	struct nvme_id_power_state psd[32];
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
 
+	/* APSTA configuration */
+	u64 apst_max_latency_ns;
+
 	/* Fabrics only */
 	u16 sqsize;
 	u32 ioccsz;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index d8b37bab2887..a76237dac4b0 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -543,6 +543,12 @@ struct nvme_dsm_range {
 	__le64			slba;
 };
 
+/* Features */
+
+struct nvme_feat_auto_pst {
+	__le64 entries[32];
+};
+
 /* Admin commands */
 
 enum nvme_admin_opcode {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] nvme: Enable autonomous power state transitions
  2016-08-29  9:25 ` [PATCH 3/3] nvme: Enable autonomous power state transitions Andy Lutomirski
@ 2016-08-29 15:07   ` J Freyensee
  2016-08-29 23:16     ` Andy Lutomirski
  2016-08-29 16:45   ` Keith Busch
  1 sibling, 1 reply; 15+ messages in thread
From: J Freyensee @ 2016-08-29 15:07 UTC (permalink / raw)
  To: Andy Lutomirski, Keith Busch, Jens Axboe
  Cc: Christoph Hellwig, linux-nvme, linux-kernel

On Mon, 2016-08-29 at 02:25 -0700, Andy Lutomirski wrote:
> NVME devices can advertise multiple power states.  These states can
> be either "operational" (the device is fully functional but possibly
> slow) or "non-operational" (the device is asleep until woken up).
> Some devices can automatically enter a non-operational state when
> idle for a specified amount of time and then automatically wake back
> up when needed.
> 
> The hardware configuration is a table.  For each state, an entry in
> the table indicates the next deeper non-operational state, if any,
> to autonomously transition to and the idle time required before
> transitioning.
> 
> This patch teaches the driver to program APST so that each
> successive non-operational state will be entered after an idle time
> equal to 100% of the total latency (entry plus exit) associated with
> that state.  A sysfs attribute 'apst_max_latency_ns' gives the
> maximum acceptable latency in ns; non-operational states with total
> latency greater than this value will not be used.  As a special
> case, apst_max_latency_ns=0 will disable APST entirely.
> 
> On hardware without APST support, apst_max_latency_ns will not be
> exposed in sysfs.
> 
> In theory, the device can expose "default" APST table, but this
> doesn't seem to function correctly on my device (Samsung 950), nor
> does it seem particularly useful.  There is also an optional
> mechanism by which a configuration can be "saved" so it will be
> automatically loaded on reset.  This can be configured from
> userspace, but it doesn't seem useful to support in the driver.
> 
> On my laptop, enabling APST seems to save nearly 1W.
> 
> The hardware tables can be decoded in userspace with nvme-cli.
> 'nvme id-ctrl /dev/nvmeN' will show the power state table and
> 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
> configuration.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  drivers/nvme/host/core.c | 167
> +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvme/host/nvme.h |   6 ++
>  include/linux/nvme.h     |   6 ++
>  3 files changed, 179 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 3f7561ab54dc..042137ad2437 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1223,6 +1223,98 @@ static void nvme_set_queue_limits(struct
> nvme_ctrl *ctrl,
>  	blk_queue_write_cache(q, vwc, vwc);
>  }
>  
> +static void nvme_configure_apst(struct nvme_ctrl *ctrl)
> +{
> +	/*
> +	 * APST (Autonomous Power State Transition) lets us program
> a
> +	 * table of power state transitions that the controller will
> +	 * perform automatically.  We configure it with a simple
> +	 * heuristic: we are willing to spend at most 2% of the time
> +	 * transitioning between power states.  Therefore, when
> running
> +	 * in any given state, we will enter the next lower-power
> +	 * non-operational state after waiting 100 * (enlat + exlat)
> +	 * microseconds, as long as that state's total latency is
> under
> +	 * the requested maximum latency.
> +	 *
> +	 * We will not autonomously enter any non-operational state
> for
> +	 * which the total latency exceeds
> apst_max_latency_ns.  Users
> +	 * can set apst_max_latency_ns to zero to turn off APST.
> +	 */
> +
> +	unsigned apste;
> +	struct nvme_feat_auto_pst *table;
> +	int ret;
> +
> +	if (!ctrl->apsta)
> +		return;	/* APST isn't supported. */
> +
> +	if (ctrl->npss > 31) {
> +		dev_warn(ctrl->device, "NPSS is invalid; disabling
> APST\n");

Quick question.  A little bit below in a later if() block, apste is set
to 0 to turn off APST, which is to be used later in a
nvme_set_features() call to actually turn it off.  You wouldn't want to
also set apste to zero too and call a nvme_set_features() to "disable
APST"?

I guess I'm a little confused on the error statement, "disabling APST",
when it doesn't seem like anything is being done to actually disable
APST, it's just more of an invalid state retrieved from the HW.
 

> +		return;
> +	}
> +
> +	table = kzalloc(sizeof(*table), GFP_KERNEL);
> +	if (!table)
> +		return;
> +
> +	if (ctrl->apst_max_latency_ns == 0) {
> +		/* Turn off APST. */
> +		apste = 0;
> +	} else {
> +		__le64 target = cpu_to_le64(0);
> +		int state;
> +
> +		/*
> +		 * Walk through all states from lowest- to highest-
> power.
> +		 * According to the spec, lower-numbered states use
> more
> +		 * power.  NPSS, despite the name, is the index of
> the
> +		 * lowest-power state, not the number of states.
> +		 */
> +		for (state = (int)ctrl->npss; state >= 0; state--) {
> +			u64 total_latency_us, transition_ms;
> +
> +			if (target)
> +				table->entries[state] = target;
> +
> +			/*
> +			 * Is this state a useful non-operational
> state for
> +			 * higher-power states to autonomously
> transition to?
> +			 */
> +			if (!(ctrl->psd[state].flags & 2))
> +				continue;  /* It's an operational
> state. */
> +
> +			total_latency_us =
> +				(u64)cpu_to_le32(ctrl-
> >psd[state].entry_lat) +
> +				+ cpu_to_le32(ctrl-
> >psd[state].exit_lat);
> +			if (total_latency_us * 1000 > ctrl-
> >apst_max_latency_ns)
> +				continue;
> +
> +			/*
> +			 * This state is good.  Use it as the APST
> idle
> +			 * target for higher power states.
> +			 */
> +			transition_ms = total_latency_us + 19;
> +			do_div(transition_ms, 20);
> +			if (transition_ms >= (1 << 24))
> +				transition_ms = (1 << 24);

Is it possible to use a macro for this bit shift as its used more than
once?

> +
> +			target = cpu_to_le64((state << 3) |
> +					     (transition_ms << 8));
> +		}
> 

snip...
.
.
.

> +	/*
> +	 * By default, allow up to 25ms of APST-induced
> latency.  This will
> +	 * have no effect on non-APST supporting controllers (i.e.
> any
> +	 * controller with APSTA == 0).
> +	 */
> +	ctrl->apst_max_latency_ns = 25000000;

Is it possible to make that a #define please?

Nice stuff!



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  2016-08-29  9:25 ` [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features() Andy Lutomirski
@ 2016-08-29 16:27   ` Keith Busch
  2016-08-29 23:20     ` Andy Lutomirski
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2016-08-29 16:27 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Jens Axboe, linux-nvme, Christoph Hellwig, linux-kernel

On Mon, Aug 29, 2016 at 02:25:45AM -0700, Andy Lutomirski wrote:
> +	/*
> +	 * A controller "page" may be bigger than a Linux page, but we can
> +	 * be conservative here.
> +	 */

It is the actually other way around: the Linux page may be larger than the
controller's. We currently use the smallest possible controller page (4k)
regardless of the host's size due to limitations discovering the CPU's
DMA alignment. PPC was the first to encounter this problem with NVMe.

Otherwise, looks good.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] nvme: Enable autonomous power state transitions
  2016-08-29  9:25 ` [PATCH 3/3] nvme: Enable autonomous power state transitions Andy Lutomirski
  2016-08-29 15:07   ` J Freyensee
@ 2016-08-29 16:45   ` Keith Busch
  2016-08-29 23:16     ` Andy Lutomirski
  1 sibling, 1 reply; 15+ messages in thread
From: Keith Busch @ 2016-08-29 16:45 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Jens Axboe, linux-nvme, Christoph Hellwig, linux-kernel

On Mon, Aug 29, 2016 at 02:25:46AM -0700, Andy Lutomirski wrote:
> +	/*
> +	 * By default, allow up to 25ms of APST-induced latency.  This will
> +	 * have no effect on non-APST supporting controllers (i.e. any
> +	 * controller with APSTA == 0).
> +	 */
> +	ctrl->apst_max_latency_ns = 25000000;

Any objection to making this a module parameter? 25ms default sounds
reasonable, but I would still like the option to not ever initialize
APST on a capable device.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] nvme: Enable autonomous power state transitions
  2016-08-29 16:45   ` Keith Busch
@ 2016-08-29 23:16     ` Andy Lutomirski
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-29 23:16 UTC (permalink / raw)
  To: Keith Busch; +Cc: Christoph Hellwig, linux-nvme, linux-kernel, Jens Axboe

On Aug 29, 2016 9:35 AM, "Keith Busch" <keith.busch@intel.com> wrote:
>
> On Mon, Aug 29, 2016 at 02:25:46AM -0700, Andy Lutomirski wrote:
> > +     /*
> > +      * By default, allow up to 25ms of APST-induced latency.  This will
> > +      * have no effect on non-APST supporting controllers (i.e. any
> > +      * controller with APSTA == 0).
> > +      */
> > +     ctrl->apst_max_latency_ns = 25000000;
>
> Any objection to making this a module parameter? 25ms default sounds
> reasonable, but I would still like the option to not ever initialize
> APST on a capable device.

Will do.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] nvme: Enable autonomous power state transitions
  2016-08-29 15:07   ` J Freyensee
@ 2016-08-29 23:16     ` Andy Lutomirski
  2016-08-30 20:21       ` Andy Lutomirski
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-29 23:16 UTC (permalink / raw)
  To: J Freyensee
  Cc: linux-nvme, Keith Busch, Christoph Hellwig, linux-kernel, Jens Axboe

On Aug 29, 2016 8:07 AM, "J Freyensee"
<james_p_freyensee@linux.intel.com> wrote:
>
> On Mon, 2016-08-29 at 02:25 -0700, Andy Lutomirski wrote:
> > NVME devices can advertise multiple power states.  These states can
> > be either "operational" (the device is fully functional but possibly
> > slow) or "non-operational" (the device is asleep until woken up).
> > Some devices can automatically enter a non-operational state when
> > idle for a specified amount of time and then automatically wake back
> > up when needed.
> >
> > The hardware configuration is a table.  For each state, an entry in
> > the table indicates the next deeper non-operational state, if any,
> > to autonomously transition to and the idle time required before
> > transitioning.
> >
> > This patch teaches the driver to program APST so that each
> > successive non-operational state will be entered after an idle time
> > equal to 100% of the total latency (entry plus exit) associated with
> > that state.  A sysfs attribute 'apst_max_latency_ns' gives the
> > maximum acceptable latency in ns; non-operational states with total
> > latency greater than this value will not be used.  As a special
> > case, apst_max_latency_ns=0 will disable APST entirely.
> >
> > On hardware without APST support, apst_max_latency_ns will not be
> > exposed in sysfs.
> >
> > In theory, the device can expose "default" APST table, but this
> > doesn't seem to function correctly on my device (Samsung 950), nor
> > does it seem particularly useful.  There is also an optional
> > mechanism by which a configuration can be "saved" so it will be
> > automatically loaded on reset.  This can be configured from
> > userspace, but it doesn't seem useful to support in the driver.
> >
> > On my laptop, enabling APST seems to save nearly 1W.
> >
> > The hardware tables can be decoded in userspace with nvme-cli.
> > 'nvme id-ctrl /dev/nvmeN' will show the power state table and
> > 'nvme get-feature -f 0x0c -H /dev/nvme0' will show the current APST
> > configuration.
> >
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > ---
> >  drivers/nvme/host/core.c | 167
> > +++++++++++++++++++++++++++++++++++++++++++++++
> >  drivers/nvme/host/nvme.h |   6 ++
> >  include/linux/nvme.h     |   6 ++
> >  3 files changed, 179 insertions(+)
> >
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 3f7561ab54dc..042137ad2437 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -1223,6 +1223,98 @@ static void nvme_set_queue_limits(struct
> > nvme_ctrl *ctrl,
> >       blk_queue_write_cache(q, vwc, vwc);
> >  }
> >
> > +static void nvme_configure_apst(struct nvme_ctrl *ctrl)
> > +{
> > +     /*
> > +      * APST (Autonomous Power State Transition) lets us program
> > a
> > +      * table of power state transitions that the controller will
> > +      * perform automatically.  We configure it with a simple
> > +      * heuristic: we are willing to spend at most 2% of the time
> > +      * transitioning between power states.  Therefore, when
> > running
> > +      * in any given state, we will enter the next lower-power
> > +      * non-operational state after waiting 100 * (enlat + exlat)
> > +      * microseconds, as long as that state's total latency is
> > under
> > +      * the requested maximum latency.
> > +      *
> > +      * We will not autonomously enter any non-operational state
> > for
> > +      * which the total latency exceeds
> > apst_max_latency_ns.  Users
> > +      * can set apst_max_latency_ns to zero to turn off APST.
> > +      */
> > +
> > +     unsigned apste;
> > +     struct nvme_feat_auto_pst *table;
> > +     int ret;
> > +
> > +     if (!ctrl->apsta)
> > +             return; /* APST isn't supported. */
> > +
> > +     if (ctrl->npss > 31) {
> > +             dev_warn(ctrl->device, "NPSS is invalid; disabling
> > APST\n");
>
> Quick question.  A little bit below in a later if() block, apste is set
> to 0 to turn off APST, which is to be used later in a
> nvme_set_features() call to actually turn it off.  You wouldn't want to
> also set apste to zero too and call a nvme_set_features() to "disable
> APST"?
>
> I guess I'm a little confused on the error statement, "disabling APST",
> when it doesn't seem like anything is being done to actually disable
> APST, it's just more of an invalid state retrieved from the HW.

I guess that should be "not using APST" instead.

>
>
> > +             return;
> > +     }
> > +
> > +     table = kzalloc(sizeof(*table), GFP_KERNEL);
> > +     if (!table)
> > +             return;
> > +
> > +     if (ctrl->apst_max_latency_ns == 0) {
> > +             /* Turn off APST. */
> > +             apste = 0;
> > +     } else {
> > +             __le64 target = cpu_to_le64(0);
> > +             int state;
> > +
> > +             /*
> > +              * Walk through all states from lowest- to highest-
> > power.
> > +              * According to the spec, lower-numbered states use
> > more
> > +              * power.  NPSS, despite the name, is the index of
> > the
> > +              * lowest-power state, not the number of states.
> > +              */
> > +             for (state = (int)ctrl->npss; state >= 0; state--) {
> > +                     u64 total_latency_us, transition_ms;
> > +
> > +                     if (target)
> > +                             table->entries[state] = target;
> > +
> > +                     /*
> > +                      * Is this state a useful non-operational
> > state for
> > +                      * higher-power states to autonomously
> > transition to?
> > +                      */
> > +                     if (!(ctrl->psd[state].flags & 2))
> > +                             continue;  /* It's an operational
> > state. */
> > +
> > +                     total_latency_us =
> > +                             (u64)cpu_to_le32(ctrl-
> > >psd[state].entry_lat) +
> > +                             + cpu_to_le32(ctrl-
> > >psd[state].exit_lat);
> > +                     if (total_latency_us * 1000 > ctrl-
> > >apst_max_latency_ns)
> > +                             continue;
> > +
> > +                     /*
> > +                      * This state is good.  Use it as the APST
> > idle
> > +                      * target for higher power states.
> > +                      */
> > +                     transition_ms = total_latency_us + 19;
> > +                     do_div(transition_ms, 20);
> > +                     if (transition_ms >= (1 << 24))
> > +                             transition_ms = (1 << 24);
>
> Is it possible to use a macro for this bit shift as its used more than
> once?

Sure, will do.

>
> > +
> > +                     target = cpu_to_le64((state << 3) |
> > +                                          (transition_ms << 8));
> > +             }
> >
>
> snip...
> .
> .
> .
>
> > +     /*
> > +      * By default, allow up to 25ms of APST-induced
> > latency.  This will
> > +      * have no effect on non-APST supporting controllers (i.e.
> > any
> > +      * controller with APSTA == 0).
> > +      */
> > +     ctrl->apst_max_latency_ns = 25000000;
>
> Is it possible to make that a #define please?

I'll make it a module parameter as Keith suggested.

>
> Nice stuff!
>
>
> >

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  2016-08-29 16:27   ` Keith Busch
@ 2016-08-29 23:20     ` Andy Lutomirski
  2016-08-30  6:36       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-29 23:20 UTC (permalink / raw)
  To: Keith Busch
  Cc: Andy Lutomirski, Jens Axboe, linux-nvme, Christoph Hellwig, linux-kernel

On Mon, Aug 29, 2016 at 9:27 AM, Keith Busch <keith.busch@intel.com> wrote:
> On Mon, Aug 29, 2016 at 02:25:45AM -0700, Andy Lutomirski wrote:
>> +     /*
>> +      * A controller "page" may be bigger than a Linux page, but we can
>> +      * be conservative here.
>> +      */
>
> It is the actually other way around: the Linux page may be larger than the
> controller's. We currently use the smallest possible controller page (4k)
> regardless of the host's size due to limitations discovering the CPU's
> DMA alignment. PPC was the first to encounter this problem with NVMe.

The "Set Features" command (section 5.15) Figure 103 says:

If using PRPs, this field shall not be a pointer to a PRP List as the
data buffer may not cross more than one page boundary. If no data
structure is used as part of the specified feature, then this field is
not used.

Does the Linux driver use PRPs?  Do we need to worry about kmalloc
returning a buffer that spans a 4k boundary but does not span a Linux
page boundary?

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  2016-08-29 23:20     ` Andy Lutomirski
@ 2016-08-30  6:36       ` Christoph Hellwig
  2016-08-30 16:00         ` Andy Lutomirski
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2016-08-30  6:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Keith Busch, Andy Lutomirski, Jens Axboe, linux-nvme,
	Christoph Hellwig, linux-kernel

On Mon, Aug 29, 2016 at 04:20:43PM -0700, Andy Lutomirski wrote:
> The "Set Features" command (section 5.15) Figure 103 says:
> 
> If using PRPs, this field shall not be a pointer to a PRP List as the
> data buffer may not cross more than one page boundary. If no data
> structure is used as part of the specified feature, then this field is
> not used.
> 
> Does the Linux driver use PRPs?

The Linux PCIe driver always uses PRPs - and for admin command only
Fabrics can use SGLs anyway.

> Do we need to worry about kmalloc
> returning a buffer that spans a 4k boundary but does not span a Linux
> page boundary?

Isn't kmalloc supposed to return naturally aligned buffers?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  2016-08-30  6:36       ` Christoph Hellwig
@ 2016-08-30 16:00         ` Andy Lutomirski
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-30 16:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Andy Lutomirski, Jens Axboe, linux-nvme, linux-kernel

On Mon, Aug 29, 2016 at 11:36 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Aug 29, 2016 at 04:20:43PM -0700, Andy Lutomirski wrote:
>> The "Set Features" command (section 5.15) Figure 103 says:
>>
>> If using PRPs, this field shall not be a pointer to a PRP List as the
>> data buffer may not cross more than one page boundary. If no data
>> structure is used as part of the specified feature, then this field is
>> not used.
>>
>> Does the Linux driver use PRPs?
>
> The Linux PCIe driver always uses PRPs - and for admin command only
> Fabrics can use SGLs anyway.
>
>> Do we need to worry about kmalloc
>> returning a buffer that spans a 4k boundary but does not span a Linux
>> page boundary?
>
> Isn't kmalloc supposed to return naturally aligned buffers?

>From brief inspection of the code, it looks like kmalloc always
returns a pointer aligned to a biggest power of two that can hold the
allocation except when it uses 96-byte or 192-byte alignment.  96 and
192 don't divide 4k.

However, I think this is all moot because I misunderstood the spec.  It says:

Data Pointer (DPTR): This field specifies the start of the data
buffer. Refer to Figure 11 for the
definition of this field. If using PRPs, this field shall not be a
pointer to a PRP List as the data buffer
may not cross more than one page boundary. If no data structure is
used as part of the specified
feature, then this field is not used.

It doesn't say "may not cross a page boundary" -- it says it may not
cross *more than one* page boundary.  I think that all it's trying to
say is that there aren't any features that have buffers larger than a
page, so no matter how they're aligned there are at most two PRP
entries, and two PRP entries can be expressed without a PRP List.

So I'm just going to remove the warning.

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] nvme: Enable autonomous power state transitions
  2016-08-29 23:16     ` Andy Lutomirski
@ 2016-08-30 20:21       ` Andy Lutomirski
  2016-09-02 18:11         ` J Freyensee
  0 siblings, 1 reply; 15+ messages in thread
From: Andy Lutomirski @ 2016-08-30 20:21 UTC (permalink / raw)
  To: J Freyensee
  Cc: linux-nvme, Keith Busch, Christoph Hellwig, linux-kernel, Jens Axboe

On Mon, Aug 29, 2016 at 4:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Aug 29, 2016 8:07 AM, "J Freyensee"
> <james_p_freyensee@linux.intel.com> wrote:
>>
>> On Mon, 2016-08-29 at 02:25 -0700, Andy Lutomirski wrote:
>> > NVME devices can advertise multiple power states.  These states can
>> > be either "operational" (the device is fully functional but possibly
>> > slow) or "non-operational" (the device is asleep until woken up).
>> > Some devices can automatically enter a non-operational state when
>> > idle for a specified amount of time and then automatically wake back
>> > up when needed.
>> >

>>
>> > +     /*
>> > +      * By default, allow up to 25ms of APST-induced
>> > latency.  This will
>> > +      * have no effect on non-APST supporting controllers (i.e.
>> > any
>> > +      * controller with APSTA == 0).
>> > +      */
>> > +     ctrl->apst_max_latency_ns = 25000000;
>>
>> Is it possible to make that a #define please?
>
> I'll make it a module parameter as Keith suggested.

One question, though: should we call this and the sysfs parameter
apst_max_latency or should it be more generically
power_save_max_latency?  The idea is that we might want to support
non-automonous transitions some day or even runtime D3.  Or maybe
those should be separately configured if used.

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] nvme: Enable autonomous power state transitions
  2016-08-30 20:21       ` Andy Lutomirski
@ 2016-09-02 18:11         ` J Freyensee
  2016-09-02 18:50           ` Andy Lutomirski
  0 siblings, 1 reply; 15+ messages in thread
From: J Freyensee @ 2016-09-02 18:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel


> > > 
> > > 
> > > > 
> > > > +     /*
> > > > +      * By default, allow up to 25ms of APST-induced
> > > > latency.  This will
> > > > +      * have no effect on non-APST supporting controllers
> > > > (i.e.
> > > > any
> > > > +      * controller with APSTA == 0).
> > > > +      */
> > > > +     ctrl->apst_max_latency_ns = 25000000;
> > > 
> > > Is it possible to make that a #define please?
> > 
> > I'll make it a module parameter as Keith suggested.
> 
> One question, though: should we call this and the sysfs parameter
> apst_max_latency or should it be more generically
> power_save_max_latency?  The idea is that we might want to support
> non-automonous transitions some day or even runtime D3.  Or maybe
> those should be separately configured if used.

I read the spec and reviewed your latest patchset.  Personally for me I
like having the field names from the NVMe spec in the names of the
Linux implementation because it makes it easier to find and relate the
two.  So apst_max_latency makes more sense to me, as this is a
'apst'(e/a) NVMe feature.

> 
> --Andy
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] nvme: Enable autonomous power state transitions
  2016-09-02 18:11         ` J Freyensee
@ 2016-09-02 18:50           ` Andy Lutomirski
  0 siblings, 0 replies; 15+ messages in thread
From: Andy Lutomirski @ 2016-09-02 18:50 UTC (permalink / raw)
  To: J Freyensee
  Cc: Keith Busch, Jens Axboe, Christoph Hellwig, linux-nvme, linux-kernel

On Fri, Sep 2, 2016 at 11:11 AM, J Freyensee
<james_p_freyensee@linux.intel.com> wrote:
>
>> > >
>> > >
>> > > >
>> > > > +     /*
>> > > > +      * By default, allow up to 25ms of APST-induced
>> > > > latency.  This will
>> > > > +      * have no effect on non-APST supporting controllers
>> > > > (i.e.
>> > > > any
>> > > > +      * controller with APSTA == 0).
>> > > > +      */
>> > > > +     ctrl->apst_max_latency_ns = 25000000;
>> > >
>> > > Is it possible to make that a #define please?
>> >
>> > I'll make it a module parameter as Keith suggested.
>>
>> One question, though: should we call this and the sysfs parameter
>> apst_max_latency or should it be more generically
>> power_save_max_latency?  The idea is that we might want to support
>> non-automonous transitions some day or even runtime D3.  Or maybe
>> those should be separately configured if used.
>
> I read the spec and reviewed your latest patchset.  Personally for me I
> like having the field names from the NVMe spec in the names of the
> Linux implementation because it makes it easier to find and relate the
> two.  So apst_max_latency makes more sense to me, as this is a
> 'apst'(e/a) NVMe feature.
>

It's not really an APST feature, though -- it's just the maximum
(entry + exit) latency from the power state table.  So if we every
supported non-APST power state transitions, we could use the same type
of policy.

I'm not really arguing for changing it, though, and I personally have
no plans to implement a non-autonomous policy.

--Andy

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-09-02 18:51 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-29  9:25 [PATCH 0/3] nvme power saving Andy Lutomirski
2016-08-29  9:25 ` [PATCH 1/3] nvme/scsi: Remove power management support Andy Lutomirski
2016-08-29  9:25 ` [PATCH 2/3] nvme: Pass pointers, not dma addresses, to nvme_get/set_features() Andy Lutomirski
2016-08-29 16:27   ` Keith Busch
2016-08-29 23:20     ` Andy Lutomirski
2016-08-30  6:36       ` Christoph Hellwig
2016-08-30 16:00         ` Andy Lutomirski
2016-08-29  9:25 ` [PATCH 3/3] nvme: Enable autonomous power state transitions Andy Lutomirski
2016-08-29 15:07   ` J Freyensee
2016-08-29 23:16     ` Andy Lutomirski
2016-08-30 20:21       ` Andy Lutomirski
2016-09-02 18:11         ` J Freyensee
2016-09-02 18:50           ` Andy Lutomirski
2016-08-29 16:45   ` Keith Busch
2016-08-29 23:16     ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).