All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/3] nvme: add thermal zone devices
@ 2019-05-26 16:29 ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-26 16:29 UTC (permalink / raw)
  To: linux-nvme, linux-pm
  Cc: Akinobu Mita, Zhang Rui, Eduardo Valentin, Daniel Lezcano,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

The NVMe controller reports up to nine temperature values in the SMART /
Health log page (the composite temperature and temperature sensor 1 through
temperature sensor 8).
The temperature threshold feature (Feature Identifier 04h) configures the
asynchronous event request command to complete when the temperature is
crossed its corresponding temperature threshold.

This provides these temperatures and thresholds via thermal zone devices.

* v3
- Change the type name of thermal zone devices from 'nvme_temp<sensor>' to
  'nvme<instance>_temp<sensor>'
- Pass a NULL to the status argument of nvme_set_feature()
- Change the name of symbolic link from 'nvme_temp<sensor>' to 'temp<sensor>'
- Don't make it fatal error if the device provides a response
- Don't register thermal zone for composite temperature if smart log reports
  zero value
- Move the thermal zones registration and unregistration into the core module.

* v2
- s/correspoinding/corresponding/ typo in commit log
- Borrowed nvme_get_features() from Keith's patch
- Temperature threshold notification is splitted into another patch
- Change the data type of 'sensor' to unsigned
- Add BUILD_BUG_ON for the array size of tzdev member in nvme_ctrl
- Add WARN_ON_ONCE for paranoid checks
- Fix off-by-one error in nvme_get_temp
- Validate 'sensor' where the value is actually used
- Define and utilize two enums related to the temperature threshold feature
- Remove hysteresis value for this trip point and don't utilize the under
  temperature threshold
- Print error message for thermal_zone_device_register() failure
- Add function comments for nvme_thermal_zones_{,un}register
- Suppress non-fatal errors from nvme_thermal_zones_register()
- Add comment about implemented temperature sensors 
- Instead of creating a new 'thermal_work', append async smart event's
  action to the existing async_event_work
- Add comment for tzdev member in nvme_ctrl
- Call nvme_thermal_zones_unregister() earlier than the last reference
  release


Akinobu Mita (2):
  nvme: add thermal zone devices
  nvme: notify thermal framework when temperature threshold events occur

Keith Busch (1):
  nvme: Export get and set features

 drivers/nvme/host/core.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  15 +++
 include/linux/nvme.h     |  12 ++
 3 files changed, 364 insertions(+), 3 deletions(-)

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Minwoo Im <minwoo.im.dev@gmail.com>
Cc: Kenneth Heitke <kenneth.heitke@intel.com>
Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
-- 
2.7.4


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 0/3] nvme: add thermal zone devices
@ 2019-05-26 16:29 ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-26 16:29 UTC (permalink / raw)


The NVMe controller reports up to nine temperature values in the SMART /
Health log page (the composite temperature and temperature sensor 1 through
temperature sensor 8).
The temperature threshold feature (Feature Identifier 04h) configures the
asynchronous event request command to complete when the temperature is
crossed its corresponding temperature threshold.

This provides these temperatures and thresholds via thermal zone devices.

* v3
- Change the type name of thermal zone devices from 'nvme_temp<sensor>' to
  'nvme<instance>_temp<sensor>'
- Pass a NULL to the status argument of nvme_set_feature()
- Change the name of symbolic link from 'nvme_temp<sensor>' to 'temp<sensor>'
- Don't make it fatal error if the device provides a response
- Don't register thermal zone for composite temperature if smart log reports
  zero value
- Move the thermal zones registration and unregistration into the core module.

* v2
- s/correspoinding/corresponding/ typo in commit log
- Borrowed nvme_get_features() from Keith's patch
- Temperature threshold notification is splitted into another patch
- Change the data type of 'sensor' to unsigned
- Add BUILD_BUG_ON for the array size of tzdev member in nvme_ctrl
- Add WARN_ON_ONCE for paranoid checks
- Fix off-by-one error in nvme_get_temp
- Validate 'sensor' where the value is actually used
- Define and utilize two enums related to the temperature threshold feature
- Remove hysteresis value for this trip point and don't utilize the under
  temperature threshold
- Print error message for thermal_zone_device_register() failure
- Add function comments for nvme_thermal_zones_{,un}register
- Suppress non-fatal errors from nvme_thermal_zones_register()
- Add comment about implemented temperature sensors 
- Instead of creating a new 'thermal_work', append async smart event's
  action to the existing async_event_work
- Add comment for tzdev member in nvme_ctrl
- Call nvme_thermal_zones_unregister() earlier than the last reference
  release


Akinobu Mita (2):
  nvme: add thermal zone devices
  nvme: notify thermal framework when temperature threshold events occur

Keith Busch (1):
  nvme: Export get and set features

 drivers/nvme/host/core.c | 340 ++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/nvme/host/nvme.h |  15 +++
 include/linux/nvme.h     |  12 ++
 3 files changed, 364 insertions(+), 3 deletions(-)

Cc: Zhang Rui <rui.zhang at intel.com>
Cc: Eduardo Valentin <edubezval at gmail.com>
Cc: Daniel Lezcano <daniel.lezcano at linaro.org>
Cc: Keith Busch <keith.busch at intel.com>
Cc: Jens Axboe <axboe at fb.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: Minwoo Im <minwoo.im.dev at gmail.com>
Cc: Kenneth Heitke <kenneth.heitke at intel.com>
Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni at wdc.com>
-- 
2.7.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 1/3] nvme: Export get and set features
  2019-05-26 16:29 ` Akinobu Mita
@ 2019-05-26 16:29   ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-26 16:29 UTC (permalink / raw)
  To: linux-nvme, linux-pm; +Cc: Keith Busch, Akinobu Mita

From: Keith Busch <keith.busch@intel.com>

Future use intends to make use of both, so export these functions. And
since their implementation is identical except for the opcode, provide a
new function that implement both.

[akinobu.mita@gmail.com>: fix line over 80 characters]
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
 drivers/nvme/host/core.c | 24 +++++++++++++++++++++---
 drivers/nvme/host/nvme.h |  6 ++++++
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c6a29a3..c950916 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1113,15 +1113,15 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
 	return id;
 }
 
-static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
-		      void *buffer, size_t buflen, u32 *result)
+static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid,
+		unsigned int dword11, void *buffer, size_t buflen, u32 *result)
 {
 	struct nvme_command c;
 	union nvme_result res;
 	int ret;
 
 	memset(&c, 0, sizeof(c));
-	c.features.opcode = nvme_admin_set_features;
+	c.features.opcode = op;
 	c.features.fid = cpu_to_le32(fid);
 	c.features.dword11 = cpu_to_le32(dword11);
 
@@ -1132,6 +1132,24 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
 	return ret;
 }
 
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+		      unsigned int dword11, void *buffer, size_t buflen,
+		      u32 *result)
+{
+	return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer,
+			     buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_set_features);
+
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+		      unsigned int dword11, void *buffer, size_t buflen,
+		      u32 *result)
+{
+	return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer,
+			     buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_get_features);
+
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 {
 	u32 q_count = (*count - 1) | ((*count - 1) << 16);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index de624ec..802aa19 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -460,6 +460,12 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		union nvme_result *result, void *buffer, unsigned bufflen,
 		unsigned timeout, int qid, int at_head,
 		blk_mq_req_flags_t flags, bool poll);
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+		      unsigned int dword11, void *buffer, size_t buflen,
+		      u32 *result);
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+		      unsigned int dword11, void *buffer, size_t buflen,
+		      u32 *result);
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
 void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 1/3] nvme: Export get and set features
@ 2019-05-26 16:29   ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-26 16:29 UTC (permalink / raw)


From: Keith Busch <keith.busch@intel.com>

Future use intends to make use of both, so export these functions. And
since their implementation is identical except for the opcode, provide a
new function that implement both.

[akinobu.mita at gmail.com>: fix line over 80 characters]
Signed-off-by: Keith Busch <keith.busch at intel.com>
Signed-off-by: Akinobu Mita <akinobu.mita at gmail.com>
---
 drivers/nvme/host/core.c | 24 +++++++++++++++++++++---
 drivers/nvme/host/nvme.h |  6 ++++++
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c6a29a3..c950916 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1113,15 +1113,15 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
 	return id;
 }
 
-static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
-		      void *buffer, size_t buflen, u32 *result)
+static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid,
+		unsigned int dword11, void *buffer, size_t buflen, u32 *result)
 {
 	struct nvme_command c;
 	union nvme_result res;
 	int ret;
 
 	memset(&c, 0, sizeof(c));
-	c.features.opcode = nvme_admin_set_features;
+	c.features.opcode = op;
 	c.features.fid = cpu_to_le32(fid);
 	c.features.dword11 = cpu_to_le32(dword11);
 
@@ -1132,6 +1132,24 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
 	return ret;
 }
 
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+		      unsigned int dword11, void *buffer, size_t buflen,
+		      u32 *result)
+{
+	return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer,
+			     buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_set_features);
+
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+		      unsigned int dword11, void *buffer, size_t buflen,
+		      u32 *result)
+{
+	return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer,
+			     buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_get_features);
+
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 {
 	u32 q_count = (*count - 1) | ((*count - 1) << 16);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index de624ec..802aa19 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -460,6 +460,12 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		union nvme_result *result, void *buffer, unsigned bufflen,
 		unsigned timeout, int qid, int at_head,
 		blk_mq_req_flags_t flags, bool poll);
+int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
+		      unsigned int dword11, void *buffer, size_t buflen,
+		      u32 *result);
+int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
+		      unsigned int dword11, void *buffer, size_t buflen,
+		      u32 *result);
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
 void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
  2019-05-26 16:29 ` Akinobu Mita
@ 2019-05-26 16:29   ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-26 16:29 UTC (permalink / raw)
  To: linux-nvme, linux-pm
  Cc: Akinobu Mita, Zhang Rui, Eduardo Valentin, Daniel Lezcano,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

The NVMe controller reports up to nine temperature values in the SMART /
Health log page (the composite temperature and temperature sensor 1 through
temperature sensor 8).

This provides these temperatures via thermal zone devices.

Once the controller is identified, the thermal zone devices are created for
all implemented temperature sensors including the composite temperature.

/sys/class/thermal/thermal_zone[0-*]:
    |---type: 'nvme<instance>-temp<sensor>'
    |---temp: Temperature
    |---trip_point_0_temp: Over temperature threshold

The thermal_zone[0-*] contains a 'device' symlink to the corresponding nvme
device.

On the other hand, the following symlinks to the thermal zone devices are
created in the nvme device sysfs directory.

- temp0: Composite temperature
- temp1: Temperature sensor 1
...
- temp8: Temperature sensor 8

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Minwoo Im <minwoo.im.dev@gmail.com>
Cc: Kenneth Heitke <kenneth.heitke@intel.com>
Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
* v3
- Change the type name of thermal zone devices from 'nvme_temp<sensor>' to
  'nvme<instance>_temp<sensor>'
- Pass a NULL to the status argument of nvme_set_feature()
- Change the name of symbolic link from 'nvme_temp<sensor>' to 'temp<sensor>'
- Don't make it fatal error if the device provides a response
- Don't register thermal zone for composite temperature if smart log reports
  zero value
- Move the thermal zones registration and unregistration into the core module.

 drivers/nvme/host/core.c | 288 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |   9 ++
 include/linux/nvme.h     |   5 +
 3 files changed, 302 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c950916..4c8271a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2200,6 +2200,289 @@ static void nvme_set_latency_tolerance(struct device *dev, s32 val)
 	}
 }
 
+#ifdef CONFIG_THERMAL
+
+static int nvme_get_temp(struct nvme_ctrl *ctrl, unsigned int sensor, int *temp)
+{
+	struct nvme_smart_log *log;
+	int ret;
+
+	BUILD_BUG_ON(ARRAY_SIZE(log->temp_sensor) + 1 !=
+		     ARRAY_SIZE(ctrl->tzdev));
+
+	if (WARN_ON_ONCE(sensor > ARRAY_SIZE(log->temp_sensor)))
+		return -EINVAL;
+
+	log = kzalloc(sizeof(*log), GFP_KERNEL);
+	if (!log)
+		return -ENOMEM;
+
+	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
+			   log, sizeof(*log), 0);
+	if (ret) {
+		ret = ret > 0 ? -EINVAL : ret;
+		goto free_log;
+	}
+
+	if (sensor)
+		*temp = le16_to_cpu(log->temp_sensor[sensor - 1]);
+	else
+		*temp = get_unaligned_le16(log->temperature);
+
+free_log:
+	kfree(log);
+
+	return ret;
+}
+
+static unsigned int nvme_tz_type_to_sensor(const char *type)
+{
+	int instance;
+	unsigned int sensor;
+
+	if (sscanf(type, "nvme%d_temp%u", &instance, &sensor) != 2)
+		return UINT_MAX;
+
+	return sensor;
+}
+
+#define KELVIN_TO_MILLICELSIUS(t) DECI_KELVIN_TO_MILLICELSIUS((t) * 10)
+#define MILLICELSIUS_TO_KELVIN(t) ((MILLICELSIUS_TO_DECI_KELVIN(t) + 5) / 10)
+
+static int nvme_tz_get_temp(struct thermal_zone_device *tzdev, int *temp)
+{
+	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
+	struct nvme_ctrl *ctrl = tzdev->devdata;
+	int ret;
+
+	ret = nvme_get_temp(ctrl, sensor, temp);
+	if (!ret)
+		*temp = KELVIN_TO_MILLICELSIUS(*temp);
+
+	return ret;
+}
+
+static int nvme_tz_get_trip_type(struct thermal_zone_device *tzdev,
+				 int trip, enum thermal_trip_type *type)
+{
+	*type = THERMAL_TRIP_ACTIVE;
+
+	return 0;
+}
+
+static int nvme_get_over_temp_thresh(struct nvme_ctrl *ctrl,
+				     unsigned int sensor, int *temp)
+{
+	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
+	int status;
+	int ret;
+
+	if (WARN_ON_ONCE(sensor >= ARRAY_SIZE(ctrl->tzdev)))
+		return -EINVAL;
+
+	ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0,
+				&status);
+	if (!ret)
+		*temp = status & NVME_TEMP_THRESH_MASK;
+
+	return ret > 0 ? -EINVAL : ret;
+}
+
+static int nvme_set_over_temp_thresh(struct nvme_ctrl *ctrl,
+				     unsigned int sensor, int temp)
+{
+	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
+	int ret;
+
+	if (WARN_ON_ONCE(sensor >= ARRAY_SIZE(ctrl->tzdev)))
+		return -EINVAL;
+
+	if (temp > NVME_TEMP_THRESH_MASK)
+		return -EINVAL;
+
+	threshold |= temp & NVME_TEMP_THRESH_MASK;
+
+	ret = nvme_set_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0,
+				NULL);
+
+	return ret > 0 ? -EINVAL : ret;
+}
+
+static int nvme_tz_get_trip_temp(struct thermal_zone_device *tzdev,
+				 int trip, int *temp)
+{
+	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
+	struct nvme_ctrl *ctrl = tzdev->devdata;
+	int ret;
+
+	ret = nvme_get_over_temp_thresh(ctrl, sensor, temp);
+	if (!ret)
+		*temp = KELVIN_TO_MILLICELSIUS(*temp);
+
+	return ret;
+}
+
+static int nvme_tz_set_trip_temp(struct thermal_zone_device *tzdev,
+				 int trip, int temp)
+{
+	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
+	struct nvme_ctrl *ctrl = tzdev->devdata;
+
+	temp = MILLICELSIUS_TO_KELVIN(temp);
+
+	return nvme_set_over_temp_thresh(ctrl, sensor, temp);
+}
+
+static struct thermal_zone_device_ops nvme_tz_ops = {
+	.get_temp = nvme_tz_get_temp,
+	.get_trip_type = nvme_tz_get_trip_type,
+	.get_trip_temp = nvme_tz_get_trip_temp,
+	.set_trip_temp = nvme_tz_set_trip_temp,
+};
+
+static struct thermal_zone_params nvme_tz_params = {
+	.governor_name = "user_space",
+	.no_hwmon = true,
+};
+
+static struct thermal_zone_device *
+nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
+{
+	struct thermal_zone_device *tzdev;
+	char name[THERMAL_NAME_LENGTH];
+	int ret;
+
+	snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
+
+	tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
+					     &nvme_tz_params, 0, 0);
+	if (IS_ERR(tzdev)) {
+		dev_err(ctrl->device,
+			"Failed to register thermal zone device: %ld\n",
+			PTR_ERR(tzdev));
+		return tzdev;
+	}
+
+	snprintf(name, sizeof(name), "temp%d", sensor);
+	ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
+				name);
+	if (ret)
+		goto device_unregister;
+
+	ret = sysfs_create_link(&tzdev->device.kobj,
+				&ctrl->ctrl_device.kobj, "device");
+	if (ret)
+		goto remove_link;
+
+	return tzdev;
+
+remove_link:
+	sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
+device_unregister:
+	thermal_zone_device_unregister(tzdev);
+
+	return ERR_PTR(ret);
+}
+
+/**
+ * nvme_thermal_zones_register() - register nvme thermal zone devices
+ * @ctrl: controller instance
+ *
+ * This function creates up to nine thermal zone devices for all implemented
+ * temperature sensors including the composite temperature.
+ * Each thermal zone device provides a single trip point temperature that is
+ * associated with an over temperature threshold.
+ */
+static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
+{
+	struct nvme_smart_log *log;
+	int ret;
+	int i;
+
+	log = kzalloc(sizeof(*log), GFP_KERNEL);
+	if (!log)
+		return 0; /* non-fatal error */
+
+	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
+			   log, sizeof(*log), 0);
+	if (ret) {
+		dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
+		/* If the device provided a response, then it's non-fatal */
+		if (ret > 0)
+			ret = 0;
+		goto free_log;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
+		struct thermal_zone_device *tzdev;
+		int temp;
+
+		if (i)
+			temp = le16_to_cpu(log->temp_sensor[i - 1]);
+		else
+			temp = get_unaligned_le16(log->temperature);
+
+		/*
+		 * All implemented temperature sensors report a non-zero value
+		 * in temperature sensor fields in the smart log page.
+		 */
+		if (!temp)
+			continue;
+		if (ctrl->tzdev[i])
+			continue;
+
+		tzdev = nvme_thermal_zone_register(ctrl, i);
+		if (!IS_ERR(tzdev))
+			ctrl->tzdev[i] = tzdev;
+	}
+
+free_log:
+	kfree(log);
+
+	return ret;
+}
+
+/**
+ * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
+ * @ctrl: controller instance
+ *
+ * This function removes the registered thermal zone devices and symlinks.
+ */
+static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
+		struct thermal_zone_device *tzdev = ctrl->tzdev[i];
+		char name[20];
+
+		if (!tzdev)
+			continue;
+
+		sysfs_remove_link(&tzdev->device.kobj, "device");
+
+		snprintf(name, sizeof(name), "temp%d", i);
+		sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
+
+		thermal_zone_device_unregister(tzdev);
+
+		ctrl->tzdev[i] = NULL;
+	}
+}
+
+#else
+
+static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
+{
+	return 0;
+}
+
+static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
+{
+}
+
+#endif /* CONFIG_THERMAL */
+
 struct nvme_core_quirk_entry {
 	/*
 	 * NVMe model and firmware strings are padded with spaces.  For
@@ -2754,6 +3037,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	if (ret < 0)
 		return ret;
 
+	ret = nvme_thermal_zones_register(ctrl);
+	if (ret < 0)
+		return ret;
+
 	ctrl->identified = true;
 
 	return 0;
@@ -3756,6 +4043,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 {
 	nvme_mpath_stop(ctrl);
 	nvme_stop_keep_alive(ctrl);
+	nvme_thermal_zones_unregister(ctrl);
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fw_act_work);
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 802aa19..53f0b24 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -15,6 +15,7 @@
 #include <linux/sed-opal.h>
 #include <linux/fault-inject.h>
 #include <linux/rcupdate.h>
+#include <linux/thermal.h>
 
 extern unsigned int nvme_io_timeout;
 #define NVME_IO_TIMEOUT	(nvme_io_timeout * HZ)
@@ -248,6 +249,14 @@ struct nvme_ctrl {
 
 	struct page *discard_page;
 	unsigned long discard_page_busy;
+
+#ifdef CONFIG_THERMAL
+	/*
+	 * tzdev[0]: composite temperature
+	 * tzdev[1-8]: temperature sensor 1 through 8
+	 */
+	struct thermal_zone_device *tzdev[9];
+#endif
 };
 
 enum nvme_iopolicy {
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 658ac75..54f0a13 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -780,6 +780,11 @@ struct nvme_write_zeroes_cmd {
 
 /* Features */
 
+enum {
+	NVME_TEMP_THRESH_MASK		= 0xffff,
+	NVME_TEMP_THRESH_SELECT_SHIFT	= 16,
+};
+
 struct nvme_feat_auto_pst {
 	__le64 entries[32];
 };
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-05-26 16:29   ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-26 16:29 UTC (permalink / raw)


The NVMe controller reports up to nine temperature values in the SMART /
Health log page (the composite temperature and temperature sensor 1 through
temperature sensor 8).

This provides these temperatures via thermal zone devices.

Once the controller is identified, the thermal zone devices are created for
all implemented temperature sensors including the composite temperature.

/sys/class/thermal/thermal_zone[0-*]:
    |---type: 'nvme<instance>-temp<sensor>'
    |---temp: Temperature
    |---trip_point_0_temp: Over temperature threshold

The thermal_zone[0-*] contains a 'device' symlink to the corresponding nvme
device.

On the other hand, the following symlinks to the thermal zone devices are
created in the nvme device sysfs directory.

- temp0: Composite temperature
- temp1: Temperature sensor 1
...
- temp8: Temperature sensor 8

Cc: Zhang Rui <rui.zhang at intel.com>
Cc: Eduardo Valentin <edubezval at gmail.com>
Cc: Daniel Lezcano <daniel.lezcano at linaro.org>
Cc: Keith Busch <keith.busch at intel.com>
Cc: Jens Axboe <axboe at fb.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: Minwoo Im <minwoo.im.dev at gmail.com>
Cc: Kenneth Heitke <kenneth.heitke at intel.com>
Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni at wdc.com>
Signed-off-by: Akinobu Mita <akinobu.mita at gmail.com>
---
* v3
- Change the type name of thermal zone devices from 'nvme_temp<sensor>' to
  'nvme<instance>_temp<sensor>'
- Pass a NULL to the status argument of nvme_set_feature()
- Change the name of symbolic link from 'nvme_temp<sensor>' to 'temp<sensor>'
- Don't make it fatal error if the device provides a response
- Don't register thermal zone for composite temperature if smart log reports
  zero value
- Move the thermal zones registration and unregistration into the core module.

 drivers/nvme/host/core.c | 288 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h |   9 ++
 include/linux/nvme.h     |   5 +
 3 files changed, 302 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c950916..4c8271a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2200,6 +2200,289 @@ static void nvme_set_latency_tolerance(struct device *dev, s32 val)
 	}
 }
 
+#ifdef CONFIG_THERMAL
+
+static int nvme_get_temp(struct nvme_ctrl *ctrl, unsigned int sensor, int *temp)
+{
+	struct nvme_smart_log *log;
+	int ret;
+
+	BUILD_BUG_ON(ARRAY_SIZE(log->temp_sensor) + 1 !=
+		     ARRAY_SIZE(ctrl->tzdev));
+
+	if (WARN_ON_ONCE(sensor > ARRAY_SIZE(log->temp_sensor)))
+		return -EINVAL;
+
+	log = kzalloc(sizeof(*log), GFP_KERNEL);
+	if (!log)
+		return -ENOMEM;
+
+	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
+			   log, sizeof(*log), 0);
+	if (ret) {
+		ret = ret > 0 ? -EINVAL : ret;
+		goto free_log;
+	}
+
+	if (sensor)
+		*temp = le16_to_cpu(log->temp_sensor[sensor - 1]);
+	else
+		*temp = get_unaligned_le16(log->temperature);
+
+free_log:
+	kfree(log);
+
+	return ret;
+}
+
+static unsigned int nvme_tz_type_to_sensor(const char *type)
+{
+	int instance;
+	unsigned int sensor;
+
+	if (sscanf(type, "nvme%d_temp%u", &instance, &sensor) != 2)
+		return UINT_MAX;
+
+	return sensor;
+}
+
+#define KELVIN_TO_MILLICELSIUS(t) DECI_KELVIN_TO_MILLICELSIUS((t) * 10)
+#define MILLICELSIUS_TO_KELVIN(t) ((MILLICELSIUS_TO_DECI_KELVIN(t) + 5) / 10)
+
+static int nvme_tz_get_temp(struct thermal_zone_device *tzdev, int *temp)
+{
+	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
+	struct nvme_ctrl *ctrl = tzdev->devdata;
+	int ret;
+
+	ret = nvme_get_temp(ctrl, sensor, temp);
+	if (!ret)
+		*temp = KELVIN_TO_MILLICELSIUS(*temp);
+
+	return ret;
+}
+
+static int nvme_tz_get_trip_type(struct thermal_zone_device *tzdev,
+				 int trip, enum thermal_trip_type *type)
+{
+	*type = THERMAL_TRIP_ACTIVE;
+
+	return 0;
+}
+
+static int nvme_get_over_temp_thresh(struct nvme_ctrl *ctrl,
+				     unsigned int sensor, int *temp)
+{
+	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
+	int status;
+	int ret;
+
+	if (WARN_ON_ONCE(sensor >= ARRAY_SIZE(ctrl->tzdev)))
+		return -EINVAL;
+
+	ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0,
+				&status);
+	if (!ret)
+		*temp = status & NVME_TEMP_THRESH_MASK;
+
+	return ret > 0 ? -EINVAL : ret;
+}
+
+static int nvme_set_over_temp_thresh(struct nvme_ctrl *ctrl,
+				     unsigned int sensor, int temp)
+{
+	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
+	int ret;
+
+	if (WARN_ON_ONCE(sensor >= ARRAY_SIZE(ctrl->tzdev)))
+		return -EINVAL;
+
+	if (temp > NVME_TEMP_THRESH_MASK)
+		return -EINVAL;
+
+	threshold |= temp & NVME_TEMP_THRESH_MASK;
+
+	ret = nvme_set_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0,
+				NULL);
+
+	return ret > 0 ? -EINVAL : ret;
+}
+
+static int nvme_tz_get_trip_temp(struct thermal_zone_device *tzdev,
+				 int trip, int *temp)
+{
+	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
+	struct nvme_ctrl *ctrl = tzdev->devdata;
+	int ret;
+
+	ret = nvme_get_over_temp_thresh(ctrl, sensor, temp);
+	if (!ret)
+		*temp = KELVIN_TO_MILLICELSIUS(*temp);
+
+	return ret;
+}
+
+static int nvme_tz_set_trip_temp(struct thermal_zone_device *tzdev,
+				 int trip, int temp)
+{
+	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
+	struct nvme_ctrl *ctrl = tzdev->devdata;
+
+	temp = MILLICELSIUS_TO_KELVIN(temp);
+
+	return nvme_set_over_temp_thresh(ctrl, sensor, temp);
+}
+
+static struct thermal_zone_device_ops nvme_tz_ops = {
+	.get_temp = nvme_tz_get_temp,
+	.get_trip_type = nvme_tz_get_trip_type,
+	.get_trip_temp = nvme_tz_get_trip_temp,
+	.set_trip_temp = nvme_tz_set_trip_temp,
+};
+
+static struct thermal_zone_params nvme_tz_params = {
+	.governor_name = "user_space",
+	.no_hwmon = true,
+};
+
+static struct thermal_zone_device *
+nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
+{
+	struct thermal_zone_device *tzdev;
+	char name[THERMAL_NAME_LENGTH];
+	int ret;
+
+	snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
+
+	tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
+					     &nvme_tz_params, 0, 0);
+	if (IS_ERR(tzdev)) {
+		dev_err(ctrl->device,
+			"Failed to register thermal zone device: %ld\n",
+			PTR_ERR(tzdev));
+		return tzdev;
+	}
+
+	snprintf(name, sizeof(name), "temp%d", sensor);
+	ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
+				name);
+	if (ret)
+		goto device_unregister;
+
+	ret = sysfs_create_link(&tzdev->device.kobj,
+				&ctrl->ctrl_device.kobj, "device");
+	if (ret)
+		goto remove_link;
+
+	return tzdev;
+
+remove_link:
+	sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
+device_unregister:
+	thermal_zone_device_unregister(tzdev);
+
+	return ERR_PTR(ret);
+}
+
+/**
+ * nvme_thermal_zones_register() - register nvme thermal zone devices
+ * @ctrl: controller instance
+ *
+ * This function creates up to nine thermal zone devices for all implemented
+ * temperature sensors including the composite temperature.
+ * Each thermal zone device provides a single trip point temperature that is
+ * associated with an over temperature threshold.
+ */
+static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
+{
+	struct nvme_smart_log *log;
+	int ret;
+	int i;
+
+	log = kzalloc(sizeof(*log), GFP_KERNEL);
+	if (!log)
+		return 0; /* non-fatal error */
+
+	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
+			   log, sizeof(*log), 0);
+	if (ret) {
+		dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
+		/* If the device provided a response, then it's non-fatal */
+		if (ret > 0)
+			ret = 0;
+		goto free_log;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
+		struct thermal_zone_device *tzdev;
+		int temp;
+
+		if (i)
+			temp = le16_to_cpu(log->temp_sensor[i - 1]);
+		else
+			temp = get_unaligned_le16(log->temperature);
+
+		/*
+		 * All implemented temperature sensors report a non-zero value
+		 * in temperature sensor fields in the smart log page.
+		 */
+		if (!temp)
+			continue;
+		if (ctrl->tzdev[i])
+			continue;
+
+		tzdev = nvme_thermal_zone_register(ctrl, i);
+		if (!IS_ERR(tzdev))
+			ctrl->tzdev[i] = tzdev;
+	}
+
+free_log:
+	kfree(log);
+
+	return ret;
+}
+
+/**
+ * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
+ * @ctrl: controller instance
+ *
+ * This function removes the registered thermal zone devices and symlinks.
+ */
+static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
+		struct thermal_zone_device *tzdev = ctrl->tzdev[i];
+		char name[20];
+
+		if (!tzdev)
+			continue;
+
+		sysfs_remove_link(&tzdev->device.kobj, "device");
+
+		snprintf(name, sizeof(name), "temp%d", i);
+		sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
+
+		thermal_zone_device_unregister(tzdev);
+
+		ctrl->tzdev[i] = NULL;
+	}
+}
+
+#else
+
+static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
+{
+	return 0;
+}
+
+static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
+{
+}
+
+#endif /* CONFIG_THERMAL */
+
 struct nvme_core_quirk_entry {
 	/*
 	 * NVMe model and firmware strings are padded with spaces.  For
@@ -2754,6 +3037,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 	if (ret < 0)
 		return ret;
 
+	ret = nvme_thermal_zones_register(ctrl);
+	if (ret < 0)
+		return ret;
+
 	ctrl->identified = true;
 
 	return 0;
@@ -3756,6 +4043,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
 {
 	nvme_mpath_stop(ctrl);
 	nvme_stop_keep_alive(ctrl);
+	nvme_thermal_zones_unregister(ctrl);
 	flush_work(&ctrl->async_event_work);
 	cancel_work_sync(&ctrl->fw_act_work);
 }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 802aa19..53f0b24 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -15,6 +15,7 @@
 #include <linux/sed-opal.h>
 #include <linux/fault-inject.h>
 #include <linux/rcupdate.h>
+#include <linux/thermal.h>
 
 extern unsigned int nvme_io_timeout;
 #define NVME_IO_TIMEOUT	(nvme_io_timeout * HZ)
@@ -248,6 +249,14 @@ struct nvme_ctrl {
 
 	struct page *discard_page;
 	unsigned long discard_page_busy;
+
+#ifdef CONFIG_THERMAL
+	/*
+	 * tzdev[0]: composite temperature
+	 * tzdev[1-8]: temperature sensor 1 through 8
+	 */
+	struct thermal_zone_device *tzdev[9];
+#endif
 };
 
 enum nvme_iopolicy {
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 658ac75..54f0a13 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -780,6 +780,11 @@ struct nvme_write_zeroes_cmd {
 
 /* Features */
 
+enum {
+	NVME_TEMP_THRESH_MASK		= 0xffff,
+	NVME_TEMP_THRESH_SELECT_SHIFT	= 16,
+};
+
 struct nvme_feat_auto_pst {
 	__le64 entries[32];
 };
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur
  2019-05-26 16:29 ` Akinobu Mita
@ 2019-05-26 16:29   ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-26 16:29 UTC (permalink / raw)
  To: linux-nvme, linux-pm
  Cc: Akinobu Mita, Zhang Rui, Eduardo Valentin, Daniel Lezcano,
	Keith Busch, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

The NVMe controller supports the temperature threshold feature (Feature
Identifier 04h) that enables to configure the asynchronous event request
command to complete when the temperature is crossed its corresponding
temperature threshold.

This enables the reporting of asynchronous events from the controller when
the temperature reached or exceeded a temperature threshold.

In the case of the temperature threshold conditions, this notifies the
thermal framework.

Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Minwoo Im <minwoo.im.dev@gmail.com>
Cc: Kenneth Heitke <kenneth.heitke@intel.com>
Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
---
* v3
- No changes since v2

 drivers/nvme/host/core.c | 28 ++++++++++++++++++++++++++++
 include/linux/nvme.h     |  7 +++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4c8271a..26c8b59 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1186,6 +1186,9 @@ static void nvme_enable_aen(struct nvme_ctrl *ctrl)
 	u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
 	int status;
 
+	if (IS_ENABLED(CONFIG_THERMAL))
+		supported_aens |= NVME_SMART_CRIT_TEMPERATURE;
+
 	if (!supported_aens)
 		return;
 
@@ -2470,6 +2473,16 @@ static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
 	}
 }
 
+static void nvme_thermal_notify_framework(struct nvme_ctrl *ctrl)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
+		if (ctrl->tzdev[i])
+			thermal_notify_framework(ctrl->tzdev[i], 0);
+	}
+}
+
 #else
 
 static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
@@ -2481,6 +2494,10 @@ static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
 {
 }
 
+static void nvme_thermal_notify_framework(struct nvme_ctrl *ctrl)
+{
+}
+
 #endif /* CONFIG_THERMAL */
 
 struct nvme_core_quirk_entry {
@@ -3901,6 +3918,16 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_remove_namespaces);
 
+static void nvme_handle_aen_smart(struct nvme_ctrl *ctrl, u32 result)
+{
+	u32 aer_type = result & NVME_AER_TYPE_MASK;
+	u32 aer_info = (result >> NVME_AER_INFO_SHIFT) & NVME_AER_INFO_MASK;
+
+	if (aer_type == NVME_AER_SMART &&
+	    aer_info == NVME_AER_SMART_TEMP_THRESH)
+		nvme_thermal_notify_framework(ctrl);
+}
+
 static void nvme_aen_uevent(struct nvme_ctrl *ctrl)
 {
 	char *envp[2] = { NULL, NULL };
@@ -3922,6 +3949,7 @@ static void nvme_async_event_work(struct work_struct *work)
 	struct nvme_ctrl *ctrl =
 		container_of(work, struct nvme_ctrl, async_event_work);
 
+	nvme_handle_aen_smart(ctrl, ctrl->aen_result);
 	nvme_aen_uevent(ctrl);
 	ctrl->ops->submit_async_event(ctrl);
 }
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 54f0a13..8e7d599 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -507,6 +507,7 @@ enum {
 };
 
 enum {
+	NVME_AER_TYPE_MASK		= 0x7,
 	NVME_AER_ERROR			= 0,
 	NVME_AER_SMART			= 1,
 	NVME_AER_NOTICE			= 2,
@@ -515,6 +516,12 @@ enum {
 };
 
 enum {
+	NVME_AER_INFO_SHIFT		= 8,
+	NVME_AER_INFO_MASK		= 0xff,
+	NVME_AER_SMART_TEMP_THRESH	= 0x01,
+};
+
+enum {
 	NVME_AER_NOTICE_NS_CHANGED	= 0x00,
 	NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
 	NVME_AER_NOTICE_ANA		= 0x03,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur
@ 2019-05-26 16:29   ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-26 16:29 UTC (permalink / raw)


The NVMe controller supports the temperature threshold feature (Feature
Identifier 04h) that enables to configure the asynchronous event request
command to complete when the temperature is crossed its corresponding
temperature threshold.

This enables the reporting of asynchronous events from the controller when
the temperature reached or exceeded a temperature threshold.

In the case of the temperature threshold conditions, this notifies the
thermal framework.

Cc: Zhang Rui <rui.zhang at intel.com>
Cc: Eduardo Valentin <edubezval at gmail.com>
Cc: Daniel Lezcano <daniel.lezcano at linaro.org>
Cc: Keith Busch <keith.busch at intel.com>
Cc: Jens Axboe <axboe at fb.com>
Cc: Christoph Hellwig <hch at lst.de>
Cc: Sagi Grimberg <sagi at grimberg.me>
Cc: Minwoo Im <minwoo.im.dev at gmail.com>
Cc: Kenneth Heitke <kenneth.heitke at intel.com>
Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni at wdc.com>
Signed-off-by: Akinobu Mita <akinobu.mita at gmail.com>
---
* v3
- No changes since v2

 drivers/nvme/host/core.c | 28 ++++++++++++++++++++++++++++
 include/linux/nvme.h     |  7 +++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4c8271a..26c8b59 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1186,6 +1186,9 @@ static void nvme_enable_aen(struct nvme_ctrl *ctrl)
 	u32 result, supported_aens = ctrl->oaes & NVME_AEN_SUPPORTED;
 	int status;
 
+	if (IS_ENABLED(CONFIG_THERMAL))
+		supported_aens |= NVME_SMART_CRIT_TEMPERATURE;
+
 	if (!supported_aens)
 		return;
 
@@ -2470,6 +2473,16 @@ static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
 	}
 }
 
+static void nvme_thermal_notify_framework(struct nvme_ctrl *ctrl)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
+		if (ctrl->tzdev[i])
+			thermal_notify_framework(ctrl->tzdev[i], 0);
+	}
+}
+
 #else
 
 static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
@@ -2481,6 +2494,10 @@ static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
 {
 }
 
+static void nvme_thermal_notify_framework(struct nvme_ctrl *ctrl)
+{
+}
+
 #endif /* CONFIG_THERMAL */
 
 struct nvme_core_quirk_entry {
@@ -3901,6 +3918,16 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_remove_namespaces);
 
+static void nvme_handle_aen_smart(struct nvme_ctrl *ctrl, u32 result)
+{
+	u32 aer_type = result & NVME_AER_TYPE_MASK;
+	u32 aer_info = (result >> NVME_AER_INFO_SHIFT) & NVME_AER_INFO_MASK;
+
+	if (aer_type == NVME_AER_SMART &&
+	    aer_info == NVME_AER_SMART_TEMP_THRESH)
+		nvme_thermal_notify_framework(ctrl);
+}
+
 static void nvme_aen_uevent(struct nvme_ctrl *ctrl)
 {
 	char *envp[2] = { NULL, NULL };
@@ -3922,6 +3949,7 @@ static void nvme_async_event_work(struct work_struct *work)
 	struct nvme_ctrl *ctrl =
 		container_of(work, struct nvme_ctrl, async_event_work);
 
+	nvme_handle_aen_smart(ctrl, ctrl->aen_result);
 	nvme_aen_uevent(ctrl);
 	ctrl->ops->submit_async_event(ctrl);
 }
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 54f0a13..8e7d599 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -507,6 +507,7 @@ enum {
 };
 
 enum {
+	NVME_AER_TYPE_MASK		= 0x7,
 	NVME_AER_ERROR			= 0,
 	NVME_AER_SMART			= 1,
 	NVME_AER_NOTICE			= 2,
@@ -515,6 +516,12 @@ enum {
 };
 
 enum {
+	NVME_AER_INFO_SHIFT		= 8,
+	NVME_AER_INFO_MASK		= 0xff,
+	NVME_AER_SMART_TEMP_THRESH	= 0x01,
+};
+
+enum {
 	NVME_AER_NOTICE_NS_CHANGED	= 0x00,
 	NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
 	NVME_AER_NOTICE_ANA		= 0x03,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 1/3] nvme: Export get and set features
  2019-05-26 16:29   ` Akinobu Mita
@ 2019-05-26 16:45     ` Chaitanya Kulkarni
  -1 siblings, 0 replies; 40+ messages in thread
From: Chaitanya Kulkarni @ 2019-05-26 16:45 UTC (permalink / raw)
  To: Akinobu Mita, linux-nvme, linux-pm; +Cc: Keith Busch

Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

On 5/26/19 9:30 AM, Akinobu Mita wrote:
> From: Keith Busch <keith.busch@intel.com>
> 
> Future use intends to make use of both, so export these functions. And
> since their implementation is identical except for the opcode, provide a
> new function that implement both.
> 
> [akinobu.mita@gmail.com>: fix line over 80 characters]
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> ---
>   drivers/nvme/host/core.c | 24 +++++++++++++++++++++---
>   drivers/nvme/host/nvme.h |  6 ++++++
>   2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index c6a29a3..c950916 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1113,15 +1113,15 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
>   	return id;
>   }
>   
> -static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
> -		      void *buffer, size_t buflen, u32 *result)
> +static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid,
> +		unsigned int dword11, void *buffer, size_t buflen, u32 *result)
>   {
>   	struct nvme_command c;
>   	union nvme_result res;
>   	int ret;
>   
>   	memset(&c, 0, sizeof(c));
> -	c.features.opcode = nvme_admin_set_features;
> +	c.features.opcode = op;
>   	c.features.fid = cpu_to_le32(fid);
>   	c.features.dword11 = cpu_to_le32(dword11);
>   
> @@ -1132,6 +1132,24 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
>   	return ret;
>   }
>   
> +int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
> +		      unsigned int dword11, void *buffer, size_t buflen,
> +		      u32 *result)
> +{
> +	return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer,
> +			     buflen, result);
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_features);
> +
> +int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
> +		      unsigned int dword11, void *buffer, size_t buflen,
> +		      u32 *result)
> +{
> +	return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer,
> +			     buflen, result);
> +}
> +EXPORT_SYMBOL_GPL(nvme_get_features);
> +
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>   {
>   	u32 q_count = (*count - 1) | ((*count - 1) << 16);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index de624ec..802aa19 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -460,6 +460,12 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
>   		union nvme_result *result, void *buffer, unsigned bufflen,
>   		unsigned timeout, int qid, int at_head,
>   		blk_mq_req_flags_t flags, bool poll);
> +int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
> +		      unsigned int dword11, void *buffer, size_t buflen,
> +		      u32 *result);
> +int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
> +		      unsigned int dword11, void *buffer, size_t buflen,
> +		      u32 *result);
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
>   void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
>   int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 1/3] nvme: Export get and set features
@ 2019-05-26 16:45     ` Chaitanya Kulkarni
  0 siblings, 0 replies; 40+ messages in thread
From: Chaitanya Kulkarni @ 2019-05-26 16:45 UTC (permalink / raw)


Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com>

On 5/26/19 9:30 AM, Akinobu Mita wrote:
> From: Keith Busch <keith.busch at intel.com>
> 
> Future use intends to make use of both, so export these functions. And
> since their implementation is identical except for the opcode, provide a
> new function that implement both.
> 
> [akinobu.mita at gmail.com>: fix line over 80 characters]
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> Signed-off-by: Akinobu Mita <akinobu.mita at gmail.com>
> ---
>   drivers/nvme/host/core.c | 24 +++++++++++++++++++++---
>   drivers/nvme/host/nvme.h |  6 ++++++
>   2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index c6a29a3..c950916 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1113,15 +1113,15 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
>   	return id;
>   }
>   
> -static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
> -		      void *buffer, size_t buflen, u32 *result)
> +static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned int fid,
> +		unsigned int dword11, void *buffer, size_t buflen, u32 *result)
>   {
>   	struct nvme_command c;
>   	union nvme_result res;
>   	int ret;
>   
>   	memset(&c, 0, sizeof(c));
> -	c.features.opcode = nvme_admin_set_features;
> +	c.features.opcode = op;
>   	c.features.fid = cpu_to_le32(fid);
>   	c.features.dword11 = cpu_to_le32(dword11);
>   
> @@ -1132,6 +1132,24 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
>   	return ret;
>   }
>   
> +int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
> +		      unsigned int dword11, void *buffer, size_t buflen,
> +		      u32 *result)
> +{
> +	return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer,
> +			     buflen, result);
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_features);
> +
> +int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
> +		      unsigned int dword11, void *buffer, size_t buflen,
> +		      u32 *result)
> +{
> +	return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer,
> +			     buflen, result);
> +}
> +EXPORT_SYMBOL_GPL(nvme_get_features);
> +
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>   {
>   	u32 q_count = (*count - 1) | ((*count - 1) << 16);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index de624ec..802aa19 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -460,6 +460,12 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
>   		union nvme_result *result, void *buffer, unsigned bufflen,
>   		unsigned timeout, int qid, int at_head,
>   		blk_mq_req_flags_t flags, bool poll);
> +int nvme_set_features(struct nvme_ctrl *dev, unsigned int fid,
> +		      unsigned int dword11, void *buffer, size_t buflen,
> +		      u32 *result);
> +int nvme_get_features(struct nvme_ctrl *dev, unsigned int fid,
> +		      unsigned int dword11, void *buffer, size_t buflen,
> +		      u32 *result);
>   int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
>   void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
>   int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-05-26 16:29   ` Akinobu Mita
@ 2019-05-29 15:15     ` Minwoo Im
  -1 siblings, 0 replies; 40+ messages in thread
From: Minwoo Im @ 2019-05-29 15:15 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-nvme, linux-pm, Zhang Rui, Eduardo Valentin,
	Daniel Lezcano, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Kenneth Heitke, Chaitanya Kulkarni

> +/**
> + * nvme_thermal_zones_register() - register nvme thermal zone devices
> + * @ctrl: controller instance
> + *
> + * This function creates up to nine thermal zone devices for all implemented
> + * temperature sensors including the composite temperature.
> + * Each thermal zone device provides a single trip point temperature that is
> + * associated with an over temperature threshold.
> + */
> +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_smart_log *log;
> +	int ret;
> +	int i;
> +
> +	log = kzalloc(sizeof(*log), GFP_KERNEL);
> +	if (!log)
> +		return 0; /* non-fatal error */

Do we need to print something like warning here? If kzalloc() fails, it
would be good to be distinguished between the nvme failure and internal
failure like this.

> +
> +	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> +			   log, sizeof(*log), 0);
> +	if (ret) {
> +		dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> +		/* If the device provided a response, then it's non-fatal */
> +		if (ret > 0)
> +			ret = 0;

It seems like that nvme_init_identify() is just check the internal error
which is in negative value now as you have posted.  Why don't we just
return the value of "ret" itself without updating it to 0 ?

> +		goto free_log;
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> +		struct thermal_zone_device *tzdev;
> +		int temp;
> +
> +		if (i)
> +			temp = le16_to_cpu(log->temp_sensor[i - 1]);
> +		else
> +			temp = get_unaligned_le16(log->temperature);
> +
> +		/*
> +		 * All implemented temperature sensors report a non-zero value
> +		 * in temperature sensor fields in the smart log page.
> +		 */
> +		if (!temp)
> +			continue;
> +		if (ctrl->tzdev[i])
> +			continue;
> +
> +		tzdev = nvme_thermal_zone_register(ctrl, i);
> +		if (!IS_ERR(tzdev))
> +			ctrl->tzdev[i] = tzdev;
> +	}
> +
> +free_log:
> +	kfree(log);
> +
> +	return ret;
> +}
> +
> +/**
> + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> + * @ctrl: controller instance
> + *
> + * This function removes the registered thermal zone devices and symlinks.
> + */
> +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> +		struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> +		char name[20];

Simple query here :)

If we are not going to allow the name of link exceed the length of its
own device name like nvmeX_tempY, then can we THERMAL_NAME_LENGTH macro
here?  If the name of link is not exactly about the device name itself,
then it's fine.  What do you think about it ?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-05-29 15:15     ` Minwoo Im
  0 siblings, 0 replies; 40+ messages in thread
From: Minwoo Im @ 2019-05-29 15:15 UTC (permalink / raw)


> +/**
> + * nvme_thermal_zones_register() - register nvme thermal zone devices
> + * @ctrl: controller instance
> + *
> + * This function creates up to nine thermal zone devices for all implemented
> + * temperature sensors including the composite temperature.
> + * Each thermal zone device provides a single trip point temperature that is
> + * associated with an over temperature threshold.
> + */
> +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_smart_log *log;
> +	int ret;
> +	int i;
> +
> +	log = kzalloc(sizeof(*log), GFP_KERNEL);
> +	if (!log)
> +		return 0; /* non-fatal error */

Do we need to print something like warning here? If kzalloc() fails, it
would be good to be distinguished between the nvme failure and internal
failure like this.

> +
> +	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> +			   log, sizeof(*log), 0);
> +	if (ret) {
> +		dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> +		/* If the device provided a response, then it's non-fatal */
> +		if (ret > 0)
> +			ret = 0;

It seems like that nvme_init_identify() is just check the internal error
which is in negative value now as you have posted.  Why don't we just
return the value of "ret" itself without updating it to 0 ?

> +		goto free_log;
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> +		struct thermal_zone_device *tzdev;
> +		int temp;
> +
> +		if (i)
> +			temp = le16_to_cpu(log->temp_sensor[i - 1]);
> +		else
> +			temp = get_unaligned_le16(log->temperature);
> +
> +		/*
> +		 * All implemented temperature sensors report a non-zero value
> +		 * in temperature sensor fields in the smart log page.
> +		 */
> +		if (!temp)
> +			continue;
> +		if (ctrl->tzdev[i])
> +			continue;
> +
> +		tzdev = nvme_thermal_zone_register(ctrl, i);
> +		if (!IS_ERR(tzdev))
> +			ctrl->tzdev[i] = tzdev;
> +	}
> +
> +free_log:
> +	kfree(log);
> +
> +	return ret;
> +}
> +
> +/**
> + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> + * @ctrl: controller instance
> + *
> + * This function removes the registered thermal zone devices and symlinks.
> + */
> +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> +		struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> +		char name[20];

Simple query here :)

If we are not going to allow the name of link exceed the length of its
own device name like nvmeX_tempY, then can we THERMAL_NAME_LENGTH macro
here?  If the name of link is not exactly about the device name itself,
then it's fine.  What do you think about it ?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 1/3] nvme: Export get and set features
  2019-05-26 16:29   ` Akinobu Mita
@ 2019-05-29 15:19     ` Minwoo Im
  -1 siblings, 0 replies; 40+ messages in thread
From: Minwoo Im @ 2019-05-29 15:19 UTC (permalink / raw)
  To: Akinobu Mita; +Cc: linux-nvme, linux-pm, Keith Busch

Looks good to me.

Reviewed-by: Minwoo Im <minwoo.im.dev@gmail.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 1/3] nvme: Export get and set features
@ 2019-05-29 15:19     ` Minwoo Im
  0 siblings, 0 replies; 40+ messages in thread
From: Minwoo Im @ 2019-05-29 15:19 UTC (permalink / raw)


Looks good to me.

Reviewed-by: Minwoo Im <minwoo.im.dev at gmail.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-05-29 15:15     ` Minwoo Im
@ 2019-05-29 16:47       ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-29 16:47 UTC (permalink / raw)
  To: Minwoo Im
  Cc: linux-nvme, linux-pm, Zhang Rui, Eduardo Valentin,
	Daniel Lezcano, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Kenneth Heitke, Chaitanya Kulkarni

2019年5月30日(木) 0:15 Minwoo Im <minwoo.im.dev@gmail.com>:
>
> > +/**
> > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > + * @ctrl: controller instance
> > + *
> > + * This function creates up to nine thermal zone devices for all implemented
> > + * temperature sensors including the composite temperature.
> > + * Each thermal zone device provides a single trip point temperature that is
> > + * associated with an over temperature threshold.
> > + */
> > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > +{
> > +     struct nvme_smart_log *log;
> > +     int ret;
> > +     int i;
> > +
> > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > +     if (!log)
> > +             return 0; /* non-fatal error */
>
> Do we need to print something like warning here? If kzalloc() fails, it
> would be good to be distinguished between the nvme failure and internal
> failure like this.

We usually remove the error message on kmalloc failure because it's
redundant as long as we don't set __GFP_NOWARN.

> > +
> > +     ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> > +                        log, sizeof(*log), 0);
> > +     if (ret) {
> > +             dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> > +             /* If the device provided a response, then it's non-fatal */
> > +             if (ret > 0)
> > +                     ret = 0;
>
> It seems like that nvme_init_identify() is just check the internal error
> which is in negative value now as you have posted.  Why don't we just
> return the value of "ret" itself without updating it to 0 ?

Both ways work for me.

I personally prefer not to return (leak) the nvme status code from
foo_register() function.

> > +             goto free_log;
> > +     }
> > +
> > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > +             struct thermal_zone_device *tzdev;
> > +             int temp;
> > +
> > +             if (i)
> > +                     temp = le16_to_cpu(log->temp_sensor[i - 1]);
> > +             else
> > +                     temp = get_unaligned_le16(log->temperature);
> > +
> > +             /*
> > +              * All implemented temperature sensors report a non-zero value
> > +              * in temperature sensor fields in the smart log page.
> > +              */
> > +             if (!temp)
> > +                     continue;
> > +             if (ctrl->tzdev[i])
> > +                     continue;
> > +
> > +             tzdev = nvme_thermal_zone_register(ctrl, i);
> > +             if (!IS_ERR(tzdev))
> > +                     ctrl->tzdev[i] = tzdev;
> > +     }
> > +
> > +free_log:
> > +     kfree(log);
> > +
> > +     return ret;
> > +}
> > +
> > +/**
> > + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> > + * @ctrl: controller instance
> > + *
> > + * This function removes the registered thermal zone devices and symlinks.
> > + */
> > +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > +             struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> > +             char name[20];
>
> Simple query here :)
>
> If we are not going to allow the name of link exceed the length of its
> own device name like nvmeX_tempY, then can we THERMAL_NAME_LENGTH macro
> here?  If the name of link is not exactly about the device name itself,
> then it's fine.  What do you think about it ?

Of course we can use THERMAL_NAME_LENGTH here.  But this char array is
used only for the symbolic link name.
It's not used for thermal cooling device type, thermal zone device type,
or thermal governor name.  So I just used a random constant to avoid
confusion.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-05-29 16:47       ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-05-29 16:47 UTC (permalink / raw)


2019?5?30?(?) 0:15 Minwoo Im <minwoo.im.dev at gmail.com>:
>
> > +/**
> > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > + * @ctrl: controller instance
> > + *
> > + * This function creates up to nine thermal zone devices for all implemented
> > + * temperature sensors including the composite temperature.
> > + * Each thermal zone device provides a single trip point temperature that is
> > + * associated with an over temperature threshold.
> > + */
> > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > +{
> > +     struct nvme_smart_log *log;
> > +     int ret;
> > +     int i;
> > +
> > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > +     if (!log)
> > +             return 0; /* non-fatal error */
>
> Do we need to print something like warning here? If kzalloc() fails, it
> would be good to be distinguished between the nvme failure and internal
> failure like this.

We usually remove the error message on kmalloc failure because it's
redundant as long as we don't set __GFP_NOWARN.

> > +
> > +     ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> > +                        log, sizeof(*log), 0);
> > +     if (ret) {
> > +             dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> > +             /* If the device provided a response, then it's non-fatal */
> > +             if (ret > 0)
> > +                     ret = 0;
>
> It seems like that nvme_init_identify() is just check the internal error
> which is in negative value now as you have posted.  Why don't we just
> return the value of "ret" itself without updating it to 0 ?

Both ways work for me.

I personally prefer not to return (leak) the nvme status code from
foo_register() function.

> > +             goto free_log;
> > +     }
> > +
> > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > +             struct thermal_zone_device *tzdev;
> > +             int temp;
> > +
> > +             if (i)
> > +                     temp = le16_to_cpu(log->temp_sensor[i - 1]);
> > +             else
> > +                     temp = get_unaligned_le16(log->temperature);
> > +
> > +             /*
> > +              * All implemented temperature sensors report a non-zero value
> > +              * in temperature sensor fields in the smart log page.
> > +              */
> > +             if (!temp)
> > +                     continue;
> > +             if (ctrl->tzdev[i])
> > +                     continue;
> > +
> > +             tzdev = nvme_thermal_zone_register(ctrl, i);
> > +             if (!IS_ERR(tzdev))
> > +                     ctrl->tzdev[i] = tzdev;
> > +     }
> > +
> > +free_log:
> > +     kfree(log);
> > +
> > +     return ret;
> > +}
> > +
> > +/**
> > + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> > + * @ctrl: controller instance
> > + *
> > + * This function removes the registered thermal zone devices and symlinks.
> > + */
> > +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > +             struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> > +             char name[20];
>
> Simple query here :)
>
> If we are not going to allow the name of link exceed the length of its
> own device name like nvmeX_tempY, then can we THERMAL_NAME_LENGTH macro
> here?  If the name of link is not exactly about the device name itself,
> then it's fine.  What do you think about it ?

Of course we can use THERMAL_NAME_LENGTH here.  But this char array is
used only for the symbolic link name.
It's not used for thermal cooling device type, thermal zone device type,
or thermal governor name.  So I just used a random constant to avoid
confusion.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-05-29 16:47       ` Akinobu Mita
@ 2019-05-30 10:18         ` Minwoo Im
  -1 siblings, 0 replies; 40+ messages in thread
From: Minwoo Im @ 2019-05-30 10:18 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-nvme, linux-pm, Zhang Rui, Eduardo Valentin,
	Daniel Lezcano, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Kenneth Heitke, Chaitanya Kulkarni

On 19-05-30 01:47:08, Akinobu Mita wrote:
> 2019年5月30日(木) 0:15 Minwoo Im <minwoo.im.dev@gmail.com>:
> >
> > > +/**
> > > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > > + * @ctrl: controller instance
> > > + *
> > > + * This function creates up to nine thermal zone devices for all implemented
> > > + * temperature sensors including the composite temperature.
> > > + * Each thermal zone device provides a single trip point temperature that is
> > > + * associated with an over temperature threshold.
> > > + */
> > > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > > +{
> > > +     struct nvme_smart_log *log;
> > > +     int ret;
> > > +     int i;
> > > +
> > > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > > +     if (!log)
> > > +             return 0; /* non-fatal error */
> >
> > Do we need to print something like warning here? If kzalloc() fails, it
> > would be good to be distinguished between the nvme failure and internal
> > failure like this.
> 
> We usually remove the error message on kmalloc failure because it's
> redundant as long as we don't set __GFP_NOWARN.

I see what you point.

> 
> > > +
> > > +     ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> > > +                        log, sizeof(*log), 0);
> > > +     if (ret) {
> > > +             dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> > > +             /* If the device provided a response, then it's non-fatal */
> > > +             if (ret > 0)
> > > +                     ret = 0;
> >
> > It seems like that nvme_init_identify() is just check the internal error
> > which is in negative value now as you have posted.  Why don't we just
> > return the value of "ret" itself without updating it to 0 ?
> 
> Both ways work for me.
> 
> I personally prefer not to return (leak) the nvme status code from
> foo_register() function.

Okay.  In the perspective of registration, the nvme status might not be
needed to be returned.  But I think if we return the nvme status here,
it would be great for the later time when we need to figure out if the nvme
command has failed or not.

But, amyway, I'm fine with this :)

Thanks for your comment on this trivial query.

> 
> > > +             goto free_log;
> > > +     }
> > > +
> > > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > > +             struct thermal_zone_device *tzdev;
> > > +             int temp;
> > > +
> > > +             if (i)
> > > +                     temp = le16_to_cpu(log->temp_sensor[i - 1]);
> > > +             else
> > > +                     temp = get_unaligned_le16(log->temperature);
> > > +
> > > +             /*
> > > +              * All implemented temperature sensors report a non-zero value
> > > +              * in temperature sensor fields in the smart log page.
> > > +              */
> > > +             if (!temp)
> > > +                     continue;
> > > +             if (ctrl->tzdev[i])
> > > +                     continue;
> > > +
> > > +             tzdev = nvme_thermal_zone_register(ctrl, i);
> > > +             if (!IS_ERR(tzdev))
> > > +                     ctrl->tzdev[i] = tzdev;
> > > +     }
> > > +
> > > +free_log:
> > > +     kfree(log);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +/**
> > > + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> > > + * @ctrl: controller instance
> > > + *
> > > + * This function removes the registered thermal zone devices and symlinks.
> > > + */
> > > +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > > +{
> > > +     int i;
> > > +
> > > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > > +             struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> > > +             char name[20];
> >
> > Simple query here :)
> >
> > If we are not going to allow the name of link exceed the length of its
> > own device name like nvmeX_tempY, then can we THERMAL_NAME_LENGTH macro
> > here?  If the name of link is not exactly about the device name itself,
> > then it's fine.  What do you think about it ?
> 
> Of course we can use THERMAL_NAME_LENGTH here.  But this char array is
> used only for the symbolic link name.
> It's not used for thermal cooling device type, thermal zone device type,
> or thermal governor name.  So I just used a random constant to avoid
> confusion.

Agreed.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-05-30 10:18         ` Minwoo Im
  0 siblings, 0 replies; 40+ messages in thread
From: Minwoo Im @ 2019-05-30 10:18 UTC (permalink / raw)


On 19-05-30 01:47:08, Akinobu Mita wrote:
> 2019?5?30?(?) 0:15 Minwoo Im <minwoo.im.dev at gmail.com>:
> >
> > > +/**
> > > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > > + * @ctrl: controller instance
> > > + *
> > > + * This function creates up to nine thermal zone devices for all implemented
> > > + * temperature sensors including the composite temperature.
> > > + * Each thermal zone device provides a single trip point temperature that is
> > > + * associated with an over temperature threshold.
> > > + */
> > > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > > +{
> > > +     struct nvme_smart_log *log;
> > > +     int ret;
> > > +     int i;
> > > +
> > > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > > +     if (!log)
> > > +             return 0; /* non-fatal error */
> >
> > Do we need to print something like warning here? If kzalloc() fails, it
> > would be good to be distinguished between the nvme failure and internal
> > failure like this.
> 
> We usually remove the error message on kmalloc failure because it's
> redundant as long as we don't set __GFP_NOWARN.

I see what you point.

> 
> > > +
> > > +     ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> > > +                        log, sizeof(*log), 0);
> > > +     if (ret) {
> > > +             dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> > > +             /* If the device provided a response, then it's non-fatal */
> > > +             if (ret > 0)
> > > +                     ret = 0;
> >
> > It seems like that nvme_init_identify() is just check the internal error
> > which is in negative value now as you have posted.  Why don't we just
> > return the value of "ret" itself without updating it to 0 ?
> 
> Both ways work for me.
> 
> I personally prefer not to return (leak) the nvme status code from
> foo_register() function.

Okay.  In the perspective of registration, the nvme status might not be
needed to be returned.  But I think if we return the nvme status here,
it would be great for the later time when we need to figure out if the nvme
command has failed or not.

But, amyway, I'm fine with this :)

Thanks for your comment on this trivial query.

> 
> > > +             goto free_log;
> > > +     }
> > > +
> > > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > > +             struct thermal_zone_device *tzdev;
> > > +             int temp;
> > > +
> > > +             if (i)
> > > +                     temp = le16_to_cpu(log->temp_sensor[i - 1]);
> > > +             else
> > > +                     temp = get_unaligned_le16(log->temperature);
> > > +
> > > +             /*
> > > +              * All implemented temperature sensors report a non-zero value
> > > +              * in temperature sensor fields in the smart log page.
> > > +              */
> > > +             if (!temp)
> > > +                     continue;
> > > +             if (ctrl->tzdev[i])
> > > +                     continue;
> > > +
> > > +             tzdev = nvme_thermal_zone_register(ctrl, i);
> > > +             if (!IS_ERR(tzdev))
> > > +                     ctrl->tzdev[i] = tzdev;
> > > +     }
> > > +
> > > +free_log:
> > > +     kfree(log);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +/**
> > > + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> > > + * @ctrl: controller instance
> > > + *
> > > + * This function removes the registered thermal zone devices and symlinks.
> > > + */
> > > +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > > +{
> > > +     int i;
> > > +
> > > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > > +             struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> > > +             char name[20];
> >
> > Simple query here :)
> >
> > If we are not going to allow the name of link exceed the length of its
> > own device name like nvmeX_tempY, then can we THERMAL_NAME_LENGTH macro
> > here?  If the name of link is not exactly about the device name itself,
> > then it's fine.  What do you think about it ?
> 
> Of course we can use THERMAL_NAME_LENGTH here.  But this char array is
> used only for the symbolic link name.
> It's not used for thermal cooling device type, thermal zone device type,
> or thermal governor name.  So I just used a random constant to avoid
> confusion.

Agreed.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-05-26 16:29   ` Akinobu Mita
@ 2019-06-01  9:02     ` Christoph Hellwig
  -1 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2019-06-01  9:02 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-nvme, linux-pm, Zhang Rui, Eduardo Valentin,
	Daniel Lezcano, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

On Mon, May 27, 2019 at 01:29:02AM +0900, Akinobu Mita wrote:
> The NVMe controller reports up to nine temperature values in the SMART /
> Health log page (the composite temperature and temperature sensor 1 through
> temperature sensor 8).
> 
> This provides these temperatures via thermal zone devices.

Can you explain a bit more why we'd do this?  I shows up some sysfs
files, but we could easily do that with nvme-cli, too.  Is there some
greater benefit of this integration?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-06-01  9:02     ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2019-06-01  9:02 UTC (permalink / raw)


On Mon, May 27, 2019@01:29:02AM +0900, Akinobu Mita wrote:
> The NVMe controller reports up to nine temperature values in the SMART /
> Health log page (the composite temperature and temperature sensor 1 through
> temperature sensor 8).
> 
> This provides these temperatures via thermal zone devices.

Can you explain a bit more why we'd do this?  I shows up some sysfs
files, but we could easily do that with nvme-cli, too.  Is there some
greater benefit of this integration?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur
  2019-05-26 16:29   ` Akinobu Mita
@ 2019-06-01  9:03     ` Christoph Hellwig
  -1 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2019-06-01  9:03 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-nvme, linux-pm, Zhang Rui, Eduardo Valentin,
	Daniel Lezcano, Keith Busch, Jens Axboe, Christoph Hellwig,
	Sagi Grimberg, Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

So before we allowed userspace to get smart AEN notifications through
uevent, and would expect userspace to clear the AEN.  I think this
could at least potentially break existing userspace by now doing that
in kernel space.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur
@ 2019-06-01  9:03     ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2019-06-01  9:03 UTC (permalink / raw)


So before we allowed userspace to get smart AEN notifications through
uevent, and would expect userspace to clear the AEN.  I think this
could at least potentially break existing userspace by now doing that
in kernel space.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-06-01  9:02     ` Christoph Hellwig
@ 2019-06-02 13:19       ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-02 13:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-pm, Zhang Rui, Eduardo Valentin,
	Daniel Lezcano, Keith Busch, Jens Axboe, Sagi Grimberg,
	Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

2019年6月1日(土) 18:03 Christoph Hellwig <hch@lst.de>:
>
> On Mon, May 27, 2019 at 01:29:02AM +0900, Akinobu Mita wrote:
> > The NVMe controller reports up to nine temperature values in the SMART /
> > Health log page (the composite temperature and temperature sensor 1 through
> > temperature sensor 8).
> >
> > This provides these temperatures via thermal zone devices.
>
> Can you explain a bit more why we'd do this?  I shows up some sysfs
> files, but we could easily do that with nvme-cli, too.  Is there some
> greater benefit of this integration?

As long as the user space thermal governor is used, there is nothing more
than that from a functional perspective.  And I suppose that this is used
with user_space governor (i.e. there is still some work to be able to bind
actual thermal cooling device).

The main purpose of this is to turn on a fan when overheated without
polling the device that could prevent the lower power state transitions.
But as you noted, we could do that with the existing AEN notifications
through uevent.

So frankly speaking, the benefit of this is providing generic thermal sysfs
interface and the tools like tmon (linux/tools/thermal/tmon) can show the
nvme temperatures.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-06-02 13:19       ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-02 13:19 UTC (permalink / raw)


2019?6?1?(?) 18:03 Christoph Hellwig <hch at lst.de>:
>
> On Mon, May 27, 2019@01:29:02AM +0900, Akinobu Mita wrote:
> > The NVMe controller reports up to nine temperature values in the SMART /
> > Health log page (the composite temperature and temperature sensor 1 through
> > temperature sensor 8).
> >
> > This provides these temperatures via thermal zone devices.
>
> Can you explain a bit more why we'd do this?  I shows up some sysfs
> files, but we could easily do that with nvme-cli, too.  Is there some
> greater benefit of this integration?

As long as the user space thermal governor is used, there is nothing more
than that from a functional perspective.  And I suppose that this is used
with user_space governor (i.e. there is still some work to be able to bind
actual thermal cooling device).

The main purpose of this is to turn on a fan when overheated without
polling the device that could prevent the lower power state transitions.
But as you noted, we could do that with the existing AEN notifications
through uevent.

So frankly speaking, the benefit of this is providing generic thermal sysfs
interface and the tools like tmon (linux/tools/thermal/tmon) can show the
nvme temperatures.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur
  2019-06-01  9:03     ` Christoph Hellwig
@ 2019-06-02 13:46       ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-02 13:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-pm, Zhang Rui, Eduardo Valentin,
	Daniel Lezcano, Keith Busch, Jens Axboe, Sagi Grimberg,
	Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

2019年6月1日(土) 18:04 Christoph Hellwig <hch@lst.de>:
>
> So before we allowed userspace to get smart AEN notifications through
> uevent, and would expect userspace to clear the AEN.  I think this
> could at least potentially break existing userspace by now doing that
> in kernel space.

This change unconditionally sets NVME_SMART_CRIT_TEMPERATURE flag in
nvme_enable_aen(), it could be problematic for existing userspace.
So it's better to provide a knob to enable/disable the event notification
and we can utilize get_mode()/set_mode() in the thermal_zone_device_ops.

Other than that, this change doesn't remove ctrl->aen_result, so existing
userspace still receives NVME_AEN=* uevents.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur
@ 2019-06-02 13:46       ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-02 13:46 UTC (permalink / raw)


2019?6?1?(?) 18:04 Christoph Hellwig <hch at lst.de>:
>
> So before we allowed userspace to get smart AEN notifications through
> uevent, and would expect userspace to clear the AEN.  I think this
> could at least potentially break existing userspace by now doing that
> in kernel space.

This change unconditionally sets NVME_SMART_CRIT_TEMPERATURE flag in
nvme_enable_aen(), it could be problematic for existing userspace.
So it's better to provide a knob to enable/disable the event notification
and we can utilize get_mode()/set_mode() in the thermal_zone_device_ops.

Other than that, this change doesn't remove ctrl->aen_result, so existing
userspace still receives NVME_AEN=* uevents.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-05-26 16:29   ` Akinobu Mita
@ 2019-06-03  2:36     ` Eduardo Valentin
  -1 siblings, 0 replies; 40+ messages in thread
From: Eduardo Valentin @ 2019-06-03  2:36 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-nvme, linux-pm, Zhang Rui, Daniel Lezcano, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Minwoo Im,
	Kenneth Heitke, Chaitanya Kulkarni

Hey Mita,

On Mon, May 27, 2019 at 01:29:02AM +0900, Akinobu Mita wrote:
> The NVMe controller reports up to nine temperature values in the SMART /
> Health log page (the composite temperature and temperature sensor 1 through
> temperature sensor 8).
> 
> This provides these temperatures via thermal zone devices.
> 
> Once the controller is identified, the thermal zone devices are created for
> all implemented temperature sensors including the composite temperature.
> 
> /sys/class/thermal/thermal_zone[0-*]:
>     |---type: 'nvme<instance>-temp<sensor>'
>     |---temp: Temperature
>     |---trip_point_0_temp: Over temperature threshold
> 
> The thermal_zone[0-*] contains a 'device' symlink to the corresponding nvme
> device.
> 
> On the other hand, the following symlinks to the thermal zone devices are
> created in the nvme device sysfs directory.
> 
> - temp0: Composite temperature
> - temp1: Temperature sensor 1
> ...
> - temp8: Temperature sensor 8
> 

These questions on V2 are still unanswered:
a. Can we get a more descriptive string into tz->type?
b. Can these APIs support DT probing too?

> Cc: Zhang Rui <rui.zhang@intel.com>
> Cc: Eduardo Valentin <edubezval@gmail.com>
> Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Jens Axboe <axboe@fb.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Minwoo Im <minwoo.im.dev@gmail.com>
> Cc: Kenneth Heitke <kenneth.heitke@intel.com>
> Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
> ---
> * v3
> - Change the type name of thermal zone devices from 'nvme_temp<sensor>' to
>   'nvme<instance>_temp<sensor>'
> - Pass a NULL to the status argument of nvme_set_feature()
> - Change the name of symbolic link from 'nvme_temp<sensor>' to 'temp<sensor>'
> - Don't make it fatal error if the device provides a response
> - Don't register thermal zone for composite temperature if smart log reports
>   zero value
> - Move the thermal zones registration and unregistration into the core module.
> 
>  drivers/nvme/host/core.c | 288 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvme/host/nvme.h |   9 ++
>  include/linux/nvme.h     |   5 +
>  3 files changed, 302 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index c950916..4c8271a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2200,6 +2200,289 @@ static void nvme_set_latency_tolerance(struct device *dev, s32 val)
>  	}
>  }
>  
> +#ifdef CONFIG_THERMAL
> +
> +static int nvme_get_temp(struct nvme_ctrl *ctrl, unsigned int sensor, int *temp)
> +{
> +	struct nvme_smart_log *log;
> +	int ret;
> +
> +	BUILD_BUG_ON(ARRAY_SIZE(log->temp_sensor) + 1 !=
> +		     ARRAY_SIZE(ctrl->tzdev));
> +
> +	if (WARN_ON_ONCE(sensor > ARRAY_SIZE(log->temp_sensor)))
> +		return -EINVAL;
> +
> +	log = kzalloc(sizeof(*log), GFP_KERNEL);
> +	if (!log)
> +		return -ENOMEM;
> +
> +	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> +			   log, sizeof(*log), 0);
> +	if (ret) {
> +		ret = ret > 0 ? -EINVAL : ret;
> +		goto free_log;
> +	}
> +
> +	if (sensor)
> +		*temp = le16_to_cpu(log->temp_sensor[sensor - 1]);
> +	else
> +		*temp = get_unaligned_le16(log->temperature);
> +
> +free_log:
> +	kfree(log);
> +
> +	return ret;
> +}
> +
> +static unsigned int nvme_tz_type_to_sensor(const char *type)
> +{
> +	int instance;
> +	unsigned int sensor;
> +
> +	if (sscanf(type, "nvme%d_temp%u", &instance, &sensor) != 2)
> +		return UINT_MAX;
> +
> +	return sensor;
> +}
> +
> +#define KELVIN_TO_MILLICELSIUS(t) DECI_KELVIN_TO_MILLICELSIUS((t) * 10)
> +#define MILLICELSIUS_TO_KELVIN(t) ((MILLICELSIUS_TO_DECI_KELVIN(t) + 5) / 10)
> +
> +static int nvme_tz_get_temp(struct thermal_zone_device *tzdev, int *temp)
> +{
> +	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
> +	struct nvme_ctrl *ctrl = tzdev->devdata;
> +	int ret;
> +
> +	ret = nvme_get_temp(ctrl, sensor, temp);
> +	if (!ret)
> +		*temp = KELVIN_TO_MILLICELSIUS(*temp);
> +
> +	return ret;
> +}
> +
> +static int nvme_tz_get_trip_type(struct thermal_zone_device *tzdev,
> +				 int trip, enum thermal_trip_type *type)
> +{
> +	*type = THERMAL_TRIP_ACTIVE;
> +
> +	return 0;
> +}
> +
> +static int nvme_get_over_temp_thresh(struct nvme_ctrl *ctrl,
> +				     unsigned int sensor, int *temp)
> +{
> +	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
> +	int status;
> +	int ret;
> +
> +	if (WARN_ON_ONCE(sensor >= ARRAY_SIZE(ctrl->tzdev)))
> +		return -EINVAL;
> +
> +	ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0,
> +				&status);
> +	if (!ret)
> +		*temp = status & NVME_TEMP_THRESH_MASK;
> +
> +	return ret > 0 ? -EINVAL : ret;
> +}
> +
> +static int nvme_set_over_temp_thresh(struct nvme_ctrl *ctrl,
> +				     unsigned int sensor, int temp)
> +{
> +	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
> +	int ret;
> +
> +	if (WARN_ON_ONCE(sensor >= ARRAY_SIZE(ctrl->tzdev)))
> +		return -EINVAL;
> +
> +	if (temp > NVME_TEMP_THRESH_MASK)
> +		return -EINVAL;
> +
> +	threshold |= temp & NVME_TEMP_THRESH_MASK;
> +
> +	ret = nvme_set_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0,
> +				NULL);
> +
> +	return ret > 0 ? -EINVAL : ret;
> +}
> +
> +static int nvme_tz_get_trip_temp(struct thermal_zone_device *tzdev,
> +				 int trip, int *temp)
> +{
> +	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
> +	struct nvme_ctrl *ctrl = tzdev->devdata;
> +	int ret;
> +
> +	ret = nvme_get_over_temp_thresh(ctrl, sensor, temp);
> +	if (!ret)
> +		*temp = KELVIN_TO_MILLICELSIUS(*temp);
> +
> +	return ret;
> +}
> +
> +static int nvme_tz_set_trip_temp(struct thermal_zone_device *tzdev,
> +				 int trip, int temp)
> +{
> +	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
> +	struct nvme_ctrl *ctrl = tzdev->devdata;
> +
> +	temp = MILLICELSIUS_TO_KELVIN(temp);
> +
> +	return nvme_set_over_temp_thresh(ctrl, sensor, temp);
> +}
> +
> +static struct thermal_zone_device_ops nvme_tz_ops = {
> +	.get_temp = nvme_tz_get_temp,
> +	.get_trip_type = nvme_tz_get_trip_type,
> +	.get_trip_temp = nvme_tz_get_trip_temp,
> +	.set_trip_temp = nvme_tz_set_trip_temp,
> +};
> +
> +static struct thermal_zone_params nvme_tz_params = {
> +	.governor_name = "user_space",

Also,

Was there any particular reason why defaulting to user_space here?

> +	.no_hwmon = true,
> +};
> +
> +static struct thermal_zone_device *
> +nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
> +{
> +	struct thermal_zone_device *tzdev;
> +	char name[THERMAL_NAME_LENGTH];
> +	int ret;
> +
> +	snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
> +
> +	tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
> +					     &nvme_tz_params, 0, 0);
> +	if (IS_ERR(tzdev)) {
> +		dev_err(ctrl->device,
> +			"Failed to register thermal zone device: %ld\n",
> +			PTR_ERR(tzdev));
> +		return tzdev;
> +	}
> +
> +	snprintf(name, sizeof(name), "temp%d", sensor);
> +	ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
> +				name);
> +	if (ret)
> +		goto device_unregister;
> +
> +	ret = sysfs_create_link(&tzdev->device.kobj,
> +				&ctrl->ctrl_device.kobj, "device");
> +	if (ret)
> +		goto remove_link;
> +
> +	return tzdev;
> +
> +remove_link:
> +	sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> +device_unregister:
> +	thermal_zone_device_unregister(tzdev);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +/**
> + * nvme_thermal_zones_register() - register nvme thermal zone devices
> + * @ctrl: controller instance
> + *
> + * This function creates up to nine thermal zone devices for all implemented
> + * temperature sensors including the composite temperature.
> + * Each thermal zone device provides a single trip point temperature that is
> + * associated with an over temperature threshold.
> + */
> +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_smart_log *log;
> +	int ret;
> +	int i;
> +
> +	log = kzalloc(sizeof(*log), GFP_KERNEL);
> +	if (!log)
> +		return 0; /* non-fatal error */

I am not sure about this API design here. I would leave the error
handling and judging if this is fatal or not to the caller. 
I mean, if I ask to register a nvme thermal zone and I get
a 0 as response, I would assume the thermal zone exists from
now on, right?

> +
> +	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> +			   log, sizeof(*log), 0);
> +	if (ret) {
> +		dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> +		/* If the device provided a response, then it's non-fatal */
> +		if (ret > 0)
> +			ret = 0;
> +		goto free_log;

Once again, hiding errors is never a strategy that scales..

> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> +		struct thermal_zone_device *tzdev;
> +		int temp;
> +
> +		if (i)
> +			temp = le16_to_cpu(log->temp_sensor[i - 1]);
> +		else
> +			temp = get_unaligned_le16(log->temperature);
> +
> +		/*
> +		 * All implemented temperature sensors report a non-zero value
> +		 * in temperature sensor fields in the smart log page.
> +		 */
> +		if (!temp)
> +			continue;
> +		if (ctrl->tzdev[i])
> +			continue;
> +
> +		tzdev = nvme_thermal_zone_register(ctrl, i);
> +		if (!IS_ERR(tzdev))
> +			ctrl->tzdev[i] = tzdev;

Here again, I would not hide errors, the API should propagate errors
if something goes wrong.

> +	}
> +
> +free_log:
> +	kfree(log);
> +
> +	return ret;
> +}
> +
> +/**
> + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> + * @ctrl: controller instance
> + *
> + * This function removes the registered thermal zone devices and symlinks.
> + */
> +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> +		struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> +		char name[20];
> +
> +		if (!tzdev)
> +			continue;
> +
> +		sysfs_remove_link(&tzdev->device.kobj, "device");
> +
> +		snprintf(name, sizeof(name), "temp%d", i);
> +		sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> +
> +		thermal_zone_device_unregister(tzdev);
> +
> +		ctrl->tzdev[i] = NULL;
> +	}
> +}
> +
> +#else
> +
> +static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> +{
> +	return 0;
> +}
> +
> +static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> +{
> +}
> +
> +#endif /* CONFIG_THERMAL */
> +
>  struct nvme_core_quirk_entry {
>  	/*
>  	 * NVMe model and firmware strings are padded with spaces.  For
> @@ -2754,6 +3037,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
>  	if (ret < 0)
>  		return ret;
>  
> +	ret = nvme_thermal_zones_register(ctrl);
> +	if (ret < 0)
> +		return ret;


I would definitely keep this code the way it is here in
nvme_init_identify(), but if I read this right, your
nvme_thermal_zones_register() will never return < 0, and therefore this
condition is never met.

I would suggest you simply propagate the errors and keep this piece
of code the way it is, propagating errors too.

> +
>  	ctrl->identified = true;
>  
>  	return 0;
> @@ -3756,6 +4043,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>  {
>  	nvme_mpath_stop(ctrl);
>  	nvme_stop_keep_alive(ctrl);
> +	nvme_thermal_zones_unregister(ctrl);
>  	flush_work(&ctrl->async_event_work);
>  	cancel_work_sync(&ctrl->fw_act_work);
>  }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 802aa19..53f0b24 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -15,6 +15,7 @@
>  #include <linux/sed-opal.h>
>  #include <linux/fault-inject.h>
>  #include <linux/rcupdate.h>
> +#include <linux/thermal.h>
>  
>  extern unsigned int nvme_io_timeout;
>  #define NVME_IO_TIMEOUT	(nvme_io_timeout * HZ)
> @@ -248,6 +249,14 @@ struct nvme_ctrl {
>  
>  	struct page *discard_page;
>  	unsigned long discard_page_busy;
> +
> +#ifdef CONFIG_THERMAL
> +	/*
> +	 * tzdev[0]: composite temperature
> +	 * tzdev[1-8]: temperature sensor 1 through 8
> +	 */
> +	struct thermal_zone_device *tzdev[9];
> +#endif
>  };
>  
>  enum nvme_iopolicy {
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 658ac75..54f0a13 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -780,6 +780,11 @@ struct nvme_write_zeroes_cmd {
>  
>  /* Features */
>  
> +enum {
> +	NVME_TEMP_THRESH_MASK		= 0xffff,
> +	NVME_TEMP_THRESH_SELECT_SHIFT	= 16,
> +};
> +
>  struct nvme_feat_auto_pst {
>  	__le64 entries[32];
>  };
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-06-03  2:36     ` Eduardo Valentin
  0 siblings, 0 replies; 40+ messages in thread
From: Eduardo Valentin @ 2019-06-03  2:36 UTC (permalink / raw)


Hey Mita,

On Mon, May 27, 2019@01:29:02AM +0900, Akinobu Mita wrote:
> The NVMe controller reports up to nine temperature values in the SMART /
> Health log page (the composite temperature and temperature sensor 1 through
> temperature sensor 8).
> 
> This provides these temperatures via thermal zone devices.
> 
> Once the controller is identified, the thermal zone devices are created for
> all implemented temperature sensors including the composite temperature.
> 
> /sys/class/thermal/thermal_zone[0-*]:
>     |---type: 'nvme<instance>-temp<sensor>'
>     |---temp: Temperature
>     |---trip_point_0_temp: Over temperature threshold
> 
> The thermal_zone[0-*] contains a 'device' symlink to the corresponding nvme
> device.
> 
> On the other hand, the following symlinks to the thermal zone devices are
> created in the nvme device sysfs directory.
> 
> - temp0: Composite temperature
> - temp1: Temperature sensor 1
> ...
> - temp8: Temperature sensor 8
> 

These questions on V2 are still unanswered:
a. Can we get a more descriptive string into tz->type?
b. Can these APIs support DT probing too?

> Cc: Zhang Rui <rui.zhang at intel.com>
> Cc: Eduardo Valentin <edubezval at gmail.com>
> Cc: Daniel Lezcano <daniel.lezcano at linaro.org>
> Cc: Keith Busch <keith.busch at intel.com>
> Cc: Jens Axboe <axboe at fb.com>
> Cc: Christoph Hellwig <hch at lst.de>
> Cc: Sagi Grimberg <sagi at grimberg.me>
> Cc: Minwoo Im <minwoo.im.dev at gmail.com>
> Cc: Kenneth Heitke <kenneth.heitke at intel.com>
> Cc: Chaitanya Kulkarni <Chaitanya.Kulkarni at wdc.com>
> Signed-off-by: Akinobu Mita <akinobu.mita at gmail.com>
> ---
> * v3
> - Change the type name of thermal zone devices from 'nvme_temp<sensor>' to
>   'nvme<instance>_temp<sensor>'
> - Pass a NULL to the status argument of nvme_set_feature()
> - Change the name of symbolic link from 'nvme_temp<sensor>' to 'temp<sensor>'
> - Don't make it fatal error if the device provides a response
> - Don't register thermal zone for composite temperature if smart log reports
>   zero value
> - Move the thermal zones registration and unregistration into the core module.
> 
>  drivers/nvme/host/core.c | 288 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/nvme/host/nvme.h |   9 ++
>  include/linux/nvme.h     |   5 +
>  3 files changed, 302 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index c950916..4c8271a 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2200,6 +2200,289 @@ static void nvme_set_latency_tolerance(struct device *dev, s32 val)
>  	}
>  }
>  
> +#ifdef CONFIG_THERMAL
> +
> +static int nvme_get_temp(struct nvme_ctrl *ctrl, unsigned int sensor, int *temp)
> +{
> +	struct nvme_smart_log *log;
> +	int ret;
> +
> +	BUILD_BUG_ON(ARRAY_SIZE(log->temp_sensor) + 1 !=
> +		     ARRAY_SIZE(ctrl->tzdev));
> +
> +	if (WARN_ON_ONCE(sensor > ARRAY_SIZE(log->temp_sensor)))
> +		return -EINVAL;
> +
> +	log = kzalloc(sizeof(*log), GFP_KERNEL);
> +	if (!log)
> +		return -ENOMEM;
> +
> +	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> +			   log, sizeof(*log), 0);
> +	if (ret) {
> +		ret = ret > 0 ? -EINVAL : ret;
> +		goto free_log;
> +	}
> +
> +	if (sensor)
> +		*temp = le16_to_cpu(log->temp_sensor[sensor - 1]);
> +	else
> +		*temp = get_unaligned_le16(log->temperature);
> +
> +free_log:
> +	kfree(log);
> +
> +	return ret;
> +}
> +
> +static unsigned int nvme_tz_type_to_sensor(const char *type)
> +{
> +	int instance;
> +	unsigned int sensor;
> +
> +	if (sscanf(type, "nvme%d_temp%u", &instance, &sensor) != 2)
> +		return UINT_MAX;
> +
> +	return sensor;
> +}
> +
> +#define KELVIN_TO_MILLICELSIUS(t) DECI_KELVIN_TO_MILLICELSIUS((t) * 10)
> +#define MILLICELSIUS_TO_KELVIN(t) ((MILLICELSIUS_TO_DECI_KELVIN(t) + 5) / 10)
> +
> +static int nvme_tz_get_temp(struct thermal_zone_device *tzdev, int *temp)
> +{
> +	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
> +	struct nvme_ctrl *ctrl = tzdev->devdata;
> +	int ret;
> +
> +	ret = nvme_get_temp(ctrl, sensor, temp);
> +	if (!ret)
> +		*temp = KELVIN_TO_MILLICELSIUS(*temp);
> +
> +	return ret;
> +}
> +
> +static int nvme_tz_get_trip_type(struct thermal_zone_device *tzdev,
> +				 int trip, enum thermal_trip_type *type)
> +{
> +	*type = THERMAL_TRIP_ACTIVE;
> +
> +	return 0;
> +}
> +
> +static int nvme_get_over_temp_thresh(struct nvme_ctrl *ctrl,
> +				     unsigned int sensor, int *temp)
> +{
> +	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
> +	int status;
> +	int ret;
> +
> +	if (WARN_ON_ONCE(sensor >= ARRAY_SIZE(ctrl->tzdev)))
> +		return -EINVAL;
> +
> +	ret = nvme_get_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0,
> +				&status);
> +	if (!ret)
> +		*temp = status & NVME_TEMP_THRESH_MASK;
> +
> +	return ret > 0 ? -EINVAL : ret;
> +}
> +
> +static int nvme_set_over_temp_thresh(struct nvme_ctrl *ctrl,
> +				     unsigned int sensor, int temp)
> +{
> +	unsigned int threshold = sensor << NVME_TEMP_THRESH_SELECT_SHIFT;
> +	int ret;
> +
> +	if (WARN_ON_ONCE(sensor >= ARRAY_SIZE(ctrl->tzdev)))
> +		return -EINVAL;
> +
> +	if (temp > NVME_TEMP_THRESH_MASK)
> +		return -EINVAL;
> +
> +	threshold |= temp & NVME_TEMP_THRESH_MASK;
> +
> +	ret = nvme_set_features(ctrl, NVME_FEAT_TEMP_THRESH, threshold, NULL, 0,
> +				NULL);
> +
> +	return ret > 0 ? -EINVAL : ret;
> +}
> +
> +static int nvme_tz_get_trip_temp(struct thermal_zone_device *tzdev,
> +				 int trip, int *temp)
> +{
> +	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
> +	struct nvme_ctrl *ctrl = tzdev->devdata;
> +	int ret;
> +
> +	ret = nvme_get_over_temp_thresh(ctrl, sensor, temp);
> +	if (!ret)
> +		*temp = KELVIN_TO_MILLICELSIUS(*temp);
> +
> +	return ret;
> +}
> +
> +static int nvme_tz_set_trip_temp(struct thermal_zone_device *tzdev,
> +				 int trip, int temp)
> +{
> +	unsigned int sensor = nvme_tz_type_to_sensor(tzdev->type);
> +	struct nvme_ctrl *ctrl = tzdev->devdata;
> +
> +	temp = MILLICELSIUS_TO_KELVIN(temp);
> +
> +	return nvme_set_over_temp_thresh(ctrl, sensor, temp);
> +}
> +
> +static struct thermal_zone_device_ops nvme_tz_ops = {
> +	.get_temp = nvme_tz_get_temp,
> +	.get_trip_type = nvme_tz_get_trip_type,
> +	.get_trip_temp = nvme_tz_get_trip_temp,
> +	.set_trip_temp = nvme_tz_set_trip_temp,
> +};
> +
> +static struct thermal_zone_params nvme_tz_params = {
> +	.governor_name = "user_space",

Also,

Was there any particular reason why defaulting to user_space here?

> +	.no_hwmon = true,
> +};
> +
> +static struct thermal_zone_device *
> +nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
> +{
> +	struct thermal_zone_device *tzdev;
> +	char name[THERMAL_NAME_LENGTH];
> +	int ret;
> +
> +	snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
> +
> +	tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
> +					     &nvme_tz_params, 0, 0);
> +	if (IS_ERR(tzdev)) {
> +		dev_err(ctrl->device,
> +			"Failed to register thermal zone device: %ld\n",
> +			PTR_ERR(tzdev));
> +		return tzdev;
> +	}
> +
> +	snprintf(name, sizeof(name), "temp%d", sensor);
> +	ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
> +				name);
> +	if (ret)
> +		goto device_unregister;
> +
> +	ret = sysfs_create_link(&tzdev->device.kobj,
> +				&ctrl->ctrl_device.kobj, "device");
> +	if (ret)
> +		goto remove_link;
> +
> +	return tzdev;
> +
> +remove_link:
> +	sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> +device_unregister:
> +	thermal_zone_device_unregister(tzdev);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +/**
> + * nvme_thermal_zones_register() - register nvme thermal zone devices
> + * @ctrl: controller instance
> + *
> + * This function creates up to nine thermal zone devices for all implemented
> + * temperature sensors including the composite temperature.
> + * Each thermal zone device provides a single trip point temperature that is
> + * associated with an over temperature threshold.
> + */
> +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_smart_log *log;
> +	int ret;
> +	int i;
> +
> +	log = kzalloc(sizeof(*log), GFP_KERNEL);
> +	if (!log)
> +		return 0; /* non-fatal error */

I am not sure about this API design here. I would leave the error
handling and judging if this is fatal or not to the caller. 
I mean, if I ask to register a nvme thermal zone and I get
a 0 as response, I would assume the thermal zone exists from
now on, right?

> +
> +	ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> +			   log, sizeof(*log), 0);
> +	if (ret) {
> +		dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> +		/* If the device provided a response, then it's non-fatal */
> +		if (ret > 0)
> +			ret = 0;
> +		goto free_log;

Once again, hiding errors is never a strategy that scales..

> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> +		struct thermal_zone_device *tzdev;
> +		int temp;
> +
> +		if (i)
> +			temp = le16_to_cpu(log->temp_sensor[i - 1]);
> +		else
> +			temp = get_unaligned_le16(log->temperature);
> +
> +		/*
> +		 * All implemented temperature sensors report a non-zero value
> +		 * in temperature sensor fields in the smart log page.
> +		 */
> +		if (!temp)
> +			continue;
> +		if (ctrl->tzdev[i])
> +			continue;
> +
> +		tzdev = nvme_thermal_zone_register(ctrl, i);
> +		if (!IS_ERR(tzdev))
> +			ctrl->tzdev[i] = tzdev;

Here again, I would not hide errors, the API should propagate errors
if something goes wrong.

> +	}
> +
> +free_log:
> +	kfree(log);
> +
> +	return ret;
> +}
> +
> +/**
> + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> + * @ctrl: controller instance
> + *
> + * This function removes the registered thermal zone devices and symlinks.
> + */
> +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> +		struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> +		char name[20];
> +
> +		if (!tzdev)
> +			continue;
> +
> +		sysfs_remove_link(&tzdev->device.kobj, "device");
> +
> +		snprintf(name, sizeof(name), "temp%d", i);
> +		sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> +
> +		thermal_zone_device_unregister(tzdev);
> +
> +		ctrl->tzdev[i] = NULL;
> +	}
> +}
> +
> +#else
> +
> +static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> +{
> +	return 0;
> +}
> +
> +static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> +{
> +}
> +
> +#endif /* CONFIG_THERMAL */
> +
>  struct nvme_core_quirk_entry {
>  	/*
>  	 * NVMe model and firmware strings are padded with spaces.  For
> @@ -2754,6 +3037,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
>  	if (ret < 0)
>  		return ret;
>  
> +	ret = nvme_thermal_zones_register(ctrl);
> +	if (ret < 0)
> +		return ret;


I would definitely keep this code the way it is here in
nvme_init_identify(), but if I read this right, your
nvme_thermal_zones_register() will never return < 0, and therefore this
condition is never met.

I would suggest you simply propagate the errors and keep this piece
of code the way it is, propagating errors too.

> +
>  	ctrl->identified = true;
>  
>  	return 0;
> @@ -3756,6 +4043,7 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl)
>  {
>  	nvme_mpath_stop(ctrl);
>  	nvme_stop_keep_alive(ctrl);
> +	nvme_thermal_zones_unregister(ctrl);
>  	flush_work(&ctrl->async_event_work);
>  	cancel_work_sync(&ctrl->fw_act_work);
>  }
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 802aa19..53f0b24 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -15,6 +15,7 @@
>  #include <linux/sed-opal.h>
>  #include <linux/fault-inject.h>
>  #include <linux/rcupdate.h>
> +#include <linux/thermal.h>
>  
>  extern unsigned int nvme_io_timeout;
>  #define NVME_IO_TIMEOUT	(nvme_io_timeout * HZ)
> @@ -248,6 +249,14 @@ struct nvme_ctrl {
>  
>  	struct page *discard_page;
>  	unsigned long discard_page_busy;
> +
> +#ifdef CONFIG_THERMAL
> +	/*
> +	 * tzdev[0]: composite temperature
> +	 * tzdev[1-8]: temperature sensor 1 through 8
> +	 */
> +	struct thermal_zone_device *tzdev[9];
> +#endif
>  };
>  
>  enum nvme_iopolicy {
> diff --git a/include/linux/nvme.h b/include/linux/nvme.h
> index 658ac75..54f0a13 100644
> --- a/include/linux/nvme.h
> +++ b/include/linux/nvme.h
> @@ -780,6 +780,11 @@ struct nvme_write_zeroes_cmd {
>  
>  /* Features */
>  
> +enum {
> +	NVME_TEMP_THRESH_MASK		= 0xffff,
> +	NVME_TEMP_THRESH_SELECT_SHIFT	= 16,
> +};
> +
>  struct nvme_feat_auto_pst {
>  	__le64 entries[32];
>  };
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-06-03  2:36     ` Eduardo Valentin
@ 2019-06-03 15:20       ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-03 15:20 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: linux-nvme, linux-pm, Zhang Rui, Daniel Lezcano, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Minwoo Im,
	Kenneth Heitke, Chaitanya Kulkarni

2019年6月3日(月) 11:36 Eduardo Valentin <edubezval@gmail.com>:
>
> Hey Mita,
>
> On Mon, May 27, 2019 at 01:29:02AM +0900, Akinobu Mita wrote:
> > The NVMe controller reports up to nine temperature values in the SMART /
> > Health log page (the composite temperature and temperature sensor 1 through
> > temperature sensor 8).
> >
> > This provides these temperatures via thermal zone devices.
> >
> > Once the controller is identified, the thermal zone devices are created for
> > all implemented temperature sensors including the composite temperature.
> >
> > /sys/class/thermal/thermal_zone[0-*]:
> >     |---type: 'nvme<instance>-temp<sensor>'
> >     |---temp: Temperature
> >     |---trip_point_0_temp: Over temperature threshold
> >
> > The thermal_zone[0-*] contains a 'device' symlink to the corresponding nvme
> > device.
> >
> > On the other hand, the following symlinks to the thermal zone devices are
> > created in the nvme device sysfs directory.
> >
> > - temp0: Composite temperature
> > - temp1: Temperature sensor 1
> > ...
> > - temp8: Temperature sensor 8
> >
>
> These questions on V2 are still unanswered:
> a. Can we get a more descriptive string into tz->type?

As I said in the other thread, the NVMe spec only says a controller
reports the composite temperature and temperature sensor 1 through 8.
What temperature sensor N means is vendor specific.

> b. Can these APIs support DT probing too?

OK. I can try, but I don't have arm or arm64 boards with PCIe for
testing with of-thermal.  So it'll be untested.

> > +static struct thermal_zone_params nvme_tz_params = {
> > +     .governor_name = "user_space",
>
> Also,
>
> Was there any particular reason why defaulting to user_space here?

I only tested with the user_space governor.  There is no cooling device
to bind with this.

> > +     .no_hwmon = true,
> > +};
> > +
> > +static struct thermal_zone_device *
> > +nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
> > +{
> > +     struct thermal_zone_device *tzdev;
> > +     char name[THERMAL_NAME_LENGTH];
> > +     int ret;
> > +
> > +     snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
> > +
> > +     tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
> > +                                          &nvme_tz_params, 0, 0);
> > +     if (IS_ERR(tzdev)) {
> > +             dev_err(ctrl->device,
> > +                     "Failed to register thermal zone device: %ld\n",
> > +                     PTR_ERR(tzdev));
> > +             return tzdev;
> > +     }
> > +
> > +     snprintf(name, sizeof(name), "temp%d", sensor);
> > +     ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
> > +                             name);
> > +     if (ret)
> > +             goto device_unregister;
> > +
> > +     ret = sysfs_create_link(&tzdev->device.kobj,
> > +                             &ctrl->ctrl_device.kobj, "device");
> > +     if (ret)
> > +             goto remove_link;
> > +
> > +     return tzdev;
> > +
> > +remove_link:
> > +     sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > +device_unregister:
> > +     thermal_zone_device_unregister(tzdev);
> > +
> > +     return ERR_PTR(ret);
> > +}
> > +
> > +/**
> > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > + * @ctrl: controller instance
> > + *
> > + * This function creates up to nine thermal zone devices for all implemented
> > + * temperature sensors including the composite temperature.
> > + * Each thermal zone device provides a single trip point temperature that is
> > + * associated with an over temperature threshold.
> > + */
> > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > +{
> > +     struct nvme_smart_log *log;
> > +     int ret;
> > +     int i;
> > +
> > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > +     if (!log)
> > +             return 0; /* non-fatal error */
>
> I am not sure about this API design here. I would leave the error
> handling and judging if this is fatal or not to the caller.
> I mean, if I ask to register a nvme thermal zone and I get
> a 0 as response, I would assume the thermal zone exists from
> now on, right?

This routine is designed to return error code only when we're unable to
communicate with the device at all (i.e. nvme_submit_sync_cmd returns a
negative value).

We don't want to abandon device initialization just due to thermal zone
failures.  Because thermal zone device isn't mandatory to manage the
controllers, and the device like qemu doesn't provide smart log (according
to Keith).

> > +
> > +     ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> > +                        log, sizeof(*log), 0);
> > +     if (ret) {
> > +             dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> > +             /* If the device provided a response, then it's non-fatal */
> > +             if (ret > 0)
> > +                     ret = 0;
> > +             goto free_log;
>
> Once again, hiding errors is never a strategy that scales..
>
> > +     }
> > +
> > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > +             struct thermal_zone_device *tzdev;
> > +             int temp;
> > +
> > +             if (i)
> > +                     temp = le16_to_cpu(log->temp_sensor[i - 1]);
> > +             else
> > +                     temp = get_unaligned_le16(log->temperature);
> > +
> > +             /*
> > +              * All implemented temperature sensors report a non-zero value
> > +              * in temperature sensor fields in the smart log page.
> > +              */
> > +             if (!temp)
> > +                     continue;
> > +             if (ctrl->tzdev[i])
> > +                     continue;
> > +
> > +             tzdev = nvme_thermal_zone_register(ctrl, i);
> > +             if (!IS_ERR(tzdev))
> > +                     ctrl->tzdev[i] = tzdev;
>
> Here again, I would not hide errors, the API should propagate errors
> if something goes wrong.
>
> > +     }
> > +
> > +free_log:
> > +     kfree(log);
> > +
> > +     return ret;
> > +}
> > +
> > +/**
> > + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> > + * @ctrl: controller instance
> > + *
> > + * This function removes the registered thermal zone devices and symlinks.
> > + */
> > +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > +             struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> > +             char name[20];
> > +
> > +             if (!tzdev)
> > +                     continue;
> > +
> > +             sysfs_remove_link(&tzdev->device.kobj, "device");
> > +
> > +             snprintf(name, sizeof(name), "temp%d", i);
> > +             sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > +
> > +             thermal_zone_device_unregister(tzdev);
> > +
> > +             ctrl->tzdev[i] = NULL;
> > +     }
> > +}
> > +
> > +#else
> > +
> > +static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_THERMAL */
> > +
> >  struct nvme_core_quirk_entry {
> >       /*
> >        * NVMe model and firmware strings are padded with spaces.  For
> > @@ -2754,6 +3037,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> >       if (ret < 0)
> >               return ret;
> >
> > +     ret = nvme_thermal_zones_register(ctrl);
> > +     if (ret < 0)
> > +             return ret;
>
>
> I would definitely keep this code the way it is here in
> nvme_init_identify(), but if I read this right, your
> nvme_thermal_zones_register() will never return < 0, and therefore this
> condition is never met.

The nvme_get_log() can return a negative value.  Only in this case,
nvme_thermal_zones_register() returns error and abandon the device
initialization.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-06-03 15:20       ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-03 15:20 UTC (permalink / raw)


2019?6?3?(?) 11:36 Eduardo Valentin <edubezval at gmail.com>:
>
> Hey Mita,
>
> On Mon, May 27, 2019@01:29:02AM +0900, Akinobu Mita wrote:
> > The NVMe controller reports up to nine temperature values in the SMART /
> > Health log page (the composite temperature and temperature sensor 1 through
> > temperature sensor 8).
> >
> > This provides these temperatures via thermal zone devices.
> >
> > Once the controller is identified, the thermal zone devices are created for
> > all implemented temperature sensors including the composite temperature.
> >
> > /sys/class/thermal/thermal_zone[0-*]:
> >     |---type: 'nvme<instance>-temp<sensor>'
> >     |---temp: Temperature
> >     |---trip_point_0_temp: Over temperature threshold
> >
> > The thermal_zone[0-*] contains a 'device' symlink to the corresponding nvme
> > device.
> >
> > On the other hand, the following symlinks to the thermal zone devices are
> > created in the nvme device sysfs directory.
> >
> > - temp0: Composite temperature
> > - temp1: Temperature sensor 1
> > ...
> > - temp8: Temperature sensor 8
> >
>
> These questions on V2 are still unanswered:
> a. Can we get a more descriptive string into tz->type?

As I said in the other thread, the NVMe spec only says a controller
reports the composite temperature and temperature sensor 1 through 8.
What temperature sensor N means is vendor specific.

> b. Can these APIs support DT probing too?

OK. I can try, but I don't have arm or arm64 boards with PCIe for
testing with of-thermal.  So it'll be untested.

> > +static struct thermal_zone_params nvme_tz_params = {
> > +     .governor_name = "user_space",
>
> Also,
>
> Was there any particular reason why defaulting to user_space here?

I only tested with the user_space governor.  There is no cooling device
to bind with this.

> > +     .no_hwmon = true,
> > +};
> > +
> > +static struct thermal_zone_device *
> > +nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
> > +{
> > +     struct thermal_zone_device *tzdev;
> > +     char name[THERMAL_NAME_LENGTH];
> > +     int ret;
> > +
> > +     snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
> > +
> > +     tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
> > +                                          &nvme_tz_params, 0, 0);
> > +     if (IS_ERR(tzdev)) {
> > +             dev_err(ctrl->device,
> > +                     "Failed to register thermal zone device: %ld\n",
> > +                     PTR_ERR(tzdev));
> > +             return tzdev;
> > +     }
> > +
> > +     snprintf(name, sizeof(name), "temp%d", sensor);
> > +     ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
> > +                             name);
> > +     if (ret)
> > +             goto device_unregister;
> > +
> > +     ret = sysfs_create_link(&tzdev->device.kobj,
> > +                             &ctrl->ctrl_device.kobj, "device");
> > +     if (ret)
> > +             goto remove_link;
> > +
> > +     return tzdev;
> > +
> > +remove_link:
> > +     sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > +device_unregister:
> > +     thermal_zone_device_unregister(tzdev);
> > +
> > +     return ERR_PTR(ret);
> > +}
> > +
> > +/**
> > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > + * @ctrl: controller instance
> > + *
> > + * This function creates up to nine thermal zone devices for all implemented
> > + * temperature sensors including the composite temperature.
> > + * Each thermal zone device provides a single trip point temperature that is
> > + * associated with an over temperature threshold.
> > + */
> > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > +{
> > +     struct nvme_smart_log *log;
> > +     int ret;
> > +     int i;
> > +
> > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > +     if (!log)
> > +             return 0; /* non-fatal error */
>
> I am not sure about this API design here. I would leave the error
> handling and judging if this is fatal or not to the caller.
> I mean, if I ask to register a nvme thermal zone and I get
> a 0 as response, I would assume the thermal zone exists from
> now on, right?

This routine is designed to return error code only when we're unable to
communicate with the device at all (i.e. nvme_submit_sync_cmd returns a
negative value).

We don't want to abandon device initialization just due to thermal zone
failures.  Because thermal zone device isn't mandatory to manage the
controllers, and the device like qemu doesn't provide smart log (according
to Keith).

> > +
> > +     ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> > +                        log, sizeof(*log), 0);
> > +     if (ret) {
> > +             dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> > +             /* If the device provided a response, then it's non-fatal */
> > +             if (ret > 0)
> > +                     ret = 0;
> > +             goto free_log;
>
> Once again, hiding errors is never a strategy that scales..
>
> > +     }
> > +
> > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > +             struct thermal_zone_device *tzdev;
> > +             int temp;
> > +
> > +             if (i)
> > +                     temp = le16_to_cpu(log->temp_sensor[i - 1]);
> > +             else
> > +                     temp = get_unaligned_le16(log->temperature);
> > +
> > +             /*
> > +              * All implemented temperature sensors report a non-zero value
> > +              * in temperature sensor fields in the smart log page.
> > +              */
> > +             if (!temp)
> > +                     continue;
> > +             if (ctrl->tzdev[i])
> > +                     continue;
> > +
> > +             tzdev = nvme_thermal_zone_register(ctrl, i);
> > +             if (!IS_ERR(tzdev))
> > +                     ctrl->tzdev[i] = tzdev;
>
> Here again, I would not hide errors, the API should propagate errors
> if something goes wrong.
>
> > +     }
> > +
> > +free_log:
> > +     kfree(log);
> > +
> > +     return ret;
> > +}
> > +
> > +/**
> > + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> > + * @ctrl: controller instance
> > + *
> > + * This function removes the registered thermal zone devices and symlinks.
> > + */
> > +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > +{
> > +     int i;
> > +
> > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > +             struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> > +             char name[20];
> > +
> > +             if (!tzdev)
> > +                     continue;
> > +
> > +             sysfs_remove_link(&tzdev->device.kobj, "device");
> > +
> > +             snprintf(name, sizeof(name), "temp%d", i);
> > +             sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > +
> > +             thermal_zone_device_unregister(tzdev);
> > +
> > +             ctrl->tzdev[i] = NULL;
> > +     }
> > +}
> > +
> > +#else
> > +
> > +static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > +{
> > +}
> > +
> > +#endif /* CONFIG_THERMAL */
> > +
> >  struct nvme_core_quirk_entry {
> >       /*
> >        * NVMe model and firmware strings are padded with spaces.  For
> > @@ -2754,6 +3037,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> >       if (ret < 0)
> >               return ret;
> >
> > +     ret = nvme_thermal_zones_register(ctrl);
> > +     if (ret < 0)
> > +             return ret;
>
>
> I would definitely keep this code the way it is here in
> nvme_init_identify(), but if I read this right, your
> nvme_thermal_zones_register() will never return < 0, and therefore this
> condition is never met.

The nvme_get_log() can return a negative value.  Only in this case,
nvme_thermal_zones_register() returns error and abandon the device
initialization.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-06-02 13:19       ` Akinobu Mita
@ 2019-06-04  7:31         ` Christoph Hellwig
  -1 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2019-06-04  7:31 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Christoph Hellwig, linux-nvme, linux-pm, Zhang Rui,
	Eduardo Valentin, Daniel Lezcano, Keith Busch, Jens Axboe,
	Sagi Grimberg, Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

On Sun, Jun 02, 2019 at 10:19:08PM +0900, Akinobu Mita wrote:
> As long as the user space thermal governor is used, there is nothing more
> than that from a functional perspective.  And I suppose that this is used
> with user_space governor (i.e. there is still some work to be able to bind
> actual thermal cooling device).
> 
> The main purpose of this is to turn on a fan when overheated without
> polling the device that could prevent the lower power state transitions.
> But as you noted, we could do that with the existing AEN notifications
> through uevent.
> 
> So frankly speaking, the benefit of this is providing generic thermal sysfs
> interface and the tools like tmon (linux/tools/thermal/tmon) can show the
> nvme temperatures.

I'm just a little worried about bloating the nvme driver with features
that look kinda nifty but don't buy us much.  I'd rather keep at least
the core and PCIe drivers as minimal.  Now the thermal device support
is pretty small and except for the smart uevents I can't find anything
actually bad, but I'm not exactly excited either.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-06-04  7:31         ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2019-06-04  7:31 UTC (permalink / raw)


On Sun, Jun 02, 2019@10:19:08PM +0900, Akinobu Mita wrote:
> As long as the user space thermal governor is used, there is nothing more
> than that from a functional perspective.  And I suppose that this is used
> with user_space governor (i.e. there is still some work to be able to bind
> actual thermal cooling device).
> 
> The main purpose of this is to turn on a fan when overheated without
> polling the device that could prevent the lower power state transitions.
> But as you noted, we could do that with the existing AEN notifications
> through uevent.
> 
> So frankly speaking, the benefit of this is providing generic thermal sysfs
> interface and the tools like tmon (linux/tools/thermal/tmon) can show the
> nvme temperatures.

I'm just a little worried about bloating the nvme driver with features
that look kinda nifty but don't buy us much.  I'd rather keep at least
the core and PCIe drivers as minimal.  Now the thermal device support
is pretty small and except for the smart uevents I can't find anything
actually bad, but I'm not exactly excited either.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur
  2019-06-02 13:46       ` Akinobu Mita
@ 2019-06-04  7:32         ` Christoph Hellwig
  -1 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2019-06-04  7:32 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: Christoph Hellwig, linux-nvme, linux-pm, Zhang Rui,
	Eduardo Valentin, Daniel Lezcano, Keith Busch, Jens Axboe,
	Sagi Grimberg, Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

On Sun, Jun 02, 2019 at 10:46:12PM +0900, Akinobu Mita wrote:
> 2019年6月1日(土) 18:04 Christoph Hellwig <hch@lst.de>:
> >
> > So before we allowed userspace to get smart AEN notifications through
> > uevent, and would expect userspace to clear the AEN.  I think this
> > could at least potentially break existing userspace by now doing that
> > in kernel space.
> 
> This change unconditionally sets NVME_SMART_CRIT_TEMPERATURE flag in
> nvme_enable_aen(), it could be problematic for existing userspace.
> So it's better to provide a knob to enable/disable the event notification
> and we can utilize get_mode()/set_mode() in the thermal_zone_device_ops.
> 
> Other than that, this change doesn't remove ctrl->aen_result, so existing
> userspace still receives NVME_AEN=* uevents.

And that is the problem.  Because for notifications we expect userspace to
read the log page and clear the event, while we are not doing it in kernel
space.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur
@ 2019-06-04  7:32         ` Christoph Hellwig
  0 siblings, 0 replies; 40+ messages in thread
From: Christoph Hellwig @ 2019-06-04  7:32 UTC (permalink / raw)


On Sun, Jun 02, 2019@10:46:12PM +0900, Akinobu Mita wrote:
> 2019?6?1?(?) 18:04 Christoph Hellwig <hch at lst.de>:
> >
> > So before we allowed userspace to get smart AEN notifications through
> > uevent, and would expect userspace to clear the AEN.  I think this
> > could at least potentially break existing userspace by now doing that
> > in kernel space.
> 
> This change unconditionally sets NVME_SMART_CRIT_TEMPERATURE flag in
> nvme_enable_aen(), it could be problematic for existing userspace.
> So it's better to provide a knob to enable/disable the event notification
> and we can utilize get_mode()/set_mode() in the thermal_zone_device_ops.
> 
> Other than that, this change doesn't remove ctrl->aen_result, so existing
> userspace still receives NVME_AEN=* uevents.

And that is the problem.  Because for notifications we expect userspace to
read the log page and clear the event, while we are not doing it in kernel
space.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-06-04  7:31         ` Christoph Hellwig
@ 2019-06-05 15:42           ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-05 15:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-nvme, linux-pm, Zhang Rui, Eduardo Valentin,
	Daniel Lezcano, Keith Busch, Jens Axboe, Sagi Grimberg,
	Minwoo Im, Kenneth Heitke, Chaitanya Kulkarni

2019年6月4日(火) 16:32 Christoph Hellwig <hch@lst.de>:
>
> On Sun, Jun 02, 2019 at 10:19:08PM +0900, Akinobu Mita wrote:
> > As long as the user space thermal governor is used, there is nothing more
> > than that from a functional perspective.  And I suppose that this is used
> > with user_space governor (i.e. there is still some work to be able to bind
> > actual thermal cooling device).
> >
> > The main purpose of this is to turn on a fan when overheated without
> > polling the device that could prevent the lower power state transitions.
> > But as you noted, we could do that with the existing AEN notifications
> > through uevent.
> >
> > So frankly speaking, the benefit of this is providing generic thermal sysfs
> > interface and the tools like tmon (linux/tools/thermal/tmon) can show the
> > nvme temperatures.
>
> I'm just a little worried about bloating the nvme driver with features
> that look kinda nifty but don't buy us much.  I'd rather keep at least
> the core and PCIe drivers as minimal.  Now the thermal device support
> is pretty small and except for the smart uevents I can't find anything
> actually bad, but I'm not exactly excited either.

I'll add thermal.c file in order to minimize the core driver changes and a
module parameter to disable the thermal zone support.

Device tree thermal zone will also be supported. It enables to use thermal
governor other than user_space governor.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-06-05 15:42           ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-05 15:42 UTC (permalink / raw)


2019?6?4?(?) 16:32 Christoph Hellwig <hch at lst.de>:
>
> On Sun, Jun 02, 2019@10:19:08PM +0900, Akinobu Mita wrote:
> > As long as the user space thermal governor is used, there is nothing more
> > than that from a functional perspective.  And I suppose that this is used
> > with user_space governor (i.e. there is still some work to be able to bind
> > actual thermal cooling device).
> >
> > The main purpose of this is to turn on a fan when overheated without
> > polling the device that could prevent the lower power state transitions.
> > But as you noted, we could do that with the existing AEN notifications
> > through uevent.
> >
> > So frankly speaking, the benefit of this is providing generic thermal sysfs
> > interface and the tools like tmon (linux/tools/thermal/tmon) can show the
> > nvme temperatures.
>
> I'm just a little worried about bloating the nvme driver with features
> that look kinda nifty but don't buy us much.  I'd rather keep at least
> the core and PCIe drivers as minimal.  Now the thermal device support
> is pretty small and except for the smart uevents I can't find anything
> actually bad, but I'm not exactly excited either.

I'll add thermal.c file in order to minimize the core driver changes and a
module parameter to disable the thermal zone support.

Device tree thermal zone will also be supported. It enables to use thermal
governor other than user_space governor.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-06-03 15:20       ` Akinobu Mita
@ 2019-06-06  4:05         ` Eduardo Valentin
  -1 siblings, 0 replies; 40+ messages in thread
From: Eduardo Valentin @ 2019-06-06  4:05 UTC (permalink / raw)
  To: Akinobu Mita
  Cc: linux-nvme, linux-pm, Zhang Rui, Daniel Lezcano, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Minwoo Im,
	Kenneth Heitke, Chaitanya Kulkarni

On Tue, Jun 04, 2019 at 12:20:19AM +0900, Akinobu Mita wrote:
> 2019年6月3日(月) 11:36 Eduardo Valentin <edubezval@gmail.com>:
> >
> > Hey Mita,
> >
> > On Mon, May 27, 2019 at 01:29:02AM +0900, Akinobu Mita wrote:
> > > The NVMe controller reports up to nine temperature values in the SMART /
> > > Health log page (the composite temperature and temperature sensor 1 through
> > > temperature sensor 8).
> > >
> > > This provides these temperatures via thermal zone devices.
> > >
> > > Once the controller is identified, the thermal zone devices are created for
> > > all implemented temperature sensors including the composite temperature.
> > >
> > > /sys/class/thermal/thermal_zone[0-*]:
> > >     |---type: 'nvme<instance>-temp<sensor>'
> > >     |---temp: Temperature
> > >     |---trip_point_0_temp: Over temperature threshold
> > >
> > > The thermal_zone[0-*] contains a 'device' symlink to the corresponding nvme
> > > device.
> > >
> > > On the other hand, the following symlinks to the thermal zone devices are
> > > created in the nvme device sysfs directory.
> > >
> > > - temp0: Composite temperature
> > > - temp1: Temperature sensor 1
> > > ...
> > > - temp8: Temperature sensor 8
> > >
> >
> > These questions on V2 are still unanswered:
> > a. Can we get a more descriptive string into tz->type?
> 
> As I said in the other thread, the NVMe spec only says a controller
> reports the composite temperature and temperature sensor 1 through 8.
> What temperature sensor N means is vendor specific.

I see..

> 
> > b. Can these APIs support DT probing too?
> 
> OK. I can try, but I don't have arm or arm64 boards with PCIe for
> testing with of-thermal.  So it'll be untested.
> 

OK.  Lets see what comes.

> > > +static struct thermal_zone_params nvme_tz_params = {
> > > +     .governor_name = "user_space",
> >
> > Also,
> >
> > Was there any particular reason why defaulting to user_space here?
> 
> I only tested with the user_space governor.  There is no cooling device
> to bind with this.
> 

hmmm.. I see. But was there any particular reason to pick the thermal
subsystem here if you do not have any cooling? Or is there any specific
cooling that can be written in future?

If no cooling is expected, why not a simple hwmon?

> > > +     .no_hwmon = true,
> > > +};
> > > +
> > > +static struct thermal_zone_device *
> > > +nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
> > > +{
> > > +     struct thermal_zone_device *tzdev;
> > > +     char name[THERMAL_NAME_LENGTH];
> > > +     int ret;
> > > +
> > > +     snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
> > > +
> > > +     tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
> > > +                                          &nvme_tz_params, 0, 0);
> > > +     if (IS_ERR(tzdev)) {
> > > +             dev_err(ctrl->device,
> > > +                     "Failed to register thermal zone device: %ld\n",
> > > +                     PTR_ERR(tzdev));
> > > +             return tzdev;
> > > +     }
> > > +
> > > +     snprintf(name, sizeof(name), "temp%d", sensor);
> > > +     ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
> > > +                             name);
> > > +     if (ret)
> > > +             goto device_unregister;
> > > +
> > > +     ret = sysfs_create_link(&tzdev->device.kobj,
> > > +                             &ctrl->ctrl_device.kobj, "device");
> > > +     if (ret)
> > > +             goto remove_link;
> > > +
> > > +     return tzdev;
> > > +
> > > +remove_link:
> > > +     sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > > +device_unregister:
> > > +     thermal_zone_device_unregister(tzdev);
> > > +
> > > +     return ERR_PTR(ret);
> > > +}
> > > +
> > > +/**
> > > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > > + * @ctrl: controller instance
> > > + *
> > > + * This function creates up to nine thermal zone devices for all implemented
> > > + * temperature sensors including the composite temperature.
> > > + * Each thermal zone device provides a single trip point temperature that is
> > > + * associated with an over temperature threshold.
> > > + */
> > > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > > +{
> > > +     struct nvme_smart_log *log;
> > > +     int ret;
> > > +     int i;
> > > +
> > > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > > +     if (!log)
> > > +             return 0; /* non-fatal error */
> >
> > I am not sure about this API design here. I would leave the error
> > handling and judging if this is fatal or not to the caller.
> > I mean, if I ask to register a nvme thermal zone and I get
> > a 0 as response, I would assume the thermal zone exists from
> > now on, right?
> 
> This routine is designed to return error code only when we're unable to
> communicate with the device at all (i.e. nvme_submit_sync_cmd returns a
> negative value).
> 
> We don't want to abandon device initialization just due to thermal zone
> failures.  Because thermal zone device isn't mandatory to manage the
> controllers, and the device like qemu doesn't provide smart log (according
> to Keith).

That is fair.. it is just really weird to continue your business when
the kernel cannot even allocate memory for you..

> 
> > > +
> > > +     ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> > > +                        log, sizeof(*log), 0);
> > > +     if (ret) {
> > > +             dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> > > +             /* If the device provided a response, then it's non-fatal */
> > > +             if (ret > 0)
> > > +                     ret = 0;
> > > +             goto free_log;
> >
> > Once again, hiding errors is never a strategy that scales..
> >
> > > +     }
> > > +
> > > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > > +             struct thermal_zone_device *tzdev;
> > > +             int temp;
> > > +
> > > +             if (i)
> > > +                     temp = le16_to_cpu(log->temp_sensor[i - 1]);
> > > +             else
> > > +                     temp = get_unaligned_le16(log->temperature);
> > > +
> > > +             /*
> > > +              * All implemented temperature sensors report a non-zero value
> > > +              * in temperature sensor fields in the smart log page.
> > > +              */
> > > +             if (!temp)
> > > +                     continue;
> > > +             if (ctrl->tzdev[i])
> > > +                     continue;
> > > +
> > > +             tzdev = nvme_thermal_zone_register(ctrl, i);
> > > +             if (!IS_ERR(tzdev))
> > > +                     ctrl->tzdev[i] = tzdev;
> >
> > Here again, I would not hide errors, the API should propagate errors
> > if something goes wrong.
> >
> > > +     }
> > > +
> > > +free_log:
> > > +     kfree(log);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +/**
> > > + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> > > + * @ctrl: controller instance
> > > + *
> > > + * This function removes the registered thermal zone devices and symlinks.
> > > + */
> > > +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > > +{
> > > +     int i;
> > > +
> > > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > > +             struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> > > +             char name[20];
> > > +
> > > +             if (!tzdev)
> > > +                     continue;
> > > +
> > > +             sysfs_remove_link(&tzdev->device.kobj, "device");
> > > +
> > > +             snprintf(name, sizeof(name), "temp%d", i);
> > > +             sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > > +
> > > +             thermal_zone_device_unregister(tzdev);
> > > +
> > > +             ctrl->tzdev[i] = NULL;
> > > +     }
> > > +}
> > > +
> > > +#else
> > > +
> > > +static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > > +{
> > > +     return 0;
> > > +}
> > > +
> > > +static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > > +{
> > > +}
> > > +
> > > +#endif /* CONFIG_THERMAL */
> > > +
> > >  struct nvme_core_quirk_entry {
> > >       /*
> > >        * NVMe model and firmware strings are padded with spaces.  For
> > > @@ -2754,6 +3037,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> > >       if (ret < 0)
> > >               return ret;
> > >
> > > +     ret = nvme_thermal_zones_register(ctrl);
> > > +     if (ret < 0)
> > > +             return ret;
> >
> >
> > I would definitely keep this code the way it is here in
> > nvme_init_identify(), but if I read this right, your
> > nvme_thermal_zones_register() will never return < 0, and therefore this
> > condition is never met.
> 
> The nvme_get_log() can return a negative value.  Only in this case,
> nvme_thermal_zones_register() returns error and abandon the device
> initialization.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-06-06  4:05         ` Eduardo Valentin
  0 siblings, 0 replies; 40+ messages in thread
From: Eduardo Valentin @ 2019-06-06  4:05 UTC (permalink / raw)


On Tue, Jun 04, 2019@12:20:19AM +0900, Akinobu Mita wrote:
> 2019?6?3?(?) 11:36 Eduardo Valentin <edubezval at gmail.com>:
> >
> > Hey Mita,
> >
> > On Mon, May 27, 2019@01:29:02AM +0900, Akinobu Mita wrote:
> > > The NVMe controller reports up to nine temperature values in the SMART /
> > > Health log page (the composite temperature and temperature sensor 1 through
> > > temperature sensor 8).
> > >
> > > This provides these temperatures via thermal zone devices.
> > >
> > > Once the controller is identified, the thermal zone devices are created for
> > > all implemented temperature sensors including the composite temperature.
> > >
> > > /sys/class/thermal/thermal_zone[0-*]:
> > >     |---type: 'nvme<instance>-temp<sensor>'
> > >     |---temp: Temperature
> > >     |---trip_point_0_temp: Over temperature threshold
> > >
> > > The thermal_zone[0-*] contains a 'device' symlink to the corresponding nvme
> > > device.
> > >
> > > On the other hand, the following symlinks to the thermal zone devices are
> > > created in the nvme device sysfs directory.
> > >
> > > - temp0: Composite temperature
> > > - temp1: Temperature sensor 1
> > > ...
> > > - temp8: Temperature sensor 8
> > >
> >
> > These questions on V2 are still unanswered:
> > a. Can we get a more descriptive string into tz->type?
> 
> As I said in the other thread, the NVMe spec only says a controller
> reports the composite temperature and temperature sensor 1 through 8.
> What temperature sensor N means is vendor specific.

I see..

> 
> > b. Can these APIs support DT probing too?
> 
> OK. I can try, but I don't have arm or arm64 boards with PCIe for
> testing with of-thermal.  So it'll be untested.
> 

OK.  Lets see what comes.

> > > +static struct thermal_zone_params nvme_tz_params = {
> > > +     .governor_name = "user_space",
> >
> > Also,
> >
> > Was there any particular reason why defaulting to user_space here?
> 
> I only tested with the user_space governor.  There is no cooling device
> to bind with this.
> 

hmmm.. I see. But was there any particular reason to pick the thermal
subsystem here if you do not have any cooling? Or is there any specific
cooling that can be written in future?

If no cooling is expected, why not a simple hwmon?

> > > +     .no_hwmon = true,
> > > +};
> > > +
> > > +static struct thermal_zone_device *
> > > +nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
> > > +{
> > > +     struct thermal_zone_device *tzdev;
> > > +     char name[THERMAL_NAME_LENGTH];
> > > +     int ret;
> > > +
> > > +     snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
> > > +
> > > +     tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
> > > +                                          &nvme_tz_params, 0, 0);
> > > +     if (IS_ERR(tzdev)) {
> > > +             dev_err(ctrl->device,
> > > +                     "Failed to register thermal zone device: %ld\n",
> > > +                     PTR_ERR(tzdev));
> > > +             return tzdev;
> > > +     }
> > > +
> > > +     snprintf(name, sizeof(name), "temp%d", sensor);
> > > +     ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
> > > +                             name);
> > > +     if (ret)
> > > +             goto device_unregister;
> > > +
> > > +     ret = sysfs_create_link(&tzdev->device.kobj,
> > > +                             &ctrl->ctrl_device.kobj, "device");
> > > +     if (ret)
> > > +             goto remove_link;
> > > +
> > > +     return tzdev;
> > > +
> > > +remove_link:
> > > +     sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > > +device_unregister:
> > > +     thermal_zone_device_unregister(tzdev);
> > > +
> > > +     return ERR_PTR(ret);
> > > +}
> > > +
> > > +/**
> > > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > > + * @ctrl: controller instance
> > > + *
> > > + * This function creates up to nine thermal zone devices for all implemented
> > > + * temperature sensors including the composite temperature.
> > > + * Each thermal zone device provides a single trip point temperature that is
> > > + * associated with an over temperature threshold.
> > > + */
> > > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > > +{
> > > +     struct nvme_smart_log *log;
> > > +     int ret;
> > > +     int i;
> > > +
> > > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > > +     if (!log)
> > > +             return 0; /* non-fatal error */
> >
> > I am not sure about this API design here. I would leave the error
> > handling and judging if this is fatal or not to the caller.
> > I mean, if I ask to register a nvme thermal zone and I get
> > a 0 as response, I would assume the thermal zone exists from
> > now on, right?
> 
> This routine is designed to return error code only when we're unable to
> communicate with the device at all (i.e. nvme_submit_sync_cmd returns a
> negative value).
> 
> We don't want to abandon device initialization just due to thermal zone
> failures.  Because thermal zone device isn't mandatory to manage the
> controllers, and the device like qemu doesn't provide smart log (according
> to Keith).

That is fair.. it is just really weird to continue your business when
the kernel cannot even allocate memory for you..

> 
> > > +
> > > +     ret = nvme_get_log(ctrl, NVME_NSID_ALL, NVME_LOG_SMART, 0,
> > > +                        log, sizeof(*log), 0);
> > > +     if (ret) {
> > > +             dev_err(ctrl->device, "Failed to get SMART log: %d\n", ret);
> > > +             /* If the device provided a response, then it's non-fatal */
> > > +             if (ret > 0)
> > > +                     ret = 0;
> > > +             goto free_log;
> >
> > Once again, hiding errors is never a strategy that scales..
> >
> > > +     }
> > > +
> > > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > > +             struct thermal_zone_device *tzdev;
> > > +             int temp;
> > > +
> > > +             if (i)
> > > +                     temp = le16_to_cpu(log->temp_sensor[i - 1]);
> > > +             else
> > > +                     temp = get_unaligned_le16(log->temperature);
> > > +
> > > +             /*
> > > +              * All implemented temperature sensors report a non-zero value
> > > +              * in temperature sensor fields in the smart log page.
> > > +              */
> > > +             if (!temp)
> > > +                     continue;
> > > +             if (ctrl->tzdev[i])
> > > +                     continue;
> > > +
> > > +             tzdev = nvme_thermal_zone_register(ctrl, i);
> > > +             if (!IS_ERR(tzdev))
> > > +                     ctrl->tzdev[i] = tzdev;
> >
> > Here again, I would not hide errors, the API should propagate errors
> > if something goes wrong.
> >
> > > +     }
> > > +
> > > +free_log:
> > > +     kfree(log);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +/**
> > > + * nvme_thermal_zones_unregister() - unregister nvme thermal zone devices
> > > + * @ctrl: controller instance
> > > + *
> > > + * This function removes the registered thermal zone devices and symlinks.
> > > + */
> > > +static void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > > +{
> > > +     int i;
> > > +
> > > +     for (i = 0; i < ARRAY_SIZE(ctrl->tzdev); i++) {
> > > +             struct thermal_zone_device *tzdev = ctrl->tzdev[i];
> > > +             char name[20];
> > > +
> > > +             if (!tzdev)
> > > +                     continue;
> > > +
> > > +             sysfs_remove_link(&tzdev->device.kobj, "device");
> > > +
> > > +             snprintf(name, sizeof(name), "temp%d", i);
> > > +             sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > > +
> > > +             thermal_zone_device_unregister(tzdev);
> > > +
> > > +             ctrl->tzdev[i] = NULL;
> > > +     }
> > > +}
> > > +
> > > +#else
> > > +
> > > +static inline int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > > +{
> > > +     return 0;
> > > +}
> > > +
> > > +static inline void nvme_thermal_zones_unregister(struct nvme_ctrl *ctrl)
> > > +{
> > > +}
> > > +
> > > +#endif /* CONFIG_THERMAL */
> > > +
> > >  struct nvme_core_quirk_entry {
> > >       /*
> > >        * NVMe model and firmware strings are padded with spaces.  For
> > > @@ -2754,6 +3037,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
> > >       if (ret < 0)
> > >               return ret;
> > >
> > > +     ret = nvme_thermal_zones_register(ctrl);
> > > +     if (ret < 0)
> > > +             return ret;
> >
> >
> > I would definitely keep this code the way it is here in
> > nvme_init_identify(), but if I read this right, your
> > nvme_thermal_zones_register() will never return < 0, and therefore this
> > condition is never met.
> 
> The nvme_get_log() can return a negative value.  Only in this case,
> nvme_thermal_zones_register() returns error and abandon the device
> initialization.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 2/3] nvme: add thermal zone devices
  2019-06-06  4:05         ` Eduardo Valentin
@ 2019-06-07 15:21           ` Akinobu Mita
  -1 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-07 15:21 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: linux-nvme, linux-pm, Zhang Rui, Daniel Lezcano, Keith Busch,
	Jens Axboe, Christoph Hellwig, Sagi Grimberg, Minwoo Im,
	Kenneth Heitke, Chaitanya Kulkarni

2019年6月6日(木) 13:05 Eduardo Valentin <edubezval@gmail.com>:

> > > > +static struct thermal_zone_params nvme_tz_params = {
> > > > +     .governor_name = "user_space",
> > >
> > > Also,
> > >
> > > Was there any particular reason why defaulting to user_space here?
> >
> > I only tested with the user_space governor.  There is no cooling device
> > to bind with this.
> >
>
> hmmm.. I see. But was there any particular reason to pick the thermal
> subsystem here if you do not have any cooling? Or is there any specific
> cooling that can be written in future?
>
> If no cooling is expected, why not a simple hwmon?

For example, we can use USB port powered cooling fan with USB hub that
supports per-port power switching.  It can be turned on and off by user
space tool.

> > > > +     .no_hwmon = true,
> > > > +};
> > > > +
> > > > +static struct thermal_zone_device *
> > > > +nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
> > > > +{
> > > > +     struct thermal_zone_device *tzdev;
> > > > +     char name[THERMAL_NAME_LENGTH];
> > > > +     int ret;
> > > > +
> > > > +     snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
> > > > +
> > > > +     tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
> > > > +                                          &nvme_tz_params, 0, 0);
> > > > +     if (IS_ERR(tzdev)) {
> > > > +             dev_err(ctrl->device,
> > > > +                     "Failed to register thermal zone device: %ld\n",
> > > > +                     PTR_ERR(tzdev));
> > > > +             return tzdev;
> > > > +     }
> > > > +
> > > > +     snprintf(name, sizeof(name), "temp%d", sensor);
> > > > +     ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
> > > > +                             name);
> > > > +     if (ret)
> > > > +             goto device_unregister;
> > > > +
> > > > +     ret = sysfs_create_link(&tzdev->device.kobj,
> > > > +                             &ctrl->ctrl_device.kobj, "device");
> > > > +     if (ret)
> > > > +             goto remove_link;
> > > > +
> > > > +     return tzdev;
> > > > +
> > > > +remove_link:
> > > > +     sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > > > +device_unregister:
> > > > +     thermal_zone_device_unregister(tzdev);
> > > > +
> > > > +     return ERR_PTR(ret);
> > > > +}
> > > > +
> > > > +/**
> > > > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > > > + * @ctrl: controller instance
> > > > + *
> > > > + * This function creates up to nine thermal zone devices for all implemented
> > > > + * temperature sensors including the composite temperature.
> > > > + * Each thermal zone device provides a single trip point temperature that is
> > > > + * associated with an over temperature threshold.
> > > > + */
> > > > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > > > +{
> > > > +     struct nvme_smart_log *log;
> > > > +     int ret;
> > > > +     int i;
> > > > +
> > > > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > > > +     if (!log)
> > > > +             return 0; /* non-fatal error */
> > >
> > > I am not sure about this API design here. I would leave the error
> > > handling and judging if this is fatal or not to the caller.
> > > I mean, if I ask to register a nvme thermal zone and I get
> > > a 0 as response, I would assume the thermal zone exists from
> > > now on, right?
> >
> > This routine is designed to return error code only when we're unable to
> > communicate with the device at all (i.e. nvme_submit_sync_cmd returns a
> > negative value).
> >
> > We don't want to abandon device initialization just due to thermal zone
> > failures.  Because thermal zone device isn't mandatory to manage the
> > controllers, and the device like qemu doesn't provide smart log (according
> > to Keith).
>
> That is fair.. it is just really weird to continue your business when
> the kernel cannot even allocate memory for you..

Right.  The system is not unlikely to continue working properly when under
page size allocation with GFP_KERNEL is failed, unless fault injection for
allocation is enabled.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 2/3] nvme: add thermal zone devices
@ 2019-06-07 15:21           ` Akinobu Mita
  0 siblings, 0 replies; 40+ messages in thread
From: Akinobu Mita @ 2019-06-07 15:21 UTC (permalink / raw)


2019?6?6?(?) 13:05 Eduardo Valentin <edubezval at gmail.com>:

> > > > +static struct thermal_zone_params nvme_tz_params = {
> > > > +     .governor_name = "user_space",
> > >
> > > Also,
> > >
> > > Was there any particular reason why defaulting to user_space here?
> >
> > I only tested with the user_space governor.  There is no cooling device
> > to bind with this.
> >
>
> hmmm.. I see. But was there any particular reason to pick the thermal
> subsystem here if you do not have any cooling? Or is there any specific
> cooling that can be written in future?
>
> If no cooling is expected, why not a simple hwmon?

For example, we can use USB port powered cooling fan with USB hub that
supports per-port power switching.  It can be turned on and off by user
space tool.

> > > > +     .no_hwmon = true,
> > > > +};
> > > > +
> > > > +static struct thermal_zone_device *
> > > > +nvme_thermal_zone_register(struct nvme_ctrl *ctrl, unsigned int sensor)
> > > > +{
> > > > +     struct thermal_zone_device *tzdev;
> > > > +     char name[THERMAL_NAME_LENGTH];
> > > > +     int ret;
> > > > +
> > > > +     snprintf(name, sizeof(name), "nvme%d_temp%u", ctrl->instance, sensor);
> > > > +
> > > > +     tzdev = thermal_zone_device_register(name, 1, 1, ctrl, &nvme_tz_ops,
> > > > +                                          &nvme_tz_params, 0, 0);
> > > > +     if (IS_ERR(tzdev)) {
> > > > +             dev_err(ctrl->device,
> > > > +                     "Failed to register thermal zone device: %ld\n",
> > > > +                     PTR_ERR(tzdev));
> > > > +             return tzdev;
> > > > +     }
> > > > +
> > > > +     snprintf(name, sizeof(name), "temp%d", sensor);
> > > > +     ret = sysfs_create_link(&ctrl->ctrl_device.kobj, &tzdev->device.kobj,
> > > > +                             name);
> > > > +     if (ret)
> > > > +             goto device_unregister;
> > > > +
> > > > +     ret = sysfs_create_link(&tzdev->device.kobj,
> > > > +                             &ctrl->ctrl_device.kobj, "device");
> > > > +     if (ret)
> > > > +             goto remove_link;
> > > > +
> > > > +     return tzdev;
> > > > +
> > > > +remove_link:
> > > > +     sysfs_remove_link(&ctrl->ctrl_device.kobj, name);
> > > > +device_unregister:
> > > > +     thermal_zone_device_unregister(tzdev);
> > > > +
> > > > +     return ERR_PTR(ret);
> > > > +}
> > > > +
> > > > +/**
> > > > + * nvme_thermal_zones_register() - register nvme thermal zone devices
> > > > + * @ctrl: controller instance
> > > > + *
> > > > + * This function creates up to nine thermal zone devices for all implemented
> > > > + * temperature sensors including the composite temperature.
> > > > + * Each thermal zone device provides a single trip point temperature that is
> > > > + * associated with an over temperature threshold.
> > > > + */
> > > > +static int nvme_thermal_zones_register(struct nvme_ctrl *ctrl)
> > > > +{
> > > > +     struct nvme_smart_log *log;
> > > > +     int ret;
> > > > +     int i;
> > > > +
> > > > +     log = kzalloc(sizeof(*log), GFP_KERNEL);
> > > > +     if (!log)
> > > > +             return 0; /* non-fatal error */
> > >
> > > I am not sure about this API design here. I would leave the error
> > > handling and judging if this is fatal or not to the caller.
> > > I mean, if I ask to register a nvme thermal zone and I get
> > > a 0 as response, I would assume the thermal zone exists from
> > > now on, right?
> >
> > This routine is designed to return error code only when we're unable to
> > communicate with the device at all (i.e. nvme_submit_sync_cmd returns a
> > negative value).
> >
> > We don't want to abandon device initialization just due to thermal zone
> > failures.  Because thermal zone device isn't mandatory to manage the
> > controllers, and the device like qemu doesn't provide smart log (according
> > to Keith).
>
> That is fair.. it is just really weird to continue your business when
> the kernel cannot even allocate memory for you..

Right.  The system is not unlikely to continue working properly when under
page size allocation with GFP_KERNEL is failed, unless fault injection for
allocation is enabled.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2019-06-07 15:21 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-26 16:29 [PATCH v3 0/3] nvme: add thermal zone devices Akinobu Mita
2019-05-26 16:29 ` Akinobu Mita
2019-05-26 16:29 ` [PATCH v3 1/3] nvme: Export get and set features Akinobu Mita
2019-05-26 16:29   ` Akinobu Mita
2019-05-26 16:45   ` Chaitanya Kulkarni
2019-05-26 16:45     ` Chaitanya Kulkarni
2019-05-29 15:19   ` Minwoo Im
2019-05-29 15:19     ` Minwoo Im
2019-05-26 16:29 ` [PATCH v3 2/3] nvme: add thermal zone devices Akinobu Mita
2019-05-26 16:29   ` Akinobu Mita
2019-05-29 15:15   ` Minwoo Im
2019-05-29 15:15     ` Minwoo Im
2019-05-29 16:47     ` Akinobu Mita
2019-05-29 16:47       ` Akinobu Mita
2019-05-30 10:18       ` Minwoo Im
2019-05-30 10:18         ` Minwoo Im
2019-06-01  9:02   ` Christoph Hellwig
2019-06-01  9:02     ` Christoph Hellwig
2019-06-02 13:19     ` Akinobu Mita
2019-06-02 13:19       ` Akinobu Mita
2019-06-04  7:31       ` Christoph Hellwig
2019-06-04  7:31         ` Christoph Hellwig
2019-06-05 15:42         ` Akinobu Mita
2019-06-05 15:42           ` Akinobu Mita
2019-06-03  2:36   ` Eduardo Valentin
2019-06-03  2:36     ` Eduardo Valentin
2019-06-03 15:20     ` Akinobu Mita
2019-06-03 15:20       ` Akinobu Mita
2019-06-06  4:05       ` Eduardo Valentin
2019-06-06  4:05         ` Eduardo Valentin
2019-06-07 15:21         ` Akinobu Mita
2019-06-07 15:21           ` Akinobu Mita
2019-05-26 16:29 ` [PATCH v3 3/3] nvme: notify thermal framework when temperature threshold events occur Akinobu Mita
2019-05-26 16:29   ` Akinobu Mita
2019-06-01  9:03   ` Christoph Hellwig
2019-06-01  9:03     ` Christoph Hellwig
2019-06-02 13:46     ` Akinobu Mita
2019-06-02 13:46       ` Akinobu Mita
2019-06-04  7:32       ` Christoph Hellwig
2019-06-04  7:32         ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.