cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] block,nvme: latency-based I/O scheduler
       [not found] <20240403141756.88233-1-hare@kernel.org>
@ 2024-05-09 20:43 ` John Meneghini
  2024-05-10  9:34   ` Niklas Cassel
  2024-05-09 20:43 ` [PATCH v3 1/3] block: track per-node I/O latency John Meneghini
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: John Meneghini @ 2024-05-09 20:43 UTC (permalink / raw)
  To: tj, josef, axboe, kbusch, hch, sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jmeneghi, jrani,
	randyj, aviv.coro

I'm re-issuing Hannes's latency patches in preparation for LSFMM

Changes since V2:

I've done quite a bit of work cleaning up these patches. There were a
number of checkpatch.pl problems as well as some compile time errors
when config BLK_NODE_LATENCY was turned off. After the clean up I
rebased these patches onto Ewan's "nvme: queue-depth multipath iopolicy"
patches. This allowed me to test both iopolicy changes together. 

All of my test results, together with the scripts I used to generate these
graphs, are available at:

  https://github.com/johnmeneghini/iopolicy

Please use the scripts in this repository to do your own testing.

Changes since V1:

Hi all,

there had been several attempts to implement a latency-based I/O
scheduler for native nvme multipath, all of which had its issues.

So time to start afresh, this time using the QoS framework
already present in the block layer.
It consists of two parts:
- a new 'blk-nlatency' QoS module, which is just a simple per-node
  latency tracker
- a 'latency' nvme I/O policy

Using the 'tiobench' fio script with 512 byte blocksize I'm getting
the following latencies (in usecs) as a baseline:
- seq write: avg 186 stddev 331
- rand write: avg 4598 stddev 7903
- seq read: avg 149 stddev 65
- rand read: avg 150 stddev 68

Enabling the 'latency' iopolicy:
- seq write: avg 178 stddev 113
- rand write: avg 3427 stddev 6703
- seq read: avg 140 stddev 59
- rand read: avg 141 stddev 58

Setting the 'decay' parameter to 10:
- seq write: avg 182 stddev 65
- rand write: avg 2619 stddev 5894
- seq read: avg 142 stddev 57
- rand read: avg 140 stddev 57  

That's on a 32G FC testbed running against a brd target,
fio running with 48 threads. So promises are met: latency
goes down, and we're even able to control the standard
deviation via the 'decay' parameter.

As usual, comments and reviews are welcome.

Changes to the original version:
- split the rqos debugfs entries
- Modify commit message to indicate latency
- rename to blk-nlatency

Hannes Reinecke (2):
  block: track per-node I/O latency
  nvme: add 'latency' iopolicy

John Meneghini (1):
  nvme: multipath: pr_notice when iopolicy changes

 MAINTAINERS                   |   1 +
 block/Kconfig                 |   9 +
 block/Makefile                |   1 +
 block/blk-mq-debugfs.c        |   2 +
 block/blk-nlatency.c          | 389 ++++++++++++++++++++++++++++++++++
 block/blk-rq-qos.h            |   6 +
 drivers/nvme/host/multipath.c |  73 ++++++-
 drivers/nvme/host/nvme.h      |   1 +
 include/linux/blk-mq.h        |  11 +
 9 files changed, 484 insertions(+), 9 deletions(-)
 create mode 100644 block/blk-nlatency.c

-- 
2.39.3


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/3] block: track per-node I/O latency
       [not found] <20240403141756.88233-1-hare@kernel.org>
  2024-05-09 20:43 ` [PATCH v3 0/3] block,nvme: latency-based I/O scheduler John Meneghini
@ 2024-05-09 20:43 ` John Meneghini
  2024-05-10  7:11   ` Damien Le Moal
  2024-05-09 20:43 ` [PATCH v3 2/3] nvme: add 'latency' iopolicy John Meneghini
  2024-05-09 20:43 ` [PATCH v3 3/3] nvme: multipath: pr_notice when iopolicy changes John Meneghini
  3 siblings, 1 reply; 11+ messages in thread
From: John Meneghini @ 2024-05-09 20:43 UTC (permalink / raw)
  To: tj, josef, axboe, kbusch, hch, sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jmeneghi, jrani,
	randyj, aviv.coro

From: Hannes Reinecke <hare@kernel.org>

Add a new option 'BLK_NODE_LATENCY' to track per-node I/O latency.
This can be used by I/O schedulers to determine the 'best' queue
to send I/O to.

Signed-off-by: Hannes Reinecke <hare@kernel.org>

Cleaned up checkpatch warnings and updated MAINTAINERS.

Signed-off-by: John Meneghini <jmeneghi@redhat.com>
---
 MAINTAINERS            |   1 +
 block/Kconfig          |   9 +
 block/Makefile         |   1 +
 block/blk-mq-debugfs.c |   2 +
 block/blk-nlatency.c   | 389 +++++++++++++++++++++++++++++++++++++++++
 block/blk-rq-qos.h     |   6 +
 include/linux/blk-mq.h |  11 ++
 7 files changed, 419 insertions(+)
 create mode 100644 block/blk-nlatency.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 7c121493f43d..a4634365c82f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5405,6 +5405,7 @@ F:	block/bfq-cgroup.c
 F:	block/blk-cgroup.c
 F:	block/blk-iocost.c
 F:	block/blk-iolatency.c
+F:	block/blk-nlatency.c
 F:	block/blk-throttle.c
 F:	include/linux/blk-cgroup.h
 
diff --git a/block/Kconfig b/block/Kconfig
index 1de4682d48cc..641ed39d609c 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -186,6 +186,15 @@ config BLK_CGROUP_IOPRIO
 	scheduler and block devices process requests. Only some I/O schedulers
 	and some block devices support I/O priorities.
 
+config BLK_NODE_LATENCY
+	bool "Track per-node I/O latency"
+	help
+	Enable per-node I/O latency tracking for multipathing. This uses the
+	blk-nodelat latency tracker to provide latencies for each node, and schedules
+	I/O on the path with the least latency for the submitting node. This can be
+	used by I/O schedulers to determine the node with the least latency. Currently
+	only supports nvme over fabrics devices.
+
 config BLK_DEBUG_FS
 	bool "Block layer debugging information in debugfs"
 	default y
diff --git a/block/Makefile b/block/Makefile
index 46ada9dc8bbf..9d2e71a3e36f 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
+obj-$(CONFIG_BLK_NODE_LATENCY) += blk-nlatency.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 94668e72ab09..cb38228b95d8 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -762,6 +762,8 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id)
 		return "latency";
 	case RQ_QOS_COST:
 		return "cost";
+	case RQ_QOS_NLAT:
+		return "node-latency";
 	}
 	return "unknown";
 }
diff --git a/block/blk-nlatency.c b/block/blk-nlatency.c
new file mode 100644
index 000000000000..219c3f636d76
--- /dev/null
+++ b/block/blk-nlatency.c
@@ -0,0 +1,389 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Per-node request latency tracking.
+ *
+ * Copyright (C) 2023 Hannes Reinecke
+ *
+ * A simple per-node latency tracker for use by I/O scheduler.
+ * Latencies are measures over 'win_usec' microseconds and stored per node.
+ * If the number of measurements falls below 'lowat' the measurement is
+ * assumed to be unreliable and will become 'stale'.
+ * These 'stale' latencies can be 'decayed', where during each measurement
+ * interval the 'stale' latency value is decreased by 'decay' percent.
+ * Once the 'stale' latency reaches zero it will be updated by the
+ * measured latency.
+ */
+#include <linux/kernel.h>
+#include <linux/blk_types.h>
+#include <linux/slab.h>
+
+#include "blk-stat.h"
+#include "blk-rq-qos.h"
+#include "blk.h"
+
+#define NLAT_DEFAULT_LOWAT 2
+#define NLAT_DEFAULT_DECAY 50
+
+struct rq_nlat {
+	struct rq_qos rqos;
+
+	u64 win_usec;		/* latency measurement window in microseconds */
+	unsigned int lowat;	/* Low Watermark latency measurement */
+	unsigned int decay;	/* Percentage for 'decaying' latencies */
+	bool enabled;
+
+	struct blk_stat_callback *cb;
+
+	unsigned int num;
+	u64 *latency;
+	unsigned int *samples;
+};
+
+static inline struct rq_nlat *RQNLAT(struct rq_qos *rqos)
+{
+	return container_of(rqos, struct rq_nlat, rqos);
+}
+
+static u64 nlat_default_latency_usec(struct request_queue *q)
+{
+	/*
+	 * We default to 2msec for non-rotational storage, and 75msec
+	 * for rotational storage.
+	 */
+	if (blk_queue_nonrot(q))
+		return 2000ULL;
+	else
+		return 75000ULL;
+}
+
+static void nlat_timer_fn(struct blk_stat_callback *cb)
+{
+	struct rq_nlat *nlat = cb->data;
+	int n;
+
+	for (n = 0; n < cb->buckets; n++) {
+		if (cb->stat[n].nr_samples < nlat->lowat) {
+			/*
+			 * 'decay' the latency by the specified
+			 * percentage to ensure the queues are
+			 * being tested to balance out temporary
+			 * latency spikes.
+			 */
+			nlat->latency[n] =
+				div64_u64(nlat->latency[n] * nlat->decay, 100);
+		} else
+			nlat->latency[n] = cb->stat[n].mean;
+		nlat->samples[n] = cb->stat[n].nr_samples;
+	}
+	if (nlat->enabled)
+		blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+}
+
+static int nlat_bucket_node(const struct request *rq)
+{
+	if (!rq->mq_ctx)
+		return -1;
+	return cpu_to_node(blk_mq_rq_cpu((struct request *)rq));
+}
+
+static void nlat_exit(struct rq_qos *rqos)
+{
+	struct rq_nlat *nlat = RQNLAT(rqos);
+
+	blk_stat_remove_callback(nlat->rqos.disk->queue, nlat->cb);
+	blk_stat_free_callback(nlat->cb);
+	kfree(nlat->samples);
+	kfree(nlat->latency);
+	kfree(nlat);
+}
+
+#ifdef CONFIG_BLK_DEBUG_FS
+static int nlat_win_usec_show(void *data, struct seq_file *m)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+
+	seq_printf(m, "%llu\n", nlat->win_usec);
+	return 0;
+}
+
+static ssize_t nlat_win_usec_write(void *data, const char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+	char val[16] = { };
+	u64 usec;
+	int err;
+
+	if (blk_queue_dying(nlat->rqos.disk->queue))
+		return -ENOENT;
+
+	if (count >= sizeof(val))
+		return -EINVAL;
+
+	if (copy_from_user(val, buf, count))
+		return -EFAULT;
+
+	err = kstrtoull(val, 10, &usec);
+	if (err)
+		return err;
+	blk_stat_deactivate(nlat->cb);
+	nlat->win_usec = usec;
+	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+
+	return count;
+}
+
+static int nlat_lowat_show(void *data, struct seq_file *m)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+
+	seq_printf(m, "%u\n", nlat->lowat);
+	return 0;
+}
+
+static ssize_t nlat_lowat_write(void *data, const char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+	char val[16] = { };
+	unsigned int lowat;
+	int err;
+
+	if (blk_queue_dying(nlat->rqos.disk->queue))
+		return -ENOENT;
+
+	if (count >= sizeof(val))
+		return -EINVAL;
+
+	if (copy_from_user(val, buf, count))
+		return -EFAULT;
+
+	err = kstrtouint(val, 10, &lowat);
+	if (err)
+		return err;
+	blk_stat_deactivate(nlat->cb);
+	nlat->lowat = lowat;
+	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+
+	return count;
+}
+
+static int nlat_decay_show(void *data, struct seq_file *m)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+
+	seq_printf(m, "%u\n", nlat->decay);
+	return 0;
+}
+
+static ssize_t nlat_decay_write(void *data, const char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+	char val[16] = { };
+	unsigned int decay;
+	int err;
+
+	if (blk_queue_dying(nlat->rqos.disk->queue))
+		return -ENOENT;
+
+	if (count >= sizeof(val))
+		return -EINVAL;
+
+	if (copy_from_user(val, buf, count))
+		return -EFAULT;
+
+	err = kstrtouint(val, 10, &decay);
+	if (err)
+		return err;
+	if (decay > 100)
+		return -EINVAL;
+	blk_stat_deactivate(nlat->cb);
+	nlat->decay = decay;
+	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+
+	return count;
+}
+
+static int nlat_enabled_show(void *data, struct seq_file *m)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+
+	seq_printf(m, "%d\n", nlat->enabled);
+	return 0;
+}
+
+static int nlat_id_show(void *data, struct seq_file *m)
+{
+	struct rq_qos *rqos = data;
+
+	seq_printf(m, "%u\n", rqos->id);
+	return 0;
+}
+
+static int nlat_latency_show(void *data, struct seq_file *m)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+	int n;
+
+	if (!nlat->enabled)
+		return 0;
+
+	for (n = 0; n < nlat->num; n++) {
+		if (n > 0)
+			seq_puts(m, " ");
+		seq_printf(m, "%llu", nlat->latency[n]);
+	}
+	seq_puts(m, "\n");
+	return 0;
+}
+
+static int nlat_samples_show(void *data, struct seq_file *m)
+{
+	struct rq_qos *rqos = data;
+	struct rq_nlat *nlat = RQNLAT(rqos);
+	int n;
+
+	if (!nlat->enabled)
+		return 0;
+
+	for (n = 0; n < nlat->num; n++) {
+		if (n > 0)
+			seq_puts(m, " ");
+		seq_printf(m, "%u", nlat->samples[n]);
+	}
+	seq_puts(m, "\n");
+	return 0;
+}
+
+static const struct blk_mq_debugfs_attr nlat_debugfs_attrs[] = {
+	{"win_usec", 0600, nlat_win_usec_show, nlat_win_usec_write},
+	{"lowat", 0600, nlat_lowat_show, nlat_lowat_write},
+	{"decay", 0600, nlat_decay_show, nlat_decay_write},
+	{"enabled", 0400, nlat_enabled_show},
+	{"id", 0400, nlat_id_show},
+	{"latency", 0400, nlat_latency_show},
+	{"samples", 0400, nlat_samples_show},
+	{},
+};
+#endif
+
+static const struct rq_qos_ops nlat_rqos_ops = {
+	.exit = nlat_exit,
+#ifdef CONFIG_BLK_DEBUG_FS
+	.debugfs_attrs = nlat_debugfs_attrs,
+#endif
+};
+
+u64 blk_nlat_latency(struct gendisk *disk, int node)
+{
+	struct rq_qos *rqos;
+	struct rq_nlat *nlat;
+
+	rqos = nlat_rq_qos(disk->queue);
+	if (!rqos)
+		return 0;
+	nlat = RQNLAT(rqos);
+	if (node > nlat->num)
+		return 0;
+
+	return div64_u64(nlat->latency[node], 1000);
+}
+EXPORT_SYMBOL_GPL(blk_nlat_latency);
+
+int blk_nlat_enable(struct gendisk *disk)
+{
+	struct rq_qos *rqos;
+	struct rq_nlat *nlat;
+
+	/* Latency tracking not enabled? */
+	rqos = nlat_rq_qos(disk->queue);
+	if (!rqos)
+		return -EINVAL;
+	nlat = RQNLAT(rqos);
+	if (nlat->enabled)
+		return 0;
+
+	/* Queue not registered? Maybe shutting down... */
+	if (!blk_queue_registered(disk->queue))
+		return -EAGAIN;
+
+	nlat->enabled = true;
+	memset(nlat->latency, 0, sizeof(u64) * nlat->num);
+	memset(nlat->samples, 0, sizeof(unsigned int) * nlat->num);
+	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(blk_nlat_enable);
+
+void blk_nlat_disable(struct gendisk *disk)
+{
+	struct rq_qos *rqos = nlat_rq_qos(disk->queue);
+	struct rq_nlat *nlat;
+
+	if (!rqos)
+		return;
+	nlat = RQNLAT(rqos);
+	if (nlat->enabled) {
+		blk_stat_deactivate(nlat->cb);
+		nlat->enabled = false;
+	}
+}
+EXPORT_SYMBOL_GPL(blk_nlat_disable);
+
+int blk_nlat_init(struct gendisk *disk)
+{
+	struct rq_nlat *nlat;
+	int ret = -ENOMEM;
+
+	nlat = kzalloc(sizeof(*nlat), GFP_KERNEL);
+	if (!nlat)
+		return -ENOMEM;
+
+	nlat->num = num_possible_nodes();
+	nlat->lowat = NLAT_DEFAULT_LOWAT;
+	nlat->decay = NLAT_DEFAULT_DECAY;
+	nlat->win_usec = nlat_default_latency_usec(disk->queue);
+
+	nlat->latency = kcalloc(nlat->num, sizeof(u64), GFP_KERNEL);
+	if (!nlat->latency)
+		goto err_free;
+	nlat->samples = kcalloc(nlat->num, sizeof(unsigned int), GFP_KERNEL);
+	if (!nlat->samples)
+		goto err_free;
+	nlat->cb = blk_stat_alloc_callback(nlat_timer_fn, nlat_bucket_node,
+					   nlat->num, nlat);
+	if (!nlat->cb)
+		goto err_free;
+
+	/*
+	 * Assign rwb and add the stats callback.
+	 */
+	mutex_lock(&disk->queue->rq_qos_mutex);
+	ret = rq_qos_add(&nlat->rqos, disk, RQ_QOS_NLAT, &nlat_rqos_ops);
+	mutex_unlock(&disk->queue->rq_qos_mutex);
+	if (ret)
+		goto err_free_cb;
+
+	blk_stat_add_callback(disk->queue, nlat->cb);
+
+	return 0;
+
+err_free_cb:
+	blk_stat_free_callback(nlat->cb);
+err_free:
+	kfree(nlat->samples);
+	kfree(nlat->latency);
+	kfree(nlat);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(blk_nlat_init);
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 37245c97ee61..2fc11ced0c00 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -17,6 +17,7 @@ enum rq_qos_id {
 	RQ_QOS_WBT,
 	RQ_QOS_LATENCY,
 	RQ_QOS_COST,
+	RQ_QOS_NLAT,
 };
 
 struct rq_wait {
@@ -79,6 +80,11 @@ static inline struct rq_qos *iolat_rq_qos(struct request_queue *q)
 	return rq_qos_id(q, RQ_QOS_LATENCY);
 }
 
+static inline struct rq_qos *nlat_rq_qos(struct request_queue *q)
+{
+	return rq_qos_id(q, RQ_QOS_NLAT);
+}
+
 static inline void rq_wait_init(struct rq_wait *rq_wait)
 {
 	atomic_set(&rq_wait->inflight, 0);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d3d8fd8e229b..1f3829627f1b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -1231,4 +1231,15 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
 }
 #endif /* CONFIG_BLK_DEV_ZONED */
 
+#ifdef CONFIG_BLK_NODE_LATENCY
+int blk_nlat_enable(struct gendisk *disk);
+void blk_nlat_disable(struct gendisk *disk);
+u64 blk_nlat_latency(struct gendisk *disk, int node);
+int blk_nlat_init(struct gendisk *disk);
+#else
+static inline int blk_nlat_enable(struct gendisk *disk) { return 0; }
+static inline void blk_nlat_disable(struct gendisk *disk) {}
+static inline u64 blk_nlat_latency(struct gendisk *disk, int node) { return 0; }
+static inline int blk_nlat_init(struct gendisk *disk) { return -EOPNOTSUPP; }
+#endif
 #endif /* BLK_MQ_H */
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/3] nvme: add 'latency' iopolicy
       [not found] <20240403141756.88233-1-hare@kernel.org>
  2024-05-09 20:43 ` [PATCH v3 0/3] block,nvme: latency-based I/O scheduler John Meneghini
  2024-05-09 20:43 ` [PATCH v3 1/3] block: track per-node I/O latency John Meneghini
@ 2024-05-09 20:43 ` John Meneghini
  2024-05-10  7:17   ` Damien Le Moal
  2024-05-09 20:43 ` [PATCH v3 3/3] nvme: multipath: pr_notice when iopolicy changes John Meneghini
  3 siblings, 1 reply; 11+ messages in thread
From: John Meneghini @ 2024-05-09 20:43 UTC (permalink / raw)
  To: tj, josef, axboe, kbusch, hch, sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jmeneghi, jrani,
	randyj, aviv.coro

From: Hannes Reinecke <hare@kernel.org>

Add a latency-based I/O policy for multipathing. It uses the blk-nodelat
latency tracker to provide latencies for each node, and schedules
I/O on the path with the least latency for the submitting node.

Signed-off-by: Hannes Reinecke <hare@kernel.org>

Make this compile when CONFIG_BLK_NODE_LATENCY is not set.
Advertise the 'latency' iopolicy in modinfo.

Signed-off-by: John Meneghini <jmeneghi@redhat.com>
---
 drivers/nvme/host/multipath.c | 63 ++++++++++++++++++++++++++++++-----
 drivers/nvme/host/nvme.h      |  1 +
 2 files changed, 55 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index d916a5ddf5d4..e9330bb1990b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -18,6 +18,7 @@ static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
 	[NVME_IOPOLICY_RR]	= "round-robin",
 	[NVME_IOPOLICY_QD]      = "queue-depth",
+	[NVME_IOPOLICY_LAT]	= "latency",
 };
 
 static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -32,6 +33,10 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
 		iopolicy = NVME_IOPOLICY_RR;
 	else if (!strncmp(val, "queue-depth", 11))
 		iopolicy = NVME_IOPOLICY_QD;
+#ifdef CONFIG_BLK_NODE_LATENCY
+	else if (!strncmp(val, "latency", 7))
+		iopolicy = NVME_IOPOLICY_LAT;
+#endif
 	else
 		return -EINVAL;
 
@@ -43,10 +48,36 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
 	return sprintf(buf, "%s\n", nvme_iopolicy_names[iopolicy]);
 }
 
+static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy)
+{
+	struct nvme_ns_head *h;
+	struct nvme_ns *ns;
+	bool enable = iopolicy == NVME_IOPOLICY_LAT;
+	int ret = 0;
+
+	mutex_lock(&subsys->lock);
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		list_for_each_entry_rcu(ns, &h->list, siblings) {
+			if (enable) {
+				ret = blk_nlat_enable(ns->disk);
+				if (ret)
+					break;
+			} else
+				blk_nlat_disable(ns->disk);
+		}
+	}
+	mutex_unlock(&subsys->lock);
+	return ret;
+}
+
 module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
 	&iopolicy, 0644);
 MODULE_PARM_DESC(iopolicy,
+#if defined(CONFIG_BLK_NODE_LATENCY)
+	"Default multipath I/O policy; 'numa' (default) , 'round-robin', 'queue-depth' or 'latency'");
+#else
 	"Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
+#endif
 
 void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
 {
@@ -250,14 +281,16 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
 {
 	int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
 	struct nvme_ns *found = NULL, *fallback = NULL, *ns;
+	int iopolicy = READ_ONCE(head->subsys->iopolicy);
 
 	list_for_each_entry_rcu(ns, &head->list, siblings) {
 		if (nvme_path_is_disabled(ns))
 			continue;
 
-		if (ns->ctrl->numa_node != NUMA_NO_NODE &&
-		    READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA)
+		if (iopolicy == NVME_IOPOLICY_NUMA)
 			distance = node_distance(node, ns->ctrl->numa_node);
+		else if (iopolicy == NVME_IOPOLICY_LAT)
+			distance = blk_nlat_latency(ns->disk, node);
 		else
 			distance = LOCAL_DISTANCE;
 
@@ -381,8 +414,8 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
 
 inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 {
-	int iopolicy = READ_ONCE(head->subsys->iopolicy);
 	int node;
+	int iopolicy = READ_ONCE(head->subsys->iopolicy);
 	struct nvme_ns *ns;
 
 	/*
@@ -401,8 +434,8 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 
 	if (iopolicy == NVME_IOPOLICY_RR)
 		return nvme_round_robin_path(head, node, ns);
-
-	if (unlikely(!nvme_path_is_optimized(ns)))
+	if (iopolicy == NVME_IOPOLICY_LAT ||
+	    unlikely(!nvme_path_is_optimized(ns)))
 		return __nvme_find_path(head, node);
 	return ns;
 }
@@ -872,15 +905,18 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
 {
 	struct nvme_subsystem *subsys =
 		container_of(dev, struct nvme_subsystem, dev);
-	int i;
+	int i, ret;
 
 	for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
 		if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
-			nvme_subsys_iopolicy_update(subsys, i);
-			return count;
+			ret = nvme_activate_iopolicy(subsys, i);
+			if (!ret) {
+				nvme_subsys_iopolicy_update(subsys, i);
+				return count;
+			}
+			return ret;
 		}
 	}
-
 	return -EINVAL;
 }
 SUBSYS_ATTR_RW(iopolicy, S_IRUGO | S_IWUSR,
@@ -916,6 +952,15 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
 
 void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
 {
+	if (!blk_nlat_init(ns->disk) &&
+	    READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_LAT) {
+		int ret = blk_nlat_enable(ns->disk);
+
+		if (unlikely(ret))
+			pr_warn("%s: Failed to enable latency tracking, error %d\n",
+				ns->disk->disk_name, ret);
+	}
+
 	if (nvme_ctrl_use_ana(ns->ctrl)) {
 		struct nvme_ana_group_desc desc = {
 			.grpid = anagrpid,
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index a557b4577c01..66bf003a6c48 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -411,6 +411,7 @@ enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
 	NVME_IOPOLICY_QD,
+	NVME_IOPOLICY_LAT,
 };
 
 struct nvme_subsystem {
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/3] nvme: multipath: pr_notice when iopolicy changes
       [not found] <20240403141756.88233-1-hare@kernel.org>
                   ` (2 preceding siblings ...)
  2024-05-09 20:43 ` [PATCH v3 2/3] nvme: add 'latency' iopolicy John Meneghini
@ 2024-05-09 20:43 ` John Meneghini
  2024-05-10  7:19   ` Damien Le Moal
  3 siblings, 1 reply; 11+ messages in thread
From: John Meneghini @ 2024-05-09 20:43 UTC (permalink / raw)
  To: tj, josef, axboe, kbusch, hch, sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jmeneghi, jrani,
	randyj, aviv.coro

Send a pr_notice when ever the iopolicy on a subsystem
is changed. This is important for support reasons. It
is fully expected that users will be changing the iopolicy
with active IO in progress.

Signed-off-by: John Meneghini <jmeneghi@redhat.com>
---
 drivers/nvme/host/multipath.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index e9330bb1990b..0286e44a081f 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -67,6 +67,10 @@ static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy)
 		}
 	}
 	mutex_unlock(&subsys->lock);
+
+	pr_notice("%s: %s enable %d status %d for subsysnqn %s\n", __func__,
+			nvme_iopolicy_names[iopolicy], enable, ret, subsys->subnqn);
+
 	return ret;
 }
 
@@ -890,6 +894,8 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
 {
 	struct nvme_ctrl *ctrl;
 
+	int old_iopolicy = READ_ONCE(subsys->iopolicy);
+
 	WRITE_ONCE(subsys->iopolicy, iopolicy);
 
 	mutex_lock(&nvme_subsystems_lock);
@@ -898,6 +904,10 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
 		nvme_mpath_clear_ctrl_paths(ctrl);
 	}
 	mutex_unlock(&nvme_subsystems_lock);
+
+	pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
+			nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],
+			subsys->subnqn);
 }
 
 static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] block: track per-node I/O latency
  2024-05-09 20:43 ` [PATCH v3 1/3] block: track per-node I/O latency John Meneghini
@ 2024-05-10  7:11   ` Damien Le Moal
  2024-05-10  9:28     ` Niklas Cassel
  2024-05-10 10:00     ` Hannes Reinecke
  0 siblings, 2 replies; 11+ messages in thread
From: Damien Le Moal @ 2024-05-10  7:11 UTC (permalink / raw)
  To: John Meneghini, tj, josef, axboe, kbusch, hch, sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jrani, randyj, aviv.coro

On 5/10/24 05:43, John Meneghini wrote:
> From: Hannes Reinecke <hare@kernel.org>
> 
> Add a new option 'BLK_NODE_LATENCY' to track per-node I/O latency.
> This can be used by I/O schedulers to determine the 'best' queue
> to send I/O to.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> 
> Cleaned up checkpatch warnings and updated MAINTAINERS.

This note should be before Hannes SoB. E.g:

[John] Fixed checkpatch warnings and updated MAINTAINERS.

> 
> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
> ---
>  MAINTAINERS            |   1 +
>  block/Kconfig          |   9 +
>  block/Makefile         |   1 +
>  block/blk-mq-debugfs.c |   2 +
>  block/blk-nlatency.c   | 389 +++++++++++++++++++++++++++++++++++++++++
>  block/blk-rq-qos.h     |   6 +
>  include/linux/blk-mq.h |  11 ++
>  7 files changed, 419 insertions(+)
>  create mode 100644 block/blk-nlatency.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7c121493f43d..a4634365c82f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5405,6 +5405,7 @@ F:	block/bfq-cgroup.c
>  F:	block/blk-cgroup.c
>  F:	block/blk-iocost.c
>  F:	block/blk-iolatency.c
> +F:	block/blk-nlatency.c
>  F:	block/blk-throttle.c
>  F:	include/linux/blk-cgroup.h
>  
> diff --git a/block/Kconfig b/block/Kconfig
> index 1de4682d48cc..641ed39d609c 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -186,6 +186,15 @@ config BLK_CGROUP_IOPRIO
>  	scheduler and block devices process requests. Only some I/O schedulers
>  	and some block devices support I/O priorities.
>  
> +config BLK_NODE_LATENCY
> +	bool "Track per-node I/O latency"
> +	help
> +	Enable per-node I/O latency tracking for multipathing. This uses the
> +	blk-nodelat latency tracker to provide latencies for each node, and schedules
> +	I/O on the path with the least latency for the submitting node. This can be
> +	used by I/O schedulers to determine the node with the least latency. Currently
> +	only supports nvme over fabrics devices.
> +
>  config BLK_DEBUG_FS
>  	bool "Block layer debugging information in debugfs"
>  	default y
> diff --git a/block/Makefile b/block/Makefile
> index 46ada9dc8bbf..9d2e71a3e36f 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
>  obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
>  obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
>  obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
> +obj-$(CONFIG_BLK_NODE_LATENCY) += blk-nlatency.o

Let's keep the alignment please.

>  obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
>  obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
>  bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 94668e72ab09..cb38228b95d8 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -762,6 +762,8 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id)
>  		return "latency";
>  	case RQ_QOS_COST:
>  		return "cost";
> +	case RQ_QOS_NLAT:
> +		return "node-latency";
>  	}
>  	return "unknown";
>  }
> diff --git a/block/blk-nlatency.c b/block/blk-nlatency.c
> new file mode 100644
> index 000000000000..219c3f636d76
> --- /dev/null
> +++ b/block/blk-nlatency.c
> @@ -0,0 +1,389 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Per-node request latency tracking.
> + *
> + * Copyright (C) 2023 Hannes Reinecke
> + *
> + * A simple per-node latency tracker for use by I/O scheduler.
> + * Latencies are measures over 'win_usec' microseconds and stored per node.
> + * If the number of measurements falls below 'lowat' the measurement is
> + * assumed to be unreliable and will become 'stale'.
> + * These 'stale' latencies can be 'decayed', where during each measurement
> + * interval the 'stale' latency value is decreased by 'decay' percent.
> + * Once the 'stale' latency reaches zero it will be updated by the
> + * measured latency.
> + */
> +#include <linux/kernel.h>
> +#include <linux/blk_types.h>
> +#include <linux/slab.h>
> +
> +#include "blk-stat.h"
> +#include "blk-rq-qos.h"
> +#include "blk.h"
> +
> +#define NLAT_DEFAULT_LOWAT 2
> +#define NLAT_DEFAULT_DECAY 50
> +
> +struct rq_nlat {
> +	struct rq_qos rqos;
> +
> +	u64 win_usec;		/* latency measurement window in microseconds */

Using microseconds forces you to do costly multiplications and divisions by
1000. Why not keep things in nanoseconds ?

> +	unsigned int lowat;	/* Low Watermark latency measurement */
> +	unsigned int decay;	/* Percentage for 'decaying' latencies */
> +	bool enabled;
> +
> +	struct blk_stat_callback *cb;
> +
> +	unsigned int num;
> +	u64 *latency;
> +	unsigned int *samples;
> +};
> +
> +static inline struct rq_nlat *RQNLAT(struct rq_qos *rqos)
> +{
> +	return container_of(rqos, struct rq_nlat, rqos);
> +}
> +
> +static u64 nlat_default_latency_usec(struct request_queue *q)
> +{
> +	/*
> +	 * We default to 2msec for non-rotational storage, and 75msec
> +	 * for rotational storage.
> +	 */
> +	if (blk_queue_nonrot(q))
> +		return 2000ULL;
> +	else

No need for this else.

> +		return 75000ULL;
> +}
> +
> +static void nlat_timer_fn(struct blk_stat_callback *cb)
> +{
> +	struct rq_nlat *nlat = cb->data;
> +	int n;
> +
> +	for (n = 0; n < cb->buckets; n++) {
> +		if (cb->stat[n].nr_samples < nlat->lowat) {
> +			/*
> +			 * 'decay' the latency by the specified
> +			 * percentage to ensure the queues are
> +			 * being tested to balance out temporary
> +			 * latency spikes.
> +			 */
> +			nlat->latency[n] =
> +				div64_u64(nlat->latency[n] * nlat->decay, 100);
> +		} else
> +			nlat->latency[n] = cb->stat[n].mean;

Missing the curly brackets around the else block.
Nit: n is a rather unusual name for a loop index. Why not the usual "i" ? Does
notmatter much though.

> +		nlat->samples[n] = cb->stat[n].nr_samples;
> +	}
> +	if (nlat->enabled)
> +		blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
> +}
> +
> +static int nlat_bucket_node(const struct request *rq)
> +{
> +	if (!rq->mq_ctx)
> +		return -1;
> +	return cpu_to_node(blk_mq_rq_cpu((struct request *)rq));
> +}
> +
> +static void nlat_exit(struct rq_qos *rqos)
> +{
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +
> +	blk_stat_remove_callback(nlat->rqos.disk->queue, nlat->cb);
> +	blk_stat_free_callback(nlat->cb);
> +	kfree(nlat->samples);
> +	kfree(nlat->latency);
> +	kfree(nlat);
> +}
> +
> +#ifdef CONFIG_BLK_DEBUG_FS
> +static int nlat_win_usec_show(void *data, struct seq_file *m)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +
> +	seq_printf(m, "%llu\n", nlat->win_usec);
> +	return 0;
> +}
> +
> +static ssize_t nlat_win_usec_write(void *data, const char __user *buf,
> +			size_t count, loff_t *ppos)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +	char val[16] = { };
> +	u64 usec;
> +	int err;
> +
> +	if (blk_queue_dying(nlat->rqos.disk->queue))
> +		return -ENOENT;
> +
> +	if (count >= sizeof(val))
> +		return -EINVAL;
> +
> +	if (copy_from_user(val, buf, count))
> +		return -EFAULT;
> +
> +	err = kstrtoull(val, 10, &usec);
> +	if (err)
> +		return err;
> +	blk_stat_deactivate(nlat->cb);
> +	nlat->win_usec = usec;
> +	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
> +
> +	return count;
> +}
> +
> +static int nlat_lowat_show(void *data, struct seq_file *m)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +
> +	seq_printf(m, "%u\n", nlat->lowat);
> +	return 0;
> +}
> +
> +static ssize_t nlat_lowat_write(void *data, const char __user *buf,
> +			size_t count, loff_t *ppos)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +	char val[16] = { };
> +	unsigned int lowat;
> +	int err;
> +
> +	if (blk_queue_dying(nlat->rqos.disk->queue))
> +		return -ENOENT;
> +
> +	if (count >= sizeof(val))
> +		return -EINVAL;
> +
> +	if (copy_from_user(val, buf, count))
> +		return -EFAULT;
> +
> +	err = kstrtouint(val, 10, &lowat);
> +	if (err)
> +		return err;
> +	blk_stat_deactivate(nlat->cb);
> +	nlat->lowat = lowat;
> +	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
> +
> +	return count;
> +}
> +
> +static int nlat_decay_show(void *data, struct seq_file *m)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +
> +	seq_printf(m, "%u\n", nlat->decay);
> +	return 0;
> +}
> +
> +static ssize_t nlat_decay_write(void *data, const char __user *buf,
> +			size_t count, loff_t *ppos)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +	char val[16] = { };
> +	unsigned int decay;
> +	int err;
> +
> +	if (blk_queue_dying(nlat->rqos.disk->queue))
> +		return -ENOENT;
> +
> +	if (count >= sizeof(val))
> +		return -EINVAL;
> +
> +	if (copy_from_user(val, buf, count))
> +		return -EFAULT;
> +
> +	err = kstrtouint(val, 10, &decay);
> +	if (err)
> +		return err;
> +	if (decay > 100)
> +		return -EINVAL;
> +	blk_stat_deactivate(nlat->cb);
> +	nlat->decay = decay;
> +	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
> +
> +	return count;
> +}
> +
> +static int nlat_enabled_show(void *data, struct seq_file *m)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +
> +	seq_printf(m, "%d\n", nlat->enabled);
> +	return 0;
> +}
> +
> +static int nlat_id_show(void *data, struct seq_file *m)
> +{
> +	struct rq_qos *rqos = data;
> +
> +	seq_printf(m, "%u\n", rqos->id);
> +	return 0;
> +}
> +
> +static int nlat_latency_show(void *data, struct seq_file *m)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +	int n;
> +
> +	if (!nlat->enabled)
> +		return 0;
> +
> +	for (n = 0; n < nlat->num; n++) {
> +		if (n > 0)
> +			seq_puts(m, " ");
> +		seq_printf(m, "%llu", nlat->latency[n]);
> +	}
> +	seq_puts(m, "\n");
> +	return 0;
> +}
> +
> +static int nlat_samples_show(void *data, struct seq_file *m)
> +{
> +	struct rq_qos *rqos = data;
> +	struct rq_nlat *nlat = RQNLAT(rqos);
> +	int n;
> +
> +	if (!nlat->enabled)
> +		return 0;
> +
> +	for (n = 0; n < nlat->num; n++) {
> +		if (n > 0)
> +			seq_puts(m, " ");
> +		seq_printf(m, "%u", nlat->samples[n]);
> +	}
> +	seq_puts(m, "\n");
> +	return 0;
> +}
> +
> +static const struct blk_mq_debugfs_attr nlat_debugfs_attrs[] = {
> +	{"win_usec", 0600, nlat_win_usec_show, nlat_win_usec_write},
> +	{"lowat", 0600, nlat_lowat_show, nlat_lowat_write},
> +	{"decay", 0600, nlat_decay_show, nlat_decay_write},
> +	{"enabled", 0400, nlat_enabled_show},
> +	{"id", 0400, nlat_id_show},
> +	{"latency", 0400, nlat_latency_show},
> +	{"samples", 0400, nlat_samples_show},
> +	{},
> +};
> +#endif
> +
> +static const struct rq_qos_ops nlat_rqos_ops = {
> +	.exit = nlat_exit,
> +#ifdef CONFIG_BLK_DEBUG_FS
> +	.debugfs_attrs = nlat_debugfs_attrs,
> +#endif
> +};
> +
> +u64 blk_nlat_latency(struct gendisk *disk, int node)
> +{
> +	struct rq_qos *rqos;
> +	struct rq_nlat *nlat;
> +
> +	rqos = nlat_rq_qos(disk->queue);
> +	if (!rqos)
> +		return 0;
> +	nlat = RQNLAT(rqos);
> +	if (node > nlat->num)
> +		return 0;
> +
> +	return div64_u64(nlat->latency[node], 1000);

See comment at the top. Why not keep everything in nanoseconds to avoid this
costly division ?

> +}
> +EXPORT_SYMBOL_GPL(blk_nlat_latency);
> +
> +int blk_nlat_enable(struct gendisk *disk)
> +{
> +	struct rq_qos *rqos;
> +	struct rq_nlat *nlat;
> +
> +	/* Latency tracking not enabled? */
> +	rqos = nlat_rq_qos(disk->queue);
> +	if (!rqos)
> +		return -EINVAL;
> +	nlat = RQNLAT(rqos);
> +	if (nlat->enabled)
> +		return 0;
> +
> +	/* Queue not registered? Maybe shutting down... */
> +	if (!blk_queue_registered(disk->queue))
> +		return -EAGAIN;
> +
> +	nlat->enabled = true;
> +	memset(nlat->latency, 0, sizeof(u64) * nlat->num);
> +	memset(nlat->samples, 0, sizeof(unsigned int) * nlat->num);
> +	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(blk_nlat_enable);
> +
> +void blk_nlat_disable(struct gendisk *disk)
> +{
> +	struct rq_qos *rqos = nlat_rq_qos(disk->queue);
> +	struct rq_nlat *nlat;
> +
> +	if (!rqos)
> +		return;
> +	nlat = RQNLAT(rqos);
> +	if (nlat->enabled) {
> +		blk_stat_deactivate(nlat->cb);
> +		nlat->enabled = false;
> +	}
> +}
> +EXPORT_SYMBOL_GPL(blk_nlat_disable);
> +
> +int blk_nlat_init(struct gendisk *disk)
> +{
> +	struct rq_nlat *nlat;
> +	int ret = -ENOMEM;
> +
> +	nlat = kzalloc(sizeof(*nlat), GFP_KERNEL);
> +	if (!nlat)
> +		return -ENOMEM;
> +
> +	nlat->num = num_possible_nodes();
> +	nlat->lowat = NLAT_DEFAULT_LOWAT;
> +	nlat->decay = NLAT_DEFAULT_DECAY;
> +	nlat->win_usec = nlat_default_latency_usec(disk->queue);
> +
> +	nlat->latency = kcalloc(nlat->num, sizeof(u64), GFP_KERNEL);
> +	if (!nlat->latency)
> +		goto err_free;
> +	nlat->samples = kcalloc(nlat->num, sizeof(unsigned int), GFP_KERNEL);
> +	if (!nlat->samples)
> +		goto err_free;
> +	nlat->cb = blk_stat_alloc_callback(nlat_timer_fn, nlat_bucket_node,
> +					   nlat->num, nlat);
> +	if (!nlat->cb)
> +		goto err_free;
> +
> +	/*
> +	 * Assign rwb and add the stats callback.
> +	 */

This can be a single line comment.

> +	mutex_lock(&disk->queue->rq_qos_mutex);
> +	ret = rq_qos_add(&nlat->rqos, disk, RQ_QOS_NLAT, &nlat_rqos_ops);
> +	mutex_unlock(&disk->queue->rq_qos_mutex);
> +	if (ret)
> +		goto err_free_cb;
> +
> +	blk_stat_add_callback(disk->queue, nlat->cb);
> +
> +	return 0;
> +
> +err_free_cb:
> +	blk_stat_free_callback(nlat->cb);
> +err_free:
> +	kfree(nlat->samples);
> +	kfree(nlat->latency);
> +	kfree(nlat);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(blk_nlat_init);
> diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
> index 37245c97ee61..2fc11ced0c00 100644
> --- a/block/blk-rq-qos.h
> +++ b/block/blk-rq-qos.h
> @@ -17,6 +17,7 @@ enum rq_qos_id {
>  	RQ_QOS_WBT,
>  	RQ_QOS_LATENCY,
>  	RQ_QOS_COST,
> +	RQ_QOS_NLAT,
>  };
>  
>  struct rq_wait {
> @@ -79,6 +80,11 @@ static inline struct rq_qos *iolat_rq_qos(struct request_queue *q)
>  	return rq_qos_id(q, RQ_QOS_LATENCY);
>  }
>  
> +static inline struct rq_qos *nlat_rq_qos(struct request_queue *q)
> +{
> +	return rq_qos_id(q, RQ_QOS_NLAT);
> +}
> +
>  static inline void rq_wait_init(struct rq_wait *rq_wait)
>  {
>  	atomic_set(&rq_wait->inflight, 0);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index d3d8fd8e229b..1f3829627f1b 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -1231,4 +1231,15 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
>  }
>  #endif /* CONFIG_BLK_DEV_ZONED */
>  
> +#ifdef CONFIG_BLK_NODE_LATENCY
> +int blk_nlat_enable(struct gendisk *disk);
> +void blk_nlat_disable(struct gendisk *disk);
> +u64 blk_nlat_latency(struct gendisk *disk, int node);
> +int blk_nlat_init(struct gendisk *disk);
> +#else
> +static inline int blk_nlat_enable(struct gendisk *disk) { return 0; }
> +static inline void blk_nlat_disable(struct gendisk *disk) {}
> +static inline u64 blk_nlat_latency(struct gendisk *disk, int node) { return 0; }
> +static inline int blk_nlat_init(struct gendisk *disk) { return -EOPNOTSUPP; }
> +#endif
>  #endif /* BLK_MQ_H */

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] nvme: add 'latency' iopolicy
  2024-05-09 20:43 ` [PATCH v3 2/3] nvme: add 'latency' iopolicy John Meneghini
@ 2024-05-10  7:17   ` Damien Le Moal
  2024-05-10 10:03     ` Hannes Reinecke
  0 siblings, 1 reply; 11+ messages in thread
From: Damien Le Moal @ 2024-05-10  7:17 UTC (permalink / raw)
  To: John Meneghini, tj, josef, axboe, kbusch, hch, sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jrani, randyj, aviv.coro

On 5/10/24 05:43, John Meneghini wrote:
> From: Hannes Reinecke <hare@kernel.org>
> 
> Add a latency-based I/O policy for multipathing. It uses the blk-nodelat
> latency tracker to provide latencies for each node, and schedules
> I/O on the path with the least latency for the submitting node.
> 
> Signed-off-by: Hannes Reinecke <hare@kernel.org>
> 
> Make this compile when CONFIG_BLK_NODE_LATENCY is not set.
> Advertise the 'latency' iopolicy in modinfo.
> 
> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
> ---
>  drivers/nvme/host/multipath.c | 63 ++++++++++++++++++++++++++++++-----
>  drivers/nvme/host/nvme.h      |  1 +
>  2 files changed, 55 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index d916a5ddf5d4..e9330bb1990b 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -18,6 +18,7 @@ static const char *nvme_iopolicy_names[] = {
>  	[NVME_IOPOLICY_NUMA]	= "numa",
>  	[NVME_IOPOLICY_RR]	= "round-robin",
>  	[NVME_IOPOLICY_QD]      = "queue-depth",
> +	[NVME_IOPOLICY_LAT]	= "latency",
>  };
>  
>  static int iopolicy = NVME_IOPOLICY_NUMA;
> @@ -32,6 +33,10 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>  		iopolicy = NVME_IOPOLICY_RR;
>  	else if (!strncmp(val, "queue-depth", 11))
>  		iopolicy = NVME_IOPOLICY_QD;
> +#ifdef CONFIG_BLK_NODE_LATENCY
> +	else if (!strncmp(val, "latency", 7))
> +		iopolicy = NVME_IOPOLICY_LAT;
> +#endif
>  	else
>  		return -EINVAL;
>  
> @@ -43,10 +48,36 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>  	return sprintf(buf, "%s\n", nvme_iopolicy_names[iopolicy]);
>  }
>  
> +static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy)
> +{
> +	struct nvme_ns_head *h;
> +	struct nvme_ns *ns;
> +	bool enable = iopolicy == NVME_IOPOLICY_LAT;
> +	int ret = 0;
> +
> +	mutex_lock(&subsys->lock);
> +	list_for_each_entry(h, &subsys->nsheads, entry) {
> +		list_for_each_entry_rcu(ns, &h->list, siblings) {
> +			if (enable) {
> +				ret = blk_nlat_enable(ns->disk);
> +				if (ret)
> +					break;
> +			} else
> +				blk_nlat_disable(ns->disk);

Missing curly brackets for the else.

> +		}
> +	}
> +	mutex_unlock(&subsys->lock);
> +	return ret;
> +}
> +
>  module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>  	&iopolicy, 0644);
>  MODULE_PARM_DESC(iopolicy,
> +#if defined(CONFIG_BLK_NODE_LATENCY)

What is so special about the latency policy that it needs to be conditionally
defined ? I missed that point. Why not drop CONFIG_BLK_NODE_LATENCY ?

> +	"Default multipath I/O policy; 'numa' (default) , 'round-robin', 'queue-depth' or 'latency'");
> +#else
>  	"Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
> +#endif
>  
>  void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>  {
> @@ -250,14 +281,16 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
>  {
>  	int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
>  	struct nvme_ns *found = NULL, *fallback = NULL, *ns;
> +	int iopolicy = READ_ONCE(head->subsys->iopolicy);
>  
>  	list_for_each_entry_rcu(ns, &head->list, siblings) {
>  		if (nvme_path_is_disabled(ns))
>  			continue;
>  
> -		if (ns->ctrl->numa_node != NUMA_NO_NODE &&
> -		    READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA)
> +		if (iopolicy == NVME_IOPOLICY_NUMA)
>  			distance = node_distance(node, ns->ctrl->numa_node);
> +		else if (iopolicy == NVME_IOPOLICY_LAT)
> +			distance = blk_nlat_latency(ns->disk, node);
>  		else
>  			distance = LOCAL_DISTANCE;
>  
> @@ -381,8 +414,8 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
>  
>  inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>  {
> -	int iopolicy = READ_ONCE(head->subsys->iopolicy);
>  	int node;
> +	int iopolicy = READ_ONCE(head->subsys->iopolicy);

No need to move this line.

>  	struct nvme_ns *ns;
>  
>  	/*
> @@ -401,8 +434,8 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>  
>  	if (iopolicy == NVME_IOPOLICY_RR)
>  		return nvme_round_robin_path(head, node, ns);
> -
> -	if (unlikely(!nvme_path_is_optimized(ns)))
> +	if (iopolicy == NVME_IOPOLICY_LAT ||
> +	    unlikely(!nvme_path_is_optimized(ns)))
>  		return __nvme_find_path(head, node);
>  	return ns;
>  }
> @@ -872,15 +905,18 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
>  {
>  	struct nvme_subsystem *subsys =
>  		container_of(dev, struct nvme_subsystem, dev);
> -	int i;
> +	int i, ret;
>  
>  	for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
>  		if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
> -			nvme_subsys_iopolicy_update(subsys, i);
> -			return count;
> +			ret = nvme_activate_iopolicy(subsys, i);
> +			if (!ret) {
> +				nvme_subsys_iopolicy_update(subsys, i);
> +				return count;
> +			}
> +			return ret;

It would be nicer to have this as:

			if (ret)
				break
			nvme_subsys_iopolicy_update(subsys, i);
			return count;

>  		}
>  	}
> -

whiteline change.

>  	return -EINVAL;

And "return ret;" here with ret initialized to -EINVAL when declared.

>  }
>  SUBSYS_ATTR_RW(iopolicy, S_IRUGO | S_IWUSR,
> @@ -916,6 +952,15 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>  
>  void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
>  {
> +	if (!blk_nlat_init(ns->disk) &&
> +	    READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_LAT) {
> +		int ret = blk_nlat_enable(ns->disk);
> +
> +		if (unlikely(ret))
> +			pr_warn("%s: Failed to enable latency tracking, error %d\n",
> +				ns->disk->disk_name, ret);
> +	}
> +
>  	if (nvme_ctrl_use_ana(ns->ctrl)) {
>  		struct nvme_ana_group_desc desc = {
>  			.grpid = anagrpid,
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index a557b4577c01..66bf003a6c48 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -411,6 +411,7 @@ enum nvme_iopolicy {
>  	NVME_IOPOLICY_NUMA,
>  	NVME_IOPOLICY_RR,
>  	NVME_IOPOLICY_QD,
> +	NVME_IOPOLICY_LAT,
>  };
>  
>  struct nvme_subsystem {

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] nvme: multipath: pr_notice when iopolicy changes
  2024-05-09 20:43 ` [PATCH v3 3/3] nvme: multipath: pr_notice when iopolicy changes John Meneghini
@ 2024-05-10  7:19   ` Damien Le Moal
  0 siblings, 0 replies; 11+ messages in thread
From: Damien Le Moal @ 2024-05-10  7:19 UTC (permalink / raw)
  To: John Meneghini, tj, josef, axboe, kbusch, hch, sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jrani, randyj, aviv.coro

On 5/10/24 05:43, John Meneghini wrote:
> Send a pr_notice when ever the iopolicy on a subsystem
> is changed. This is important for support reasons. It
> is fully expected that users will be changing the iopolicy
> with active IO in progress.
> 
> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
> ---
>  drivers/nvme/host/multipath.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index e9330bb1990b..0286e44a081f 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -67,6 +67,10 @@ static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy)
>  		}
>  	}
>  	mutex_unlock(&subsys->lock);
> +
> +	pr_notice("%s: %s enable %d status %d for subsysnqn %s\n", __func__,
> +			nvme_iopolicy_names[iopolicy], enable, ret, subsys->subnqn);
> +
>  	return ret;
>  }
>  
> @@ -890,6 +894,8 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
>  {
>  	struct nvme_ctrl *ctrl;
>  
> +	int old_iopolicy = READ_ONCE(subsys->iopolicy);

No need for the white line before this.

> +
>  	WRITE_ONCE(subsys->iopolicy, iopolicy);
>  
>  	mutex_lock(&nvme_subsystems_lock);
> @@ -898,6 +904,10 @@ void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
>  		nvme_mpath_clear_ctrl_paths(ctrl);
>  	}
>  	mutex_unlock(&nvme_subsystems_lock);
> +
> +	pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
> +			nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],
> +			subsys->subnqn);

Using dev_notice(&subsys->dev, "...", ...); may be better. Same for the other
messages.

>  }
>  
>  static ssize_t nvme_subsys_iopolicy_store(struct device *dev,

-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] block: track per-node I/O latency
  2024-05-10  7:11   ` Damien Le Moal
@ 2024-05-10  9:28     ` Niklas Cassel
  2024-05-10 10:00     ` Hannes Reinecke
  1 sibling, 0 replies; 11+ messages in thread
From: Niklas Cassel @ 2024-05-10  9:28 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: John Meneghini, tj, josef, axboe, kbusch, hch, sagi, emilne,
	hare, linux-block, cgroups, linux-nvme, linux-kernel, jrani,
	randyj, aviv.coro

On Fri, May 10, 2024 at 04:11:10PM +0900, Damien Le Moal wrote:
> On 5/10/24 05:43, John Meneghini wrote:
> > From: Hannes Reinecke <hare@kernel.org>
> > 
> > Add a new option 'BLK_NODE_LATENCY' to track per-node I/O latency.
> > This can be used by I/O schedulers to determine the 'best' queue
> > to send I/O to.
> > 
> > Signed-off-by: Hannes Reinecke <hare@kernel.org>
> > 
> > Cleaned up checkpatch warnings and updated MAINTAINERS.
> 
> This note should be before Hannes SoB. E.g:
> 
> [John] Fixed checkpatch warnings and updated MAINTAINERS.

Not before, it shoud be after Hannes SoB.
(Between Hannes' Signed-off-by and John's Signed-off-by)

See this example:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a37e12bcab22efa05802f87baa0692365ae0ab4d


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 0/3] block,nvme: latency-based I/O scheduler
  2024-05-09 20:43 ` [PATCH v3 0/3] block,nvme: latency-based I/O scheduler John Meneghini
@ 2024-05-10  9:34   ` Niklas Cassel
  0 siblings, 0 replies; 11+ messages in thread
From: Niklas Cassel @ 2024-05-10  9:34 UTC (permalink / raw)
  To: John Meneghini
  Cc: tj, josef, axboe, kbusch, hch, sagi, emilne, hare, linux-block,
	cgroups, linux-nvme, linux-kernel, jrani, randyj, aviv.coro

On Thu, May 09, 2024 at 04:43:21PM -0400, John Meneghini wrote:
> I'm re-issuing Hannes's latency patches in preparation for LSFMM

Hello John,

Just a small note.

Please don't reply-to the previous version of the series (v2), when sending
out a v3.

It creates "an unmanageable forest of references in email clients".

See:
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#explicit-in-reply-to-headers

Instead just add the url to the v2 on lore.kernel.org.

See you at LSFMM!


Kind regards,
Niklas

> 
> Changes since V2:
> 
> I've done quite a bit of work cleaning up these patches. There were a
> number of checkpatch.pl problems as well as some compile time errors
> when config BLK_NODE_LATENCY was turned off. After the clean up I
> rebased these patches onto Ewan's "nvme: queue-depth multipath iopolicy"
> patches. This allowed me to test both iopolicy changes together. 
> 
> All of my test results, together with the scripts I used to generate these
> graphs, are available at:
> 
>   https://github.com/johnmeneghini/iopolicy
> 
> Please use the scripts in this repository to do your own testing.
> 
> Changes since V1:
> 
> Hi all,
> 
> there had been several attempts to implement a latency-based I/O
> scheduler for native nvme multipath, all of which had its issues.
> 
> So time to start afresh, this time using the QoS framework
> already present in the block layer.
> It consists of two parts:
> - a new 'blk-nlatency' QoS module, which is just a simple per-node
>   latency tracker
> - a 'latency' nvme I/O policy
> 
> Using the 'tiobench' fio script with 512 byte blocksize I'm getting
> the following latencies (in usecs) as a baseline:
> - seq write: avg 186 stddev 331
> - rand write: avg 4598 stddev 7903
> - seq read: avg 149 stddev 65
> - rand read: avg 150 stddev 68
> 
> Enabling the 'latency' iopolicy:
> - seq write: avg 178 stddev 113
> - rand write: avg 3427 stddev 6703
> - seq read: avg 140 stddev 59
> - rand read: avg 141 stddev 58
> 
> Setting the 'decay' parameter to 10:
> - seq write: avg 182 stddev 65
> - rand write: avg 2619 stddev 5894
> - seq read: avg 142 stddev 57
> - rand read: avg 140 stddev 57  
> 
> That's on a 32G FC testbed running against a brd target,
> fio running with 48 threads. So promises are met: latency
> goes down, and we're even able to control the standard
> deviation via the 'decay' parameter.
> 
> As usual, comments and reviews are welcome.
> 
> Changes to the original version:
> - split the rqos debugfs entries
> - Modify commit message to indicate latency
> - rename to blk-nlatency
> 
> Hannes Reinecke (2):
>   block: track per-node I/O latency
>   nvme: add 'latency' iopolicy
> 
> John Meneghini (1):
>   nvme: multipath: pr_notice when iopolicy changes
> 
>  MAINTAINERS                   |   1 +
>  block/Kconfig                 |   9 +
>  block/Makefile                |   1 +
>  block/blk-mq-debugfs.c        |   2 +
>  block/blk-nlatency.c          | 389 ++++++++++++++++++++++++++++++++++
>  block/blk-rq-qos.h            |   6 +
>  drivers/nvme/host/multipath.c |  73 ++++++-
>  drivers/nvme/host/nvme.h      |   1 +
>  include/linux/blk-mq.h        |  11 +
>  9 files changed, 484 insertions(+), 9 deletions(-)
>  create mode 100644 block/blk-nlatency.c
> 
> -- 
> 2.39.3
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/3] block: track per-node I/O latency
  2024-05-10  7:11   ` Damien Le Moal
  2024-05-10  9:28     ` Niklas Cassel
@ 2024-05-10 10:00     ` Hannes Reinecke
  1 sibling, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2024-05-10 10:00 UTC (permalink / raw)
  To: Damien Le Moal, John Meneghini, tj, josef, axboe, kbusch, hch,
	sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jrani, randyj, aviv.coro

On 5/10/24 09:11, Damien Le Moal wrote:
> On 5/10/24 05:43, John Meneghini wrote:
>> From: Hannes Reinecke <hare@kernel.org>
>>
>> Add a new option 'BLK_NODE_LATENCY' to track per-node I/O latency.
>> This can be used by I/O schedulers to determine the 'best' queue
>> to send I/O to.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>
>> Cleaned up checkpatch warnings and updated MAINTAINERS.
> 
> This note should be before Hannes SoB. E.g:
> 
> [John] Fixed checkpatch warnings and updated MAINTAINERS.
> 
>>
>> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
>> ---
>>   MAINTAINERS            |   1 +
>>   block/Kconfig          |   9 +
>>   block/Makefile         |   1 +
>>   block/blk-mq-debugfs.c |   2 +
>>   block/blk-nlatency.c   | 389 +++++++++++++++++++++++++++++++++++++++++
>>   block/blk-rq-qos.h     |   6 +
>>   include/linux/blk-mq.h |  11 ++
>>   7 files changed, 419 insertions(+)
>>   create mode 100644 block/blk-nlatency.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 7c121493f43d..a4634365c82f 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -5405,6 +5405,7 @@ F:	block/bfq-cgroup.c
>>   F:	block/blk-cgroup.c
>>   F:	block/blk-iocost.c
>>   F:	block/blk-iolatency.c
>> +F:	block/blk-nlatency.c
>>   F:	block/blk-throttle.c
>>   F:	include/linux/blk-cgroup.h
>>   
>> diff --git a/block/Kconfig b/block/Kconfig
>> index 1de4682d48cc..641ed39d609c 100644
>> --- a/block/Kconfig
>> +++ b/block/Kconfig
>> @@ -186,6 +186,15 @@ config BLK_CGROUP_IOPRIO
>>   	scheduler and block devices process requests. Only some I/O schedulers
>>   	and some block devices support I/O priorities.
>>   
>> +config BLK_NODE_LATENCY
>> +	bool "Track per-node I/O latency"
>> +	help
>> +	Enable per-node I/O latency tracking for multipathing. This uses the
>> +	blk-nodelat latency tracker to provide latencies for each node, and schedules
>> +	I/O on the path with the least latency for the submitting node. This can be
>> +	used by I/O schedulers to determine the node with the least latency. Currently
>> +	only supports nvme over fabrics devices.
>> +
>>   config BLK_DEBUG_FS
>>   	bool "Block layer debugging information in debugfs"
>>   	default y
>> diff --git a/block/Makefile b/block/Makefile
>> index 46ada9dc8bbf..9d2e71a3e36f 100644
>> --- a/block/Makefile
>> +++ b/block/Makefile
>> @@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
>>   obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
>>   obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
>>   obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
>> +obj-$(CONFIG_BLK_NODE_LATENCY) += blk-nlatency.o
> 
> Let's keep the alignment please.
> 
>>   obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
>>   obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
>>   bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>> index 94668e72ab09..cb38228b95d8 100644
>> --- a/block/blk-mq-debugfs.c
>> +++ b/block/blk-mq-debugfs.c
>> @@ -762,6 +762,8 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id)
>>   		return "latency";
>>   	case RQ_QOS_COST:
>>   		return "cost";
>> +	case RQ_QOS_NLAT:
>> +		return "node-latency";
>>   	}
>>   	return "unknown";
>>   }
>> diff --git a/block/blk-nlatency.c b/block/blk-nlatency.c
>> new file mode 100644
>> index 000000000000..219c3f636d76
>> --- /dev/null
>> +++ b/block/blk-nlatency.c
>> @@ -0,0 +1,389 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Per-node request latency tracking.
>> + *
>> + * Copyright (C) 2023 Hannes Reinecke
>> + *
>> + * A simple per-node latency tracker for use by I/O scheduler.
>> + * Latencies are measures over 'win_usec' microseconds and stored per node.
>> + * If the number of measurements falls below 'lowat' the measurement is
>> + * assumed to be unreliable and will become 'stale'.
>> + * These 'stale' latencies can be 'decayed', where during each measurement
>> + * interval the 'stale' latency value is decreased by 'decay' percent.
>> + * Once the 'stale' latency reaches zero it will be updated by the
>> + * measured latency.
>> + */
>> +#include <linux/kernel.h>
>> +#include <linux/blk_types.h>
>> +#include <linux/slab.h>
>> +
>> +#include "blk-stat.h"
>> +#include "blk-rq-qos.h"
>> +#include "blk.h"
>> +
>> +#define NLAT_DEFAULT_LOWAT 2
>> +#define NLAT_DEFAULT_DECAY 50
>> +
>> +struct rq_nlat {
>> +	struct rq_qos rqos;
>> +
>> +	u64 win_usec;		/* latency measurement window in microseconds */
> 
> Using microseconds forces you to do costly multiplications and divisions by
> 1000. Why not keep things in nanoseconds ?
> 
I wanted to keep the user interface simple; entering nanoseconds values
is tedious. But sure, I can change it.

>> +	unsigned int lowat;	/* Low Watermark latency measurement */
>> +	unsigned int decay;	/* Percentage for 'decaying' latencies */
>> +	bool enabled;
>> +
>> +	struct blk_stat_callback *cb;
>> +
>> +	unsigned int num;
>> +	u64 *latency;
>> +	unsigned int *samples;
>> +};
>> +
>> +static inline struct rq_nlat *RQNLAT(struct rq_qos *rqos)
>> +{
>> +	return container_of(rqos, struct rq_nlat, rqos);
>> +}
>> +
>> +static u64 nlat_default_latency_usec(struct request_queue *q)
>> +{
>> +	/*
>> +	 * We default to 2msec for non-rotational storage, and 75msec
>> +	 * for rotational storage.
>> +	 */
>> +	if (blk_queue_nonrot(q))
>> +		return 2000ULL;
>> +	else
> 
> No need for this else.
> 
OK.

>> +		return 75000ULL;
>> +}
>> +
>> +static void nlat_timer_fn(struct blk_stat_callback *cb)
>> +{
>> +	struct rq_nlat *nlat = cb->data;
>> +	int n;
>> +
>> +	for (n = 0; n < cb->buckets; n++) {
>> +		if (cb->stat[n].nr_samples < nlat->lowat) {
>> +			/*
>> +			 * 'decay' the latency by the specified
>> +			 * percentage to ensure the queues are
>> +			 * being tested to balance out temporary
>> +			 * latency spikes.
>> +			 */
>> +			nlat->latency[n] =
>> +				div64_u64(nlat->latency[n] * nlat->decay, 100);
>> +		} else
>> +			nlat->latency[n] = cb->stat[n].mean;
> 
> Missing the curly brackets around the else block.
> Nit: n is a rather unusual name for a loop index. Why not the usual "i" ? Does
> notmatter much though.
> 
There was a reason once ... but yeah, let's move to 'i'.

>> +		nlat->samples[n] = cb->stat[n].nr_samples;
>> +	}
>> +	if (nlat->enabled)
>> +		blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
>> +}
>> +
>> +static int nlat_bucket_node(const struct request *rq)
>> +{
>> +	if (!rq->mq_ctx)
>> +		return -1;
>> +	return cpu_to_node(blk_mq_rq_cpu((struct request *)rq));
>> +}
>> +
>> +static void nlat_exit(struct rq_qos *rqos)
>> +{
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +
>> +	blk_stat_remove_callback(nlat->rqos.disk->queue, nlat->cb);
>> +	blk_stat_free_callback(nlat->cb);
>> +	kfree(nlat->samples);
>> +	kfree(nlat->latency);
>> +	kfree(nlat);
>> +}
>> +
>> +#ifdef CONFIG_BLK_DEBUG_FS
>> +static int nlat_win_usec_show(void *data, struct seq_file *m)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +
>> +	seq_printf(m, "%llu\n", nlat->win_usec);
>> +	return 0;
>> +}
>> +
>> +static ssize_t nlat_win_usec_write(void *data, const char __user *buf,
>> +			size_t count, loff_t *ppos)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +	char val[16] = { };
>> +	u64 usec;
>> +	int err;
>> +
>> +	if (blk_queue_dying(nlat->rqos.disk->queue))
>> +		return -ENOENT;
>> +
>> +	if (count >= sizeof(val))
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(val, buf, count))
>> +		return -EFAULT;
>> +
>> +	err = kstrtoull(val, 10, &usec);
>> +	if (err)
>> +		return err;
>> +	blk_stat_deactivate(nlat->cb);
>> +	nlat->win_usec = usec;
>> +	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
>> +
>> +	return count;
>> +}
>> +
>> +static int nlat_lowat_show(void *data, struct seq_file *m)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +
>> +	seq_printf(m, "%u\n", nlat->lowat);
>> +	return 0;
>> +}
>> +
>> +static ssize_t nlat_lowat_write(void *data, const char __user *buf,
>> +			size_t count, loff_t *ppos)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +	char val[16] = { };
>> +	unsigned int lowat;
>> +	int err;
>> +
>> +	if (blk_queue_dying(nlat->rqos.disk->queue))
>> +		return -ENOENT;
>> +
>> +	if (count >= sizeof(val))
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(val, buf, count))
>> +		return -EFAULT;
>> +
>> +	err = kstrtouint(val, 10, &lowat);
>> +	if (err)
>> +		return err;
>> +	blk_stat_deactivate(nlat->cb);
>> +	nlat->lowat = lowat;
>> +	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
>> +
>> +	return count;
>> +}
>> +
>> +static int nlat_decay_show(void *data, struct seq_file *m)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +
>> +	seq_printf(m, "%u\n", nlat->decay);
>> +	return 0;
>> +}
>> +
>> +static ssize_t nlat_decay_write(void *data, const char __user *buf,
>> +			size_t count, loff_t *ppos)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +	char val[16] = { };
>> +	unsigned int decay;
>> +	int err;
>> +
>> +	if (blk_queue_dying(nlat->rqos.disk->queue))
>> +		return -ENOENT;
>> +
>> +	if (count >= sizeof(val))
>> +		return -EINVAL;
>> +
>> +	if (copy_from_user(val, buf, count))
>> +		return -EFAULT;
>> +
>> +	err = kstrtouint(val, 10, &decay);
>> +	if (err)
>> +		return err;
>> +	if (decay > 100)
>> +		return -EINVAL;
>> +	blk_stat_deactivate(nlat->cb);
>> +	nlat->decay = decay;
>> +	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
>> +
>> +	return count;
>> +}
>> +
>> +static int nlat_enabled_show(void *data, struct seq_file *m)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +
>> +	seq_printf(m, "%d\n", nlat->enabled);
>> +	return 0;
>> +}
>> +
>> +static int nlat_id_show(void *data, struct seq_file *m)
>> +{
>> +	struct rq_qos *rqos = data;
>> +
>> +	seq_printf(m, "%u\n", rqos->id);
>> +	return 0;
>> +}
>> +
>> +static int nlat_latency_show(void *data, struct seq_file *m)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +	int n;
>> +
>> +	if (!nlat->enabled)
>> +		return 0;
>> +
>> +	for (n = 0; n < nlat->num; n++) {
>> +		if (n > 0)
>> +			seq_puts(m, " ");
>> +		seq_printf(m, "%llu", nlat->latency[n]);
>> +	}
>> +	seq_puts(m, "\n");
>> +	return 0;
>> +}
>> +
>> +static int nlat_samples_show(void *data, struct seq_file *m)
>> +{
>> +	struct rq_qos *rqos = data;
>> +	struct rq_nlat *nlat = RQNLAT(rqos);
>> +	int n;
>> +
>> +	if (!nlat->enabled)
>> +		return 0;
>> +
>> +	for (n = 0; n < nlat->num; n++) {
>> +		if (n > 0)
>> +			seq_puts(m, " ");
>> +		seq_printf(m, "%u", nlat->samples[n]);
>> +	}
>> +	seq_puts(m, "\n");
>> +	return 0;
>> +}
>> +
>> +static const struct blk_mq_debugfs_attr nlat_debugfs_attrs[] = {
>> +	{"win_usec", 0600, nlat_win_usec_show, nlat_win_usec_write},
>> +	{"lowat", 0600, nlat_lowat_show, nlat_lowat_write},
>> +	{"decay", 0600, nlat_decay_show, nlat_decay_write},
>> +	{"enabled", 0400, nlat_enabled_show},
>> +	{"id", 0400, nlat_id_show},
>> +	{"latency", 0400, nlat_latency_show},
>> +	{"samples", 0400, nlat_samples_show},
>> +	{},
>> +};
>> +#endif
>> +
>> +static const struct rq_qos_ops nlat_rqos_ops = {
>> +	.exit = nlat_exit,
>> +#ifdef CONFIG_BLK_DEBUG_FS
>> +	.debugfs_attrs = nlat_debugfs_attrs,
>> +#endif
>> +};
>> +
>> +u64 blk_nlat_latency(struct gendisk *disk, int node)
>> +{
>> +	struct rq_qos *rqos;
>> +	struct rq_nlat *nlat;
>> +
>> +	rqos = nlat_rq_qos(disk->queue);
>> +	if (!rqos)
>> +		return 0;
>> +	nlat = RQNLAT(rqos);
>> +	if (node > nlat->num)
>> +		return 0;
>> +
>> +	return div64_u64(nlat->latency[node], 1000);
> 
> See comment at the top. Why not keep everything in nanoseconds to avoid this
> costly division ?
> 
Agreed.

>> +}
>> +EXPORT_SYMBOL_GPL(blk_nlat_latency);
>> +
>> +int blk_nlat_enable(struct gendisk *disk)
>> +{
>> +	struct rq_qos *rqos;
>> +	struct rq_nlat *nlat;
>> +
>> +	/* Latency tracking not enabled? */
>> +	rqos = nlat_rq_qos(disk->queue);
>> +	if (!rqos)
>> +		return -EINVAL;
>> +	nlat = RQNLAT(rqos);
>> +	if (nlat->enabled)
>> +		return 0;
>> +
>> +	/* Queue not registered? Maybe shutting down... */
>> +	if (!blk_queue_registered(disk->queue))
>> +		return -EAGAIN;
>> +
>> +	nlat->enabled = true;
>> +	memset(nlat->latency, 0, sizeof(u64) * nlat->num);
>> +	memset(nlat->samples, 0, sizeof(unsigned int) * nlat->num);
>> +	blk_stat_activate_nsecs(nlat->cb, nlat->win_usec * 1000);
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(blk_nlat_enable);
>> +
>> +void blk_nlat_disable(struct gendisk *disk)
>> +{
>> +	struct rq_qos *rqos = nlat_rq_qos(disk->queue);
>> +	struct rq_nlat *nlat;
>> +
>> +	if (!rqos)
>> +		return;
>> +	nlat = RQNLAT(rqos);
>> +	if (nlat->enabled) {
>> +		blk_stat_deactivate(nlat->cb);
>> +		nlat->enabled = false;
>> +	}
>> +}
>> +EXPORT_SYMBOL_GPL(blk_nlat_disable);
>> +
>> +int blk_nlat_init(struct gendisk *disk)
>> +{
>> +	struct rq_nlat *nlat;
>> +	int ret = -ENOMEM;
>> +
>> +	nlat = kzalloc(sizeof(*nlat), GFP_KERNEL);
>> +	if (!nlat)
>> +		return -ENOMEM;
>> +
>> +	nlat->num = num_possible_nodes();
>> +	nlat->lowat = NLAT_DEFAULT_LOWAT;
>> +	nlat->decay = NLAT_DEFAULT_DECAY;
>> +	nlat->win_usec = nlat_default_latency_usec(disk->queue);
>> +
>> +	nlat->latency = kcalloc(nlat->num, sizeof(u64), GFP_KERNEL);
>> +	if (!nlat->latency)
>> +		goto err_free;
>> +	nlat->samples = kcalloc(nlat->num, sizeof(unsigned int), GFP_KERNEL);
>> +	if (!nlat->samples)
>> +		goto err_free;
>> +	nlat->cb = blk_stat_alloc_callback(nlat_timer_fn, nlat_bucket_node,
>> +					   nlat->num, nlat);
>> +	if (!nlat->cb)
>> +		goto err_free;
>> +
>> +	/*
>> +	 * Assign rwb and add the stats callback.
>> +	 */
> 
> This can be a single line comment.
> 
Ok.

>> +	mutex_lock(&disk->queue->rq_qos_mutex);
>> +	ret = rq_qos_add(&nlat->rqos, disk, RQ_QOS_NLAT, &nlat_rqos_ops);
>> +	mutex_unlock(&disk->queue->rq_qos_mutex);
>> +	if (ret)
>> +		goto err_free_cb;
>> +
>> +	blk_stat_add_callback(disk->queue, nlat->cb);
>> +
>> +	return 0;
>> +
>> +err_free_cb:
>> +	blk_stat_free_callback(nlat->cb);
>> +err_free:
>> +	kfree(nlat->samples);
>> +	kfree(nlat->latency);
>> +	kfree(nlat);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(blk_nlat_init);
>> diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
>> index 37245c97ee61..2fc11ced0c00 100644
>> --- a/block/blk-rq-qos.h
>> +++ b/block/blk-rq-qos.h
>> @@ -17,6 +17,7 @@ enum rq_qos_id {
>>   	RQ_QOS_WBT,
>>   	RQ_QOS_LATENCY,
>>   	RQ_QOS_COST,
>> +	RQ_QOS_NLAT,
>>   };
>>   
>>   struct rq_wait {
>> @@ -79,6 +80,11 @@ static inline struct rq_qos *iolat_rq_qos(struct request_queue *q)
>>   	return rq_qos_id(q, RQ_QOS_LATENCY);
>>   }
>>   
>> +static inline struct rq_qos *nlat_rq_qos(struct request_queue *q)
>> +{
>> +	return rq_qos_id(q, RQ_QOS_NLAT);
>> +}
>> +
>>   static inline void rq_wait_init(struct rq_wait *rq_wait)
>>   {
>>   	atomic_set(&rq_wait->inflight, 0);
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index d3d8fd8e229b..1f3829627f1b 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -1231,4 +1231,15 @@ static inline bool blk_req_can_dispatch_to_zone(struct request *rq)
>>   }
>>   #endif /* CONFIG_BLK_DEV_ZONED */
>>   
>> +#ifdef CONFIG_BLK_NODE_LATENCY
>> +int blk_nlat_enable(struct gendisk *disk);
>> +void blk_nlat_disable(struct gendisk *disk);
>> +u64 blk_nlat_latency(struct gendisk *disk, int node);
>> +int blk_nlat_init(struct gendisk *disk);
>> +#else
>> +static inline int blk_nlat_enable(struct gendisk *disk) { return 0; }
>> +static inline void blk_nlat_disable(struct gendisk *disk) {}
>> +static inline u64 blk_nlat_latency(struct gendisk *disk, int node) { return 0; }
>> +static inline int blk_nlat_init(struct gendisk *disk) { return -EOPNOTSUPP; }
>> +#endif
>>   #endif /* BLK_MQ_H */
> 

Thanks for the review!

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/3] nvme: add 'latency' iopolicy
  2024-05-10  7:17   ` Damien Le Moal
@ 2024-05-10 10:03     ` Hannes Reinecke
  0 siblings, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2024-05-10 10:03 UTC (permalink / raw)
  To: Damien Le Moal, John Meneghini, tj, josef, axboe, kbusch, hch,
	sagi, emilne, hare
  Cc: linux-block, cgroups, linux-nvme, linux-kernel, jrani, randyj, aviv.coro

On 5/10/24 09:17, Damien Le Moal wrote:
> On 5/10/24 05:43, John Meneghini wrote:
>> From: Hannes Reinecke <hare@kernel.org>
>>
>> Add a latency-based I/O policy for multipathing. It uses the blk-nodelat
>> latency tracker to provide latencies for each node, and schedules
>> I/O on the path with the least latency for the submitting node.
>>
>> Signed-off-by: Hannes Reinecke <hare@kernel.org>
>>
>> Make this compile when CONFIG_BLK_NODE_LATENCY is not set.
>> Advertise the 'latency' iopolicy in modinfo.
>>
>> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
>> ---
>>   drivers/nvme/host/multipath.c | 63 ++++++++++++++++++++++++++++++-----
>>   drivers/nvme/host/nvme.h      |  1 +
>>   2 files changed, 55 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index d916a5ddf5d4..e9330bb1990b 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -18,6 +18,7 @@ static const char *nvme_iopolicy_names[] = {
>>   	[NVME_IOPOLICY_NUMA]	= "numa",
>>   	[NVME_IOPOLICY_RR]	= "round-robin",
>>   	[NVME_IOPOLICY_QD]      = "queue-depth",
>> +	[NVME_IOPOLICY_LAT]	= "latency",
>>   };
>>   
>>   static int iopolicy = NVME_IOPOLICY_NUMA;
>> @@ -32,6 +33,10 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
>>   		iopolicy = NVME_IOPOLICY_RR;
>>   	else if (!strncmp(val, "queue-depth", 11))
>>   		iopolicy = NVME_IOPOLICY_QD;
>> +#ifdef CONFIG_BLK_NODE_LATENCY
>> +	else if (!strncmp(val, "latency", 7))
>> +		iopolicy = NVME_IOPOLICY_LAT;
>> +#endif
>>   	else
>>   		return -EINVAL;
>>   
>> @@ -43,10 +48,36 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
>>   	return sprintf(buf, "%s\n", nvme_iopolicy_names[iopolicy]);
>>   }
>>   
>> +static int nvme_activate_iopolicy(struct nvme_subsystem *subsys, int iopolicy)
>> +{
>> +	struct nvme_ns_head *h;
>> +	struct nvme_ns *ns;
>> +	bool enable = iopolicy == NVME_IOPOLICY_LAT;
>> +	int ret = 0;
>> +
>> +	mutex_lock(&subsys->lock);
>> +	list_for_each_entry(h, &subsys->nsheads, entry) {
>> +		list_for_each_entry_rcu(ns, &h->list, siblings) {
>> +			if (enable) {
>> +				ret = blk_nlat_enable(ns->disk);
>> +				if (ret)
>> +					break;
>> +			} else
>> +				blk_nlat_disable(ns->disk);
> 
> Missing curly brackets for the else.
> 
Ok.

>> +		}
>> +	}
>> +	mutex_unlock(&subsys->lock);
>> +	return ret;
>> +}
>> +
>>   module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
>>   	&iopolicy, 0644);
>>   MODULE_PARM_DESC(iopolicy,
>> +#if defined(CONFIG_BLK_NODE_LATENCY)
> 
> What is so special about the latency policy that it needs to be conditionally
> defined ? I missed that point. Why not drop CONFIG_BLK_NODE_LATENCY ?
> 
The 'latency' policy is using the blk-rqos infrastructure, which in 
itself might not be compiled in.
So we don't want the user to give a false impression here.

>> +	"Default multipath I/O policy; 'numa' (default) , 'round-robin', 'queue-depth' or 'latency'");
>> +#else
>>   	"Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
>> +#endif
>>   
>>   void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
>>   {
>> @@ -250,14 +281,16 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
>>   {
>>   	int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
>>   	struct nvme_ns *found = NULL, *fallback = NULL, *ns;
>> +	int iopolicy = READ_ONCE(head->subsys->iopolicy);
>>   
>>   	list_for_each_entry_rcu(ns, &head->list, siblings) {
>>   		if (nvme_path_is_disabled(ns))
>>   			continue;
>>   
>> -		if (ns->ctrl->numa_node != NUMA_NO_NODE &&
>> -		    READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA)
>> +		if (iopolicy == NVME_IOPOLICY_NUMA)
>>   			distance = node_distance(node, ns->ctrl->numa_node);
>> +		else if (iopolicy == NVME_IOPOLICY_LAT)
>> +			distance = blk_nlat_latency(ns->disk, node);
>>   		else
>>   			distance = LOCAL_DISTANCE;
>>   
>> @@ -381,8 +414,8 @@ static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
>>   
>>   inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>   {
>> -	int iopolicy = READ_ONCE(head->subsys->iopolicy);
>>   	int node;
>> +	int iopolicy = READ_ONCE(head->subsys->iopolicy);
> 
> No need to move this line.
> 
Sure.

>>   	struct nvme_ns *ns;
>>   
>>   	/*
>> @@ -401,8 +434,8 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
>>   
>>   	if (iopolicy == NVME_IOPOLICY_RR)
>>   		return nvme_round_robin_path(head, node, ns);
>> -
>> -	if (unlikely(!nvme_path_is_optimized(ns)))
>> +	if (iopolicy == NVME_IOPOLICY_LAT ||
>> +	    unlikely(!nvme_path_is_optimized(ns)))
>>   		return __nvme_find_path(head, node);
>>   	return ns;
>>   }
>> @@ -872,15 +905,18 @@ static ssize_t nvme_subsys_iopolicy_store(struct device *dev,
>>   {
>>   	struct nvme_subsystem *subsys =
>>   		container_of(dev, struct nvme_subsystem, dev);
>> -	int i;
>> +	int i, ret;
>>   
>>   	for (i = 0; i < ARRAY_SIZE(nvme_iopolicy_names); i++) {
>>   		if (sysfs_streq(buf, nvme_iopolicy_names[i])) {
>> -			nvme_subsys_iopolicy_update(subsys, i);
>> -			return count;
>> +			ret = nvme_activate_iopolicy(subsys, i);
>> +			if (!ret) {
>> +				nvme_subsys_iopolicy_update(subsys, i);
>> +				return count;
>> +			}
>> +			return ret;
> 
> It would be nicer to have this as:
> 
> 			if (ret)
> 				break
> 			nvme_subsys_iopolicy_update(subsys, i);
> 			return count;
> 

Ok.

>>   		}
>>   	}
>> -
> 
> whiteline change.
> 
>>   	return -EINVAL;
> 
> And "return ret;" here with ret initialized to -EINVAL when declared.
> 
Ok.

>>   }
>>   SUBSYS_ATTR_RW(iopolicy, S_IRUGO | S_IWUSR,
>> @@ -916,6 +952,15 @@ static int nvme_lookup_ana_group_desc(struct nvme_ctrl *ctrl,
>>   
>>   void nvme_mpath_add_disk(struct nvme_ns *ns, __le32 anagrpid)
>>   {
>> +	if (!blk_nlat_init(ns->disk) &&
>> +	    READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_LAT) {
>> +		int ret = blk_nlat_enable(ns->disk);
>> +
>> +		if (unlikely(ret))
>> +			pr_warn("%s: Failed to enable latency tracking, error %d\n",
>> +				ns->disk->disk_name, ret);
>> +	}
>> +
>>   	if (nvme_ctrl_use_ana(ns->ctrl)) {
>>   		struct nvme_ana_group_desc desc = {
>>   			.grpid = anagrpid,
>> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
>> index a557b4577c01..66bf003a6c48 100644
>> --- a/drivers/nvme/host/nvme.h
>> +++ b/drivers/nvme/host/nvme.h
>> @@ -411,6 +411,7 @@ enum nvme_iopolicy {
>>   	NVME_IOPOLICY_NUMA,
>>   	NVME_IOPOLICY_RR,
>>   	NVME_IOPOLICY_QD,
>> +	NVME_IOPOLICY_LAT,
>>   };
>>   
>>   struct nvme_subsystem {
> 

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-05-10 10:03 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240403141756.88233-1-hare@kernel.org>
2024-05-09 20:43 ` [PATCH v3 0/3] block,nvme: latency-based I/O scheduler John Meneghini
2024-05-10  9:34   ` Niklas Cassel
2024-05-09 20:43 ` [PATCH v3 1/3] block: track per-node I/O latency John Meneghini
2024-05-10  7:11   ` Damien Le Moal
2024-05-10  9:28     ` Niklas Cassel
2024-05-10 10:00     ` Hannes Reinecke
2024-05-09 20:43 ` [PATCH v3 2/3] nvme: add 'latency' iopolicy John Meneghini
2024-05-10  7:17   ` Damien Le Moal
2024-05-10 10:03     ` Hannes Reinecke
2024-05-09 20:43 ` [PATCH v3 3/3] nvme: multipath: pr_notice when iopolicy changes John Meneghini
2024-05-10  7:19   ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).