All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rachit Agarwal <rach4x0r@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>
Cc: Rachit Agarwal <rach4x0r@gmail.com>,
	linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
	linux-kernel@vger.kernel.org, Keith Busch <kbusch@kernel.org>,
	Ming Lei <ming.lei@redhat.com>,
	Jaehyun Hwang <jaehyun.hwang@cornell.edu>,
	Qizhe Cai <qc228@cornell.edu>,
	Midhul Vuppalapati <mvv25@cornell.edu>,
	Rachit Agarwal <ragarwal@cs.cornell.edu>,
	Sagi Grimberg <sagi@lightbitslabs.com>,
	Rachit Agarwal <ragarwal@cornell.edu>
Subject: [PATCH] iosched: Add i10 I/O Scheduler
Date: Thu, 12 Nov 2020 09:07:52 -0500	[thread overview]
Message-ID: <20201112140752.1554-1-rach4x0r@gmail.com> (raw)

From: Rachit Agarwal <rach4x0r@gmail.com>>

Hi All,

I/O batching is beneficial for optimizing IOPS and throughput for various
applications. For instance, several kernel block drivers would benefit from batching,
including mmc [1] and tcp-based storage drivers like nvme-tcp [2,3]. While we have
support for batching dispatch [4], we need an I/O scheduler to efficiently enable
batching. Such a scheduler is particularly interesting for disaggregated storage,
where the access latency of remote disaggregated storage may be higher than local
storage access; thus, batching can significantly help in amortizing the remote access
latency while increasing the throughput.

This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
dispatching only when #requests or #bytes is larger than a default threshold or when
a timer expires. After that, batching dispatch [3] would happen, allowing batching
at device drivers along with "bd->last" and ".commit_rqs".

The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
and many additional results are presented in [6].

While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
nor a need for a global tracking context, so a new scheduler is needed rather than
to build this functionality to an existing scheduler.

We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
tests in [6]. For our future work, we plan to support adaptive batching according to
system load and to extend the scheduler to support isolation in multi-tenant deployments
(to simultaneously achieve low tail latency for latency-sensitive applications and high
throughput for throughput-bound applications).

References
[1] https://lore.kernel.org/linux-block/cover.1587888520.git.baolin.wang7@gmail.com/T/#mc48a8fb6069843827458f5fea722e1179d32af2a
[2] https://git.infradead.org/nvme.git/commit/122e5b9f3d370ae11e1502d14ff5c7ea9b144a76
[3] https://git.infradead.org/nvme.git/commit/86f0348ace1510d7ac25124b096fb88a6ab45270
[4] https://lore.kernel.org/linux-block/20200630102501.2238972-1-ming.lei@redhat.com/
[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
[6] https://www.usenix.org/conference/nsdi20/presentation/hwang

Signed-off-by: Jaehyun Hwang <jaehyun.hwang@cornell.edu>
Signed-off-by: Qizhe Cai <qc228@cornell.edu>
Signed-off-by: Midhul Vuppalapati <mvv25@cornell.edu>
Signed-off-by: Rachit Agarwal <ragarwal@cornell.edu>
Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 Documentation/block/i10-iosched.rst |  69 +++++
 block/Kconfig.iosched               |   8 +
 block/Makefile                      |   1 +
 block/i10-iosched.c                 | 421 ++++++++++++++++++++++++++++
 4 files changed, 499 insertions(+)
 create mode 100644 Documentation/block/i10-iosched.rst
 create mode 100644 block/i10-iosched.c

diff --git a/Documentation/block/i10-iosched.rst b/Documentation/block/i10-iosched.rst
new file mode 100644
index 000000000000..3db7ca3ed4c1
--- /dev/null
+++ b/Documentation/block/i10-iosched.rst
@@ -0,0 +1,69 @@
+==========================
+i10 I/O scheduler overview
+==========================
+
+I/O batching is beneficial for optimizing IOPS and throughput for various
+applications. For instance, several kernel block drivers would
+benefit from batching, including mmc [1] and tcp-based storage drivers like
+nvme-tcp [2,3]. While we have support for batching dispatch [4], we need an
+I/O scheduler to efficiently enable batching. Such a scheduler is particularly
+interesting for disaggregated storage, where the access latency of remote
+disaggregated storage may be higher than local storage access; thus, batching
+can significantly help in amortizing the remote access latency while increasing
+the throughput.
+
+This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
+of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
+dispatching only when #requests or #bytes is larger than a default threshold or when
+a timer expires. After that, batching dispatch [3] would happen, allowing batching
+at device drivers along with "bd->last" and ".commit_rqs".
+
+The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
+scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
+of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
+RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
+terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
+and many additional results are presented in [6].
+
+While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
+in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
+nor a need for a global tracking context, so a new scheduler is needed rather than
+to build this functionality to an existing scheduler.
+
+We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
+64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
+tests in [6]. For our future work, we plan to support adaptive batching according to
+system load and to extend the scheduler to support isolation in multi-tenant deployments
+(to simultaneously achieve low tail latency for latency-sensitive applications and high
+throughput for throughput-bound applications).
+
+References
+[1] https://lore.kernel.org/linux-block/cover.1587888520.git.baolin.wang7@gmail.com/T/#mc48a8fb6069843827458f5fea722e1179d32af2a
+[2] https://git.infradead.org/nvme.git/commit/122e5b9f3d370ae11e1502d14ff5c7ea9b144a76
+[3] https://git.infradead.org/nvme.git/commit/86f0348ace1510d7ac25124b096fb88a6ab45270
+[4] https://lore.kernel.org/linux-block/20200630102501.2238972-1-ming.lei@redhat.com/
+[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
+[6] https://www.usenix.org/conference/nsdi20/presentation/hwang
+
+==========================
+i10 I/O scheduler tunables
+==========================
+
+The three tunables for the i10 scheduler are the number of requests for
+reads/writes, the number of bytes for writes, and a timeout value.
+i10 will use these values for batching requests.
+
+batch_nr
+--------
+Number of requests for batching read/write requests
+Default: 16
+
+batch_bytes
+-----------
+Number of bytes for batching write requests
+Default: 65536 (bytes)
+
+batch_timeout
+-------------
+Timeout value for batching (in microseconds)
+Default: 50 (us)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 2f2158e05a91..5b3623b19487 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -44,6 +44,14 @@ config BFQ_CGROUP_DEBUG
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config MQ_IOSCHED_I10
+	tristate "i10 I/O scheduler"
+	default y
+	help
+	  The i10 I/O Scheduler supports batching at BLK-MQ.
+	  Any device driver that benefits from batching
+	  (e.g., NVMe-over-TCP) can use this scheduler.
+
 endmenu
 
 endif
diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..27e0789589ea 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_I10)    += i10-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
 obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
 
diff --git a/block/i10-iosched.c b/block/i10-iosched.c
new file mode 100644
index 000000000000..b5451beaa66d
--- /dev/null
+++ b/block/i10-iosched.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The i10 I/O Scheduler - supports batching at blk-mq.
+ *	The main use case is disaggregated storage access
+ *	using NVMe-over-Fabric (e.g., NVMe-over-TCP device driver).
+ *
+ * An early version of the idea is described and evaluated in
+ * "TCP ≈ RDMA: CPU-efficient Remote Storage Access with i10",
+ * USENIX NSDI 2020.
+ *
+ * Copyright (C) 2020 Cornell University
+ *	Jaehyun Hwang <jaehyun.hwang@cornell.edu>
+ *	Qizhe Cai <qc228@cornell.edu>
+ *	Midhul Vuppalapati <mvv25@cornell.edu%>
+ *	Rachit Agarwal <ragarwal@cornell.edu>
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-debugfs.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+
+/* Default batch size in number of requests */
+#define I10_DEF_BATCH_NR	16
+/* Default batch size in bytes (for write requests) */
+#define I10_DEF_BATCH_BYTES	65536
+/* Default timeout value for batching (us units) */
+#define I10_DEF_BATCH_TIMEOUT	50
+
+enum i10_state {
+	/* Batching state:
+	 * Do not run dispatching until we have
+	 * a certain amount of requests or a timer expires.
+	 */
+	I10_STATE_BATCH,
+
+	/* Dispatching state:
+	 * Run dispatching until all requests in the
+	 * scheduler's hctx ihq are dispatched.
+	 */
+	I10_STATE_DISPATCH,
+};
+
+struct i10_queue_data {
+	struct request_queue *q;
+
+	unsigned int	def_batch_nr;
+	unsigned int	def_batch_bytes;
+	unsigned int	def_batch_timeout;
+};
+
+struct i10_hctx_queue {
+	spinlock_t		lock;
+	struct list_head	rq_list;
+
+	struct blk_mq_hw_ctx	*hctx;
+
+	unsigned int	batch_nr;
+	unsigned int	batch_bytes;
+	unsigned int	batch_timeout;
+
+	unsigned int	qlen_nr;
+	unsigned int	qlen_bytes;
+
+	struct hrtimer	dispatch_timer;
+	enum i10_state	state;
+};
+
+static struct i10_queue_data *i10_queue_data_alloc(struct request_queue *q)
+{
+	struct i10_queue_data *iqd;
+
+	iqd = kzalloc_node(sizeof(*iqd), GFP_KERNEL, q->node);
+	if (!iqd)
+		return ERR_PTR(-ENOMEM);
+
+	iqd->q = q;
+	iqd->def_batch_nr = I10_DEF_BATCH_NR;
+	iqd->def_batch_bytes = I10_DEF_BATCH_BYTES;
+	iqd->def_batch_timeout = I10_DEF_BATCH_TIMEOUT;
+
+	return iqd;
+}
+
+static int i10_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct i10_queue_data *iqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	iqd = i10_queue_data_alloc(q);
+	if (IS_ERR(iqd)) {
+		kobject_put(&eq->kobj);
+		return PTR_ERR(iqd);
+	}
+
+	blk_stat_enable_accounting(q);
+
+	eq->elevator_data = iqd;
+	q->elevator = eq;
+
+	return 0;
+}
+
+static void i10_exit_sched(struct elevator_queue *e)
+{
+	struct i10_queue_data *iqd = e->elevator_data;
+
+	kfree(iqd);
+}
+
+enum hrtimer_restart i10_hctx_timeout_handler(struct hrtimer *timer)
+{
+	struct i10_hctx_queue *ihq =
+		container_of(timer, struct i10_hctx_queue,
+			dispatch_timer);
+
+	ihq->state = I10_STATE_DISPATCH;
+	blk_mq_run_hw_queue(ihq->hctx, true);
+
+	return HRTIMER_NORESTART;
+}
+
+static void i10_hctx_queue_reset(struct i10_hctx_queue *ihq)
+{
+	ihq->qlen_nr = 0;
+	ihq->qlen_bytes = 0;
+	ihq->state = I10_STATE_BATCH;
+}
+
+static int i10_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	struct i10_hctx_queue *ihq;
+
+	ihq = kzalloc_node(sizeof(*ihq), GFP_KERNEL, hctx->numa_node);
+	if (!ihq)
+		return -ENOMEM;
+
+	spin_lock_init(&ihq->lock);
+	INIT_LIST_HEAD(&ihq->rq_list);
+
+	ihq->hctx = hctx;
+	ihq->batch_nr = 0;
+	ihq->batch_bytes = 0;
+	ihq->batch_timeout = 0;
+
+	hrtimer_init(&ihq->dispatch_timer,
+		CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	ihq->dispatch_timer.function = &i10_hctx_timeout_handler;
+
+	i10_hctx_queue_reset(ihq);
+
+	hctx->sched_data = ihq;
+
+	return 0;
+}
+
+static void i10_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	hrtimer_cancel(&ihq->dispatch_timer);
+	kfree(hctx->sched_data);
+}
+
+static bool i10_hctx_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+		unsigned int nr_segs)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+	struct list_head *rq_list = &ihq->rq_list;
+	bool merged;
+
+	spin_lock(&ihq->lock);
+	merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio, nr_segs);
+	spin_unlock(&ihq->lock);
+
+	if (merged && bio_data_dir(bio) == WRITE)
+		ihq->qlen_bytes += bio->bi_iter.bi_size;
+
+	return merged;
+}
+
+/*
+ * The batch size can be adjusted dynamically on a per-hctx basis.
+ * Use per-hctx variables in that case.
+ */
+static inline unsigned int i10_hctx_batch_nr(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	return ihq->batch_nr ?
+		ihq->batch_nr : iqd->def_batch_nr;
+}
+
+static inline unsigned int i10_hctx_batch_bytes(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	return ihq->batch_bytes ?
+		ihq->batch_bytes : iqd->def_batch_bytes;
+}
+
+static inline unsigned int i10_hctx_batch_timeout(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	return ihq->batch_timeout ?
+		ihq->batch_timeout : iqd->def_batch_timeout;
+}
+
+static void i10_hctx_insert_update(struct i10_hctx_queue *ihq,
+				struct request *rq)
+{
+	if (rq_data_dir(rq) == WRITE)
+		ihq->qlen_bytes += blk_rq_bytes(rq);
+	ihq->qlen_nr++;
+}
+
+static void i10_hctx_insert_requests(struct blk_mq_hw_ctx *hctx,
+				struct list_head *rq_list, bool at_head)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+	struct request *rq, *next;
+
+	list_for_each_entry_safe(rq, next, rq_list, queuelist) {
+		struct list_head *head = &ihq->rq_list;
+
+		spin_lock(&ihq->lock);
+		if (at_head)
+			list_move(&rq->queuelist, head);
+		else
+			list_move_tail(&rq->queuelist, head);
+		i10_hctx_insert_update(ihq, rq);
+		blk_mq_sched_request_inserted(rq);
+		spin_unlock(&ihq->lock);
+	}
+
+	/* Start a new timer */
+	if (ihq->state == I10_STATE_BATCH &&
+	   !hrtimer_active(&ihq->dispatch_timer))
+		hrtimer_start(&ihq->dispatch_timer,
+			ns_to_ktime(i10_hctx_batch_timeout(hctx)
+				* NSEC_PER_USEC),
+			HRTIMER_MODE_REL);
+}
+
+static struct request *i10_hctx_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+	struct request *rq;
+
+	spin_lock(&ihq->lock);
+	rq = list_first_entry_or_null(&ihq->rq_list,
+				struct request, queuelist);
+	if (rq)
+		list_del_init(&rq->queuelist);
+	else
+		i10_hctx_queue_reset(ihq);
+	spin_unlock(&ihq->lock);
+
+	return rq;
+}
+
+static inline bool i10_hctx_dispatch_now(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	return (ihq->qlen_nr >= i10_hctx_batch_nr(hctx)) ||
+		(ihq->qlen_bytes >= i10_hctx_batch_bytes(hctx));
+}
+
+/*
+ * Return true if we are in the dispatching state.
+ */
+static bool i10_hctx_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	if (ihq->state == I10_STATE_BATCH) {
+		if (i10_hctx_dispatch_now(hctx)) {
+			ihq->state = I10_STATE_DISPATCH;
+			if (hrtimer_active(&ihq->dispatch_timer))
+				hrtimer_cancel(&ihq->dispatch_timer);
+		}
+	}
+
+	return (ihq->state == I10_STATE_DISPATCH);
+}
+
+#define I10_DEF_BATCH_SHOW_STORE(name)					\
+static ssize_t i10_def_batch_##name##_show(struct elevator_queue *e,	\
+				char *page)				\
+{									\
+	struct i10_queue_data *iqd = e->elevator_data;			\
+									\
+	return sprintf(page, "%u\n", iqd->def_batch_##name);		\
+}									\
+									\
+static ssize_t i10_def_batch_##name##_store(struct elevator_queue *e,	\
+			const char *page, size_t count)			\
+{									\
+	struct i10_queue_data *iqd = e->elevator_data;			\
+	unsigned long long value;					\
+	int ret;							\
+									\
+	ret = kstrtoull(page, 10, &value);				\
+	if (ret)							\
+		return ret;						\
+									\
+	iqd->def_batch_##name = value;					\
+									\
+	return count;							\
+}
+I10_DEF_BATCH_SHOW_STORE(nr);
+I10_DEF_BATCH_SHOW_STORE(bytes);
+I10_DEF_BATCH_SHOW_STORE(timeout);
+#undef I10_DEF_BATCH_SHOW_STORE
+
+#define I10_SCHED_ATTR(name)	\
+	__ATTR(batch_##name, 0644, i10_def_batch_##name##_show, i10_def_batch_##name##_store)
+static struct elv_fs_entry i10_sched_attrs[] = {
+	I10_SCHED_ATTR(nr),
+	I10_SCHED_ATTR(bytes),
+	I10_SCHED_ATTR(timeout),
+	__ATTR_NULL
+};
+#undef I10_SCHED_ATTR
+
+#ifdef CONFIG_BLK_DEBUG_FS
+#define I10_DEBUGFS_SHOW(name)	\
+static int i10_hctx_batch_##name##_show(void *data, struct seq_file *m)	\
+{									\
+	struct blk_mq_hw_ctx *hctx = data;				\
+	struct i10_hctx_queue *ihq = hctx->sched_data;			\
+									\
+	seq_printf(m, "%u\n", ihq->batch_##name);			\
+	return 0;							\
+}									\
+									\
+static int i10_hctx_qlen_##name##_show(void *data, struct seq_file *m)	\
+{									\
+	struct blk_mq_hw_ctx *hctx = data;				\
+	struct i10_hctx_queue *ihq = hctx->sched_data;			\
+									\
+	seq_printf(m, "%u\n", ihq->qlen_##name);			\
+	return 0;							\
+}
+I10_DEBUGFS_SHOW(nr);
+I10_DEBUGFS_SHOW(bytes);
+#undef I10_DEBUGFS_SHOW
+
+static int i10_hctx_state_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	seq_printf(m, "%d\n", ihq->state);
+	return 0;
+}
+
+#define I10_HCTX_QUEUE_ATTR(name)					\
+	{"batch_" #name, 0400, i10_hctx_batch_##name##_show},		\
+	{"qlen_" #name, 0400, i10_hctx_qlen_##name##_show}
+static const struct blk_mq_debugfs_attr i10_hctx_debugfs_attrs[] = {
+	I10_HCTX_QUEUE_ATTR(nr),
+	I10_HCTX_QUEUE_ATTR(bytes),
+	{"state", 0400, i10_hctx_state_show},
+	{},
+};
+#undef I10_HCTX_QUEUE_ATTR
+#endif
+
+static struct elevator_type i10_sched = {
+	.ops = {
+		.init_sched = i10_init_sched,
+		.exit_sched = i10_exit_sched,
+		.init_hctx = i10_init_hctx,
+		.exit_hctx = i10_exit_hctx,
+		.bio_merge = i10_hctx_bio_merge,
+		.insert_requests = i10_hctx_insert_requests,
+		.dispatch_request = i10_hctx_dispatch_request,
+		.has_work = i10_hctx_has_work,
+	},
+#ifdef CONFIG_BLK_DEBUG_FS
+	.hctx_debugfs_attrs = i10_hctx_debugfs_attrs,
+#endif
+	.elevator_attrs = i10_sched_attrs,
+	.elevator_name = "i10",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init i10_init(void)
+{
+	return elv_register(&i10_sched);
+}
+
+static void __exit i10_exit(void)
+{
+	elv_unregister(&i10_sched);
+}
+
+module_init(i10_init);
+module_exit(i10_exit);
+
+MODULE_AUTHOR("Jaehyun Hwang, Qizhe Cai, Midhul Vuppalapati, Rachit Agarwal");
+MODULE_LICENSE("GPLv2");
+MODULE_DESCRIPTION("i10 I/O scheduler");
-- 
2.22.0


WARNING: multiple messages have this Message-ID (diff)
From: Rachit Agarwal <rach4x0r@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>
Cc: Rachit Agarwal <rach4x0r@gmail.com>,
	Qizhe Cai <qc228@cornell.edu>,
	Rachit Agarwal <ragarwal@cornell.edu>,
	linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org,
	Ming Lei <ming.lei@redhat.com>,
	linux-block@vger.kernel.org,
	Midhul Vuppalapati <mvv25@cornell.edu>,
	Jaehyun Hwang <jaehyun.hwang@cornell.edu>,
	Rachit Agarwal <ragarwal@cs.cornell.edu>,
	Keith Busch <kbusch@kernel.org>,
	Sagi Grimberg <sagi@lightbitslabs.com>
Subject: [PATCH] iosched: Add i10 I/O Scheduler
Date: Thu, 12 Nov 2020 09:07:52 -0500	[thread overview]
Message-ID: <20201112140752.1554-1-rach4x0r@gmail.com> (raw)

From: Rachit Agarwal <rach4x0r@gmail.com>>

Hi All,

I/O batching is beneficial for optimizing IOPS and throughput for various
applications. For instance, several kernel block drivers would benefit from batching,
including mmc [1] and tcp-based storage drivers like nvme-tcp [2,3]. While we have
support for batching dispatch [4], we need an I/O scheduler to efficiently enable
batching. Such a scheduler is particularly interesting for disaggregated storage,
where the access latency of remote disaggregated storage may be higher than local
storage access; thus, batching can significantly help in amortizing the remote access
latency while increasing the throughput.

This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
dispatching only when #requests or #bytes is larger than a default threshold or when
a timer expires. After that, batching dispatch [3] would happen, allowing batching
at device drivers along with "bd->last" and ".commit_rqs".

The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
and many additional results are presented in [6].

While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
nor a need for a global tracking context, so a new scheduler is needed rather than
to build this functionality to an existing scheduler.

We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
tests in [6]. For our future work, we plan to support adaptive batching according to
system load and to extend the scheduler to support isolation in multi-tenant deployments
(to simultaneously achieve low tail latency for latency-sensitive applications and high
throughput for throughput-bound applications).

References
[1] https://lore.kernel.org/linux-block/cover.1587888520.git.baolin.wang7@gmail.com/T/#mc48a8fb6069843827458f5fea722e1179d32af2a
[2] https://git.infradead.org/nvme.git/commit/122e5b9f3d370ae11e1502d14ff5c7ea9b144a76
[3] https://git.infradead.org/nvme.git/commit/86f0348ace1510d7ac25124b096fb88a6ab45270
[4] https://lore.kernel.org/linux-block/20200630102501.2238972-1-ming.lei@redhat.com/
[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
[6] https://www.usenix.org/conference/nsdi20/presentation/hwang

Signed-off-by: Jaehyun Hwang <jaehyun.hwang@cornell.edu>
Signed-off-by: Qizhe Cai <qc228@cornell.edu>
Signed-off-by: Midhul Vuppalapati <mvv25@cornell.edu>
Signed-off-by: Rachit Agarwal <ragarwal@cornell.edu>
Signed-off-by: Sagi Grimberg <sagi@lightbitslabs.com>
---
 Documentation/block/i10-iosched.rst |  69 +++++
 block/Kconfig.iosched               |   8 +
 block/Makefile                      |   1 +
 block/i10-iosched.c                 | 421 ++++++++++++++++++++++++++++
 4 files changed, 499 insertions(+)
 create mode 100644 Documentation/block/i10-iosched.rst
 create mode 100644 block/i10-iosched.c

diff --git a/Documentation/block/i10-iosched.rst b/Documentation/block/i10-iosched.rst
new file mode 100644
index 000000000000..3db7ca3ed4c1
--- /dev/null
+++ b/Documentation/block/i10-iosched.rst
@@ -0,0 +1,69 @@
+==========================
+i10 I/O scheduler overview
+==========================
+
+I/O batching is beneficial for optimizing IOPS and throughput for various
+applications. For instance, several kernel block drivers would
+benefit from batching, including mmc [1] and tcp-based storage drivers like
+nvme-tcp [2,3]. While we have support for batching dispatch [4], we need an
+I/O scheduler to efficiently enable batching. Such a scheduler is particularly
+interesting for disaggregated storage, where the access latency of remote
+disaggregated storage may be higher than local storage access; thus, batching
+can significantly help in amortizing the remote access latency while increasing
+the throughput.
+
+This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
+of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
+dispatching only when #requests or #bytes is larger than a default threshold or when
+a timer expires. After that, batching dispatch [3] would happen, allowing batching
+at device drivers along with "bd->last" and ".commit_rqs".
+
+The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
+scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
+of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
+RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
+terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
+and many additional results are presented in [6].
+
+While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
+in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
+nor a need for a global tracking context, so a new scheduler is needed rather than
+to build this functionality to an existing scheduler.
+
+We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
+64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
+tests in [6]. For our future work, we plan to support adaptive batching according to
+system load and to extend the scheduler to support isolation in multi-tenant deployments
+(to simultaneously achieve low tail latency for latency-sensitive applications and high
+throughput for throughput-bound applications).
+
+References
+[1] https://lore.kernel.org/linux-block/cover.1587888520.git.baolin.wang7@gmail.com/T/#mc48a8fb6069843827458f5fea722e1179d32af2a
+[2] https://git.infradead.org/nvme.git/commit/122e5b9f3d370ae11e1502d14ff5c7ea9b144a76
+[3] https://git.infradead.org/nvme.git/commit/86f0348ace1510d7ac25124b096fb88a6ab45270
+[4] https://lore.kernel.org/linux-block/20200630102501.2238972-1-ming.lei@redhat.com/
+[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
+[6] https://www.usenix.org/conference/nsdi20/presentation/hwang
+
+==========================
+i10 I/O scheduler tunables
+==========================
+
+The three tunables for the i10 scheduler are the number of requests for
+reads/writes, the number of bytes for writes, and a timeout value.
+i10 will use these values for batching requests.
+
+batch_nr
+--------
+Number of requests for batching read/write requests
+Default: 16
+
+batch_bytes
+-----------
+Number of bytes for batching write requests
+Default: 65536 (bytes)
+
+batch_timeout
+-------------
+Timeout value for batching (in microseconds)
+Default: 50 (us)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 2f2158e05a91..5b3623b19487 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -44,6 +44,14 @@ config BFQ_CGROUP_DEBUG
 	Enable some debugging help. Currently it exports additional stat
 	files in a cgroup which can be useful for debugging.
 
+config MQ_IOSCHED_I10
+	tristate "i10 I/O scheduler"
+	default y
+	help
+	  The i10 I/O Scheduler supports batching at BLK-MQ.
+	  Any device driver that benefits from batching
+	  (e.g., NVMe-over-TCP) can use this scheduler.
+
 endmenu
 
 endif
diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..27e0789589ea 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_I10)    += i10-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
 obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
 
diff --git a/block/i10-iosched.c b/block/i10-iosched.c
new file mode 100644
index 000000000000..b5451beaa66d
--- /dev/null
+++ b/block/i10-iosched.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The i10 I/O Scheduler - supports batching at blk-mq.
+ *	The main use case is disaggregated storage access
+ *	using NVMe-over-Fabric (e.g., NVMe-over-TCP device driver).
+ *
+ * An early version of the idea is described and evaluated in
+ * "TCP ≈ RDMA: CPU-efficient Remote Storage Access with i10",
+ * USENIX NSDI 2020.
+ *
+ * Copyright (C) 2020 Cornell University
+ *	Jaehyun Hwang <jaehyun.hwang@cornell.edu>
+ *	Qizhe Cai <qc228@cornell.edu>
+ *	Midhul Vuppalapati <mvv25@cornell.edu%>
+ *	Rachit Agarwal <ragarwal@cornell.edu>
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-debugfs.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+
+/* Default batch size in number of requests */
+#define I10_DEF_BATCH_NR	16
+/* Default batch size in bytes (for write requests) */
+#define I10_DEF_BATCH_BYTES	65536
+/* Default timeout value for batching (us units) */
+#define I10_DEF_BATCH_TIMEOUT	50
+
+enum i10_state {
+	/* Batching state:
+	 * Do not run dispatching until we have
+	 * a certain amount of requests or a timer expires.
+	 */
+	I10_STATE_BATCH,
+
+	/* Dispatching state:
+	 * Run dispatching until all requests in the
+	 * scheduler's hctx ihq are dispatched.
+	 */
+	I10_STATE_DISPATCH,
+};
+
+struct i10_queue_data {
+	struct request_queue *q;
+
+	unsigned int	def_batch_nr;
+	unsigned int	def_batch_bytes;
+	unsigned int	def_batch_timeout;
+};
+
+struct i10_hctx_queue {
+	spinlock_t		lock;
+	struct list_head	rq_list;
+
+	struct blk_mq_hw_ctx	*hctx;
+
+	unsigned int	batch_nr;
+	unsigned int	batch_bytes;
+	unsigned int	batch_timeout;
+
+	unsigned int	qlen_nr;
+	unsigned int	qlen_bytes;
+
+	struct hrtimer	dispatch_timer;
+	enum i10_state	state;
+};
+
+static struct i10_queue_data *i10_queue_data_alloc(struct request_queue *q)
+{
+	struct i10_queue_data *iqd;
+
+	iqd = kzalloc_node(sizeof(*iqd), GFP_KERNEL, q->node);
+	if (!iqd)
+		return ERR_PTR(-ENOMEM);
+
+	iqd->q = q;
+	iqd->def_batch_nr = I10_DEF_BATCH_NR;
+	iqd->def_batch_bytes = I10_DEF_BATCH_BYTES;
+	iqd->def_batch_timeout = I10_DEF_BATCH_TIMEOUT;
+
+	return iqd;
+}
+
+static int i10_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+	struct i10_queue_data *iqd;
+	struct elevator_queue *eq;
+
+	eq = elevator_alloc(q, e);
+	if (!eq)
+		return -ENOMEM;
+
+	iqd = i10_queue_data_alloc(q);
+	if (IS_ERR(iqd)) {
+		kobject_put(&eq->kobj);
+		return PTR_ERR(iqd);
+	}
+
+	blk_stat_enable_accounting(q);
+
+	eq->elevator_data = iqd;
+	q->elevator = eq;
+
+	return 0;
+}
+
+static void i10_exit_sched(struct elevator_queue *e)
+{
+	struct i10_queue_data *iqd = e->elevator_data;
+
+	kfree(iqd);
+}
+
+enum hrtimer_restart i10_hctx_timeout_handler(struct hrtimer *timer)
+{
+	struct i10_hctx_queue *ihq =
+		container_of(timer, struct i10_hctx_queue,
+			dispatch_timer);
+
+	ihq->state = I10_STATE_DISPATCH;
+	blk_mq_run_hw_queue(ihq->hctx, true);
+
+	return HRTIMER_NORESTART;
+}
+
+static void i10_hctx_queue_reset(struct i10_hctx_queue *ihq)
+{
+	ihq->qlen_nr = 0;
+	ihq->qlen_bytes = 0;
+	ihq->state = I10_STATE_BATCH;
+}
+
+static int i10_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	struct i10_hctx_queue *ihq;
+
+	ihq = kzalloc_node(sizeof(*ihq), GFP_KERNEL, hctx->numa_node);
+	if (!ihq)
+		return -ENOMEM;
+
+	spin_lock_init(&ihq->lock);
+	INIT_LIST_HEAD(&ihq->rq_list);
+
+	ihq->hctx = hctx;
+	ihq->batch_nr = 0;
+	ihq->batch_bytes = 0;
+	ihq->batch_timeout = 0;
+
+	hrtimer_init(&ihq->dispatch_timer,
+		CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	ihq->dispatch_timer.function = &i10_hctx_timeout_handler;
+
+	i10_hctx_queue_reset(ihq);
+
+	hctx->sched_data = ihq;
+
+	return 0;
+}
+
+static void i10_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	hrtimer_cancel(&ihq->dispatch_timer);
+	kfree(hctx->sched_data);
+}
+
+static bool i10_hctx_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+		unsigned int nr_segs)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+	struct list_head *rq_list = &ihq->rq_list;
+	bool merged;
+
+	spin_lock(&ihq->lock);
+	merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio, nr_segs);
+	spin_unlock(&ihq->lock);
+
+	if (merged && bio_data_dir(bio) == WRITE)
+		ihq->qlen_bytes += bio->bi_iter.bi_size;
+
+	return merged;
+}
+
+/*
+ * The batch size can be adjusted dynamically on a per-hctx basis.
+ * Use per-hctx variables in that case.
+ */
+static inline unsigned int i10_hctx_batch_nr(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	return ihq->batch_nr ?
+		ihq->batch_nr : iqd->def_batch_nr;
+}
+
+static inline unsigned int i10_hctx_batch_bytes(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	return ihq->batch_bytes ?
+		ihq->batch_bytes : iqd->def_batch_bytes;
+}
+
+static inline unsigned int i10_hctx_batch_timeout(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	return ihq->batch_timeout ?
+		ihq->batch_timeout : iqd->def_batch_timeout;
+}
+
+static void i10_hctx_insert_update(struct i10_hctx_queue *ihq,
+				struct request *rq)
+{
+	if (rq_data_dir(rq) == WRITE)
+		ihq->qlen_bytes += blk_rq_bytes(rq);
+	ihq->qlen_nr++;
+}
+
+static void i10_hctx_insert_requests(struct blk_mq_hw_ctx *hctx,
+				struct list_head *rq_list, bool at_head)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+	struct request *rq, *next;
+
+	list_for_each_entry_safe(rq, next, rq_list, queuelist) {
+		struct list_head *head = &ihq->rq_list;
+
+		spin_lock(&ihq->lock);
+		if (at_head)
+			list_move(&rq->queuelist, head);
+		else
+			list_move_tail(&rq->queuelist, head);
+		i10_hctx_insert_update(ihq, rq);
+		blk_mq_sched_request_inserted(rq);
+		spin_unlock(&ihq->lock);
+	}
+
+	/* Start a new timer */
+	if (ihq->state == I10_STATE_BATCH &&
+	   !hrtimer_active(&ihq->dispatch_timer))
+		hrtimer_start(&ihq->dispatch_timer,
+			ns_to_ktime(i10_hctx_batch_timeout(hctx)
+				* NSEC_PER_USEC),
+			HRTIMER_MODE_REL);
+}
+
+static struct request *i10_hctx_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+	struct request *rq;
+
+	spin_lock(&ihq->lock);
+	rq = list_first_entry_or_null(&ihq->rq_list,
+				struct request, queuelist);
+	if (rq)
+		list_del_init(&rq->queuelist);
+	else
+		i10_hctx_queue_reset(ihq);
+	spin_unlock(&ihq->lock);
+
+	return rq;
+}
+
+static inline bool i10_hctx_dispatch_now(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	return (ihq->qlen_nr >= i10_hctx_batch_nr(hctx)) ||
+		(ihq->qlen_bytes >= i10_hctx_batch_bytes(hctx));
+}
+
+/*
+ * Return true if we are in the dispatching state.
+ */
+static bool i10_hctx_has_work(struct blk_mq_hw_ctx *hctx)
+{
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	if (ihq->state == I10_STATE_BATCH) {
+		if (i10_hctx_dispatch_now(hctx)) {
+			ihq->state = I10_STATE_DISPATCH;
+			if (hrtimer_active(&ihq->dispatch_timer))
+				hrtimer_cancel(&ihq->dispatch_timer);
+		}
+	}
+
+	return (ihq->state == I10_STATE_DISPATCH);
+}
+
+#define I10_DEF_BATCH_SHOW_STORE(name)					\
+static ssize_t i10_def_batch_##name##_show(struct elevator_queue *e,	\
+				char *page)				\
+{									\
+	struct i10_queue_data *iqd = e->elevator_data;			\
+									\
+	return sprintf(page, "%u\n", iqd->def_batch_##name);		\
+}									\
+									\
+static ssize_t i10_def_batch_##name##_store(struct elevator_queue *e,	\
+			const char *page, size_t count)			\
+{									\
+	struct i10_queue_data *iqd = e->elevator_data;			\
+	unsigned long long value;					\
+	int ret;							\
+									\
+	ret = kstrtoull(page, 10, &value);				\
+	if (ret)							\
+		return ret;						\
+									\
+	iqd->def_batch_##name = value;					\
+									\
+	return count;							\
+}
+I10_DEF_BATCH_SHOW_STORE(nr);
+I10_DEF_BATCH_SHOW_STORE(bytes);
+I10_DEF_BATCH_SHOW_STORE(timeout);
+#undef I10_DEF_BATCH_SHOW_STORE
+
+#define I10_SCHED_ATTR(name)	\
+	__ATTR(batch_##name, 0644, i10_def_batch_##name##_show, i10_def_batch_##name##_store)
+static struct elv_fs_entry i10_sched_attrs[] = {
+	I10_SCHED_ATTR(nr),
+	I10_SCHED_ATTR(bytes),
+	I10_SCHED_ATTR(timeout),
+	__ATTR_NULL
+};
+#undef I10_SCHED_ATTR
+
+#ifdef CONFIG_BLK_DEBUG_FS
+#define I10_DEBUGFS_SHOW(name)	\
+static int i10_hctx_batch_##name##_show(void *data, struct seq_file *m)	\
+{									\
+	struct blk_mq_hw_ctx *hctx = data;				\
+	struct i10_hctx_queue *ihq = hctx->sched_data;			\
+									\
+	seq_printf(m, "%u\n", ihq->batch_##name);			\
+	return 0;							\
+}									\
+									\
+static int i10_hctx_qlen_##name##_show(void *data, struct seq_file *m)	\
+{									\
+	struct blk_mq_hw_ctx *hctx = data;				\
+	struct i10_hctx_queue *ihq = hctx->sched_data;			\
+									\
+	seq_printf(m, "%u\n", ihq->qlen_##name);			\
+	return 0;							\
+}
+I10_DEBUGFS_SHOW(nr);
+I10_DEBUGFS_SHOW(bytes);
+#undef I10_DEBUGFS_SHOW
+
+static int i10_hctx_state_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	struct i10_hctx_queue *ihq = hctx->sched_data;
+
+	seq_printf(m, "%d\n", ihq->state);
+	return 0;
+}
+
+#define I10_HCTX_QUEUE_ATTR(name)					\
+	{"batch_" #name, 0400, i10_hctx_batch_##name##_show},		\
+	{"qlen_" #name, 0400, i10_hctx_qlen_##name##_show}
+static const struct blk_mq_debugfs_attr i10_hctx_debugfs_attrs[] = {
+	I10_HCTX_QUEUE_ATTR(nr),
+	I10_HCTX_QUEUE_ATTR(bytes),
+	{"state", 0400, i10_hctx_state_show},
+	{},
+};
+#undef I10_HCTX_QUEUE_ATTR
+#endif
+
+static struct elevator_type i10_sched = {
+	.ops = {
+		.init_sched = i10_init_sched,
+		.exit_sched = i10_exit_sched,
+		.init_hctx = i10_init_hctx,
+		.exit_hctx = i10_exit_hctx,
+		.bio_merge = i10_hctx_bio_merge,
+		.insert_requests = i10_hctx_insert_requests,
+		.dispatch_request = i10_hctx_dispatch_request,
+		.has_work = i10_hctx_has_work,
+	},
+#ifdef CONFIG_BLK_DEBUG_FS
+	.hctx_debugfs_attrs = i10_hctx_debugfs_attrs,
+#endif
+	.elevator_attrs = i10_sched_attrs,
+	.elevator_name = "i10",
+	.elevator_owner = THIS_MODULE,
+};
+
+static int __init i10_init(void)
+{
+	return elv_register(&i10_sched);
+}
+
+static void __exit i10_exit(void)
+{
+	elv_unregister(&i10_sched);
+}
+
+module_init(i10_init);
+module_exit(i10_exit);
+
+MODULE_AUTHOR("Jaehyun Hwang, Qizhe Cai, Midhul Vuppalapati, Rachit Agarwal");
+MODULE_LICENSE("GPLv2");
+MODULE_DESCRIPTION("i10 I/O scheduler");
-- 
2.22.0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

             reply	other threads:[~2020-11-12 14:08 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-12 14:07 Rachit Agarwal [this message]
2020-11-12 14:07 ` [PATCH] iosched: Add i10 I/O Scheduler Rachit Agarwal
2020-11-12 18:02 ` Jens Axboe
2020-11-12 18:02   ` Jens Axboe
2020-11-13 20:34   ` Sagi Grimberg
2020-11-13 20:34     ` Sagi Grimberg
2020-11-13 21:03     ` Jens Axboe
2020-11-13 21:03       ` Jens Axboe
2020-11-13 21:23       ` Sagi Grimberg
2020-11-13 21:23         ` Sagi Grimberg
2020-11-13 21:26         ` Jens Axboe
2020-11-13 21:26           ` Jens Axboe
2020-11-13 21:36           ` Sagi Grimberg
2020-11-13 21:36             ` Sagi Grimberg
2020-11-13 21:44             ` Jens Axboe
2020-11-13 21:44               ` Jens Axboe
2020-11-13 21:56               ` Sagi Grimberg
2020-11-13 21:56                 ` Sagi Grimberg
     [not found]                 ` <CAKeUqKKHg1wD19pnwJEd8whubnuGVic_ZhDjebaq3kKmY9TtsQ@mail.gmail.com>
2020-11-30 19:20                   ` Sagi Grimberg
2020-11-30 19:20                     ` Sagi Grimberg
2020-12-28 14:28                   ` Rachit Agarwal
2021-01-11 18:15                     ` Rachit Agarwal
2021-01-11 18:15                       ` Rachit Agarwal
2020-11-16  8:41             ` Ming Lei
2020-11-16  8:41               ` Ming Lei
2020-11-13 14:59 ` Ming Lei
2020-11-13 14:59   ` Ming Lei
2020-11-13 20:58   ` Sagi Grimberg
2020-11-13 20:58     ` Sagi Grimberg

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201112140752.1554-1-rach4x0r@gmail.com \
    --to=rach4x0r@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=hch@lst.de \
    --cc=jaehyun.hwang@cornell.edu \
    --cc=kbusch@kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=ming.lei@redhat.com \
    --cc=mvv25@cornell.edu \
    --cc=qc228@cornell.edu \
    --cc=ragarwal@cornell.edu \
    --cc=ragarwal@cs.cornell.edu \
    --cc=sagi@lightbitslabs.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.