linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] Improve I/O priority support
@ 2021-06-08 23:06 Bart Van Assche
  2021-06-08 23:06 ` [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
                   ` (13 more replies)
  0 siblings, 14 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche

Hi Jens,

A feature that is missing from the Linux kernel for storage devices that
support I/O priorities is to set the I/O priority in requests involving page
cache writeback. Since the identity of the process that triggers page cache
writeback is not known in the writeback code, the priority set by ioprio_set()
is ignored. However, an I/O cgroup is associated with writeback requests
by certain filesystems. Hence this patch series that implements the following
changes:
* Add an rq-qos policy that makes the I/O priority configurable per I/O cgroup
  and also that changes the I/O priority of requests to the lower of (request
  I/O priority, cgroup I/O priority).
* Introduce one queue per I/O priority in the mq-deadline scheduler.
* Dispatch the highest priority requests first.

Please consider this patch series for kernel v5.14.

Thanks,

Bart.

Changes compared to v1:
- Moved the code for assigning an I/O priority into a new rq-qos policy.
- Dropped patch "block/mq-deadline: Reduce the read expiry time for
  non-rotational media".
- Made sure that dd->async_depth >= 1.
- Implemented an aging mechanism such that lower priority requests are not
  postponed forever.

Bart Van Assche (14):
  block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init()
    calls
  block/blk-rq-qos: Move a function from a header file into a C file
  block: Introduce the ioprio rq-qos policy
  block/mq-deadline: Add several comments
  block/mq-deadline: Add two lockdep_assert_held() statements
  block/mq-deadline: Remove two local variables
  block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  block/mq-deadline: Improve compile-time argument checking
  block/mq-deadline: Improve the sysfs show and store macros
  block/mq-deadline: Reserve 25% of scheduler tags for synchronous
    requests
  block/mq-deadline: Add I/O priority support
  block/mq-deadline: Add cgroup support
  block/mq-deadline: Prioritize high-priority requests

 Documentation/admin-guide/cgroup-v2.rst |   26 +
 block/Kconfig                           |   19 +-
 block/Kconfig.iosched                   |    6 +
 block/Makefile                          |    3 +
 block/blk-cgroup.c                      |   14 +-
 block/blk-ioprio.c                      |  236 +++++
 block/blk-ioprio.h                      |   19 +
 block/blk-rq-qos.c                      |   15 +
 block/blk-rq-qos.h                      |   14 +-
 block/mq-deadline-cgroup.c              |  185 ++++
 block/mq-deadline-cgroup.h              |  112 +++
 block/mq-deadline-main.c                | 1171 +++++++++++++++++++++++
 block/mq-deadline.c                     |  815 ----------------
 13 files changed, 1798 insertions(+), 837 deletions(-)
 create mode 100644 block/blk-ioprio.c
 create mode 100644 block/blk-ioprio.h
 create mode 100644 block/mq-deadline-cgroup.c
 create mode 100644 block/mq-deadline-cgroup.h
 create mode 100644 block/mq-deadline-main.c
 delete mode 100644 block/mq-deadline.c


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-09  4:14   ` Damien Le Moal
                     ` (2 more replies)
  2021-06-08 23:06 ` [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

These entries were consecutive at the time of their introduction but are no
longer consecutive. Make these again consecutive. Additionally, modify the
help text since it refers to blk-mq and since the legacy block layer has
been removed.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/Kconfig | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index a2297edfdde8..6685578b2a20 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -133,6 +133,13 @@ config BLK_WBT
 	dynamically on an algorithm loosely based on CoDel, factoring in
 	the realtime performance of the disk.
 
+config BLK_WBT_MQ
+	bool "Enable writeback throttling by default"
+	default y
+	depends on BLK_WBT
+	help
+	Enable writeback throttling by default for request-based block devices.
+
 config BLK_CGROUP_IOLATENCY
 	bool "Enable support for latency based cgroup IO protection"
 	depends on BLK_CGROUP=y
@@ -155,13 +162,6 @@ config BLK_CGROUP_IOCOST
 	distributes IO capacity between different groups based on
 	their share of the overall weight distribution.
 
-config BLK_WBT_MQ
-	bool "Multiqueue writeback throttling"
-	default y
-	depends on BLK_WBT
-	help
-	Enable writeback throttling by default on multiqueue devices.
-
 config BLK_DEBUG_FS
 	bool "Block layer debugging information in debugfs"
 	default y

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
  2021-06-08 23:06 ` [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-09  4:19   ` Damien Le Moal
                     ` (2 more replies)
  2021-06-08 23:06 ` [PATCH 03/14] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
                   ` (11 subsequent siblings)
  13 siblings, 3 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Tejun Heo, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

Before adding more calls in this function, simplify the error path.

Cc: Tejun Heo <tj@kernel.org>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-cgroup.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d169e2055158..3b0f6efaa2b6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1183,15 +1183,14 @@ int blkcg_init_queue(struct request_queue *q)
 	if (preloaded)
 		radix_tree_preload_end();
 
-	ret = blk_throtl_init(q);
+	ret = blk_iolatency_init(q);
 	if (ret)
 		goto err_destroy_all;
 
-	ret = blk_iolatency_init(q);
-	if (ret) {
-		blk_throtl_exit(q);
+	ret = blk_throtl_init(q);
+	if (ret)
 		goto err_destroy_all;
-	}
+
 	return 0;
 
 err_destroy_all:

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 03/14] block/blk-rq-qos: Move a function from a header file into a C file
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
  2021-06-08 23:06 ` [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
  2021-06-08 23:06 ` [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-09  4:22   ` Damien Le Moal
  2021-06-10  6:03   ` Hannes Reinecke
  2021-06-08 23:06 ` [PATCH 04/14] block: Introduce the ioprio rq-qos policy Bart Van Assche
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

rq_qos_id_to_name() is not in the hot path so move it from a .h into a .c
file.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-rq-qos.c | 13 +++++++++++++
 block/blk-rq-qos.h | 13 +------------
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 656460636ad3..61dccf584c68 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -2,6 +2,19 @@
 
 #include "blk-rq-qos.h"
 
+const char *rq_qos_id_to_name(enum rq_qos_id id)
+{
+	switch (id) {
+	case RQ_QOS_WBT:
+		return "wbt";
+	case RQ_QOS_LATENCY:
+		return "latency";
+	case RQ_QOS_COST:
+		return "cost";
+	}
+	return "unknown";
+}
+
 /*
  * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
  * false if 'v' + 1 would be bigger than 'below'.
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 2bc43e94f4c4..6b5f9ae69883 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -78,18 +78,7 @@ static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
 	return rq_qos_id(q, RQ_QOS_LATENCY);
 }
 
-static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
-{
-	switch (id) {
-	case RQ_QOS_WBT:
-		return "wbt";
-	case RQ_QOS_LATENCY:
-		return "latency";
-	case RQ_QOS_COST:
-		return "cost";
-	}
-	return "unknown";
-}
+const char *rq_qos_id_to_name(enum rq_qos_id id);
 
 static inline void rq_wait_init(struct rq_wait *rq_wait)
 {

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 04/14] block: Introduce the ioprio rq-qos policy
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (2 preceding siblings ...)
  2021-06-08 23:06 ` [PATCH 03/14] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-09  4:40   ` Damien Le Moal
  2021-06-10  6:20   ` Hannes Reinecke
  2021-06-08 23:06 ` [PATCH 05/14] block/mq-deadline: Add several comments Bart Van Assche
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

Introduce an rq-qos policy that assigns an I/O priority to requests based
on blk-cgroup configuration settings. This policy has the following
advantages over the ioprio_set() system call:
- This policy is cgroup based so it has all the advantages of cgroups.
- While ioprio_set() does not affect page cache writeback I/O, this rq-qos
  controller affects page cache writeback I/O for filesystems that support
  assiociating a cgroup with writeback I/O. See also
  Documentation/admin-guide/cgroup-v2.rst.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  26 +++
 block/Kconfig                           |   9 +
 block/Makefile                          |   1 +
 block/blk-cgroup.c                      |   5 +
 block/blk-ioprio.c                      | 236 ++++++++++++++++++++++++
 block/blk-ioprio.h                      |  19 ++
 block/blk-rq-qos.c                      |   2 +
 block/blk-rq-qos.h                      |   1 +
 8 files changed, 299 insertions(+)
 create mode 100644 block/blk-ioprio.c
 create mode 100644 block/blk-ioprio.h

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b1e81aa8598a..cf8c8073b474 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -56,6 +56,7 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
        5-3-3. IO Latency
          5-3-3-1. How IO Latency Throttling Works
          5-3-3-2. IO Latency Interface Files
+       5-3-4. IO Priority
      5-4. PID
        5-4-1. PID Interface Files
      5-5. Cpuset
@@ -1866,6 +1867,31 @@ IO Latency Interface Files
 		duration of time between evaluation events.  Windows only elapse
 		with IO activity.  Idle periods extend the most recent window.
 
+IO Priority
+~~~~~~~~~~~
+
+A single attribute controls the behavior of the I/O priority cgroup policy,
+namely the blkio.prio.class attribute. The following values are accepted for
+that attribute:
+
+  0
+	Do not modify the I/O priority class.
+
+  1
+	For requests that do not have an I/O priority class (IOPRIO_CLASS_NONE),
+	change the I/O priority class into 1 (IOPRIO_CLASS_RT). Do not modify
+	the I/O priority class of other requests.
+
+  2
+	For requests that do not have an I/O priority class, change the I/O
+	priority class into IOPRIO_CLASS_BE. For requests that have the
+	priority class RT, change it into BE (2). Do not modify
+	the I/O priority class of requests that have priority 3 (IDLE).
+
+  3
+	Change the I/O priority class of all requests into 3 (IDLE),
+	the lowest I/O priority class.
+
 PID
 ---
 
diff --git a/block/Kconfig b/block/Kconfig
index 6685578b2a20..319816058298 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -162,6 +162,15 @@ config BLK_CGROUP_IOCOST
 	distributes IO capacity between different groups based on
 	their share of the overall weight distribution.
 
+config BLK_CGROUP_IOPRIO
+	bool "Cgroup I/O controller for assigning I/O priority"
+	depends on BLK_CGROUP
+	help
+	Enable the .prio interface for assigning an I/O priority to requests.
+	The I/O priority affects the order in which an I/O scheduler and block
+	devices process requests. Only some I/O schedulers and some block
+	devices support I/O priorities.
+
 config BLK_DEBUG_FS
 	bool "Block layer debugging information in debugfs"
 	default y
diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..af3d044abaf1 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
 obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
 obj-$(CONFIG_BLK_CGROUP_RWSTAT)	+= blk-cgroup-rwstat.o
 obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
+obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3b0f6efaa2b6..7b06a5fa3cac 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -31,6 +31,7 @@
 #include <linux/tracehook.h>
 #include <linux/psi.h>
 #include "blk.h"
+#include "blk-ioprio.h"
 
 /*
  * blkcg_pol_mutex protects blkcg_policy[] and policy [de]activation.
@@ -1187,6 +1188,10 @@ int blkcg_init_queue(struct request_queue *q)
 	if (ret)
 		goto err_destroy_all;
 
+	ret = blk_ioprio_init(q);
+	if (ret)
+		goto err_destroy_all;
+
 	ret = blk_throtl_init(q);
 	if (ret)
 		goto err_destroy_all;
diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
new file mode 100644
index 000000000000..d7292cdef67d
--- /dev/null
+++ b/block/blk-ioprio.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Block rq-qos policy for assigning an I/O priority to requests.
+ *
+ * Using an rq-qos policy for assigning I/O priority has two advantages over
+ * using the ioprio_set() system call:
+ *
+ * - This policy is cgroup based so it has all the advantages of cgroups.
+ * - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
+ *   controller affects page cache writeback I/O for filesystems that support
+ *   assiociating a cgroup with writeback I/O. See also
+ *   Documentation/admin-guide/cgroup-v2.rst.
+ */
+
+#include <linux/blk-cgroup.h>
+#include <linux/blk-mq.h>
+#include <linux/blk_types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include "blk-ioprio.h"
+#include "blk-rq-qos.h"
+
+/*
+ * The accepted I/O priority values are 0..IOPRIO_CLASS_MAX(3). 0 (default)
+ * means do not modify the I/O priority. Values 1..3 are used to restrict the
+ * I/O priority for a cgroup to the specified priority. See also
+ * <linux/ioprio.h>.
+ */
+#define IOPRIO_CLASS_MAX IOPRIO_CLASS_IDLE
+
+static struct blkcg_policy ioprio_policy;
+
+/**
+ * struct ioprio_blkg - Per (cgroup, request queue) data.
+ * @pd: blkg_policy_data structure.
+ */
+struct ioprio_blkg {
+	struct blkg_policy_data pd;
+};
+
+/**
+ * struct ioprio_blkcg - Per cgroup data.
+ * @cpd: blkcg_policy_data structure.
+ * @prio_class: One of the IOPRIO_CLASS_* values. See also <linux/ioprio.h>.
+ */
+struct ioprio_blkcg {
+	struct blkcg_policy_data cpd;
+	u8			 prio_class;
+};
+
+static inline struct ioprio_blkg *pd_to_ioprio(struct blkg_policy_data *pd)
+{
+	return pd ? container_of(pd, struct ioprio_blkg, pd) : NULL;
+}
+
+static struct ioprio_blkcg *
+ioprio_blkcg_from_css(struct cgroup_subsys_state *css)
+{
+	struct blkcg *blkcg = css_to_blkcg(css);
+	struct blkcg_policy_data *cpd = blkcg_to_cpd(blkcg, &ioprio_policy);
+
+	return container_of(cpd, struct ioprio_blkcg, cpd);
+}
+
+static struct ioprio_blkcg *ioprio_blkcg_from_bio(struct bio *bio)
+{
+	struct blkg_policy_data *pd;
+
+	pd = blkg_to_pd(bio->bi_blkg, &ioprio_policy);
+	if (!pd)
+		return NULL;
+
+	return container_of(blkcg_to_cpd(pd->blkg->blkcg, &ioprio_policy),
+			    struct ioprio_blkcg, cpd);
+}
+
+static u64 ioprio_show_prio_class(struct cgroup_subsys_state *css,
+				  struct cftype *cft)
+{
+	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(css);
+
+	return blkcg->prio_class;
+}
+
+static int ioprio_set_prio_class(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val)
+{
+	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(css);
+
+	if (val > IOPRIO_CLASS_MAX)
+		return -EINVAL;
+
+	blkcg->prio_class = val;
+	return 0;
+}
+
+static struct blkg_policy_data *ioprio_alloc_pd(gfp_t gfp,
+						struct request_queue *q,
+						struct blkcg *blkcg)
+{
+	struct ioprio_blkg *ioprio_blkg;
+
+	ioprio_blkg = kzalloc(sizeof(*ioprio_blkg), gfp);
+	if (!ioprio_blkg)
+		return NULL;
+
+	return &ioprio_blkg->pd;
+}
+
+static void ioprio_free_pd(struct blkg_policy_data *pd)
+{
+	struct ioprio_blkg *ioprio_blkg = pd_to_ioprio(pd);
+
+	kfree(ioprio_blkg);
+}
+
+static struct blkcg_policy_data *ioprio_alloc_cpd(gfp_t gfp)
+{
+	struct ioprio_blkcg *blkcg;
+
+	blkcg = kzalloc(sizeof(*blkcg), gfp);
+	if (!blkcg)
+		return NULL;
+	blkcg->prio_class = IOPRIO_CLASS_NONE;
+	return &blkcg->cpd;
+}
+
+static void ioprio_free_cpd(struct blkcg_policy_data *cpd)
+{
+	struct ioprio_blkcg *blkcg = container_of(cpd, typeof(*blkcg), cpd);
+
+	kfree(blkcg);
+}
+
+#define IOPRIO_ATTRS						\
+	{							\
+		.name		= "prio.class",			\
+		.read_u64	= ioprio_show_prio_class,	\
+		.write_u64	= ioprio_set_prio_class,	\
+	},							\
+	{ } /* sentinel */
+
+/* cgroup v2 attributes */
+static struct cftype ioprio_files[] = {
+	IOPRIO_ATTRS
+};
+
+/* cgroup v1 attributes */
+static struct cftype ioprio_legacy_files[] = {
+	IOPRIO_ATTRS
+};
+
+static struct blkcg_policy ioprio_policy = {
+	.dfl_cftypes	= ioprio_files,
+	.legacy_cftypes = ioprio_legacy_files,
+
+	.cpd_alloc_fn	= ioprio_alloc_cpd,
+	.cpd_free_fn	= ioprio_free_cpd,
+
+	.pd_alloc_fn	= ioprio_alloc_pd,
+	.pd_free_fn	= ioprio_free_pd,
+};
+
+struct blk_ioprio {
+	struct rq_qos rqos;
+};
+
+static void blkcg_ioprio_track(struct rq_qos *rqos, struct request *rq,
+			       struct bio *bio)
+{
+	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_bio(bio);
+
+	/*
+	 * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers
+	 * correspond to a lower priority. Hence, the max_t() below selects
+	 * the lower priority of bi_ioprio and the cgroup I/O priority class.
+	 * If the cgroup priority has been set to IOPRIO_CLASS_NONE == 0, the
+	 * bio I/O priority is not modified. If the bio I/O priority equals
+	 * IOPRIO_CLASS_NONE, the cgroup I/O priority is assigned to the bio.
+	 */
+	bio->bi_ioprio = max_t(u16, bio->bi_ioprio,
+			       IOPRIO_PRIO_VALUE(blkcg->prio_class, 0));
+}
+
+static void blkcg_ioprio_exit(struct rq_qos *rqos)
+{
+	struct blk_ioprio *blkioprio_blkg =
+		container_of(rqos, typeof(*blkioprio_blkg), rqos);
+
+	blkcg_deactivate_policy(rqos->q, &ioprio_policy);
+	kfree(blkioprio_blkg);
+}
+
+static struct rq_qos_ops blkcg_ioprio_ops = {
+	.track = blkcg_ioprio_track,
+	.exit = blkcg_ioprio_exit,
+};
+
+int blk_ioprio_init(struct request_queue *q)
+{
+	struct blk_ioprio *blkioprio_blkg;
+	struct rq_qos *rqos;
+	int ret;
+
+	blkioprio_blkg = kzalloc(sizeof(*blkioprio_blkg), GFP_KERNEL);
+	if (!blkioprio_blkg)
+		return -ENOMEM;
+
+	ret = blkcg_activate_policy(q, &ioprio_policy);
+	if (ret) {
+		kfree(blkioprio_blkg);
+		return ret;
+	}
+
+	rqos = &blkioprio_blkg->rqos;
+	rqos->id = RQ_QOS_IOPRIO;
+	rqos->ops = &blkcg_ioprio_ops;
+	rqos->q = q;
+
+	rq_qos_add(q, rqos);
+
+	return 0;
+}
+
+static int __init ioprio_init(void)
+{
+	return blkcg_policy_register(&ioprio_policy);
+}
+
+static void __exit ioprio_exit(void)
+{
+	blkcg_policy_unregister(&ioprio_policy);
+}
+
+module_init(ioprio_init);
+module_exit(ioprio_exit);
diff --git a/block/blk-ioprio.h b/block/blk-ioprio.h
new file mode 100644
index 000000000000..a7785c2f1aea
--- /dev/null
+++ b/block/blk-ioprio.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BLK_IOPRIO_H_
+#define _BLK_IOPRIO_H_
+
+#include <linux/kconfig.h>
+
+struct request_queue;
+
+#ifdef CONFIG_BLK_CGROUP_IOPRIO
+int blk_ioprio_init(struct request_queue *q);
+#else
+static inline int blk_ioprio_init(struct request_queue *q)
+{
+	return 0;
+}
+#endif
+
+#endif /* _BLK_IOPRIO_H_ */
diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
index 61dccf584c68..3a5400b11af7 100644
--- a/block/blk-rq-qos.c
+++ b/block/blk-rq-qos.c
@@ -11,6 +11,8 @@ const char *rq_qos_id_to_name(enum rq_qos_id id)
 		return "latency";
 	case RQ_QOS_COST:
 		return "cost";
+	case RQ_QOS_IOPRIO:
+		return "ioprio";
 	}
 	return "unknown";
 }
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 6b5f9ae69883..1d4c6e951aa6 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -16,6 +16,7 @@ enum rq_qos_id {
 	RQ_QOS_WBT,
 	RQ_QOS_LATENCY,
 	RQ_QOS_COST,
+	RQ_QOS_IOPRIO,
 };
 
 struct rq_wait {

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 05/14] block/mq-deadline: Add several comments
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (3 preceding siblings ...)
  2021-06-08 23:06 ` [PATCH 04/14] block: Introduce the ioprio rq-qos policy Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-10  6:21   ` Hannes Reinecke
  2021-06-08 23:06 ` [PATCH 06/14] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Johannes Thumshirn, Himanshu Madhani,
	Hannes Reinecke, Ming Lei

Make the code easier to read by adding more comments.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 8eea2cbf2bf4..31418e9ce9e2 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -139,6 +139,9 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
 	}
 }
 
+/*
+ * Callback function that is invoked after @next has been merged into @req.
+ */
 static void dd_merged_requests(struct request_queue *q, struct request *req,
 			       struct request *next)
 {
@@ -375,6 +378,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 }
 
 /*
+ * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
+ *
  * One confusing aspect here is that we get called for a specific
  * hardware queue, but we may return a request that is for a
  * different hardware queue. This is because mq-deadline has shared
@@ -438,6 +443,10 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
 	return 0;
 }
 
+/*
+ * Try to merge @bio into an existing request. If @bio has been merged into
+ * an existing request, store the pointer to that request into *@rq.
+ */
 static int dd_request_merge(struct request_queue *q, struct request **rq,
 			    struct bio *bio)
 {
@@ -461,6 +470,10 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
 	return ELEVATOR_NO_MERGE;
 }
 
+/*
+ * Attempt to merge a bio into an existing request. This function is called
+ * before @bio is associated with a request.
+ */
 static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs)
 {
@@ -518,6 +531,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	}
 }
 
+/*
+ * Called from blk_mq_sched_insert_request() or blk_mq_sched_insert_requests().
+ */
 static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 			       struct list_head *list, bool at_head)
 {
@@ -544,6 +560,8 @@ static void dd_prepare_request(struct request *rq)
 }
 
 /*
+ * Callback from inside blk_mq_free_request().
+ *
  * For zoned block devices, write unlock the target zone of
  * completed write requests. Do this while holding the zone lock
  * spinlock so that the zone is never unlocked while deadline_fifo_request()

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 06/14] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (4 preceding siblings ...)
  2021-06-08 23:06 ` [PATCH 05/14] block/mq-deadline: Add several comments Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-10  6:21   ` Hannes Reinecke
  2021-06-08 23:06 ` [PATCH 07/14] block/mq-deadline: Remove two local variables Bart Van Assche
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn, Himanshu Madhani, Ming Lei

Document the locking strategy by adding two lockdep_assert_held()
statements.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 31418e9ce9e2..191ff5ce629c 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -279,6 +279,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	bool reads, writes;
 	int data_dir;
 
+	lockdep_assert_held(&dd->lock);
+
 	if (!list_empty(&dd->dispatch)) {
 		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
 		list_del_init(&rq->queuelist);
@@ -501,6 +503,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
 
+	lockdep_assert_held(&dd->lock);
+
 	/*
 	 * This may be a requeue of a write request that has locked its
 	 * target zone. If it is the case, this releases the zone lock.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 07/14] block/mq-deadline: Remove two local variables
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (5 preceding siblings ...)
  2021-06-08 23:06 ` [PATCH 06/14] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-10  6:22   ` Hannes Reinecke
  2021-06-08 23:06 ` [PATCH 08/14] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn, Himanshu Madhani, Ming Lei

Make __dd_dispatch_request() easier to read by removing two local
variables.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 191ff5ce629c..caa438f62a4d 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -276,7 +276,6 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
 static struct request *__dd_dispatch_request(struct deadline_data *dd)
 {
 	struct request *rq, *next_rq;
-	bool reads, writes;
 	int data_dir;
 
 	lockdep_assert_held(&dd->lock);
@@ -287,9 +286,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 		goto done;
 	}
 
-	reads = !list_empty(&dd->fifo_list[READ]);
-	writes = !list_empty(&dd->fifo_list[WRITE]);
-
 	/*
 	 * batches are currently reads XOR writes
 	 */
@@ -306,7 +302,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (reads) {
+	if (!list_empty(&dd->fifo_list[READ])) {
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
 
 		if (deadline_fifo_request(dd, WRITE) &&
@@ -322,7 +318,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (writes) {
+	if (!list_empty(&dd->fifo_list[WRITE])) {
 dispatch_writes:
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 08/14] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (6 preceding siblings ...)
  2021-06-08 23:06 ` [PATCH 07/14] block/mq-deadline: Remove two local variables Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-08 23:06 ` [PATCH 09/14] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Chaitanya Kulkarni, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn, Himanshu Madhani, Ming Lei

Change "queue" into "sched" to make the function names reflect better the
purpose of these functions.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index caa438f62a4d..d823ba7cb084 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -395,7 +395,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	return rq;
 }
 
-static void dd_exit_queue(struct elevator_queue *e)
+static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
@@ -408,7 +408,7 @@ static void dd_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
@@ -800,8 +800,8 @@ static struct elevator_type mq_deadline = {
 		.requests_merged	= dd_merged_requests,
 		.request_merged		= dd_request_merged,
 		.has_work		= dd_has_work,
-		.init_sched		= dd_init_queue,
-		.exit_sched		= dd_exit_queue,
+		.init_sched		= dd_init_sched,
+		.exit_sched		= dd_exit_sched,
 	},
 
 #ifdef CONFIG_BLK_DEBUG_FS

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 09/14] block/mq-deadline: Improve compile-time argument checking
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (7 preceding siblings ...)
  2021-06-08 23:06 ` [PATCH 08/14] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-08 23:06 ` [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Chaitanya Kulkarni, Hannes Reinecke, Johannes Thumshirn,
	Himanshu Madhani, Damien Le Moal, Ming Lei

Modern compilers complain if an out-of-range value is passed to a function
argument that has an enumeration type. Let the compiler detect out-of-range
data direction arguments instead of verifying the data_dir argument at
runtime.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 96 +++++++++++++++++++++++----------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index d823ba7cb084..69126beff77d 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -35,6 +35,13 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
+enum dd_data_dir {
+	DD_READ		= READ,
+	DD_WRITE	= WRITE,
+};
+
+enum { DD_DIR_COUNT = 2 };
+
 struct deadline_data {
 	/*
 	 * run time data
@@ -43,20 +50,20 @@ struct deadline_data {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];
-	struct list_head fifo_list[2];
+	struct rb_root sort_list[DD_DIR_COUNT];
+	struct list_head fifo_list[DD_DIR_COUNT];
 
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
-	struct request *next_rq[2];
+	struct request *next_rq[DD_DIR_COUNT];
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
-	int fifo_expire[2];
+	int fifo_expire[DD_DIR_COUNT];
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
@@ -97,7 +104,7 @@ deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
 static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
 	if (dd->next_rq[data_dir] == rq)
 		dd->next_rq[data_dir] = deadline_latter_request(rq);
@@ -169,10 +176,10 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
+	dd->next_rq[DD_READ] = NULL;
+	dd->next_rq[DD_WRITE] = NULL;
 	dd->next_rq[data_dir] = deadline_latter_request(rq);
 
 	/*
@@ -185,9 +192,10 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_data *dd,
+				      enum dd_data_dir data_dir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
 
 	/*
 	 * rq is expired!
@@ -203,19 +211,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
  * dispatch using arrival ordered lists.
  */
 static struct request *
-deadline_fifo_request(struct deadline_data *dd, int data_dir)
+deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
-		return NULL;
-
 	if (list_empty(&dd->fifo_list[data_dir]))
 		return NULL;
 
 	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
-	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
 	/*
@@ -223,7 +228,7 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
 	 * an unlocked target zone.
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) {
+	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
 		if (blk_req_can_dispatch_to_zone(rq))
 			goto out;
 	}
@@ -239,19 +244,16 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
  * dispatch using sector position sorted lists.
  */
 static struct request *
-deadline_next_request(struct deadline_data *dd, int data_dir)
+deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
-		return NULL;
-
 	rq = dd->next_rq[data_dir];
 	if (!rq)
 		return NULL;
 
-	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
 	/*
@@ -276,7 +278,7 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
 static struct request *__dd_dispatch_request(struct deadline_data *dd)
 {
 	struct request *rq, *next_rq;
-	int data_dir;
+	enum dd_data_dir data_dir;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -289,9 +291,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	rq = deadline_next_request(dd, WRITE);
+	rq = deadline_next_request(dd, DD_WRITE);
 	if (!rq)
-		rq = deadline_next_request(dd, READ);
+		rq = deadline_next_request(dd, DD_READ);
 
 	if (rq && dd->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
@@ -302,14 +304,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (!list_empty(&dd->fifo_list[READ])) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+	if (!list_empty(&dd->fifo_list[DD_READ])) {
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
 
-		if (deadline_fifo_request(dd, WRITE) &&
+		if (deadline_fifo_request(dd, DD_WRITE) &&
 		    (dd->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
-		data_dir = READ;
+		data_dir = DD_READ;
 
 		goto dispatch_find_request;
 	}
@@ -318,13 +320,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (!list_empty(&dd->fifo_list[WRITE])) {
+	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
 
 		dd->starved = 0;
 
-		data_dir = WRITE;
+		data_dir = DD_WRITE;
 
 		goto dispatch_find_request;
 	}
@@ -399,8 +401,8 @@ static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
+	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
 
 	kfree(dd);
 }
@@ -424,12 +426,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	}
 	eq->elevator_data = dd;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
-	dd->fifo_expire[READ] = read_expire;
-	dd->fifo_expire[WRITE] = write_expire;
+	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
+	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
+	dd->sort_list[DD_READ] = RB_ROOT;
+	dd->sort_list[DD_WRITE] = RB_ROOT;
+	dd->fifo_expire[DD_READ] = read_expire;
+	dd->fifo_expire[DD_WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
 	dd->front_merges = 1;
 	dd->fifo_batch = fifo_batch;
@@ -497,7 +499,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 {
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
 	lockdep_assert_held(&dd->lock);
 
@@ -585,7 +587,7 @@ static void dd_finish_request(struct request *rq)
 
 		spin_lock_irqsave(&dd->zone_lock, flags);
 		blk_req_zone_write_unlock(rq);
-		if (!list_empty(&dd->fifo_list[WRITE]))
+		if (!list_empty(&dd->fifo_list[DD_WRITE]))
 			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
 	}
@@ -626,8 +628,8 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 		__data = jiffies_to_msecs(__data);			\
 	return deadline_var_show(__data, (page));			\
 }
-SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
-SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
 SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
 SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
 SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
@@ -649,8 +651,8 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
 		*(__PTR) = __data;					\
 	return count;							\
 }
-STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
 STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
 STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
 STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
@@ -717,8 +719,8 @@ static int deadline_##name##_next_rq_show(void *data,			\
 		__blk_mq_debugfs_rq_show(m, rq);			\
 	return 0;							\
 }
-DEADLINE_DEBUGFS_DDIR_ATTRS(READ, read)
-DEADLINE_DEBUGFS_DDIR_ATTRS(WRITE, write)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
 #undef DEADLINE_DEBUGFS_DDIR_ATTRS
 
 static int deadline_batching_show(void *data, struct seq_file *m)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (8 preceding siblings ...)
  2021-06-08 23:06 ` [PATCH 09/14] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
@ 2021-06-08 23:06 ` Bart Van Assche
  2021-06-09  4:46   ` Damien Le Moal
                     ` (2 more replies)
  2021-06-08 23:07 ` [PATCH 11/14] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests Bart Van Assche
                   ` (3 subsequent siblings)
  13 siblings, 3 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

Define separate macros for integers and jiffies to improve readability.
Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol()
since the former functions are the recommended functions.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 66 ++++++++++++++++++++-------------------------
 1 file changed, 29 insertions(+), 37 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 69126beff77d..1d1bb7a41d2a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -605,58 +605,50 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 /*
  * sysfs parts below
  */
-static ssize_t
-deadline_var_show(int var, char *page)
-{
-	return sprintf(page, "%d\n", var);
-}
-
-static void
-deadline_var_store(int *var, const char *page)
-{
-	char *p = (char *) page;
-
-	*var = simple_strtol(p, &p, 10);
-}
-
-#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+#define SHOW_INT(__FUNC, __VAR)						\
 static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 {									\
 	struct deadline_data *dd = e->elevator_data;			\
-	int __data = __VAR;						\
-	if (__CONV)							\
-		__data = jiffies_to_msecs(__data);			\
-	return deadline_var_show(__data, (page));			\
-}
-SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
-SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
-SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
-SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
-SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
-#undef SHOW_FUNCTION
+									\
+	return sysfs_emit((page), "%d\n", __VAR);			\
+}
+#define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
+SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
+SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
+SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
+SHOW_INT(deadline_front_merges_show, dd->front_merges);
+SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch);
+#undef SHOW_INT
+#undef SHOW_JIFFIES
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
 static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
 {									\
 	struct deadline_data *dd = e->elevator_data;			\
-	int __data;							\
-	deadline_var_store(&__data, (page));				\
+	int __data, __ret;						\
+									\
+	__ret = kstrtoint((page), 0, &__data);				\
+	if (__ret < 0)							\
+		return __ret;						\
 	if (__data < (MIN))						\
 		__data = (MIN);						\
 	else if (__data > (MAX))					\
 		__data = (MAX);						\
-	if (__CONV)							\
-		*(__PTR) = msecs_to_jiffies(__data);			\
-	else								\
-		*(__PTR) = __data;					\
+	*(__PTR) = __CONV(__data);					\
 	return count;							\
 }
-STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
-STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
-STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
+#define STORE_INT(__FUNC, __PTR, MIN, MAX)				\
+	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, )
+#define STORE_JIFFIES(__FUNC, __PTR, MIN, MAX)				\
+	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
+STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
+STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
+STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
+STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
+STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX);
 #undef STORE_FUNCTION
+#undef STORE_INT
+#undef STORE_JIFFIES
 
 #define DD_ATTR(name) \
 	__ATTR(name, 0644, deadline_##name##_show, deadline_##name##_store)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 11/14] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (9 preceding siblings ...)
  2021-06-08 23:06 ` [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
@ 2021-06-08 23:07 ` Bart Van Assche
  2021-06-10  6:26   ` Hannes Reinecke
  2021-06-08 23:07 ` [PATCH 12/14] block/mq-deadline: Add I/O priority support Bart Van Assche
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

For interactive workloads it is important that synchronous requests are
not delayed. Hence reserve 25% of scheduler tags for synchronous requests.
This patch still allows asynchronous requests to fill the hardware queues
since blk_mq_init_sched() makes sure that the number of scheduler requests
is the double of the hardware queue depth. From blk_mq_init_sched():

	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
				   BLKDEV_MAX_RQ);

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 1d1bb7a41d2a..a7d0584437d1 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -67,6 +67,7 @@ struct deadline_data {
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
+	u32 async_depth;
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
@@ -397,6 +398,44 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	return rq;
 }
 
+/*
+ * Called by __blk_mq_alloc_request(). The shallow_depth value set by this
+ * function is used by __blk_mq_get_tag().
+ */
+static void dd_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
+{
+	struct deadline_data *dd = data->q->elevator->elevator_data;
+
+	/* Do not throttle synchronous reads. */
+	if (op_is_sync(op) && !op_is_write(op))
+		return;
+
+	/*
+	 * Throttle asynchronous requests and writes such that these requests
+	 * do not block the allocation of synchronous requests.
+	 */
+	data->shallow_depth = dd->async_depth;
+}
+
+/* Called by blk_mq_update_nr_requests(). */
+static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct blk_mq_tags *tags = hctx->sched_tags;
+
+	dd->async_depth = max(1UL, 3 * q->nr_requests / 4);
+
+	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, dd->async_depth);
+}
+
+/* Called by blk_mq_init_hctx() and blk_mq_init_sched(). */
+static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	dd_depth_updated(hctx);
+	return 0;
+}
+
 static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
@@ -733,6 +772,15 @@ static int deadline_starved_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+static int dd_async_depth_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	seq_printf(m, "%u\n", dd->async_depth);
+	return 0;
+}
+
 static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
 	__acquires(&dd->lock)
 {
@@ -775,6 +823,7 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 	DEADLINE_QUEUE_DDIR_ATTRS(write),
 	{"batching", 0400, deadline_batching_show},
 	{"starved", 0400, deadline_starved_show},
+	{"async_depth", 0400, dd_async_depth_show},
 	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
 	{},
 };
@@ -783,6 +832,8 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 
 static struct elevator_type mq_deadline = {
 	.ops = {
+		.depth_updated		= dd_depth_updated,
+		.limit_depth		= dd_limit_depth,
 		.insert_requests	= dd_insert_requests,
 		.dispatch_request	= dd_dispatch_request,
 		.prepare_request	= dd_prepare_request,
@@ -796,6 +847,7 @@ static struct elevator_type mq_deadline = {
 		.has_work		= dd_has_work,
 		.init_sched		= dd_init_sched,
 		.exit_sched		= dd_exit_sched,
+		.init_hctx		= dd_init_hctx,
 	},
 
 #ifdef CONFIG_BLK_DEBUG_FS

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 12/14] block/mq-deadline: Add I/O priority support
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (10 preceding siblings ...)
  2021-06-08 23:07 ` [PATCH 11/14] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests Bart Van Assche
@ 2021-06-08 23:07 ` Bart Van Assche
  2021-06-09  5:03   ` Damien Le Moal
  2021-06-10  6:35   ` Hannes Reinecke
  2021-06-08 23:07 ` [PATCH 13/14] block/mq-deadline: Add cgroup support Bart Van Assche
  2021-06-08 23:07 ` [PATCH 14/14] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
  13 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

Maintain one dispatch list and FIFO one list per I/O priority class: RT, BE
and IDLE. Maintain statistics for each priority level. Split the debugfs
attributes per priority level as follows:

$ ls /sys/kernel/debug/block/.../sched/
async_depth  dispatch2        read_next_rq      write2_fifo_list
batching     read0_fifo_list  starved           write_next_rq
dispatch0    read1_fifo_list  write0_fifo_list
dispatch1    read2_fifo_list  write1_fifo_list

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 355 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 279 insertions(+), 76 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index a7d0584437d1..776ff49713c3 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -42,16 +42,40 @@ enum dd_data_dir {
 
 enum { DD_DIR_COUNT = 2 };
 
+enum dd_prio {
+	DD_RT_PRIO	= 0,
+	DD_BE_PRIO	= 1,
+	DD_IDLE_PRIO	= 2,
+	DD_PRIO_MAX	= 2,
+};
+
+enum { DD_PRIO_COUNT = 3 };
+
+/* I/O statistics per I/O priority. */
+struct io_stats_per_prio {
+	local_t inserted;
+	local_t merged;
+	local_t dispatched;
+	local_t completed;
+};
+
+/* I/O statistics for all I/O priorities (enum dd_prio). */
+struct io_stats {
+	struct io_stats_per_prio stats[DD_PRIO_COUNT];
+};
+
 struct deadline_data {
 	/*
 	 * run time data
 	 */
 
 	/*
-	 * requests (deadline_rq s) are present on both sort_list and fifo_list
+	 * Requests are present on both sort_list[] and fifo_list[][]. The
+	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
+	 * The second index is the data direction (rq_data_dir(rq)).
 	 */
 	struct rb_root sort_list[DD_DIR_COUNT];
-	struct list_head fifo_list[DD_DIR_COUNT];
+	struct list_head fifo_list[DD_PRIO_COUNT][DD_DIR_COUNT];
 
 	/*
 	 * next in sort order. read, write or both are NULL
@@ -60,6 +84,8 @@ struct deadline_data {
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
+	struct io_stats __percpu *stats;
+
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -71,7 +97,42 @@ struct deadline_data {
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
-	struct list_head dispatch;
+	struct list_head dispatch[DD_PRIO_COUNT];
+};
+
+/* Count one event of type 'event_type' and with I/O priority 'prio' */
+#define dd_count(dd, event_type, prio) do {				\
+	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
+									\
+	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
+	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
+	local_inc(&io_stats->stats[(prio)].event_type);			\
+	put_cpu_ptr(io_stats);						\
+} while (0)
+
+/*
+ * Returns the total number of dd_count(dd, event_type, prio) calls across all
+ * CPUs. No locking or barriers since it is fine if the returned sum is slightly
+ * outdated.
+ */
+#define dd_sum(dd, event_type, prio) ({					\
+	unsigned int cpu;						\
+	u32 sum = 0;							\
+									\
+	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
+	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
+	for_each_present_cpu(cpu)					\
+		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
+				  stats[(prio)].event_type);		\
+	sum;								\
+})
+
+/* Maps an I/O priority class to a deadline scheduler priority. */
+static const enum dd_prio ioprio_class_to_prio[] = {
+	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
+	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
+	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
+	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
 };
 
 static inline struct rb_root *
@@ -147,12 +208,31 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
 	}
 }
 
+/*
+ * Returns the I/O priority class (IOPRIO_CLASS_*) that has been assigned to a
+ * request.
+ */
+static u8 dd_rq_ioclass(struct request *rq)
+{
+	return IOPRIO_PRIO_CLASS(req_get_ioprio(rq));
+}
+
 /*
  * Callback function that is invoked after @next has been merged into @req.
  */
 static void dd_merged_requests(struct request_queue *q, struct request *req,
 			       struct request *next)
 {
+	struct deadline_data *dd = q->elevator->elevator_data;
+	const u8 ioprio_class = dd_rq_ioclass(next);
+	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+
+	if (next->elv.priv[0]) {
+		dd_count(dd, merged, prio);
+	} else {
+		WARN_ON_ONCE(true);
+	}
+
 	/*
 	 * if next expires before rq, assign its expire time to rq
 	 * and move into next position (next will be deleted) in fifo
@@ -189,14 +269,21 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
 	deadline_remove_request(rq->q, rq);
 }
 
+/* Number of requests queued for a given priority level. */
+static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
+{
+	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+}
+
 /*
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
 static inline int deadline_check_fifo(struct deadline_data *dd,
+				      enum dd_prio prio,
 				      enum dd_data_dir data_dir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+	struct request *rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
 
 	/*
 	 * rq is expired!
@@ -212,15 +299,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd,
  * dispatch using arrival ordered lists.
  */
 static struct request *
-deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
+deadline_fifo_request(struct deadline_data *dd, enum dd_prio prio,
+		      enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (list_empty(&dd->fifo_list[data_dir]))
+	if (list_empty(&dd->fifo_list[prio][data_dir]))
 		return NULL;
 
-	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+	rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
 	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
@@ -229,7 +317,7 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 	 * an unlocked target zone.
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
+	list_for_each_entry(rq, &dd->fifo_list[prio][DD_WRITE], queuelist) {
 		if (blk_req_can_dispatch_to_zone(rq))
 			goto out;
 	}
@@ -245,7 +333,8 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
  * dispatch using sector position sorted lists.
  */
 static struct request *
-deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
+deadline_next_request(struct deadline_data *dd, enum dd_prio prio,
+		      enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
@@ -276,15 +365,18 @@ deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
  * deadline_dispatch_requests selects the best request according to
  * read/write expire, fifo_batch, etc
  */
-static struct request *__dd_dispatch_request(struct deadline_data *dd)
+static struct request *__dd_dispatch_request(struct deadline_data *dd,
+					     enum dd_prio prio)
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
+	u8 ioprio_class;
 
 	lockdep_assert_held(&dd->lock);
 
-	if (!list_empty(&dd->dispatch)) {
-		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
+	if (!list_empty(&dd->dispatch[prio])) {
+		rq = list_first_entry(&dd->dispatch[prio], struct request,
+				      queuelist);
 		list_del_init(&rq->queuelist);
 		goto done;
 	}
@@ -292,9 +384,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	rq = deadline_next_request(dd, DD_WRITE);
+	rq = deadline_next_request(dd, prio, DD_WRITE);
 	if (!rq)
-		rq = deadline_next_request(dd, DD_READ);
+		rq = deadline_next_request(dd, prio, DD_READ);
 
 	if (rq && dd->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
@@ -305,10 +397,10 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (!list_empty(&dd->fifo_list[DD_READ])) {
+	if (!list_empty(&dd->fifo_list[prio][DD_READ])) {
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
 
-		if (deadline_fifo_request(dd, DD_WRITE) &&
+		if (deadline_fifo_request(dd, prio, DD_WRITE) &&
 		    (dd->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
@@ -321,7 +413,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
+	if (!list_empty(&dd->fifo_list[prio][DD_WRITE])) {
 dispatch_writes:
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
 
@@ -338,14 +430,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	next_rq = deadline_next_request(dd, data_dir);
-	if (deadline_check_fifo(dd, data_dir) || !next_rq) {
+	next_rq = deadline_next_request(dd, prio, data_dir);
+	if (deadline_check_fifo(dd, prio, data_dir) || !next_rq) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = deadline_fifo_request(dd, data_dir);
+		rq = deadline_fifo_request(dd, prio, data_dir);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
@@ -370,6 +462,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	dd->batching++;
 	deadline_move_request(dd, rq);
 done:
+	ioprio_class = dd_rq_ioclass(rq);
+	prio = ioprio_class_to_prio[ioprio_class];
+	if (rq->elv.priv[0]) {
+		dd_count(dd, dispatched, prio);
+	} else {
+		WARN_ON_ONCE(true);
+	}
 	/*
 	 * If the request needs its target zone locked, do it.
 	 */
@@ -390,9 +489,14 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
 	struct request *rq;
+	enum dd_prio prio;
 
 	spin_lock(&dd->lock);
-	rq = __dd_dispatch_request(dd);
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, prio);
+		if (rq)
+			break;
+	}
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -439,9 +543,14 @@ static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
+	enum dd_prio prio;
+
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_READ]));
+		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_WRITE]));
+	}
 
-	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
+	free_percpu(dd->stats);
 
 	kfree(dd);
 }
@@ -453,20 +562,29 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
+	enum dd_prio prio;
+	int ret = -ENOMEM;
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
-		return -ENOMEM;
+		return ret;
 
 	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
-	if (!dd) {
-		kobject_put(&eq->kobj);
-		return -ENOMEM;
-	}
+	if (!dd)
+		goto put_eq;
+
 	eq->elevator_data = dd;
 
-	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
+	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
+				     GFP_KERNEL | __GFP_ZERO);
+	if (!dd->stats)
+		goto free_dd;
+
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_READ]);
+		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_WRITE]);
+		INIT_LIST_HEAD(&dd->dispatch[prio]);
+	}
 	dd->sort_list[DD_READ] = RB_ROOT;
 	dd->sort_list[DD_WRITE] = RB_ROOT;
 	dd->fifo_expire[DD_READ] = read_expire;
@@ -476,10 +594,16 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->fifo_batch = fifo_batch;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
-	INIT_LIST_HEAD(&dd->dispatch);
 
 	q->elevator = eq;
 	return 0;
+
+free_dd:
+	kfree(dd);
+
+put_eq:
+	kobject_put(&eq->kobj);
+	return ret;
 }
 
 /*
@@ -539,6 +663,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const enum dd_data_dir data_dir = rq_data_dir(rq);
+	u16 ioprio = req_get_ioprio(rq);
+	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
+	enum dd_prio prio;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -548,13 +675,18 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 */
 	blk_req_zone_write_unlock(rq);
 
+	prio = ioprio_class_to_prio[ioprio_class];
+	dd_count(dd, inserted, prio);
+	WARN_ON_ONCE(rq->elv.priv[0]);
+	rq->elv.priv[0] = (void *)1ULL;
+
 	if (blk_mq_sched_try_insert_merge(q, rq))
 		return;
 
 	trace_block_rq_insert(rq);
 
 	if (at_head) {
-		list_add(&rq->queuelist, &dd->dispatch);
+		list_add(&rq->queuelist, &dd->dispatch[prio]);
 	} else {
 		deadline_add_rq_rb(dd, rq);
 
@@ -568,7 +700,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+		list_add_tail(&rq->queuelist, &dd->fifo_list[prio][data_dir]);
 	}
 }
 
@@ -592,12 +724,10 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	spin_unlock(&dd->lock);
 }
 
-/*
- * Nothing to do here. This is defined only to ensure that .finish_request
- * method is called upon request completion.
- */
+/* Callback from inside blk_mq_rq_ctx_init(). */
 static void dd_prepare_request(struct request *rq)
 {
+	rq->elv.priv[0] = NULL;
 }
 
 /*
@@ -619,26 +749,41 @@ static void dd_prepare_request(struct request *rq)
 static void dd_finish_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	const u8 ioprio_class = dd_rq_ioclass(rq);
+	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+
+	if (rq->elv.priv[0])
+		dd_count(dd, completed, prio);
 
 	if (blk_queue_is_zoned(q)) {
-		struct deadline_data *dd = q->elevator->elevator_data;
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->zone_lock, flags);
 		blk_req_zone_write_unlock(rq);
-		if (!list_empty(&dd->fifo_list[DD_WRITE]))
+		if (!list_empty(&dd->fifo_list[prio][DD_WRITE]))
 			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
 	}
 }
 
+static bool dd_has_work_for_prio(struct deadline_data *dd, enum dd_prio prio)
+{
+	return !list_empty_careful(&dd->dispatch[prio]) ||
+		!list_empty_careful(&dd->fifo_list[prio][DD_READ]) ||
+		!list_empty_careful(&dd->fifo_list[prio][DD_WRITE]);
+}
+
 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+	enum dd_prio prio;
+
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
+		if (dd_has_work_for_prio(dd, prio))
+			return true;
 
-	return !list_empty_careful(&dd->dispatch) ||
-		!list_empty_careful(&dd->fifo_list[0]) ||
-		!list_empty_careful(&dd->fifo_list[1]);
+	return false;
 }
 
 /*
@@ -702,7 +847,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 };
 
 #ifdef CONFIG_BLK_DEBUG_FS
-#define DEADLINE_DEBUGFS_DDIR_ATTRS(ddir, name)				\
+#define DEADLINE_DEBUGFS_DDIR_ATTRS(prio, data_dir, name)		\
 static void *deadline_##name##_fifo_start(struct seq_file *m,		\
 					  loff_t *pos)			\
 	__acquires(&dd->lock)						\
@@ -711,7 +856,7 @@ static void *deadline_##name##_fifo_start(struct seq_file *m,		\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
 	spin_lock(&dd->lock);						\
-	return seq_list_start(&dd->fifo_list[ddir], *pos);		\
+	return seq_list_start(&dd->fifo_list[prio][data_dir], *pos);	\
 }									\
 									\
 static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
@@ -720,7 +865,7 @@ static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
-	return seq_list_next(v, &dd->fifo_list[ddir], pos);		\
+	return seq_list_next(v, &dd->fifo_list[prio][data_dir], pos);	\
 }									\
 									\
 static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
@@ -737,22 +882,31 @@ static const struct seq_operations deadline_##name##_fifo_seq_ops = {	\
 	.next	= deadline_##name##_fifo_next,				\
 	.stop	= deadline_##name##_fifo_stop,				\
 	.show	= blk_mq_debugfs_rq_show,				\
-};									\
-									\
+};
+
+#define DEADLINE_DEBUGFS_NEXT_RQ(data_dir, name)			\
 static int deadline_##name##_next_rq_show(void *data,			\
 					  struct seq_file *m)		\
 {									\
 	struct request_queue *q = data;					\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
-	struct request *rq = dd->next_rq[ddir];				\
+	struct request *rq = dd->next_rq[data_dir];			\
 									\
 	if (rq)								\
 		__blk_mq_debugfs_rq_show(m, rq);			\
 	return 0;							\
 }
-DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
-DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
+
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_READ, read0)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_WRITE, write0)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_READ, read1)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_WRITE, write1)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_READ, read2)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_WRITE, write2)
+DEADLINE_DEBUGFS_NEXT_RQ(DD_READ, read)
+DEADLINE_DEBUGFS_NEXT_RQ(DD_WRITE, write)
 #undef DEADLINE_DEBUGFS_DDIR_ATTRS
+#undef DEADLINE_DEBUGFS_NEXT_RQ
 
 static int deadline_batching_show(void *data, struct seq_file *m)
 {
@@ -781,50 +935,99 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
 	return 0;
 }
 
-static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
-	__acquires(&dd->lock)
+static int dd_queued_show(void *data, struct seq_file *m)
 {
-	struct request_queue *q = m->private;
+	struct request_queue *q = data;
 	struct deadline_data *dd = q->elevator->elevator_data;
 
-	spin_lock(&dd->lock);
-	return seq_list_start(&dd->dispatch, *pos);
+	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
+		   dd_queued(dd, DD_BE_PRIO),
+		   dd_queued(dd, DD_IDLE_PRIO));
+	return 0;
 }
 
-static void *deadline_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
+/* Number of requests owned by the block driver for a given priority. */
+static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
 {
-	struct request_queue *q = m->private;
-	struct deadline_data *dd = q->elevator->elevator_data;
-
-	return seq_list_next(v, &dd->dispatch, pos);
+	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
+		- dd_sum(dd, completed, prio);
 }
 
-static void deadline_dispatch_stop(struct seq_file *m, void *v)
-	__releases(&dd->lock)
+static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 {
-	struct request_queue *q = m->private;
+	struct request_queue *q = data;
 	struct deadline_data *dd = q->elevator->elevator_data;
 
-	spin_unlock(&dd->lock);
+	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
+		   dd_owned_by_driver(dd, DD_BE_PRIO),
+		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
+	return 0;
 }
 
-static const struct seq_operations deadline_dispatch_seq_ops = {
-	.start	= deadline_dispatch_start,
-	.next	= deadline_dispatch_next,
-	.stop	= deadline_dispatch_stop,
-	.show	= blk_mq_debugfs_rq_show,
-};
+#define DEADLINE_DISPATCH_ATTR(prio)					\
+static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
+					     loff_t *pos)		\
+	__acquires(&dd->lock)						\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+									\
+	spin_lock(&dd->lock);						\
+	return seq_list_start(&dd->dispatch[prio], *pos);		\
+}									\
+									\
+static void *deadline_dispatch##prio##_next(struct seq_file *m,		\
+					    void *v, loff_t *pos)	\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+									\
+	return seq_list_next(v, &dd->dispatch[prio], pos);		\
+}									\
+									\
+static void deadline_dispatch##prio##_stop(struct seq_file *m, void *v)	\
+	__releases(&dd->lock)						\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+									\
+	spin_unlock(&dd->lock);						\
+}									\
+									\
+static const struct seq_operations deadline_dispatch##prio##_seq_ops = { \
+	.start	= deadline_dispatch##prio##_start,			\
+	.next	= deadline_dispatch##prio##_next,			\
+	.stop	= deadline_dispatch##prio##_stop,			\
+	.show	= blk_mq_debugfs_rq_show,				\
+}
+
+DEADLINE_DISPATCH_ATTR(0);
+DEADLINE_DISPATCH_ATTR(1);
+DEADLINE_DISPATCH_ATTR(2);
+#undef DEADLINE_DISPATCH_ATTR
 
-#define DEADLINE_QUEUE_DDIR_ATTRS(name)						\
-	{#name "_fifo_list", 0400, .seq_ops = &deadline_##name##_fifo_seq_ops},	\
+#define DEADLINE_QUEUE_DDIR_ATTRS(name)					\
+	{#name "_fifo_list", 0400,					\
+			.seq_ops = &deadline_##name##_fifo_seq_ops}
+#define DEADLINE_NEXT_RQ_ATTR(name)					\
 	{#name "_next_rq", 0400, deadline_##name##_next_rq_show}
 static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
-	DEADLINE_QUEUE_DDIR_ATTRS(read),
-	DEADLINE_QUEUE_DDIR_ATTRS(write),
+	DEADLINE_QUEUE_DDIR_ATTRS(read0),
+	DEADLINE_QUEUE_DDIR_ATTRS(write0),
+	DEADLINE_QUEUE_DDIR_ATTRS(read1),
+	DEADLINE_QUEUE_DDIR_ATTRS(write1),
+	DEADLINE_QUEUE_DDIR_ATTRS(read2),
+	DEADLINE_QUEUE_DDIR_ATTRS(write2),
+	DEADLINE_NEXT_RQ_ATTR(read),
+	DEADLINE_NEXT_RQ_ATTR(write),
 	{"batching", 0400, deadline_batching_show},
 	{"starved", 0400, deadline_starved_show},
 	{"async_depth", 0400, dd_async_depth_show},
-	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
+	{"dispatch0", 0400, .seq_ops = &deadline_dispatch0_seq_ops},
+	{"dispatch1", 0400, .seq_ops = &deadline_dispatch1_seq_ops},
+	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
+	{"owned_by_driver", 0400, dd_owned_by_driver_show},
+	{"queued", 0400, dd_queued_show},
 	{},
 };
 #undef DEADLINE_QUEUE_DDIR_ATTRS
@@ -874,6 +1077,6 @@ static void __exit deadline_exit(void)
 module_init(deadline_init);
 module_exit(deadline_exit);
 
-MODULE_AUTHOR("Jens Axboe");
+MODULE_AUTHOR("Jens Axboe, Damien Le Moal and Bart Van Assche");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("MQ deadline IO scheduler");

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 13/14] block/mq-deadline: Add cgroup support
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (11 preceding siblings ...)
  2021-06-08 23:07 ` [PATCH 12/14] block/mq-deadline: Add I/O priority support Bart Van Assche
@ 2021-06-08 23:07 ` Bart Van Assche
  2021-06-10  6:37   ` Hannes Reinecke
  2021-06-08 23:07 ` [PATCH 14/14] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
  13 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

Maintain statistics per cgroup and export these to user space. These
statistics are essential for verifying whether the proper I/O priorities
have been assigned to requests. With this patch applied the output of
(cd /sys/fs/cgroup/blkio/ && ls) is as follows:
blkio.dd.dispatched_pri0  blkio.dd.inserted_pri3  cgroup.clone_children
blkio.dd.dispatched_pri1  blkio.dd.merged_pri0    cgroup.procs
blkio.dd.dispatched_pri2  blkio.dd.merged_pri1    cgroup.sane_behavior
blkio.dd.dispatched_pri3  blkio.dd.merged_pri2    notify_on_release
blkio.dd.inserted_pri0    blkio.dd.merged_pri3    release_agent
blkio.dd.inserted_pri1    blkio.prio.class        tasks
blkio.dd.inserted_pri2    blkio.reset_stats

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/Kconfig.iosched                       |   6 +
 block/Makefile                              |   2 +
 block/mq-deadline-cgroup.c                  | 185 ++++++++++++++++++++
 block/mq-deadline-cgroup.h                  | 112 ++++++++++++
 block/{mq-deadline.c => mq-deadline-main.c} |  85 +++++++--
 5 files changed, 376 insertions(+), 14 deletions(-)
 create mode 100644 block/mq-deadline-cgroup.c
 create mode 100644 block/mq-deadline-cgroup.h
 rename block/{mq-deadline.c => mq-deadline-main.c} (94%)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 2f2158e05a91..64053d67a97b 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -9,6 +9,12 @@ config MQ_IOSCHED_DEADLINE
 	help
 	  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_DEADLINE_CGROUP
+       tristate
+       default y
+       depends on MQ_IOSCHED_DEADLINE
+       depends on BLK_CGROUP
+
 config MQ_IOSCHED_KYBER
 	tristate "Kyber I/O scheduler"
 	default y
diff --git a/block/Makefile b/block/Makefile
index af3d044abaf1..b9db5d4edfc8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -21,6 +21,8 @@ obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
+mq-deadline-y += mq-deadline-main.o
+mq-deadline-$(CONFIG_MQ_IOSCHED_DEADLINE_CGROUP)+= mq-deadline-cgroup.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
 obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
diff --git a/block/mq-deadline-cgroup.c b/block/mq-deadline-cgroup.c
new file mode 100644
index 000000000000..73b04e5a577f
--- /dev/null
+++ b/block/mq-deadline-cgroup.c
@@ -0,0 +1,185 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/blk-cgroup.h>
+#include <linux/ioprio.h>
+
+#include "mq-deadline-cgroup.h"
+
+static struct blkcg_policy dd_blkcg_policy;
+
+static struct dd_blkcg *dd_blkcg_from_css(struct cgroup_subsys_state *css)
+{
+	struct blkcg *blkcg = css_to_blkcg(css);
+	struct blkcg_policy_data *cpd = blkcg_to_cpd(blkcg, &dd_blkcg_policy);
+
+	return container_of(cpd, struct dd_blkcg, cpd);
+}
+
+/*
+ * Number of inserted but not completed requests for priority class
+ * cft->private.
+ */
+static u64 dd_show_inserted(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	struct dd_blkcg *blkcg = dd_blkcg_from_css(css);
+	const u8 prio = cft->private;
+
+	return ddcg_sum(blkcg, inserted, prio) -
+		ddcg_sum(blkcg, completed, prio);
+}
+
+/* Number of merged requests for priority class cft->private. */
+static u64 dd_show_merged(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct dd_blkcg *blkcg = dd_blkcg_from_css(css);
+	const u8 prio = cft->private;
+
+	return ddcg_sum(blkcg, merged, prio);
+}
+
+/*
+ * Number of dispatched but not completed requests for priority class
+ * cft->private.
+ */
+static u64 dd_show_dispatched(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	struct dd_blkcg *blkcg = dd_blkcg_from_css(css);
+	const u8 prio = cft->private;
+
+	return ddcg_sum(blkcg, dispatched, prio) +
+		ddcg_sum(blkcg, merged, prio) -
+		ddcg_sum(blkcg, completed, prio);
+}
+
+#define IOPRI_ATTRS(pfx, cft_flags, callback)		\
+	{						\
+		.name		= pfx "0",		\
+		.private	= IOPRIO_CLASS_NONE,	\
+		.flags		= cft_flags,		\
+		.read_u64	= callback,		\
+	},						\
+	{						\
+		.name		= pfx "1",		\
+		.private	= IOPRIO_CLASS_RT,	\
+		.flags		= cft_flags,		\
+		.read_u64	= callback,		\
+	},						\
+	{						\
+		.name		= pfx "2",		\
+		.private	= IOPRIO_CLASS_BE,	\
+		.flags		= cft_flags,		\
+		.read_u64	= callback,		\
+	},						\
+	{						\
+		.name		= pfx "3",		\
+		.private	= IOPRIO_CLASS_IDLE,	\
+		.flags		= cft_flags,		\
+		.read_u64	= callback,		\
+	}
+
+#define DD_ATTRS							\
+	IOPRI_ATTRS("dd.inserted_pri", 0, dd_show_inserted),		\
+	IOPRI_ATTRS("dd.merged_pri", 0, dd_show_merged),		\
+	IOPRI_ATTRS("dd.dispatched_pri", 0, dd_show_dispatched),	\
+	{ } /* sentinel */
+
+/* cgroup v2 attributes */
+static struct cftype dd_blkcg_files[] = {
+	DD_ATTRS
+};
+
+/* cgroup v1 attributes */
+static struct cftype dd_blkcg_legacy_files[] = {
+	DD_ATTRS
+};
+
+static struct blkcg_policy_data *dd_cpd_alloc(gfp_t gfp)
+{
+	struct dd_blkcg *pd;
+
+	pd = kzalloc(sizeof(*pd), gfp);
+	if (!pd)
+		return NULL;
+	pd->stats = alloc_percpu_gfp(typeof(*pd->stats),
+				     GFP_KERNEL | __GFP_ZERO);
+	if (!pd->stats) {
+		kfree(pd);
+		return NULL;
+	}
+	return &pd->cpd;
+}
+
+static void dd_cpd_free(struct blkcg_policy_data *cpd)
+{
+	struct dd_blkcg *dd_blkcg = container_of(cpd, typeof(*dd_blkcg), cpd);
+
+	free_percpu(dd_blkcg->stats);
+	kfree(dd_blkcg);
+}
+
+/*
+ * Convert an association between a block cgroup and a request queue into a
+ * pointer to the mq-deadline information associated with a (blkcg, queue) pair.
+ */
+struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio)
+{
+	struct blkg_policy_data *pd;
+
+	pd = blkg_to_pd(bio->bi_blkg, &dd_blkcg_policy);
+	if (!pd)
+		return NULL;
+
+	return container_of(blkcg_to_cpd(pd->blkg->blkcg, &dd_blkcg_policy),
+			    struct dd_blkcg, cpd);
+}
+
+static struct blkg_policy_data *dd_pd_alloc(gfp_t gfp, struct request_queue *q,
+					    struct blkcg *blkcg)
+{
+	struct dd_blkg *pd;
+
+	pd = kzalloc(sizeof(*pd), gfp);
+	if (!pd)
+		return NULL;
+	return &pd->pd;
+}
+
+static void dd_pd_free(struct blkg_policy_data *pd)
+{
+	struct dd_blkg *dd_blkg = container_of(pd, typeof(*dd_blkg), pd);
+
+	kfree(dd_blkg);
+}
+
+static struct blkcg_policy dd_blkcg_policy = {
+	.dfl_cftypes		= dd_blkcg_files,
+	.legacy_cftypes		= dd_blkcg_legacy_files,
+
+	.cpd_alloc_fn		= dd_cpd_alloc,
+	.cpd_free_fn		= dd_cpd_free,
+
+	.pd_alloc_fn		= dd_pd_alloc,
+	.pd_free_fn		= dd_pd_free,
+};
+
+int dd_activate_policy(struct request_queue *q)
+{
+	return blkcg_activate_policy(q, &dd_blkcg_policy);
+}
+
+void dd_deactivate_policy(struct request_queue *q)
+{
+	blkcg_deactivate_policy(q, &dd_blkcg_policy);
+}
+
+int __init dd_blkcg_init(void)
+{
+	return blkcg_policy_register(&dd_blkcg_policy);
+}
+
+void __exit dd_blkcg_exit(void)
+{
+	blkcg_policy_unregister(&dd_blkcg_policy);
+}
diff --git a/block/mq-deadline-cgroup.h b/block/mq-deadline-cgroup.h
new file mode 100644
index 000000000000..1e263557c881
--- /dev/null
+++ b/block/mq-deadline-cgroup.h
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#if !defined(_MQ_DEADLINE_CGROUP_H_)
+#define _MQ_DEADLINE_CGROUP_H_
+
+#include <linux/blk-cgroup.h>
+
+struct request_queue;
+
+/**
+ * struct io_stats_per_prio - I/O statistics per I/O priority class.
+ * @inserted: Number of inserted requests.
+ * @merged: Number of merged requests.
+ * @dispatched: Number of dispatched requests.
+ * @completed: Number of I/O completions.
+ */
+struct io_stats_per_prio {
+	local_t inserted;
+	local_t merged;
+	local_t dispatched;
+	local_t completed;
+};
+
+/* I/O statistics per I/O cgroup per I/O priority class (IOPRIO_CLASS_*). */
+struct blkcg_io_stats {
+	struct io_stats_per_prio stats[4];
+};
+
+/**
+ * struct dd_blkcg - Per cgroup data.
+ * @cpd: blkcg_policy_data structure.
+ * @stats: I/O statistics.
+ */
+struct dd_blkcg {
+	struct blkcg_policy_data cpd;	/* must be the first member */
+	struct blkcg_io_stats __percpu *stats;
+};
+
+/*
+ * Count one event of type 'event_type' and with I/O priority class
+ * 'prio_class'.
+ */
+#define ddcg_count(ddcg, event_type, prio_class) do {			\
+	struct blkcg_io_stats *io_stats = get_cpu_ptr((ddcg)->stats);	\
+									\
+	BUILD_BUG_ON(!__same_type((ddcg), struct dd_blkcg *));		\
+	BUILD_BUG_ON(!__same_type((prio_class), u8));			\
+	local_inc(&io_stats->stats[(prio_class)].event_type);		\
+	put_cpu_ptr(io_stats);						\
+} while (0)
+
+/*
+ * Returns the total number of ddcg_count(ddcg, event_type, prio_class) calls
+ * across all CPUs. No locking or barriers since it is fine if the returned
+ * sum is slightly outdated.
+ */
+#define ddcg_sum(ddcg, event_type, prio) ({				\
+	unsigned int cpu;						\
+	u32 sum = 0;							\
+									\
+	BUILD_BUG_ON(!__same_type((ddcg), struct dd_blkcg *));		\
+	BUILD_BUG_ON(!__same_type((prio), u8));				\
+	for_each_present_cpu(cpu)					\
+		sum += local_read(&per_cpu_ptr((ddcg)->stats, cpu)->	\
+				  stats[(prio)].event_type);		\
+	sum;								\
+})
+
+#ifdef CONFIG_BLK_CGROUP
+
+/**
+ * struct dd_blkg - Per (cgroup, request queue) data.
+ * @pd: blkg_policy_data structure.
+ */
+struct dd_blkg {
+	struct blkg_policy_data pd;	/* must be the first member */
+};
+
+struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio);
+int dd_activate_policy(struct request_queue *q);
+void dd_deactivate_policy(struct request_queue *q);
+int __init dd_blkcg_init(void);
+void __exit dd_blkcg_exit(void);
+
+#else /* CONFIG_BLK_CGROUP */
+
+static inline struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline int dd_activate_policy(struct request_queue *q)
+{
+	return 0;
+}
+
+static inline void dd_deactivate_policy(struct request_queue *q)
+{
+}
+
+static inline int dd_blkcg_init(void)
+{
+	return 0;
+}
+
+static inline void dd_blkcg_exit(void)
+{
+}
+
+#endif /* CONFIG_BLK_CGROUP */
+
+#endif /* _MQ_DEADLINE_CGROUP_H_ */
diff --git a/block/mq-deadline.c b/block/mq-deadline-main.c
similarity index 94%
rename from block/mq-deadline.c
rename to block/mq-deadline-main.c
index 776ff49713c3..1b2b6c1de5b8 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline-main.c
@@ -25,6 +25,7 @@
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
 #include "blk-mq-sched.h"
+#include "mq-deadline-cgroup.h"
 
 /*
  * See Documentation/block/deadline-iosched.rst
@@ -51,14 +52,6 @@ enum dd_prio {
 
 enum { DD_PRIO_COUNT = 3 };
 
-/* I/O statistics per I/O priority. */
-struct io_stats_per_prio {
-	local_t inserted;
-	local_t merged;
-	local_t dispatched;
-	local_t completed;
-};
-
 /* I/O statistics for all I/O priorities (enum dd_prio). */
 struct io_stats {
 	struct io_stats_per_prio stats[DD_PRIO_COUNT];
@@ -69,6 +62,9 @@ struct deadline_data {
 	 * run time data
 	 */
 
+	/* Request queue that owns this data structure. */
+	struct request_queue *queue;
+
 	/*
 	 * Requests are present on both sort_list[] and fifo_list[][]. The
 	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
@@ -226,11 +222,15 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+	struct dd_blkcg *blkcg = next->elv.priv[0];
 
-	if (next->elv.priv[0]) {
+	if (blkcg) {
 		dd_count(dd, merged, prio);
+		ddcg_count(blkcg, merged, ioprio_class);
 	} else {
+#if defined(CONFIG_BLK_CGROUP)
 		WARN_ON_ONCE(true);
+#endif
 	}
 
 	/*
@@ -370,6 +370,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
+	struct dd_blkcg *blkcg;
 	u8 ioprio_class;
 
 	lockdep_assert_held(&dd->lock);
@@ -462,12 +463,16 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	dd->batching++;
 	deadline_move_request(dd, rq);
 done:
+	blkcg = rq->elv.priv[0];
 	ioprio_class = dd_rq_ioclass(rq);
 	prio = ioprio_class_to_prio[ioprio_class];
-	if (rq->elv.priv[0]) {
+	if (blkcg) {
 		dd_count(dd, dispatched, prio);
+		ddcg_count(blkcg, dispatched, ioprio_class);
 	} else {
+#if defined(CONFIG_BLK_CGROUP)
 		WARN_ON_ONCE(true);
+#endif
 	}
 	/*
 	 * If the request needs its target zone locked, do it.
@@ -545,6 +550,8 @@ static void dd_exit_sched(struct elevator_queue *e)
 	struct deadline_data *dd = e->elevator_data;
 	enum dd_prio prio;
 
+	dd_deactivate_policy(dd->queue);
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_READ]));
 		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_WRITE]));
@@ -556,7 +563,7 @@ static void dd_exit_sched(struct elevator_queue *e)
 }
 
 /*
- * initialize elevator private data (deadline_data).
+ * Initialize elevator private data (deadline_data) and associate with blkcg.
  */
 static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
@@ -565,6 +572,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	enum dd_prio prio;
 	int ret = -ENOMEM;
 
+	/*
+	 * Initialization would be very tricky if the queue is not frozen,
+	 * hence the warning statement below.
+	 */
+	WARN_ON_ONCE(!percpu_ref_is_zero(&q->q_usage_counter));
+
 	eq = elevator_alloc(q, e);
 	if (!eq)
 		return ret;
@@ -580,6 +593,8 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (!dd->stats)
 		goto free_dd;
 
+	dd->queue = q;
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_READ]);
 		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_WRITE]);
@@ -595,9 +610,17 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
+	ret = dd_activate_policy(q);
+	if (ret)
+		goto free_stats;
+
+	ret = 0;
 	q->elevator = eq;
 	return 0;
 
+free_stats:
+	free_percpu(dd->stats);
+
 free_dd:
 	kfree(dd);
 
@@ -666,6 +689,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	u16 ioprio = req_get_ioprio(rq);
 	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
 	enum dd_prio prio;
+	struct dd_blkcg *blkcg;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -675,10 +699,25 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 */
 	blk_req_zone_write_unlock(rq);
 
+	/*
+	 * If a block cgroup has been associated with the submitter and if an
+	 * I/O priority has been set in the associated block cgroup, use the
+	 * lowest of the cgroup priority and the request priority for the
+	 * request. If no priority has been set in the request, use the cgroup
+	 * priority.
+	 */
+	blkcg = dd_blkcg_from_bio(rq->bio);
+	if (blkcg) {
+		ddcg_count(blkcg, inserted, ioprio_class);
+	} else {
+#if defined(CONFIG_BLK_CGROUP)
+		WARN_ON_ONCE(true);
+#endif
+	}
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd_count(dd, inserted, prio);
 	WARN_ON_ONCE(rq->elv.priv[0]);
-	rq->elv.priv[0] = (void *)1ULL;
+	rq->elv.priv[0] = blkcg;
 
 	if (blk_mq_sched_try_insert_merge(q, rq))
 		return;
@@ -750,11 +789,14 @@ static void dd_finish_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct deadline_data *dd = q->elevator->elevator_data;
+	struct dd_blkcg *blkcg = rq->elv.priv[0];
 	const u8 ioprio_class = dd_rq_ioclass(rq);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 
-	if (rq->elv.priv[0])
+	if (blkcg) {
 		dd_count(dd, completed, prio);
+		ddcg_count(blkcg, completed, ioprio_class);
+	}
 
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;
@@ -1066,11 +1108,26 @@ MODULE_ALIAS("mq-deadline-iosched");
 
 static int __init deadline_init(void)
 {
-	return elv_register(&mq_deadline);
+	int ret;
+
+	ret = elv_register(&mq_deadline);
+	if (ret)
+		goto out;
+	ret = dd_blkcg_init();
+	if (ret)
+		goto unreg;
+
+out:
+	return ret;
+
+unreg:
+	elv_unregister(&mq_deadline);
+	goto out;
 }
 
 static void __exit deadline_exit(void)
 {
+	dd_blkcg_exit();
 	elv_unregister(&mq_deadline);
 }
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH 14/14] block/mq-deadline: Prioritize high-priority requests
  2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
                   ` (12 preceding siblings ...)
  2021-06-08 23:07 ` [PATCH 13/14] block/mq-deadline: Add cgroup support Bart Van Assche
@ 2021-06-08 23:07 ` Bart Van Assche
  2021-06-09  5:10   ` Damien Le Moal
  2021-06-10  6:43   ` Hannes Reinecke
  13 siblings, 2 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-08 23:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

While one or more requests with a certain I/O priority are pending, do not
dispatch lower priority requests. Dispatch lower priority requests anyway
after the "aging" time has expired.

This patch has been tested as follows:

modprobe scsi_debug ndelay=1000000 max_queue=16 &&
sd='' &&
while [ -z "$sd" ]; do
  sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done &&
echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
cd /sys/fs/cgroup/blkio/ &&
echo $$ >cgroup.procs &&
echo 2 >blkio.prio.class &&
mkdir -p hipri &&
cd hipri &&
echo 1 >blkio.prio.class &&
{ max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
echo $$ >cgroup.procs &&
max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt

Result:
* 11000 IOPS for the high-priority job
*   400 IOPS for the low-priority job

If the aging expiry time is changed from 100s into 0, the IOPS results change
into 6712 and 6796 IOPS.

The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline-main.c | 42 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/block/mq-deadline-main.c b/block/mq-deadline-main.c
index 1b2b6c1de5b8..cecdb475a610 100644
--- a/block/mq-deadline-main.c
+++ b/block/mq-deadline-main.c
@@ -32,6 +32,11 @@
  */
 static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
 static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+/*
+ * Time after which to dispatch lower priority requests even if higher
+ * priority requests are pending.
+ */
+static const int aging_expire = 10 * HZ;
 static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
@@ -90,6 +95,7 @@ struct deadline_data {
 	int writes_starved;
 	int front_merges;
 	u32 async_depth;
+	int aging_expire;
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
@@ -363,10 +369,10 @@ deadline_next_request(struct deadline_data *dd, enum dd_prio prio,
 
 /*
  * deadline_dispatch_requests selects the best request according to
- * read/write expire, fifo_batch, etc
+ * read/write expire, fifo_batch, etc and with a start time <= @latest.
  */
 static struct request *__dd_dispatch_request(struct deadline_data *dd,
-					     enum dd_prio prio)
+					enum dd_prio prio, u64 latest_start_ns)
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
@@ -378,6 +384,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	if (!list_empty(&dd->dispatch[prio])) {
 		rq = list_first_entry(&dd->dispatch[prio], struct request,
 				      queuelist);
+		if (rq->start_time_ns > latest_start_ns)
+			return NULL;
 		list_del_init(&rq->queuelist);
 		goto done;
 	}
@@ -457,6 +465,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	dd->batching = 0;
 
 dispatch_request:
+	if (rq->start_time_ns > latest_start_ns)
+		return NULL;
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -493,15 +503,32 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
-	struct request *rq;
+	const u64 now_ns = ktime_get_ns();
+	struct request *rq = NULL;
 	enum dd_prio prio;
 
 	spin_lock(&dd->lock);
-	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
-		rq = __dd_dispatch_request(dd, prio);
+	/*
+	 * Start with dispatching requests whose deadline expired more than
+	 * aging_expire jiffies ago.
+	 */
+	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, prio, now_ns -
+					   jiffies_to_nsecs(dd->aging_expire));
 		if (rq)
+			goto unlock;
+	}
+	/*
+	 * Next, dispatch requests in priority order. Ignore lower priority
+	 * requests if any higher priority requests are pending.
+	 */
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, prio, now_ns);
+		if (rq || dd_queued(dd, prio))
 			break;
 	}
+
+unlock:
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -607,6 +634,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->writes_starved = writes_starved;
 	dd->front_merges = 1;
 	dd->fifo_batch = fifo_batch;
+	dd->aging_expire = aging_expire;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
@@ -725,6 +753,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	trace_block_rq_insert(rq);
 
 	if (at_head) {
+		rq->fifo_time = jiffies;
 		list_add(&rq->queuelist, &dd->dispatch[prio]);
 	} else {
 		deadline_add_rq_rb(dd, rq);
@@ -841,6 +870,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
 SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
 SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
+SHOW_JIFFIES(deadline_aging_expire_show, dd->aging_expire);
 SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
 SHOW_INT(deadline_front_merges_show, dd->front_merges);
 SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch);
@@ -869,6 +899,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
 	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
 STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
 STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
+STORE_JIFFIES(deadline_aging_expire_store, &dd->aging_expire, 0, INT_MAX);
 STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
 STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
 STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX);
@@ -885,6 +916,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
 	DD_ATTR(fifo_batch),
+	DD_ATTR(aging_expire),
 	__ATTR_NULL
 };
 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  2021-06-08 23:06 ` [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
@ 2021-06-09  4:14   ` Damien Le Moal
  2021-06-09  7:03   ` Johannes Thumshirn
  2021-06-10  6:02   ` Hannes Reinecke
  2 siblings, 0 replies; 45+ messages in thread
From: Damien Le Moal @ 2021-06-09  4:14 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/09 8:07, Bart Van Assche wrote:
> These entries were consecutive at the time of their introduction but are no
> longer consecutive. Make these again consecutive. Additionally, modify the
> help text since it refers to blk-mq and since the legacy block layer has
> been removed.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/Kconfig | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index a2297edfdde8..6685578b2a20 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -133,6 +133,13 @@ config BLK_WBT
>  	dynamically on an algorithm loosely based on CoDel, factoring in
>  	the realtime performance of the disk.
>  
> +config BLK_WBT_MQ
> +	bool "Enable writeback throttling by default"
> +	default y
> +	depends on BLK_WBT
> +	help
> +	Enable writeback throttling by default for request-based block devices.
> +
>  config BLK_CGROUP_IOLATENCY
>  	bool "Enable support for latency based cgroup IO protection"
>  	depends on BLK_CGROUP=y
> @@ -155,13 +162,6 @@ config BLK_CGROUP_IOCOST
>  	distributes IO capacity between different groups based on
>  	their share of the overall weight distribution.
>  
> -config BLK_WBT_MQ
> -	bool "Multiqueue writeback throttling"
> -	default y
> -	depends on BLK_WBT
> -	help
> -	Enable writeback throttling by default on multiqueue devices.
> -
>  config BLK_DEBUG_FS
>  	bool "Block layer debugging information in debugfs"
>  	default y
> 

Looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls
  2021-06-08 23:06 ` [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
@ 2021-06-09  4:19   ` Damien Le Moal
  2021-06-09  7:05   ` Johannes Thumshirn
  2021-06-10  6:02   ` Hannes Reinecke
  2 siblings, 0 replies; 45+ messages in thread
From: Damien Le Moal @ 2021-06-09  4:19 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Tejun Heo,
	Hannes Reinecke, Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/09 8:07, Bart Van Assche wrote:
> Before adding more calls in this function, simplify the error path.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-cgroup.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d169e2055158..3b0f6efaa2b6 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1183,15 +1183,14 @@ int blkcg_init_queue(struct request_queue *q)
>  	if (preloaded)
>  		radix_tree_preload_end();
>  
> -	ret = blk_throtl_init(q);
> +	ret = blk_iolatency_init(q);
>  	if (ret)
>  		goto err_destroy_all;
>  
> -	ret = blk_iolatency_init(q);
> -	if (ret) {
> -		blk_throtl_exit(q);
> +	ret = blk_throtl_init(q);
> +	if (ret)
>  		goto err_destroy_all;
> -	}
> +
>  	return 0;
>  
>  err_destroy_all:
> 

Looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/14] block/blk-rq-qos: Move a function from a header file into a C file
  2021-06-08 23:06 ` [PATCH 03/14] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
@ 2021-06-09  4:22   ` Damien Le Moal
  2021-06-10  6:03   ` Hannes Reinecke
  1 sibling, 0 replies; 45+ messages in thread
From: Damien Le Moal @ 2021-06-09  4:22 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/09 8:07, Bart Van Assche wrote:
> rq_qos_id_to_name() is not in the hot path so move it from a .h into a .c
> file.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-rq-qos.c | 13 +++++++++++++
>  block/blk-rq-qos.h | 13 +------------
>  2 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
> index 656460636ad3..61dccf584c68 100644
> --- a/block/blk-rq-qos.c
> +++ b/block/blk-rq-qos.c
> @@ -2,6 +2,19 @@
>  
>  #include "blk-rq-qos.h"
>  
> +const char *rq_qos_id_to_name(enum rq_qos_id id)
> +{
> +	switch (id) {
> +	case RQ_QOS_WBT:
> +		return "wbt";
> +	case RQ_QOS_LATENCY:
> +		return "latency";
> +	case RQ_QOS_COST:
> +		return "cost";
> +	}
> +	return "unknown";
> +}
> +
>  /*
>   * Increment 'v', if 'v' is below 'below'. Returns true if we succeeded,
>   * false if 'v' + 1 would be bigger than 'below'.
> diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
> index 2bc43e94f4c4..6b5f9ae69883 100644
> --- a/block/blk-rq-qos.h
> +++ b/block/blk-rq-qos.h
> @@ -78,18 +78,7 @@ static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
>  	return rq_qos_id(q, RQ_QOS_LATENCY);
>  }
>  
> -static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
> -{
> -	switch (id) {
> -	case RQ_QOS_WBT:
> -		return "wbt";
> -	case RQ_QOS_LATENCY:
> -		return "latency";
> -	case RQ_QOS_COST:
> -		return "cost";
> -	}
> -	return "unknown";
> -}
> +const char *rq_qos_id_to_name(enum rq_qos_id id);
>  
>  static inline void rq_wait_init(struct rq_wait *rq_wait)
>  {
> 

Looks OK, but since the only caller is in blk-mq-debugfs.c, the function could
be moved there and be static, no ?

Anyway,

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] block: Introduce the ioprio rq-qos policy
  2021-06-08 23:06 ` [PATCH 04/14] block: Introduce the ioprio rq-qos policy Bart Van Assche
@ 2021-06-09  4:40   ` Damien Le Moal
  2021-06-09 16:48     ` Bart Van Assche
  2021-06-10  6:20   ` Hannes Reinecke
  1 sibling, 1 reply; 45+ messages in thread
From: Damien Le Moal @ 2021-06-09  4:40 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/09 8:07, Bart Van Assche wrote:
> Introduce an rq-qos policy that assigns an I/O priority to requests based
> on blk-cgroup configuration settings. This policy has the following
> advantages over the ioprio_set() system call:
> - This policy is cgroup based so it has all the advantages of cgroups.
> - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
>   controller affects page cache writeback I/O for filesystems that support
>   assiociating a cgroup with writeback I/O. See also
>   Documentation/admin-guide/cgroup-v2.rst.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  26 +++
>  block/Kconfig                           |   9 +
>  block/Makefile                          |   1 +
>  block/blk-cgroup.c                      |   5 +
>  block/blk-ioprio.c                      | 236 ++++++++++++++++++++++++
>  block/blk-ioprio.h                      |  19 ++
>  block/blk-rq-qos.c                      |   2 +
>  block/blk-rq-qos.h                      |   1 +
>  8 files changed, 299 insertions(+)
>  create mode 100644 block/blk-ioprio.c
>  create mode 100644 block/blk-ioprio.h
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index b1e81aa8598a..cf8c8073b474 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -56,6 +56,7 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
>         5-3-3. IO Latency
>           5-3-3-1. How IO Latency Throttling Works
>           5-3-3-2. IO Latency Interface Files
> +       5-3-4. IO Priority
>       5-4. PID
>         5-4-1. PID Interface Files
>       5-5. Cpuset
> @@ -1866,6 +1867,31 @@ IO Latency Interface Files
>  		duration of time between evaluation events.  Windows only elapse
>  		with IO activity.  Idle periods extend the most recent window.
>  
> +IO Priority
> +~~~~~~~~~~~
> +
> +A single attribute controls the behavior of the I/O priority cgroup policy,
> +namely the blkio.prio.class attribute. The following values are accepted for
> +that attribute:
> +
> +  0
> +	Do not modify the I/O priority class.
> +
> +  1
> +	For requests that do not have an I/O priority class (IOPRIO_CLASS_NONE),
> +	change the I/O priority class into 1 (IOPRIO_CLASS_RT). Do not modify
> +	the I/O priority class of other requests.
> +
> +  2
> +	For requests that do not have an I/O priority class, change the I/O
> +	priority class into IOPRIO_CLASS_BE. For requests that have the
> +	priority class RT, change it into BE (2). Do not modify
> +	the I/O priority class of requests that have priority 3 (IDLE).
> +
> +  3
> +	Change the I/O priority class of all requests into 3 (IDLE),
> +	the lowest I/O priority class.
> +
>  PID
>  ---
>  
> diff --git a/block/Kconfig b/block/Kconfig
> index 6685578b2a20..319816058298 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -162,6 +162,15 @@ config BLK_CGROUP_IOCOST
>  	distributes IO capacity between different groups based on
>  	their share of the overall weight distribution.
>  
> +config BLK_CGROUP_IOPRIO
> +	bool "Cgroup I/O controller for assigning I/O priority"
> +	depends on BLK_CGROUP
> +	help
> +	Enable the .prio interface for assigning an I/O priority to requests.
> +	The I/O priority affects the order in which an I/O scheduler and block
> +	devices process requests. Only some I/O schedulers and some block
> +	devices support I/O priorities.
> +
>  config BLK_DEBUG_FS
>  	bool "Block layer debugging information in debugfs"
>  	default y
> diff --git a/block/Makefile b/block/Makefile
> index 8d841f5f986f..af3d044abaf1 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -17,6 +17,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
>  obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
>  obj-$(CONFIG_BLK_CGROUP_RWSTAT)	+= blk-cgroup-rwstat.o
>  obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
> +obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
>  obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
>  obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
>  obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 3b0f6efaa2b6..7b06a5fa3cac 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -31,6 +31,7 @@
>  #include <linux/tracehook.h>
>  #include <linux/psi.h>
>  #include "blk.h"
> +#include "blk-ioprio.h"
>  
>  /*
>   * blkcg_pol_mutex protects blkcg_policy[] and policy [de]activation.
> @@ -1187,6 +1188,10 @@ int blkcg_init_queue(struct request_queue *q)
>  	if (ret)
>  		goto err_destroy_all;
>  
> +	ret = blk_ioprio_init(q);
> +	if (ret)
> +		goto err_destroy_all;
> +
>  	ret = blk_throtl_init(q);
>  	if (ret)
>  		goto err_destroy_all;
> diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
> new file mode 100644
> index 000000000000..d7292cdef67d
> --- /dev/null
> +++ b/block/blk-ioprio.c
> @@ -0,0 +1,236 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Block rq-qos policy for assigning an I/O priority to requests.
> + *
> + * Using an rq-qos policy for assigning I/O priority has two advantages over
> + * using the ioprio_set() system call:
> + *
> + * - This policy is cgroup based so it has all the advantages of cgroups.
> + * - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
> + *   controller affects page cache writeback I/O for filesystems that support
> + *   assiociating a cgroup with writeback I/O. See also
> + *   Documentation/admin-guide/cgroup-v2.rst.
> + */
> +
> +#include <linux/blk-cgroup.h>
> +#include <linux/blk-mq.h>
> +#include <linux/blk_types.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include "blk-ioprio.h"
> +#include "blk-rq-qos.h"
> +
> +/*
> + * The accepted I/O priority values are 0..IOPRIO_CLASS_MAX(3). 0 (default)
> + * means do not modify the I/O priority. Values 1..3 are used to restrict the
> + * I/O priority for a cgroup to the specified priority. See also
> + * <linux/ioprio.h>.
> + */
> +#define IOPRIO_CLASS_MAX IOPRIO_CLASS_IDLE
> +
> +static struct blkcg_policy ioprio_policy;
> +
> +/**
> + * struct ioprio_blkg - Per (cgroup, request queue) data.
> + * @pd: blkg_policy_data structure.
> + */
> +struct ioprio_blkg {
> +	struct blkg_policy_data pd;
> +};
> +
> +/**
> + * struct ioprio_blkcg - Per cgroup data.
> + * @cpd: blkcg_policy_data structure.
> + * @prio_class: One of the IOPRIO_CLASS_* values. See also <linux/ioprio.h>.
> + */
> +struct ioprio_blkcg {
> +	struct blkcg_policy_data cpd;
> +	u8			 prio_class;
> +};
> +
> +static inline struct ioprio_blkg *pd_to_ioprio(struct blkg_policy_data *pd)
> +{
> +	return pd ? container_of(pd, struct ioprio_blkg, pd) : NULL;
> +}
> +
> +static struct ioprio_blkcg *
> +ioprio_blkcg_from_css(struct cgroup_subsys_state *css)
> +{
> +	struct blkcg *blkcg = css_to_blkcg(css);
> +	struct blkcg_policy_data *cpd = blkcg_to_cpd(blkcg, &ioprio_policy);
> +
> +	return container_of(cpd, struct ioprio_blkcg, cpd);
> +}
> +
> +static struct ioprio_blkcg *ioprio_blkcg_from_bio(struct bio *bio)
> +{
> +	struct blkg_policy_data *pd;
> +
> +	pd = blkg_to_pd(bio->bi_blkg, &ioprio_policy);
> +	if (!pd)
> +		return NULL;
> +
> +	return container_of(blkcg_to_cpd(pd->blkg->blkcg, &ioprio_policy),
> +			    struct ioprio_blkcg, cpd);
> +}
> +
> +static u64 ioprio_show_prio_class(struct cgroup_subsys_state *css,
> +				  struct cftype *cft)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(css);
> +
> +	return blkcg->prio_class;
> +}
> +
> +static int ioprio_set_prio_class(struct cgroup_subsys_state *css,
> +				 struct cftype *cft, u64 val)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(css);
> +
> +	if (val > IOPRIO_CLASS_MAX)
> +		return -EINVAL;

Where is IOPRIO_CLASS_MAX defined ? I do not see it.
Why not use ioprio_valid() ?

> +
> +	blkcg->prio_class = val;
> +	return 0;
> +}
> +
> +static struct blkg_policy_data *ioprio_alloc_pd(gfp_t gfp,
> +						struct request_queue *q,
> +						struct blkcg *blkcg)
> +{
> +	struct ioprio_blkg *ioprio_blkg;
> +
> +	ioprio_blkg = kzalloc(sizeof(*ioprio_blkg), gfp);
> +	if (!ioprio_blkg)
> +		return NULL;
> +
> +	return &ioprio_blkg->pd;
> +}
> +
> +static void ioprio_free_pd(struct blkg_policy_data *pd)
> +{
> +	struct ioprio_blkg *ioprio_blkg = pd_to_ioprio(pd);
> +
> +	kfree(ioprio_blkg);
> +}
> +
> +static struct blkcg_policy_data *ioprio_alloc_cpd(gfp_t gfp)
> +{
> +	struct ioprio_blkcg *blkcg;
> +
> +	blkcg = kzalloc(sizeof(*blkcg), gfp);
> +	if (!blkcg)
> +		return NULL;
> +	blkcg->prio_class = IOPRIO_CLASS_NONE;
> +	return &blkcg->cpd;
> +}
> +
> +static void ioprio_free_cpd(struct blkcg_policy_data *cpd)
> +{
> +	struct ioprio_blkcg *blkcg = container_of(cpd, typeof(*blkcg), cpd);
> +
> +	kfree(blkcg);
> +}
> +
> +#define IOPRIO_ATTRS						\
> +	{							\
> +		.name		= "prio.class",			\
> +		.read_u64	= ioprio_show_prio_class,	\
> +		.write_u64	= ioprio_set_prio_class,	\
> +	},							\
> +	{ } /* sentinel */
> +
> +/* cgroup v2 attributes */
> +static struct cftype ioprio_files[] = {
> +	IOPRIO_ATTRS
> +};
> +
> +/* cgroup v1 attributes */
> +static struct cftype ioprio_legacy_files[] = {
> +	IOPRIO_ATTRS
> +};
> +
> +static struct blkcg_policy ioprio_policy = {
> +	.dfl_cftypes	= ioprio_files,
> +	.legacy_cftypes = ioprio_legacy_files,
> +
> +	.cpd_alloc_fn	= ioprio_alloc_cpd,
> +	.cpd_free_fn	= ioprio_free_cpd,
> +
> +	.pd_alloc_fn	= ioprio_alloc_pd,
> +	.pd_free_fn	= ioprio_free_pd,
> +};
> +
> +struct blk_ioprio {
> +	struct rq_qos rqos;
> +};
> +
> +static void blkcg_ioprio_track(struct rq_qos *rqos, struct request *rq,
> +			       struct bio *bio)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_bio(bio);
> +
> +	/*
> +	 * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers
> +	 * correspond to a lower priority. Hence, the max_t() below selects
> +	 * the lower priority of bi_ioprio and the cgroup I/O priority class.
> +	 * If the cgroup priority has been set to IOPRIO_CLASS_NONE == 0, the
> +	 * bio I/O priority is not modified. If the bio I/O priority equals
> +	 * IOPRIO_CLASS_NONE, the cgroup I/O priority is assigned to the bio.
> +	 */
> +	bio->bi_ioprio = max_t(u16, bio->bi_ioprio,
> +			       IOPRIO_PRIO_VALUE(blkcg->prio_class, 0));
> +}
> +
> +static void blkcg_ioprio_exit(struct rq_qos *rqos)
> +{
> +	struct blk_ioprio *blkioprio_blkg =
> +		container_of(rqos, typeof(*blkioprio_blkg), rqos);
> +
> +	blkcg_deactivate_policy(rqos->q, &ioprio_policy);
> +	kfree(blkioprio_blkg);
> +}
> +
> +static struct rq_qos_ops blkcg_ioprio_ops = {
> +	.track = blkcg_ioprio_track,
> +	.exit = blkcg_ioprio_exit,
> +};
> +
> +int blk_ioprio_init(struct request_queue *q)
> +{
> +	struct blk_ioprio *blkioprio_blkg;
> +	struct rq_qos *rqos;
> +	int ret;
> +
> +	blkioprio_blkg = kzalloc(sizeof(*blkioprio_blkg), GFP_KERNEL);
> +	if (!blkioprio_blkg)
> +		return -ENOMEM;
> +
> +	ret = blkcg_activate_policy(q, &ioprio_policy);
> +	if (ret) {
> +		kfree(blkioprio_blkg);
> +		return ret;
> +	}
> +
> +	rqos = &blkioprio_blkg->rqos;
> +	rqos->id = RQ_QOS_IOPRIO;
> +	rqos->ops = &blkcg_ioprio_ops;
> +	rqos->q = q;
> +
> +	rq_qos_add(q, rqos);
> +
> +	return 0;
> +}
> +
> +static int __init ioprio_init(void)
> +{
> +	return blkcg_policy_register(&ioprio_policy);
> +}
> +
> +static void __exit ioprio_exit(void)
> +{
> +	blkcg_policy_unregister(&ioprio_policy);
> +}
> +
> +module_init(ioprio_init);
> +module_exit(ioprio_exit);
> diff --git a/block/blk-ioprio.h b/block/blk-ioprio.h
> new file mode 100644
> index 000000000000..a7785c2f1aea
> --- /dev/null
> +++ b/block/blk-ioprio.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _BLK_IOPRIO_H_
> +#define _BLK_IOPRIO_H_
> +
> +#include <linux/kconfig.h>
> +
> +struct request_queue;
> +
> +#ifdef CONFIG_BLK_CGROUP_IOPRIO
> +int blk_ioprio_init(struct request_queue *q);
> +#else
> +static inline int blk_ioprio_init(struct request_queue *q)
> +{
> +	return 0;
> +}
> +#endif
> +
> +#endif /* _BLK_IOPRIO_H_ */
> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
> index 61dccf584c68..3a5400b11af7 100644
> --- a/block/blk-rq-qos.c
> +++ b/block/blk-rq-qos.c
> @@ -11,6 +11,8 @@ const char *rq_qos_id_to_name(enum rq_qos_id id)
>  		return "latency";
>  	case RQ_QOS_COST:
>  		return "cost";
> +	case RQ_QOS_IOPRIO:
> +		return "ioprio";
>  	}
>  	return "unknown";
>  }
> diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
> index 6b5f9ae69883..1d4c6e951aa6 100644
> --- a/block/blk-rq-qos.h
> +++ b/block/blk-rq-qos.h
> @@ -16,6 +16,7 @@ enum rq_qos_id {
>  	RQ_QOS_WBT,
>  	RQ_QOS_LATENCY,
>  	RQ_QOS_COST,
> +	RQ_QOS_IOPRIO,
>  };
>  
>  struct rq_wait {
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-08 23:06 ` [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
@ 2021-06-09  4:46   ` Damien Le Moal
  2021-06-09 16:54     ` Bart Van Assche
  2021-06-09  7:11   ` Johannes Thumshirn
  2021-06-10  6:23   ` Hannes Reinecke
  2 siblings, 1 reply; 45+ messages in thread
From: Damien Le Moal @ 2021-06-09  4:46 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/09 8:07, Bart Van Assche wrote:
> Define separate macros for integers and jiffies to improve readability.
> Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol()
> since the former functions are the recommended functions.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 66 ++++++++++++++++++++-------------------------
>  1 file changed, 29 insertions(+), 37 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 69126beff77d..1d1bb7a41d2a 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -605,58 +605,50 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>  /*
>   * sysfs parts below
>   */
> -static ssize_t
> -deadline_var_show(int var, char *page)
> -{
> -	return sprintf(page, "%d\n", var);
> -}
> -
> -static void
> -deadline_var_store(int *var, const char *page)
> -{
> -	char *p = (char *) page;
> -
> -	*var = simple_strtol(p, &p, 10);
> -}
> -
> -#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
> +#define SHOW_INT(__FUNC, __VAR)						\
>  static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  {									\
>  	struct deadline_data *dd = e->elevator_data;			\
> -	int __data = __VAR;						\
> -	if (__CONV)							\
> -		__data = jiffies_to_msecs(__data);			\
> -	return deadline_var_show(__data, (page));			\
> -}
> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
> -SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> -SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> -SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> -#undef SHOW_FUNCTION
> +									\
> +	return sysfs_emit((page), "%d\n", __VAR);			\
> +}
> +#define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))

jiffies_to_msecs() returns an unsigned int but sysfs_emit() in SHOW_INT() uses a
%d format. That will cause problems, no ?

> +SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
> +SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> +SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
> +SHOW_INT(deadline_front_merges_show, dd->front_merges);
> +SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch);
> +#undef SHOW_INT
> +#undef SHOW_JIFFIES
>  
>  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
>  static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
>  {									\
>  	struct deadline_data *dd = e->elevator_data;			\
> -	int __data;							\
> -	deadline_var_store(&__data, (page));				\
> +	int __data, __ret;						\
> +									\
> +	__ret = kstrtoint((page), 0, &__data);				\
> +	if (__ret < 0)							\
> +		return __ret;						\
>  	if (__data < (MIN))						\
>  		__data = (MIN);						\
>  	else if (__data > (MAX))					\
>  		__data = (MAX);						\
> -	if (__CONV)							\
> -		*(__PTR) = msecs_to_jiffies(__data);			\
> -	else								\
> -		*(__PTR) = __data;					\
> +	*(__PTR) = __CONV(__data);					\
>  	return count;							\
>  }
> -STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
> -STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
> -STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> -STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> -STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#define STORE_INT(__FUNC, __PTR, MIN, MAX)				\
> +	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, )
> +#define STORE_JIFFIES(__FUNC, __PTR, MIN, MAX)				\
> +	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
> +STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
> +STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> +STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
> +STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
> +STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX);
>  #undef STORE_FUNCTION
> +#undef STORE_INT
> +#undef STORE_JIFFIES
>  
>  #define DD_ATTR(name) \
>  	__ATTR(name, 0644, deadline_##name##_show, deadline_##name##_store)
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/14] block/mq-deadline: Add I/O priority support
  2021-06-08 23:07 ` [PATCH 12/14] block/mq-deadline: Add I/O priority support Bart Van Assche
@ 2021-06-09  5:03   ` Damien Le Moal
  2021-06-09 17:25     ` Bart Van Assche
  2021-06-10  6:35   ` Hannes Reinecke
  1 sibling, 1 reply; 45+ messages in thread
From: Damien Le Moal @ 2021-06-09  5:03 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/09 8:07, Bart Van Assche wrote:
> Maintain one dispatch list and FIFO one list per I/O priority class: RT, BE

s/FIFO one/one FIFO

> and IDLE. Maintain statistics for each priority level. Split the debugfs
> attributes per priority level as follows:
> 
> $ ls /sys/kernel/debug/block/.../sched/
> async_depth  dispatch2        read_next_rq      write2_fifo_list
> batching     read0_fifo_list  starved           write_next_rq
> dispatch0    read1_fifo_list  write0_fifo_list
> dispatch1    read2_fifo_list  write1_fifo_list
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 355 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 279 insertions(+), 76 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index a7d0584437d1..776ff49713c3 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -42,16 +42,40 @@ enum dd_data_dir {
>  
>  enum { DD_DIR_COUNT = 2 };
>  
> +enum dd_prio {
> +	DD_RT_PRIO	= 0,
> +	DD_BE_PRIO	= 1,
> +	DD_IDLE_PRIO	= 2,
> +	DD_PRIO_MAX	= 2,
> +};
> +
> +enum { DD_PRIO_COUNT = 3 };
> +
> +/* I/O statistics per I/O priority. */
> +struct io_stats_per_prio {
> +	local_t inserted;
> +	local_t merged;
> +	local_t dispatched;
> +	local_t completed;
> +};
> +
> +/* I/O statistics for all I/O priorities (enum dd_prio). */
> +struct io_stats {
> +	struct io_stats_per_prio stats[DD_PRIO_COUNT];
> +};
> +
>  struct deadline_data {
>  	/*
>  	 * run time data
>  	 */
>  
>  	/*
> -	 * requests (deadline_rq s) are present on both sort_list and fifo_list
> +	 * Requests are present on both sort_list[] and fifo_list[][]. The
> +	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
> +	 * The second index is the data direction (rq_data_dir(rq)).
>  	 */
>  	struct rb_root sort_list[DD_DIR_COUNT];
> -	struct list_head fifo_list[DD_DIR_COUNT];
> +	struct list_head fifo_list[DD_PRIO_COUNT][DD_DIR_COUNT];

WOuld it make sense to pack these 2 into a sub structure ? e.g.:

struct deadline_lists {
	struct rb_root sort_list;
	struct list_head fifo_list[DD_PRIO_COUNT];
};

struct deadline_data {
	...
	/*
	 * Requests are present on both sort_list[] and fifo_list[][]. The
	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
	 * The second index is the data direction (rq_data_dir(rq)).
 	 */
	struct deadline_lists	lists[DD_DIR_COUNT];

>  
>  	/*
>  	 * next in sort order. read, write or both are NULL
> @@ -60,6 +84,8 @@ struct deadline_data {
>  	unsigned int batching;		/* number of sequential requests made */
>  	unsigned int starved;		/* times reads have starved writes */
>  
> +	struct io_stats __percpu *stats;
> +
>  	/*
>  	 * settings that change how the i/o scheduler behaves
>  	 */
> @@ -71,7 +97,42 @@ struct deadline_data {
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> -	struct list_head dispatch;
> +	struct list_head dispatch[DD_PRIO_COUNT];
> +};
> +
> +/* Count one event of type 'event_type' and with I/O priority 'prio' */
> +#define dd_count(dd, event_type, prio) do {				\
> +	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
> +									\
> +	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> +	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> +	local_inc(&io_stats->stats[(prio)].event_type);			\
> +	put_cpu_ptr(io_stats);						\
> +} while (0)
> +
> +/*
> + * Returns the total number of dd_count(dd, event_type, prio) calls across all
> + * CPUs. No locking or barriers since it is fine if the returned sum is slightly
> + * outdated.
> + */
> +#define dd_sum(dd, event_type, prio) ({					\
> +	unsigned int cpu;						\
> +	u32 sum = 0;							\
> +									\
> +	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> +	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> +	for_each_present_cpu(cpu)					\
> +		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
> +				  stats[(prio)].event_type);		\
> +	sum;								\
> +})
> +
> +/* Maps an I/O priority class to a deadline scheduler priority. */
> +static const enum dd_prio ioprio_class_to_prio[] = {
> +	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
> +	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
> +	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
> +	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
>  };
>  
>  static inline struct rb_root *
> @@ -147,12 +208,31 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
>  	}
>  }
>  
> +/*
> + * Returns the I/O priority class (IOPRIO_CLASS_*) that has been assigned to a
> + * request.
> + */
> +static u8 dd_rq_ioclass(struct request *rq)
> +{
> +	return IOPRIO_PRIO_CLASS(req_get_ioprio(rq));
> +}
> +
>  /*
>   * Callback function that is invoked after @next has been merged into @req.
>   */
>  static void dd_merged_requests(struct request_queue *q, struct request *req,
>  			       struct request *next)
>  {
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const u8 ioprio_class = dd_rq_ioclass(next);
> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
> +
> +	if (next->elv.priv[0]) {
> +		dd_count(dd, merged, prio);
> +	} else {
> +		WARN_ON_ONCE(true);
> +	}

No need for the curly brackets I think.

> +
>  	/*
>  	 * if next expires before rq, assign its expire time to rq
>  	 * and move into next position (next will be deleted) in fifo
> @@ -189,14 +269,21 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
>  	deadline_remove_request(rq->q, rq);
>  }
>  
> +/* Number of requests queued for a given priority level. */
> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);

This also includes requests that are being executed on the device. Is that OK ?

> +}
> +
>  /*
>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>   */
>  static inline int deadline_check_fifo(struct deadline_data *dd,
> +				      enum dd_prio prio,
>  				      enum dd_data_dir data_dir)
>  {
> -	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
>  
>  	/*
>  	 * rq is expired!
> @@ -212,15 +299,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd,
>   * dispatch using arrival ordered lists.
>   */
>  static struct request *
> -deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
> +deadline_fifo_request(struct deadline_data *dd, enum dd_prio prio,
> +		      enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
>  
> -	if (list_empty(&dd->fifo_list[data_dir]))
> +	if (list_empty(&dd->fifo_list[prio][data_dir]))
>  		return NULL;
>  
> -	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
>  	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>  		return rq;
>  
> @@ -229,7 +317,7 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>  	 * an unlocked target zone.
>  	 */
>  	spin_lock_irqsave(&dd->zone_lock, flags);
> -	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
> +	list_for_each_entry(rq, &dd->fifo_list[prio][DD_WRITE], queuelist) {
>  		if (blk_req_can_dispatch_to_zone(rq))
>  			goto out;
>  	}
> @@ -245,7 +333,8 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>   * dispatch using sector position sorted lists.
>   */
>  static struct request *
> -deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
> +deadline_next_request(struct deadline_data *dd, enum dd_prio prio,
> +		      enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
> @@ -276,15 +365,18 @@ deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>   * deadline_dispatch_requests selects the best request according to
>   * read/write expire, fifo_batch, etc
>   */
> -static struct request *__dd_dispatch_request(struct deadline_data *dd)
> +static struct request *__dd_dispatch_request(struct deadline_data *dd,
> +					     enum dd_prio prio)
>  {
>  	struct request *rq, *next_rq;
>  	enum dd_data_dir data_dir;
> +	u8 ioprio_class;
>  
>  	lockdep_assert_held(&dd->lock);
>  
> -	if (!list_empty(&dd->dispatch)) {
> -		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
> +	if (!list_empty(&dd->dispatch[prio])) {
> +		rq = list_first_entry(&dd->dispatch[prio], struct request,
> +				      queuelist);
>  		list_del_init(&rq->queuelist);
>  		goto done;
>  	}
> @@ -292,9 +384,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	/*
>  	 * batches are currently reads XOR writes
>  	 */
> -	rq = deadline_next_request(dd, DD_WRITE);
> +	rq = deadline_next_request(dd, prio, DD_WRITE);
>  	if (!rq)
> -		rq = deadline_next_request(dd, DD_READ);
> +		rq = deadline_next_request(dd, prio, DD_READ);
>  
>  	if (rq && dd->batching < dd->fifo_batch)
>  		/* we have a next request are still entitled to batch */
> @@ -305,10 +397,10 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * data direction (read / write)
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[DD_READ])) {
> +	if (!list_empty(&dd->fifo_list[prio][DD_READ])) {
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
>  
> -		if (deadline_fifo_request(dd, DD_WRITE) &&
> +		if (deadline_fifo_request(dd, prio, DD_WRITE) &&
>  		    (dd->starved++ >= dd->writes_starved))
>  			goto dispatch_writes;
>  
> @@ -321,7 +413,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * there are either no reads or writes have been starved
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
> +	if (!list_empty(&dd->fifo_list[prio][DD_WRITE])) {
>  dispatch_writes:
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
>  
> @@ -338,14 +430,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	/*
>  	 * we are not running a batch, find best request for selected data_dir
>  	 */
> -	next_rq = deadline_next_request(dd, data_dir);
> -	if (deadline_check_fifo(dd, data_dir) || !next_rq) {
> +	next_rq = deadline_next_request(dd, prio, data_dir);
> +	if (deadline_check_fifo(dd, prio, data_dir) || !next_rq) {
>  		/*
>  		 * A deadline has expired, the last request was in the other
>  		 * direction, or we have run out of higher-sectored requests.
>  		 * Start again from the request with the earliest expiry time.
>  		 */
> -		rq = deadline_fifo_request(dd, data_dir);
> +		rq = deadline_fifo_request(dd, prio, data_dir);
>  	} else {
>  		/*
>  		 * The last req was the same dir and we have a next request in
> @@ -370,6 +462,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	dd->batching++;
>  	deadline_move_request(dd, rq);
>  done:
> +	ioprio_class = dd_rq_ioclass(rq);
> +	prio = ioprio_class_to_prio[ioprio_class];
> +	if (rq->elv.priv[0]) {
> +		dd_count(dd, dispatched, prio);
> +	} else {
> +		WARN_ON_ONCE(true);
> +	}

I do not think you need the curly brackets here.

>  	/*
>  	 * If the request needs its target zone locked, do it.
>  	 */
> @@ -390,9 +489,14 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>  	struct request *rq;
> +	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);
> -	rq = __dd_dispatch_request(dd);
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, prio);
> +		if (rq)
> +			break;
> +	}
>  	spin_unlock(&dd->lock);
>  
>  	return rq;
> @@ -439,9 +543,14 @@ static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
>  static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
> +	enum dd_prio prio;
> +
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_READ]));
> +		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_WRITE]));
> +	}
>  
> -	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
> -	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
> +	free_percpu(dd->stats);
>  
>  	kfree(dd);
>  }
> @@ -453,20 +562,29 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  {
>  	struct deadline_data *dd;
>  	struct elevator_queue *eq;
> +	enum dd_prio prio;
> +	int ret = -ENOMEM;
>  
>  	eq = elevator_alloc(q, e);
>  	if (!eq)
> -		return -ENOMEM;
> +		return ret;
>  
>  	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> -	if (!dd) {
> -		kobject_put(&eq->kobj);
> -		return -ENOMEM;
> -	}
> +	if (!dd)
> +		goto put_eq;
> +
>  	eq->elevator_data = dd;
>  
> -	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
> -	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
> +	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
> +				     GFP_KERNEL | __GFP_ZERO);
> +	if (!dd->stats)
> +		goto free_dd;
> +
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_READ]);
> +		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_WRITE]);
> +		INIT_LIST_HEAD(&dd->dispatch[prio]);
> +	}
>  	dd->sort_list[DD_READ] = RB_ROOT;
>  	dd->sort_list[DD_WRITE] = RB_ROOT;
>  	dd->fifo_expire[DD_READ] = read_expire;
> @@ -476,10 +594,16 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	dd->fifo_batch = fifo_batch;
>  	spin_lock_init(&dd->lock);
>  	spin_lock_init(&dd->zone_lock);
> -	INIT_LIST_HEAD(&dd->dispatch);
>  
>  	q->elevator = eq;
>  	return 0;
> +
> +free_dd:
> +	kfree(dd);
> +
> +put_eq:
> +	kobject_put(&eq->kobj);
> +	return ret;
>  }
>  
>  /*
> @@ -539,6 +663,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	struct request_queue *q = hctx->queue;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  	const enum dd_data_dir data_dir = rq_data_dir(rq);
> +	u16 ioprio = req_get_ioprio(rq);
> +	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
> +	enum dd_prio prio;
>  
>  	lockdep_assert_held(&dd->lock);
>  
> @@ -548,13 +675,18 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	 */
>  	blk_req_zone_write_unlock(rq);
>  
> +	prio = ioprio_class_to_prio[ioprio_class];
> +	dd_count(dd, inserted, prio);
> +	WARN_ON_ONCE(rq->elv.priv[0]);
> +	rq->elv.priv[0] = (void *)1ULL;
> +
>  	if (blk_mq_sched_try_insert_merge(q, rq))
>  		return;
>  
>  	trace_block_rq_insert(rq);
>  
>  	if (at_head) {
> -		list_add(&rq->queuelist, &dd->dispatch);
> +		list_add(&rq->queuelist, &dd->dispatch[prio]);
>  	} else {
>  		deadline_add_rq_rb(dd, rq);
>  
> @@ -568,7 +700,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  		 * set expire time and add to fifo list
>  		 */
>  		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> -		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
> +		list_add_tail(&rq->queuelist, &dd->fifo_list[prio][data_dir]);
>  	}
>  }
>  
> @@ -592,12 +724,10 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>  	spin_unlock(&dd->lock);
>  }
>  
> -/*
> - * Nothing to do here. This is defined only to ensure that .finish_request
> - * method is called upon request completion.
> - */
> +/* Callback from inside blk_mq_rq_ctx_init(). */
>  static void dd_prepare_request(struct request *rq)
>  {
> +	rq->elv.priv[0] = NULL;
>  }
>  
>  /*
> @@ -619,26 +749,41 @@ static void dd_prepare_request(struct request *rq)
>  static void dd_finish_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const u8 ioprio_class = dd_rq_ioclass(rq);
> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
> +
> +	if (rq->elv.priv[0])
> +		dd_count(dd, completed, prio);
>  
>  	if (blk_queue_is_zoned(q)) {
> -		struct deadline_data *dd = q->elevator->elevator_data;
>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);
>  		blk_req_zone_write_unlock(rq);
> -		if (!list_empty(&dd->fifo_list[DD_WRITE]))
> +		if (!list_empty(&dd->fifo_list[prio][DD_WRITE]))
>  			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
>  		spin_unlock_irqrestore(&dd->zone_lock, flags);
>  	}
>  }
>  
> +static bool dd_has_work_for_prio(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return !list_empty_careful(&dd->dispatch[prio]) ||
> +		!list_empty_careful(&dd->fifo_list[prio][DD_READ]) ||
> +		!list_empty_careful(&dd->fifo_list[prio][DD_WRITE]);
> +}
> +
>  static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	enum dd_prio prio;
> +
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
> +		if (dd_has_work_for_prio(dd, prio))
> +			return true;
>  
> -	return !list_empty_careful(&dd->dispatch) ||
> -		!list_empty_careful(&dd->fifo_list[0]) ||
> -		!list_empty_careful(&dd->fifo_list[1]);
> +	return false;
>  }
>  
>  /*
> @@ -702,7 +847,7 @@ static struct elv_fs_entry deadline_attrs[] = {
>  };
>  
>  #ifdef CONFIG_BLK_DEBUG_FS
> -#define DEADLINE_DEBUGFS_DDIR_ATTRS(ddir, name)				\
> +#define DEADLINE_DEBUGFS_DDIR_ATTRS(prio, data_dir, name)		\
>  static void *deadline_##name##_fifo_start(struct seq_file *m,		\
>  					  loff_t *pos)			\
>  	__acquires(&dd->lock)						\
> @@ -711,7 +856,7 @@ static void *deadline_##name##_fifo_start(struct seq_file *m,		\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
>  									\
>  	spin_lock(&dd->lock);						\
> -	return seq_list_start(&dd->fifo_list[ddir], *pos);		\
> +	return seq_list_start(&dd->fifo_list[prio][data_dir], *pos);	\
>  }									\
>  									\
>  static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
> @@ -720,7 +865,7 @@ static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
>  	struct request_queue *q = m->private;				\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
>  									\
> -	return seq_list_next(v, &dd->fifo_list[ddir], pos);		\
> +	return seq_list_next(v, &dd->fifo_list[prio][data_dir], pos);	\
>  }									\
>  									\
>  static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
> @@ -737,22 +882,31 @@ static const struct seq_operations deadline_##name##_fifo_seq_ops = {	\
>  	.next	= deadline_##name##_fifo_next,				\
>  	.stop	= deadline_##name##_fifo_stop,				\
>  	.show	= blk_mq_debugfs_rq_show,				\
> -};									\
> -									\
> +};
> +
> +#define DEADLINE_DEBUGFS_NEXT_RQ(data_dir, name)			\
>  static int deadline_##name##_next_rq_show(void *data,			\
>  					  struct seq_file *m)		\
>  {									\
>  	struct request_queue *q = data;					\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
> -	struct request *rq = dd->next_rq[ddir];				\
> +	struct request *rq = dd->next_rq[data_dir];			\
>  									\
>  	if (rq)								\
>  		__blk_mq_debugfs_rq_show(m, rq);			\
>  	return 0;							\
>  }
> -DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
> -DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
> +
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_READ, read0)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_WRITE, write0)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_READ, read1)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_WRITE, write1)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_READ, read2)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_WRITE, write2)
> +DEADLINE_DEBUGFS_NEXT_RQ(DD_READ, read)
> +DEADLINE_DEBUGFS_NEXT_RQ(DD_WRITE, write)
>  #undef DEADLINE_DEBUGFS_DDIR_ATTRS
> +#undef DEADLINE_DEBUGFS_NEXT_RQ
>  
>  static int deadline_batching_show(void *data, struct seq_file *m)
>  {
> @@ -781,50 +935,99 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
>  	return 0;
>  }
>  
> -static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
> -	__acquires(&dd->lock)
> +static int dd_queued_show(void *data, struct seq_file *m)
>  {
> -	struct request_queue *q = m->private;
> +	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  
> -	spin_lock(&dd->lock);
> -	return seq_list_start(&dd->dispatch, *pos);
> +	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
> +		   dd_queued(dd, DD_BE_PRIO),
> +		   dd_queued(dd, DD_IDLE_PRIO));
> +	return 0;
>  }
>  
> -static void *deadline_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
> +/* Number of requests owned by the block driver for a given priority. */
> +static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	struct request_queue *q = m->private;
> -	struct deadline_data *dd = q->elevator->elevator_data;
> -
> -	return seq_list_next(v, &dd->dispatch, pos);
> +	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
> +		- dd_sum(dd, completed, prio);
>  }
>  
> -static void deadline_dispatch_stop(struct seq_file *m, void *v)
> -	__releases(&dd->lock)
> +static int dd_owned_by_driver_show(void *data, struct seq_file *m)
>  {
> -	struct request_queue *q = m->private;
> +	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  
> -	spin_unlock(&dd->lock);
> +	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
> +		   dd_owned_by_driver(dd, DD_BE_PRIO),
> +		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
> +	return 0;
>  }
>  
> -static const struct seq_operations deadline_dispatch_seq_ops = {
> -	.start	= deadline_dispatch_start,
> -	.next	= deadline_dispatch_next,
> -	.stop	= deadline_dispatch_stop,
> -	.show	= blk_mq_debugfs_rq_show,
> -};
> +#define DEADLINE_DISPATCH_ATTR(prio)					\
> +static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
> +					     loff_t *pos)		\
> +	__acquires(&dd->lock)						\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	spin_lock(&dd->lock);						\
> +	return seq_list_start(&dd->dispatch[prio], *pos);		\
> +}									\
> +									\
> +static void *deadline_dispatch##prio##_next(struct seq_file *m,		\
> +					    void *v, loff_t *pos)	\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	return seq_list_next(v, &dd->dispatch[prio], pos);		\
> +}									\
> +									\
> +static void deadline_dispatch##prio##_stop(struct seq_file *m, void *v)	\
> +	__releases(&dd->lock)						\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	spin_unlock(&dd->lock);						\
> +}									\
> +									\
> +static const struct seq_operations deadline_dispatch##prio##_seq_ops = { \
> +	.start	= deadline_dispatch##prio##_start,			\
> +	.next	= deadline_dispatch##prio##_next,			\
> +	.stop	= deadline_dispatch##prio##_stop,			\
> +	.show	= blk_mq_debugfs_rq_show,				\
> +}
> +
> +DEADLINE_DISPATCH_ATTR(0);
> +DEADLINE_DISPATCH_ATTR(1);
> +DEADLINE_DISPATCH_ATTR(2);
> +#undef DEADLINE_DISPATCH_ATTR
>  
> -#define DEADLINE_QUEUE_DDIR_ATTRS(name)						\
> -	{#name "_fifo_list", 0400, .seq_ops = &deadline_##name##_fifo_seq_ops},	\
> +#define DEADLINE_QUEUE_DDIR_ATTRS(name)					\
> +	{#name "_fifo_list", 0400,					\
> +			.seq_ops = &deadline_##name##_fifo_seq_ops}
> +#define DEADLINE_NEXT_RQ_ATTR(name)					\
>  	{#name "_next_rq", 0400, deadline_##name##_next_rq_show}
>  static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
> -	DEADLINE_QUEUE_DDIR_ATTRS(read),
> -	DEADLINE_QUEUE_DDIR_ATTRS(write),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read0),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write0),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read1),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write1),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read2),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write2),
> +	DEADLINE_NEXT_RQ_ATTR(read),
> +	DEADLINE_NEXT_RQ_ATTR(write),
>  	{"batching", 0400, deadline_batching_show},
>  	{"starved", 0400, deadline_starved_show},
>  	{"async_depth", 0400, dd_async_depth_show},
> -	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
> +	{"dispatch0", 0400, .seq_ops = &deadline_dispatch0_seq_ops},
> +	{"dispatch1", 0400, .seq_ops = &deadline_dispatch1_seq_ops},
> +	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
> +	{"owned_by_driver", 0400, dd_owned_by_driver_show},
> +	{"queued", 0400, dd_queued_show},
>  	{},
>  };
>  #undef DEADLINE_QUEUE_DDIR_ATTRS
> @@ -874,6 +1077,6 @@ static void __exit deadline_exit(void)
>  module_init(deadline_init);
>  module_exit(deadline_exit);
>  
> -MODULE_AUTHOR("Jens Axboe");
> +MODULE_AUTHOR("Jens Axboe, Damien Le Moal and Bart Van Assche");
>  MODULE_LICENSE("GPL");
>  MODULE_DESCRIPTION("MQ deadline IO scheduler");
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 14/14] block/mq-deadline: Prioritize high-priority requests
  2021-06-08 23:07 ` [PATCH 14/14] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-06-09  5:10   ` Damien Le Moal
  2021-06-10  6:43   ` Hannes Reinecke
  1 sibling, 0 replies; 45+ messages in thread
From: Damien Le Moal @ 2021-06-09  5:10 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/09 8:07, Bart Van Assche wrote:
> While one or more requests with a certain I/O priority are pending, do not
> dispatch lower priority requests. Dispatch lower priority requests anyway
> after the "aging" time has expired.
> 
> This patch has been tested as follows:
> 
> modprobe scsi_debug ndelay=1000000 max_queue=16 &&
> sd='' &&
> while [ -z "$sd" ]; do
>   sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done &&
> echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
> cd /sys/fs/cgroup/blkio/ &&
> echo $$ >cgroup.procs &&
> echo 2 >blkio.prio.class &&
> mkdir -p hipri &&
> cd hipri &&
> echo 1 >blkio.prio.class &&
> { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
> echo $$ >cgroup.procs &&
> max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *   400 IOPS for the low-priority job
> 
> If the aging expiry time is changed from 100s into 0, the IOPS results change
> into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline-main.c | 42 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/block/mq-deadline-main.c b/block/mq-deadline-main.c
> index 1b2b6c1de5b8..cecdb475a610 100644
> --- a/block/mq-deadline-main.c
> +++ b/block/mq-deadline-main.c
> @@ -32,6 +32,11 @@
>   */
>  static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
>  static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +/*
> + * Time after which to dispatch lower priority requests even if higher
> + * priority requests are pending.
> + */
> +static const int aging_expire = 10 * HZ;
>  static const int writes_starved = 2;    /* max times reads can starve a write */
>  static const int fifo_batch = 16;       /* # of sequential requests treated as one
>  				     by the above parameters. For throughput. */
> @@ -90,6 +95,7 @@ struct deadline_data {
>  	int writes_starved;
>  	int front_merges;
>  	u32 async_depth;
> +	int aging_expire;
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> @@ -363,10 +369,10 @@ deadline_next_request(struct deadline_data *dd, enum dd_prio prio,
>  
>  /*
>   * deadline_dispatch_requests selects the best request according to
> - * read/write expire, fifo_batch, etc
> + * read/write expire, fifo_batch, etc and with a start time <= @latest.
>   */
>  static struct request *__dd_dispatch_request(struct deadline_data *dd,
> -					     enum dd_prio prio)
> +					enum dd_prio prio, u64 latest_start_ns)
>  {
>  	struct request *rq, *next_rq;
>  	enum dd_data_dir data_dir;
> @@ -378,6 +384,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	if (!list_empty(&dd->dispatch[prio])) {
>  		rq = list_first_entry(&dd->dispatch[prio], struct request,
>  				      queuelist);
> +		if (rq->start_time_ns > latest_start_ns)
> +			return NULL;
>  		list_del_init(&rq->queuelist);
>  		goto done;
>  	}
> @@ -457,6 +465,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	dd->batching = 0;
>  
>  dispatch_request:
> +	if (rq->start_time_ns > latest_start_ns)
> +		return NULL;
>  	/*
>  	 * rq is the selected appropriate request.
>  	 */
> @@ -493,15 +503,32 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> -	struct request *rq;
> +	const u64 now_ns = ktime_get_ns();
> +	struct request *rq = NULL;
>  	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);
> -	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, prio);
> +	/*
> +	 * Start with dispatching requests whose deadline expired more than
> +	 * aging_expire jiffies ago.
> +	 */
> +	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, prio, now_ns -
> +					   jiffies_to_nsecs(dd->aging_expire));
>  		if (rq)
> +			goto unlock;
> +	}
> +	/*
> +	 * Next, dispatch requests in priority order. Ignore lower priority
> +	 * requests if any higher priority requests are pending.
> +	 */
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, prio, now_ns);
> +		if (rq || dd_queued(dd, prio))
>  			break;
>  	}
> +
> +unlock:
>  	spin_unlock(&dd->lock);
>  
>  	return rq;
> @@ -607,6 +634,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	dd->writes_starved = writes_starved;
>  	dd->front_merges = 1;
>  	dd->fifo_batch = fifo_batch;
> +	dd->aging_expire = aging_expire;
>  	spin_lock_init(&dd->lock);
>  	spin_lock_init(&dd->zone_lock);
>  
> @@ -725,6 +753,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	trace_block_rq_insert(rq);
>  
>  	if (at_head) {
> +		rq->fifo_time = jiffies;
>  		list_add(&rq->queuelist, &dd->dispatch[prio]);
>  	} else {
>  		deadline_add_rq_rb(dd, rq);
> @@ -841,6 +870,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
>  SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
>  SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> +SHOW_JIFFIES(deadline_aging_expire_show, dd->aging_expire);
>  SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
>  SHOW_INT(deadline_front_merges_show, dd->front_merges);
>  SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch);
> @@ -869,6 +899,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
>  	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
>  STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
>  STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> +STORE_JIFFIES(deadline_aging_expire_store, &dd->aging_expire, 0, INT_MAX);
>  STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
>  STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
>  STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX);
> @@ -885,6 +916,7 @@ static struct elv_fs_entry deadline_attrs[] = {
>  	DD_ATTR(writes_starved),
>  	DD_ATTR(front_merges),
>  	DD_ATTR(fifo_batch),
> +	DD_ATTR(aging_expire),
>  	__ATTR_NULL
>  };
>  
> 

Looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  2021-06-08 23:06 ` [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
  2021-06-09  4:14   ` Damien Le Moal
@ 2021-06-09  7:03   ` Johannes Thumshirn
  2021-06-10  6:02   ` Hannes Reinecke
  2 siblings, 0 replies; 45+ messages in thread
From: Johannes Thumshirn @ 2021-06-09  7:03 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Hannes Reinecke, Ming Lei, Himanshu Madhani

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls
  2021-06-08 23:06 ` [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
  2021-06-09  4:19   ` Damien Le Moal
@ 2021-06-09  7:05   ` Johannes Thumshirn
  2021-06-10  6:02   ` Hannes Reinecke
  2 siblings, 0 replies; 45+ messages in thread
From: Johannes Thumshirn @ 2021-06-09  7:05 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Tejun Heo,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Himanshu Madhani

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-08 23:06 ` [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
  2021-06-09  4:46   ` Damien Le Moal
@ 2021-06-09  7:11   ` Johannes Thumshirn
  2021-06-09 16:59     ` Bart Van Assche
  2021-06-10  6:23   ` Hannes Reinecke
  2 siblings, 1 reply; 45+ messages in thread
From: Johannes Thumshirn @ 2021-06-09  7:11 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Hannes Reinecke, Ming Lei, Himanshu Madhani

On 09/06/2021 01:07, Bart Van Assche wrote:
> +#define SHOW_INT(__FUNC, __VAR)						\
>  static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  {									\
>  	struct deadline_data *dd = e->elevator_data;			\
> -	int __data = __VAR;						\
> -	if (__CONV)							\
> -		__data = jiffies_to_msecs(__data);			\
> -	return deadline_var_show(__data, (page));			\
> -}
> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
> -SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> -SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> -SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> -#undef SHOW_FUNCTION
> +									\
> +	return sysfs_emit((page), "%d\n", __VAR);			\
> +}

Nit: the () around page aren't needed

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] block: Introduce the ioprio rq-qos policy
  2021-06-09  4:40   ` Damien Le Moal
@ 2021-06-09 16:48     ` Bart Van Assche
  2021-06-10  0:29       ` Damien Le Moal
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2021-06-09 16:48 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/8/21 9:40 PM, Damien Le Moal wrote:
> On 2021/06/09 8:07, Bart Van Assche wrote:

[ ... ]

>> +/*
>> + * The accepted I/O priority values are 0..IOPRIO_CLASS_MAX(3). 0 (default)
>> + * means do not modify the I/O priority. Values 1..3 are used to restrict the
>> + * I/O priority for a cgroup to the specified priority. See also
>> + * <linux/ioprio.h>.
>> + */
>> +#define IOPRIO_CLASS_MAX IOPRIO_CLASS_IDLE

[ ... ]

>> +static int ioprio_set_prio_class(struct cgroup_subsys_state *css,
>> +				 struct cftype *cft, u64 val)
>> +{
>> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(css);
>> +
>> +	if (val > IOPRIO_CLASS_MAX)
>> +		return -EINVAL;
> 
> Where is IOPRIO_CLASS_MAX defined ? I do not see it.

Near the start of the new block/blk-ioprio.c file.

> Why not use ioprio_valid() ?

The definition of that macro is as follows:

#define ioprio_valid(mask)	\
	(IOPRIO_PRIO_CLASS((mask)) != IOPRIO_CLASS_NONE)

So that macro accepts I/O priority classes 1..7 while the above code
accepts I/O priority classes 0..3. I think that ioprio_set_prio_class()
should only accept I/O priority class values 0..3.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-09  4:46   ` Damien Le Moal
@ 2021-06-09 16:54     ` Bart Van Assche
  2021-06-10  0:31       ` Damien Le Moal
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2021-06-09 16:54 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/8/21 9:46 PM, Damien Le Moal wrote:
> On 2021/06/09 8:07, Bart Van Assche wrote:
>> +#define SHOW_INT(__FUNC, __VAR)						\
>>  static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>>  {									\
>>  	struct deadline_data *dd = e->elevator_data;			\
>> -	int __data = __VAR;						\
>> -	if (__CONV)							\
>> -		__data = jiffies_to_msecs(__data);			\
>> -	return deadline_var_show(__data, (page));			\
>> -}
>> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
>> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
>> -SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
>> -SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
>> -SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
>> -#undef SHOW_FUNCTION
>> +									\
>> +	return sysfs_emit((page), "%d\n", __VAR);			\
>> +}
>> +#define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
> 
> jiffies_to_msecs() returns an unsigned int but sysfs_emit() in SHOW_INT() uses a
> %d format. That will cause problems, no ?

The corresponding store functions restrict values that represent a time
in jiffies to the interval [0, INT_MAX] milliseconds so I think that
using %d to format a time in milliseconds is fine.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-09  7:11   ` Johannes Thumshirn
@ 2021-06-09 16:59     ` Bart Van Assche
  0 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-09 16:59 UTC (permalink / raw)
  To: Johannes Thumshirn, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Hannes Reinecke, Ming Lei, Himanshu Madhani

On 6/9/21 12:11 AM, Johannes Thumshirn wrote:
> On 09/06/2021 01:07, Bart Van Assche wrote:
>> +#define SHOW_INT(__FUNC, __VAR)						\
>>  static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>>  {									\
>>  	struct deadline_data *dd = e->elevator_data;			\
>> -	int __data = __VAR;						\
>> -	if (__CONV)							\
>> -		__data = jiffies_to_msecs(__data);			\
>> -	return deadline_var_show(__data, (page));			\
>> -}
>> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
>> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
>> -SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
>> -SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
>> -SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
>> -#undef SHOW_FUNCTION
>> +									\
>> +	return sysfs_emit((page), "%d\n", __VAR);			\
>> +}
> 
> Nit: the () around page aren't needed

Right, I will remove these parentheses.

Thanks,

Bart.



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/14] block/mq-deadline: Add I/O priority support
  2021-06-09  5:03   ` Damien Le Moal
@ 2021-06-09 17:25     ` Bart Van Assche
  2021-06-10  0:39       ` Damien Le Moal
  0 siblings, 1 reply; 45+ messages in thread
From: Bart Van Assche @ 2021-06-09 17:25 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/8/21 10:03 PM, Damien Le Moal wrote:
> On 2021/06/09 8:07, Bart Van Assche wrote:
>>  struct deadline_data {
>>  	/*
>>  	 * run time data
>>  	 */
>>  
>>  	/*
>> -	 * requests (deadline_rq s) are present on both sort_list and fifo_list
>> +	 * Requests are present on both sort_list[] and fifo_list[][]. The
>> +	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
>> +	 * The second index is the data direction (rq_data_dir(rq)).
>>  	 */
>>  	struct rb_root sort_list[DD_DIR_COUNT];
>> -	struct list_head fifo_list[DD_DIR_COUNT];
>> +	struct list_head fifo_list[DD_PRIO_COUNT][DD_DIR_COUNT];
> 
> Would it make sense to pack these 2 into a sub structure ? e.g.:
> 
> struct deadline_lists {
> 	struct rb_root sort_list;
> 	struct list_head fifo_list[DD_PRIO_COUNT];
> };
> 
> struct deadline_data {
> 	...
> 	/*
> 	 * Requests are present on both sort_list[] and fifo_list[][]. The
> 	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
> 	 * The second index is the data direction (rq_data_dir(rq)).
>  	 */
> 	struct deadline_lists	lists[DD_DIR_COUNT];
 The deadline_fifo_request() function and several other functions
examine both directions so I think that grouping per direction would
complicate these functions. Grouping per I/O priority might help to make
these functions easier to read. Do you want me to look further into this?

>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +	const u8 ioprio_class = dd_rq_ioclass(next);
>> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>> +
>> +	if (next->elv.priv[0]) {
>> +		dd_count(dd, merged, prio);
>> +	} else {
>> +		WARN_ON_ONCE(true);
>> +	}
> 
> No need for the curly brackets I think.

I can leave these out but you may want to know that leaving the curly
brackets out from this patch will make it necessary to introduce these
in the next patch in this series.

>> +/* Number of requests queued for a given priority level. */
>> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>> +{
>> +	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> 
> This also includes requests that are being executed on the device. Is that OK ?

Yes, and that's on purpose. Please note that this function is only used
to export statistics to user space.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] block: Introduce the ioprio rq-qos policy
  2021-06-09 16:48     ` Bart Van Assche
@ 2021-06-10  0:29       ` Damien Le Moal
  0 siblings, 0 replies; 45+ messages in thread
From: Damien Le Moal @ 2021-06-10  0:29 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/10 1:49, Bart Van Assche wrote:
> On 6/8/21 9:40 PM, Damien Le Moal wrote:
>> On 2021/06/09 8:07, Bart Van Assche wrote:
> 
> [ ... ]
> 
>>> +/*
>>> + * The accepted I/O priority values are 0..IOPRIO_CLASS_MAX(3). 0 (default)
>>> + * means do not modify the I/O priority. Values 1..3 are used to restrict the
>>> + * I/O priority for a cgroup to the specified priority. See also
>>> + * <linux/ioprio.h>.
>>> + */
>>> +#define IOPRIO_CLASS_MAX IOPRIO_CLASS_IDLE
> 
> [ ... ]
> 
>>> +static int ioprio_set_prio_class(struct cgroup_subsys_state *css,
>>> +				 struct cftype *cft, u64 val)
>>> +{
>>> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(css);
>>> +
>>> +	if (val > IOPRIO_CLASS_MAX)
>>> +		return -EINVAL;
>>
>> Where is IOPRIO_CLASS_MAX defined ? I do not see it.
> 
> Near the start of the new block/blk-ioprio.c file.

OK. I missed that. Shouldn't this definition be moved to include/linux/ioprio.h
? It can be added as the last entry of the class enum.

> 
>> Why not use ioprio_valid() ?
> 
> The definition of that macro is as follows:
> 
> #define ioprio_valid(mask)	\
> 	(IOPRIO_PRIO_CLASS((mask)) != IOPRIO_CLASS_NONE)
> 
> So that macro accepts I/O priority classes 1..7 while the above code
> accepts I/O priority classes 0..3. I think that ioprio_set_prio_class()
> should only accept I/O priority class values 0..3.

Yes. I misread this macro definition. The name is actually confusing. It does
not check that the ioprio is valid, it only check that it is not NONE, meaning
that something is set, but that something may not be valid. We could probably
turn this macro into an inline function that does a better job at checking that
valid values are in the "mask" argument, which by the way is also badly named,
since that is not a mask at all, but an ioprio.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-09 16:54     ` Bart Van Assche
@ 2021-06-10  0:31       ` Damien Le Moal
  0 siblings, 0 replies; 45+ messages in thread
From: Damien Le Moal @ 2021-06-10  0:31 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/10 1:54, Bart Van Assche wrote:
> On 6/8/21 9:46 PM, Damien Le Moal wrote:
>> On 2021/06/09 8:07, Bart Van Assche wrote:
>>> +#define SHOW_INT(__FUNC, __VAR)						\
>>>  static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>>>  {									\
>>>  	struct deadline_data *dd = e->elevator_data;			\
>>> -	int __data = __VAR;						\
>>> -	if (__CONV)							\
>>> -		__data = jiffies_to_msecs(__data);			\
>>> -	return deadline_var_show(__data, (page));			\
>>> -}
>>> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
>>> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
>>> -SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
>>> -SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
>>> -SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
>>> -#undef SHOW_FUNCTION
>>> +									\
>>> +	return sysfs_emit((page), "%d\n", __VAR);			\
>>> +}
>>> +#define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
>>
>> jiffies_to_msecs() returns an unsigned int but sysfs_emit() in SHOW_INT() uses a
>> %d format. That will cause problems, no ?
> 
> The corresponding store functions restrict values that represent a time
> in jiffies to the interval [0, INT_MAX] milliseconds so I think that
> using %d to format a time in milliseconds is fine.

Missed that part :) OK then.

> 
> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/14] block/mq-deadline: Add I/O priority support
  2021-06-09 17:25     ` Bart Van Assche
@ 2021-06-10  0:39       ` Damien Le Moal
  0 siblings, 0 replies; 45+ messages in thread
From: Damien Le Moal @ 2021-06-10  0:39 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Hannes Reinecke,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 2021/06/10 2:25, Bart Van Assche wrote:
> On 6/8/21 10:03 PM, Damien Le Moal wrote:
>> On 2021/06/09 8:07, Bart Van Assche wrote:
>>>  struct deadline_data {
>>>  	/*
>>>  	 * run time data
>>>  	 */
>>>  
>>>  	/*
>>> -	 * requests (deadline_rq s) are present on both sort_list and fifo_list
>>> +	 * Requests are present on both sort_list[] and fifo_list[][]. The
>>> +	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
>>> +	 * The second index is the data direction (rq_data_dir(rq)).
>>>  	 */
>>>  	struct rb_root sort_list[DD_DIR_COUNT];
>>> -	struct list_head fifo_list[DD_DIR_COUNT];
>>> +	struct list_head fifo_list[DD_PRIO_COUNT][DD_DIR_COUNT];
>>
>> Would it make sense to pack these 2 into a sub structure ? e.g.:
>>
>> struct deadline_lists {
>> 	struct rb_root sort_list;
>> 	struct list_head fifo_list[DD_PRIO_COUNT];
>> };
>>
>> struct deadline_data {
>> 	...
>> 	/*
>> 	 * Requests are present on both sort_list[] and fifo_list[][]. The
>> 	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
>> 	 * The second index is the data direction (rq_data_dir(rq)).
>>  	 */
>> 	struct deadline_lists	lists[DD_DIR_COUNT];
>  The deadline_fifo_request() function and several other functions
> examine both directions so I think that grouping per direction would
> complicate these functions. Grouping per I/O priority might help to make
> these functions easier to read. Do you want me to look further into this?

If the code is simplified, yes, let's do it ! And Hannes will be happy as the
array of arrays will be removed :)

>>> +	struct deadline_data *dd = q->elevator->elevator_data;
>>> +	const u8 ioprio_class = dd_rq_ioclass(next);
>>> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>>> +
>>> +	if (next->elv.priv[0]) {
>>> +		dd_count(dd, merged, prio);
>>> +	} else {
>>> +		WARN_ON_ONCE(true);
>>> +	}
>>
>> No need for the curly brackets I think.
> 
> I can leave these out but you may want to know that leaving the curly
> brackets out from this patch will make it necessary to introduce these
> in the next patch in this series.
> 
>>> +/* Number of requests queued for a given priority level. */
>>> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>>> +{
>>> +	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
>>
>> This also includes requests that are being executed on the device. Is that OK ?
> 
> Yes, and that's on purpose. Please note that this function is only used
> to export statistics to user space.

Yes, I got it that it is only for stats shown to the user. I am just wondering
if we need to separate in-flight requests and queued requests. That should be
easy to do since you also have the dispatched count.
But again, not entirely sure if it is useful to have such level of detail. So I
guess it is OK as is for now. We can revisit later if needed.

> 
> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  2021-06-08 23:06 ` [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
  2021-06-09  4:14   ` Damien Le Moal
  2021-06-09  7:03   ` Johannes Thumshirn
@ 2021-06-10  6:02   ` Hannes Reinecke
  2 siblings, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:02 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:06 AM, Bart Van Assche wrote:
> These entries were consecutive at the time of their introduction but are no
> longer consecutive. Make these again consecutive. Additionally, modify the
> help text since it refers to blk-mq and since the legacy block layer has
> been removed.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/Kconfig | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls
  2021-06-08 23:06 ` [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
  2021-06-09  4:19   ` Damien Le Moal
  2021-06-09  7:05   ` Johannes Thumshirn
@ 2021-06-10  6:02   ` Hannes Reinecke
  2 siblings, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:02 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Tejun Heo,
	Damien Le Moal, Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:06 AM, Bart Van Assche wrote:
> Before adding more calls in this function, simplify the error path.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-cgroup.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d169e2055158..3b0f6efaa2b6 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1183,15 +1183,14 @@ int blkcg_init_queue(struct request_queue *q)
>  	if (preloaded)
>  		radix_tree_preload_end();
>  
> -	ret = blk_throtl_init(q);
> +	ret = blk_iolatency_init(q);
>  	if (ret)
>  		goto err_destroy_all;
>  
> -	ret = blk_iolatency_init(q);
> -	if (ret) {
> -		blk_throtl_exit(q);
> +	ret = blk_throtl_init(q);
> +	if (ret)
>  		goto err_destroy_all;
> -	}
> +
>  	return 0;
>  
>  err_destroy_all:
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 03/14] block/blk-rq-qos: Move a function from a header file into a C file
  2021-06-08 23:06 ` [PATCH 03/14] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
  2021-06-09  4:22   ` Damien Le Moal
@ 2021-06-10  6:03   ` Hannes Reinecke
  1 sibling, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:03 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:06 AM, Bart Van Assche wrote:
> rq_qos_id_to_name() is not in the hot path so move it from a .h into a .c
> file.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-rq-qos.c | 13 +++++++++++++
>  block/blk-rq-qos.h | 13 +------------
>  2 files changed, 14 insertions(+), 12 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] block: Introduce the ioprio rq-qos policy
  2021-06-08 23:06 ` [PATCH 04/14] block: Introduce the ioprio rq-qos policy Bart Van Assche
  2021-06-09  4:40   ` Damien Le Moal
@ 2021-06-10  6:20   ` Hannes Reinecke
  2021-06-10 17:14     ` Bart Van Assche
  1 sibling, 1 reply; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:20 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:06 AM, Bart Van Assche wrote:
> Introduce an rq-qos policy that assigns an I/O priority to requests based
> on blk-cgroup configuration settings. This policy has the following
> advantages over the ioprio_set() system call:
> - This policy is cgroup based so it has all the advantages of cgroups.
> - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
>   controller affects page cache writeback I/O for filesystems that support
>   assiociating a cgroup with writeback I/O. See also
>   Documentation/admin-guide/cgroup-v2.rst.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  26 +++
>  block/Kconfig                           |   9 +
>  block/Makefile                          |   1 +
>  block/blk-cgroup.c                      |   5 +
>  block/blk-ioprio.c                      | 236 ++++++++++++++++++++++++
>  block/blk-ioprio.h                      |  19 ++
>  block/blk-rq-qos.c                      |   2 +
>  block/blk-rq-qos.h                      |   1 +
>  8 files changed, 299 insertions(+)
>  create mode 100644 block/blk-ioprio.c
>  create mode 100644 block/blk-ioprio.h
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index b1e81aa8598a..cf8c8073b474 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -56,6 +56,7 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
>         5-3-3. IO Latency
>           5-3-3-1. How IO Latency Throttling Works
>           5-3-3-2. IO Latency Interface Files
> +       5-3-4. IO Priority
>       5-4. PID
>         5-4-1. PID Interface Files
>       5-5. Cpuset
> @@ -1866,6 +1867,31 @@ IO Latency Interface Files
>  		duration of time between evaluation events.  Windows only elapse
>  		with IO activity.  Idle periods extend the most recent window.
>  
> +IO Priority
> +~~~~~~~~~~~
> +
> +A single attribute controls the behavior of the I/O priority cgroup policy,
> +namely the blkio.prio.class attribute. The following values are accepted for
> +that attribute:
> +
> +  0
> +	Do not modify the I/O priority class.
> +
> +  1
> +	For requests that do not have an I/O priority class (IOPRIO_CLASS_NONE),
> +	change the I/O priority class into 1 (IOPRIO_CLASS_RT). Do not modify
> +	the I/O priority class of other requests.
> +
> +  2
> +	For requests that do not have an I/O priority class, change the I/O
> +	priority class into IOPRIO_CLASS_BE. For requests that have the
> +	priority class RT, change it into BE (2). Do not modify
> +	the I/O priority class of requests that have priority 3 (IDLE).
> +
> +  3
> +	Change the I/O priority class of all requests into 3 (IDLE),
> +	the lowest I/O priority class.
> +
>  PID
>  ---
>  

Can you make this descriptive names, like 'none', 'none-to-rt',
'be-not-idle', and 'idle-all'?
Having to read the documentation to figure out what the numbers mean is
quite cumbersome.

> diff --git a/block/Kconfig b/block/Kconfig
> index 6685578b2a20..319816058298 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -162,6 +162,15 @@ config BLK_CGROUP_IOCOST
>  	distributes IO capacity between different groups based on
>  	their share of the overall weight distribution.
>  
> +config BLK_CGROUP_IOPRIO
> +	bool "Cgroup I/O controller for assigning I/O priority"
> +	depends on BLK_CGROUP
> +	help
> +	Enable the .prio interface for assigning an I/O priority to requests.
> +	The I/O priority affects the order in which an I/O scheduler and block
> +	devices process requests. Only some I/O schedulers and some block
> +	devices support I/O priorities.
> +
>  config BLK_DEBUG_FS
>  	bool "Block layer debugging information in debugfs"
>  	default y
> diff --git a/block/Makefile b/block/Makefile
> index 8d841f5f986f..af3d044abaf1 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -17,6 +17,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
>  obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
>  obj-$(CONFIG_BLK_CGROUP_RWSTAT)	+= blk-cgroup-rwstat.o
>  obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
> +obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
>  obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
>  obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
>  obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 3b0f6efaa2b6..7b06a5fa3cac 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -31,6 +31,7 @@
>  #include <linux/tracehook.h>
>  #include <linux/psi.h>
>  #include "blk.h"
> +#include "blk-ioprio.h"
>  
>  /*
>   * blkcg_pol_mutex protects blkcg_policy[] and policy [de]activation.
> @@ -1187,6 +1188,10 @@ int blkcg_init_queue(struct request_queue *q)
>  	if (ret)
>  		goto err_destroy_all;
>  
> +	ret = blk_ioprio_init(q);
> +	if (ret)
> +		goto err_destroy_all;
> +
>  	ret = blk_throtl_init(q);
>  	if (ret)
>  		goto err_destroy_all;
> diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
> new file mode 100644
> index 000000000000..d7292cdef67d
> --- /dev/null
> +++ b/block/blk-ioprio.c
> @@ -0,0 +1,236 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Block rq-qos policy for assigning an I/O priority to requests.
> + *
> + * Using an rq-qos policy for assigning I/O priority has two advantages over
> + * using the ioprio_set() system call:
> + *
> + * - This policy is cgroup based so it has all the advantages of cgroups.
> + * - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
> + *   controller affects page cache writeback I/O for filesystems that support
> + *   assiociating a cgroup with writeback I/O. See also
> + *   Documentation/admin-guide/cgroup-v2.rst.
> + */
> +
> +#include <linux/blk-cgroup.h>
> +#include <linux/blk-mq.h>
> +#include <linux/blk_types.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include "blk-ioprio.h"
> +#include "blk-rq-qos.h"
> +
> +/*
> + * The accepted I/O priority values are 0..IOPRIO_CLASS_MAX(3). 0 (default)
> + * means do not modify the I/O priority. Values 1..3 are used to restrict the
> + * I/O priority for a cgroup to the specified priority. See also
> + * <linux/ioprio.h>.
> + */
> +#define IOPRIO_CLASS_MAX IOPRIO_CLASS_IDLE
> +

And I would really not use the IOPRIO_CLASS prefix here.
These values define a _policy_, if and when a specific class should be
used; they are not the class itself.
So please use a different prefix here for the various policies (which
you already have outlined in the documentation), eg

IOPRIO_POLICY_NONE
IOPRIO_POLICY_NONE_TO_RT
IOPRIO_POLICY_BE_NOT_IDLE
IOPRIO_POLICY_IDLE_ALL

> +static struct blkcg_policy ioprio_policy;
> +
> +/**
> + * struct ioprio_blkg - Per (cgroup, request queue) data.
> + * @pd: blkg_policy_data structure.
> + */
> +struct ioprio_blkg {
> +	struct blkg_policy_data pd;
> +};
> +
> +/**
> + * struct ioprio_blkcg - Per cgroup data.
> + * @cpd: blkcg_policy_data structure.
> + * @prio_class: One of the IOPRIO_CLASS_* values. See also <linux/ioprio.h>.
> + */
> +struct ioprio_blkcg {
> +	struct blkcg_policy_data cpd;
> +	u8			 prio_class;
> +};
> +
> +static inline struct ioprio_blkg *pd_to_ioprio(struct blkg_policy_data *pd)
> +{
> +	return pd ? container_of(pd, struct ioprio_blkg, pd) : NULL;
> +}
> +
> +static struct ioprio_blkcg *
> +ioprio_blkcg_from_css(struct cgroup_subsys_state *css)
> +{
> +	struct blkcg *blkcg = css_to_blkcg(css);
> +	struct blkcg_policy_data *cpd = blkcg_to_cpd(blkcg, &ioprio_policy);
> +
> +	return container_of(cpd, struct ioprio_blkcg, cpd);
> +}
> +
> +static struct ioprio_blkcg *ioprio_blkcg_from_bio(struct bio *bio)
> +{
> +	struct blkg_policy_data *pd;
> +
> +	pd = blkg_to_pd(bio->bi_blkg, &ioprio_policy);
> +	if (!pd)
> +		return NULL;
> +
> +	return container_of(blkcg_to_cpd(pd->blkg->blkcg, &ioprio_policy),
> +			    struct ioprio_blkcg, cpd);
> +}
> +
> +static u64 ioprio_show_prio_class(struct cgroup_subsys_state *css,
> +				  struct cftype *cft)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(css);
> +
> +	return blkcg->prio_class;
> +}
> +
> +static int ioprio_set_prio_class(struct cgroup_subsys_state *css,
> +				 struct cftype *cft, u64 val)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(css);
> +
> +	if (val > IOPRIO_CLASS_MAX)
> +		return -EINVAL;
> +
> +	blkcg->prio_class = val;
> +	return 0;
> +}
> +

And these would need to be modified to accept and show the names, too.

> +static struct blkg_policy_data *ioprio_alloc_pd(gfp_t gfp,
> +						struct request_queue *q,
> +						struct blkcg *blkcg)
> +{
> +	struct ioprio_blkg *ioprio_blkg;
> +
> +	ioprio_blkg = kzalloc(sizeof(*ioprio_blkg), gfp);
> +	if (!ioprio_blkg)
> +		return NULL;
> +
> +	return &ioprio_blkg->pd;
> +}
> +
> +static void ioprio_free_pd(struct blkg_policy_data *pd)
> +{
> +	struct ioprio_blkg *ioprio_blkg = pd_to_ioprio(pd);
> +
> +	kfree(ioprio_blkg);
> +}
> +
> +static struct blkcg_policy_data *ioprio_alloc_cpd(gfp_t gfp)
> +{
> +	struct ioprio_blkcg *blkcg;
> +
> +	blkcg = kzalloc(sizeof(*blkcg), gfp);
> +	if (!blkcg)
> +		return NULL;
> +	blkcg->prio_class = IOPRIO_CLASS_NONE;
> +	return &blkcg->cpd;
> +}
> +
> +static void ioprio_free_cpd(struct blkcg_policy_data *cpd)
> +{
> +	struct ioprio_blkcg *blkcg = container_of(cpd, typeof(*blkcg), cpd);
> +
> +	kfree(blkcg);
> +}
> +
> +#define IOPRIO_ATTRS						\
> +	{							\
> +		.name		= "prio.class",			\
> +		.read_u64	= ioprio_show_prio_class,	\
> +		.write_u64	= ioprio_set_prio_class,	\
> +	},							\
> +	{ } /* sentinel */
> +
> +/* cgroup v2 attributes */
> +static struct cftype ioprio_files[] = {
> +	IOPRIO_ATTRS
> +};
> +
> +/* cgroup v1 attributes */
> +static struct cftype ioprio_legacy_files[] = {
> +	IOPRIO_ATTRS
> +};
> +
> +static struct blkcg_policy ioprio_policy = {
> +	.dfl_cftypes	= ioprio_files,
> +	.legacy_cftypes = ioprio_legacy_files,
> +
> +	.cpd_alloc_fn	= ioprio_alloc_cpd,
> +	.cpd_free_fn	= ioprio_free_cpd,
> +
> +	.pd_alloc_fn	= ioprio_alloc_pd,
> +	.pd_free_fn	= ioprio_free_pd,
> +};
> +
> +struct blk_ioprio {
> +	struct rq_qos rqos;
> +};
> +
> +static void blkcg_ioprio_track(struct rq_qos *rqos, struct request *rq,
> +			       struct bio *bio)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_bio(bio);
> +
> +	/*
> +	 * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers
> +	 * correspond to a lower priority. Hence, the max_t() below selects
> +	 * the lower priority of bi_ioprio and the cgroup I/O priority class.
> +	 * If the cgroup priority has been set to IOPRIO_CLASS_NONE == 0, the
> +	 * bio I/O priority is not modified. If the bio I/O priority equals
> +	 * IOPRIO_CLASS_NONE, the cgroup I/O priority is assigned to the bio.
> +	 */
> +	bio->bi_ioprio = max_t(u16, bio->bi_ioprio,
> +			       IOPRIO_PRIO_VALUE(blkcg->prio_class, 0));
> +}

Sheesh. Now that is cheeky.
First defining a (conceptually) complex policy setting (where people
wonder where these policies came from), which then devolve into a simple
max() setting of the priority value.
This _really_ could do with a better explanation in the documentation,
as then it's far easier to understand _why_ certain policies override
others.
IE this comment belongs in the documentation, as explanation of the
underlying mechanics of the ioprio policies.

> +
> +static void blkcg_ioprio_exit(struct rq_qos *rqos)
> +{
> +	struct blk_ioprio *blkioprio_blkg =
> +		container_of(rqos, typeof(*blkioprio_blkg), rqos);
> +
> +	blkcg_deactivate_policy(rqos->q, &ioprio_policy);
> +	kfree(blkioprio_blkg);
> +}
> +
> +static struct rq_qos_ops blkcg_ioprio_ops = {
> +	.track = blkcg_ioprio_track,
> +	.exit = blkcg_ioprio_exit,
> +};
> +
> +int blk_ioprio_init(struct request_queue *q)
> +{
> +	struct blk_ioprio *blkioprio_blkg;
> +	struct rq_qos *rqos;
> +	int ret;
> +
> +	blkioprio_blkg = kzalloc(sizeof(*blkioprio_blkg), GFP_KERNEL);
> +	if (!blkioprio_blkg)
> +		return -ENOMEM;
> +
> +	ret = blkcg_activate_policy(q, &ioprio_policy);
> +	if (ret) {
> +		kfree(blkioprio_blkg);
> +		return ret;
> +	}
> +
> +	rqos = &blkioprio_blkg->rqos;
> +	rqos->id = RQ_QOS_IOPRIO;
> +	rqos->ops = &blkcg_ioprio_ops;
> +	rqos->q = q;
> +
> +	rq_qos_add(q, rqos);
> +
> +	return 0;
> +}
> +
> +static int __init ioprio_init(void)
> +{
> +	return blkcg_policy_register(&ioprio_policy);
> +}
> +
> +static void __exit ioprio_exit(void)
> +{
> +	blkcg_policy_unregister(&ioprio_policy);
> +}
> +
> +module_init(ioprio_init);
> +module_exit(ioprio_exit);
> diff --git a/block/blk-ioprio.h b/block/blk-ioprio.h
> new file mode 100644
> index 000000000000..a7785c2f1aea
> --- /dev/null
> +++ b/block/blk-ioprio.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _BLK_IOPRIO_H_
> +#define _BLK_IOPRIO_H_
> +
> +#include <linux/kconfig.h>
> +
> +struct request_queue;
> +
> +#ifdef CONFIG_BLK_CGROUP_IOPRIO
> +int blk_ioprio_init(struct request_queue *q);
> +#else
> +static inline int blk_ioprio_init(struct request_queue *q)
> +{
> +	return 0;
> +}
> +#endif
> +
> +#endif /* _BLK_IOPRIO_H_ */
> diff --git a/block/blk-rq-qos.c b/block/blk-rq-qos.c
> index 61dccf584c68..3a5400b11af7 100644
> --- a/block/blk-rq-qos.c
> +++ b/block/blk-rq-qos.c
> @@ -11,6 +11,8 @@ const char *rq_qos_id_to_name(enum rq_qos_id id)
>  		return "latency";
>  	case RQ_QOS_COST:
>  		return "cost";
> +	case RQ_QOS_IOPRIO:
> +		return "ioprio";
>  	}
>  	return "unknown";
>  }
> diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
> index 6b5f9ae69883..1d4c6e951aa6 100644
> --- a/block/blk-rq-qos.h
> +++ b/block/blk-rq-qos.h
> @@ -16,6 +16,7 @@ enum rq_qos_id {
>  	RQ_QOS_WBT,
>  	RQ_QOS_LATENCY,
>  	RQ_QOS_COST,
> +	RQ_QOS_IOPRIO,
>  };
>  
>  struct rq_wait {
> 
But other than that: good work, I like it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 05/14] block/mq-deadline: Add several comments
  2021-06-08 23:06 ` [PATCH 05/14] block/mq-deadline: Add several comments Bart Van Assche
@ 2021-06-10  6:21   ` Hannes Reinecke
  0 siblings, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:21 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Johannes Thumshirn, Himanshu Madhani, Ming Lei

On 6/9/21 1:06 AM, Bart Van Assche wrote:
> Make the code easier to read by adding more comments.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 06/14] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-06-08 23:06 ` [PATCH 06/14] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
@ 2021-06-10  6:21   ` Hannes Reinecke
  0 siblings, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:21 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Chaitanya Kulkarni,
	Damien Le Moal, Johannes Thumshirn, Himanshu Madhani, Ming Lei

On 6/9/21 1:06 AM, Bart Van Assche wrote:
> Document the locking strategy by adding two lockdep_assert_held()
> statements.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 07/14] block/mq-deadline: Remove two local variables
  2021-06-08 23:06 ` [PATCH 07/14] block/mq-deadline: Remove two local variables Bart Van Assche
@ 2021-06-10  6:22   ` Hannes Reinecke
  0 siblings, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:22 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Chaitanya Kulkarni,
	Damien Le Moal, Johannes Thumshirn, Himanshu Madhani, Ming Lei

On 6/9/21 1:06 AM, Bart Van Assche wrote:
> Make __dd_dispatch_request() easier to read by removing two local
> variables.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-08 23:06 ` [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
  2021-06-09  4:46   ` Damien Le Moal
  2021-06-09  7:11   ` Johannes Thumshirn
@ 2021-06-10  6:23   ` Hannes Reinecke
  2 siblings, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:23 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:06 AM, Bart Van Assche wrote:
> Define separate macros for integers and jiffies to improve readability.
> Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol()
> since the former functions are the recommended functions.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 66 ++++++++++++++++++++-------------------------
>  1 file changed, 29 insertions(+), 37 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 11/14] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests
  2021-06-08 23:07 ` [PATCH 11/14] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests Bart Van Assche
@ 2021-06-10  6:26   ` Hannes Reinecke
  0 siblings, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:26 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:07 AM, Bart Van Assche wrote:
> For interactive workloads it is important that synchronous requests are
> not delayed. Hence reserve 25% of scheduler tags for synchronous requests.
> This patch still allows asynchronous requests to fill the hardware queues
> since blk_mq_init_sched() makes sure that the number of scheduler requests
> is the double of the hardware queue depth. From blk_mq_init_sched():
> 
> 	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
> 				   BLKDEV_MAX_RQ);
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 52 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 52 insertions(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 1d1bb7a41d2a..a7d0584437d1 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -67,6 +67,7 @@ struct deadline_data {
>  	int fifo_batch;
>  	int writes_starved;
>  	int front_merges;
> +	u32 async_depth;
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> @@ -397,6 +398,44 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  	return rq;
>  }
>  
> +/*
> + * Called by __blk_mq_alloc_request(). The shallow_depth value set by this
> + * function is used by __blk_mq_get_tag().
> + */
> +static void dd_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
> +{
> +	struct deadline_data *dd = data->q->elevator->elevator_data;
> +
> +	/* Do not throttle synchronous reads. */
> +	if (op_is_sync(op) && !op_is_write(op))
> +		return;
> +
> +	/*
> +	 * Throttle asynchronous requests and writes such that these requests
> +	 * do not block the allocation of synchronous requests.
> +	 */
> +	data->shallow_depth = dd->async_depth;
> +}
> +
> +/* Called by blk_mq_update_nr_requests(). */
> +static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	struct blk_mq_tags *tags = hctx->sched_tags;
> +
> +	dd->async_depth = max(1UL, 3 * q->nr_requests / 4);
> +
> +	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, dd->async_depth);
> +}
> +
> +/* Called by blk_mq_init_hctx() and blk_mq_init_sched(). */
> +static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
> +{
> +	dd_depth_updated(hctx);
> +	return 0;
> +}
> +
>  static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
> @@ -733,6 +772,15 @@ static int deadline_starved_show(void *data, struct seq_file *m)
>  	return 0;
>  }
>  
> +static int dd_async_depth_show(void *data, struct seq_file *m)
> +{
> +	struct request_queue *q = data;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	seq_printf(m, "%u\n", dd->async_depth);
> +	return 0;
> +}
> +
>  static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
>  	__acquires(&dd->lock)
>  {
> @@ -775,6 +823,7 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
>  	DEADLINE_QUEUE_DDIR_ATTRS(write),
>  	{"batching", 0400, deadline_batching_show},
>  	{"starved", 0400, deadline_starved_show},
> +	{"async_depth", 0400, dd_async_depth_show},
>  	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
>  	{},
>  };
> @@ -783,6 +832,8 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
>  
>  static struct elevator_type mq_deadline = {
>  	.ops = {
> +		.depth_updated		= dd_depth_updated,
> +		.limit_depth		= dd_limit_depth,
>  		.insert_requests	= dd_insert_requests,
>  		.dispatch_request	= dd_dispatch_request,
>  		.prepare_request	= dd_prepare_request,
> @@ -796,6 +847,7 @@ static struct elevator_type mq_deadline = {
>  		.has_work		= dd_has_work,
>  		.init_sched		= dd_init_sched,
>  		.exit_sched		= dd_exit_sched,
> +		.init_hctx		= dd_init_hctx,
>  	},
>  
>  #ifdef CONFIG_BLK_DEBUG_FS
> 
Can't say I like heuristics. Might be needed, but there will always be
use-cases where the heuristics fail. Can't we make this value
configurable via sysfs?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 12/14] block/mq-deadline: Add I/O priority support
  2021-06-08 23:07 ` [PATCH 12/14] block/mq-deadline: Add I/O priority support Bart Van Assche
  2021-06-09  5:03   ` Damien Le Moal
@ 2021-06-10  6:35   ` Hannes Reinecke
  1 sibling, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:35 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:07 AM, Bart Van Assche wrote:
> Maintain one dispatch list and FIFO one list per I/O priority class: RT, BE
> and IDLE. Maintain statistics for each priority level. Split the debugfs
> attributes per priority level as follows:
> 
> $ ls /sys/kernel/debug/block/.../sched/
> async_depth  dispatch2        read_next_rq      write2_fifo_list
> batching     read0_fifo_list  starved           write_next_rq
> dispatch0    read1_fifo_list  write0_fifo_list
> dispatch1    read2_fifo_list  write1_fifo_list
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 355 ++++++++++++++++++++++++++++++++++----------
>  1 file changed, 279 insertions(+), 76 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index a7d0584437d1..776ff49713c3 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -42,16 +42,40 @@ enum dd_data_dir {
>  
>  enum { DD_DIR_COUNT = 2 };
>  
> +enum dd_prio {
> +	DD_RT_PRIO	= 0,
> +	DD_BE_PRIO	= 1,
> +	DD_IDLE_PRIO	= 2,
> +	DD_PRIO_MAX	= 2,
> +};
> +
> +enum { DD_PRIO_COUNT = 3 };
> +
> +/* I/O statistics per I/O priority. */
> +struct io_stats_per_prio {
> +	local_t inserted;
> +	local_t merged;
> +	local_t dispatched;
> +	local_t completed;
> +};
> +
> +/* I/O statistics for all I/O priorities (enum dd_prio). */
> +struct io_stats {
> +	struct io_stats_per_prio stats[DD_PRIO_COUNT];
> +};
> +
>  struct deadline_data {
>  	/*
>  	 * run time data
>  	 */
>  
>  	/*
> -	 * requests (deadline_rq s) are present on both sort_list and fifo_list
> +	 * Requests are present on both sort_list[] and fifo_list[][]. The
> +	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
> +	 * The second index is the data direction (rq_data_dir(rq)).
>  	 */
>  	struct rb_root sort_list[DD_DIR_COUNT];
> -	struct list_head fifo_list[DD_DIR_COUNT];
> +	struct list_head fifo_list[DD_PRIO_COUNT][DD_DIR_COUNT];
>  

Damien already commented on this; I am really not a fan of arrays of
arrays. Make this a simple array and calculate the offset accordingly.

>  	/*
>  	 * next in sort order. read, write or both are NULL
> @@ -60,6 +84,8 @@ struct deadline_data {
>  	unsigned int batching;		/* number of sequential requests made */
>  	unsigned int starved;		/* times reads have starved writes */
>  
> +	struct io_stats __percpu *stats;
> +
>  	/*
>  	 * settings that change how the i/o scheduler behaves
>  	 */
> @@ -71,7 +97,42 @@ struct deadline_data {
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> -	struct list_head dispatch;
> +	struct list_head dispatch[DD_PRIO_COUNT];
> +};
> +
> +/* Count one event of type 'event_type' and with I/O priority 'prio' */
> +#define dd_count(dd, event_type, prio) do {				\
> +	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
> +									\
> +	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> +	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> +	local_inc(&io_stats->stats[(prio)].event_type);			\
> +	put_cpu_ptr(io_stats);						\
> +} while (0)
> +
> +/*
> + * Returns the total number of dd_count(dd, event_type, prio) calls across all
> + * CPUs. No locking or barriers since it is fine if the returned sum is slightly
> + * outdated.
> + */
> +#define dd_sum(dd, event_type, prio) ({					\
> +	unsigned int cpu;						\
> +	u32 sum = 0;							\
> +									\
> +	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> +	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> +	for_each_present_cpu(cpu)					\
> +		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
> +				  stats[(prio)].event_type);		\
> +	sum;								\
> +})
> +
> +/* Maps an I/O priority class to a deadline scheduler priority. */
> +static const enum dd_prio ioprio_class_to_prio[] = {
> +	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
> +	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
> +	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
> +	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
>  };
>  
>  static inline struct rb_root *
> @@ -147,12 +208,31 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
>  	}
>  }
>  
> +/*
> + * Returns the I/O priority class (IOPRIO_CLASS_*) that has been assigned to a
> + * request.
> + */
> +static u8 dd_rq_ioclass(struct request *rq)
> +{
> +	return IOPRIO_PRIO_CLASS(req_get_ioprio(rq));
> +}
> +
>  /*
>   * Callback function that is invoked after @next has been merged into @req.
>   */
>  static void dd_merged_requests(struct request_queue *q, struct request *req,
>  			       struct request *next)
>  {
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const u8 ioprio_class = dd_rq_ioclass(next);
> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
> +
> +	if (next->elv.priv[0]) {
> +		dd_count(dd, merged, prio);
> +	} else {
> +		WARN_ON_ONCE(true);
> +	}
> +
>  	/*
>  	 * if next expires before rq, assign its expire time to rq
>  	 * and move into next position (next will be deleted) in fifo
> @@ -189,14 +269,21 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
>  	deadline_remove_request(rq->q, rq);
>  }
>  
> +/* Number of requests queued for a given priority level. */
> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +}
> +
>  /*
>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>   */
>  static inline int deadline_check_fifo(struct deadline_data *dd,
> +				      enum dd_prio prio,
>  				      enum dd_data_dir data_dir)
>  {
> -	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);

Same here for the arrays of arrays sentiment. Can't you introduce an
accessor function?

>  
>  	/*
>  	 * rq is expired!
> @@ -212,15 +299,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd,
>   * dispatch using arrival ordered lists.
>   */
>  static struct request *
> -deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
> +deadline_fifo_request(struct deadline_data *dd, enum dd_prio prio,
> +		      enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
>  
> -	if (list_empty(&dd->fifo_list[data_dir]))
> +	if (list_empty(&dd->fifo_list[prio][data_dir]))
>  		return NULL;
>  
> -	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
>  	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>  		return rq;
>  
> @@ -229,7 +317,7 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>  	 * an unlocked target zone.
>  	 */
>  	spin_lock_irqsave(&dd->zone_lock, flags);
> -	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
> +	list_for_each_entry(rq, &dd->fifo_list[prio][DD_WRITE], queuelist) {
>  		if (blk_req_can_dispatch_to_zone(rq))
>  			goto out;
>  	}
> @@ -245,7 +333,8 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>   * dispatch using sector position sorted lists.
>   */
>  static struct request *
> -deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
> +deadline_next_request(struct deadline_data *dd, enum dd_prio prio,
> +		      enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
> @@ -276,15 +365,18 @@ deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>   * deadline_dispatch_requests selects the best request according to
>   * read/write expire, fifo_batch, etc
>   */
> -static struct request *__dd_dispatch_request(struct deadline_data *dd)
> +static struct request *__dd_dispatch_request(struct deadline_data *dd,
> +					     enum dd_prio prio)
>  {
>  	struct request *rq, *next_rq;
>  	enum dd_data_dir data_dir;
> +	u8 ioprio_class;
>  
>  	lockdep_assert_held(&dd->lock);
>  
> -	if (!list_empty(&dd->dispatch)) {
> -		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
> +	if (!list_empty(&dd->dispatch[prio])) {
> +		rq = list_first_entry(&dd->dispatch[prio], struct request,
> +				      queuelist);
>  		list_del_init(&rq->queuelist);
>  		goto done;
>  	}
> @@ -292,9 +384,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	/*
>  	 * batches are currently reads XOR writes
>  	 */
> -	rq = deadline_next_request(dd, DD_WRITE);
> +	rq = deadline_next_request(dd, prio, DD_WRITE);
>  	if (!rq)
> -		rq = deadline_next_request(dd, DD_READ);
> +		rq = deadline_next_request(dd, prio, DD_READ);
>  
>  	if (rq && dd->batching < dd->fifo_batch)
>  		/* we have a next request are still entitled to batch */
> @@ -305,10 +397,10 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * data direction (read / write)
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[DD_READ])) {
> +	if (!list_empty(&dd->fifo_list[prio][DD_READ])) {
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
>  
> -		if (deadline_fifo_request(dd, DD_WRITE) &&
> +		if (deadline_fifo_request(dd, prio, DD_WRITE) &&
>  		    (dd->starved++ >= dd->writes_starved))
>  			goto dispatch_writes;
>  
> @@ -321,7 +413,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * there are either no reads or writes have been starved
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
> +	if (!list_empty(&dd->fifo_list[prio][DD_WRITE])) {
>  dispatch_writes:
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
>  
> @@ -338,14 +430,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	/*
>  	 * we are not running a batch, find best request for selected data_dir
>  	 */
> -	next_rq = deadline_next_request(dd, data_dir);
> -	if (deadline_check_fifo(dd, data_dir) || !next_rq) {
> +	next_rq = deadline_next_request(dd, prio, data_dir);
> +	if (deadline_check_fifo(dd, prio, data_dir) || !next_rq) {
>  		/*
>  		 * A deadline has expired, the last request was in the other
>  		 * direction, or we have run out of higher-sectored requests.
>  		 * Start again from the request with the earliest expiry time.
>  		 */
> -		rq = deadline_fifo_request(dd, data_dir);
> +		rq = deadline_fifo_request(dd, prio, data_dir);
>  	} else {
>  		/*
>  		 * The last req was the same dir and we have a next request in
> @@ -370,6 +462,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	dd->batching++;
>  	deadline_move_request(dd, rq);
>  done:
> +	ioprio_class = dd_rq_ioclass(rq);
> +	prio = ioprio_class_to_prio[ioprio_class];
> +	if (rq->elv.priv[0]) {
> +		dd_count(dd, dispatched, prio);
> +	} else {
> +		WARN_ON_ONCE(true);
> +	}
>  	/*
>  	 * If the request needs its target zone locked, do it.
>  	 */
> @@ -390,9 +489,14 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>  	struct request *rq;
> +	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);
> -	rq = __dd_dispatch_request(dd);
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, prio);
> +		if (rq)
> +			break;
> +	}
>  	spin_unlock(&dd->lock);
>  
>  	return rq;
> @@ -439,9 +543,14 @@ static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
>  static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
> +	enum dd_prio prio;
> +
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_READ]));
> +		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_WRITE]));
> +	}
>  
> -	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
> -	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
> +	free_percpu(dd->stats);
>  
>  	kfree(dd);
>  }
> @@ -453,20 +562,29 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  {
>  	struct deadline_data *dd;
>  	struct elevator_queue *eq;
> +	enum dd_prio prio;
> +	int ret = -ENOMEM;
>  
>  	eq = elevator_alloc(q, e);
>  	if (!eq)
> -		return -ENOMEM;
> +		return ret;
>  
>  	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
> -	if (!dd) {
> -		kobject_put(&eq->kobj);
> -		return -ENOMEM;
> -	}
> +	if (!dd)
> +		goto put_eq;
> +
>  	eq->elevator_data = dd;
>  
> -	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
> -	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
> +	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
> +				     GFP_KERNEL | __GFP_ZERO);
> +	if (!dd->stats)
> +		goto free_dd;
> +
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_READ]);
> +		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_WRITE]);
> +		INIT_LIST_HEAD(&dd->dispatch[prio]);
> +	}
>  	dd->sort_list[DD_READ] = RB_ROOT;
>  	dd->sort_list[DD_WRITE] = RB_ROOT;
>  	dd->fifo_expire[DD_READ] = read_expire;
> @@ -476,10 +594,16 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	dd->fifo_batch = fifo_batch;
>  	spin_lock_init(&dd->lock);
>  	spin_lock_init(&dd->zone_lock);
> -	INIT_LIST_HEAD(&dd->dispatch);
>  
>  	q->elevator = eq;
>  	return 0;
> +
> +free_dd:
> +	kfree(dd);
> +
> +put_eq:
> +	kobject_put(&eq->kobj);
> +	return ret;
>  }
>  
>  /*
> @@ -539,6 +663,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	struct request_queue *q = hctx->queue;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  	const enum dd_data_dir data_dir = rq_data_dir(rq);
> +	u16 ioprio = req_get_ioprio(rq);
> +	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
> +	enum dd_prio prio;
>  
>  	lockdep_assert_held(&dd->lock);
>  
> @@ -548,13 +675,18 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	 */
>  	blk_req_zone_write_unlock(rq);
>  
> +	prio = ioprio_class_to_prio[ioprio_class];
> +	dd_count(dd, inserted, prio);
> +	WARN_ON_ONCE(rq->elv.priv[0]);
> +	rq->elv.priv[0] = (void *)1ULL;
> +
>  	if (blk_mq_sched_try_insert_merge(q, rq))
>  		return;
>  
>  	trace_block_rq_insert(rq);
>  
>  	if (at_head) {
> -		list_add(&rq->queuelist, &dd->dispatch);
> +		list_add(&rq->queuelist, &dd->dispatch[prio]);
>  	} else {
>  		deadline_add_rq_rb(dd, rq);
>  
> @@ -568,7 +700,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  		 * set expire time and add to fifo list
>  		 */
>  		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> -		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
> +		list_add_tail(&rq->queuelist, &dd->fifo_list[prio][data_dir]);
>  	}
>  }
>  
> @@ -592,12 +724,10 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>  	spin_unlock(&dd->lock);
>  }
>  
> -/*
> - * Nothing to do here. This is defined only to ensure that .finish_request
> - * method is called upon request completion.
> - */
> +/* Callback from inside blk_mq_rq_ctx_init(). */
>  static void dd_prepare_request(struct request *rq)
>  {
> +	rq->elv.priv[0] = NULL;
>  }
>  
>  /*
> @@ -619,26 +749,41 @@ static void dd_prepare_request(struct request *rq)
>  static void dd_finish_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const u8 ioprio_class = dd_rq_ioclass(rq);
> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
> +
> +	if (rq->elv.priv[0])
> +		dd_count(dd, completed, prio);
>  
>  	if (blk_queue_is_zoned(q)) {
> -		struct deadline_data *dd = q->elevator->elevator_data;
>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);
>  		blk_req_zone_write_unlock(rq);
> -		if (!list_empty(&dd->fifo_list[DD_WRITE]))
> +		if (!list_empty(&dd->fifo_list[prio][DD_WRITE]))
>  			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
>  		spin_unlock_irqrestore(&dd->zone_lock, flags);
>  	}
>  }
>  
> +static bool dd_has_work_for_prio(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return !list_empty_careful(&dd->dispatch[prio]) ||
> +		!list_empty_careful(&dd->fifo_list[prio][DD_READ]) ||
> +		!list_empty_careful(&dd->fifo_list[prio][DD_WRITE]);
> +}
> +
>  static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	enum dd_prio prio;
> +
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
> +		if (dd_has_work_for_prio(dd, prio))
> +			return true;
>  
> -	return !list_empty_careful(&dd->dispatch) ||
> -		!list_empty_careful(&dd->fifo_list[0]) ||
> -		!list_empty_careful(&dd->fifo_list[1]);
> +	return false;
>  }
>  
>  /*
> @@ -702,7 +847,7 @@ static struct elv_fs_entry deadline_attrs[] = {
>  };
>  
>  #ifdef CONFIG_BLK_DEBUG_FS
> -#define DEADLINE_DEBUGFS_DDIR_ATTRS(ddir, name)				\
> +#define DEADLINE_DEBUGFS_DDIR_ATTRS(prio, data_dir, name)		\
>  static void *deadline_##name##_fifo_start(struct seq_file *m,		\
>  					  loff_t *pos)			\
>  	__acquires(&dd->lock)						\
> @@ -711,7 +856,7 @@ static void *deadline_##name##_fifo_start(struct seq_file *m,		\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
>  									\
>  	spin_lock(&dd->lock);						\
> -	return seq_list_start(&dd->fifo_list[ddir], *pos);		\
> +	return seq_list_start(&dd->fifo_list[prio][data_dir], *pos);	\
>  }									\
>  									\
>  static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
> @@ -720,7 +865,7 @@ static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
>  	struct request_queue *q = m->private;				\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
>  									\
> -	return seq_list_next(v, &dd->fifo_list[ddir], pos);		\
> +	return seq_list_next(v, &dd->fifo_list[prio][data_dir], pos);	\
>  }									\
>  									\
>  static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
> @@ -737,22 +882,31 @@ static const struct seq_operations deadline_##name##_fifo_seq_ops = {	\
>  	.next	= deadline_##name##_fifo_next,				\
>  	.stop	= deadline_##name##_fifo_stop,				\
>  	.show	= blk_mq_debugfs_rq_show,				\
> -};									\
> -									\
> +};
> +
> +#define DEADLINE_DEBUGFS_NEXT_RQ(data_dir, name)			\
>  static int deadline_##name##_next_rq_show(void *data,			\
>  					  struct seq_file *m)		\
>  {									\
>  	struct request_queue *q = data;					\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
> -	struct request *rq = dd->next_rq[ddir];				\
> +	struct request *rq = dd->next_rq[data_dir];			\
>  									\
>  	if (rq)								\
>  		__blk_mq_debugfs_rq_show(m, rq);			\
>  	return 0;							\
>  }
> -DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
> -DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
> +
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_READ, read0)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_WRITE, write0)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_READ, read1)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_WRITE, write1)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_READ, read2)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_WRITE, write2)
> +DEADLINE_DEBUGFS_NEXT_RQ(DD_READ, read)
> +DEADLINE_DEBUGFS_NEXT_RQ(DD_WRITE, write)
>  #undef DEADLINE_DEBUGFS_DDIR_ATTRS
> +#undef DEADLINE_DEBUGFS_NEXT_RQ
>  
>  static int deadline_batching_show(void *data, struct seq_file *m)
>  {
> @@ -781,50 +935,99 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
>  	return 0;
>  }
>  
> -static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
> -	__acquires(&dd->lock)
> +static int dd_queued_show(void *data, struct seq_file *m)
>  {
> -	struct request_queue *q = m->private;
> +	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  
> -	spin_lock(&dd->lock);
> -	return seq_list_start(&dd->dispatch, *pos);
> +	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
> +		   dd_queued(dd, DD_BE_PRIO),
> +		   dd_queued(dd, DD_IDLE_PRIO));
> +	return 0;
>  }
>  
> -static void *deadline_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
> +/* Number of requests owned by the block driver for a given priority. */
> +static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	struct request_queue *q = m->private;
> -	struct deadline_data *dd = q->elevator->elevator_data;
> -
> -	return seq_list_next(v, &dd->dispatch, pos);
> +	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
> +		- dd_sum(dd, completed, prio);
>  }
>  
> -static void deadline_dispatch_stop(struct seq_file *m, void *v)
> -	__releases(&dd->lock)
> +static int dd_owned_by_driver_show(void *data, struct seq_file *m)
>  {
> -	struct request_queue *q = m->private;
> +	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  
> -	spin_unlock(&dd->lock);
> +	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
> +		   dd_owned_by_driver(dd, DD_BE_PRIO),
> +		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
> +	return 0;
>  }
>  
> -static const struct seq_operations deadline_dispatch_seq_ops = {
> -	.start	= deadline_dispatch_start,
> -	.next	= deadline_dispatch_next,
> -	.stop	= deadline_dispatch_stop,
> -	.show	= blk_mq_debugfs_rq_show,
> -};
> +#define DEADLINE_DISPATCH_ATTR(prio)					\
> +static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
> +					     loff_t *pos)		\
> +	__acquires(&dd->lock)						\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	spin_lock(&dd->lock);						\
> +	return seq_list_start(&dd->dispatch[prio], *pos);		\
> +}									\
> +									\
> +static void *deadline_dispatch##prio##_next(struct seq_file *m,		\
> +					    void *v, loff_t *pos)	\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	return seq_list_next(v, &dd->dispatch[prio], pos);		\
> +}									\
> +									\
> +static void deadline_dispatch##prio##_stop(struct seq_file *m, void *v)	\
> +	__releases(&dd->lock)						\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	spin_unlock(&dd->lock);						\
> +}									\
> +									\
> +static const struct seq_operations deadline_dispatch##prio##_seq_ops = { \
> +	.start	= deadline_dispatch##prio##_start,			\
> +	.next	= deadline_dispatch##prio##_next,			\
> +	.stop	= deadline_dispatch##prio##_stop,			\
> +	.show	= blk_mq_debugfs_rq_show,				\
> +}
> +
> +DEADLINE_DISPATCH_ATTR(0);
> +DEADLINE_DISPATCH_ATTR(1);
> +DEADLINE_DISPATCH_ATTR(2);
> +#undef DEADLINE_DISPATCH_ATTR
>  
> -#define DEADLINE_QUEUE_DDIR_ATTRS(name)						\
> -	{#name "_fifo_list", 0400, .seq_ops = &deadline_##name##_fifo_seq_ops},	\
> +#define DEADLINE_QUEUE_DDIR_ATTRS(name)					\
> +	{#name "_fifo_list", 0400,					\
> +			.seq_ops = &deadline_##name##_fifo_seq_ops}
> +#define DEADLINE_NEXT_RQ_ATTR(name)					\
>  	{#name "_next_rq", 0400, deadline_##name##_next_rq_show}
>  static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
> -	DEADLINE_QUEUE_DDIR_ATTRS(read),
> -	DEADLINE_QUEUE_DDIR_ATTRS(write),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read0),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write0),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read1),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write1),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read2),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write2),
> +	DEADLINE_NEXT_RQ_ATTR(read),
> +	DEADLINE_NEXT_RQ_ATTR(write),
>  	{"batching", 0400, deadline_batching_show},
>  	{"starved", 0400, deadline_starved_show},
>  	{"async_depth", 0400, dd_async_depth_show},
> -	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
> +	{"dispatch0", 0400, .seq_ops = &deadline_dispatch0_seq_ops},
> +	{"dispatch1", 0400, .seq_ops = &deadline_dispatch1_seq_ops},
> +	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
> +	{"owned_by_driver", 0400, dd_owned_by_driver_show},
> +	{"queued", 0400, dd_queued_show},
>  	{},
>  };
>  #undef DEADLINE_QUEUE_DDIR_ATTRS
> @@ -874,6 +1077,6 @@ static void __exit deadline_exit(void)
>  module_init(deadline_init);
>  module_exit(deadline_exit);
>  
> -MODULE_AUTHOR("Jens Axboe");
> +MODULE_AUTHOR("Jens Axboe, Damien Le Moal and Bart Van Assche");
>  MODULE_LICENSE("GPL");
>  MODULE_DESCRIPTION("MQ deadline IO scheduler");
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 13/14] block/mq-deadline: Add cgroup support
  2021-06-08 23:07 ` [PATCH 13/14] block/mq-deadline: Add cgroup support Bart Van Assche
@ 2021-06-10  6:37   ` Hannes Reinecke
  0 siblings, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:37 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:07 AM, Bart Van Assche wrote:
> Maintain statistics per cgroup and export these to user space. These
> statistics are essential for verifying whether the proper I/O priorities
> have been assigned to requests. With this patch applied the output of
> (cd /sys/fs/cgroup/blkio/ && ls) is as follows:
> blkio.dd.dispatched_pri0  blkio.dd.inserted_pri3  cgroup.clone_children
> blkio.dd.dispatched_pri1  blkio.dd.merged_pri0    cgroup.procs
> blkio.dd.dispatched_pri2  blkio.dd.merged_pri1    cgroup.sane_behavior
> blkio.dd.dispatched_pri3  blkio.dd.merged_pri2    notify_on_release
> blkio.dd.inserted_pri0    blkio.dd.merged_pri3    release_agent
> blkio.dd.inserted_pri1    blkio.prio.class        tasks
> blkio.dd.inserted_pri2    blkio.reset_stats
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/Kconfig.iosched                       |   6 +
>  block/Makefile                              |   2 +
>  block/mq-deadline-cgroup.c                  | 185 ++++++++++++++++++++
>  block/mq-deadline-cgroup.h                  | 112 ++++++++++++
>  block/{mq-deadline.c => mq-deadline-main.c} |  85 +++++++--
>  5 files changed, 376 insertions(+), 14 deletions(-)
>  create mode 100644 block/mq-deadline-cgroup.c
>  create mode 100644 block/mq-deadline-cgroup.h
>  rename block/{mq-deadline.c => mq-deadline-main.c} (94%)
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 14/14] block/mq-deadline: Prioritize high-priority requests
  2021-06-08 23:07 ` [PATCH 14/14] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
  2021-06-09  5:10   ` Damien Le Moal
@ 2021-06-10  6:43   ` Hannes Reinecke
  1 sibling, 0 replies; 45+ messages in thread
From: Hannes Reinecke @ 2021-06-10  6:43 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 1:07 AM, Bart Van Assche wrote:
> While one or more requests with a certain I/O priority are pending, do not
> dispatch lower priority requests. Dispatch lower priority requests anyway
> after the "aging" time has expired.
> 
> This patch has been tested as follows:
> 
> modprobe scsi_debug ndelay=1000000 max_queue=16 &&
> sd='' &&
> while [ -z "$sd" ]; do
>   sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done &&
> echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
> cd /sys/fs/cgroup/blkio/ &&
> echo $$ >cgroup.procs &&
> echo 2 >blkio.prio.class &&
> mkdir -p hipri &&
> cd hipri &&
> echo 1 >blkio.prio.class &&
> { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
> echo $$ >cgroup.procs &&
> max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *   400 IOPS for the low-priority job
> 
> If the aging expiry time is changed from 100s into 0, the IOPS results change
> into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline-main.c | 42 +++++++++++++++++++++++++++++++++++-----
>  1 file changed, 37 insertions(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH 04/14] block: Introduce the ioprio rq-qos policy
  2021-06-10  6:20   ` Hannes Reinecke
@ 2021-06-10 17:14     ` Bart Van Assche
  0 siblings, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2021-06-10 17:14 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Ming Lei, Johannes Thumshirn, Himanshu Madhani

On 6/9/21 11:20 PM, Hannes Reinecke wrote:
> On 6/9/21 1:06 AM, Bart Van Assche wrote:
>> +static void blkcg_ioprio_track(struct rq_qos *rqos, struct request *rq,
>> +			       struct bio *bio)
>> +{
>> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_bio(bio);
>> +
>> +	/*
>> +	 * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers
>> +	 * correspond to a lower priority. Hence, the max_t() below selects
>> +	 * the lower priority of bi_ioprio and the cgroup I/O priority class.
>> +	 * If the cgroup priority has been set to IOPRIO_CLASS_NONE == 0, the
>> +	 * bio I/O priority is not modified. If the bio I/O priority equals
>> +	 * IOPRIO_CLASS_NONE, the cgroup I/O priority is assigned to the bio.
>> +	 */
>> +	bio->bi_ioprio = max_t(u16, bio->bi_ioprio,
>> +			       IOPRIO_PRIO_VALUE(blkcg->prio_class, 0));
>> +}
> 
> Sheesh. Now that is cheeky.
> First defining a (conceptually) complex policy setting (where people
> wonder where these policies came from), which then devolve into a simple
> max() setting of the priority value.
> This _really_ could do with a better explanation in the documentation,
> as then it's far easier to understand _why_ certain policies override
> others.
> IE this comment belongs in the documentation, as explanation of the
> underlying mechanics of the ioprio policies.

Hi Hannes,

blkcg_ioprio_track() is called from the hot path so I want this function
to be fast. Since the desired behavior can be implemented with a max() I
chose max() instead of e.g. using a lookup table.

I will address your comments about the documentation and the code.

Bart.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-06-10 17:14 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-08 23:06 [PATCH 00/14] Improve I/O priority support Bart Van Assche
2021-06-08 23:06 ` [PATCH 01/14] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
2021-06-09  4:14   ` Damien Le Moal
2021-06-09  7:03   ` Johannes Thumshirn
2021-06-10  6:02   ` Hannes Reinecke
2021-06-08 23:06 ` [PATCH 02/14] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
2021-06-09  4:19   ` Damien Le Moal
2021-06-09  7:05   ` Johannes Thumshirn
2021-06-10  6:02   ` Hannes Reinecke
2021-06-08 23:06 ` [PATCH 03/14] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
2021-06-09  4:22   ` Damien Le Moal
2021-06-10  6:03   ` Hannes Reinecke
2021-06-08 23:06 ` [PATCH 04/14] block: Introduce the ioprio rq-qos policy Bart Van Assche
2021-06-09  4:40   ` Damien Le Moal
2021-06-09 16:48     ` Bart Van Assche
2021-06-10  0:29       ` Damien Le Moal
2021-06-10  6:20   ` Hannes Reinecke
2021-06-10 17:14     ` Bart Van Assche
2021-06-08 23:06 ` [PATCH 05/14] block/mq-deadline: Add several comments Bart Van Assche
2021-06-10  6:21   ` Hannes Reinecke
2021-06-08 23:06 ` [PATCH 06/14] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
2021-06-10  6:21   ` Hannes Reinecke
2021-06-08 23:06 ` [PATCH 07/14] block/mq-deadline: Remove two local variables Bart Van Assche
2021-06-10  6:22   ` Hannes Reinecke
2021-06-08 23:06 ` [PATCH 08/14] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
2021-06-08 23:06 ` [PATCH 09/14] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
2021-06-08 23:06 ` [PATCH 10/14] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
2021-06-09  4:46   ` Damien Le Moal
2021-06-09 16:54     ` Bart Van Assche
2021-06-10  0:31       ` Damien Le Moal
2021-06-09  7:11   ` Johannes Thumshirn
2021-06-09 16:59     ` Bart Van Assche
2021-06-10  6:23   ` Hannes Reinecke
2021-06-08 23:07 ` [PATCH 11/14] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests Bart Van Assche
2021-06-10  6:26   ` Hannes Reinecke
2021-06-08 23:07 ` [PATCH 12/14] block/mq-deadline: Add I/O priority support Bart Van Assche
2021-06-09  5:03   ` Damien Le Moal
2021-06-09 17:25     ` Bart Van Assche
2021-06-10  0:39       ` Damien Le Moal
2021-06-10  6:35   ` Hannes Reinecke
2021-06-08 23:07 ` [PATCH 13/14] block/mq-deadline: Add cgroup support Bart Van Assche
2021-06-10  6:37   ` Hannes Reinecke
2021-06-08 23:07 ` [PATCH 14/14] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
2021-06-09  5:10   ` Damien Le Moal
2021-06-10  6:43   ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).