linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/16] Improve I/O priority support
@ 2021-06-18  0:44 Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
                   ` (16 more replies)
  0 siblings, 17 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche

Hi Jens,

A feature that is missing from the Linux kernel for storage devices that
support I/O priorities is to set the I/O priority in requests involving page
cache writeback. Since the identity of the process that triggers page cache
writeback is not known in the writeback code, the priority set by ioprio_set()
is ignored. However, an I/O cgroup is associated with writeback requests
by certain filesystems. Hence this patch series that implements the following
changes:
* Add an rq-qos policy that makes the I/O priority configurable per I/O cgroup
  and also that changes the I/O priority of requests to the lower of (request
  I/O priority, cgroup I/O priority).
* Introduce one queue per I/O priority in the mq-deadline scheduler.
* Dispatch the highest priority requests first.

Please consider this patch series for kernel v5.14.

Thanks,

Bart.

Changes compared to v2:
- For the blk-ioprio rq-qos policy, switched from numeric to textual policy
  names.
- Moved rq_qos_id_to_name() into debugfs code.
- Moved the mq-deadline I/O statistics into io.stat.
- Introduced the dd_per_prio data structure.
- Switched from a single sort list to one sort list per I/O priority.
- Removed the WARN_ON_ONCE(blkcg == NULL) statements.

Changes compared to v1:
- Moved the code for assigning an I/O priority into a new rq-qos policy.
- Dropped patch "block/mq-deadline: Reduce the read expiry time for
  non-rotational media".
- Made sure that dd->async_depth >= 1.
- Implemented an aging mechanism such that lower priority requests are not
  postponed forever.

Bart Van Assche (16):
  block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init()
    calls
  block/blk-rq-qos: Move a function from a header file into a C file
  block: Introduce the ioprio rq-qos policy
  block/mq-deadline: Add several comments
  block/mq-deadline: Add two lockdep_assert_held() statements
  block/mq-deadline: Remove two local variables
  block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  block/mq-deadline: Improve compile-time argument checking
  block/mq-deadline: Improve the sysfs show and store macros
  block/mq-deadline: Reserve 25% of scheduler tags for synchronous
    requests
  block/mq-deadline: Micro-optimize the batching algorithm
  block/mq-deadline: Add I/O priority support
  block/mq-deadline: Track I/O statistics
  block/mq-deadline: Add cgroup support
  block/mq-deadline: Prioritize high-priority requests

 Documentation/admin-guide/cgroup-v2.rst |   55 ++
 block/Kconfig                           |   19 +-
 block/Kconfig.iosched                   |    6 +
 block/Makefile                          |    3 +
 block/blk-cgroup.c                      |   14 +-
 block/blk-ioprio.c                      |  262 +++++
 block/blk-ioprio.h                      |   19 +
 block/blk-mq-debugfs.c                  |   15 +
 block/blk-rq-qos.h                      |   14 +-
 block/mq-deadline-cgroup.c              |  126 +++
 block/mq-deadline-cgroup.h              |  114 +++
 block/mq-deadline-main.c                | 1173 +++++++++++++++++++++++
 block/mq-deadline.c                     |  815 ----------------
 13 files changed, 1797 insertions(+), 838 deletions(-)
 create mode 100644 block/blk-ioprio.c
 create mode 100644 block/blk-ioprio.h
 create mode 100644 block/mq-deadline-cgroup.c
 create mode 100644 block/mq-deadline-cgroup.h
 create mode 100644 block/mq-deadline-main.c
 delete mode 100644 block/mq-deadline.c


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 17:09   ` Adam Manzanares
  2021-06-18 19:49   ` Himanshu Madhani
  2021-06-18  0:44 ` [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Ming Lei, Himanshu Madhani

These entries were consecutive at the time of their introduction but are no
longer consecutive. Make these again consecutive. Additionally, modify the
help text since it refers to blk-mq and since the legacy block layer has
been removed.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/Kconfig | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/Kconfig b/block/Kconfig
index a2297edfdde8..6685578b2a20 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -133,6 +133,13 @@ config BLK_WBT
 	dynamically on an algorithm loosely based on CoDel, factoring in
 	the realtime performance of the disk.
 
+config BLK_WBT_MQ
+	bool "Enable writeback throttling by default"
+	default y
+	depends on BLK_WBT
+	help
+	Enable writeback throttling by default for request-based block devices.
+
 config BLK_CGROUP_IOLATENCY
 	bool "Enable support for latency based cgroup IO protection"
 	depends on BLK_CGROUP=y
@@ -155,13 +162,6 @@ config BLK_CGROUP_IOCOST
 	distributes IO capacity between different groups based on
 	their share of the overall weight distribution.
 
-config BLK_WBT_MQ
-	bool "Multiqueue writeback throttling"
-	default y
-	depends on BLK_WBT
-	help
-	Enable writeback throttling by default on multiqueue devices.
-
 config BLK_DEBUG_FS
 	bool "Block layer debugging information in debugfs"
 	default y

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 17:15   ` Adam Manzanares
                     ` (2 more replies)
  2021-06-18  0:44 ` [PATCH v3 03/16] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
                   ` (14 subsequent siblings)
  16 siblings, 3 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Tejun Heo, Ming Lei, Himanshu Madhani

Before adding more calls in this function, simplify the error path.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-cgroup.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d169e2055158..3b0f6efaa2b6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1183,15 +1183,14 @@ int blkcg_init_queue(struct request_queue *q)
 	if (preloaded)
 		radix_tree_preload_end();
 
-	ret = blk_throtl_init(q);
+	ret = blk_iolatency_init(q);
 	if (ret)
 		goto err_destroy_all;
 
-	ret = blk_iolatency_init(q);
-	if (ret) {
-		blk_throtl_exit(q);
+	ret = blk_throtl_init(q);
+	if (ret)
 		goto err_destroy_all;
-	}
+
 	return 0;
 
 err_destroy_all:

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 03/16] block/blk-rq-qos: Move a function from a header file into a C file
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 17:22   ` Adam Manzanares
  2021-06-18 19:49   ` Himanshu Madhani
  2021-06-18  0:44 ` [PATCH v3 04/16] block: Introduce the ioprio rq-qos policy Bart Van Assche
                   ` (13 subsequent siblings)
  16 siblings, 2 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

rq_qos_id_to_name() is only used in blk-mq-debugfs.c so move that function
into in blk-mq-debugfs.c.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq-debugfs.c | 13 +++++++++++++
 block/blk-rq-qos.h     | 13 -------------
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 2a75bc7401df..6ac1c86f62ef 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -937,6 +937,19 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q)
 	q->sched_debugfs_dir = NULL;
 }
 
+static const char *rq_qos_id_to_name(enum rq_qos_id id)
+{
+	switch (id) {
+	case RQ_QOS_WBT:
+		return "wbt";
+	case RQ_QOS_LATENCY:
+		return "latency";
+	case RQ_QOS_COST:
+		return "cost";
+	}
+	return "unknown";
+}
+
 void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
 {
 	debugfs_remove_recursive(rqos->debugfs_dir);
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index 2bcb3495e376..a77afbdd472c 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -79,19 +79,6 @@ static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
 	return rq_qos_id(q, RQ_QOS_LATENCY);
 }
 
-static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
-{
-	switch (id) {
-	case RQ_QOS_WBT:
-		return "wbt";
-	case RQ_QOS_LATENCY:
-		return "latency";
-	case RQ_QOS_COST:
-		return "cost";
-	}
-	return "unknown";
-}
-
 static inline void rq_wait_init(struct rq_wait *rq_wait)
 {
 	atomic_set(&rq_wait->inflight, 0);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 04/16] block: Introduce the ioprio rq-qos policy
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (2 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 03/16] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 22:02   ` Adam Manzanares
  2021-06-18  0:44 ` [PATCH v3 05/16] block/mq-deadline: Add several comments Bart Van Assche
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

Introduce an rq-qos policy that assigns an I/O priority to requests based
on blk-cgroup configuration settings. This policy has the following
advantages over the ioprio_set() system call:
- This policy is cgroup based so it has all the advantages of cgroups.
- While ioprio_set() does not affect page cache writeback I/O, this rq-qos
  controller affects page cache writeback I/O for filesystems that support
  assiociating a cgroup with writeback I/O. See also
  Documentation/admin-guide/cgroup-v2.rst.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 Documentation/admin-guide/cgroup-v2.rst |  55 +++++
 block/Kconfig                           |   9 +
 block/Makefile                          |   1 +
 block/blk-cgroup.c                      |   5 +
 block/blk-ioprio.c                      | 262 ++++++++++++++++++++++++
 block/blk-ioprio.h                      |  19 ++
 block/blk-mq-debugfs.c                  |   2 +
 block/blk-rq-qos.h                      |   1 +
 8 files changed, 354 insertions(+)
 create mode 100644 block/blk-ioprio.c
 create mode 100644 block/blk-ioprio.h

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b1e81aa8598a..4e59925e6583 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -56,6 +56,7 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
        5-3-3. IO Latency
          5-3-3-1. How IO Latency Throttling Works
          5-3-3-2. IO Latency Interface Files
+       5-3-4. IO Priority
      5-4. PID
        5-4-1. PID Interface Files
      5-5. Cpuset
@@ -1866,6 +1867,60 @@ IO Latency Interface Files
 		duration of time between evaluation events.  Windows only elapse
 		with IO activity.  Idle periods extend the most recent window.
 
+IO Priority
+~~~~~~~~~~~
+
+A single attribute controls the behavior of the I/O priority cgroup policy,
+namely the blkio.prio.class attribute. The following values are accepted for
+that attribute:
+
+  no-change
+	Do not modify the I/O priority class.
+
+  none-to-rt
+	For requests that do not have an I/O priority class (NONE),
+	change the I/O priority class into RT. Do not modify
+	the I/O priority class of other requests.
+
+  restrict-to-be
+	For requests that do not have an I/O priority class or that have I/O
+	priority class RT, change it into BE. Do not modify the I/O priority
+	class of requests that have priority class IDLE.
+
+  idle
+	Change the I/O priority class of all requests into IDLE, the lowest
+	I/O priority class.
+
+The following numerical values are associated with the I/O priority policies:
+
++-------------+---+
+| no-change   | 0 |
++-------------+---+
+| none-to-rt  | 1 |
++-------------+---+
+| rt-to-be    | 2 |
++-------------+---+
+| all-to-idle | 3 |
++-------------+---+
+
+The numerical value that corresponds to each I/O priority class is as follows:
+
++-------------------------------+---+
+| IOPRIO_CLASS_NONE             | 0 |
++-------------------------------+---+
+| IOPRIO_CLASS_RT (real-time)   | 1 |
++-------------------------------+---+
+| IOPRIO_CLASS_BE (best effort) | 2 |
++-------------------------------+---+
+| IOPRIO_CLASS_IDLE             | 3 |
++-------------------------------+---+
+
+The algorithm to set the I/O priority class for a request is as follows:
+
+- Translate the I/O priority class policy into a number.
+- Change the request I/O priority class into the maximum of the I/O priority
+  class policy number and the numerical I/O priority class.
+
 PID
 ---
 
diff --git a/block/Kconfig b/block/Kconfig
index 6685578b2a20..e71c63eaaf52 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -162,6 +162,15 @@ config BLK_CGROUP_IOCOST
 	distributes IO capacity between different groups based on
 	their share of the overall weight distribution.
 
+config BLK_CGROUP_IOPRIO
+	bool "Cgroup I/O controller for assigning an I/O priority class"
+	depends on BLK_CGROUP
+	help
+	Enable the .prio interface for assigning an I/O priority class to
+	requests. The I/O priority class affects the order in which an I/O
+	scheduler and block devices process requests. Only some I/O schedulers
+	and some block devices support I/O priorities.
+
 config BLK_DEBUG_FS
 	bool "Block layer debugging information in debugfs"
 	default y
diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..af3d044abaf1 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -17,6 +17,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
 obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
 obj-$(CONFIG_BLK_CGROUP_RWSTAT)	+= blk-cgroup-rwstat.o
 obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
+obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3b0f6efaa2b6..7b06a5fa3cac 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -31,6 +31,7 @@
 #include <linux/tracehook.h>
 #include <linux/psi.h>
 #include "blk.h"
+#include "blk-ioprio.h"
 
 /*
  * blkcg_pol_mutex protects blkcg_policy[] and policy [de]activation.
@@ -1187,6 +1188,10 @@ int blkcg_init_queue(struct request_queue *q)
 	if (ret)
 		goto err_destroy_all;
 
+	ret = blk_ioprio_init(q);
+	if (ret)
+		goto err_destroy_all;
+
 	ret = blk_throtl_init(q);
 	if (ret)
 		goto err_destroy_all;
diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
new file mode 100644
index 000000000000..332a07761bf8
--- /dev/null
+++ b/block/blk-ioprio.c
@@ -0,0 +1,262 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Block rq-qos policy for assigning an I/O priority class to requests.
+ *
+ * Using an rq-qos policy for assigning I/O priority class has two advantages
+ * over using the ioprio_set() system call:
+ *
+ * - This policy is cgroup based so it has all the advantages of cgroups.
+ * - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
+ *   controller affects page cache writeback I/O for filesystems that support
+ *   assiociating a cgroup with writeback I/O. See also
+ *   Documentation/admin-guide/cgroup-v2.rst.
+ */
+
+#include <linux/blk-cgroup.h>
+#include <linux/blk-mq.h>
+#include <linux/blk_types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include "blk-ioprio.h"
+#include "blk-rq-qos.h"
+
+/**
+ * enum prio_policy - I/O priority class policy.
+ * @POLICY_NO_CHANGE: (default) do not modify the I/O priority class.
+ * @POLICY_NONE_TO_RT: modify IOPRIO_CLASS_NONE into IOPRIO_CLASS_RT.
+ * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into
+ *		IOPRIO_CLASS_BE.
+ * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE.
+ *
+ * See also <linux/ioprio.h>.
+ */
+enum prio_policy {
+	POLICY_NO_CHANGE	= 0,
+	POLICY_NONE_TO_RT	= 1,
+	POLICY_RESTRICT_TO_BE	= 2,
+	POLICY_ALL_TO_IDLE	= 3,
+};
+
+static const char *policy_name[] = {
+	[POLICY_NO_CHANGE]	= "no-change",
+	[POLICY_NONE_TO_RT]	= "none-to-rt",
+	[POLICY_RESTRICT_TO_BE]	= "restrict-to-be",
+	[POLICY_ALL_TO_IDLE]	= "idle",
+};
+
+static struct blkcg_policy ioprio_policy;
+
+/**
+ * struct ioprio_blkg - Per (cgroup, request queue) data.
+ * @pd: blkg_policy_data structure.
+ */
+struct ioprio_blkg {
+	struct blkg_policy_data pd;
+};
+
+/**
+ * struct ioprio_blkcg - Per cgroup data.
+ * @cpd: blkcg_policy_data structure.
+ * @prio_policy: One of the IOPRIO_CLASS_* values. See also <linux/ioprio.h>.
+ */
+struct ioprio_blkcg {
+	struct blkcg_policy_data cpd;
+	enum prio_policy	 prio_policy;
+};
+
+static inline struct ioprio_blkg *pd_to_ioprio(struct blkg_policy_data *pd)
+{
+	return pd ? container_of(pd, struct ioprio_blkg, pd) : NULL;
+}
+
+static struct ioprio_blkcg *blkcg_to_ioprio_blkcg(struct blkcg *blkcg)
+{
+	return container_of(blkcg_to_cpd(blkcg, &ioprio_policy),
+			    struct ioprio_blkcg, cpd);
+}
+
+static struct ioprio_blkcg *
+ioprio_blkcg_from_css(struct cgroup_subsys_state *css)
+{
+	return blkcg_to_ioprio_blkcg(css_to_blkcg(css));
+}
+
+static struct ioprio_blkcg *ioprio_blkcg_from_bio(struct bio *bio)
+{
+	struct blkg_policy_data *pd = blkg_to_pd(bio->bi_blkg, &ioprio_policy);
+
+	if (!pd)
+		return NULL;
+
+	return blkcg_to_ioprio_blkcg(pd->blkg->blkcg);
+}
+
+static int ioprio_show_prio_policy(struct seq_file *sf, void *v)
+{
+	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(seq_css(sf));
+
+	seq_printf(sf, "%s\n", policy_name[blkcg->prio_policy]);
+	return 0;
+}
+
+static ssize_t ioprio_set_prio_policy(struct kernfs_open_file *of, char *buf,
+				      size_t nbytes, loff_t off)
+{
+	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(of_css(of));
+	int ret;
+
+	if (off != 0)
+		return -EIO;
+	/* kernfs_fop_write_iter() terminates 'buf' with '\0'. */
+	ret = sysfs_match_string(policy_name, buf);
+	if (ret < 0)
+		return ret;
+	blkcg->prio_policy = ret;
+
+	return nbytes;
+}
+
+static struct blkg_policy_data *
+ioprio_alloc_pd(gfp_t gfp, struct request_queue *q, struct blkcg *blkcg)
+{
+	struct ioprio_blkg *ioprio_blkg;
+
+	ioprio_blkg = kzalloc(sizeof(*ioprio_blkg), gfp);
+	if (!ioprio_blkg)
+		return NULL;
+
+	return &ioprio_blkg->pd;
+}
+
+static void ioprio_free_pd(struct blkg_policy_data *pd)
+{
+	struct ioprio_blkg *ioprio_blkg = pd_to_ioprio(pd);
+
+	kfree(ioprio_blkg);
+}
+
+static struct blkcg_policy_data *ioprio_alloc_cpd(gfp_t gfp)
+{
+	struct ioprio_blkcg *blkcg;
+
+	blkcg = kzalloc(sizeof(*blkcg), gfp);
+	if (!blkcg)
+		return NULL;
+	blkcg->prio_policy = POLICY_NO_CHANGE;
+	return &blkcg->cpd;
+}
+
+static void ioprio_free_cpd(struct blkcg_policy_data *cpd)
+{
+	struct ioprio_blkcg *blkcg = container_of(cpd, typeof(*blkcg), cpd);
+
+	kfree(blkcg);
+}
+
+#define IOPRIO_ATTRS						\
+	{							\
+		.name		= "prio.class",			\
+		.seq_show	= ioprio_show_prio_policy,	\
+		.write		= ioprio_set_prio_policy,	\
+	},							\
+	{ } /* sentinel */
+
+/* cgroup v2 attributes */
+static struct cftype ioprio_files[] = {
+	IOPRIO_ATTRS
+};
+
+/* cgroup v1 attributes */
+static struct cftype ioprio_legacy_files[] = {
+	IOPRIO_ATTRS
+};
+
+static struct blkcg_policy ioprio_policy = {
+	.dfl_cftypes	= ioprio_files,
+	.legacy_cftypes = ioprio_legacy_files,
+
+	.cpd_alloc_fn	= ioprio_alloc_cpd,
+	.cpd_free_fn	= ioprio_free_cpd,
+
+	.pd_alloc_fn	= ioprio_alloc_pd,
+	.pd_free_fn	= ioprio_free_pd,
+};
+
+struct blk_ioprio {
+	struct rq_qos rqos;
+};
+
+static void blkcg_ioprio_track(struct rq_qos *rqos, struct request *rq,
+			       struct bio *bio)
+{
+	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_bio(bio);
+
+	/*
+	 * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers
+	 * correspond to a lower priority. Hence, the max_t() below selects
+	 * the lower priority of bi_ioprio and the cgroup I/O priority class.
+	 * If the cgroup policy has been set to POLICY_NO_CHANGE == 0, the
+	 * bio I/O priority is not modified. If the bio I/O priority equals
+	 * IOPRIO_CLASS_NONE, the cgroup I/O priority is assigned to the bio.
+	 */
+	bio->bi_ioprio = max_t(u16, bio->bi_ioprio,
+			       IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0));
+}
+
+static void blkcg_ioprio_exit(struct rq_qos *rqos)
+{
+	struct blk_ioprio *blkioprio_blkg =
+		container_of(rqos, typeof(*blkioprio_blkg), rqos);
+
+	blkcg_deactivate_policy(rqos->q, &ioprio_policy);
+	kfree(blkioprio_blkg);
+}
+
+static struct rq_qos_ops blkcg_ioprio_ops = {
+	.track	= blkcg_ioprio_track,
+	.exit	= blkcg_ioprio_exit,
+};
+
+int blk_ioprio_init(struct request_queue *q)
+{
+	struct blk_ioprio *blkioprio_blkg;
+	struct rq_qos *rqos;
+	int ret;
+
+	blkioprio_blkg = kzalloc(sizeof(*blkioprio_blkg), GFP_KERNEL);
+	if (!blkioprio_blkg)
+		return -ENOMEM;
+
+	ret = blkcg_activate_policy(q, &ioprio_policy);
+	if (ret) {
+		kfree(blkioprio_blkg);
+		return ret;
+	}
+
+	rqos = &blkioprio_blkg->rqos;
+	rqos->id = RQ_QOS_IOPRIO;
+	rqos->ops = &blkcg_ioprio_ops;
+	rqos->q = q;
+
+	/*
+	 * Registering the rq-qos policy after activating the blk-cgroup
+	 * policy guarantees that ioprio_blkcg_from_bio(bio) != NULL in the
+	 * rq-qos callbacks.
+	 */
+	rq_qos_add(q, rqos);
+
+	return 0;
+}
+
+static int __init ioprio_init(void)
+{
+	return blkcg_policy_register(&ioprio_policy);
+}
+
+static void __exit ioprio_exit(void)
+{
+	blkcg_policy_unregister(&ioprio_policy);
+}
+
+module_init(ioprio_init);
+module_exit(ioprio_exit);
diff --git a/block/blk-ioprio.h b/block/blk-ioprio.h
new file mode 100644
index 000000000000..a7785c2f1aea
--- /dev/null
+++ b/block/blk-ioprio.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _BLK_IOPRIO_H_
+#define _BLK_IOPRIO_H_
+
+#include <linux/kconfig.h>
+
+struct request_queue;
+
+#ifdef CONFIG_BLK_CGROUP_IOPRIO
+int blk_ioprio_init(struct request_queue *q);
+#else
+static inline int blk_ioprio_init(struct request_queue *q)
+{
+	return 0;
+}
+#endif
+
+#endif /* _BLK_IOPRIO_H_ */
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 6ac1c86f62ef..4b66d2776eda 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -946,6 +946,8 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id)
 		return "latency";
 	case RQ_QOS_COST:
 		return "cost";
+	case RQ_QOS_IOPRIO:
+		return "ioprio";
 	}
 	return "unknown";
 }
diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
index a77afbdd472c..f000f83e0621 100644
--- a/block/blk-rq-qos.h
+++ b/block/blk-rq-qos.h
@@ -17,6 +17,7 @@ enum rq_qos_id {
 	RQ_QOS_WBT,
 	RQ_QOS_LATENCY,
 	RQ_QOS_COST,
+	RQ_QOS_IOPRIO,
 };
 
 struct rq_wait {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 05/16] block/mq-deadline: Add several comments
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (3 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 04/16] block: Introduce the ioprio rq-qos policy Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 22:06   ` Adam Manzanares
  2021-06-18  0:44 ` [PATCH v3 06/16] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Johannes Thumshirn,
	Himanshu Madhani, Hannes Reinecke, Ming Lei

Make the code easier to read by adding more comments.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 8eea2cbf2bf4..31418e9ce9e2 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -139,6 +139,9 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
 	}
 }
 
+/*
+ * Callback function that is invoked after @next has been merged into @req.
+ */
 static void dd_merged_requests(struct request_queue *q, struct request *req,
 			       struct request *next)
 {
@@ -375,6 +378,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 }
 
 /*
+ * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
+ *
  * One confusing aspect here is that we get called for a specific
  * hardware queue, but we may return a request that is for a
  * different hardware queue. This is because mq-deadline has shared
@@ -438,6 +443,10 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
 	return 0;
 }
 
+/*
+ * Try to merge @bio into an existing request. If @bio has been merged into
+ * an existing request, store the pointer to that request into *@rq.
+ */
 static int dd_request_merge(struct request_queue *q, struct request **rq,
 			    struct bio *bio)
 {
@@ -461,6 +470,10 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
 	return ELEVATOR_NO_MERGE;
 }
 
+/*
+ * Attempt to merge a bio into an existing request. This function is called
+ * before @bio is associated with a request.
+ */
 static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs)
 {
@@ -518,6 +531,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	}
 }
 
+/*
+ * Called from blk_mq_sched_insert_request() or blk_mq_sched_insert_requests().
+ */
 static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 			       struct list_head *list, bool at_head)
 {
@@ -544,6 +560,8 @@ static void dd_prepare_request(struct request *rq)
 }
 
 /*
+ * Callback from inside blk_mq_free_request().
+ *
  * For zoned block devices, write unlock the target zone of
  * completed write requests. Do this while holding the zone lock
  * spinlock so that the zone is never unlocked while deadline_fifo_request()

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 06/16] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (4 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 05/16] block/mq-deadline: Add several comments Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 22:09   ` Adam Manzanares
  2021-06-18  0:44 ` [PATCH v3 07/16] block/mq-deadline: Remove two local variables Bart Van Assche
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal,
	Hannes Reinecke, Johannes Thumshirn, Himanshu Madhani, Ming Lei

Document the locking strategy by adding two lockdep_assert_held()
statements.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 31418e9ce9e2..191ff5ce629c 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -279,6 +279,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	bool reads, writes;
 	int data_dir;
 
+	lockdep_assert_held(&dd->lock);
+
 	if (!list_empty(&dd->dispatch)) {
 		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
 		list_del_init(&rq->queuelist);
@@ -501,6 +503,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
 
+	lockdep_assert_held(&dd->lock);
+
 	/*
 	 * This may be a requeue of a write request that has locked its
 	 * target zone. If it is the case, this releases the zone lock.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 07/16] block/mq-deadline: Remove two local variables
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (5 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 06/16] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 22:16   ` Adam Manzanares
  2021-06-18  0:44 ` [PATCH v3 08/16] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal,
	Hannes Reinecke, Johannes Thumshirn, Himanshu Madhani, Ming Lei

Make __dd_dispatch_request() easier to read by removing two local
variables.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 191ff5ce629c..caa438f62a4d 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -276,7 +276,6 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
 static struct request *__dd_dispatch_request(struct deadline_data *dd)
 {
 	struct request *rq, *next_rq;
-	bool reads, writes;
 	int data_dir;
 
 	lockdep_assert_held(&dd->lock);
@@ -287,9 +286,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 		goto done;
 	}
 
-	reads = !list_empty(&dd->fifo_list[READ]);
-	writes = !list_empty(&dd->fifo_list[WRITE]);
-
 	/*
 	 * batches are currently reads XOR writes
 	 */
@@ -306,7 +302,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (reads) {
+	if (!list_empty(&dd->fifo_list[READ])) {
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
 
 		if (deadline_fifo_request(dd, WRITE) &&
@@ -322,7 +318,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (writes) {
+	if (!list_empty(&dd->fifo_list[WRITE])) {
 dispatch_writes:
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 08/16] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (6 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 07/16] block/mq-deadline: Remove two local variables Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 22:18   ` Adam Manzanares
  2021-06-18  0:44 ` [PATCH v3 09/16] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Chaitanya Kulkarni, Damien Le Moal,
	Hannes Reinecke, Johannes Thumshirn, Himanshu Madhani, Ming Lei

Change "queue" into "sched" to make the function names reflect better the
purpose of these functions.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index caa438f62a4d..d823ba7cb084 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -395,7 +395,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	return rq;
 }
 
-static void dd_exit_queue(struct elevator_queue *e)
+static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
@@ -408,7 +408,7 @@ static void dd_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
@@ -800,8 +800,8 @@ static struct elevator_type mq_deadline = {
 		.requests_merged	= dd_merged_requests,
 		.request_merged		= dd_request_merged,
 		.has_work		= dd_has_work,
-		.init_sched		= dd_init_queue,
-		.exit_sched		= dd_exit_queue,
+		.init_sched		= dd_init_sched,
+		.exit_sched		= dd_exit_sched,
 	},
 
 #ifdef CONFIG_BLK_DEBUG_FS

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 09/16] block/mq-deadline: Improve compile-time argument checking
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (7 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 08/16] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 22:30   ` Adam Manzanares
  2021-06-18  0:44 ` [PATCH v3 10/16] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Chaitanya Kulkarni, Hannes Reinecke,
	Johannes Thumshirn, Himanshu Madhani, Damien Le Moal, Ming Lei

Modern compilers complain if an out-of-range value is passed to a function
argument that has an enumeration type. Let the compiler detect out-of-range
data direction arguments instead of verifying the data_dir argument at
runtime.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 96 +++++++++++++++++++++++----------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index d823ba7cb084..69126beff77d 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -35,6 +35,13 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
+enum dd_data_dir {
+	DD_READ		= READ,
+	DD_WRITE	= WRITE,
+};
+
+enum { DD_DIR_COUNT = 2 };
+
 struct deadline_data {
 	/*
 	 * run time data
@@ -43,20 +50,20 @@ struct deadline_data {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];
-	struct list_head fifo_list[2];
+	struct rb_root sort_list[DD_DIR_COUNT];
+	struct list_head fifo_list[DD_DIR_COUNT];
 
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
-	struct request *next_rq[2];
+	struct request *next_rq[DD_DIR_COUNT];
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
-	int fifo_expire[2];
+	int fifo_expire[DD_DIR_COUNT];
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
@@ -97,7 +104,7 @@ deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
 static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
 	if (dd->next_rq[data_dir] == rq)
 		dd->next_rq[data_dir] = deadline_latter_request(rq);
@@ -169,10 +176,10 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
+	dd->next_rq[DD_READ] = NULL;
+	dd->next_rq[DD_WRITE] = NULL;
 	dd->next_rq[data_dir] = deadline_latter_request(rq);
 
 	/*
@@ -185,9 +192,10 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_data *dd,
+				      enum dd_data_dir data_dir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
 
 	/*
 	 * rq is expired!
@@ -203,19 +211,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
  * dispatch using arrival ordered lists.
  */
 static struct request *
-deadline_fifo_request(struct deadline_data *dd, int data_dir)
+deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
-		return NULL;
-
 	if (list_empty(&dd->fifo_list[data_dir]))
 		return NULL;
 
 	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
-	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
 	/*
@@ -223,7 +228,7 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
 	 * an unlocked target zone.
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) {
+	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
 		if (blk_req_can_dispatch_to_zone(rq))
 			goto out;
 	}
@@ -239,19 +244,16 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
  * dispatch using sector position sorted lists.
  */
 static struct request *
-deadline_next_request(struct deadline_data *dd, int data_dir)
+deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
-		return NULL;
-
 	rq = dd->next_rq[data_dir];
 	if (!rq)
 		return NULL;
 
-	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
 	/*
@@ -276,7 +278,7 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
 static struct request *__dd_dispatch_request(struct deadline_data *dd)
 {
 	struct request *rq, *next_rq;
-	int data_dir;
+	enum dd_data_dir data_dir;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -289,9 +291,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	rq = deadline_next_request(dd, WRITE);
+	rq = deadline_next_request(dd, DD_WRITE);
 	if (!rq)
-		rq = deadline_next_request(dd, READ);
+		rq = deadline_next_request(dd, DD_READ);
 
 	if (rq && dd->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
@@ -302,14 +304,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (!list_empty(&dd->fifo_list[READ])) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+	if (!list_empty(&dd->fifo_list[DD_READ])) {
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
 
-		if (deadline_fifo_request(dd, WRITE) &&
+		if (deadline_fifo_request(dd, DD_WRITE) &&
 		    (dd->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
-		data_dir = READ;
+		data_dir = DD_READ;
 
 		goto dispatch_find_request;
 	}
@@ -318,13 +320,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (!list_empty(&dd->fifo_list[WRITE])) {
+	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
 
 		dd->starved = 0;
 
-		data_dir = WRITE;
+		data_dir = DD_WRITE;
 
 		goto dispatch_find_request;
 	}
@@ -399,8 +401,8 @@ static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
+	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
 
 	kfree(dd);
 }
@@ -424,12 +426,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	}
 	eq->elevator_data = dd;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
-	dd->fifo_expire[READ] = read_expire;
-	dd->fifo_expire[WRITE] = write_expire;
+	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
+	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
+	dd->sort_list[DD_READ] = RB_ROOT;
+	dd->sort_list[DD_WRITE] = RB_ROOT;
+	dd->fifo_expire[DD_READ] = read_expire;
+	dd->fifo_expire[DD_WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
 	dd->front_merges = 1;
 	dd->fifo_batch = fifo_batch;
@@ -497,7 +499,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 {
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
 	lockdep_assert_held(&dd->lock);
 
@@ -585,7 +587,7 @@ static void dd_finish_request(struct request *rq)
 
 		spin_lock_irqsave(&dd->zone_lock, flags);
 		blk_req_zone_write_unlock(rq);
-		if (!list_empty(&dd->fifo_list[WRITE]))
+		if (!list_empty(&dd->fifo_list[DD_WRITE]))
 			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
 	}
@@ -626,8 +628,8 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 		__data = jiffies_to_msecs(__data);			\
 	return deadline_var_show(__data, (page));			\
 }
-SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
-SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
 SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
 SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
 SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
@@ -649,8 +651,8 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
 		*(__PTR) = __data;					\
 	return count;							\
 }
-STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
 STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
 STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
 STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
@@ -717,8 +719,8 @@ static int deadline_##name##_next_rq_show(void *data,			\
 		__blk_mq_debugfs_rq_show(m, rq);			\
 	return 0;							\
 }
-DEADLINE_DEBUGFS_DDIR_ATTRS(READ, read)
-DEADLINE_DEBUGFS_DDIR_ATTRS(WRITE, write)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
 #undef DEADLINE_DEBUGFS_DDIR_ATTRS
 
 static int deadline_batching_show(void *data, struct seq_file *m)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 10/16] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (8 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 09/16] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18 23:07   ` Adam Manzanares
  2021-06-18  0:44 ` [PATCH v3 11/16] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests Bart Van Assche
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

Define separate macros for integers and jiffies to improve readability.
Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol().
The former functions are the recommended functions.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 66 ++++++++++++++++++++-------------------------
 1 file changed, 29 insertions(+), 37 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 69126beff77d..f92224ff0256 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -605,58 +605,50 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 /*
  * sysfs parts below
  */
-static ssize_t
-deadline_var_show(int var, char *page)
-{
-	return sprintf(page, "%d\n", var);
-}
-
-static void
-deadline_var_store(int *var, const char *page)
-{
-	char *p = (char *) page;
-
-	*var = simple_strtol(p, &p, 10);
-}
-
-#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
+#define SHOW_INT(__FUNC, __VAR)						\
 static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 {									\
 	struct deadline_data *dd = e->elevator_data;			\
-	int __data = __VAR;						\
-	if (__CONV)							\
-		__data = jiffies_to_msecs(__data);			\
-	return deadline_var_show(__data, (page));			\
-}
-SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
-SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
-SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
-SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
-SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
-#undef SHOW_FUNCTION
+									\
+	return sysfs_emit(page, "%d\n", __VAR);				\
+}
+#define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
+SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
+SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
+SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
+SHOW_INT(deadline_front_merges_show, dd->front_merges);
+SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch);
+#undef SHOW_INT
+#undef SHOW_JIFFIES
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
 static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
 {									\
 	struct deadline_data *dd = e->elevator_data;			\
-	int __data;							\
-	deadline_var_store(&__data, (page));				\
+	int __data, __ret;						\
+									\
+	__ret = kstrtoint(page, 0, &__data);				\
+	if (__ret < 0)							\
+		return __ret;						\
 	if (__data < (MIN))						\
 		__data = (MIN);						\
 	else if (__data > (MAX))					\
 		__data = (MAX);						\
-	if (__CONV)							\
-		*(__PTR) = msecs_to_jiffies(__data);			\
-	else								\
-		*(__PTR) = __data;					\
+	*(__PTR) = __CONV(__data);					\
 	return count;							\
 }
-STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
-STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
-STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
+#define STORE_INT(__FUNC, __PTR, MIN, MAX)				\
+	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, )
+#define STORE_JIFFIES(__FUNC, __PTR, MIN, MAX)				\
+	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
+STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
+STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
+STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
+STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
+STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX);
 #undef STORE_FUNCTION
+#undef STORE_INT
+#undef STORE_JIFFIES
 
 #define DD_ATTR(name) \
 	__ATTR(name, 0644, deadline_##name##_show, deadline_##name##_store)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 11/16] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (9 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 10/16] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 12/16] block/mq-deadline: Micro-optimize the batching algorithm Bart Van Assche
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

For interactive workloads it is important that synchronous requests are
not delayed. Hence reserve 25% of scheduler tags for synchronous requests.
This patch still allows asynchronous requests to fill the hardware queues
since blk_mq_init_sched() makes sure that the number of scheduler requests
is the double of the hardware queue depth. From blk_mq_init_sched():

	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
				   BLKDEV_MAX_RQ);

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 55 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f92224ff0256..44da481c3fea 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -67,6 +67,7 @@ struct deadline_data {
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
+	u32 async_depth;
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
@@ -397,6 +398,44 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	return rq;
 }
 
+/*
+ * Called by __blk_mq_alloc_request(). The shallow_depth value set by this
+ * function is used by __blk_mq_get_tag().
+ */
+static void dd_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
+{
+	struct deadline_data *dd = data->q->elevator->elevator_data;
+
+	/* Do not throttle synchronous reads. */
+	if (op_is_sync(op) && !op_is_write(op))
+		return;
+
+	/*
+	 * Throttle asynchronous requests and writes such that these requests
+	 * do not block the allocation of synchronous requests.
+	 */
+	data->shallow_depth = dd->async_depth;
+}
+
+/* Called by blk_mq_update_nr_requests(). */
+static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct blk_mq_tags *tags = hctx->sched_tags;
+
+	dd->async_depth = max(1UL, 3 * q->nr_requests / 4);
+
+	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, dd->async_depth);
+}
+
+/* Called by blk_mq_init_hctx() and blk_mq_init_sched(). */
+static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	dd_depth_updated(hctx);
+	return 0;
+}
+
 static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
@@ -617,6 +656,7 @@ SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
 SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
 SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
 SHOW_INT(deadline_front_merges_show, dd->front_merges);
+SHOW_INT(deadline_async_depth_show, dd->front_merges);
 SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch);
 #undef SHOW_INT
 #undef SHOW_JIFFIES
@@ -645,6 +685,7 @@ STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX)
 STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
 STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
 STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
+STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
 STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX);
 #undef STORE_FUNCTION
 #undef STORE_INT
@@ -658,6 +699,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(write_expire),
 	DD_ATTR(writes_starved),
 	DD_ATTR(front_merges),
+	DD_ATTR(async_depth),
 	DD_ATTR(fifo_batch),
 	__ATTR_NULL
 };
@@ -733,6 +775,15 @@ static int deadline_starved_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+static int dd_async_depth_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	seq_printf(m, "%u\n", dd->async_depth);
+	return 0;
+}
+
 static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
 	__acquires(&dd->lock)
 {
@@ -775,6 +826,7 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 	DEADLINE_QUEUE_DDIR_ATTRS(write),
 	{"batching", 0400, deadline_batching_show},
 	{"starved", 0400, deadline_starved_show},
+	{"async_depth", 0400, dd_async_depth_show},
 	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
 	{},
 };
@@ -783,6 +835,8 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 
 static struct elevator_type mq_deadline = {
 	.ops = {
+		.depth_updated		= dd_depth_updated,
+		.limit_depth		= dd_limit_depth,
 		.insert_requests	= dd_insert_requests,
 		.dispatch_request	= dd_dispatch_request,
 		.prepare_request	= dd_prepare_request,
@@ -796,6 +850,7 @@ static struct elevator_type mq_deadline = {
 		.has_work		= dd_has_work,
 		.init_sched		= dd_init_sched,
 		.exit_sched		= dd_exit_sched,
+		.init_hctx		= dd_init_hctx,
 	},
 
 #ifdef CONFIG_BLK_DEBUG_FS

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 12/16] block/mq-deadline: Micro-optimize the batching algorithm
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (10 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 11/16] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 13/16] block/mq-deadline: Add I/O priority support Bart Van Assche
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

When dispatching the first request of a batch, the deadline_move_request()
call clears .next_rq[] for the opposite data direction. .next_rq[] is not
restored when changing data direction. Fix this by not clearing .next_rq[]
and by keeping track of the data direction of a batch in a variable instead.

This patch is a micro-optimization because:
- The number of deadline_next_request() calls for the read direction is
  halved.
- The number of times that deadline_next_request() returns NULL is reduced.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 44da481c3fea..b09ae1f332a2 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -53,6 +53,8 @@ struct deadline_data {
 	struct rb_root sort_list[DD_DIR_COUNT];
 	struct list_head fifo_list[DD_DIR_COUNT];
 
+	/* Data direction of latest dispatched request. */
+	enum dd_data_dir last_dir;
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
@@ -179,8 +181,6 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
 	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
-	dd->next_rq[DD_READ] = NULL;
-	dd->next_rq[DD_WRITE] = NULL;
 	dd->next_rq[data_dir] = deadline_latter_request(rq);
 
 	/*
@@ -292,10 +292,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	rq = deadline_next_request(dd, DD_WRITE);
-	if (!rq)
-		rq = deadline_next_request(dd, DD_READ);
-
+	rq = deadline_next_request(dd, dd->last_dir);
 	if (rq && dd->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
@@ -361,6 +358,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	if (!rq)
 		return NULL;
 
+	dd->last_dir = data_dir;
 	dd->batching = 0;
 
 dispatch_request:
@@ -473,6 +471,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->fifo_expire[DD_WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
 	dd->front_merges = 1;
+	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 13/16] block/mq-deadline: Add I/O priority support
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (11 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 12/16] block/mq-deadline: Micro-optimize the batching algorithm Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 14/16] block/mq-deadline: Track I/O statistics Bart Van Assche
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

Maintain one dispatch list and one FIFO list per I/O priority class: RT, BE
and IDLE. Maintain statistics for each priority level. Split the debugfs
attributes per priority level as follows:

$ ls /sys/kernel/debug/block/.../sched/
async_depth  dispatch2        read_next_rq      write2_fifo_list
batching     read0_fifo_list  starved           write_next_rq
dispatch0    read1_fifo_list  write0_fifo_list
dispatch1    read2_fifo_list  write1_fifo_list

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 342 +++++++++++++++++++++++++++++---------------
 1 file changed, 228 insertions(+), 114 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b09ae1f332a2..aba672a5be1e 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -42,23 +42,36 @@ enum dd_data_dir {
 
 enum { DD_DIR_COUNT = 2 };
 
+enum dd_prio {
+	DD_RT_PRIO	= 0,
+	DD_BE_PRIO	= 1,
+	DD_IDLE_PRIO	= 2,
+	DD_PRIO_MAX	= 2,
+};
+
+enum { DD_PRIO_COUNT = 3 };
+
+/*
+ * Deadline scheduler data per I/O priority (enum dd_prio). Requests are
+ * present on both sort_list[] and fifo_list[].
+ */
+struct dd_per_prio {
+	struct list_head dispatch;
+	struct rb_root sort_list[DD_DIR_COUNT];
+	struct list_head fifo_list[DD_DIR_COUNT];
+	/* Next request in FIFO order. Read, write or both are NULL. */
+	struct request *next_rq[DD_DIR_COUNT];
+};
+
 struct deadline_data {
 	/*
 	 * run time data
 	 */
 
-	/*
-	 * requests (deadline_rq s) are present on both sort_list and fifo_list
-	 */
-	struct rb_root sort_list[DD_DIR_COUNT];
-	struct list_head fifo_list[DD_DIR_COUNT];
+	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
 	/* Data direction of latest dispatched request. */
 	enum dd_data_dir last_dir;
-	/*
-	 * next in sort order. read, write or both are NULL
-	 */
-	struct request *next_rq[DD_DIR_COUNT];
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
@@ -73,13 +86,29 @@ struct deadline_data {
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
-	struct list_head dispatch;
+};
+
+/* Maps an I/O priority class to a deadline scheduler priority. */
+static const enum dd_prio ioprio_class_to_prio[] = {
+	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
+	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
+	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
+	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
 };
 
 static inline struct rb_root *
-deadline_rb_root(struct deadline_data *dd, struct request *rq)
+deadline_rb_root(struct dd_per_prio *per_prio, struct request *rq)
 {
-	return &dd->sort_list[rq_data_dir(rq)];
+	return &per_prio->sort_list[rq_data_dir(rq)];
+}
+
+/*
+ * Returns the I/O priority class (IOPRIO_CLASS_*) that has been assigned to a
+ * request.
+ */
+static u8 dd_rq_ioclass(struct request *rq)
+{
+	return IOPRIO_PRIO_CLASS(req_get_ioprio(rq));
 }
 
 /*
@@ -97,38 +126,38 @@ deadline_latter_request(struct request *rq)
 }
 
 static void
-deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
+deadline_add_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
 {
-	struct rb_root *root = deadline_rb_root(dd, rq);
+	struct rb_root *root = deadline_rb_root(per_prio, rq);
 
 	elv_rb_add(root, rq);
 }
 
 static inline void
-deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
+deadline_del_rq_rb(struct dd_per_prio *per_prio, struct request *rq)
 {
 	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
-	if (dd->next_rq[data_dir] == rq)
-		dd->next_rq[data_dir] = deadline_latter_request(rq);
+	if (per_prio->next_rq[data_dir] == rq)
+		per_prio->next_rq[data_dir] = deadline_latter_request(rq);
 
-	elv_rb_del(deadline_rb_root(dd, rq), rq);
+	elv_rb_del(deadline_rb_root(per_prio, rq), rq);
 }
 
 /*
  * remove rq from rbtree and fifo.
  */
-static void deadline_remove_request(struct request_queue *q, struct request *rq)
+static void deadline_remove_request(struct request_queue *q,
+				    struct dd_per_prio *per_prio,
+				    struct request *rq)
 {
-	struct deadline_data *dd = q->elevator->elevator_data;
-
 	list_del_init(&rq->queuelist);
 
 	/*
 	 * We might not be on the rbtree, if we are doing an insert merge
 	 */
 	if (!RB_EMPTY_NODE(&rq->rb_node))
-		deadline_del_rq_rb(dd, rq);
+		deadline_del_rq_rb(per_prio, rq);
 
 	elv_rqhash_del(q, rq);
 	if (q->last_merge == rq)
@@ -139,13 +168,16 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
 			      enum elv_merge type)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
+	const u8 ioprio_class = dd_rq_ioclass(req);
+	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+	struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
 	/*
 	 * if the merge was a front merge, we need to reposition request
 	 */
 	if (type == ELEVATOR_FRONT_MERGE) {
-		elv_rb_del(deadline_rb_root(dd, req), req);
-		deadline_add_rq_rb(dd, req);
+		elv_rb_del(deadline_rb_root(per_prio, req), req);
+		deadline_add_rq_rb(per_prio, req);
 	}
 }
 
@@ -155,6 +187,9 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
 static void dd_merged_requests(struct request_queue *q, struct request *req,
 			       struct request *next)
 {
+	const u8 ioprio_class = dd_rq_ioclass(next);
+	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+
 	/*
 	 * if next expires before rq, assign its expire time to rq
 	 * and move into next position (next will be deleted) in fifo
@@ -170,33 +205,34 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 	/*
 	 * kill knowledge of next, this one is a goner
 	 */
-	deadline_remove_request(q, next);
+	deadline_remove_request(q, &dd->per_prio[prio], next);
 }
 
 /*
  * move an entry to dispatch queue
  */
 static void
-deadline_move_request(struct deadline_data *dd, struct request *rq)
+deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
+		      struct request *rq)
 {
 	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
-	dd->next_rq[data_dir] = deadline_latter_request(rq);
+	per_prio->next_rq[data_dir] = deadline_latter_request(rq);
 
 	/*
 	 * take it off the sort and fifo list
 	 */
-	deadline_remove_request(rq->q, rq);
+	deadline_remove_request(rq->q, per_prio, rq);
 }
 
 /*
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd,
+static inline int deadline_check_fifo(struct dd_per_prio *per_prio,
 				      enum dd_data_dir data_dir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+	struct request *rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
 
 	/*
 	 * rq is expired!
@@ -212,15 +248,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd,
  * dispatch using arrival ordered lists.
  */
 static struct request *
-deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
+deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
+		      enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (list_empty(&dd->fifo_list[data_dir]))
+	if (list_empty(&per_prio->fifo_list[data_dir]))
 		return NULL;
 
-	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
 	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
@@ -229,7 +266,7 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 	 * an unlocked target zone.
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
+	list_for_each_entry(rq, &per_prio->fifo_list[DD_WRITE], queuelist) {
 		if (blk_req_can_dispatch_to_zone(rq))
 			goto out;
 	}
@@ -245,12 +282,13 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
  * dispatch using sector position sorted lists.
  */
 static struct request *
-deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
+deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
+		      enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	rq = dd->next_rq[data_dir];
+	rq = per_prio->next_rq[data_dir];
 	if (!rq)
 		return NULL;
 
@@ -276,15 +314,17 @@ deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
  * deadline_dispatch_requests selects the best request according to
  * read/write expire, fifo_batch, etc
  */
-static struct request *__dd_dispatch_request(struct deadline_data *dd)
+static struct request *__dd_dispatch_request(struct deadline_data *dd,
+					     struct dd_per_prio *per_prio)
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
 
 	lockdep_assert_held(&dd->lock);
 
-	if (!list_empty(&dd->dispatch)) {
-		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
+	if (!list_empty(&per_prio->dispatch)) {
+		rq = list_first_entry(&per_prio->dispatch, struct request,
+				      queuelist);
 		list_del_init(&rq->queuelist);
 		goto done;
 	}
@@ -292,7 +332,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	rq = deadline_next_request(dd, dd->last_dir);
+	rq = deadline_next_request(dd, per_prio, dd->last_dir);
 	if (rq && dd->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
 		goto dispatch_request;
@@ -302,10 +342,10 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (!list_empty(&dd->fifo_list[DD_READ])) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
+	if (!list_empty(&per_prio->fifo_list[DD_READ])) {
+		BUG_ON(RB_EMPTY_ROOT(&per_prio->sort_list[DD_READ]));
 
-		if (deadline_fifo_request(dd, DD_WRITE) &&
+		if (deadline_fifo_request(dd, per_prio, DD_WRITE) &&
 		    (dd->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
@@ -318,9 +358,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
+	if (!list_empty(&per_prio->fifo_list[DD_WRITE])) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&per_prio->sort_list[DD_WRITE]));
 
 		dd->starved = 0;
 
@@ -335,14 +375,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	next_rq = deadline_next_request(dd, data_dir);
-	if (deadline_check_fifo(dd, data_dir) || !next_rq) {
+	next_rq = deadline_next_request(dd, per_prio, data_dir);
+	if (deadline_check_fifo(per_prio, data_dir) || !next_rq) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = deadline_fifo_request(dd, data_dir);
+		rq = deadline_fifo_request(dd, per_prio, data_dir);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
@@ -366,7 +406,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * rq is the selected appropriate request.
 	 */
 	dd->batching++;
-	deadline_move_request(dd, rq);
+	deadline_move_request(dd, per_prio, rq);
 done:
 	/*
 	 * If the request needs its target zone locked, do it.
@@ -388,9 +428,14 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
 	struct request *rq;
+	enum dd_prio prio;
 
 	spin_lock(&dd->lock);
-	rq = __dd_dispatch_request(dd);
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
+		if (rq)
+			break;
+	}
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -437,9 +482,14 @@ static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
+	enum dd_prio prio;
 
-	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		struct dd_per_prio *per_prio = &dd->per_prio[prio];
+
+		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
+		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
+	}
 
 	kfree(dd);
 }
@@ -451,22 +501,28 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
+	enum dd_prio prio;
+	int ret = -ENOMEM;
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
-		return -ENOMEM;
+		return ret;
 
 	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
-	if (!dd) {
-		kobject_put(&eq->kobj);
-		return -ENOMEM;
-	}
+	if (!dd)
+		goto put_eq;
+
 	eq->elevator_data = dd;
 
-	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
-	dd->sort_list[DD_READ] = RB_ROOT;
-	dd->sort_list[DD_WRITE] = RB_ROOT;
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		struct dd_per_prio *per_prio = &dd->per_prio[prio];
+
+		INIT_LIST_HEAD(&per_prio->dispatch);
+		INIT_LIST_HEAD(&per_prio->fifo_list[DD_READ]);
+		INIT_LIST_HEAD(&per_prio->fifo_list[DD_WRITE]);
+		per_prio->sort_list[DD_READ] = RB_ROOT;
+		per_prio->sort_list[DD_WRITE] = RB_ROOT;
+	}
 	dd->fifo_expire[DD_READ] = read_expire;
 	dd->fifo_expire[DD_WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
@@ -475,10 +531,13 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->fifo_batch = fifo_batch;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
-	INIT_LIST_HEAD(&dd->dispatch);
 
 	q->elevator = eq;
 	return 0;
+
+put_eq:
+	kobject_put(&eq->kobj);
+	return ret;
 }
 
 /*
@@ -489,13 +548,16 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
 			    struct bio *bio)
 {
 	struct deadline_data *dd = q->elevator->elevator_data;
+	const u8 ioprio_class = IOPRIO_PRIO_CLASS(bio->bi_ioprio);
+	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+	struct dd_per_prio *per_prio = &dd->per_prio[prio];
 	sector_t sector = bio_end_sector(bio);
 	struct request *__rq;
 
 	if (!dd->front_merges)
 		return ELEVATOR_NO_MERGE;
 
-	__rq = elv_rb_find(&dd->sort_list[bio_data_dir(bio)], sector);
+	__rq = elv_rb_find(&per_prio->sort_list[bio_data_dir(bio)], sector);
 	if (__rq) {
 		BUG_ON(sector != blk_rq_pos(__rq));
 
@@ -538,6 +600,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const enum dd_data_dir data_dir = rq_data_dir(rq);
+	u16 ioprio = req_get_ioprio(rq);
+	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
+	struct dd_per_prio *per_prio;
+	enum dd_prio prio;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -547,15 +613,18 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 */
 	blk_req_zone_write_unlock(rq);
 
+	prio = ioprio_class_to_prio[ioprio_class];
+
 	if (blk_mq_sched_try_insert_merge(q, rq))
 		return;
 
 	trace_block_rq_insert(rq);
 
+	per_prio = &dd->per_prio[prio];
 	if (at_head) {
-		list_add(&rq->queuelist, &dd->dispatch);
+		list_add(&rq->queuelist, &per_prio->dispatch);
 	} else {
-		deadline_add_rq_rb(dd, rq);
+		deadline_add_rq_rb(per_prio, rq);
 
 		if (rq_mergeable(rq)) {
 			elv_rqhash_add(q, rq);
@@ -567,7 +636,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+		list_add_tail(&rq->queuelist, &per_prio->fifo_list[data_dir]);
 	}
 }
 
@@ -618,26 +687,39 @@ static void dd_prepare_request(struct request *rq)
 static void dd_finish_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	const u8 ioprio_class = dd_rq_ioclass(rq);
+	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+	struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
 	if (blk_queue_is_zoned(q)) {
-		struct deadline_data *dd = q->elevator->elevator_data;
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->zone_lock, flags);
 		blk_req_zone_write_unlock(rq);
-		if (!list_empty(&dd->fifo_list[DD_WRITE]))
+		if (!list_empty(&per_prio->fifo_list[DD_WRITE]))
 			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
 	}
 }
 
+static bool dd_has_work_for_prio(struct dd_per_prio *per_prio)
+{
+	return !list_empty_careful(&per_prio->dispatch) ||
+		!list_empty_careful(&per_prio->fifo_list[DD_READ]) ||
+		!list_empty_careful(&per_prio->fifo_list[DD_WRITE]);
+}
+
 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+	enum dd_prio prio;
+
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
+		if (dd_has_work_for_prio(&dd->per_prio[prio]))
+			return true;
 
-	return !list_empty_careful(&dd->dispatch) ||
-		!list_empty_careful(&dd->fifo_list[0]) ||
-		!list_empty_careful(&dd->fifo_list[1]);
+	return false;
 }
 
 /*
@@ -704,16 +786,17 @@ static struct elv_fs_entry deadline_attrs[] = {
 };
 
 #ifdef CONFIG_BLK_DEBUG_FS
-#define DEADLINE_DEBUGFS_DDIR_ATTRS(ddir, name)				\
+#define DEADLINE_DEBUGFS_DDIR_ATTRS(prio, data_dir, name)		\
 static void *deadline_##name##_fifo_start(struct seq_file *m,		\
 					  loff_t *pos)			\
 	__acquires(&dd->lock)						\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
+	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
 									\
 	spin_lock(&dd->lock);						\
-	return seq_list_start(&dd->fifo_list[ddir], *pos);		\
+	return seq_list_start(&per_prio->fifo_list[data_dir], *pos);	\
 }									\
 									\
 static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
@@ -721,8 +804,9 @@ static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
 {									\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
+	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
 									\
-	return seq_list_next(v, &dd->fifo_list[ddir], pos);		\
+	return seq_list_next(v, &per_prio->fifo_list[data_dir], pos);	\
 }									\
 									\
 static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
@@ -746,14 +830,20 @@ static int deadline_##name##_next_rq_show(void *data,			\
 {									\
 	struct request_queue *q = data;					\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
-	struct request *rq = dd->next_rq[ddir];				\
+	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
+	struct request *rq = per_prio->next_rq[data_dir];		\
 									\
 	if (rq)								\
 		__blk_mq_debugfs_rq_show(m, rq);			\
 	return 0;							\
 }
-DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
-DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
+
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_READ, read0);
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_WRITE, write0);
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_READ, read1);
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_WRITE, write1);
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_READ, read2);
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_WRITE, write2);
 #undef DEADLINE_DEBUGFS_DDIR_ATTRS
 
 static int deadline_batching_show(void *data, struct seq_file *m)
@@ -783,50 +873,74 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
 	return 0;
 }
 
-static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
-	__acquires(&dd->lock)
-{
-	struct request_queue *q = m->private;
-	struct deadline_data *dd = q->elevator->elevator_data;
-
-	spin_lock(&dd->lock);
-	return seq_list_start(&dd->dispatch, *pos);
-}
-
-static void *deadline_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
-{
-	struct request_queue *q = m->private;
-	struct deadline_data *dd = q->elevator->elevator_data;
-
-	return seq_list_next(v, &dd->dispatch, pos);
-}
-
-static void deadline_dispatch_stop(struct seq_file *m, void *v)
-	__releases(&dd->lock)
-{
-	struct request_queue *q = m->private;
-	struct deadline_data *dd = q->elevator->elevator_data;
-
-	spin_unlock(&dd->lock);
+#define DEADLINE_DISPATCH_ATTR(prio)					\
+static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
+					     loff_t *pos)		\
+	__acquires(&dd->lock)						\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
+									\
+	spin_lock(&dd->lock);						\
+	return seq_list_start(&per_prio->dispatch, *pos);		\
+}									\
+									\
+static void *deadline_dispatch##prio##_next(struct seq_file *m,		\
+					    void *v, loff_t *pos)	\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+	struct dd_per_prio *per_prio = &dd->per_prio[prio];		\
+									\
+	return seq_list_next(v, &per_prio->dispatch, pos);		\
+}									\
+									\
+static void deadline_dispatch##prio##_stop(struct seq_file *m, void *v)	\
+	__releases(&dd->lock)						\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+									\
+	spin_unlock(&dd->lock);						\
+}									\
+									\
+static const struct seq_operations deadline_dispatch##prio##_seq_ops = { \
+	.start	= deadline_dispatch##prio##_start,			\
+	.next	= deadline_dispatch##prio##_next,			\
+	.stop	= deadline_dispatch##prio##_stop,			\
+	.show	= blk_mq_debugfs_rq_show,				\
 }
 
-static const struct seq_operations deadline_dispatch_seq_ops = {
-	.start	= deadline_dispatch_start,
-	.next	= deadline_dispatch_next,
-	.stop	= deadline_dispatch_stop,
-	.show	= blk_mq_debugfs_rq_show,
-};
+DEADLINE_DISPATCH_ATTR(0);
+DEADLINE_DISPATCH_ATTR(1);
+DEADLINE_DISPATCH_ATTR(2);
+#undef DEADLINE_DISPATCH_ATTR
 
-#define DEADLINE_QUEUE_DDIR_ATTRS(name)						\
-	{#name "_fifo_list", 0400, .seq_ops = &deadline_##name##_fifo_seq_ops},	\
+#define DEADLINE_QUEUE_DDIR_ATTRS(name)					\
+	{#name "_fifo_list", 0400,					\
+			.seq_ops = &deadline_##name##_fifo_seq_ops}
+#define DEADLINE_NEXT_RQ_ATTR(name)					\
 	{#name "_next_rq", 0400, deadline_##name##_next_rq_show}
 static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
-	DEADLINE_QUEUE_DDIR_ATTRS(read),
-	DEADLINE_QUEUE_DDIR_ATTRS(write),
+	DEADLINE_QUEUE_DDIR_ATTRS(read0),
+	DEADLINE_QUEUE_DDIR_ATTRS(write0),
+	DEADLINE_QUEUE_DDIR_ATTRS(read1),
+	DEADLINE_QUEUE_DDIR_ATTRS(write1),
+	DEADLINE_QUEUE_DDIR_ATTRS(read2),
+	DEADLINE_QUEUE_DDIR_ATTRS(write2),
+	DEADLINE_NEXT_RQ_ATTR(read0),
+	DEADLINE_NEXT_RQ_ATTR(write0),
+	DEADLINE_NEXT_RQ_ATTR(read1),
+	DEADLINE_NEXT_RQ_ATTR(write1),
+	DEADLINE_NEXT_RQ_ATTR(read2),
+	DEADLINE_NEXT_RQ_ATTR(write2),
 	{"batching", 0400, deadline_batching_show},
 	{"starved", 0400, deadline_starved_show},
 	{"async_depth", 0400, dd_async_depth_show},
-	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
+	{"dispatch0", 0400, .seq_ops = &deadline_dispatch0_seq_ops},
+	{"dispatch1", 0400, .seq_ops = &deadline_dispatch1_seq_ops},
+	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
 	{},
 };
 #undef DEADLINE_QUEUE_DDIR_ATTRS
@@ -876,6 +990,6 @@ static void __exit deadline_exit(void)
 module_init(deadline_init);
 module_exit(deadline_exit);
 
-MODULE_AUTHOR("Jens Axboe");
+MODULE_AUTHOR("Jens Axboe, Damien Le Moal and Bart Van Assche");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("MQ deadline IO scheduler");

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 14/16] block/mq-deadline: Track I/O statistics
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (12 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 13/16] block/mq-deadline: Add I/O priority support Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 15/16] block/mq-deadline: Add cgroup support Bart Van Assche
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

Track I/O statistics per I/O priority and export these statistics to
debugfs. These statistics help developers of the deadline scheduler.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 100 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index aba672a5be1e..04d9d6b3745b 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -51,6 +51,19 @@ enum dd_prio {
 
 enum { DD_PRIO_COUNT = 3 };
 
+/* I/O statistics per I/O priority. */
+struct io_stats_per_prio {
+	local_t inserted;
+	local_t merged;
+	local_t dispatched;
+	local_t completed;
+};
+
+/* I/O statistics for all I/O priorities (enum dd_prio). */
+struct io_stats {
+	struct io_stats_per_prio stats[DD_PRIO_COUNT];
+};
+
 /*
  * Deadline scheduler data per I/O priority (enum dd_prio). Requests are
  * present on both sort_list[] and fifo_list[].
@@ -75,6 +88,8 @@ struct deadline_data {
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
+	struct io_stats __percpu *stats;
+
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -88,6 +103,33 @@ struct deadline_data {
 	spinlock_t zone_lock;
 };
 
+/* Count one event of type 'event_type' and with I/O priority 'prio' */
+#define dd_count(dd, event_type, prio) do {				\
+	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
+									\
+	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
+	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
+	local_inc(&io_stats->stats[(prio)].event_type);			\
+	put_cpu_ptr(io_stats);						\
+} while (0)
+
+/*
+ * Returns the total number of dd_count(dd, event_type, prio) calls across all
+ * CPUs. No locking or barriers since it is fine if the returned sum is slightly
+ * outdated.
+ */
+#define dd_sum(dd, event_type, prio) ({					\
+	unsigned int cpu;						\
+	u32 sum = 0;							\
+									\
+	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
+	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
+	for_each_present_cpu(cpu)					\
+		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
+				  stats[(prio)].event_type);		\
+	sum;								\
+})
+
 /* Maps an I/O priority class to a deadline scheduler priority. */
 static const enum dd_prio ioprio_class_to_prio[] = {
 	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
@@ -187,9 +229,12 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
 static void dd_merged_requests(struct request_queue *q, struct request *req,
 			       struct request *next)
 {
+	struct deadline_data *dd = q->elevator->elevator_data;
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 
+	dd_count(dd, merged, prio);
+
 	/*
 	 * if next expires before rq, assign its expire time to rq
 	 * and move into next position (next will be deleted) in fifo
@@ -225,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	deadline_remove_request(rq->q, per_prio, rq);
 }
 
+/* Number of requests queued for a given priority level. */
+static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
+{
+	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+}
+
 /*
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
@@ -319,6 +370,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
+	enum dd_prio prio;
+	u8 ioprio_class;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -408,6 +461,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	dd->batching++;
 	deadline_move_request(dd, per_prio, rq);
 done:
+	ioprio_class = dd_rq_ioclass(rq);
+	prio = ioprio_class_to_prio[ioprio_class];
+	dd_count(dd, dispatched, prio);
 	/*
 	 * If the request needs its target zone locked, do it.
 	 */
@@ -491,6 +547,8 @@ static void dd_exit_sched(struct elevator_queue *e)
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
 	}
 
+	free_percpu(dd->stats);
+
 	kfree(dd);
 }
 
@@ -514,6 +572,11 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	eq->elevator_data = dd;
 
+	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
+				     GFP_KERNEL | __GFP_ZERO);
+	if (!dd->stats)
+		goto free_dd;
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -535,6 +598,9 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	q->elevator = eq;
 	return 0;
 
+free_dd:
+	kfree(dd);
+
 put_eq:
 	kobject_put(&eq->kobj);
 	return ret;
@@ -614,6 +680,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_req_zone_write_unlock(rq);
 
 	prio = ioprio_class_to_prio[ioprio_class];
+	dd_count(dd, inserted, prio);
 
 	if (blk_mq_sched_try_insert_merge(q, rq))
 		return;
@@ -692,6 +759,8 @@ static void dd_finish_request(struct request *rq)
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
+	dd_count(dd, completed, prio);
+
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;
 
@@ -873,6 +942,35 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+static int dd_queued_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
+		   dd_queued(dd, DD_BE_PRIO),
+		   dd_queued(dd, DD_IDLE_PRIO));
+	return 0;
+}
+
+/* Number of requests owned by the block driver for a given priority. */
+static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
+{
+	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
+		- dd_sum(dd, completed, prio);
+}
+
+static int dd_owned_by_driver_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
+		   dd_owned_by_driver(dd, DD_BE_PRIO),
+		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
+	return 0;
+}
+
 #define DEADLINE_DISPATCH_ATTR(prio)					\
 static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
 					     loff_t *pos)		\
@@ -941,6 +1039,8 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 	{"dispatch0", 0400, .seq_ops = &deadline_dispatch0_seq_ops},
 	{"dispatch1", 0400, .seq_ops = &deadline_dispatch1_seq_ops},
 	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
+	{"owned_by_driver", 0400, dd_owned_by_driver_show},
+	{"queued", 0400, dd_queued_show},
 	{},
 };
 #undef DEADLINE_QUEUE_DDIR_ATTRS

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 15/16] block/mq-deadline: Add cgroup support
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (13 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 14/16] block/mq-deadline: Track I/O statistics Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-06-18  0:44 ` [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
  2021-06-21 16:06 ` [PATCH v3 00/16] Improve I/O priority support Jens Axboe
  16 siblings, 0 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

Maintain statistics per cgroup and export these to user space. These
statistics are essential for verifying whether the proper I/O priorities
have been assigned to requests. An example of the statistics data with
this patch applied:

$ cat /sys/fs/cgroup/io.stat
11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0
8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/Kconfig.iosched                       |   6 +
 block/Makefile                              |   2 +
 block/mq-deadline-cgroup.c                  | 126 ++++++++++++++++++++
 block/mq-deadline-cgroup.h                  | 114 ++++++++++++++++++
 block/{mq-deadline.c => mq-deadline-main.c} |  74 +++++++++---
 5 files changed, 308 insertions(+), 14 deletions(-)
 create mode 100644 block/mq-deadline-cgroup.c
 create mode 100644 block/mq-deadline-cgroup.h
 rename block/{mq-deadline.c => mq-deadline-main.c} (95%)

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 2f2158e05a91..64053d67a97b 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -9,6 +9,12 @@ config MQ_IOSCHED_DEADLINE
 	help
 	  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_DEADLINE_CGROUP
+       tristate
+       default y
+       depends on MQ_IOSCHED_DEADLINE
+       depends on BLK_CGROUP
+
 config MQ_IOSCHED_KYBER
 	tristate "Kyber I/O scheduler"
 	default y
diff --git a/block/Makefile b/block/Makefile
index af3d044abaf1..b9db5d4edfc8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -21,6 +21,8 @@ obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
+mq-deadline-y += mq-deadline-main.o
+mq-deadline-$(CONFIG_MQ_IOSCHED_DEADLINE_CGROUP)+= mq-deadline-cgroup.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
 obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
diff --git a/block/mq-deadline-cgroup.c b/block/mq-deadline-cgroup.c
new file mode 100644
index 000000000000..3b4bfddec39f
--- /dev/null
+++ b/block/mq-deadline-cgroup.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/blk-cgroup.h>
+#include <linux/ioprio.h>
+
+#include "mq-deadline-cgroup.h"
+
+static struct blkcg_policy dd_blkcg_policy;
+
+static struct blkcg_policy_data *dd_cpd_alloc(gfp_t gfp)
+{
+	struct dd_blkcg *pd;
+
+	pd = kzalloc(sizeof(*pd), gfp);
+	if (!pd)
+		return NULL;
+	pd->stats = alloc_percpu_gfp(typeof(*pd->stats),
+				     GFP_KERNEL | __GFP_ZERO);
+	if (!pd->stats) {
+		kfree(pd);
+		return NULL;
+	}
+	return &pd->cpd;
+}
+
+static void dd_cpd_free(struct blkcg_policy_data *cpd)
+{
+	struct dd_blkcg *dd_blkcg = container_of(cpd, typeof(*dd_blkcg), cpd);
+
+	free_percpu(dd_blkcg->stats);
+	kfree(dd_blkcg);
+}
+
+static struct dd_blkcg *dd_blkcg_from_pd(struct blkg_policy_data *pd)
+{
+	return container_of(blkcg_to_cpd(pd->blkg->blkcg, &dd_blkcg_policy),
+			    struct dd_blkcg, cpd);
+}
+
+/*
+ * Convert an association between a block cgroup and a request queue into a
+ * pointer to the mq-deadline information associated with a (blkcg, queue) pair.
+ */
+struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio)
+{
+	struct blkg_policy_data *pd;
+
+	pd = blkg_to_pd(bio->bi_blkg, &dd_blkcg_policy);
+	if (!pd)
+		return NULL;
+
+	return dd_blkcg_from_pd(pd);
+}
+
+static size_t dd_pd_stat(struct blkg_policy_data *pd, char *buf, size_t size)
+{
+	static const char *const prio_class_name[] = {
+		[IOPRIO_CLASS_NONE]	= "NONE",
+		[IOPRIO_CLASS_RT]	= "RT",
+		[IOPRIO_CLASS_BE]	= "BE",
+		[IOPRIO_CLASS_IDLE]	= "IDLE",
+	};
+	struct dd_blkcg *blkcg = dd_blkcg_from_pd(pd);
+	int res = 0;
+	u8 prio;
+
+	for (prio = 0; prio < ARRAY_SIZE(blkcg->stats->stats); prio++)
+		res += scnprintf(buf + res, size - res,
+			" [%s] dispatched=%u inserted=%u merged=%u",
+			prio_class_name[prio],
+			ddcg_sum(blkcg, dispatched, prio) +
+			ddcg_sum(blkcg, merged, prio) -
+			ddcg_sum(blkcg, completed, prio),
+			ddcg_sum(blkcg, inserted, prio) -
+			ddcg_sum(blkcg, completed, prio),
+			ddcg_sum(blkcg, merged, prio));
+
+	return res;
+}
+
+static struct blkg_policy_data *dd_pd_alloc(gfp_t gfp, struct request_queue *q,
+					    struct blkcg *blkcg)
+{
+	struct dd_blkg *pd;
+
+	pd = kzalloc(sizeof(*pd), gfp);
+	if (!pd)
+		return NULL;
+	return &pd->pd;
+}
+
+static void dd_pd_free(struct blkg_policy_data *pd)
+{
+	struct dd_blkg *dd_blkg = container_of(pd, typeof(*dd_blkg), pd);
+
+	kfree(dd_blkg);
+}
+
+static struct blkcg_policy dd_blkcg_policy = {
+	.cpd_alloc_fn		= dd_cpd_alloc,
+	.cpd_free_fn		= dd_cpd_free,
+
+	.pd_alloc_fn		= dd_pd_alloc,
+	.pd_free_fn		= dd_pd_free,
+	.pd_stat_fn		= dd_pd_stat,
+};
+
+int dd_activate_policy(struct request_queue *q)
+{
+	return blkcg_activate_policy(q, &dd_blkcg_policy);
+}
+
+void dd_deactivate_policy(struct request_queue *q)
+{
+	blkcg_deactivate_policy(q, &dd_blkcg_policy);
+}
+
+int __init dd_blkcg_init(void)
+{
+	return blkcg_policy_register(&dd_blkcg_policy);
+}
+
+void __exit dd_blkcg_exit(void)
+{
+	blkcg_policy_unregister(&dd_blkcg_policy);
+}
diff --git a/block/mq-deadline-cgroup.h b/block/mq-deadline-cgroup.h
new file mode 100644
index 000000000000..0143fd74f3ce
--- /dev/null
+++ b/block/mq-deadline-cgroup.h
@@ -0,0 +1,114 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#if !defined(_MQ_DEADLINE_CGROUP_H_)
+#define _MQ_DEADLINE_CGROUP_H_
+
+#include <linux/blk-cgroup.h>
+
+struct request_queue;
+
+/**
+ * struct io_stats_per_prio - I/O statistics per I/O priority class.
+ * @inserted: Number of inserted requests.
+ * @merged: Number of merged requests.
+ * @dispatched: Number of dispatched requests.
+ * @completed: Number of I/O completions.
+ */
+struct io_stats_per_prio {
+	local_t inserted;
+	local_t merged;
+	local_t dispatched;
+	local_t completed;
+};
+
+/* I/O statistics per I/O cgroup per I/O priority class (IOPRIO_CLASS_*). */
+struct blkcg_io_stats {
+	struct io_stats_per_prio stats[4];
+};
+
+/**
+ * struct dd_blkcg - Per cgroup data.
+ * @cpd: blkcg_policy_data structure.
+ * @stats: I/O statistics.
+ */
+struct dd_blkcg {
+	struct blkcg_policy_data cpd;	/* must be the first member */
+	struct blkcg_io_stats __percpu *stats;
+};
+
+/*
+ * Count one event of type 'event_type' and with I/O priority class
+ * 'prio_class'.
+ */
+#define ddcg_count(ddcg, event_type, prio_class) do {			\
+if (ddcg) {								\
+	struct blkcg_io_stats *io_stats = get_cpu_ptr((ddcg)->stats);	\
+									\
+	BUILD_BUG_ON(!__same_type((ddcg), struct dd_blkcg *));		\
+	BUILD_BUG_ON(!__same_type((prio_class), u8));			\
+	local_inc(&io_stats->stats[(prio_class)].event_type);		\
+	put_cpu_ptr(io_stats);						\
+}									\
+} while (0)
+
+/*
+ * Returns the total number of ddcg_count(ddcg, event_type, prio_class) calls
+ * across all CPUs. No locking or barriers since it is fine if the returned
+ * sum is slightly outdated.
+ */
+#define ddcg_sum(ddcg, event_type, prio) ({				\
+	unsigned int cpu;						\
+	u32 sum = 0;							\
+									\
+	BUILD_BUG_ON(!__same_type((ddcg), struct dd_blkcg *));		\
+	BUILD_BUG_ON(!__same_type((prio), u8));				\
+	for_each_present_cpu(cpu)					\
+		sum += local_read(&per_cpu_ptr((ddcg)->stats, cpu)->	\
+				  stats[(prio)].event_type);		\
+	sum;								\
+})
+
+#ifdef CONFIG_BLK_CGROUP
+
+/**
+ * struct dd_blkg - Per (cgroup, request queue) data.
+ * @pd: blkg_policy_data structure.
+ */
+struct dd_blkg {
+	struct blkg_policy_data pd;	/* must be the first member */
+};
+
+struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio);
+int dd_activate_policy(struct request_queue *q);
+void dd_deactivate_policy(struct request_queue *q);
+int __init dd_blkcg_init(void);
+void __exit dd_blkcg_exit(void);
+
+#else /* CONFIG_BLK_CGROUP */
+
+static inline struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline int dd_activate_policy(struct request_queue *q)
+{
+	return 0;
+}
+
+static inline void dd_deactivate_policy(struct request_queue *q)
+{
+}
+
+static inline int dd_blkcg_init(void)
+{
+	return 0;
+}
+
+static inline void dd_blkcg_exit(void)
+{
+}
+
+#endif /* CONFIG_BLK_CGROUP */
+
+#endif /* _MQ_DEADLINE_CGROUP_H_ */
diff --git a/block/mq-deadline.c b/block/mq-deadline-main.c
similarity index 95%
rename from block/mq-deadline.c
rename to block/mq-deadline-main.c
index 04d9d6b3745b..58a401ea8f56 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline-main.c
@@ -25,6 +25,7 @@
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
 #include "blk-mq-sched.h"
+#include "mq-deadline-cgroup.h"
 
 /*
  * See Documentation/block/deadline-iosched.rst
@@ -51,14 +52,6 @@ enum dd_prio {
 
 enum { DD_PRIO_COUNT = 3 };
 
-/* I/O statistics per I/O priority. */
-struct io_stats_per_prio {
-	local_t inserted;
-	local_t merged;
-	local_t dispatched;
-	local_t completed;
-};
-
 /* I/O statistics for all I/O priorities (enum dd_prio). */
 struct io_stats {
 	struct io_stats_per_prio stats[DD_PRIO_COUNT];
@@ -81,6 +74,9 @@ struct deadline_data {
 	 * run time data
 	 */
 
+	/* Request queue that owns this data structure. */
+	struct request_queue *queue;
+
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
 	/* Data direction of latest dispatched request. */
@@ -232,8 +228,10 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+	struct dd_blkcg *blkcg = next->elv.priv[0];
 
 	dd_count(dd, merged, prio);
+	ddcg_count(blkcg, merged, ioprio_class);
 
 	/*
 	 * if next expires before rq, assign its expire time to rq
@@ -370,6 +368,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
+	struct dd_blkcg *blkcg;
 	enum dd_prio prio;
 	u8 ioprio_class;
 
@@ -464,6 +463,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	ioprio_class = dd_rq_ioclass(rq);
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd_count(dd, dispatched, prio);
+	blkcg = rq->elv.priv[0];
+	ddcg_count(blkcg, dispatched, ioprio_class);
 	/*
 	 * If the request needs its target zone locked, do it.
 	 */
@@ -540,6 +541,8 @@ static void dd_exit_sched(struct elevator_queue *e)
 	struct deadline_data *dd = e->elevator_data;
 	enum dd_prio prio;
 
+	dd_deactivate_policy(dd->queue);
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -553,7 +556,7 @@ static void dd_exit_sched(struct elevator_queue *e)
 }
 
 /*
- * initialize elevator private data (deadline_data).
+ * Initialize elevator private data (deadline_data) and associate with blkcg.
  */
 static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
@@ -562,6 +565,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	enum dd_prio prio;
 	int ret = -ENOMEM;
 
+	/*
+	 * Initialization would be very tricky if the queue is not frozen,
+	 * hence the warning statement below.
+	 */
+	WARN_ON_ONCE(!percpu_ref_is_zero(&q->q_usage_counter));
+
 	eq = elevator_alloc(q, e);
 	if (!eq)
 		return ret;
@@ -577,6 +586,8 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	if (!dd->stats)
 		goto free_dd;
 
+	dd->queue = q;
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -595,9 +606,17 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
+	ret = dd_activate_policy(q);
+	if (ret)
+		goto free_stats;
+
+	ret = 0;
 	q->elevator = eq;
 	return 0;
 
+free_stats:
+	free_percpu(dd->stats);
+
 free_dd:
 	kfree(dd);
 
@@ -670,6 +689,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
 	struct dd_per_prio *per_prio;
 	enum dd_prio prio;
+	struct dd_blkcg *blkcg;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -679,8 +699,19 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 */
 	blk_req_zone_write_unlock(rq);
 
+	/*
+	 * If a block cgroup has been associated with the submitter and if an
+	 * I/O priority has been set in the associated block cgroup, use the
+	 * lowest of the cgroup priority and the request priority for the
+	 * request. If no priority has been set in the request, use the cgroup
+	 * priority.
+	 */
 	prio = ioprio_class_to_prio[ioprio_class];
 	dd_count(dd, inserted, prio);
+	blkcg = dd_blkcg_from_bio(rq->bio);
+	ddcg_count(blkcg, inserted, ioprio_class);
+	WARN_ON_ONCE(rq->elv.priv[0]);
+	rq->elv.priv[0] = blkcg;
 
 	if (blk_mq_sched_try_insert_merge(q, rq))
 		return;
@@ -727,12 +758,10 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	spin_unlock(&dd->lock);
 }
 
-/*
- * Nothing to do here. This is defined only to ensure that .finish_request
- * method is called upon request completion.
- */
+/* Callback from inside blk_mq_rq_ctx_init(). */
 static void dd_prepare_request(struct request *rq)
 {
+	rq->elv.priv[0] = NULL;
 }
 
 /*
@@ -755,11 +784,13 @@ static void dd_finish_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct deadline_data *dd = q->elevator->elevator_data;
+	struct dd_blkcg *blkcg = rq->elv.priv[0];
 	const u8 ioprio_class = dd_rq_ioclass(rq);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 	struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
 	dd_count(dd, completed, prio);
+	ddcg_count(blkcg, completed, ioprio_class);
 
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;
@@ -1079,11 +1110,26 @@ MODULE_ALIAS("mq-deadline-iosched");
 
 static int __init deadline_init(void)
 {
-	return elv_register(&mq_deadline);
+	int ret;
+
+	ret = elv_register(&mq_deadline);
+	if (ret)
+		goto out;
+	ret = dd_blkcg_init();
+	if (ret)
+		goto unreg;
+
+out:
+	return ret;
+
+unreg:
+	elv_unregister(&mq_deadline);
+	goto out;
 }
 
 static void __exit deadline_exit(void)
 {
+	dd_blkcg_exit();
 	elv_unregister(&mq_deadline);
 }
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (14 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 15/16] block/mq-deadline: Add cgroup support Bart Van Assche
@ 2021-06-18  0:44 ` Bart Van Assche
  2021-08-20  0:45   ` Niklas Cassel
  2021-06-21 16:06 ` [PATCH v3 00/16] Improve I/O priority support Jens Axboe
  16 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-06-18  0:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

While one or more requests with a certain I/O priority are pending, do not
dispatch lower priority requests. Dispatch lower priority requests anyway
after the "aging" time has expired.

This patch has been tested as follows:

modprobe scsi_debug ndelay=1000000 max_queue=16 &&
sd='' &&
while [ -z "$sd" ]; do
  sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done &&
echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
cd /sys/fs/cgroup/blkio/ &&
echo $$ >cgroup.procs &&
echo restrict-to-be >blkio.prio.class &&
mkdir -p hipri &&
cd hipri &&
echo none-to-rt >blkio.prio.class &&
{ max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
echo $$ >cgroup.procs &&
max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt

Result:
* 11000 IOPS for the high-priority job
*    40 IOPS for the low-priority job

If the aging expiry time is changed from 100s into 0, the IOPS results change
into 6712 and 6796 IOPS.

The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline-main.c | 42 +++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/block/mq-deadline-main.c b/block/mq-deadline-main.c
index 58a401ea8f56..4815e536091f 100644
--- a/block/mq-deadline-main.c
+++ b/block/mq-deadline-main.c
@@ -32,6 +32,11 @@
  */
 static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
 static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+/*
+ * Time after which to dispatch lower priority requests even if higher
+ * priority requests are pending.
+ */
+static const int aging_expire = 10 * HZ;
 static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
@@ -94,6 +99,7 @@ struct deadline_data {
 	int writes_starved;
 	int front_merges;
 	u32 async_depth;
+	int aging_expire;
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
@@ -361,10 +367,11 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 
 /*
  * deadline_dispatch_requests selects the best request according to
- * read/write expire, fifo_batch, etc
+ * read/write expire, fifo_batch, etc and with a start time <= @latest.
  */
 static struct request *__dd_dispatch_request(struct deadline_data *dd,
-					     struct dd_per_prio *per_prio)
+					     struct dd_per_prio *per_prio,
+					     u64 latest_start_ns)
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
@@ -377,6 +384,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	if (!list_empty(&per_prio->dispatch)) {
 		rq = list_first_entry(&per_prio->dispatch, struct request,
 				      queuelist);
+		if (rq->start_time_ns > latest_start_ns)
+			return NULL;
 		list_del_init(&rq->queuelist);
 		goto done;
 	}
@@ -454,6 +463,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	dd->batching = 0;
 
 dispatch_request:
+	if (rq->start_time_ns > latest_start_ns)
+		return NULL;
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -484,15 +495,32 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
-	struct request *rq;
+	const u64 now_ns = ktime_get_ns();
+	struct request *rq = NULL;
 	enum dd_prio prio;
 
 	spin_lock(&dd->lock);
-	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
-		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
+	/*
+	 * Start with dispatching requests whose deadline expired more than
+	 * aging_expire jiffies ago.
+	 */
+	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now_ns -
+					   jiffies_to_nsecs(dd->aging_expire));
 		if (rq)
+			goto unlock;
+	}
+	/*
+	 * Next, dispatch requests in priority order. Ignore lower priority
+	 * requests if any higher priority requests are pending.
+	 */
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now_ns);
+		if (rq || dd_queued(dd, prio))
 			break;
 	}
+
+unlock:
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -603,6 +631,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->front_merges = 1;
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
+	dd->aging_expire = aging_expire;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
@@ -835,6 +864,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
 SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
 SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
+SHOW_JIFFIES(deadline_aging_expire_show, dd->aging_expire);
 SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
 SHOW_INT(deadline_front_merges_show, dd->front_merges);
 SHOW_INT(deadline_async_depth_show, dd->front_merges);
@@ -864,6 +894,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
 	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
 STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
 STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
+STORE_JIFFIES(deadline_aging_expire_store, &dd->aging_expire, 0, INT_MAX);
 STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
 STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
 STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
@@ -882,6 +913,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(front_merges),
 	DD_ATTR(async_depth),
 	DD_ATTR(fifo_batch),
+	DD_ATTR(aging_expire),
 	__ATTR_NULL
 };
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  2021-06-18  0:44 ` [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
@ 2021-06-18 17:09   ` Adam Manzanares
  2021-06-18 19:49   ` Himanshu Madhani
  1 sibling, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 17:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Ming Lei,
	Himanshu Madhani

On Thu, Jun 17, 2021 at 05:44:41PM -0700, Bart Van Assche wrote:
> These entries were consecutive at the time of their introduction but are no
> longer consecutive. Make these again consecutive. Additionally, modify the
> help text since it refers to blk-mq and since the legacy block layer has
> been removed.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/Kconfig | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index a2297edfdde8..6685578b2a20 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -133,6 +133,13 @@ config BLK_WBT
>  	dynamically on an algorithm loosely based on CoDel, factoring in
>  	the realtime performance of the disk.
>  
> +config BLK_WBT_MQ
> +	bool "Enable writeback throttling by default"
> +	default y
> +	depends on BLK_WBT
> +	help
> +	Enable writeback throttling by default for request-based block devices.
> +
>  config BLK_CGROUP_IOLATENCY
>  	bool "Enable support for latency based cgroup IO protection"
>  	depends on BLK_CGROUP=y
> @@ -155,13 +162,6 @@ config BLK_CGROUP_IOCOST
>  	distributes IO capacity between different groups based on
>  	their share of the overall weight distribution.
>  
> -config BLK_WBT_MQ
> -	bool "Multiqueue writeback throttling"
> -	default y
> -	depends on BLK_WBT
> -	help
> -	Enable writeback throttling by default on multiqueue devices.
> -
>  config BLK_DEBUG_FS
>  	bool "Block layer debugging information in debugfs"
>  	default y

Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls
  2021-06-18  0:44 ` [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
@ 2021-06-18 17:15   ` Adam Manzanares
  2021-06-18 19:49   ` Himanshu Madhani
  2021-06-21 14:24   ` Tejun Heo
  2 siblings, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 17:15 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Tejun Heo,
	Ming Lei, Himanshu Madhani

On Thu, Jun 17, 2021 at 05:44:42PM -0700, Bart Van Assche wrote:
> Before adding more calls in this function, simplify the error path.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-cgroup.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d169e2055158..3b0f6efaa2b6 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1183,15 +1183,14 @@ int blkcg_init_queue(struct request_queue *q)
>  	if (preloaded)
>  		radix_tree_preload_end();
>  
> -	ret = blk_throtl_init(q);
> +	ret = blk_iolatency_init(q);
>  	if (ret)
>  		goto err_destroy_all;
>  
> -	ret = blk_iolatency_init(q);
> -	if (ret) {
> -		blk_throtl_exit(q);
> +	ret = blk_throtl_init(q);
> +	if (ret)
>  		goto err_destroy_all;
> -	}
> +
>  	return 0;
>  
>  err_destroy_all:

Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 03/16] block/blk-rq-qos: Move a function from a header file into a C file
  2021-06-18  0:44 ` [PATCH v3 03/16] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
@ 2021-06-18 17:22   ` Adam Manzanares
  2021-06-18 19:49   ` Himanshu Madhani
  1 sibling, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 17:22 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

On Thu, Jun 17, 2021 at 05:44:43PM -0700, Bart Van Assche wrote:
> rq_qos_id_to_name() is only used in blk-mq-debugfs.c so move that function
> into in blk-mq-debugfs.c.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-mq-debugfs.c | 13 +++++++++++++
>  block/blk-rq-qos.h     | 13 -------------
>  2 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 2a75bc7401df..6ac1c86f62ef 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -937,6 +937,19 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q)
>  	q->sched_debugfs_dir = NULL;
>  }
>  
> +static const char *rq_qos_id_to_name(enum rq_qos_id id)
> +{
> +	switch (id) {
> +	case RQ_QOS_WBT:
> +		return "wbt";
> +	case RQ_QOS_LATENCY:
> +		return "latency";
> +	case RQ_QOS_COST:
> +		return "cost";
> +	}
> +	return "unknown";
> +}
> +
>  void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
>  {
>  	debugfs_remove_recursive(rqos->debugfs_dir);
> diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
> index 2bcb3495e376..a77afbdd472c 100644
> --- a/block/blk-rq-qos.h
> +++ b/block/blk-rq-qos.h
> @@ -79,19 +79,6 @@ static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
>  	return rq_qos_id(q, RQ_QOS_LATENCY);
>  }
>  
> -static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
> -{
> -	switch (id) {
> -	case RQ_QOS_WBT:
> -		return "wbt";
> -	case RQ_QOS_LATENCY:
> -		return "latency";
> -	case RQ_QOS_COST:
> -		return "cost";
> -	}
> -	return "unknown";
> -}
> -
>  static inline void rq_wait_init(struct rq_wait *rq_wait)
>  {
>  	atomic_set(&rq_wait->inflight, 0);

Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 03/16] block/blk-rq-qos: Move a function from a header file into a C file
  2021-06-18  0:44 ` [PATCH v3 03/16] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
  2021-06-18 17:22   ` Adam Manzanares
@ 2021-06-18 19:49   ` Himanshu Madhani
  1 sibling, 0 replies; 40+ messages in thread
From: Himanshu Madhani @ 2021-06-18 19:49 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn



On 6/17/21 7:44 PM, Bart Van Assche wrote:
> rq_qos_id_to_name() is only used in blk-mq-debugfs.c so move that function
> into in blk-mq-debugfs.c.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/blk-mq-debugfs.c | 13 +++++++++++++
>   block/blk-rq-qos.h     | 13 -------------
>   2 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 2a75bc7401df..6ac1c86f62ef 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -937,6 +937,19 @@ void blk_mq_debugfs_unregister_sched(struct request_queue *q)
>   	q->sched_debugfs_dir = NULL;
>   }
>   
> +static const char *rq_qos_id_to_name(enum rq_qos_id id)
> +{
> +	switch (id) {
> +	case RQ_QOS_WBT:
> +		return "wbt";
> +	case RQ_QOS_LATENCY:
> +		return "latency";
> +	case RQ_QOS_COST:
> +		return "cost";
> +	}
> +	return "unknown";
> +}
> +
>   void blk_mq_debugfs_unregister_rqos(struct rq_qos *rqos)
>   {
>   	debugfs_remove_recursive(rqos->debugfs_dir);
> diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
> index 2bcb3495e376..a77afbdd472c 100644
> --- a/block/blk-rq-qos.h
> +++ b/block/blk-rq-qos.h
> @@ -79,19 +79,6 @@ static inline struct rq_qos *blkcg_rq_qos(struct request_queue *q)
>   	return rq_qos_id(q, RQ_QOS_LATENCY);
>   }
>   
> -static inline const char *rq_qos_id_to_name(enum rq_qos_id id)
> -{
> -	switch (id) {
> -	case RQ_QOS_WBT:
> -		return "wbt";
> -	case RQ_QOS_LATENCY:
> -		return "latency";
> -	case RQ_QOS_COST:
> -		return "cost";
> -	}
> -	return "unknown";
> -}
> -
>   static inline void rq_wait_init(struct rq_wait *rq_wait)
>   {
>   	atomic_set(&rq_wait->inflight, 0);
> 

Looks Good.

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls
  2021-06-18  0:44 ` [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
  2021-06-18 17:15   ` Adam Manzanares
@ 2021-06-18 19:49   ` Himanshu Madhani
  2021-06-21 14:24   ` Tejun Heo
  2 siblings, 0 replies; 40+ messages in thread
From: Himanshu Madhani @ 2021-06-18 19:49 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Tejun Heo,
	Ming Lei



On 6/17/21 7:44 PM, Bart Van Assche wrote:
> Before adding more calls in this function, simplify the error path.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/blk-cgroup.c | 9 ++++-----
>   1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index d169e2055158..3b0f6efaa2b6 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1183,15 +1183,14 @@ int blkcg_init_queue(struct request_queue *q)
>   	if (preloaded)
>   		radix_tree_preload_end();
>   
> -	ret = blk_throtl_init(q);
> +	ret = blk_iolatency_init(q);
>   	if (ret)
>   		goto err_destroy_all;
>   
> -	ret = blk_iolatency_init(q);
> -	if (ret) {
> -		blk_throtl_exit(q);
> +	ret = blk_throtl_init(q);
> +	if (ret)
>   		goto err_destroy_all;
> -	}
> +
>   	return 0;
>   
>   err_destroy_all:
> 

Looks Good.

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive
  2021-06-18  0:44 ` [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
  2021-06-18 17:09   ` Adam Manzanares
@ 2021-06-18 19:49   ` Himanshu Madhani
  1 sibling, 0 replies; 40+ messages in thread
From: Himanshu Madhani @ 2021-06-18 19:49 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Johannes Thumshirn, Hannes Reinecke, Ming Lei



On 6/17/21 7:44 PM, Bart Van Assche wrote:
> These entries were consecutive at the time of their introduction but are no
> longer consecutive. Make these again consecutive. Additionally, modify the
> help text since it refers to blk-mq and since the legacy block layer has
> been removed.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/Kconfig | 14 +++++++-------
>   1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/Kconfig b/block/Kconfig
> index a2297edfdde8..6685578b2a20 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -133,6 +133,13 @@ config BLK_WBT
>   	dynamically on an algorithm loosely based on CoDel, factoring in
>   	the realtime performance of the disk.
>   
> +config BLK_WBT_MQ
> +	bool "Enable writeback throttling by default"
> +	default y
> +	depends on BLK_WBT
> +	help
> +	Enable writeback throttling by default for request-based block devices.
> +
>   config BLK_CGROUP_IOLATENCY
>   	bool "Enable support for latency based cgroup IO protection"
>   	depends on BLK_CGROUP=y
> @@ -155,13 +162,6 @@ config BLK_CGROUP_IOCOST
>   	distributes IO capacity between different groups based on
>   	their share of the overall weight distribution.
>   
> -config BLK_WBT_MQ
> -	bool "Multiqueue writeback throttling"
> -	default y
> -	depends on BLK_WBT
> -	help
> -	Enable writeback throttling by default on multiqueue devices.
> -
>   config BLK_DEBUG_FS
>   	bool "Block layer debugging information in debugfs"
>   	default y
> 

Looks Good.

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 04/16] block: Introduce the ioprio rq-qos policy
  2021-06-18  0:44 ` [PATCH v3 04/16] block: Introduce the ioprio rq-qos policy Bart Van Assche
@ 2021-06-18 22:02   ` Adam Manzanares
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 22:02 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

On Thu, Jun 17, 2021 at 05:44:44PM -0700, Bart Van Assche wrote:
> Introduce an rq-qos policy that assigns an I/O priority to requests based
> on blk-cgroup configuration settings. This policy has the following
> advantages over the ioprio_set() system call:
> - This policy is cgroup based so it has all the advantages of cgroups.
> - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
>   controller affects page cache writeback I/O for filesystems that support
>   assiociating a cgroup with writeback I/O. See also
>   Documentation/admin-guide/cgroup-v2.rst.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst |  55 +++++
>  block/Kconfig                           |   9 +
>  block/Makefile                          |   1 +
>  block/blk-cgroup.c                      |   5 +
>  block/blk-ioprio.c                      | 262 ++++++++++++++++++++++++
>  block/blk-ioprio.h                      |  19 ++
>  block/blk-mq-debugfs.c                  |   2 +
>  block/blk-rq-qos.h                      |   1 +
>  8 files changed, 354 insertions(+)
>  create mode 100644 block/blk-ioprio.c
>  create mode 100644 block/blk-ioprio.h
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index b1e81aa8598a..4e59925e6583 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -56,6 +56,7 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
>         5-3-3. IO Latency
>           5-3-3-1. How IO Latency Throttling Works
>           5-3-3-2. IO Latency Interface Files
> +       5-3-4. IO Priority
>       5-4. PID
>         5-4-1. PID Interface Files
>       5-5. Cpuset
> @@ -1866,6 +1867,60 @@ IO Latency Interface Files
>  		duration of time between evaluation events.  Windows only elapse
>  		with IO activity.  Idle periods extend the most recent window.
>  
> +IO Priority
> +~~~~~~~~~~~
> +
> +A single attribute controls the behavior of the I/O priority cgroup policy,
> +namely the blkio.prio.class attribute. The following values are accepted for
> +that attribute:
> +
> +  no-change
> +	Do not modify the I/O priority class.
> +
> +  none-to-rt
> +	For requests that do not have an I/O priority class (NONE),
> +	change the I/O priority class into RT. Do not modify
> +	the I/O priority class of other requests.
> +
> +  restrict-to-be
> +	For requests that do not have an I/O priority class or that have I/O
> +	priority class RT, change it into BE. Do not modify the I/O priority
> +	class of requests that have priority class IDLE.
> +
> +  idle
> +	Change the I/O priority class of all requests into IDLE, the lowest
> +	I/O priority class.
> +
> +The following numerical values are associated with the I/O priority policies:
> +
> ++-------------+---+
> +| no-change   | 0 |
> ++-------------+---+
> +| none-to-rt  | 1 |
> ++-------------+---+
> +| rt-to-be    | 2 |
> ++-------------+---+
> +| all-to-idle | 3 |
> ++-------------+---+
> +
> +The numerical value that corresponds to each I/O priority class is as follows:
> +
> ++-------------------------------+---+
> +| IOPRIO_CLASS_NONE             | 0 |
> ++-------------------------------+---+
> +| IOPRIO_CLASS_RT (real-time)   | 1 |
> ++-------------------------------+---+
> +| IOPRIO_CLASS_BE (best effort) | 2 |
> ++-------------------------------+---+
> +| IOPRIO_CLASS_IDLE             | 3 |
> ++-------------------------------+---+
> +
> +The algorithm to set the I/O priority class for a request is as follows:
> +
> +- Translate the I/O priority class policy into a number.
> +- Change the request I/O priority class into the maximum of the I/O priority
> +  class policy number and the numerical I/O priority class.
> +
>  PID
>  ---
>  
> diff --git a/block/Kconfig b/block/Kconfig
> index 6685578b2a20..e71c63eaaf52 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -162,6 +162,15 @@ config BLK_CGROUP_IOCOST
>  	distributes IO capacity between different groups based on
>  	their share of the overall weight distribution.
>  
> +config BLK_CGROUP_IOPRIO
> +	bool "Cgroup I/O controller for assigning an I/O priority class"
> +	depends on BLK_CGROUP
> +	help
> +	Enable the .prio interface for assigning an I/O priority class to
> +	requests. The I/O priority class affects the order in which an I/O
> +	scheduler and block devices process requests. Only some I/O schedulers
> +	and some block devices support I/O priorities.
> +
>  config BLK_DEBUG_FS
>  	bool "Block layer debugging information in debugfs"
>  	default y
> diff --git a/block/Makefile b/block/Makefile
> index 8d841f5f986f..af3d044abaf1 100644
> --- a/block/Makefile
> +++ b/block/Makefile
> @@ -17,6 +17,7 @@ obj-$(CONFIG_BLK_DEV_BSGLIB)	+= bsg-lib.o
>  obj-$(CONFIG_BLK_CGROUP)	+= blk-cgroup.o
>  obj-$(CONFIG_BLK_CGROUP_RWSTAT)	+= blk-cgroup-rwstat.o
>  obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
> +obj-$(CONFIG_BLK_CGROUP_IOPRIO)	+= blk-ioprio.o
>  obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
>  obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
>  obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 3b0f6efaa2b6..7b06a5fa3cac 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -31,6 +31,7 @@
>  #include <linux/tracehook.h>
>  #include <linux/psi.h>
>  #include "blk.h"
> +#include "blk-ioprio.h"
>  
>  /*
>   * blkcg_pol_mutex protects blkcg_policy[] and policy [de]activation.
> @@ -1187,6 +1188,10 @@ int blkcg_init_queue(struct request_queue *q)
>  	if (ret)
>  		goto err_destroy_all;
>  
> +	ret = blk_ioprio_init(q);
> +	if (ret)
> +		goto err_destroy_all;
> +
>  	ret = blk_throtl_init(q);
>  	if (ret)
>  		goto err_destroy_all;
> diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c
> new file mode 100644
> index 000000000000..332a07761bf8
> --- /dev/null
> +++ b/block/blk-ioprio.c
> @@ -0,0 +1,262 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Block rq-qos policy for assigning an I/O priority class to requests.
> + *
> + * Using an rq-qos policy for assigning I/O priority class has two advantages
> + * over using the ioprio_set() system call:
> + *
> + * - This policy is cgroup based so it has all the advantages of cgroups.
> + * - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
> + *   controller affects page cache writeback I/O for filesystems that support
> + *   assiociating a cgroup with writeback I/O. See also
> + *   Documentation/admin-guide/cgroup-v2.rst.
> + */
> +
> +#include <linux/blk-cgroup.h>
> +#include <linux/blk-mq.h>
> +#include <linux/blk_types.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include "blk-ioprio.h"
> +#include "blk-rq-qos.h"
> +
> +/**
> + * enum prio_policy - I/O priority class policy.
> + * @POLICY_NO_CHANGE: (default) do not modify the I/O priority class.
> + * @POLICY_NONE_TO_RT: modify IOPRIO_CLASS_NONE into IOPRIO_CLASS_RT.
> + * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into
> + *		IOPRIO_CLASS_BE.
> + * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE.
> + *
> + * See also <linux/ioprio.h>.
> + */
> +enum prio_policy {
> +	POLICY_NO_CHANGE	= 0,
> +	POLICY_NONE_TO_RT	= 1,
> +	POLICY_RESTRICT_TO_BE	= 2,
> +	POLICY_ALL_TO_IDLE	= 3,
> +};
> +
> +static const char *policy_name[] = {
> +	[POLICY_NO_CHANGE]	= "no-change",
> +	[POLICY_NONE_TO_RT]	= "none-to-rt",
> +	[POLICY_RESTRICT_TO_BE]	= "restrict-to-be",
> +	[POLICY_ALL_TO_IDLE]	= "idle",
> +};
> +
> +static struct blkcg_policy ioprio_policy;
> +
> +/**
> + * struct ioprio_blkg - Per (cgroup, request queue) data.
> + * @pd: blkg_policy_data structure.
> + */
> +struct ioprio_blkg {
> +	struct blkg_policy_data pd;
> +};
> +
> +/**
> + * struct ioprio_blkcg - Per cgroup data.
> + * @cpd: blkcg_policy_data structure.
> + * @prio_policy: One of the IOPRIO_CLASS_* values. See also <linux/ioprio.h>.
> + */
> +struct ioprio_blkcg {
> +	struct blkcg_policy_data cpd;
> +	enum prio_policy	 prio_policy;
> +};
> +
> +static inline struct ioprio_blkg *pd_to_ioprio(struct blkg_policy_data *pd)
> +{
> +	return pd ? container_of(pd, struct ioprio_blkg, pd) : NULL;
> +}
> +
> +static struct ioprio_blkcg *blkcg_to_ioprio_blkcg(struct blkcg *blkcg)
> +{
> +	return container_of(blkcg_to_cpd(blkcg, &ioprio_policy),
> +			    struct ioprio_blkcg, cpd);
> +}
> +
> +static struct ioprio_blkcg *
> +ioprio_blkcg_from_css(struct cgroup_subsys_state *css)
> +{
> +	return blkcg_to_ioprio_blkcg(css_to_blkcg(css));
> +}
> +
> +static struct ioprio_blkcg *ioprio_blkcg_from_bio(struct bio *bio)
> +{
> +	struct blkg_policy_data *pd = blkg_to_pd(bio->bi_blkg, &ioprio_policy);
> +
> +	if (!pd)
> +		return NULL;
> +
> +	return blkcg_to_ioprio_blkcg(pd->blkg->blkcg);
> +}
> +
> +static int ioprio_show_prio_policy(struct seq_file *sf, void *v)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(seq_css(sf));
> +
> +	seq_printf(sf, "%s\n", policy_name[blkcg->prio_policy]);
> +	return 0;
> +}
> +
> +static ssize_t ioprio_set_prio_policy(struct kernfs_open_file *of, char *buf,
> +				      size_t nbytes, loff_t off)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_css(of_css(of));
> +	int ret;
> +
> +	if (off != 0)
> +		return -EIO;
> +	/* kernfs_fop_write_iter() terminates 'buf' with '\0'. */
> +	ret = sysfs_match_string(policy_name, buf);
> +	if (ret < 0)
> +		return ret;
> +	blkcg->prio_policy = ret;
> +
> +	return nbytes;
> +}
> +
> +static struct blkg_policy_data *
> +ioprio_alloc_pd(gfp_t gfp, struct request_queue *q, struct blkcg *blkcg)
> +{
> +	struct ioprio_blkg *ioprio_blkg;
> +
> +	ioprio_blkg = kzalloc(sizeof(*ioprio_blkg), gfp);
> +	if (!ioprio_blkg)
> +		return NULL;
> +
> +	return &ioprio_blkg->pd;
> +}
> +
> +static void ioprio_free_pd(struct blkg_policy_data *pd)
> +{
> +	struct ioprio_blkg *ioprio_blkg = pd_to_ioprio(pd);
> +
> +	kfree(ioprio_blkg);
> +}
> +
> +static struct blkcg_policy_data *ioprio_alloc_cpd(gfp_t gfp)
> +{
> +	struct ioprio_blkcg *blkcg;
> +
> +	blkcg = kzalloc(sizeof(*blkcg), gfp);
> +	if (!blkcg)
> +		return NULL;
> +	blkcg->prio_policy = POLICY_NO_CHANGE;
> +	return &blkcg->cpd;
> +}
> +
> +static void ioprio_free_cpd(struct blkcg_policy_data *cpd)
> +{
> +	struct ioprio_blkcg *blkcg = container_of(cpd, typeof(*blkcg), cpd);
> +
> +	kfree(blkcg);
> +}
> +
> +#define IOPRIO_ATTRS						\
> +	{							\
> +		.name		= "prio.class",			\
> +		.seq_show	= ioprio_show_prio_policy,	\
> +		.write		= ioprio_set_prio_policy,	\
> +	},							\
> +	{ } /* sentinel */
> +
> +/* cgroup v2 attributes */
> +static struct cftype ioprio_files[] = {
> +	IOPRIO_ATTRS
> +};
> +
> +/* cgroup v1 attributes */
> +static struct cftype ioprio_legacy_files[] = {
> +	IOPRIO_ATTRS
> +};
> +
> +static struct blkcg_policy ioprio_policy = {
> +	.dfl_cftypes	= ioprio_files,
> +	.legacy_cftypes = ioprio_legacy_files,
> +
> +	.cpd_alloc_fn	= ioprio_alloc_cpd,
> +	.cpd_free_fn	= ioprio_free_cpd,
> +
> +	.pd_alloc_fn	= ioprio_alloc_pd,
> +	.pd_free_fn	= ioprio_free_pd,
> +};
> +
> +struct blk_ioprio {
> +	struct rq_qos rqos;
> +};
> +
> +static void blkcg_ioprio_track(struct rq_qos *rqos, struct request *rq,
> +			       struct bio *bio)
> +{
> +	struct ioprio_blkcg *blkcg = ioprio_blkcg_from_bio(bio);
> +
> +	/*
> +	 * Except for IOPRIO_CLASS_NONE, higher I/O priority numbers
> +	 * correspond to a lower priority. Hence, the max_t() below selects
> +	 * the lower priority of bi_ioprio and the cgroup I/O priority class.
> +	 * If the cgroup policy has been set to POLICY_NO_CHANGE == 0, the
> +	 * bio I/O priority is not modified. If the bio I/O priority equals
> +	 * IOPRIO_CLASS_NONE, the cgroup I/O priority is assigned to the bio.
> +	 */
> +	bio->bi_ioprio = max_t(u16, bio->bi_ioprio,
> +			       IOPRIO_PRIO_VALUE(blkcg->prio_policy, 0));
> +}
> +
> +static void blkcg_ioprio_exit(struct rq_qos *rqos)
> +{
> +	struct blk_ioprio *blkioprio_blkg =
> +		container_of(rqos, typeof(*blkioprio_blkg), rqos);
> +
> +	blkcg_deactivate_policy(rqos->q, &ioprio_policy);
> +	kfree(blkioprio_blkg);
> +}
> +
> +static struct rq_qos_ops blkcg_ioprio_ops = {
> +	.track	= blkcg_ioprio_track,
> +	.exit	= blkcg_ioprio_exit,
> +};
> +
> +int blk_ioprio_init(struct request_queue *q)
> +{
> +	struct blk_ioprio *blkioprio_blkg;
> +	struct rq_qos *rqos;
> +	int ret;
> +
> +	blkioprio_blkg = kzalloc(sizeof(*blkioprio_blkg), GFP_KERNEL);
> +	if (!blkioprio_blkg)
> +		return -ENOMEM;
> +
> +	ret = blkcg_activate_policy(q, &ioprio_policy);
> +	if (ret) {
> +		kfree(blkioprio_blkg);
> +		return ret;
> +	}
> +
> +	rqos = &blkioprio_blkg->rqos;
> +	rqos->id = RQ_QOS_IOPRIO;
> +	rqos->ops = &blkcg_ioprio_ops;
> +	rqos->q = q;
> +
> +	/*
> +	 * Registering the rq-qos policy after activating the blk-cgroup
> +	 * policy guarantees that ioprio_blkcg_from_bio(bio) != NULL in the
> +	 * rq-qos callbacks.
> +	 */
> +	rq_qos_add(q, rqos);
> +
> +	return 0;
> +}
> +
> +static int __init ioprio_init(void)
> +{
> +	return blkcg_policy_register(&ioprio_policy);
> +}
> +
> +static void __exit ioprio_exit(void)
> +{
> +	blkcg_policy_unregister(&ioprio_policy);
> +}
> +
> +module_init(ioprio_init);
> +module_exit(ioprio_exit);
> diff --git a/block/blk-ioprio.h b/block/blk-ioprio.h
> new file mode 100644
> index 000000000000..a7785c2f1aea
> --- /dev/null
> +++ b/block/blk-ioprio.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef _BLK_IOPRIO_H_
> +#define _BLK_IOPRIO_H_
> +
> +#include <linux/kconfig.h>
> +
> +struct request_queue;
> +
> +#ifdef CONFIG_BLK_CGROUP_IOPRIO
> +int blk_ioprio_init(struct request_queue *q);
> +#else
> +static inline int blk_ioprio_init(struct request_queue *q)
> +{
> +	return 0;
> +}
> +#endif
> +
> +#endif /* _BLK_IOPRIO_H_ */
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 6ac1c86f62ef..4b66d2776eda 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -946,6 +946,8 @@ static const char *rq_qos_id_to_name(enum rq_qos_id id)
>  		return "latency";
>  	case RQ_QOS_COST:
>  		return "cost";
> +	case RQ_QOS_IOPRIO:
> +		return "ioprio";
>  	}
>  	return "unknown";
>  }
> diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h
> index a77afbdd472c..f000f83e0621 100644
> --- a/block/blk-rq-qos.h
> +++ b/block/blk-rq-qos.h
> @@ -17,6 +17,7 @@ enum rq_qos_id {
>  	RQ_QOS_WBT,
>  	RQ_QOS_LATENCY,
>  	RQ_QOS_COST,
> +	RQ_QOS_IOPRIO,
>  };
>  
>  struct rq_wait {


Looks good and is working as expected, here are some IOPs from a local
loopback device using this feature and the BFQ scheduler

APP     CGRP_PRIO_ON    CGRP_PRIO_OFF

IDLE    51.2K           67.2K
RT      93.6K           67.4

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>
Tested by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 05/16] block/mq-deadline: Add several comments
  2021-06-18  0:44 ` [PATCH v3 05/16] block/mq-deadline: Add several comments Bart Van Assche
@ 2021-06-18 22:06   ` Adam Manzanares
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 22:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Johannes Thumshirn, Himanshu Madhani,
	Hannes Reinecke, Ming Lei

On Thu, Jun 17, 2021 at 05:44:45PM -0700, Bart Van Assche wrote:
> Make the code easier to read by adding more comments.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 8eea2cbf2bf4..31418e9ce9e2 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -139,6 +139,9 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
>  	}
>  }
>  
> +/*
> + * Callback function that is invoked after @next has been merged into @req.
> + */
>  static void dd_merged_requests(struct request_queue *q, struct request *req,
>  			       struct request *next)
>  {
> @@ -375,6 +378,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  }
>  
>  /*
> + * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
> + *
>   * One confusing aspect here is that we get called for a specific
>   * hardware queue, but we may return a request that is for a
>   * different hardware queue. This is because mq-deadline has shared
> @@ -438,6 +443,10 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
>  	return 0;
>  }
>  
> +/*
> + * Try to merge @bio into an existing request. If @bio has been merged into
> + * an existing request, store the pointer to that request into *@rq.
> + */
>  static int dd_request_merge(struct request_queue *q, struct request **rq,
>  			    struct bio *bio)
>  {
> @@ -461,6 +470,10 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
>  	return ELEVATOR_NO_MERGE;
>  }
>  
> +/*
> + * Attempt to merge a bio into an existing request. This function is called
> + * before @bio is associated with a request.
> + */
>  static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
>  		unsigned int nr_segs)
>  {
> @@ -518,6 +531,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	}
>  }
>  
> +/*
> + * Called from blk_mq_sched_insert_request() or blk_mq_sched_insert_requests().
> + */
>  static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>  			       struct list_head *list, bool at_head)
>  {
> @@ -544,6 +560,8 @@ static void dd_prepare_request(struct request *rq)
>  }
>  
>  /*
> + * Callback from inside blk_mq_free_request().
> + *
>   * For zoned block devices, write unlock the target zone of
>   * completed write requests. Do this while holding the zone lock
>   * spinlock so that the zone is never unlocked while deadline_fifo_request()

Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 06/16] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-06-18  0:44 ` [PATCH v3 06/16] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
@ 2021-06-18 22:09   ` Adam Manzanares
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 22:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Chaitanya Kulkarni, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn, Himanshu Madhani, Ming Lei

On Thu, Jun 17, 2021 at 05:44:46PM -0700, Bart Van Assche wrote:
> Document the locking strategy by adding two lockdep_assert_held()
> statements.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 31418e9ce9e2..191ff5ce629c 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -279,6 +279,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	bool reads, writes;
>  	int data_dir;
>  
> +	lockdep_assert_held(&dd->lock);
> +
>  	if (!list_empty(&dd->dispatch)) {
>  		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
>  		list_del_init(&rq->queuelist);
> @@ -501,6 +503,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  	const int data_dir = rq_data_dir(rq);
>  
> +	lockdep_assert_held(&dd->lock);
> +
>  	/*
>  	 * This may be a requeue of a write request that has locked its
>  	 * target zone. If it is the case, this releases the zone lock.

Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 07/16] block/mq-deadline: Remove two local variables
  2021-06-18  0:44 ` [PATCH v3 07/16] block/mq-deadline: Remove two local variables Bart Van Assche
@ 2021-06-18 22:16   ` Adam Manzanares
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 22:16 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Chaitanya Kulkarni, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn, Himanshu Madhani, Ming Lei

On Thu, Jun 17, 2021 at 05:44:47PM -0700, Bart Van Assche wrote:
> Make __dd_dispatch_request() easier to read by removing two local
> variables.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 191ff5ce629c..caa438f62a4d 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -276,7 +276,6 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
>  static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  {
>  	struct request *rq, *next_rq;
> -	bool reads, writes;
>  	int data_dir;
>  
>  	lockdep_assert_held(&dd->lock);
> @@ -287,9 +286,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  		goto done;
>  	}
>  
> -	reads = !list_empty(&dd->fifo_list[READ]);
> -	writes = !list_empty(&dd->fifo_list[WRITE]);
> -
>  	/*
>  	 * batches are currently reads XOR writes
>  	 */
> @@ -306,7 +302,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * data direction (read / write)
>  	 */
>  
> -	if (reads) {
> +	if (!list_empty(&dd->fifo_list[READ])) {
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
>  
>  		if (deadline_fifo_request(dd, WRITE) &&
> @@ -322,7 +318,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * there are either no reads or writes have been starved
>  	 */
>  
> -	if (writes) {
> +	if (!list_empty(&dd->fifo_list[WRITE])) {
>  dispatch_writes:
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> 

Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 08/16] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-06-18  0:44 ` [PATCH v3 08/16] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
@ 2021-06-18 22:18   ` Adam Manzanares
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 22:18 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Chaitanya Kulkarni, Damien Le Moal, Hannes Reinecke,
	Johannes Thumshirn, Himanshu Madhani, Ming Lei

On Thu, Jun 17, 2021 at 05:44:48PM -0700, Bart Van Assche wrote:
> Change "queue" into "sched" to make the function names reflect better the
> purpose of these functions.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index caa438f62a4d..d823ba7cb084 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -395,7 +395,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  	return rq;
>  }
>  
> -static void dd_exit_queue(struct elevator_queue *e)
> +static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
>  
> @@ -408,7 +408,7 @@ static void dd_exit_queue(struct elevator_queue *e)
>  /*
>   * initialize elevator private data (deadline_data).
>   */
> -static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  {
>  	struct deadline_data *dd;
>  	struct elevator_queue *eq;
> @@ -800,8 +800,8 @@ static struct elevator_type mq_deadline = {
>  		.requests_merged	= dd_merged_requests,
>  		.request_merged		= dd_request_merged,
>  		.has_work		= dd_has_work,
> -		.init_sched		= dd_init_queue,
> -		.exit_sched		= dd_exit_queue,
> +		.init_sched		= dd_init_sched,
> +		.exit_sched		= dd_exit_sched,
>  	},
>  
>  #ifdef CONFIG_BLK_DEBUG_FS


Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 09/16] block/mq-deadline: Improve compile-time argument checking
  2021-06-18  0:44 ` [PATCH v3 09/16] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
@ 2021-06-18 22:30   ` Adam Manzanares
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 22:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Chaitanya Kulkarni, Hannes Reinecke, Johannes Thumshirn,
	Himanshu Madhani, Damien Le Moal, Ming Lei

On Thu, Jun 17, 2021 at 05:44:49PM -0700, Bart Van Assche wrote:
> Modern compilers complain if an out-of-range value is passed to a function
> argument that has an enumeration type. Let the compiler detect out-of-range
> data direction arguments instead of verifying the data_dir argument at
> runtime.
> 
> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 96 +++++++++++++++++++++++----------------------
>  1 file changed, 49 insertions(+), 47 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index d823ba7cb084..69126beff77d 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -35,6 +35,13 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
>  static const int fifo_batch = 16;       /* # of sequential requests treated as one
>  				     by the above parameters. For throughput. */
>  
> +enum dd_data_dir {
> +	DD_READ		= READ,
> +	DD_WRITE	= WRITE,
> +};
> +
> +enum { DD_DIR_COUNT = 2 };
> +
>  struct deadline_data {
>  	/*
>  	 * run time data
> @@ -43,20 +50,20 @@ struct deadline_data {
>  	/*
>  	 * requests (deadline_rq s) are present on both sort_list and fifo_list
>  	 */
> -	struct rb_root sort_list[2];
> -	struct list_head fifo_list[2];
> +	struct rb_root sort_list[DD_DIR_COUNT];
> +	struct list_head fifo_list[DD_DIR_COUNT];
>  
>  	/*
>  	 * next in sort order. read, write or both are NULL
>  	 */
> -	struct request *next_rq[2];
> +	struct request *next_rq[DD_DIR_COUNT];
>  	unsigned int batching;		/* number of sequential requests made */
>  	unsigned int starved;		/* times reads have starved writes */
>  
>  	/*
>  	 * settings that change how the i/o scheduler behaves
>  	 */
> -	int fifo_expire[2];
> +	int fifo_expire[DD_DIR_COUNT];
>  	int fifo_batch;
>  	int writes_starved;
>  	int front_merges;
> @@ -97,7 +104,7 @@ deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
>  static inline void
>  deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
>  {
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);
>  
>  	if (dd->next_rq[data_dir] == rq)
>  		dd->next_rq[data_dir] = deadline_latter_request(rq);
> @@ -169,10 +176,10 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
>  static void
>  deadline_move_request(struct deadline_data *dd, struct request *rq)
>  {
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);
>  
> -	dd->next_rq[READ] = NULL;
> -	dd->next_rq[WRITE] = NULL;
> +	dd->next_rq[DD_READ] = NULL;
> +	dd->next_rq[DD_WRITE] = NULL;
>  	dd->next_rq[data_dir] = deadline_latter_request(rq);
>  
>  	/*
> @@ -185,9 +192,10 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>   */
> -static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
> +static inline int deadline_check_fifo(struct deadline_data *dd,
> +				      enum dd_data_dir data_dir)
>  {
> -	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
>  
>  	/*
>  	 * rq is expired!
> @@ -203,19 +211,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
>   * dispatch using arrival ordered lists.
>   */
>  static struct request *
> -deadline_fifo_request(struct deadline_data *dd, int data_dir)
> +deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
>  
> -	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
> -		return NULL;
> -
>  	if (list_empty(&dd->fifo_list[data_dir]))
>  		return NULL;
>  
>  	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> -	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>  		return rq;
>  
>  	/*
> @@ -223,7 +228,7 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
>  	 * an unlocked target zone.
>  	 */
>  	spin_lock_irqsave(&dd->zone_lock, flags);
> -	list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) {
> +	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
>  		if (blk_req_can_dispatch_to_zone(rq))
>  			goto out;
>  	}
> @@ -239,19 +244,16 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
>   * dispatch using sector position sorted lists.
>   */
>  static struct request *
> -deadline_next_request(struct deadline_data *dd, int data_dir)
> +deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
>  
> -	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
> -		return NULL;
> -
>  	rq = dd->next_rq[data_dir];
>  	if (!rq)
>  		return NULL;
>  
> -	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>  		return rq;
>  
>  	/*
> @@ -276,7 +278,7 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
>  static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  {
>  	struct request *rq, *next_rq;
> -	int data_dir;
> +	enum dd_data_dir data_dir;
>  
>  	lockdep_assert_held(&dd->lock);
>  
> @@ -289,9 +291,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	/*
>  	 * batches are currently reads XOR writes
>  	 */
> -	rq = deadline_next_request(dd, WRITE);
> +	rq = deadline_next_request(dd, DD_WRITE);
>  	if (!rq)
> -		rq = deadline_next_request(dd, READ);
> +		rq = deadline_next_request(dd, DD_READ);
>  
>  	if (rq && dd->batching < dd->fifo_batch)
>  		/* we have a next request are still entitled to batch */
> @@ -302,14 +304,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * data direction (read / write)
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[READ])) {
> -		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
> +	if (!list_empty(&dd->fifo_list[DD_READ])) {
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
>  
> -		if (deadline_fifo_request(dd, WRITE) &&
> +		if (deadline_fifo_request(dd, DD_WRITE) &&
>  		    (dd->starved++ >= dd->writes_starved))
>  			goto dispatch_writes;
>  
> -		data_dir = READ;
> +		data_dir = DD_READ;
>  
>  		goto dispatch_find_request;
>  	}
> @@ -318,13 +320,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * there are either no reads or writes have been starved
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[WRITE])) {
> +	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
>  dispatch_writes:
> -		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
>  
>  		dd->starved = 0;
>  
> -		data_dir = WRITE;
> +		data_dir = DD_WRITE;
>  
>  		goto dispatch_find_request;
>  	}
> @@ -399,8 +401,8 @@ static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
>  
> -	BUG_ON(!list_empty(&dd->fifo_list[READ]));
> -	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
> +	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
> +	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
>  
>  	kfree(dd);
>  }
> @@ -424,12 +426,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	}
>  	eq->elevator_data = dd;
>  
> -	INIT_LIST_HEAD(&dd->fifo_list[READ]);
> -	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
> -	dd->sort_list[READ] = RB_ROOT;
> -	dd->sort_list[WRITE] = RB_ROOT;
> -	dd->fifo_expire[READ] = read_expire;
> -	dd->fifo_expire[WRITE] = write_expire;
> +	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
> +	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
> +	dd->sort_list[DD_READ] = RB_ROOT;
> +	dd->sort_list[DD_WRITE] = RB_ROOT;
> +	dd->fifo_expire[DD_READ] = read_expire;
> +	dd->fifo_expire[DD_WRITE] = write_expire;
>  	dd->writes_starved = writes_starved;
>  	dd->front_merges = 1;
>  	dd->fifo_batch = fifo_batch;
> @@ -497,7 +499,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  {
>  	struct request_queue *q = hctx->queue;
>  	struct deadline_data *dd = q->elevator->elevator_data;
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);
>  
>  	lockdep_assert_held(&dd->lock);
>  
> @@ -585,7 +587,7 @@ static void dd_finish_request(struct request *rq)
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);
>  		blk_req_zone_write_unlock(rq);
> -		if (!list_empty(&dd->fifo_list[WRITE]))
> +		if (!list_empty(&dd->fifo_list[DD_WRITE]))
>  			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
>  		spin_unlock_irqrestore(&dd->zone_lock, flags);
>  	}
> @@ -626,8 +628,8 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  		__data = jiffies_to_msecs(__data);			\
>  	return deadline_var_show(__data, (page));			\
>  }
> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
>  SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
>  SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
>  SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> @@ -649,8 +651,8 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
>  		*(__PTR) = __data;					\
>  	return count;							\
>  }
> -STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> -STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
>  STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
>  STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
>  STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> @@ -717,8 +719,8 @@ static int deadline_##name##_next_rq_show(void *data,			\
>  		__blk_mq_debugfs_rq_show(m, rq);			\
>  	return 0;							\
>  }
> -DEADLINE_DEBUGFS_DDIR_ATTRS(READ, read)
> -DEADLINE_DEBUGFS_DDIR_ATTRS(WRITE, write)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
>  #undef DEADLINE_DEBUGFS_DDIR_ATTRS
>  
>  static int deadline_batching_show(void *data, struct seq_file *m)

Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 10/16] block/mq-deadline: Improve the sysfs show and store macros
  2021-06-18  0:44 ` [PATCH v3 10/16] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
@ 2021-06-18 23:07   ` Adam Manzanares
  0 siblings, 0 replies; 40+ messages in thread
From: Adam Manzanares @ 2021-06-18 23:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

On Thu, Jun 17, 2021 at 05:44:50PM -0700, Bart Van Assche wrote:
> Define separate macros for integers and jiffies to improve readability.
> Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol().
> The former functions are the recommended functions.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 66 ++++++++++++++++++++-------------------------
>  1 file changed, 29 insertions(+), 37 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 69126beff77d..f92224ff0256 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -605,58 +605,50 @@ static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>  /*
>   * sysfs parts below
>   */
> -static ssize_t
> -deadline_var_show(int var, char *page)
> -{
> -	return sprintf(page, "%d\n", var);
> -}
> -
> -static void
> -deadline_var_store(int *var, const char *page)
> -{
> -	char *p = (char *) page;
> -
> -	*var = simple_strtol(p, &p, 10);
> -}
> -
> -#define SHOW_FUNCTION(__FUNC, __VAR, __CONV)				\
> +#define SHOW_INT(__FUNC, __VAR)						\
>  static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  {									\
>  	struct deadline_data *dd = e->elevator_data;			\
> -	int __data = __VAR;						\
> -	if (__CONV)							\
> -		__data = jiffies_to_msecs(__data);			\
> -	return deadline_var_show(__data, (page));			\
> -}
> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
> -SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
> -SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
> -SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> -#undef SHOW_FUNCTION
> +									\
> +	return sysfs_emit(page, "%d\n", __VAR);				\
> +}
> +#define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
> +SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
> +SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> +SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
> +SHOW_INT(deadline_front_merges_show, dd->front_merges);
> +SHOW_INT(deadline_fifo_batch_show, dd->fifo_batch);
> +#undef SHOW_INT
> +#undef SHOW_JIFFIES
>  
>  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
>  static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)	\
>  {									\
>  	struct deadline_data *dd = e->elevator_data;			\
> -	int __data;							\
> -	deadline_var_store(&__data, (page));				\
> +	int __data, __ret;						\
> +									\
> +	__ret = kstrtoint(page, 0, &__data);				\
> +	if (__ret < 0)							\
> +		return __ret;						\
>  	if (__data < (MIN))						\
>  		__data = (MIN);						\
>  	else if (__data > (MAX))					\
>  		__data = (MAX);						\
> -	if (__CONV)							\
> -		*(__PTR) = msecs_to_jiffies(__data);			\
> -	else								\
> -		*(__PTR) = __data;					\
> +	*(__PTR) = __CONV(__data);					\
>  	return count;							\
>  }
> -STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
> -STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
> -STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
> -STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
> -STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> +#define STORE_INT(__FUNC, __PTR, MIN, MAX)				\
> +	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, )
> +#define STORE_JIFFIES(__FUNC, __PTR, MIN, MAX)				\
> +	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
> +STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
> +STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> +STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
> +STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
> +STORE_INT(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX);
>  #undef STORE_FUNCTION
> +#undef STORE_INT
> +#undef STORE_JIFFIES
>  
>  #define DD_ATTR(name) \
>  	__ATTR(name, 0644, deadline_##name##_show, deadline_##name##_store)

Looks good.

Reviewed by: Adam Manzanares <a.manzanares@samsung.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls
  2021-06-18  0:44 ` [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
  2021-06-18 17:15   ` Adam Manzanares
  2021-06-18 19:49   ` Himanshu Madhani
@ 2021-06-21 14:24   ` Tejun Heo
  2 siblings, 0 replies; 40+ messages in thread
From: Tejun Heo @ 2021-06-21 14:24 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Damien Le Moal, Johannes Thumshirn,
	Hannes Reinecke, Ming Lei, Himanshu Madhani

On Thu, Jun 17, 2021 at 05:44:42PM -0700, Bart Van Assche wrote:
> Before adding more calls in this function, simplify the error path.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 00/16] Improve I/O priority support
  2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
                   ` (15 preceding siblings ...)
  2021-06-18  0:44 ` [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-06-21 16:06 ` Jens Axboe
  16 siblings, 0 replies; 40+ messages in thread
From: Jens Axboe @ 2021-06-21 16:06 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares

On 6/17/21 6:44 PM, Bart Van Assche wrote:
> Hi Jens,
> 
> A feature that is missing from the Linux kernel for storage devices that
> support I/O priorities is to set the I/O priority in requests involving page
> cache writeback. Since the identity of the process that triggers page cache
> writeback is not known in the writeback code, the priority set by ioprio_set()
> is ignored. However, an I/O cgroup is associated with writeback requests
> by certain filesystems. Hence this patch series that implements the following
> changes:
> * Add an rq-qos policy that makes the I/O priority configurable per I/O cgroup
>   and also that changes the I/O priority of requests to the lower of (request
>   I/O priority, cgroup I/O priority).
> * Introduce one queue per I/O priority in the mq-deadline scheduler.
> * Dispatch the highest priority requests first.
> 
> Please consider this patch series for kernel v5.14.

Applied, thanks Bart.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-06-18  0:44 ` [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-08-20  0:45   ` Niklas Cassel
  2021-08-20 18:04     ` Bart Van Assche
  0 siblings, 1 reply; 40+ messages in thread
From: Niklas Cassel @ 2021-08-20  0:45 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

On Thu, Jun 17, 2021 at 05:44:56PM -0700, Bart Van Assche wrote:
> While one or more requests with a certain I/O priority are pending, do not
> dispatch lower priority requests. Dispatch lower priority requests anyway
> after the "aging" time has expired.
> 
> This patch has been tested as follows:
> 
> modprobe scsi_debug ndelay=1000000 max_queue=16 &&
> sd='' &&
> while [ -z "$sd" ]; do
>   sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done &&
> echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire &&
> cd /sys/fs/cgroup/blkio/ &&
> echo $$ >cgroup.procs &&
> echo restrict-to-be >blkio.prio.class &&
> mkdir -p hipri &&
> cd hipri &&
> echo none-to-rt >blkio.prio.class &&
> { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } &&
> echo $$ >cgroup.procs &&
> max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *    40 IOPS for the low-priority job
> 
> If the aging expiry time is changed from 100s into 0, the IOPS results change
> into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: Himanshu Madhani <himanshu.madhani@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---

(snip)

> @@ -484,15 +495,32 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> -	struct request *rq;
> +	const u64 now_ns = ktime_get_ns();
> +	struct request *rq = NULL;
>  	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);
> -	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
> +	/*
> +	 * Start with dispatching requests whose deadline expired more than
> +	 * aging_expire jiffies ago.
> +	 */
> +	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now_ns -
> +					   jiffies_to_nsecs(dd->aging_expire));
>  		if (rq)
> +			goto unlock;
> +	}
> +	/*
> +	 * Next, dispatch requests in priority order. Ignore lower priority
> +	 * requests if any higher priority requests are pending.
> +	 */
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now_ns);
> +		if (rq || dd_queued(dd, prio))
>  			break;
>  	}

Hello Bart,

I'm seeing some unwanted behavior since this patch was introduced.

If I have two processes submitting IO, one with IO class RT,
and one with IO class IDLE.

What happens is that in the "Next, dispatch requests in priority order."
for-loop, even when there are no RT requests to dispatch (

Here are some trace_printk's that I've added:

             fio-1493    [029] ...1   254.606782: dd_insert_requests: dd prio: 0 prio class: 1
First RT req inserted.

             fio-1493    [029] ...3   254.606786: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.606786: __dd_dispatch_request: dd prio: 0 prio class: 1
The first RT req gets dispatched immediately.

             fio-1493    [029] ...3   254.606791: dd_dispatch_request: rq ? ffff888144ff6800 queued ? 1
             fio-1493    [029] ...1   254.607005: dd_insert_requests: dd prio: 2 prio class: 3
First IDLE req inserted. Will be put on hold (for a max of aging_expire).

             fio-1493    [029] ...3   254.607007: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607008: dd_dispatch_request: rq ? 0000000000000000 queued ? 1
             fio-1493    [029] ...1   254.607132: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607134: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607134: dd_dispatch_request: rq ? 0000000000000000 queued ? 1
             fio-1493    [029] ...1   254.607261: dd_insert_requests: dd prio: 0 prio class: 1
Second RT req is inserted.

             fio-1493    [029] ...3   254.607262: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607263: __dd_dispatch_request: dd prio: 0 prio class: 1
Second RT req gets dispatched immediately.

             fio-1493    [029] ...3   254.607263: dd_dispatch_request: rq ? ffff888144ff71c0 queued ? 2
             fio-1493    [029] ...3   254.607264: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607265: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ...3   254.607274: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607274: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ...1   254.607439: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607440: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607441: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ...1   254.607618: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607619: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607620: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ...1   254.607789: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607790: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607791: dd_dispatch_request: rq ? 0000000000000000 queued ? 2
             fio-1493    [029] ..s.   254.607860: dd_finish_request: dd prio: 0 prio class: 1
First RT req is finished. IDLE requests are still on hold, since the second RT req hasn't finished yet.

             fio-1493    [029] ...1   254.607978: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.607980: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.607980: dd_dispatch_request: rq ? 0000000000000000 queued ? 1
             fio-1493    [029] ...1   254.608021: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.608022: dd_dispatch_request: checking dd_prio: 0
             fio-1493    [029] ...3   254.608023: dd_dispatch_request: rq ? 0000000000000000 queued ? 1
          <idle>-0       [029] .Ns1   254.608657: dd_finish_request: dd prio: 0 prio class: 1
Second RT req is finished.

             fio-1493    [029] ...1   254.608672: dd_insert_requests: dd prio: 2 prio class: 3
             fio-1493    [029] ...3   254.608674: dd_dispatch_request: checking dd_prio: 0
There are no RT reqs waiting to get dispatched, and no RT reqs in flight.

             fio-1493    [029] ...3   254.608675: dd_dispatch_request: checking dd_prio: 1
Therefore the loop continues to examine the next dd_prio (BE)

             fio-1493    [029] ...3   254.608675: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
There are no BE reqs waiting to get dispatched, however there are 4294967100 reqs in flight.

Since it says that there are BE reqs in flight, we never continue the loop to examine IDLE reqs.
(This number is obviously bogus.)
If there wasn't any BE reqs in flight, the loop would have moved on, and dispatched the IDLE reqs
at this point in time.
(Since there isn't any RT reqs to handle.)

   kworker/29:1H-464     [029] ...2   254.612489: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.612490: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.612491: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.616483: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.616484: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.616485: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.620479: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.620480: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.620481: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.624481: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.624482: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.624482: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.628478: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.628479: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.628480: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.632478: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.632479: dd_dispatch_request: checking dd_prio: 1
   kworker/29:1H-464     [029] ...2   254.632480: dd_dispatch_request: rq ? 0000000000000000 queued ? 4294967100
   kworker/29:1H-464     [029] ...2   254.636477: dd_dispatch_request: checking dd_prio: 0
   kworker/29:1H-464     [029] ...2   254.636477: dd_dispatch_request: checking dd_prio: 1

...

   kworker/29:1H-464     [029] ...2   264.607254: __dd_dispatch_request: dd prio: 2 prio class: 3
Instead, because of the broken accouting of BE reqs, we have to wait 10 seconds,
until the aging_expire timeout occurs, until the first IDLE request is finally dispatched.
If it wasn't for the broken BE accounting, this first IDLE req would have been dispatched
as soon as the second RT req finished.



So, what is wrong here?

1) The "Next, dispatch requests in priority order." for-loop does:

 if (rq || dd_queued(dd, prio))

To check if it should process requests for the next priority class or not.

dd_queued() calls dd_sum() which has this comment:

/*                                                                                          
 * Returns the total number of dd_count(dd, event_type, prio) calls across all              
 * CPUs. No locking or barriers since it is fine if the returned sum is slightly            
 * outdated.                                                                                
 */

Perhaps not so got to use an accounting that is not accurate to determine
if we should process IOs belonging to a certain priority class or not.

Perhaps we could use e.g. atomics instead of per cpu counters without
locking?


2) I added even more trace_printk's and I saw this:

#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
  kworker/u64:11-628     [026] ....    13.650123: dd_finish_request: dd prio: 1 prio class: 0
  kworker/u64:11-628     [026] ....    13.650125: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:3-289     [023] ....    13.650217: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:3-289     [023] ....    13.650219: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:0-7       [030] ....    13.650226: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:0-7       [030] ....    13.650229: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:2-288     [014] ....    13.650365: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ....    13.650367: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:8-294     [024] ....    13.650405: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:8-294     [024] ....    13.650407: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:9-296     [018] ....    13.650418: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ....    13.650420: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:7-293     [031] ....    13.650444: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:7-293     [031] ....    13.650446: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:4-290     [001] ....    13.650491: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:4-290     [001] ....    13.650493: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
   kworker/u64:8-294     [024] ....    13.650588: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:8-294     [024] ....    13.650588: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:2-288     [014] ....    13.650621: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ....    13.650622: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:7-293     [031] ....    13.650642: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:7-293     [031] ....    13.650643: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:9-296     [018] ....    13.650653: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ....    13.650654: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:4-290     [001] ....    13.650690: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:4-290     [001] ....    13.650691: dd_queued_print: ins: 0 comp: 2 queued: 4294967294
   kworker/u64:8-294     [024] ....    13.650764: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:8-294     [024] ....    13.650765: dd_queued_print: ins: 0 comp: 3 queued: 4294967293
   kworker/u64:2-288     [014] ....    13.650774: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ....    13.650774: dd_queued_print: ins: 0 comp: 3 queued: 4294967293
   kworker/u64:9-296     [018] ....    13.650800: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ....    13.650801: dd_queued_print: ins: 0 comp: 3 queued: 4294967293
   kworker/u64:7-293     [031] ....    13.650833: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:7-293     [031] ....    13.650834: dd_queued_print: ins: 0 comp: 3 queued: 4294967293
   kworker/u64:8-294     [024] ....    13.650865: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:8-294     [024] ....    13.650866: dd_queued_print: ins: 0 comp: 4 queued: 4294967292
...
   kworker/u64:2-288     [014] ....    13.655333: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ....    13.655334: dd_queued_print: ins: 0 comp: 19 queued: 4294967277
   kworker/u64:9-296     [018] ....    13.655382: dd_finish_request: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ....    13.655383: dd_queued_print: ins: 0 comp: 19 queued: 4294967277
   kworker/u64:4-290     [001] ...1    13.655564: dd_insert_requests: dd prio: 1 prio class: 0
Here comes the first insert...

   kworker/u64:4-290     [001] ...1    13.655565: dd_queued_print: ins: 1 comp: 12 queued: 4294967285
   kworker/u64:9-296     [018] ...1    13.655565: dd_insert_requests: dd prio: 1 prio class: 0
   kworker/u64:2-288     [014] ...1    13.655565: dd_insert_requests: dd prio: 1 prio class: 0
   kworker/u64:9-296     [018] ...1    13.655566: dd_queued_print: ins: 1 comp: 19 queued: 4294967278
   kworker/u64:2-288     [014] ...1    13.655566: dd_queued_print: ins: 1 comp: 19 queued: 4294967278
   kworker/u64:8-294     [024] ....    13.655662: dd_finish_request: dd prio: 1 prio class: 0

What appears to be happening here is that dd_finish_request() gets called a bunch of times,
without any preceeding dd_insert_requests() call.

Reading the comment above dd_finish_request():

 * Callback from inside blk_mq_free_request().

Could it be that this callback is done on certain requests that was never
sent down to mq-deadline?
Perhaps blk_mq_request_bypass_insert() or blk_mq_try_issue_directly() was
called, and therefore dd_insert_requests() was never called for some of the
ealiest requests in the system, but since e->type->ops.finish_request() is
set, dd_finish_request() gets called on free anyway.

Since dd_queued() is defined as:
	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
And since we can see that we have several calls to dd_finish_request()
that has increased the completed counter, dd_queued() returns a
very high value, since 0 - 19 = 4294967277.

This is probably the bug that causes the bogus accouting of BE reqs.
However, looking at the comment for dd_sum(), it also doesn't feel good
to rely on something that is "slightly outdated" to determine if we
should process a whole io class or not.
Letting requests wait for 10 seconds when there are no other outstanding
requests in the scheduler doesn't seem like the right thing to do.

Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-08-20  0:45   ` Niklas Cassel
@ 2021-08-20 18:04     ` Bart Van Assche
  2021-08-20 23:05       ` Niklas Cassel
  0 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-08-20 18:04 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

On 8/19/21 5:45 PM, Niklas Cassel wrote:
 > dd_queued() calls dd_sum() which has this comment:
 >
 > /*
 >   * Returns the total number of dd_count(dd, event_type, prio) calls across all
 >   * CPUs. No locking or barriers since it is fine if the returned sum is slightly
 >   * outdated.
 >   */
 >
 > Perhaps not so got to use an accounting that is not accurate to determine
 > if we should process IOs belonging to a certain priority class or not.
 >
 > Perhaps we could use e.g. atomics instead of per cpu counters without
 > locking?

First of all, thanks for the detailed report.

Using atomics is an option but an option we should only choose if there are no
better options since every atomic operation in the hot path has a measurable
negative performance impact.

 >    kworker/u64:11-628     [026] ....    13.650123: dd_finish_request: dd prio: 1 prio class: 0
 >    kworker/u64:11-628     [026] ....    13.650125: dd_queued_print: ins: 0 comp: 1 queued: 4294967295

4294967295 is the unsigned representation of -1. This indicates a bug - the
"queued" number should never be negative.

 > What appears to be happening here is that dd_finish_request() gets called a bunch of times,
 > without any preceeding dd_insert_requests() call.
 >
 > Reading the comment above dd_finish_request():
 >
 >   * Callback from inside blk_mq_free_request().
 >
 > Could it be that this callback is done on certain requests that was never
 > sent down to mq-deadline?
 > Perhaps blk_mq_request_bypass_insert() or blk_mq_try_issue_directly() was
 > called, and therefore dd_insert_requests() was never called for some of the
 > ealiest requests in the system, but since e->type->ops.finish_request() is
 > set, dd_finish_request() gets called on free anyway.
 >
 > Since dd_queued() is defined as:
 > 	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
 > And since we can see that we have several calls to dd_finish_request()
 > that has increased the completed counter, dd_queued() returns a
 > very high value, since 0 - 19 = 4294967277.
 >
 > This is probably the bug that causes the bogus accouting of BE reqs.
 > However, looking at the comment for dd_sum(), it also doesn't feel good
 > to rely on something that is "slightly outdated" to determine if we
 > should process a whole io class or not.
 > Letting requests wait for 10 seconds when there are no other outstanding
 > requests in the scheduler doesn't seem like the right thing to do.

The "slightly outdated" in that comment is not what causes the I/O delays -
these are caused by updating statistics in dd_finish_request() for requests
that have not been seen by dd_insert_requests(). Please note that
dd_insert_request() and dd_dispatch_request() access the I/O statistics
while dd->lock is held. Only dd_finish_request() updates the I/O statistics
without holding dd->lock. So the dd_queued() call from inside
dd_dispatch_request() can return a number that is too big but not a number
that is too small. Hence, I don't think that updating the I/O statistics
without locking in the deadline scheduler can cause an I/O delay.

Does the patch below help?

Thanks,

Bart.


Subject: [PATCH] mq-deadline: Fix request accounting

The block layer may call the I/O scheduler .finish_request() callback
without having called the .insert_requests() callback. Make sure that the
mq-deadline I/O statistics are correct if the block layer inserts an I/O
request that bypasses the I/O scheduler. This patch prevents that lower
priority I/O is delayed longer than necessary for mixed I/O priority
workloads.

Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
  block/mq-deadline-main.c | 14 ++++++++++++--
  1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/block/mq-deadline-main.c b/block/mq-deadline-main.c
index 294be0c0db65..933be9c82ec4 100644
--- a/block/mq-deadline-main.c
+++ b/block/mq-deadline-main.c
@@ -743,6 +743,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  	blkcg = dd_blkcg_from_bio(rq->bio);
  	ddcg_count(blkcg, inserted, ioprio_class);
  	rq->elv.priv[0] = blkcg;
+	rq->elv.priv[1] = (void *)(uintptr_t)1;

  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
  		blk_mq_free_requests(&free);
@@ -795,6 +796,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
  static void dd_prepare_request(struct request *rq)
  {
  	rq->elv.priv[0] = NULL;
+	rq->elv.priv[1] = NULL;
  }

  /*
@@ -822,8 +824,16 @@ static void dd_finish_request(struct request *rq)
  	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
  	struct dd_per_prio *per_prio = &dd->per_prio[prio];

-	dd_count(dd, completed, prio);
-	ddcg_count(blkcg, completed, ioprio_class);
+	/*
+	 * The block layer core may call dd_finish_request() without having
+	 * called dd_insert_requests(). Hence only update statistics for
+	 * requests for which dd_insert_requests() has been called. See also
+	 * blk_mq_request_bypass_insert().
+	 */
+	if (rq->elv.priv[1]) {
+		dd_count(dd, completed, prio);
+		ddcg_count(blkcg, completed, ioprio_class);
+	}

  	if (blk_queue_is_zoned(q)) {
  		unsigned long flags;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-08-20 18:04     ` Bart Van Assche
@ 2021-08-20 23:05       ` Niklas Cassel
  2021-08-20 23:38         ` Bart Van Assche
  0 siblings, 1 reply; 40+ messages in thread
From: Niklas Cassel @ 2021-08-20 23:05 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

On Fri, Aug 20, 2021 at 11:04:32AM -0700, Bart Van Assche wrote:
> On 8/19/21 5:45 PM, Niklas Cassel wrote:
> > dd_queued() calls dd_sum() which has this comment:
> >
> > /*
> >   * Returns the total number of dd_count(dd, event_type, prio) calls across all
> >   * CPUs. No locking or barriers since it is fine if the returned sum is slightly
> >   * outdated.
> >   */
> >
> > Perhaps not so got to use an accounting that is not accurate to determine
> > if we should process IOs belonging to a certain priority class or not.
> >
> > Perhaps we could use e.g. atomics instead of per cpu counters without
> > locking?
> 
> First of all, thanks for the detailed report.
> 
> Using atomics is an option but an option we should only choose if there are no
> better options since every atomic operation in the hot path has a measurable
> negative performance impact.
> 
> >    kworker/u64:11-628     [026] ....    13.650123: dd_finish_request: dd prio: 1 prio class: 0
> >    kworker/u64:11-628     [026] ....    13.650125: dd_queued_print: ins: 0 comp: 1 queued: 4294967295
> 
> 4294967295 is the unsigned representation of -1. This indicates a bug - the
> "queued" number should never be negative.
> 
> > What appears to be happening here is that dd_finish_request() gets called a bunch of times,
> > without any preceeding dd_insert_requests() call.
> >
> > Reading the comment above dd_finish_request():
> >
> >   * Callback from inside blk_mq_free_request().
> >
> > Could it be that this callback is done on certain requests that was never
> > sent down to mq-deadline?
> > Perhaps blk_mq_request_bypass_insert() or blk_mq_try_issue_directly() was
> > called, and therefore dd_insert_requests() was never called for some of the
> > ealiest requests in the system, but since e->type->ops.finish_request() is
> > set, dd_finish_request() gets called on free anyway.
> >
> > Since dd_queued() is defined as:
> > 	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> > And since we can see that we have several calls to dd_finish_request()
> > that has increased the completed counter, dd_queued() returns a
> > very high value, since 0 - 19 = 4294967277.
> >
> > This is probably the bug that causes the bogus accouting of BE reqs.
> > However, looking at the comment for dd_sum(), it also doesn't feel good
> > to rely on something that is "slightly outdated" to determine if we
> > should process a whole io class or not.
> > Letting requests wait for 10 seconds when there are no other outstanding
> > requests in the scheduler doesn't seem like the right thing to do.
> 
> The "slightly outdated" in that comment is not what causes the I/O delays -
> these are caused by updating statistics in dd_finish_request() for requests
> that have not been seen by dd_insert_requests(). Please note that
> dd_insert_request() and dd_dispatch_request() access the I/O statistics
> while dd->lock is held. Only dd_finish_request() updates the I/O statistics
> without holding dd->lock. So the dd_queued() call from inside
> dd_dispatch_request() can return a number that is too big but not a number
> that is too small. Hence, I don't think that updating the I/O statistics
> without locking in the deadline scheduler can cause an I/O delay.
> 
> Does the patch below help?
> 
> Thanks,
> 
> Bart.
> 
> 
> Subject: [PATCH] mq-deadline: Fix request accounting
> 
> The block layer may call the I/O scheduler .finish_request() callback
> without having called the .insert_requests() callback. Make sure that the
> mq-deadline I/O statistics are correct if the block layer inserts an I/O
> request that bypasses the I/O scheduler. This patch prevents that lower
> priority I/O is delayed longer than necessary for mixed I/O priority
> workloads.
> 
> Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
> Reported-by: Niklas Cassel <Niklas.Cassel@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline-main.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/block/mq-deadline-main.c b/block/mq-deadline-main.c
> index 294be0c0db65..933be9c82ec4 100644
> --- a/block/mq-deadline-main.c
> +++ b/block/mq-deadline-main.c
> @@ -743,6 +743,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	blkcg = dd_blkcg_from_bio(rq->bio);
>  	ddcg_count(blkcg, inserted, ioprio_class);
>  	rq->elv.priv[0] = blkcg;
> +	rq->elv.priv[1] = (void *)(uintptr_t)1;
> 
>  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>  		blk_mq_free_requests(&free);
> @@ -795,6 +796,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>  static void dd_prepare_request(struct request *rq)
>  {
>  	rq->elv.priv[0] = NULL;
> +	rq->elv.priv[1] = NULL;
>  }
> 
>  /*
> @@ -822,8 +824,16 @@ static void dd_finish_request(struct request *rq)
>  	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>  	struct dd_per_prio *per_prio = &dd->per_prio[prio];
> 
> -	dd_count(dd, completed, prio);
> -	ddcg_count(blkcg, completed, ioprio_class);
> +	/*
> +	 * The block layer core may call dd_finish_request() without having
> +	 * called dd_insert_requests(). Hence only update statistics for
> +	 * requests for which dd_insert_requests() has been called. See also
> +	 * blk_mq_request_bypass_insert().
> +	 */
> +	if (rq->elv.priv[1]) {
> +		dd_count(dd, completed, prio);
> +		ddcg_count(blkcg, completed, ioprio_class);
> +	}
> 
>  	if (blk_queue_is_zoned(q)) {
>  		unsigned long flags;

Hello Bart,


Thank you for your patch!
I tested it, and it does solve my problem.

I've been thinking more about this problem.
The problem is seen on a SATA zoned drive.

These drives have mq-deadline set as default by the
blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE) call in
drivers/scsi/sd_zbc.c:sd_zbc_read_zones()

This triggers block/elevator.c:elevator_init_mq() to initialize
"mq-deadline" as default scheduler for these devices.

I think that the problem might because that drivers/scsi/sd_zbc.c
has created the request_queue and submitted requests, before the call
to elevator_init_mq() is done.

elevator_init_mq() will set q->elevator->type->ops, so once that is set,
blk_mq_free_request() will call e->type->ops.finish_request(rq),
regardless if the request was inserted through the recently initialized
scheduler or not.

While I'm perfectly happy with your fix, would it perhaps be possible
to do the fix in block/elevator.c instead, so that we don't need to
do the same type of check that you did, in each and every single
io scheduler?

Looking at block/elevator.c:elevator_init_mq(), it seems to do:

blk_mq_freeze_queue()
blk_mq_quiesce_queue()

blk_mq_init_sched()

blk_mq_unquiesce_queue()
blk_mq_unfreeze_queue()

This obviously isn't enough to avoid the bug that we are seeing,
but could perhaps a more general fix be to flush/wait until all
in-flight requests have completed, and then free them, and then
set q->elevator->type->ops. That way, all requests inserted after
the io scheduler has been initialized, will have gone through the
io scheduler. So all finish_request() calls should have a
matching insert_request() call. What do you think?


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-08-20 23:05       ` Niklas Cassel
@ 2021-08-20 23:38         ` Bart Van Assche
  2021-08-23  7:36           ` Niklas Cassel
  0 siblings, 1 reply; 40+ messages in thread
From: Bart Van Assche @ 2021-08-20 23:38 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

On 8/20/21 4:05 PM, Niklas Cassel wrote:
> Thank you for your patch!
> I tested it, and it does solve my problem.

That's quick. Thanks!

> I've been thinking more about this problem.
> The problem is seen on a SATA zoned drive.
> 
> These drives have mq-deadline set as default by the
> blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE) call in
> drivers/scsi/sd_zbc.c:sd_zbc_read_zones()
> 
> This triggers block/elevator.c:elevator_init_mq() to initialize
> "mq-deadline" as default scheduler for these devices.
> 
> I think that the problem might because that drivers/scsi/sd_zbc.c
> has created the request_queue and submitted requests, before the call
> to elevator_init_mq() is done.
> 
> elevator_init_mq() will set q->elevator->type->ops, so once that is set,
> blk_mq_free_request() will call e->type->ops.finish_request(rq),
> regardless if the request was inserted through the recently initialized
> scheduler or not.
> 
> While I'm perfectly happy with your fix, would it perhaps be possible
> to do the fix in block/elevator.c instead, so that we don't need to
> do the same type of check that you did, in each and every single
> io scheduler?
> 
> Looking at block/elevator.c:elevator_init_mq(), it seems to do:
> 
> blk_mq_freeze_queue()
> blk_mq_quiesce_queue()
> 
> blk_mq_init_sched()
> 
> blk_mq_unquiesce_queue()
> blk_mq_unfreeze_queue()
> 
> This obviously isn't enough to avoid the bug that we are seeing,
> but could perhaps a more general fix be to flush/wait until all
> in-flight requests have completed, and then free them, and then
> set q->elevator->type->ops. That way, all requests inserted after
> the io scheduler has been initialized, will have gone through the
> io scheduler. So all finish_request() calls should have a
> matching insert_request() call. What do you think?

q->elevator is set from inside the I/O scheduler's init_sched callback 
and that callback is called with the request queue frozen. Freezing 
happens by calling blk_mq_freeze_queue() and that function waits until 
all previously submitted requests have finished. So I don't think that 
the race described above can happen.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-08-20 23:38         ` Bart Van Assche
@ 2021-08-23  7:36           ` Niklas Cassel
  2021-08-23 17:15             ` Bart Van Assche
  0 siblings, 1 reply; 40+ messages in thread
From: Niklas Cassel @ 2021-08-23  7:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

On Fri, Aug 20, 2021 at 04:38:24PM -0700, Bart Van Assche wrote:
> On 8/20/21 4:05 PM, Niklas Cassel wrote:
> > Thank you for your patch!
> > I tested it, and it does solve my problem.
> 
> That's quick. Thanks!

Thank you for the patch!

> 
> > I've been thinking more about this problem.
> > The problem is seen on a SATA zoned drive.
> > 
> > These drives have mq-deadline set as default by the
> > blk_queue_required_elevator_features(q, ELEVATOR_F_ZBD_SEQ_WRITE) call in
> > drivers/scsi/sd_zbc.c:sd_zbc_read_zones()
> > 
> > This triggers block/elevator.c:elevator_init_mq() to initialize
> > "mq-deadline" as default scheduler for these devices.
> > 
> > I think that the problem might because that drivers/scsi/sd_zbc.c
> > has created the request_queue and submitted requests, before the call
> > to elevator_init_mq() is done.
> > 
> > elevator_init_mq() will set q->elevator->type->ops, so once that is set,
> > blk_mq_free_request() will call e->type->ops.finish_request(rq),
> > regardless if the request was inserted through the recently initialized
> > scheduler or not.
> > 
> > While I'm perfectly happy with your fix, would it perhaps be possible
> > to do the fix in block/elevator.c instead, so that we don't need to
> > do the same type of check that you did, in each and every single
> > io scheduler?
> > 
> > Looking at block/elevator.c:elevator_init_mq(), it seems to do:
> > 
> > blk_mq_freeze_queue()
> > blk_mq_quiesce_queue()
> > 
> > blk_mq_init_sched()
> > 
> > blk_mq_unquiesce_queue()
> > blk_mq_unfreeze_queue()
> > 
> > This obviously isn't enough to avoid the bug that we are seeing,
> > but could perhaps a more general fix be to flush/wait until all
> > in-flight requests have completed, and then free them, and then
> > set q->elevator->type->ops. That way, all requests inserted after
> > the io scheduler has been initialized, will have gone through the
> > io scheduler. So all finish_request() calls should have a
> > matching insert_request() call. What do you think?
> 
> q->elevator is set from inside the I/O scheduler's init_sched callback and
> that callback is called with the request queue frozen. Freezing happens by
> calling blk_mq_freeze_queue() and that function waits until all previously
> submitted requests have finished. So I don't think that the race described
> above can happen.

I see.
I was mainly thinking that it should be possible to do a generic fix,
such that we eventually won't need a similar fix as yours in all the
different I/O schedulers.

However, looking at e.g. BFQ it does appear to have something similar
to your fix already:

#define RQ_BFQQ(rq)             ((rq)->elv.priv[1])

bfq_finish_requeue_request()
	struct bfq_queue *bfqq = RQ_BFQQ(rq);

	...

        if (!rq->elv.icq || !bfqq)
                return;



So your proposed fix should also be fine.

However, it does not apply on top of Torvalds master or Jens's for-next
branch because they both have reverted your cgroup support patch.

If you rebase your fix and send it out, I will be happy to send out
a Reviewed-by/Tested-by.


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-08-23  7:36           ` Niklas Cassel
@ 2021-08-23 17:15             ` Bart Van Assche
  2021-08-23 23:01               ` Damien Le Moal
  2021-08-24 21:33               ` Niklas Cassel
  0 siblings, 2 replies; 40+ messages in thread
From: Bart Van Assche @ 2021-08-23 17:15 UTC (permalink / raw)
  To: Niklas Cassel
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

On 8/23/21 12:36 AM, Niklas Cassel wrote:
> I was mainly thinking that it should be possible to do a generic fix,
> such that we eventually won't need a similar fix as yours in all the
> different I/O schedulers.

Coming up with a generic fix would be great but I have not yet found an 
elegant approach ...

Another question is what the impact is of scheduler bypass on zoned 
block devices? Is the zone locking performed by the mq-deadline 
scheduler for writes to zoned block devices compatible with I/O 
scheduler bypass?

> However, it does not apply on top of Torvalds master or Jens's for-next
> branch because they both have reverted your cgroup support patch.
> 
> If you rebase your fix and send it out, I will be happy to send out
> a Reviewed-by/Tested-by.

I will rebase, retest and resend my patch.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-08-23 17:15             ` Bart Van Assche
@ 2021-08-23 23:01               ` Damien Le Moal
  2021-08-24 21:33               ` Niklas Cassel
  1 sibling, 0 replies; 40+ messages in thread
From: Damien Le Moal @ 2021-08-23 23:01 UTC (permalink / raw)
  To: Bart Van Assche, Niklas Cassel
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Hannes Reinecke, Ming Lei, Johannes Thumshirn,
	Himanshu Madhani

On 2021/08/24 2:15, Bart Van Assche wrote:
> On 8/23/21 12:36 AM, Niklas Cassel wrote:
>> I was mainly thinking that it should be possible to do a generic fix,
>> such that we eventually won't need a similar fix as yours in all the
>> different I/O schedulers.
> 
> Coming up with a generic fix would be great but I have not yet found an 
> elegant approach ...
> 
> Another question is what the impact is of scheduler bypass on zoned 
> block devices? Is the zone locking performed by the mq-deadline 
> scheduler for writes to zoned block devices compatible with I/O 
> scheduler bypass?

Without mq-deadline, in the general case, regular writes to the same zone may
end up being reordered and IO errors will follow. Only zone append writes can
survive the scheduler bypass as zone write locking in that case is done at the
scsi disk driver level using the dispatch queue. So bypassing the scheduler for
writes can work only with very special cases, namely, the user issuign small
request that are never split and at QD=1 at most per zone. Any other workload
(larger requests and/or higher write QD per zone) can easily trigger write
errors (that is fairly easy to check).

> 
>> However, it does not apply on top of Torvalds master or Jens's for-next
>> branch because they both have reverted your cgroup support patch.
>>
>> If you rebase your fix and send it out, I will be happy to send out
>> a Reviewed-by/Tested-by.
> 
> I will rebase, retest and resend my patch.
> 
> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests
  2021-08-23 17:15             ` Bart Van Assche
  2021-08-23 23:01               ` Damien Le Moal
@ 2021-08-24 21:33               ` Niklas Cassel
  1 sibling, 0 replies; 40+ messages in thread
From: Niklas Cassel @ 2021-08-24 21:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Adam Manzanares, Damien Le Moal, Hannes Reinecke, Ming Lei,
	Johannes Thumshirn, Himanshu Madhani

On Mon, Aug 23, 2021 at 10:15:39AM -0700, Bart Van Assche wrote:
> On 8/23/21 12:36 AM, Niklas Cassel wrote:
> > I was mainly thinking that it should be possible to do a generic fix,
> > such that we eventually won't need a similar fix as yours in all the
> > different I/O schedulers.
> 
> Coming up with a generic fix would be great but I have not yet found an
> elegant approach ...
> 
> Another question is what the impact is of scheduler bypass on zoned block
> devices? Is the zone locking performed by the mq-deadline scheduler for
> writes to zoned block devices compatible with I/O scheduler bypass?

If anyone is curious of how the stack trace looks:

#0  dd_finish_request (rq=0xffff8881051b0000) at block/mq-deadline.c:790
#1  0xffffffff81741fcf in blk_mq_free_request (rq=rq@entry=0xffff8881051b0000)
    at block/blk-mq.c:516
#2  0xffffffff8172e4fa in blk_put_request (req=req@entry=0xffff8881051b0000)
    at block/blk-core.c:644
#3  0xffffffff819c51c2 in __scsi_execute (sdev=0xffff888106064000,
    cmd=cmd@entry=0xffffc900002c78d8 "", data_direction=data_direction@entry=3,
    buffer=buffer@entry=0x0 <fixed_percpu_data>, bufflen=bufflen@entry=0,
    sense=sense@entry=0x0 <fixed_percpu_data>, sshdr=0xffffc900002c7878, timeout=30000, retries=5,
    flags=0, rq_flags=0, resid=0x0 <fixed_percpu_data>) at drivers/scsi/scsi_lib.c:260
#4  0xffffffff819e12b6 in scsi_execute_req (resid=0x0 <fixed_percpu_data>, retries=5,
    timeout=30000, sshdr=0xffffc900002c7878, bufflen=0, buffer=0x0 <fixed_percpu_data>,
    data_direction=3, cmd=0xffffc900002c78d8 "", sdev=<optimized out>)
    at ./include/scsi/scsi_device.h:463
#5  sd_spinup_disk (sdkp=<optimized out>) at drivers/scsi/sd.c:2177
#6  sd_revalidate_disk (disk=<optimized out>) at drivers/scsi/sd.c:3302
#7  0xffffffff819e479d in sd_open (bdev=0xffff888102b45800, mode=1) at drivers/scsi/sd.c:1443
#8  0xffffffff81422285 in blkdev_get_whole (bdev=bdev@entry=0xffff888102b45800, mode=mode@entry=1)
    at fs/block_dev.c:1253
#9  0xffffffff8142348a in blkdev_get_by_dev (dev=<optimized out>, mode=mode@entry=1,
    holder=holder@entry=0x0 <fixed_percpu_data>) at fs/block_dev.c:1417
#10 0xffffffff8175632c in disk_scan_partitions (disk=0xffff88810514dc00) at block/genhd.c:388
#11 register_disk (groups=<optimized out>, disk=0xffff88810514dc00, parent=0xffff888106064268)
    at block/genhd.c:435
#12 __device_add_disk (parent=parent@entry=0xffff888106064268, disk=disk@entry=0xffff88810514dc00,
    groups=groups@entry=0x0 <fixed_percpu_data>, register_queue=register_queue@entry=true)
    at block/genhd.c:527
#13 0xffffffff8175640f in device_add_disk (parent=parent@entry=0xffff888106064268,
    disk=disk@entry=0xffff88810514dc00, groups=groups@entry=0x0 <fixed_percpu_data>)
    at block/genhd.c:548
#14 0xffffffff819e4d0f in sd_probe (dev=0xffff888106064268) at drivers/scsi/sd.c:3581


This stack trace is simply the first one, it can be traced back to
sd_revalidate_disk(). All the other dd_finish_request() calls (which doesn't
have a matching insert) also originate from sd_revalidate_disk().

Like we suspected, this is because of scheduler bypass.

E.g.
sd_revalidate_disk() -> read_capacity_16() -> __scsi_execute() ->
blk_execute_rq() -> blk_execute_rq_nowait() -> blk_mq_sched_insert_request()
-> blk_mq_sched_bypass_insert() -> blk_mq_request_bypass_insert()

__scsi_execute() sets req op to REQ_OP_DRV_OUT or REQ_OP_DRV_IN.

blk_mq_sched_insert_request() bypasses the scheduler when
blk_mq_sched_bypass_insert() returns true, which it does if
blk_rq_is_passthrough().
blk_rq_is_passthrough() returns true if req op is REQ_OP_DRV_OUT
or REQ_OP_DRV_IN.

Basically __scsi_execute() is the equivalent of __nvme_submit_sync_cmd(),
but with a worse name :)


"Is the zone locking performed by the mq-deadline scheduler for
writes to zoned block devices compatible with I/O scheduler bypass?"

Since sd_revalidate_disk() doesn't do any writes, everything is fine.
(And like Damien said, if any kernel code did passthrough writes,
we would have seen errors from the drive a long time ago.)

Yes, a user submitting passthrough writes can of course do "bad things",
but that is expected :)


Kind regards,
Niklas

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-08-24 21:34 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-18  0:44 [PATCH v3 00/16] Improve I/O priority support Bart Van Assche
2021-06-18  0:44 ` [PATCH v3 01/16] block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutive Bart Van Assche
2021-06-18 17:09   ` Adam Manzanares
2021-06-18 19:49   ` Himanshu Madhani
2021-06-18  0:44 ` [PATCH v3 02/16] block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() calls Bart Van Assche
2021-06-18 17:15   ` Adam Manzanares
2021-06-18 19:49   ` Himanshu Madhani
2021-06-21 14:24   ` Tejun Heo
2021-06-18  0:44 ` [PATCH v3 03/16] block/blk-rq-qos: Move a function from a header file into a C file Bart Van Assche
2021-06-18 17:22   ` Adam Manzanares
2021-06-18 19:49   ` Himanshu Madhani
2021-06-18  0:44 ` [PATCH v3 04/16] block: Introduce the ioprio rq-qos policy Bart Van Assche
2021-06-18 22:02   ` Adam Manzanares
2021-06-18  0:44 ` [PATCH v3 05/16] block/mq-deadline: Add several comments Bart Van Assche
2021-06-18 22:06   ` Adam Manzanares
2021-06-18  0:44 ` [PATCH v3 06/16] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
2021-06-18 22:09   ` Adam Manzanares
2021-06-18  0:44 ` [PATCH v3 07/16] block/mq-deadline: Remove two local variables Bart Van Assche
2021-06-18 22:16   ` Adam Manzanares
2021-06-18  0:44 ` [PATCH v3 08/16] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
2021-06-18 22:18   ` Adam Manzanares
2021-06-18  0:44 ` [PATCH v3 09/16] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
2021-06-18 22:30   ` Adam Manzanares
2021-06-18  0:44 ` [PATCH v3 10/16] block/mq-deadline: Improve the sysfs show and store macros Bart Van Assche
2021-06-18 23:07   ` Adam Manzanares
2021-06-18  0:44 ` [PATCH v3 11/16] block/mq-deadline: Reserve 25% of scheduler tags for synchronous requests Bart Van Assche
2021-06-18  0:44 ` [PATCH v3 12/16] block/mq-deadline: Micro-optimize the batching algorithm Bart Van Assche
2021-06-18  0:44 ` [PATCH v3 13/16] block/mq-deadline: Add I/O priority support Bart Van Assche
2021-06-18  0:44 ` [PATCH v3 14/16] block/mq-deadline: Track I/O statistics Bart Van Assche
2021-06-18  0:44 ` [PATCH v3 15/16] block/mq-deadline: Add cgroup support Bart Van Assche
2021-06-18  0:44 ` [PATCH v3 16/16] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
2021-08-20  0:45   ` Niklas Cassel
2021-08-20 18:04     ` Bart Van Assche
2021-08-20 23:05       ` Niklas Cassel
2021-08-20 23:38         ` Bart Van Assche
2021-08-23  7:36           ` Niklas Cassel
2021-08-23 17:15             ` Bart Van Assche
2021-08-23 23:01               ` Damien Le Moal
2021-08-24 21:33               ` Niklas Cassel
2021-06-21 16:06 ` [PATCH v3 00/16] Improve I/O priority support Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).