linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] Improve I/O priority support
@ 2021-05-27  1:01 Bart Van Assche
  2021-05-27  1:01 ` [PATCH 1/9] block/mq-deadline: Add several comments Bart Van Assche
                   ` (11 more replies)
  0 siblings, 12 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche

Hi Jens,

A feature that is missing from the Linux kernel for storage devices that
support I/O priorities is to set the I/O priority in requests involving page
cache writeback. Since the identity of the process that triggers page cache
writeback is not known in the writeback code, the priority set by ioprio_set()
is ignored. However, an I/O cgroup is associated with writeback requests
by certain filesystems. Hence this patch series that implements the following
changes for the mq-deadline scheduler:
* Make the I/O priority configurable per I/O cgroup.
* Change the I/O priority of requests to the lower of (request I/O priority,
  cgroup I/O priority).
* Introduce one queue per I/O priority in the mq-deadline scheduler.

Please consider this patch series for kernel v5.14.

Thanks,

Bart.

Bart Van Assche (9):
  block/mq-deadline: Add several comments
  block/mq-deadline: Add two lockdep_assert_held() statements
  block/mq-deadline: Remove two local variables
  block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  block/mq-deadline: Improve compile-time argument checking
  block/mq-deadline: Reduce the read expiry time for non-rotational
    media
  block/mq-deadline: Reserve 25% of tags for synchronous requests
  block/mq-deadline: Add I/O priority support
  block/mq-deadline: Add cgroup support

 block/Kconfig.iosched      |   6 +
 block/Makefile             |   1 +
 block/mq-deadline-cgroup.c | 206 +++++++++++++++
 block/mq-deadline-cgroup.h |  73 ++++++
 block/mq-deadline.c        | 524 +++++++++++++++++++++++++++++--------
 5 files changed, 695 insertions(+), 115 deletions(-)
 create mode 100644 block/mq-deadline-cgroup.c
 create mode 100644 block/mq-deadline-cgroup.h


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 1/9] block/mq-deadline: Add several comments
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  3:03   ` Damien Le Moal
                     ` (3 more replies)
  2021-05-27  1:01 ` [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
                   ` (10 subsequent siblings)
  11 siblings, 4 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

Make the code easier to read by adding more comments.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 8eea2cbf2bf4..64cabbc157ea 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -139,6 +139,9 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
 	}
 }
 
+/*
+ * Callback function that is invoked after @next has been merged into @req.
+ */
 static void dd_merged_requests(struct request_queue *q, struct request *req,
 			       struct request *next)
 {
@@ -375,6 +378,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 }
 
 /*
+ * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
+ *
  * One confusing aspect here is that we get called for a specific
  * hardware queue, but we may return a request that is for a
  * different hardware queue. This is because mq-deadline has shared
@@ -438,6 +443,10 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
 	return 0;
 }
 
+/*
+ * Try to merge @bio into an existing request. If @bio has been merged into
+ * an existing request, store the pointer to that request into *@rq.
+ */
 static int dd_request_merge(struct request_queue *q, struct request **rq,
 			    struct bio *bio)
 {
@@ -461,6 +470,10 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
 	return ELEVATOR_NO_MERGE;
 }
 
+/*
+ * Attempt to merge a bio into an existing request. This function is called
+ * before @bio is associated with a request.
+ */
 static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs)
 {
@@ -518,6 +531,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	}
 }
 
+/*
+ * Called from blk_mq_sched_insert_request() or blk_mq_sched_insert_requests().
+ */
 static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 			       struct list_head *list, bool at_head)
 {
@@ -544,6 +560,8 @@ static void dd_prepare_request(struct request *rq)
 }
 
 /*
+ * Callback function called from inside blk_mq_free_request().
+ *
  * For zoned block devices, write unlock the target zone of
  * completed write requests. Do this while holding the zone lock
  * spinlock so that the zone is never unlocked while deadline_fifo_request()

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
  2021-05-27  1:01 ` [PATCH 1/9] block/mq-deadline: Add several comments Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  2:25   ` Chaitanya Kulkarni
                     ` (4 more replies)
  2021-05-27  1:01 ` [PATCH 3/9] block/mq-deadline: Remove two local variables Bart Van Assche
                   ` (9 subsequent siblings)
  11 siblings, 5 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

Document the locking strategy by adding two lockdep_assert_held()
statements.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 64cabbc157ea..4da0bd3b9827 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -279,6 +279,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	bool reads, writes;
 	int data_dir;
 
+	lockdep_assert_held(&dd->lock);
+
 	if (!list_empty(&dd->dispatch)) {
 		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
 		list_del_init(&rq->queuelist);
@@ -501,6 +503,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const int data_dir = rq_data_dir(rq);
 
+	lockdep_assert_held(&dd->lock);
+
 	/*
 	 * This may be a requeue of a write request that has locked its
 	 * target zone. If it is the case, this releases the zone lock.

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 3/9] block/mq-deadline: Remove two local variables
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
  2021-05-27  1:01 ` [PATCH 1/9] block/mq-deadline: Add several comments Bart Van Assche
  2021-05-27  1:01 ` [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  2:26   ` Chaitanya Kulkarni
                     ` (4 more replies)
  2021-05-27  1:01 ` [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
                   ` (8 subsequent siblings)
  11 siblings, 5 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

Make __dd_dispatch_request() easier to read by removing two local
variables.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 4da0bd3b9827..cc9d0fc72db2 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -276,7 +276,6 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
 static struct request *__dd_dispatch_request(struct deadline_data *dd)
 {
 	struct request *rq, *next_rq;
-	bool reads, writes;
 	int data_dir;
 
 	lockdep_assert_held(&dd->lock);
@@ -287,9 +286,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 		goto done;
 	}
 
-	reads = !list_empty(&dd->fifo_list[READ]);
-	writes = !list_empty(&dd->fifo_list[WRITE]);
-
 	/*
 	 * batches are currently reads XOR writes
 	 */
@@ -306,7 +302,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (reads) {
+	if (!list_empty(&dd->fifo_list[READ])) {
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
 
 		if (deadline_fifo_request(dd, WRITE) &&
@@ -322,7 +318,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (writes) {
+	if (!list_empty(&dd->fifo_list[WRITE])) {
 dispatch_writes:
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
 

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (2 preceding siblings ...)
  2021-05-27  1:01 ` [PATCH 3/9] block/mq-deadline: Remove two local variables Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  2:27   ` Chaitanya Kulkarni
                     ` (4 more replies)
  2021-05-27  1:01 ` [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
                   ` (7 subsequent siblings)
  11 siblings, 5 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

Change "queue" into "sched" to make the function names reflect better the
purpose of these functions.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index cc9d0fc72db2..c4f51e7884fb 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -395,7 +395,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	return rq;
 }
 
-static void dd_exit_queue(struct elevator_queue *e)
+static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
@@ -408,7 +408,7 @@ static void dd_exit_queue(struct elevator_queue *e)
 /*
  * initialize elevator private data (deadline_data).
  */
-static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
+static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
@@ -800,8 +800,8 @@ static struct elevator_type mq_deadline = {
 		.requests_merged	= dd_merged_requests,
 		.request_merged		= dd_request_merged,
 		.has_work		= dd_has_work,
-		.init_sched		= dd_init_queue,
-		.exit_sched		= dd_exit_queue,
+		.init_sched		= dd_init_sched,
+		.exit_sched		= dd_exit_sched,
 	},
 
 #ifdef CONFIG_BLK_DEBUG_FS

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (3 preceding siblings ...)
  2021-05-27  1:01 ` [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  2:28   ` Chaitanya Kulkarni
                     ` (4 more replies)
  2021-05-27  1:01 ` [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media Bart Van Assche
                   ` (6 subsequent siblings)
  11 siblings, 5 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

Modern compilers complain if an out-of-range value is passed to a function
argument that has an enumeration type. Let the compiler detect out-of-range
data direction arguments instead of verifying the data_dir argument at
runtime.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 96 +++++++++++++++++++++++----------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index c4f51e7884fb..dfbc6b77fa71 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -35,6 +35,13 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
 
+enum dd_data_dir {
+	DD_READ		= READ,
+	DD_WRITE	= WRITE,
+} __packed;
+
+enum { DD_DIR_COUNT = 2 };
+
 struct deadline_data {
 	/*
 	 * run time data
@@ -43,20 +50,20 @@ struct deadline_data {
 	/*
 	 * requests (deadline_rq s) are present on both sort_list and fifo_list
 	 */
-	struct rb_root sort_list[2];
-	struct list_head fifo_list[2];
+	struct rb_root sort_list[DD_DIR_COUNT];
+	struct list_head fifo_list[DD_DIR_COUNT];
 
 	/*
 	 * next in sort order. read, write or both are NULL
 	 */
-	struct request *next_rq[2];
+	struct request *next_rq[DD_DIR_COUNT];
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
-	int fifo_expire[2];
+	int fifo_expire[DD_DIR_COUNT];
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
@@ -97,7 +104,7 @@ deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
 static inline void
 deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
 {
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
 	if (dd->next_rq[data_dir] == rq)
 		dd->next_rq[data_dir] = deadline_latter_request(rq);
@@ -169,10 +176,10 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 static void
 deadline_move_request(struct deadline_data *dd, struct request *rq)
 {
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
-	dd->next_rq[READ] = NULL;
-	dd->next_rq[WRITE] = NULL;
+	dd->next_rq[DD_READ] = NULL;
+	dd->next_rq[DD_WRITE] = NULL;
 	dd->next_rq[data_dir] = deadline_latter_request(rq);
 
 	/*
@@ -185,9 +192,10 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
-static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
+static inline int deadline_check_fifo(struct deadline_data *dd,
+				      enum dd_data_dir data_dir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
+	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
 
 	/*
 	 * rq is expired!
@@ -203,19 +211,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
  * dispatch using arrival ordered lists.
  */
 static struct request *
-deadline_fifo_request(struct deadline_data *dd, int data_dir)
+deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
-		return NULL;
-
 	if (list_empty(&dd->fifo_list[data_dir]))
 		return NULL;
 
 	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
-	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
 	/*
@@ -223,7 +228,7 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
 	 * an unlocked target zone.
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) {
+	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
 		if (blk_req_can_dispatch_to_zone(rq))
 			goto out;
 	}
@@ -239,19 +244,16 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
  * dispatch using sector position sorted lists.
  */
 static struct request *
-deadline_next_request(struct deadline_data *dd, int data_dir)
+deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
-		return NULL;
-
 	rq = dd->next_rq[data_dir];
 	if (!rq)
 		return NULL;
 
-	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
+	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
 	/*
@@ -276,7 +278,7 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
 static struct request *__dd_dispatch_request(struct deadline_data *dd)
 {
 	struct request *rq, *next_rq;
-	int data_dir;
+	enum dd_data_dir data_dir;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -289,9 +291,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	rq = deadline_next_request(dd, WRITE);
+	rq = deadline_next_request(dd, DD_WRITE);
 	if (!rq)
-		rq = deadline_next_request(dd, READ);
+		rq = deadline_next_request(dd, DD_READ);
 
 	if (rq && dd->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
@@ -302,14 +304,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (!list_empty(&dd->fifo_list[READ])) {
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+	if (!list_empty(&dd->fifo_list[DD_READ])) {
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
 
-		if (deadline_fifo_request(dd, WRITE) &&
+		if (deadline_fifo_request(dd, DD_WRITE) &&
 		    (dd->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
-		data_dir = READ;
+		data_dir = DD_READ;
 
 		goto dispatch_find_request;
 	}
@@ -318,13 +320,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (!list_empty(&dd->fifo_list[WRITE])) {
+	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
 dispatch_writes:
-		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
 
 		dd->starved = 0;
 
-		data_dir = WRITE;
+		data_dir = DD_WRITE;
 
 		goto dispatch_find_request;
 	}
@@ -399,8 +401,8 @@ static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
 
-	BUG_ON(!list_empty(&dd->fifo_list[READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
+	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
 
 	kfree(dd);
 }
@@ -424,12 +426,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	}
 	eq->elevator_data = dd;
 
-	INIT_LIST_HEAD(&dd->fifo_list[READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
-	dd->sort_list[READ] = RB_ROOT;
-	dd->sort_list[WRITE] = RB_ROOT;
-	dd->fifo_expire[READ] = read_expire;
-	dd->fifo_expire[WRITE] = write_expire;
+	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
+	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
+	dd->sort_list[DD_READ] = RB_ROOT;
+	dd->sort_list[DD_WRITE] = RB_ROOT;
+	dd->fifo_expire[DD_READ] = read_expire;
+	dd->fifo_expire[DD_WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
 	dd->front_merges = 1;
 	dd->fifo_batch = fifo_batch;
@@ -497,7 +499,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 {
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
-	const int data_dir = rq_data_dir(rq);
+	const enum dd_data_dir data_dir = rq_data_dir(rq);
 
 	lockdep_assert_held(&dd->lock);
 
@@ -585,7 +587,7 @@ static void dd_finish_request(struct request *rq)
 
 		spin_lock_irqsave(&dd->zone_lock, flags);
 		blk_req_zone_write_unlock(rq);
-		if (!list_empty(&dd->fifo_list[WRITE]))
+		if (!list_empty(&dd->fifo_list[DD_WRITE]))
 			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
 	}
@@ -626,8 +628,8 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 		__data = jiffies_to_msecs(__data);			\
 	return deadline_var_show(__data, (page));			\
 }
-SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
-SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
 SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
 SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
 SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
@@ -649,8 +651,8 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
 		*(__PTR) = __data;					\
 	return count;							\
 }
-STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
 STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
 STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
 STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
@@ -717,8 +719,8 @@ static int deadline_##name##_next_rq_show(void *data,			\
 		__blk_mq_debugfs_rq_show(m, rq);			\
 	return 0;							\
 }
-DEADLINE_DEBUGFS_DDIR_ATTRS(READ, read)
-DEADLINE_DEBUGFS_DDIR_ATTRS(WRITE, write)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
 #undef DEADLINE_DEBUGFS_DDIR_ATTRS
 
 static int deadline_batching_show(void *data, struct seq_file *m)

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (4 preceding siblings ...)
  2021-05-27  1:01 ` [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  2:30   ` Chaitanya Kulkarni
                     ` (3 more replies)
  2021-05-27  1:01 ` [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests Bart Van Assche
                   ` (5 subsequent siblings)
  11 siblings, 4 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

Rotational media (hard disks) benefit more from merging requests than
non-rotational media. Reduce the read expire time for non-rotational media
to reduce read latency.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index dfbc6b77fa71..2ab844a4b6b5 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -29,7 +29,9 @@
 /*
  * See Documentation/block/deadline-iosched.rst
  */
-static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
+/* max time before a read is submitted. */
+static const int read_expire_rot = HZ / 2;
+static const int read_expire_nonrot = 1;
 static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
 static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
@@ -430,7 +432,8 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
 	dd->sort_list[DD_READ] = RB_ROOT;
 	dd->sort_list[DD_WRITE] = RB_ROOT;
-	dd->fifo_expire[DD_READ] = read_expire;
+	dd->fifo_expire[DD_READ] = blk_queue_nonrot(q) ? read_expire_nonrot :
+		read_expire_rot;
 	dd->fifo_expire[DD_WRITE] = write_expire;
 	dd->writes_starved = writes_starved;
 	dd->front_merges = 1;

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (5 preceding siblings ...)
  2021-05-27  1:01 ` [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  3:33   ` Damien Le Moal
  2021-05-27  6:54   ` Hannes Reinecke
  2021-05-27  1:01 ` [PATCH 8/9] block/mq-deadline: Add I/O priority support Bart Van Assche
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

For interactive workloads it is important that synchronous requests are
not delayed. Hence reserve 25% of tags for synchronous requests. This patch
still allows asynchronous requests to fill the hardware queues since
blk_mq_init_sched() makes sure that the number of scheduler requests is the
double of the hardware queue depth. From blk_mq_init_sched():

	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
				   BLKDEV_MAX_RQ);

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 46 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2ab844a4b6b5..81f487d77e09 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -69,6 +69,7 @@ struct deadline_data {
 	int fifo_batch;
 	int writes_starved;
 	int front_merges;
+	u32 async_depth;
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
@@ -399,6 +400,38 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	return rq;
 }
 
+static void dd_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
+{
+	struct deadline_data *dd = data->q->elevator->elevator_data;
+
+	/* Do not throttle synchronous reads. */
+	if (op_is_sync(op) && !op_is_write(op))
+		return;
+
+	/*
+	 * Throttle asynchronous requests and writes such that these requests
+	 * do not block the allocation of synchronous requests.
+	 */
+	data->shallow_depth = dd->async_depth;
+}
+
+static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	struct blk_mq_tags *tags = hctx->sched_tags;
+
+	dd->async_depth = 3 * q->nr_requests / 4;
+
+	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, dd->async_depth);
+}
+
+static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+	dd_depth_updated(hctx);
+	return 0;
+}
+
 static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
@@ -744,6 +777,15 @@ static int deadline_starved_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+static int dd_async_depth_show(void *data, struct seq_file *m)
+{
+	struct request_queue *q = data;
+	struct deadline_data *dd = q->elevator->elevator_data;
+
+	seq_printf(m, "%u\n", dd->async_depth);
+	return 0;
+}
+
 static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
 	__acquires(&dd->lock)
 {
@@ -786,6 +828,7 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 	DEADLINE_QUEUE_DDIR_ATTRS(write),
 	{"batching", 0400, deadline_batching_show},
 	{"starved", 0400, deadline_starved_show},
+	{"async_depth", 0400, dd_async_depth_show},
 	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
 	{},
 };
@@ -794,6 +837,8 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
 
 static struct elevator_type mq_deadline = {
 	.ops = {
+		.depth_updated		= dd_depth_updated,
+		.limit_depth		= dd_limit_depth,
 		.insert_requests	= dd_insert_requests,
 		.dispatch_request	= dd_dispatch_request,
 		.prepare_request	= dd_prepare_request,
@@ -807,6 +852,7 @@ static struct elevator_type mq_deadline = {
 		.has_work		= dd_has_work,
 		.init_sched		= dd_init_sched,
 		.exit_sched		= dd_exit_sched,
+		.init_hctx		= dd_init_hctx,
 	},
 
 #ifdef CONFIG_BLK_DEBUG_FS

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 8/9] block/mq-deadline: Add I/O priority support
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (6 preceding siblings ...)
  2021-05-27  1:01 ` [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  3:48   ` Damien Le Moal
  2021-05-27  7:07   ` Hannes Reinecke
  2021-05-27  1:01 ` [PATCH 9/9] block/mq-deadline: Add cgroup support Bart Van Assche
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

Maintain one FIFO list per I/O priority: RT, BE and IDLE. Only dispatch
requests for a lower priority after all higher priority requests have
finished. Maintain statistics for each priority level. Split the debugfs
attributes per priority level as follows:

$ ls /sys/kernel/debug/block/.../sched/
async_depth  dispatch2        read_next_rq      write2_fifo_list
batching     read0_fifo_list  starved           write_next_rq
dispatch0    read1_fifo_list  write0_fifo_list
dispatch1    read2_fifo_list  write1_fifo_list

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 293 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 223 insertions(+), 70 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 81f487d77e09..5a703e1228fa 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -44,16 +44,27 @@ enum dd_data_dir {
 
 enum { DD_DIR_COUNT = 2 };
 
+enum dd_prio {
+	DD_RT_PRIO	= 0,
+	DD_BE_PRIO	= 1,
+	DD_IDLE_PRIO	= 2,
+	DD_PRIO_MAX	= 2,
+} __packed;
+
+enum { DD_PRIO_COUNT = 3 };
+
 struct deadline_data {
 	/*
 	 * run time data
 	 */
 
 	/*
-	 * requests (deadline_rq s) are present on both sort_list and fifo_list
+	 * Requests are present on both sort_list[] and fifo_list[][]. The
+	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
+	 * The second index is the data direction (rq_data_dir(rq)).
 	 */
 	struct rb_root sort_list[DD_DIR_COUNT];
-	struct list_head fifo_list[DD_DIR_COUNT];
+	struct list_head fifo_list[DD_PRIO_COUNT][DD_DIR_COUNT];
 
 	/*
 	 * next in sort order. read, write or both are NULL
@@ -62,6 +73,12 @@ struct deadline_data {
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
+	/* statistics */
+	atomic_t inserted[DD_PRIO_COUNT];
+	atomic_t dispatched[DD_PRIO_COUNT];
+	atomic_t merged[DD_PRIO_COUNT];
+	atomic_t completed[DD_PRIO_COUNT];
+
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -73,7 +90,15 @@ struct deadline_data {
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
-	struct list_head dispatch;
+	struct list_head dispatch[DD_PRIO_COUNT];
+};
+
+/* Maps an I/O priority class to a deadline scheduler priority. */
+static const enum dd_prio ioprio_class_to_prio[] = {
+	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
+	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
+	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
+	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
 };
 
 static inline struct rb_root *
@@ -149,12 +174,28 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
 	}
 }
 
+/* Returns the I/O class that has been assigned by dd_insert_request(). */
+static u8 dd_rq_ioclass(struct request *rq)
+{
+	return (uintptr_t)rq->elv.priv[1];
+}
+
 /*
  * Callback function that is invoked after @next has been merged into @req.
  */
 static void dd_merged_requests(struct request_queue *q, struct request *req,
 			       struct request *next)
 {
+	struct deadline_data *dd = q->elevator->elevator_data;
+	const u8 ioprio_class = dd_rq_ioclass(next);
+	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+
+	if (next->elv.priv[0]) {
+		atomic_inc(&dd->merged[prio]);
+	} else {
+		WARN_ON_ONCE(true);
+	}
+
 	/*
 	 * if next expires before rq, assign its expire time to rq
 	 * and move into next position (next will be deleted) in fifo
@@ -191,14 +232,22 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
 	deadline_remove_request(rq->q, rq);
 }
 
+/* Number of requests queued for a given priority level. */
+static u64 dd_queued(struct deadline_data *dd, enum dd_prio prio)
+{
+	return atomic_read(&dd->inserted[prio]) -
+		atomic_read(&dd->completed[prio]);
+}
+
 /*
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
  */
 static inline int deadline_check_fifo(struct deadline_data *dd,
+				      enum dd_prio prio,
 				      enum dd_data_dir data_dir)
 {
-	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+	struct request *rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
 
 	/*
 	 * rq is expired!
@@ -214,15 +263,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd,
  * dispatch using arrival ordered lists.
  */
 static struct request *
-deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
+deadline_fifo_request(struct deadline_data *dd, enum dd_prio prio,
+		      enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
 
-	if (list_empty(&dd->fifo_list[data_dir]))
+	if (list_empty(&dd->fifo_list[prio][data_dir]))
 		return NULL;
 
-	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
+	rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
 	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
 		return rq;
 
@@ -231,7 +281,7 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
 	 * an unlocked target zone.
 	 */
 	spin_lock_irqsave(&dd->zone_lock, flags);
-	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
+	list_for_each_entry(rq, &dd->fifo_list[prio][DD_WRITE], queuelist) {
 		if (blk_req_can_dispatch_to_zone(rq))
 			goto out;
 	}
@@ -247,7 +297,8 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
  * dispatch using sector position sorted lists.
  */
 static struct request *
-deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
+deadline_next_request(struct deadline_data *dd, enum dd_prio prio,
+		      enum dd_data_dir data_dir)
 {
 	struct request *rq;
 	unsigned long flags;
@@ -278,15 +329,18 @@ deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
  * deadline_dispatch_requests selects the best request according to
  * read/write expire, fifo_batch, etc
  */
-static struct request *__dd_dispatch_request(struct deadline_data *dd)
+static struct request *__dd_dispatch_request(struct deadline_data *dd,
+					     enum dd_prio prio)
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
+	u8 ioprio_class;
 
 	lockdep_assert_held(&dd->lock);
 
-	if (!list_empty(&dd->dispatch)) {
-		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
+	if (!list_empty(&dd->dispatch[prio])) {
+		rq = list_first_entry(&dd->dispatch[prio], struct request,
+				      queuelist);
 		list_del_init(&rq->queuelist);
 		goto done;
 	}
@@ -294,9 +348,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * batches are currently reads XOR writes
 	 */
-	rq = deadline_next_request(dd, DD_WRITE);
+	rq = deadline_next_request(dd, prio, DD_WRITE);
 	if (!rq)
-		rq = deadline_next_request(dd, DD_READ);
+		rq = deadline_next_request(dd, prio, DD_READ);
 
 	if (rq && dd->batching < dd->fifo_batch)
 		/* we have a next request are still entitled to batch */
@@ -307,10 +361,10 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * data direction (read / write)
 	 */
 
-	if (!list_empty(&dd->fifo_list[DD_READ])) {
+	if (!list_empty(&dd->fifo_list[prio][DD_READ])) {
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
 
-		if (deadline_fifo_request(dd, DD_WRITE) &&
+		if (deadline_fifo_request(dd, prio, DD_WRITE) &&
 		    (dd->starved++ >= dd->writes_starved))
 			goto dispatch_writes;
 
@@ -323,7 +377,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	 * there are either no reads or writes have been starved
 	 */
 
-	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
+	if (!list_empty(&dd->fifo_list[prio][DD_WRITE])) {
 dispatch_writes:
 		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
 
@@ -340,14 +394,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	/*
 	 * we are not running a batch, find best request for selected data_dir
 	 */
-	next_rq = deadline_next_request(dd, data_dir);
-	if (deadline_check_fifo(dd, data_dir) || !next_rq) {
+	next_rq = deadline_next_request(dd, prio, data_dir);
+	if (deadline_check_fifo(dd, prio, data_dir) || !next_rq) {
 		/*
 		 * A deadline has expired, the last request was in the other
 		 * direction, or we have run out of higher-sectored requests.
 		 * Start again from the request with the earliest expiry time.
 		 */
-		rq = deadline_fifo_request(dd, data_dir);
+		rq = deadline_fifo_request(dd, prio, data_dir);
 	} else {
 		/*
 		 * The last req was the same dir and we have a next request in
@@ -372,6 +426,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
 	dd->batching++;
 	deadline_move_request(dd, rq);
 done:
+	ioprio_class = dd_rq_ioclass(rq);
+	prio = ioprio_class_to_prio[ioprio_class];
+	if (rq->elv.priv[0]) {
+		atomic_inc(&dd->dispatched[prio]);
+	} else {
+		WARN_ON_ONCE(true);
+	}
 	/*
 	 * If the request needs its target zone locked, do it.
 	 */
@@ -392,9 +453,14 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
 	struct request *rq;
+	enum dd_prio prio;
 
 	spin_lock(&dd->lock);
-	rq = __dd_dispatch_request(dd);
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, prio);
+		if (rq || dd_queued(dd, prio))
+			break;
+	}
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -435,9 +501,12 @@ static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 static void dd_exit_sched(struct elevator_queue *e)
 {
 	struct deadline_data *dd = e->elevator_data;
+	enum dd_prio prio;
 
-	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
-	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_READ]));
+		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_WRITE]));
+	}
 
 	kfree(dd);
 }
@@ -449,6 +518,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
+	enum dd_prio prio;
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
@@ -461,8 +531,11 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	}
 	eq->elevator_data = dd;
 
-	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
-	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
+		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_READ]);
+		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_WRITE]);
+		INIT_LIST_HEAD(&dd->dispatch[prio]);
+	}
 	dd->sort_list[DD_READ] = RB_ROOT;
 	dd->sort_list[DD_WRITE] = RB_ROOT;
 	dd->fifo_expire[DD_READ] = blk_queue_nonrot(q) ? read_expire_nonrot :
@@ -473,7 +546,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->fifo_batch = fifo_batch;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
-	INIT_LIST_HEAD(&dd->dispatch);
 
 	q->elevator = eq;
 	return 0;
@@ -536,6 +608,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const enum dd_data_dir data_dir = rq_data_dir(rq);
+	u16 ioprio = req_get_ioprio(rq);
+	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
+	enum dd_prio prio;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -545,13 +620,19 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 */
 	blk_req_zone_write_unlock(rq);
 
+	prio = ioprio_class_to_prio[ioprio_class];
+	atomic_inc(&dd->inserted[prio]);
+	WARN_ON_ONCE(rq->elv.priv[0]);
+	rq->elv.priv[0] = (void *)1ULL;
+	rq->elv.priv[1] = (void *)(uintptr_t)ioprio_class;
+
 	if (blk_mq_sched_try_insert_merge(q, rq))
 		return;
 
 	trace_block_rq_insert(rq);
 
 	if (at_head) {
-		list_add(&rq->queuelist, &dd->dispatch);
+		list_add(&rq->queuelist, &dd->dispatch[prio]);
 	} else {
 		deadline_add_rq_rb(dd, rq);
 
@@ -565,7 +646,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 		 * set expire time and add to fifo list
 		 */
 		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
-		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
+		list_add_tail(&rq->queuelist, &dd->fifo_list[prio][data_dir]);
 	}
 }
 
@@ -589,12 +670,10 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	spin_unlock(&dd->lock);
 }
 
-/*
- * Nothing to do here. This is defined only to ensure that .finish_request
- * method is called upon request completion.
- */
+/* Callback function called from inside blk_mq_rq_ctx_init(). */
 static void dd_prepare_request(struct request *rq)
 {
+	rq->elv.priv[0] = NULL;
 }
 
 /*
@@ -616,26 +695,41 @@ static void dd_prepare_request(struct request *rq)
 static void dd_finish_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	struct deadline_data *dd = q->elevator->elevator_data;
+	const u8 ioprio_class = dd_rq_ioclass(rq);
+	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+
+	if (rq->elv.priv[0])
+		atomic_inc(&dd->completed[prio]);
 
 	if (blk_queue_is_zoned(q)) {
-		struct deadline_data *dd = q->elevator->elevator_data;
 		unsigned long flags;
 
 		spin_lock_irqsave(&dd->zone_lock, flags);
 		blk_req_zone_write_unlock(rq);
-		if (!list_empty(&dd->fifo_list[DD_WRITE]))
+		if (!list_empty(&dd->fifo_list[prio][DD_WRITE]))
 			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
 	}
 }
 
+static bool dd_has_work_for_prio(struct deadline_data *dd, enum dd_prio prio)
+{
+	return !list_empty_careful(&dd->dispatch[prio]) ||
+		!list_empty_careful(&dd->fifo_list[prio][DD_READ]) ||
+		!list_empty_careful(&dd->fifo_list[prio][DD_WRITE]);
+}
+
 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+	enum dd_prio prio;
+
+	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
+		if (dd_has_work_for_prio(dd, prio))
+			return true;
 
-	return !list_empty_careful(&dd->dispatch) ||
-		!list_empty_careful(&dd->fifo_list[0]) ||
-		!list_empty_careful(&dd->fifo_list[1]);
+	return false;
 }
 
 /*
@@ -707,7 +801,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 };
 
 #ifdef CONFIG_BLK_DEBUG_FS
-#define DEADLINE_DEBUGFS_DDIR_ATTRS(ddir, name)				\
+#define DEADLINE_DEBUGFS_DDIR_ATTRS(prio, data_dir, name)		\
 static void *deadline_##name##_fifo_start(struct seq_file *m,		\
 					  loff_t *pos)			\
 	__acquires(&dd->lock)						\
@@ -716,7 +810,7 @@ static void *deadline_##name##_fifo_start(struct seq_file *m,		\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
 	spin_lock(&dd->lock);						\
-	return seq_list_start(&dd->fifo_list[ddir], *pos);		\
+	return seq_list_start(&dd->fifo_list[prio][data_dir], *pos);	\
 }									\
 									\
 static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
@@ -725,7 +819,7 @@ static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
 	struct request_queue *q = m->private;				\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
 									\
-	return seq_list_next(v, &dd->fifo_list[ddir], pos);		\
+	return seq_list_next(v, &dd->fifo_list[prio][data_dir], pos);	\
 }									\
 									\
 static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
@@ -742,22 +836,31 @@ static const struct seq_operations deadline_##name##_fifo_seq_ops = {	\
 	.next	= deadline_##name##_fifo_next,				\
 	.stop	= deadline_##name##_fifo_stop,				\
 	.show	= blk_mq_debugfs_rq_show,				\
-};									\
-									\
+};
+
+#define DEADLINE_DEBUGFS_NEXT_RQ(data_dir, name)			\
 static int deadline_##name##_next_rq_show(void *data,			\
 					  struct seq_file *m)		\
 {									\
 	struct request_queue *q = data;					\
 	struct deadline_data *dd = q->elevator->elevator_data;		\
-	struct request *rq = dd->next_rq[ddir];				\
+	struct request *rq = dd->next_rq[data_dir];			\
 									\
 	if (rq)								\
 		__blk_mq_debugfs_rq_show(m, rq);			\
 	return 0;							\
 }
-DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
-DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
+
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_READ, read0)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_WRITE, write0)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_READ, read1)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_WRITE, write1)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_READ, read2)
+DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_WRITE, write2)
+DEADLINE_DEBUGFS_NEXT_RQ(DD_READ, read)
+DEADLINE_DEBUGFS_NEXT_RQ(DD_WRITE, write)
 #undef DEADLINE_DEBUGFS_DDIR_ATTRS
+#undef DEADLINE_DEBUGFS_NEXT_RQ
 
 static int deadline_batching_show(void *data, struct seq_file *m)
 {
@@ -786,50 +889,100 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
 	return 0;
 }
 
-static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
-	__acquires(&dd->lock)
+static int dd_queued_show(void *data, struct seq_file *m)
 {
-	struct request_queue *q = m->private;
+	struct request_queue *q = data;
 	struct deadline_data *dd = q->elevator->elevator_data;
 
-	spin_lock(&dd->lock);
-	return seq_list_start(&dd->dispatch, *pos);
+	seq_printf(m, "%llu %llu %llu\n", dd_queued(dd, DD_RT_PRIO),
+		   dd_queued(dd, DD_BE_PRIO),
+		   dd_queued(dd, DD_IDLE_PRIO));
+	return 0;
 }
 
-static void *deadline_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
+/* Number of requests owned by the block driver for a given priority. */
+static u64 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
 {
-	struct request_queue *q = m->private;
-	struct deadline_data *dd = q->elevator->elevator_data;
-
-	return seq_list_next(v, &dd->dispatch, pos);
+	return atomic_read(&dd->dispatched[prio]) +
+		atomic_read(&dd->merged[prio]) -
+		atomic_read(&dd->completed[prio]);
 }
 
-static void deadline_dispatch_stop(struct seq_file *m, void *v)
-	__releases(&dd->lock)
+static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 {
-	struct request_queue *q = m->private;
+	struct request_queue *q = data;
 	struct deadline_data *dd = q->elevator->elevator_data;
 
-	spin_unlock(&dd->lock);
+	seq_printf(m, "%llu %llu %llu\n", dd_owned_by_driver(dd, DD_RT_PRIO),
+		   dd_owned_by_driver(dd, DD_BE_PRIO),
+		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
+	return 0;
 }
 
-static const struct seq_operations deadline_dispatch_seq_ops = {
-	.start	= deadline_dispatch_start,
-	.next	= deadline_dispatch_next,
-	.stop	= deadline_dispatch_stop,
-	.show	= blk_mq_debugfs_rq_show,
+#define DEADLINE_DISPATCH_ATTR(prio)					\
+static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
+					     loff_t *pos)		\
+	__acquires(&dd->lock)						\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+									\
+	spin_lock(&dd->lock);						\
+	return seq_list_start(&dd->dispatch[prio], *pos);		\
+}									\
+									\
+static void *deadline_dispatch##prio##_next(struct seq_file *m,		\
+					    void *v, loff_t *pos)	\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+									\
+	return seq_list_next(v, &dd->dispatch[prio], pos);		\
+}									\
+									\
+static void deadline_dispatch##prio##_stop(struct seq_file *m, void *v)	\
+	__releases(&dd->lock)						\
+{									\
+	struct request_queue *q = m->private;				\
+	struct deadline_data *dd = q->elevator->elevator_data;		\
+									\
+	spin_unlock(&dd->lock);						\
+}									\
+									\
+static const struct seq_operations deadline_dispatch##prio##_seq_ops = { \
+	.start	= deadline_dispatch##prio##_start,			\
+	.next	= deadline_dispatch##prio##_next,			\
+	.stop	= deadline_dispatch##prio##_stop,			\
+	.show	= blk_mq_debugfs_rq_show,				\
 };
 
-#define DEADLINE_QUEUE_DDIR_ATTRS(name)						\
-	{#name "_fifo_list", 0400, .seq_ops = &deadline_##name##_fifo_seq_ops},	\
+DEADLINE_DISPATCH_ATTR(0);
+DEADLINE_DISPATCH_ATTR(1);
+DEADLINE_DISPATCH_ATTR(2);
+#undef DEADLINE_DISPATCH_ATTR
+
+#define DEADLINE_QUEUE_DDIR_ATTRS(name)					\
+	{#name "_fifo_list", 0400,					\
+			.seq_ops = &deadline_##name##_fifo_seq_ops}
+#define DEADLINE_NEXT_RQ_ATTR(name)					\
 	{#name "_next_rq", 0400, deadline_##name##_next_rq_show}
 static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
-	DEADLINE_QUEUE_DDIR_ATTRS(read),
-	DEADLINE_QUEUE_DDIR_ATTRS(write),
+	DEADLINE_QUEUE_DDIR_ATTRS(read0),
+	DEADLINE_QUEUE_DDIR_ATTRS(write0),
+	DEADLINE_QUEUE_DDIR_ATTRS(read1),
+	DEADLINE_QUEUE_DDIR_ATTRS(write1),
+	DEADLINE_QUEUE_DDIR_ATTRS(read2),
+	DEADLINE_QUEUE_DDIR_ATTRS(write2),
+	DEADLINE_NEXT_RQ_ATTR(read),
+	DEADLINE_NEXT_RQ_ATTR(write),
 	{"batching", 0400, deadline_batching_show},
 	{"starved", 0400, deadline_starved_show},
 	{"async_depth", 0400, dd_async_depth_show},
-	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
+	{"dispatch0", 0400, .seq_ops = &deadline_dispatch0_seq_ops},
+	{"dispatch1", 0400, .seq_ops = &deadline_dispatch1_seq_ops},
+	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
+	{"owned_by_driver", 0400, dd_owned_by_driver_show},
+	{"queued", 0400, dd_queued_show},
 	{},
 };
 #undef DEADLINE_QUEUE_DDIR_ATTRS
@@ -879,6 +1032,6 @@ static void __exit deadline_exit(void)
 module_init(deadline_init);
 module_exit(deadline_exit);
 
-MODULE_AUTHOR("Jens Axboe");
+MODULE_AUTHOR("Jens Axboe, Damien Le Moal and Bart Van Assche");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("MQ deadline IO scheduler");

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH 9/9] block/mq-deadline: Add cgroup support
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (7 preceding siblings ...)
  2021-05-27  1:01 ` [PATCH 8/9] block/mq-deadline: Add I/O priority support Bart Van Assche
@ 2021-05-27  1:01 ` Bart Van Assche
  2021-05-27  7:09   ` Hannes Reinecke
  2021-05-27  6:25 ` [PATCH 0/9] Improve I/O priority support Wang Jianchao
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27  1:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Bart Van Assche, Damien Le Moal, Ming Lei

Add support for configuring I/O priority per block cgroup. Assign the lowest
of the following two priorities to a request: rq->ioprio and blkcg->ioprio.
Maintain statistics per cgroup and make these available in sysfs.

This patch has been tested as follows:

SHELL 1

modprobe scsi_debug ndelay=1000000 max_queue=16 &&
while [ -z "$sd" ]; do sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*); done &&
cd /sys/fs/cgroup/blkio/ &&
echo 2 >blkio.dd.prio &&
mkdir -p hipri &&
cd hipri &&
echo 1 >blkio.dd.prio &&
echo $$ >cgroup.procs &&
max-iops -a1 -d32 -j1 -e mq-deadline $sd

SHELL 2

sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) &&
max-iops -a1 -d32 -j1 -e mq-deadline $sd

Result:
* 12000 IOPS in shell 1
*  2000 IOPS in shell 2

The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d}
--iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2))
--name=${positional_argument_1} --filename=${positional_argument_1}

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/Kconfig.iosched      |   6 ++
 block/Makefile             |   1 +
 block/mq-deadline-cgroup.c | 206 +++++++++++++++++++++++++++++++++++++
 block/mq-deadline-cgroup.h |  73 +++++++++++++
 block/mq-deadline.c        |  96 ++++++++++++++---
 5 files changed, 370 insertions(+), 12 deletions(-)
 create mode 100644 block/mq-deadline-cgroup.c
 create mode 100644 block/mq-deadline-cgroup.h

diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 2f2158e05a91..64053d67a97b 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -9,6 +9,12 @@ config MQ_IOSCHED_DEADLINE
 	help
 	  MQ version of the deadline IO scheduler.
 
+config MQ_IOSCHED_DEADLINE_CGROUP
+       tristate
+       default y
+       depends on MQ_IOSCHED_DEADLINE
+       depends on BLK_CGROUP
+
 config MQ_IOSCHED_KYBER
 	tristate "Kyber I/O scheduler"
 	default y
diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..7d5763d5f988 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -20,6 +20,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING)	+= blk-throttle.o
 obj-$(CONFIG_BLK_CGROUP_IOLATENCY)	+= blk-iolatency.o
 obj-$(CONFIG_BLK_CGROUP_IOCOST)	+= blk-iocost.o
 obj-$(CONFIG_MQ_IOSCHED_DEADLINE)	+= mq-deadline.o
+obj-$(CONFIG_MQ_IOSCHED_DEADLINE_CGROUP)+= mq-deadline-cgroup.o
 obj-$(CONFIG_MQ_IOSCHED_KYBER)	+= kyber-iosched.o
 bfq-y				:= bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
 obj-$(CONFIG_IOSCHED_BFQ)	+= bfq.o
diff --git a/block/mq-deadline-cgroup.c b/block/mq-deadline-cgroup.c
new file mode 100644
index 000000000000..3983795dda60
--- /dev/null
+++ b/block/mq-deadline-cgroup.c
@@ -0,0 +1,206 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/blk-cgroup.h>
+#include <linux/ioprio.h>
+
+#include "mq-deadline-cgroup.h"
+
+static struct blkcg_policy dd_blkcg_policy;
+
+static struct dd_blkcg *dd_blkcg_from_css(struct cgroup_subsys_state *css)
+{
+	struct blkcg *blkcg = css_to_blkcg(css);
+	struct blkcg_policy_data *cpd = blkcg_to_cpd(blkcg, &dd_blkcg_policy);
+
+	return container_of(cpd, struct dd_blkcg, cpd);
+}
+
+static u64 dd_show_prio(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	return dd_blkcg_from_css(css)->prio;
+}
+
+static int dd_set_prio(struct cgroup_subsys_state *css, struct cftype *cft,
+		       u64 val)
+{
+	struct dd_blkcg *dd_blkcg = dd_blkcg_from_css(css);
+
+	if (val < 0 || val > 3)
+		return -EINVAL;
+	dd_blkcg->prio = val;
+	return 0;
+}
+
+/*
+ * Total number of inserted requests not associated with any cgroup and
+ * with I/O priority cft->private.
+ */
+static u64 dd_show_nocg(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	const u8 prio = cft->private;
+
+	return atomic_read(&nocg_inserted[prio]);
+}
+
+/* Number of inserted but not completed requests for priority cft->private. */
+static u64 dd_show_inserted(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	struct dd_blkcg *blkcg = dd_blkcg_from_css(css);
+	const u8 prio = cft->private;
+
+	return atomic_read(&blkcg->inserted[prio]) -
+		atomic_read(&blkcg->completed[prio]);
+}
+
+static u64 dd_show_merged(struct cgroup_subsys_state *css, struct cftype *cft)
+{
+	struct dd_blkcg *blkcg = dd_blkcg_from_css(css);
+	const u8 prio = cft->private;
+
+	return atomic_read(&blkcg->merged[prio]);
+}
+
+/* Number of dispatched but not completed requests for priority cft->private. */
+static u64 dd_show_dispatched(struct cgroup_subsys_state *css,
+			      struct cftype *cft)
+{
+	struct dd_blkcg *blkcg = dd_blkcg_from_css(css);
+	const u8 prio = cft->private;
+
+	return atomic_read(&blkcg->dispatched[prio]) +
+		atomic_read(&blkcg->merged[prio]) -
+		atomic_read(&blkcg->completed[prio]);
+}
+
+#define IOPRI_ATTRS(pfx, cft_flags, callback)		\
+	{						\
+		.name		= pfx "0",		\
+		.private	= IOPRIO_CLASS_NONE,	\
+		.flags		= cft_flags,		\
+		.read_u64	= callback,		\
+	},						\
+	{						\
+		.name		= pfx "1",		\
+		.private	= IOPRIO_CLASS_RT,	\
+		.flags		= cft_flags,		\
+		.read_u64	= callback,		\
+	},						\
+	{						\
+		.name		= pfx "2",		\
+		.private	= IOPRIO_CLASS_BE,	\
+		.flags		= cft_flags,		\
+		.read_u64	= callback,		\
+	},						\
+	{						\
+		.name		= pfx "3",		\
+		.private	= IOPRIO_CLASS_IDLE,	\
+		.flags		= cft_flags,		\
+		.read_u64	= callback,		\
+	}
+
+#define DD_ATTRS							\
+	{								\
+		.name		= "dd.prio",				\
+		.read_u64	= dd_show_prio,				\
+		.write_u64	= dd_set_prio,				\
+	},								\
+	IOPRI_ATTRS("dd.nocg_pri", CFTYPE_ONLY_ON_ROOT, dd_show_nocg),	\
+	IOPRI_ATTRS("dd.inserted_pri", 0, dd_show_inserted),		\
+	IOPRI_ATTRS("dd.merged_pri", 0, dd_show_merged),		\
+	IOPRI_ATTRS("dd.dispatched_pri", 0, dd_show_dispatched),	\
+	{ } /* sentinel */
+
+/* cgroup v2 attributes */
+static struct cftype dd_blkcg_files[] = {
+	DD_ATTRS
+};
+
+/* cgroup v1 attributes */
+static struct cftype dd_blkcg_legacy_files[] = {
+	DD_ATTRS
+};
+
+static struct blkcg_policy_data *dd_cpd_alloc(gfp_t gfp)
+{
+	struct dd_blkcg *pd;
+
+	pd = kzalloc(sizeof(*pd), gfp);
+	if (!pd)
+		return NULL;
+	pd->prio = IOPRIO_CLASS_NONE;
+	return &pd->cpd;
+}
+
+static void dd_cpd_free(struct blkcg_policy_data *cpd)
+{
+	struct dd_blkcg *dd_blkcg = container_of(cpd, typeof(*dd_blkcg), cpd);
+
+	kfree(dd_blkcg);
+}
+
+/*
+ * Convert an association between a block cgroup and a request queue into a
+ * pointer to the mq-deadline information associated with a (blkcg, queue) pair.
+ */
+struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio)
+{
+	struct blkg_policy_data *pd;
+
+	pd = blkg_to_pd(bio->bi_blkg, &dd_blkcg_policy);
+	if (!pd)
+		return NULL;
+
+	return container_of(blkcg_to_cpd(pd->blkg->blkcg, &dd_blkcg_policy),
+			    struct dd_blkcg, cpd);
+}
+
+static struct blkg_policy_data *dd_pd_alloc(gfp_t gfp, struct request_queue *q,
+					    struct blkcg *blkcg)
+{
+	struct dd_blkg *pd;
+
+	pd = kzalloc(sizeof(*pd), gfp);
+	if (!pd)
+		return NULL;
+	return &pd->pd;
+}
+
+static void dd_pd_free(struct blkg_policy_data *pd)
+{
+	struct dd_blkg *dd_blkg = container_of(pd, typeof(*dd_blkg), pd);
+
+	kfree(dd_blkg);
+}
+
+static struct blkcg_policy dd_blkcg_policy = {
+	.dfl_cftypes		= dd_blkcg_files,
+	.legacy_cftypes		= dd_blkcg_legacy_files,
+
+	.cpd_alloc_fn		= dd_cpd_alloc,
+	.cpd_free_fn		= dd_cpd_free,
+
+	.pd_alloc_fn		= dd_pd_alloc,
+	.pd_free_fn		= dd_pd_free,
+};
+
+int dd_activate_policy(struct request_queue *q)
+{
+	return blkcg_activate_policy(q, &dd_blkcg_policy);
+}
+
+void dd_deactivate_policy(struct request_queue *q)
+{
+	blkcg_deactivate_policy(q, &dd_blkcg_policy);
+}
+
+int __init dd_blkcg_init(void)
+{
+	return blkcg_policy_register(&dd_blkcg_policy);
+}
+
+void __exit dd_blkcg_exit(void)
+{
+	blkcg_policy_unregister(&dd_blkcg_policy);
+}
diff --git a/block/mq-deadline-cgroup.h b/block/mq-deadline-cgroup.h
new file mode 100644
index 000000000000..0996a1a28412
--- /dev/null
+++ b/block/mq-deadline-cgroup.h
@@ -0,0 +1,73 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#if !defined(_MQ_DEADLINE_CGROUP_H_)
+#define _MQ_DEADLINE_CGROUP_H_
+
+#include <linux/blk-cgroup.h>
+
+struct request_queue;
+
+extern atomic_t nocg_inserted[4];
+
+/**
+ * struct dd_blkcg - Per cgroup data.
+ * @cpd: blkcg_policy_data structure.
+ * @prio: One of the IOPRIO_CLASS_* values. See also <linux/ioprio.h>.
+ * @inserted: Number of inserted requests for the given IOPRIO_CLASS_*.
+ * @merged: Number of merged requests for the given I/O priority level.
+ * @dispatched: Number of dispatched requests for the given IOPRIO_CLASS_*.
+ * @completed: Number of completions per I/O priority level.
+ */
+struct dd_blkcg {
+	struct blkcg_policy_data cpd;	/* must be the first member */
+	u8 prio;
+	atomic_t inserted[4];
+	atomic_t merged[4];
+	atomic_t dispatched[4];
+	atomic_t completed[4];
+};
+
+#ifdef CONFIG_BLK_CGROUP
+
+/**
+ * struct dd_blkg - Per (cgroup, request queue) data.
+ * @pd: blkg_policy_data structure.
+ */
+struct dd_blkg {
+	struct blkg_policy_data pd;	/* must be the first member */
+};
+
+struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio);
+int dd_activate_policy(struct request_queue *q);
+void dd_deactivate_policy(struct request_queue *q);
+int __init dd_blkcg_init(void);
+void __exit dd_blkcg_exit(void);
+
+#else /* CONFIG_BLK_CGROUP */
+
+static inline struct dd_blkcg *dd_blkcg_from_bio(struct bio *bio)
+{
+	return NULL;
+}
+
+static inline int dd_activate_policy(struct request_queue *q)
+{
+	return 0;
+}
+
+static inline void dd_deactivate_policy(struct request_queue *q)
+{
+}
+
+static inline int dd_blkcg_init(void)
+{
+	return 0;
+}
+
+static inline void dd_blkcg_exit(void)
+{
+}
+
+#endif /* CONFIG_BLK_CGROUP */
+
+#endif /* _MQ_DEADLINE_CGROUP_H_ */
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 5a703e1228fa..0aa14cc302fd 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -25,6 +25,7 @@
 #include "blk-mq-debugfs.h"
 #include "blk-mq-tag.h"
 #include "blk-mq-sched.h"
+#include "mq-deadline-cgroup.h"
 
 /*
  * See Documentation/block/deadline-iosched.rst
@@ -58,6 +59,9 @@ struct deadline_data {
 	 * run time data
 	 */
 
+	/* Request queue that owns this data structure. */
+	struct request_queue *queue;
+
 	/*
 	 * Requests are present on both sort_list[] and fifo_list[][]. The
 	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
@@ -101,6 +105,9 @@ static const enum dd_prio ioprio_class_to_prio[] = {
 	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
 };
 
+/* To track requests not associated with any cgroup. */
+atomic_t nocg_inserted[4];
+
 static inline struct rb_root *
 deadline_rb_root(struct deadline_data *dd, struct request *rq)
 {
@@ -189,9 +196,11 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 	struct deadline_data *dd = q->elevator->elevator_data;
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
+	struct dd_blkcg *blkcg = next->elv.priv[0];
 
-	if (next->elv.priv[0]) {
+	if (blkcg) {
 		atomic_inc(&dd->merged[prio]);
+		atomic_inc(&blkcg->merged[ioprio_class]);
 	} else {
 		WARN_ON_ONCE(true);
 	}
@@ -334,6 +343,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
+	struct dd_blkcg *blkcg;
 	u8 ioprio_class;
 
 	lockdep_assert_held(&dd->lock);
@@ -426,10 +436,12 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	dd->batching++;
 	deadline_move_request(dd, rq);
 done:
+	blkcg = rq->elv.priv[0];
 	ioprio_class = dd_rq_ioclass(rq);
 	prio = ioprio_class_to_prio[ioprio_class];
-	if (rq->elv.priv[0]) {
+	if (blkcg) {
 		atomic_inc(&dd->dispatched[prio]);
+		atomic_inc(&blkcg->dispatched[ioprio_class]);
 	} else {
 		WARN_ON_ONCE(true);
 	}
@@ -503,6 +515,8 @@ static void dd_exit_sched(struct elevator_queue *e)
 	struct deadline_data *dd = e->elevator_data;
 	enum dd_prio prio;
 
+	dd_deactivate_policy(dd->queue);
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_READ]));
 		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_WRITE]));
@@ -512,25 +526,33 @@ static void dd_exit_sched(struct elevator_queue *e)
 }
 
 /*
- * initialize elevator private data (deadline_data).
+ * Initialize elevator private data (deadline_data) and associate with blkcg.
  */
 static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 {
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
 	enum dd_prio prio;
+	int ret = -ENOMEM;
+
+	/*
+	 * Initialization would be very tricky if the queue is not frozen,
+	 * hence the warning statement below.
+	 */
+	WARN_ON_ONCE(!percpu_ref_is_zero(&q->q_usage_counter));
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
-		return -ENOMEM;
+		goto out;
 
 	dd = kzalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
-	if (!dd) {
-		kobject_put(&eq->kobj);
-		return -ENOMEM;
-	}
+	if (!dd)
+		goto put_eq;
+
 	eq->elevator_data = dd;
 
+	dd->queue = q;
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_READ]);
 		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_WRITE]);
@@ -547,8 +569,23 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
+	ret = dd_activate_policy(q);
+	if (ret)
+		goto free_dd;
+
+	ret = 0;
 	q->elevator = eq;
-	return 0;
+
+out:
+	return ret;
+
+free_dd:
+	eq->elevator_data = NULL;
+	kfree(dd);
+
+put_eq:
+	kobject_put(&eq->kobj);
+	goto out;
 }
 
 /*
@@ -611,6 +648,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	u16 ioprio = req_get_ioprio(rq);
 	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
 	enum dd_prio prio;
+	struct dd_blkcg *blkcg;
 
 	lockdep_assert_held(&dd->lock);
 
@@ -620,10 +658,27 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	 */
 	blk_req_zone_write_unlock(rq);
 
+	/*
+	 * If a block cgroup has been associated with the submitter and if an
+	 * I/O priority has been set in the associated block cgroup, use the
+	 * lowest of the cgroup priority and the request priority for the
+	 * request. If no priority has been set in the request, use the cgroup
+	 * priority.
+	 */
+	blkcg = dd_blkcg_from_bio(rq->bio);
+	if (blkcg) {
+		ioprio_class = max(ioprio_class, blkcg->prio);
+		rq->ioprio = IOPRIO_PRIO_VALUE(ioprio_class,
+					       IOPRIO_PRIO_DATA(ioprio));
+		atomic_inc(&blkcg->inserted[ioprio_class]);
+	} else {
+		WARN_ON_ONCE(true);
+		atomic_inc(&nocg_inserted[ioprio_class]);
+	}
 	prio = ioprio_class_to_prio[ioprio_class];
 	atomic_inc(&dd->inserted[prio]);
 	WARN_ON_ONCE(rq->elv.priv[0]);
-	rq->elv.priv[0] = (void *)1ULL;
+	rq->elv.priv[0] = blkcg;
 	rq->elv.priv[1] = (void *)(uintptr_t)ioprio_class;
 
 	if (blk_mq_sched_try_insert_merge(q, rq))
@@ -696,11 +751,14 @@ static void dd_finish_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
 	struct deadline_data *dd = q->elevator->elevator_data;
+	struct dd_blkcg *blkcg = rq->elv.priv[0];
 	const u8 ioprio_class = dd_rq_ioclass(rq);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 
-	if (rq->elv.priv[0])
+	if (blkcg) {
 		atomic_inc(&dd->completed[prio]);
+		atomic_inc(&blkcg->completed[ioprio_class]);
+	}
 
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;
@@ -1021,11 +1079,25 @@ MODULE_ALIAS("mq-deadline-iosched");
 
 static int __init deadline_init(void)
 {
-	return elv_register(&mq_deadline);
+	int ret;
+	ret = elv_register(&mq_deadline);
+	if (ret)
+		goto out;
+	ret = dd_blkcg_init();
+	if (ret)
+		goto unreg;
+
+out:
+	return ret;
+
+unreg:
+	elv_unregister(&mq_deadline);
+	goto out;
 }
 
 static void __exit deadline_exit(void)
 {
+	dd_blkcg_exit();
 	elv_unregister(&mq_deadline);
 }
 

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-05-27  1:01 ` [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
@ 2021-05-27  2:25   ` Chaitanya Kulkarni
  2021-05-27  3:09   ` Damien Le Moal
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Chaitanya Kulkarni @ 2021-05-27  2:25 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/26/21 18:02, Bart Van Assche wrote:
> Document the locking strategy by adding two lockdep_assert_held()
> statements.
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/9] block/mq-deadline: Remove two local variables
  2021-05-27  1:01 ` [PATCH 3/9] block/mq-deadline: Remove two local variables Bart Van Assche
@ 2021-05-27  2:26   ` Chaitanya Kulkarni
  2021-05-27  3:11   ` Damien Le Moal
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Chaitanya Kulkarni @ 2021-05-27  2:26 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/26/21 18:02, Bart Van Assche wrote:
> Make __dd_dispatch_request() easier to read by removing two local
> variables.
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-05-27  1:01 ` [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
@ 2021-05-27  2:27   ` Chaitanya Kulkarni
  2021-05-27  3:13   ` Damien Le Moal
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Chaitanya Kulkarni @ 2021-05-27  2:27 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/26/21 18:02, Bart Van Assche wrote:
> Change "queue" into "sched" to make the function names reflect better the
> purpose of these functions.
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking
  2021-05-27  1:01 ` [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
@ 2021-05-27  2:28   ` Chaitanya Kulkarni
  2021-05-27  3:24   ` Damien Le Moal
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Chaitanya Kulkarni @ 2021-05-27  2:28 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/26/21 18:02, Bart Van Assche wrote:
> Modern compilers complain if an out-of-range value is passed to a function
> argument that has an enumeration type. Let the compiler detect out-of-range
> data direction arguments instead of verifying the data_dir argument at
> runtime.
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media
  2021-05-27  1:01 ` [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media Bart Van Assche
@ 2021-05-27  2:30   ` Chaitanya Kulkarni
  2021-05-27  3:27   ` Damien Le Moal
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 65+ messages in thread
From: Chaitanya Kulkarni @ 2021-05-27  2:30 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/26/21 18:02, Bart Van Assche wrote:
> -	dd->fifo_expire[DD_READ] = read_expire;
> +	dd->fifo_expire[DD_READ] = blk_queue_nonrot(q) ? read_expire_nonrot :
> +		read_expire_rot;

I've got comments about not using  ?:.

Apart from that looks good.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/9] block/mq-deadline: Add several comments
  2021-05-27  1:01 ` [PATCH 1/9] block/mq-deadline: Add several comments Bart Van Assche
@ 2021-05-27  3:03   ` Damien Le Moal
  2021-05-27  6:45   ` Hannes Reinecke
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 65+ messages in thread
From: Damien Le Moal @ 2021-05-27  3:03 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/27 10:01, Bart Van Assche wrote:
> Make the code easier to read by adding more comments.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-05-27  1:01 ` [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
  2021-05-27  2:25   ` Chaitanya Kulkarni
@ 2021-05-27  3:09   ` Damien Le Moal
  2021-05-27  6:46   ` Hannes Reinecke
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Damien Le Moal @ 2021-05-27  3:09 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/27 10:02, Bart Van Assche wrote:
> Document the locking strategy by adding two lockdep_assert_held()
> statements.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 64cabbc157ea..4da0bd3b9827 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -279,6 +279,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	bool reads, writes;
>  	int data_dir;
>  
> +	lockdep_assert_held(&dd->lock);
> +
>  	if (!list_empty(&dd->dispatch)) {
>  		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
>  		list_del_init(&rq->queuelist);
> @@ -501,6 +503,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  	const int data_dir = rq_data_dir(rq);
>  
> +	lockdep_assert_held(&dd->lock);
> +
>  	/*
>  	 * This may be a requeue of a write request that has locked its
>  	 * target zone. If it is the case, this releases the zone lock.
> 

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/9] block/mq-deadline: Remove two local variables
  2021-05-27  1:01 ` [PATCH 3/9] block/mq-deadline: Remove two local variables Bart Van Assche
  2021-05-27  2:26   ` Chaitanya Kulkarni
@ 2021-05-27  3:11   ` Damien Le Moal
  2021-05-27  6:46   ` Hannes Reinecke
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 65+ messages in thread
From: Damien Le Moal @ 2021-05-27  3:11 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/27 10:01, Bart Van Assche wrote:
> Make __dd_dispatch_request() easier to read by removing two local
> variables.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 4da0bd3b9827..cc9d0fc72db2 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -276,7 +276,6 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
>  static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  {
>  	struct request *rq, *next_rq;
> -	bool reads, writes;
>  	int data_dir;
>  
>  	lockdep_assert_held(&dd->lock);
> @@ -287,9 +286,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  		goto done;
>  	}
>  
> -	reads = !list_empty(&dd->fifo_list[READ]);
> -	writes = !list_empty(&dd->fifo_list[WRITE]);
> -
>  	/*
>  	 * batches are currently reads XOR writes
>  	 */
> @@ -306,7 +302,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * data direction (read / write)
>  	 */
>  
> -	if (reads) {
> +	if (!list_empty(&dd->fifo_list[READ])) {
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
>  
>  		if (deadline_fifo_request(dd, WRITE) &&
> @@ -322,7 +318,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * there are either no reads or writes have been starved
>  	 */
>  
> -	if (writes) {
> +	if (!list_empty(&dd->fifo_list[WRITE])) {
>  dispatch_writes:
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
>  
> 

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-05-27  1:01 ` [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
  2021-05-27  2:27   ` Chaitanya Kulkarni
@ 2021-05-27  3:13   ` Damien Le Moal
  2021-05-27 19:33     ` Bart Van Assche
  2021-05-27  6:47   ` Hannes Reinecke
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 65+ messages in thread
From: Damien Le Moal @ 2021-05-27  3:13 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/27 10:01, Bart Van Assche wrote:
> Change "queue" into "sched" to make the function names reflect better the
> purpose of these functions.

Nit: may be change this to:

Change "queue" into "sched" to make the function names match the elevator type
operation names.

> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index cc9d0fc72db2..c4f51e7884fb 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -395,7 +395,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  	return rq;
>  }
>  
> -static void dd_exit_queue(struct elevator_queue *e)
> +static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
>  
> @@ -408,7 +408,7 @@ static void dd_exit_queue(struct elevator_queue *e)
>  /*
>   * initialize elevator private data (deadline_data).
>   */
> -static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  {
>  	struct deadline_data *dd;
>  	struct elevator_queue *eq;
> @@ -800,8 +800,8 @@ static struct elevator_type mq_deadline = {
>  		.requests_merged	= dd_merged_requests,
>  		.request_merged		= dd_request_merged,
>  		.has_work		= dd_has_work,
> -		.init_sched		= dd_init_queue,
> -		.exit_sched		= dd_exit_queue,
> +		.init_sched		= dd_init_sched,
> +		.exit_sched		= dd_exit_sched,
>  	},
>  
>  #ifdef CONFIG_BLK_DEBUG_FS
> 

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking
  2021-05-27  1:01 ` [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
  2021-05-27  2:28   ` Chaitanya Kulkarni
@ 2021-05-27  3:24   ` Damien Le Moal
  2021-05-27 19:38     ` Bart Van Assche
  2021-05-27  6:48   ` Hannes Reinecke
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 65+ messages in thread
From: Damien Le Moal @ 2021-05-27  3:24 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/27 10:02, Bart Van Assche wrote:
> Modern compilers complain if an out-of-range value is passed to a function
> argument that has an enumeration type. Let the compiler detect out-of-range
> data direction arguments instead of verifying the data_dir argument at
> runtime.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 96 +++++++++++++++++++++++----------------------
>  1 file changed, 49 insertions(+), 47 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index c4f51e7884fb..dfbc6b77fa71 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -35,6 +35,13 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
>  static const int fifo_batch = 16;       /* # of sequential requests treated as one
>  				     by the above parameters. For throughput. */
>  
> +enum dd_data_dir {
> +	DD_READ		= READ,
> +	DD_WRITE	= WRITE,
> +} __packed;

Why the "__packed" here ?

> +
> +enum { DD_DIR_COUNT = 2 };

Why not a simple #define for this one ?

> +
>  struct deadline_data {
>  	/*
>  	 * run time data
> @@ -43,20 +50,20 @@ struct deadline_data {
>  	/*
>  	 * requests (deadline_rq s) are present on both sort_list and fifo_list
>  	 */
> -	struct rb_root sort_list[2];
> -	struct list_head fifo_list[2];
> +	struct rb_root sort_list[DD_DIR_COUNT];
> +	struct list_head fifo_list[DD_DIR_COUNT];
>  
>  	/*
>  	 * next in sort order. read, write or both are NULL
>  	 */
> -	struct request *next_rq[2];
> +	struct request *next_rq[DD_DIR_COUNT];
>  	unsigned int batching;		/* number of sequential requests made */
>  	unsigned int starved;		/* times reads have starved writes */
>  
>  	/*
>  	 * settings that change how the i/o scheduler behaves
>  	 */
> -	int fifo_expire[2];
> +	int fifo_expire[DD_DIR_COUNT];
>  	int fifo_batch;
>  	int writes_starved;
>  	int front_merges;
> @@ -97,7 +104,7 @@ deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
>  static inline void
>  deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
>  {
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);
>  
>  	if (dd->next_rq[data_dir] == rq)
>  		dd->next_rq[data_dir] = deadline_latter_request(rq);
> @@ -169,10 +176,10 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
>  static void
>  deadline_move_request(struct deadline_data *dd, struct request *rq)
>  {
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);
>  
> -	dd->next_rq[READ] = NULL;
> -	dd->next_rq[WRITE] = NULL;
> +	dd->next_rq[DD_READ] = NULL;
> +	dd->next_rq[DD_WRITE] = NULL;
>  	dd->next_rq[data_dir] = deadline_latter_request(rq);
>  
>  	/*
> @@ -185,9 +192,10 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>   */
> -static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
> +static inline int deadline_check_fifo(struct deadline_data *dd,
> +				      enum dd_data_dir data_dir)
>  {
> -	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
>  
>  	/*
>  	 * rq is expired!
> @@ -203,19 +211,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
>   * dispatch using arrival ordered lists.
>   */
>  static struct request *
> -deadline_fifo_request(struct deadline_data *dd, int data_dir)
> +deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
>  
> -	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
> -		return NULL;
> -
>  	if (list_empty(&dd->fifo_list[data_dir]))
>  		return NULL;
>  
>  	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> -	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>  		return rq;
>  
>  	/*
> @@ -223,7 +228,7 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
>  	 * an unlocked target zone.
>  	 */
>  	spin_lock_irqsave(&dd->zone_lock, flags);
> -	list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) {
> +	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
>  		if (blk_req_can_dispatch_to_zone(rq))
>  			goto out;
>  	}
> @@ -239,19 +244,16 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
>   * dispatch using sector position sorted lists.
>   */
>  static struct request *
> -deadline_next_request(struct deadline_data *dd, int data_dir)
> +deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
>  
> -	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
> -		return NULL;
> -
>  	rq = dd->next_rq[data_dir];
>  	if (!rq)
>  		return NULL;
>  
> -	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>  		return rq;
>  
>  	/*
> @@ -276,7 +278,7 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
>  static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  {
>  	struct request *rq, *next_rq;
> -	int data_dir;
> +	enum dd_data_dir data_dir;
>  
>  	lockdep_assert_held(&dd->lock);
>  
> @@ -289,9 +291,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	/*
>  	 * batches are currently reads XOR writes
>  	 */
> -	rq = deadline_next_request(dd, WRITE);
> +	rq = deadline_next_request(dd, DD_WRITE);
>  	if (!rq)
> -		rq = deadline_next_request(dd, READ);
> +		rq = deadline_next_request(dd, DD_READ);
>  
>  	if (rq && dd->batching < dd->fifo_batch)
>  		/* we have a next request are still entitled to batch */
> @@ -302,14 +304,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * data direction (read / write)
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[READ])) {
> -		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
> +	if (!list_empty(&dd->fifo_list[DD_READ])) {
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
>  
> -		if (deadline_fifo_request(dd, WRITE) &&
> +		if (deadline_fifo_request(dd, DD_WRITE) &&
>  		    (dd->starved++ >= dd->writes_starved))
>  			goto dispatch_writes;
>  
> -		data_dir = READ;
> +		data_dir = DD_READ;
>  
>  		goto dispatch_find_request;
>  	}
> @@ -318,13 +320,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * there are either no reads or writes have been starved
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[WRITE])) {
> +	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
>  dispatch_writes:
> -		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
>  
>  		dd->starved = 0;
>  
> -		data_dir = WRITE;
> +		data_dir = DD_WRITE;
>  
>  		goto dispatch_find_request;
>  	}
> @@ -399,8 +401,8 @@ static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
>  
> -	BUG_ON(!list_empty(&dd->fifo_list[READ]));
> -	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
> +	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
> +	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
>  
>  	kfree(dd);
>  }
> @@ -424,12 +426,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	}
>  	eq->elevator_data = dd;
>  
> -	INIT_LIST_HEAD(&dd->fifo_list[READ]);
> -	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
> -	dd->sort_list[READ] = RB_ROOT;
> -	dd->sort_list[WRITE] = RB_ROOT;
> -	dd->fifo_expire[READ] = read_expire;
> -	dd->fifo_expire[WRITE] = write_expire;
> +	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
> +	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
> +	dd->sort_list[DD_READ] = RB_ROOT;
> +	dd->sort_list[DD_WRITE] = RB_ROOT;
> +	dd->fifo_expire[DD_READ] = read_expire;
> +	dd->fifo_expire[DD_WRITE] = write_expire;
>  	dd->writes_starved = writes_starved;
>  	dd->front_merges = 1;
>  	dd->fifo_batch = fifo_batch;
> @@ -497,7 +499,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  {
>  	struct request_queue *q = hctx->queue;
>  	struct deadline_data *dd = q->elevator->elevator_data;
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);>
>  	lockdep_assert_held(&dd->lock);
>  
> @@ -585,7 +587,7 @@ static void dd_finish_request(struct request *rq)
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);
>  		blk_req_zone_write_unlock(rq);
> -		if (!list_empty(&dd->fifo_list[WRITE]))
> +		if (!list_empty(&dd->fifo_list[DD_WRITE]))
>  			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
>  		spin_unlock_irqrestore(&dd->zone_lock, flags);
>  	}
> @@ -626,8 +628,8 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  		__data = jiffies_to_msecs(__data);			\
>  	return deadline_var_show(__data, (page));			\
>  }
> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
>  SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
>  SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
>  SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> @@ -649,8 +651,8 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
>  		*(__PTR) = __data;					\
>  	return count;							\
>  }
> -STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> -STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
>  STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
>  STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
>  STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> @@ -717,8 +719,8 @@ static int deadline_##name##_next_rq_show(void *data,			\
>  		__blk_mq_debugfs_rq_show(m, rq);			\
>  	return 0;							\
>  }
> -DEADLINE_DEBUGFS_DDIR_ATTRS(READ, read)
> -DEADLINE_DEBUGFS_DDIR_ATTRS(WRITE, write)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
>  #undef DEADLINE_DEBUGFS_DDIR_ATTRS
>  
>  static int deadline_batching_show(void *data, struct seq_file *m)
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media
  2021-05-27  1:01 ` [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media Bart Van Assche
  2021-05-27  2:30   ` Chaitanya Kulkarni
@ 2021-05-27  3:27   ` Damien Le Moal
  2021-05-27 19:43     ` Bart Van Assche
  2021-05-27  6:52   ` Hannes Reinecke
  2021-05-27 15:20   ` Himanshu Madhani
  3 siblings, 1 reply; 65+ messages in thread
From: Damien Le Moal @ 2021-05-27  3:27 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/27 10:02, Bart Van Assche wrote:
> Rotational media (hard disks) benefit more from merging requests than
> non-rotational media. Reduce the read expire time for non-rotational media

"Reduce the read expire time for non-rotational..." -> "Reduce the default read
expire time for non-rotational..."

> to reduce read latency.

I am not against this, but I wonder if we should not let udev do that with some
rules instead of adding totally arbitrary values here...

> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index dfbc6b77fa71..2ab844a4b6b5 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -29,7 +29,9 @@
>  /*
>   * See Documentation/block/deadline-iosched.rst
>   */
> -static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
> +/* max time before a read is submitted. */
> +static const int read_expire_rot = HZ / 2;
> +static const int read_expire_nonrot = 1;
>  static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
>  static const int writes_starved = 2;    /* max times reads can starve a write */
>  static const int fifo_batch = 16;       /* # of sequential requests treated as one
> @@ -430,7 +432,8 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
>  	dd->sort_list[DD_READ] = RB_ROOT;
>  	dd->sort_list[DD_WRITE] = RB_ROOT;
> -	dd->fifo_expire[DD_READ] = read_expire;
> +	dd->fifo_expire[DD_READ] = blk_queue_nonrot(q) ? read_expire_nonrot :
> +		read_expire_rot;
>  	dd->fifo_expire[DD_WRITE] = write_expire;
>  	dd->writes_starved = writes_starved;
>  	dd->front_merges = 1;
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests
  2021-05-27  1:01 ` [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests Bart Van Assche
@ 2021-05-27  3:33   ` Damien Le Moal
  2021-05-27 20:00     ` Bart Van Assche
  2021-05-27  6:54   ` Hannes Reinecke
  1 sibling, 1 reply; 65+ messages in thread
From: Damien Le Moal @ 2021-05-27  3:33 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/27 10:02, Bart Van Assche wrote:
> For interactive workloads it is important that synchronous requests are
> not delayed. Hence reserve 25% of tags for synchronous requests. This patch

s/tags/scheduler tags

to be clear that we are not talking about the device tags. Same in the patch
title may be.

> still allows asynchronous requests to fill the hardware queues since
> blk_mq_init_sched() makes sure that the number of scheduler requests is the
> double of the hardware queue depth. From blk_mq_init_sched():
> 
> 	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
> 				   BLKDEV_MAX_RQ);
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 46 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 2ab844a4b6b5..81f487d77e09 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -69,6 +69,7 @@ struct deadline_data {
>  	int fifo_batch;
>  	int writes_starved;
>  	int front_merges;
> +	u32 async_depth;
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> @@ -399,6 +400,38 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  	return rq;
>  }
>  
> +static void dd_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)

Similarly as you did in patch 1, may be add a comment about this operation and
when it is called ?

> +{
> +	struct deadline_data *dd = data->q->elevator->elevator_data;
> +
> +	/* Do not throttle synchronous reads. */
> +	if (op_is_sync(op) && !op_is_write(op))
> +		return;
> +
> +	/*
> +	 * Throttle asynchronous requests and writes such that these requests
> +	 * do not block the allocation of synchronous requests.
> +	 */
> +	data->shallow_depth = dd->async_depth;
> +}
> +
> +static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	struct blk_mq_tags *tags = hctx->sched_tags;
> +
> +	dd->async_depth = 3 * q->nr_requests / 4;

I think that nr_requests is always at least 2, but it may be good to have a
sanity check here that we do not end up with async_depth == 0, no ?

> +
> +	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, dd->async_depth);
> +}
> +
> +static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
> +{
> +	dd_depth_updated(hctx);
> +	return 0;
> +}
> +
>  static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
> @@ -744,6 +777,15 @@ static int deadline_starved_show(void *data, struct seq_file *m)
>  	return 0;
>  }
>  
> +static int dd_async_depth_show(void *data, struct seq_file *m)
> +{
> +	struct request_queue *q = data;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +
> +	seq_printf(m, "%u\n", dd->async_depth);
> +	return 0;
> +}
> +
>  static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
>  	__acquires(&dd->lock)
>  {
> @@ -786,6 +828,7 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
>  	DEADLINE_QUEUE_DDIR_ATTRS(write),
>  	{"batching", 0400, deadline_batching_show},
>  	{"starved", 0400, deadline_starved_show},
> +	{"async_depth", 0400, dd_async_depth_show},
>  	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
>  	{},
>  };
> @@ -794,6 +837,8 @@ static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
>  
>  static struct elevator_type mq_deadline = {
>  	.ops = {
> +		.depth_updated		= dd_depth_updated,
> +		.limit_depth		= dd_limit_depth,
>  		.insert_requests	= dd_insert_requests,
>  		.dispatch_request	= dd_dispatch_request,
>  		.prepare_request	= dd_prepare_request,
> @@ -807,6 +852,7 @@ static struct elevator_type mq_deadline = {
>  		.has_work		= dd_has_work,
>  		.init_sched		= dd_init_sched,
>  		.exit_sched		= dd_exit_sched,
> +		.init_hctx		= dd_init_hctx,
>  	},
>  
>  #ifdef CONFIG_BLK_DEBUG_FS
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 8/9] block/mq-deadline: Add I/O priority support
  2021-05-27  1:01 ` [PATCH 8/9] block/mq-deadline: Add I/O priority support Bart Van Assche
@ 2021-05-27  3:48   ` Damien Le Moal
  2021-05-27 20:12     ` Bart Van Assche
  2021-05-27  7:07   ` Hannes Reinecke
  1 sibling, 1 reply; 65+ messages in thread
From: Damien Le Moal @ 2021-05-27  3:48 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/27 10:02, Bart Van Assche wrote:
> Maintain one FIFO list per I/O priority: RT, BE and IDLE. Only dispatch

I/O priority -> I/O priority class. Since there is the prio level too, I think
it is better to clarify that this acts only on the class, not the level. It may
be good to mention that piro level is ignored.

> requests for a lower priority after all higher priority requests have
> finished. Maintain statistics for each priority level. Split the debugfs
> attributes per priority level as follows:

s/priority level/priority class

> 
> $ ls /sys/kernel/debug/block/.../sched/
> async_depth  dispatch2        read_next_rq      write2_fifo_list
> batching     read0_fifo_list  starved           write_next_rq
> dispatch0    read1_fifo_list  write0_fifo_list
> dispatch1    read2_fifo_list  write1_fifo_list
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 293 +++++++++++++++++++++++++++++++++-----------
>  1 file changed, 223 insertions(+), 70 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 81f487d77e09..5a703e1228fa 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -44,16 +44,27 @@ enum dd_data_dir {
>  
>  enum { DD_DIR_COUNT = 2 };
>  
> +enum dd_prio {
> +	DD_RT_PRIO	= 0,
> +	DD_BE_PRIO	= 1,
> +	DD_IDLE_PRIO	= 2,
> +	DD_PRIO_MAX	= 2,
> +} __packed;

Why __packed ?

> +
> +enum { DD_PRIO_COUNT = 3 };

#define ?

> +
>  struct deadline_data {
>  	/*
>  	 * run time data
>  	 */
>  
>  	/*
> -	 * requests (deadline_rq s) are present on both sort_list and fifo_list
> +	 * Requests are present on both sort_list[] and fifo_list[][]. The
> +	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
> +	 * The second index is the data direction (rq_data_dir(rq)).
>  	 */
>  	struct rb_root sort_list[DD_DIR_COUNT];
> -	struct list_head fifo_list[DD_DIR_COUNT];
> +	struct list_head fifo_list[DD_PRIO_COUNT][DD_DIR_COUNT];
>  
>  	/*
>  	 * next in sort order. read, write or both are NULL
> @@ -62,6 +73,12 @@ struct deadline_data {
>  	unsigned int batching;		/* number of sequential requests made */
>  	unsigned int starved;		/* times reads have starved writes */
>  
> +	/* statistics */
> +	atomic_t inserted[DD_PRIO_COUNT];
> +	atomic_t dispatched[DD_PRIO_COUNT];
> +	atomic_t merged[DD_PRIO_COUNT];
> +	atomic_t completed[DD_PRIO_COUNT];
> +
>  	/*
>  	 * settings that change how the i/o scheduler behaves
>  	 */
> @@ -73,7 +90,15 @@ struct deadline_data {
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> -	struct list_head dispatch;
> +	struct list_head dispatch[DD_PRIO_COUNT];
> +};
> +
> +/* Maps an I/O priority class to a deadline scheduler priority. */
> +static const enum dd_prio ioprio_class_to_prio[] = {
> +	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
> +	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
> +	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
> +	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
>  };
>  
>  static inline struct rb_root *
> @@ -149,12 +174,28 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
>  	}
>  }
>  
> +/* Returns the I/O class that has been assigned by dd_insert_request(). */
> +static u8 dd_rq_ioclass(struct request *rq)
> +{
> +	return (uintptr_t)rq->elv.priv[1];
> +}
> +
>  /*
>   * Callback function that is invoked after @next has been merged into @req.
>   */
>  static void dd_merged_requests(struct request_queue *q, struct request *req,
>  			       struct request *next)
>  {
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const u8 ioprio_class = dd_rq_ioclass(next);
> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
> +
> +	if (next->elv.priv[0]) {
> +		atomic_inc(&dd->merged[prio]);
> +	} else {
> +		WARN_ON_ONCE(true);
> +	}

I do not think you need the curly braces here.

> +
>  	/*
>  	 * if next expires before rq, assign its expire time to rq
>  	 * and move into next position (next will be deleted) in fifo
> @@ -191,14 +232,22 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
>  	deadline_remove_request(rq->q, rq);
>  }
>  
> +/* Number of requests queued for a given priority level. */
> +static u64 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return atomic_read(&dd->inserted[prio]) -
> +		atomic_read(&dd->completed[prio]);
> +}
> +
>  /*
>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>   */
>  static inline int deadline_check_fifo(struct deadline_data *dd,
> +				      enum dd_prio prio,
>  				      enum dd_data_dir data_dir)
>  {
> -	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
>  
>  	/*
>  	 * rq is expired!
> @@ -214,15 +263,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd,
>   * dispatch using arrival ordered lists.
>   */
>  static struct request *
> -deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
> +deadline_fifo_request(struct deadline_data *dd, enum dd_prio prio,
> +		      enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
>  
> -	if (list_empty(&dd->fifo_list[data_dir]))
> +	if (list_empty(&dd->fifo_list[prio][data_dir]))
>  		return NULL;
>  
> -	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
>  	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>  		return rq;
>  
> @@ -231,7 +281,7 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>  	 * an unlocked target zone.
>  	 */
>  	spin_lock_irqsave(&dd->zone_lock, flags);
> -	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
> +	list_for_each_entry(rq, &dd->fifo_list[prio][DD_WRITE], queuelist) {
>  		if (blk_req_can_dispatch_to_zone(rq))
>  			goto out;
>  	}
> @@ -247,7 +297,8 @@ deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>   * dispatch using sector position sorted lists.
>   */
>  static struct request *
> -deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
> +deadline_next_request(struct deadline_data *dd, enum dd_prio prio,
> +		      enum dd_data_dir data_dir)
>  {
>  	struct request *rq;
>  	unsigned long flags;
> @@ -278,15 +329,18 @@ deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>   * deadline_dispatch_requests selects the best request according to
>   * read/write expire, fifo_batch, etc
>   */
> -static struct request *__dd_dispatch_request(struct deadline_data *dd)
> +static struct request *__dd_dispatch_request(struct deadline_data *dd,
> +					     enum dd_prio prio)
>  {
>  	struct request *rq, *next_rq;
>  	enum dd_data_dir data_dir;
> +	u8 ioprio_class;
>  
>  	lockdep_assert_held(&dd->lock);
>  
> -	if (!list_empty(&dd->dispatch)) {
> -		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
> +	if (!list_empty(&dd->dispatch[prio])) {
> +		rq = list_first_entry(&dd->dispatch[prio], struct request,
> +				      queuelist);
>  		list_del_init(&rq->queuelist);
>  		goto done;
>  	}
> @@ -294,9 +348,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	/*
>  	 * batches are currently reads XOR writes
>  	 */
> -	rq = deadline_next_request(dd, DD_WRITE);
> +	rq = deadline_next_request(dd, prio, DD_WRITE);
>  	if (!rq)
> -		rq = deadline_next_request(dd, DD_READ);
> +		rq = deadline_next_request(dd, prio, DD_READ);
>  
>  	if (rq && dd->batching < dd->fifo_batch)
>  		/* we have a next request are still entitled to batch */
> @@ -307,10 +361,10 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * data direction (read / write)
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[DD_READ])) {
> +	if (!list_empty(&dd->fifo_list[prio][DD_READ])) {
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
>  
> -		if (deadline_fifo_request(dd, DD_WRITE) &&
> +		if (deadline_fifo_request(dd, prio, DD_WRITE) &&
>  		    (dd->starved++ >= dd->writes_starved))
>  			goto dispatch_writes;
>  
> @@ -323,7 +377,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	 * there are either no reads or writes have been starved
>  	 */
>  
> -	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
> +	if (!list_empty(&dd->fifo_list[prio][DD_WRITE])) {
>  dispatch_writes:
>  		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
>  
> @@ -340,14 +394,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	/*
>  	 * we are not running a batch, find best request for selected data_dir
>  	 */
> -	next_rq = deadline_next_request(dd, data_dir);
> -	if (deadline_check_fifo(dd, data_dir) || !next_rq) {
> +	next_rq = deadline_next_request(dd, prio, data_dir);
> +	if (deadline_check_fifo(dd, prio, data_dir) || !next_rq) {
>  		/*
>  		 * A deadline has expired, the last request was in the other
>  		 * direction, or we have run out of higher-sectored requests.
>  		 * Start again from the request with the earliest expiry time.
>  		 */
> -		rq = deadline_fifo_request(dd, data_dir);
> +		rq = deadline_fifo_request(dd, prio, data_dir);
>  	} else {
>  		/*
>  		 * The last req was the same dir and we have a next request in
> @@ -372,6 +426,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  	dd->batching++;
>  	deadline_move_request(dd, rq);
>  done:
> +	ioprio_class = dd_rq_ioclass(rq);
> +	prio = ioprio_class_to_prio[ioprio_class];
> +	if (rq->elv.priv[0]) {
> +		atomic_inc(&dd->dispatched[prio]);
> +	} else {
> +		WARN_ON_ONCE(true);
> +	}

No need for the curly braces.

>  	/*
>  	 * If the request needs its target zone locked, do it.
>  	 */
> @@ -392,9 +453,14 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>  	struct request *rq;
> +	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);
> -	rq = __dd_dispatch_request(dd);
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, prio);
> +		if (rq || dd_queued(dd, prio))
> +			break;
> +	}
>  	spin_unlock(&dd->lock);


Unless I missed something, this means that the aging (fifo list expire)
mechanism is per prio list but there is no aging between the prio classes. This
means that an application doing lots of RT direct IOs will completely starve
other prio classes and hang the applications using them.

I think we need aging between the prio classes too to avoid that. Otherwise, that.

>  
>  	return rq;
> @@ -435,9 +501,12 @@ static int dd_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
>  static void dd_exit_sched(struct elevator_queue *e)
>  {
>  	struct deadline_data *dd = e->elevator_data;
> +	enum dd_prio prio;
>  
> -	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
> -	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_READ]));
> +		WARN_ON_ONCE(!list_empty(&dd->fifo_list[prio][DD_WRITE]));
> +	}
>  
>  	kfree(dd);
>  }
> @@ -449,6 +518,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  {
>  	struct deadline_data *dd;
>  	struct elevator_queue *eq;
> +	enum dd_prio prio;
>  
>  	eq = elevator_alloc(q, e);
>  	if (!eq)
> @@ -461,8 +531,11 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	}
>  	eq->elevator_data = dd;
>  
> -	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
> -	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> +		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_READ]);
> +		INIT_LIST_HEAD(&dd->fifo_list[prio][DD_WRITE]);
> +		INIT_LIST_HEAD(&dd->dispatch[prio]);
> +	}
>  	dd->sort_list[DD_READ] = RB_ROOT;
>  	dd->sort_list[DD_WRITE] = RB_ROOT;
>  	dd->fifo_expire[DD_READ] = blk_queue_nonrot(q) ? read_expire_nonrot :
> @@ -473,7 +546,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	dd->fifo_batch = fifo_batch;
>  	spin_lock_init(&dd->lock);
>  	spin_lock_init(&dd->zone_lock);
> -	INIT_LIST_HEAD(&dd->dispatch);
>  
>  	q->elevator = eq;
>  	return 0;
> @@ -536,6 +608,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	struct request_queue *q = hctx->queue;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  	const enum dd_data_dir data_dir = rq_data_dir(rq);
> +	u16 ioprio = req_get_ioprio(rq);
> +	u8 ioprio_class = IOPRIO_PRIO_CLASS(ioprio);
> +	enum dd_prio prio;
>  
>  	lockdep_assert_held(&dd->lock);
>  
> @@ -545,13 +620,19 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	 */
>  	blk_req_zone_write_unlock(rq);
>  
> +	prio = ioprio_class_to_prio[ioprio_class];
> +	atomic_inc(&dd->inserted[prio]);
> +	WARN_ON_ONCE(rq->elv.priv[0]);
> +	rq->elv.priv[0] = (void *)1ULL;
> +	rq->elv.priv[1] = (void *)(uintptr_t)ioprio_class;
> +
>  	if (blk_mq_sched_try_insert_merge(q, rq))
>  		return;
>  
>  	trace_block_rq_insert(rq);
>  
>  	if (at_head) {
> -		list_add(&rq->queuelist, &dd->dispatch);
> +		list_add(&rq->queuelist, &dd->dispatch[prio]);
>  	} else {
>  		deadline_add_rq_rb(dd, rq);
>  
> @@ -565,7 +646,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  		 * set expire time and add to fifo list
>  		 */
>  		rq->fifo_time = jiffies + dd->fifo_expire[data_dir];
> -		list_add_tail(&rq->queuelist, &dd->fifo_list[data_dir]);
> +		list_add_tail(&rq->queuelist, &dd->fifo_list[prio][data_dir]);
>  	}
>  }
>  
> @@ -589,12 +670,10 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>  	spin_unlock(&dd->lock);
>  }
>  
> -/*
> - * Nothing to do here. This is defined only to ensure that .finish_request
> - * method is called upon request completion.
> - */
> +/* Callback function called from inside blk_mq_rq_ctx_init(). */
>  static void dd_prepare_request(struct request *rq)
>  {
> +	rq->elv.priv[0] = NULL;
>  }
>  
>  /*
> @@ -616,26 +695,41 @@ static void dd_prepare_request(struct request *rq)
>  static void dd_finish_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const u8 ioprio_class = dd_rq_ioclass(rq);
> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
> +
> +	if (rq->elv.priv[0])
> +		atomic_inc(&dd->completed[prio]);
>  
>  	if (blk_queue_is_zoned(q)) {
> -		struct deadline_data *dd = q->elevator->elevator_data;
>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);
>  		blk_req_zone_write_unlock(rq);
> -		if (!list_empty(&dd->fifo_list[DD_WRITE]))
> +		if (!list_empty(&dd->fifo_list[prio][DD_WRITE]))
>  			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
>  		spin_unlock_irqrestore(&dd->zone_lock, flags);
>  	}
>  }
>  
> +static bool dd_has_work_for_prio(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return !list_empty_careful(&dd->dispatch[prio]) ||
> +		!list_empty_careful(&dd->fifo_list[prio][DD_READ]) ||
> +		!list_empty_careful(&dd->fifo_list[prio][DD_WRITE]);
> +}
> +
>  static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	enum dd_prio prio;
> +
> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++)
> +		if (dd_has_work_for_prio(dd, prio))
> +			return true;
>  
> -	return !list_empty_careful(&dd->dispatch) ||
> -		!list_empty_careful(&dd->fifo_list[0]) ||
> -		!list_empty_careful(&dd->fifo_list[1]);
> +	return false;
>  }
>  
>  /*
> @@ -707,7 +801,7 @@ static struct elv_fs_entry deadline_attrs[] = {
>  };
>  
>  #ifdef CONFIG_BLK_DEBUG_FS
> -#define DEADLINE_DEBUGFS_DDIR_ATTRS(ddir, name)				\
> +#define DEADLINE_DEBUGFS_DDIR_ATTRS(prio, data_dir, name)		\
>  static void *deadline_##name##_fifo_start(struct seq_file *m,		\
>  					  loff_t *pos)			\
>  	__acquires(&dd->lock)						\
> @@ -716,7 +810,7 @@ static void *deadline_##name##_fifo_start(struct seq_file *m,		\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
>  									\
>  	spin_lock(&dd->lock);						\
> -	return seq_list_start(&dd->fifo_list[ddir], *pos);		\
> +	return seq_list_start(&dd->fifo_list[prio][data_dir], *pos);	\
>  }									\
>  									\
>  static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
> @@ -725,7 +819,7 @@ static void *deadline_##name##_fifo_next(struct seq_file *m, void *v,	\
>  	struct request_queue *q = m->private;				\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
>  									\
> -	return seq_list_next(v, &dd->fifo_list[ddir], pos);		\
> +	return seq_list_next(v, &dd->fifo_list[prio][data_dir], pos);	\
>  }									\
>  									\
>  static void deadline_##name##_fifo_stop(struct seq_file *m, void *v)	\
> @@ -742,22 +836,31 @@ static const struct seq_operations deadline_##name##_fifo_seq_ops = {	\
>  	.next	= deadline_##name##_fifo_next,				\
>  	.stop	= deadline_##name##_fifo_stop,				\
>  	.show	= blk_mq_debugfs_rq_show,				\
> -};									\
> -									\
> +};
> +
> +#define DEADLINE_DEBUGFS_NEXT_RQ(data_dir, name)			\
>  static int deadline_##name##_next_rq_show(void *data,			\
>  					  struct seq_file *m)		\
>  {									\
>  	struct request_queue *q = data;					\
>  	struct deadline_data *dd = q->elevator->elevator_data;		\
> -	struct request *rq = dd->next_rq[ddir];				\
> +	struct request *rq = dd->next_rq[data_dir];			\
>  									\
>  	if (rq)								\
>  		__blk_mq_debugfs_rq_show(m, rq);			\
>  	return 0;							\
>  }
> -DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
> -DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
> +
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_READ, read0)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_RT_PRIO, DD_WRITE, write0)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_READ, read1)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_BE_PRIO, DD_WRITE, write1)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_READ, read2)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_IDLE_PRIO, DD_WRITE, write2)
> +DEADLINE_DEBUGFS_NEXT_RQ(DD_READ, read)
> +DEADLINE_DEBUGFS_NEXT_RQ(DD_WRITE, write)
>  #undef DEADLINE_DEBUGFS_DDIR_ATTRS
> +#undef DEADLINE_DEBUGFS_NEXT_RQ
>  
>  static int deadline_batching_show(void *data, struct seq_file *m)
>  {
> @@ -786,50 +889,100 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
>  	return 0;
>  }
>  
> -static void *deadline_dispatch_start(struct seq_file *m, loff_t *pos)
> -	__acquires(&dd->lock)
> +static int dd_queued_show(void *data, struct seq_file *m)
>  {
> -	struct request_queue *q = m->private;
> +	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  
> -	spin_lock(&dd->lock);
> -	return seq_list_start(&dd->dispatch, *pos);
> +	seq_printf(m, "%llu %llu %llu\n", dd_queued(dd, DD_RT_PRIO),
> +		   dd_queued(dd, DD_BE_PRIO),
> +		   dd_queued(dd, DD_IDLE_PRIO));
> +	return 0;
>  }
>  
> -static void *deadline_dispatch_next(struct seq_file *m, void *v, loff_t *pos)
> +/* Number of requests owned by the block driver for a given priority. */
> +static u64 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	struct request_queue *q = m->private;
> -	struct deadline_data *dd = q->elevator->elevator_data;
> -
> -	return seq_list_next(v, &dd->dispatch, pos);
> +	return atomic_read(&dd->dispatched[prio]) +
> +		atomic_read(&dd->merged[prio]) -
> +		atomic_read(&dd->completed[prio]);
>  }
>  
> -static void deadline_dispatch_stop(struct seq_file *m, void *v)
> -	__releases(&dd->lock)
> +static int dd_owned_by_driver_show(void *data, struct seq_file *m)
>  {
> -	struct request_queue *q = m->private;
> +	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
>  
> -	spin_unlock(&dd->lock);
> +	seq_printf(m, "%llu %llu %llu\n", dd_owned_by_driver(dd, DD_RT_PRIO),
> +		   dd_owned_by_driver(dd, DD_BE_PRIO),
> +		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
> +	return 0;
>  }
>  
> -static const struct seq_operations deadline_dispatch_seq_ops = {
> -	.start	= deadline_dispatch_start,
> -	.next	= deadline_dispatch_next,
> -	.stop	= deadline_dispatch_stop,
> -	.show	= blk_mq_debugfs_rq_show,
> +#define DEADLINE_DISPATCH_ATTR(prio)					\
> +static void *deadline_dispatch##prio##_start(struct seq_file *m,	\
> +					     loff_t *pos)		\
> +	__acquires(&dd->lock)						\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	spin_lock(&dd->lock);						\
> +	return seq_list_start(&dd->dispatch[prio], *pos);		\
> +}									\
> +									\
> +static void *deadline_dispatch##prio##_next(struct seq_file *m,		\
> +					    void *v, loff_t *pos)	\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	return seq_list_next(v, &dd->dispatch[prio], pos);		\
> +}									\
> +									\
> +static void deadline_dispatch##prio##_stop(struct seq_file *m, void *v)	\
> +	__releases(&dd->lock)						\
> +{									\
> +	struct request_queue *q = m->private;				\
> +	struct deadline_data *dd = q->elevator->elevator_data;		\
> +									\
> +	spin_unlock(&dd->lock);						\
> +}									\
> +									\
> +static const struct seq_operations deadline_dispatch##prio##_seq_ops = { \
> +	.start	= deadline_dispatch##prio##_start,			\
> +	.next	= deadline_dispatch##prio##_next,			\
> +	.stop	= deadline_dispatch##prio##_stop,			\
> +	.show	= blk_mq_debugfs_rq_show,				\
>  };
>  
> -#define DEADLINE_QUEUE_DDIR_ATTRS(name)						\
> -	{#name "_fifo_list", 0400, .seq_ops = &deadline_##name##_fifo_seq_ops},	\
> +DEADLINE_DISPATCH_ATTR(0);
> +DEADLINE_DISPATCH_ATTR(1);
> +DEADLINE_DISPATCH_ATTR(2);
> +#undef DEADLINE_DISPATCH_ATTR
> +
> +#define DEADLINE_QUEUE_DDIR_ATTRS(name)					\
> +	{#name "_fifo_list", 0400,					\
> +			.seq_ops = &deadline_##name##_fifo_seq_ops}
> +#define DEADLINE_NEXT_RQ_ATTR(name)					\
>  	{#name "_next_rq", 0400, deadline_##name##_next_rq_show}
>  static const struct blk_mq_debugfs_attr deadline_queue_debugfs_attrs[] = {
> -	DEADLINE_QUEUE_DDIR_ATTRS(read),
> -	DEADLINE_QUEUE_DDIR_ATTRS(write),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read0),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write0),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read1),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write1),
> +	DEADLINE_QUEUE_DDIR_ATTRS(read2),
> +	DEADLINE_QUEUE_DDIR_ATTRS(write2),
> +	DEADLINE_NEXT_RQ_ATTR(read),
> +	DEADLINE_NEXT_RQ_ATTR(write),
>  	{"batching", 0400, deadline_batching_show},
>  	{"starved", 0400, deadline_starved_show},
>  	{"async_depth", 0400, dd_async_depth_show},
> -	{"dispatch", 0400, .seq_ops = &deadline_dispatch_seq_ops},
> +	{"dispatch0", 0400, .seq_ops = &deadline_dispatch0_seq_ops},
> +	{"dispatch1", 0400, .seq_ops = &deadline_dispatch1_seq_ops},
> +	{"dispatch2", 0400, .seq_ops = &deadline_dispatch2_seq_ops},
> +	{"owned_by_driver", 0400, dd_owned_by_driver_show},
> +	{"queued", 0400, dd_queued_show},
>  	{},
>  };
>  #undef DEADLINE_QUEUE_DDIR_ATTRS
> @@ -879,6 +1032,6 @@ static void __exit deadline_exit(void)
>  module_init(deadline_init);
>  module_exit(deadline_exit);
>  
> -MODULE_AUTHOR("Jens Axboe");
> +MODULE_AUTHOR("Jens Axboe, Damien Le Moal and Bart Van Assche");
>  MODULE_LICENSE("GPL");
>  MODULE_DESCRIPTION("MQ deadline IO scheduler");
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (8 preceding siblings ...)
  2021-05-27  1:01 ` [PATCH 9/9] block/mq-deadline: Add cgroup support Bart Van Assche
@ 2021-05-27  6:25 ` Wang Jianchao
  2021-05-27  8:05   ` Wang Jianchao
  2021-05-27  8:56 ` Johannes Thumshirn
  2021-05-27 17:58 ` Adam Manzanares
  11 siblings, 1 reply; 65+ messages in thread
From: Wang Jianchao @ 2021-05-27  6:25 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares



On 2021/5/27 9:01 AM, Bart Van Assche wrote:
> Hi Jens,
> 
> A feature that is missing from the Linux kernel for storage devices that
> support I/O priorities is to set the I/O priority in requests involving page
> cache writeback. Since the identity of the process that triggers page cache
> writeback is not known in the writeback code, the priority set by ioprio_set()
> is ignored. However, an I/O cgroup is associated with writeback requests
> by certain filesystems. Hence this patch series that implements the following
> changes for the mq-deadline scheduler:
Hi Bart

How about implement this feature based on the rq-qos framework ?
Maybe it is better to make mq-deadline a pure io-scheduler.

Best regards
Jianchao

> * Make the I/O priority configurable per I/O cgroup.
> * Change the I/O priority of requests to the lower of (request I/O priority,
>   cgroup I/O priority).
> * Introduce one queue per I/O priority in the mq-deadline scheduler.
> 
> Please consider this patch series for kernel v5.14.
> 
> Thanks,
> 
> Bart.
> 
> Bart Van Assche (9):
>   block/mq-deadline: Add several comments
>   block/mq-deadline: Add two lockdep_assert_held() statements
>   block/mq-deadline: Remove two local variables
>   block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
>   block/mq-deadline: Improve compile-time argument checking
>   block/mq-deadline: Reduce the read expiry time for non-rotational
>     media
>   block/mq-deadline: Reserve 25% of tags for synchronous requests
>   block/mq-deadline: Add I/O priority support
>   block/mq-deadline: Add cgroup support
> 
>  block/Kconfig.iosched      |   6 +
>  block/Makefile             |   1 +
>  block/mq-deadline-cgroup.c | 206 +++++++++++++++
>  block/mq-deadline-cgroup.h |  73 ++++++
>  block/mq-deadline.c        | 524 +++++++++++++++++++++++++++++--------
>  5 files changed, 695 insertions(+), 115 deletions(-)
>  create mode 100644 block/mq-deadline-cgroup.c
>  create mode 100644 block/mq-deadline-cgroup.h
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/9] block/mq-deadline: Add several comments
  2021-05-27  1:01 ` [PATCH 1/9] block/mq-deadline: Add several comments Bart Van Assche
  2021-05-27  3:03   ` Damien Le Moal
@ 2021-05-27  6:45   ` Hannes Reinecke
  2021-05-27 19:30     ` Bart Van Assche
  2021-05-27  8:43   ` Johannes Thumshirn
  2021-05-27 15:13   ` Himanshu Madhani
  3 siblings, 1 reply; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  6:45 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> Make the code easier to read by adding more comments.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 8eea2cbf2bf4..64cabbc157ea 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -139,6 +139,9 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
>  	}
>  }
>  
> +/*
> + * Callback function that is invoked after @next has been merged into @req.
> + */
>  static void dd_merged_requests(struct request_queue *q, struct request *req,
>  			       struct request *next)
>  {
> @@ -375,6 +378,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>  }
>  
>  /*
> + * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
> + *
>   * One confusing aspect here is that we get called for a specific
>   * hardware queue, but we may return a request that is for a
>   * different hardware queue. This is because mq-deadline has shared
> @@ -438,6 +443,10 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
>  	return 0;
>  }
>  
> +/*
> + * Try to merge @bio into an existing request. If @bio has been merged into
> + * an existing request, store the pointer to that request into *@rq.
> + */
>  static int dd_request_merge(struct request_queue *q, struct request **rq,
>  			    struct bio *bio)
>  {
> @@ -461,6 +470,10 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
>  	return ELEVATOR_NO_MERGE;
>  }
>  
> +/*
> + * Attempt to merge a bio into an existing request. This function is called
> + * before @bio is associated with a request.
> + */
>  static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
>  		unsigned int nr_segs)
>  {
> @@ -518,6 +531,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	}
>  }
>  
> +/*
> + * Called from blk_mq_sched_insert_request() or blk_mq_sched_insert_requests().
> + */
>  static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>  			       struct list_head *list, bool at_head)
>  {
> @@ -544,6 +560,8 @@ static void dd_prepare_request(struct request *rq)
>  }
>  
>  /*
> + * Callback function called from inside blk_mq_free_request().
> + *
>   * For zoned block devices, write unlock the target zone of
>   * completed write requests. Do this while holding the zone lock
>   * spinlock so that the zone is never unlocked while deadline_fifo_request()
> 
Shouldn't these be kernel-doc comments?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-05-27  1:01 ` [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
  2021-05-27  2:25   ` Chaitanya Kulkarni
  2021-05-27  3:09   ` Damien Le Moal
@ 2021-05-27  6:46   ` Hannes Reinecke
  2021-05-27  8:44   ` Johannes Thumshirn
  2021-05-27 15:14   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  6:46 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> Document the locking strategy by adding two lockdep_assert_held()
> statements.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/9] block/mq-deadline: Remove two local variables
  2021-05-27  1:01 ` [PATCH 3/9] block/mq-deadline: Remove two local variables Bart Van Assche
  2021-05-27  2:26   ` Chaitanya Kulkarni
  2021-05-27  3:11   ` Damien Le Moal
@ 2021-05-27  6:46   ` Hannes Reinecke
  2021-05-27  8:44   ` Johannes Thumshirn
  2021-05-27 15:15   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  6:46 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> Make __dd_dispatch_request() easier to read by removing two local
> variables.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-05-27  1:01 ` [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
  2021-05-27  2:27   ` Chaitanya Kulkarni
  2021-05-27  3:13   ` Damien Le Moal
@ 2021-05-27  6:47   ` Hannes Reinecke
  2021-05-27  8:44   ` Johannes Thumshirn
  2021-05-27 15:16   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  6:47 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> Change "queue" into "sched" to make the function names reflect better the
> purpose of these functions.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking
  2021-05-27  1:01 ` [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
  2021-05-27  2:28   ` Chaitanya Kulkarni
  2021-05-27  3:24   ` Damien Le Moal
@ 2021-05-27  6:48   ` Hannes Reinecke
  2021-05-27  8:49   ` Johannes Thumshirn
  2021-05-27 15:19   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  6:48 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> Modern compilers complain if an out-of-range value is passed to a function
> argument that has an enumeration type. Let the compiler detect out-of-range
> data direction arguments instead of verifying the data_dir argument at
> runtime.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 96 +++++++++++++++++++++++----------------------
>  1 file changed, 49 insertions(+), 47 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media
  2021-05-27  1:01 ` [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media Bart Van Assche
  2021-05-27  2:30   ` Chaitanya Kulkarni
  2021-05-27  3:27   ` Damien Le Moal
@ 2021-05-27  6:52   ` Hannes Reinecke
  2021-05-27 15:20   ` Himanshu Madhani
  3 siblings, 0 replies; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  6:52 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> Rotational media (hard disks) benefit more from merging requests than
> non-rotational media. Reduce the read expire time for non-rotational media
> to reduce read latency.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
Do we get numbers for this? Or is is just code review?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests
  2021-05-27  1:01 ` [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests Bart Van Assche
  2021-05-27  3:33   ` Damien Le Moal
@ 2021-05-27  6:54   ` Hannes Reinecke
  2021-05-27 19:55     ` Bart Van Assche
  1 sibling, 1 reply; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  6:54 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> For interactive workloads it is important that synchronous requests are
> not delayed. Hence reserve 25% of tags for synchronous requests. This patch
> still allows asynchronous requests to fill the hardware queues since
> blk_mq_init_sched() makes sure that the number of scheduler requests is the
> double of the hardware queue depth. From blk_mq_init_sched():
> 
> 	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
> 				   BLKDEV_MAX_RQ);
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 46 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 46 insertions(+)
> I wonder if that's a good idea in general ... I'm thinking of poor SATA
drives which only have 31 tags; setting aside 8 of them for specific
use-cases does make a difference one would think.

Do you have testcases for this?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 8/9] block/mq-deadline: Add I/O priority support
  2021-05-27  1:01 ` [PATCH 8/9] block/mq-deadline: Add I/O priority support Bart Van Assche
  2021-05-27  3:48   ` Damien Le Moal
@ 2021-05-27  7:07   ` Hannes Reinecke
  2021-05-27 20:23     ` Bart Van Assche
  1 sibling, 1 reply; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  7:07 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> Maintain one FIFO list per I/O priority: RT, BE and IDLE. Only dispatch
> requests for a lower priority after all higher priority requests have
> finished. Maintain statistics for each priority level. Split the debugfs
> attributes per priority level as follows:
> 
> $ ls /sys/kernel/debug/block/.../sched/
> async_depth  dispatch2        read_next_rq      write2_fifo_list
> batching     read0_fifo_list  starved           write_next_rq
> dispatch0    read1_fifo_list  write0_fifo_list
> dispatch1    read2_fifo_list  write1_fifo_list
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 293 +++++++++++++++++++++++++++++++++-----------
>  1 file changed, 223 insertions(+), 70 deletions(-)
> 
Interesting question, though, wrt to request merging and realtime
priority: what takes priority?
Is the realtime priority more important than request merging?

And also the ioprio document states that there are 8 levels of class
data, determining how much time each class should spend on disk access.
Have these been taken into consideration?

> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 81f487d77e09..5a703e1228fa 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -44,16 +44,27 @@ enum dd_data_dir {
>  
>  enum { DD_DIR_COUNT = 2 };
>  
> +enum dd_prio {
> +	DD_RT_PRIO	= 0,
> +	DD_BE_PRIO	= 1,
> +	DD_IDLE_PRIO	= 2,
> +	DD_PRIO_MAX	= 2,
> +} __packed;
> +
> +enum { DD_PRIO_COUNT = 3 };
> +
>  struct deadline_data {
>  	/*
>  	 * run time data
>  	 */
>  
>  	/*
> -	 * requests (deadline_rq s) are present on both sort_list and fifo_list
> +	 * Requests are present on both sort_list[] and fifo_list[][]. The
> +	 * first index of fifo_list[][] is the I/O priority class (DD_*_PRIO).
> +	 * The second index is the data direction (rq_data_dir(rq)).
>  	 */
>  	struct rb_root sort_list[DD_DIR_COUNT];
> -	struct list_head fifo_list[DD_DIR_COUNT];
> +	struct list_head fifo_list[DD_PRIO_COUNT][DD_DIR_COUNT];
>  
>  	/*
>  	 * next in sort order. read, write or both are NULL
> @@ -62,6 +73,12 @@ struct deadline_data {
>  	unsigned int batching;		/* number of sequential requests made */
>  	unsigned int starved;		/* times reads have starved writes */
>  
> +	/* statistics */
> +	atomic_t inserted[DD_PRIO_COUNT];
> +	atomic_t dispatched[DD_PRIO_COUNT];
> +	atomic_t merged[DD_PRIO_COUNT];
> +	atomic_t completed[DD_PRIO_COUNT];
> +
>  	/*
>  	 * settings that change how the i/o scheduler behaves
>  	 */
> @@ -73,7 +90,15 @@ struct deadline_data {
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> -	struct list_head dispatch;
> +	struct list_head dispatch[DD_PRIO_COUNT];
> +};
> +
> +/* Maps an I/O priority class to a deadline scheduler priority. */
> +static const enum dd_prio ioprio_class_to_prio[] = {
> +	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
> +	[IOPRIO_CLASS_RT]	= DD_RT_PRIO,
> +	[IOPRIO_CLASS_BE]	= DD_BE_PRIO,
> +	[IOPRIO_CLASS_IDLE]	= DD_IDLE_PRIO,
>  };
>  
>  static inline struct rb_root *
> @@ -149,12 +174,28 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
>  	}
>  }
>  
> +/* Returns the I/O class that has been assigned by dd_insert_request(). */
> +static u8 dd_rq_ioclass(struct request *rq)
> +{
> +	return (uintptr_t)rq->elv.priv[1];
> +}
> +
>  /*
>   * Callback function that is invoked after @next has been merged into @req.
>   */
>  static void dd_merged_requests(struct request_queue *q, struct request *req,
>  			       struct request *next)
>  {
> +	struct deadline_data *dd = q->elevator->elevator_data;
> +	const u8 ioprio_class = dd_rq_ioclass(next);
> +	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
> +
> +	if (next->elv.priv[0]) {
> +		atomic_inc(&dd->merged[prio]);
> +	} else {
> +		WARN_ON_ONCE(true);
> +	}
> +
>  	/*
>  	 * if next expires before rq, assign its expire time to rq
>  	 * and move into next position (next will be deleted) in fifo
> @@ -191,14 +232,22 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
>  	deadline_remove_request(rq->q, rq);
>  }
>  
> +/* Number of requests queued for a given priority level. */
> +static u64 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return atomic_read(&dd->inserted[prio]) -
> +		atomic_read(&dd->completed[prio]);
> +}
> +
>  /*
>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>   */
>  static inline int deadline_check_fifo(struct deadline_data *dd,
> +				      enum dd_prio prio,
>  				      enum dd_data_dir data_dir)
>  {
> -	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
>  
>  	/*
>  	 * rq is expired!

I am _so_ not a fan of arrays in C.
Can't you make this an accessor and use
dd->fifo_list[prio * 2 + data_dir] ?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 9/9] block/mq-deadline: Add cgroup support
  2021-05-27  1:01 ` [PATCH 9/9] block/mq-deadline: Add cgroup support Bart Van Assche
@ 2021-05-27  7:09   ` Hannes Reinecke
  0 siblings, 0 replies; 65+ messages in thread
From: Hannes Reinecke @ 2021-05-27  7:09 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 3:01 AM, Bart Van Assche wrote:
> Add support for configuring I/O priority per block cgroup. Assign the lowest
> of the following two priorities to a request: rq->ioprio and blkcg->ioprio.
> Maintain statistics per cgroup and make these available in sysfs.
> 
> This patch has been tested as follows:
> 
> SHELL 1
> 
> modprobe scsi_debug ndelay=1000000 max_queue=16 &&
> while [ -z "$sd" ]; do sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*); done &&
> cd /sys/fs/cgroup/blkio/ &&
> echo 2 >blkio.dd.prio &&
> mkdir -p hipri &&
> cd hipri &&
> echo 1 >blkio.dd.prio &&
> echo $$ >cgroup.procs &&
> max-iops -a1 -d32 -j1 -e mq-deadline $sd
> 
> SHELL 2
> 
> sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) &&
> max-iops -a1 -d32 -j1 -e mq-deadline $sd
> 
> Result:
> * 12000 IOPS in shell 1
> *  2000 IOPS in shell 2
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d}
> --iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2))
> --name=${positional_argument_1} --filename=${positional_argument_1}
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/Kconfig.iosched      |   6 ++
>  block/Makefile             |   1 +
>  block/mq-deadline-cgroup.c | 206 +++++++++++++++++++++++++++++++++++++
>  block/mq-deadline-cgroup.h |  73 +++++++++++++
>  block/mq-deadline.c        |  96 ++++++++++++++---
>  5 files changed, 370 insertions(+), 12 deletions(-)
>  create mode 100644 block/mq-deadline-cgroup.c
>  create mode 100644 block/mq-deadline-cgroup.h
> Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27  6:25 ` [PATCH 0/9] Improve I/O priority support Wang Jianchao
@ 2021-05-27  8:05   ` Wang Jianchao
  2021-05-27 18:40     ` Bart Van Assche
  0 siblings, 1 reply; 65+ messages in thread
From: Wang Jianchao @ 2021-05-27  8:05 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares



On 2021/5/27 2:25 PM, Wang Jianchao wrote:
> 
> 
> On 2021/5/27 9:01 AM, Bart Van Assche wrote:
>> Hi Jens,
>>
>> A feature that is missing from the Linux kernel for storage devices that
>> support I/O priorities is to set the I/O priority in requests involving page
>> cache writeback. Since the identity of the process that triggers page cache
>> writeback is not known in the writeback code, the priority set by ioprio_set()
>> is ignored. However, an I/O cgroup is associated with writeback requests
>> by certain filesystems. Hence this patch series that implements the following
>> changes for the mq-deadline scheduler:
> Hi Bart
> 
> How about implement this feature based on the rq-qos framework ?
> Maybe it is better to make mq-deadline a pure io-scheduler.

In addition, it is more flexible to combine different io-scheduler and rq-qos policy
if we take io priority as a new policy ;)

> 
> Best regards
> Jianchao
> 
>> * Make the I/O priority configurable per I/O cgroup.
>> * Change the I/O priority of requests to the lower of (request I/O priority,
>>   cgroup I/O priority).
>> * Introduce one queue per I/O priority in the mq-deadline scheduler.
>>
>> Please consider this patch series for kernel v5.14.
>>
>> Thanks,
>>
>> Bart.
>>
>> Bart Van Assche (9):
>>   block/mq-deadline: Add several comments
>>   block/mq-deadline: Add two lockdep_assert_held() statements
>>   block/mq-deadline: Remove two local variables
>>   block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
>>   block/mq-deadline: Improve compile-time argument checking
>>   block/mq-deadline: Reduce the read expiry time for non-rotational
>>     media
>>   block/mq-deadline: Reserve 25% of tags for synchronous requests
>>   block/mq-deadline: Add I/O priority support
>>   block/mq-deadline: Add cgroup support
>>
>>  block/Kconfig.iosched      |   6 +
>>  block/Makefile             |   1 +
>>  block/mq-deadline-cgroup.c | 206 +++++++++++++++
>>  block/mq-deadline-cgroup.h |  73 ++++++
>>  block/mq-deadline.c        | 524 +++++++++++++++++++++++++++++--------
>>  5 files changed, 695 insertions(+), 115 deletions(-)
>>  create mode 100644 block/mq-deadline-cgroup.c
>>  create mode 100644 block/mq-deadline-cgroup.h
>>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/9] block/mq-deadline: Add several comments
  2021-05-27  1:01 ` [PATCH 1/9] block/mq-deadline: Add several comments Bart Van Assche
  2021-05-27  3:03   ` Damien Le Moal
  2021-05-27  6:45   ` Hannes Reinecke
@ 2021-05-27  8:43   ` Johannes Thumshirn
  2021-05-27 15:13   ` Himanshu Madhani
  3 siblings, 0 replies; 65+ messages in thread
From: Johannes Thumshirn @ 2021-05-27  8:43 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-05-27  1:01 ` [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
                     ` (2 preceding siblings ...)
  2021-05-27  6:46   ` Hannes Reinecke
@ 2021-05-27  8:44   ` Johannes Thumshirn
  2021-05-27 15:14   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Johannes Thumshirn @ 2021-05-27  8:44 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/9] block/mq-deadline: Remove two local variables
  2021-05-27  1:01 ` [PATCH 3/9] block/mq-deadline: Remove two local variables Bart Van Assche
                     ` (2 preceding siblings ...)
  2021-05-27  6:46   ` Hannes Reinecke
@ 2021-05-27  8:44   ` Johannes Thumshirn
  2021-05-27 15:15   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Johannes Thumshirn @ 2021-05-27  8:44 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-05-27  1:01 ` [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
                     ` (2 preceding siblings ...)
  2021-05-27  6:47   ` Hannes Reinecke
@ 2021-05-27  8:44   ` Johannes Thumshirn
  2021-05-27 15:16   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Johannes Thumshirn @ 2021-05-27  8:44 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking
  2021-05-27  1:01 ` [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
                     ` (2 preceding siblings ...)
  2021-05-27  6:48   ` Hannes Reinecke
@ 2021-05-27  8:49   ` Johannes Thumshirn
  2021-05-27 15:19   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Johannes Thumshirn @ 2021-05-27  8:49 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 27/05/2021 03:02, Bart Van Assche wrote:
> +enum dd_data_dir {
> +	DD_READ		= READ,
> +	DD_WRITE	= WRITE,
> +} __packed;


I don't think packing the enum is of any value here.

Otherwise
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (9 preceding siblings ...)
  2021-05-27  6:25 ` [PATCH 0/9] Improve I/O priority support Wang Jianchao
@ 2021-05-27  8:56 ` Johannes Thumshirn
  2021-05-27 17:23   ` Chaitanya Kulkarni
  2021-05-27 17:58 ` Adam Manzanares
  11 siblings, 1 reply; 65+ messages in thread
From: Johannes Thumshirn @ 2021-05-27  8:56 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares

On 27/05/2021 03:01, Bart Van Assche wrote:
> Hi Jens,
> 
> A feature that is missing from the Linux kernel for storage devices that
> support I/O priorities is to set the I/O priority in requests involving page
> cache writeback. Since the identity of the process that triggers page cache
> writeback is not known in the writeback code, the priority set by ioprio_set()
> is ignored. However, an I/O cgroup is associated with writeback requests
> by certain filesystems. Hence this patch series that implements the following
> changes for the mq-deadline scheduler:
> * Make the I/O priority configurable per I/O cgroup.
> * Change the I/O priority of requests to the lower of (request I/O priority,
>   cgroup I/O priority).
> * Introduce one queue per I/O priority in the mq-deadline scheduler.
> 
> Please consider this patch series for kernel v5.14.
> 
> Thanks,
> 
> Bart.
> 
> Bart Van Assche (9):
>   block/mq-deadline: Add several comments
>   block/mq-deadline: Add two lockdep_assert_held() statements
>   block/mq-deadline: Remove two local variables
>   block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
>   block/mq-deadline: Improve compile-time argument checking
>

I think the above 5 patches can go in independently as cleanups.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/9] block/mq-deadline: Add several comments
  2021-05-27  1:01 ` [PATCH 1/9] block/mq-deadline: Add several comments Bart Van Assche
                     ` (2 preceding siblings ...)
  2021-05-27  8:43   ` Johannes Thumshirn
@ 2021-05-27 15:13   ` Himanshu Madhani
  3 siblings, 0 replies; 65+ messages in thread
From: Himanshu Madhani @ 2021-05-27 15:13 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei



On 5/26/21 8:01 PM, Bart Van Assche wrote:
> Make the code easier to read by adding more comments.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 8eea2cbf2bf4..64cabbc157ea 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -139,6 +139,9 @@ static void dd_request_merged(struct request_queue *q, struct request *req,
>   	}
>   }
>   
> +/*
> + * Callback function that is invoked after @next has been merged into @req.
> + */
>   static void dd_merged_requests(struct request_queue *q, struct request *req,
>   			       struct request *next)
>   {
> @@ -375,6 +378,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   }
>   
>   /*
> + * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
> + *
>    * One confusing aspect here is that we get called for a specific
>    * hardware queue, but we may return a request that is for a
>    * different hardware queue. This is because mq-deadline has shared
> @@ -438,6 +443,10 @@ static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
>   	return 0;
>   }
>   
> +/*
> + * Try to merge @bio into an existing request. If @bio has been merged into
> + * an existing request, store the pointer to that request into *@rq.
> + */
>   static int dd_request_merge(struct request_queue *q, struct request **rq,
>   			    struct bio *bio)
>   {
> @@ -461,6 +470,10 @@ static int dd_request_merge(struct request_queue *q, struct request **rq,
>   	return ELEVATOR_NO_MERGE;
>   }
>   
> +/*
> + * Attempt to merge a bio into an existing request. This function is called
> + * before @bio is associated with a request.
> + */
>   static bool dd_bio_merge(struct request_queue *q, struct bio *bio,
>   		unsigned int nr_segs)
>   {
> @@ -518,6 +531,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   	}
>   }
>   
> +/*
> + * Called from blk_mq_sched_insert_request() or blk_mq_sched_insert_requests().
> + */
>   static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
>   			       struct list_head *list, bool at_head)
>   {
> @@ -544,6 +560,8 @@ static void dd_prepare_request(struct request *rq)
>   }
>   
>   /*
> + * Callback function called from inside blk_mq_free_request().
> + *
>    * For zoned block devices, write unlock the target zone of
>    * completed write requests. Do this while holding the zone lock
>    * spinlock so that the zone is never unlocked while deadline_fifo_request()
> 

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements
  2021-05-27  1:01 ` [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
                     ` (3 preceding siblings ...)
  2021-05-27  8:44   ` Johannes Thumshirn
@ 2021-05-27 15:14   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Himanshu Madhani @ 2021-05-27 15:14 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei



On 5/26/21 8:01 PM, Bart Van Assche wrote:
> Document the locking strategy by adding two lockdep_assert_held()
> statements.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 64cabbc157ea..4da0bd3b9827 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -279,6 +279,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   	bool reads, writes;
>   	int data_dir;
>   
> +	lockdep_assert_held(&dd->lock);
> +
>   	if (!list_empty(&dd->dispatch)) {
>   		rq = list_first_entry(&dd->dispatch, struct request, queuelist);
>   		list_del_init(&rq->queuelist);
> @@ -501,6 +503,8 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   	struct deadline_data *dd = q->elevator->elevator_data;
>   	const int data_dir = rq_data_dir(rq);
>   
> +	lockdep_assert_held(&dd->lock);
> +
>   	/*
>   	 * This may be a requeue of a write request that has locked its
>   	 * target zone. If it is the case, this releases the zone lock.
> 

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 3/9] block/mq-deadline: Remove two local variables
  2021-05-27  1:01 ` [PATCH 3/9] block/mq-deadline: Remove two local variables Bart Van Assche
                     ` (3 preceding siblings ...)
  2021-05-27  8:44   ` Johannes Thumshirn
@ 2021-05-27 15:15   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Himanshu Madhani @ 2021-05-27 15:15 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei



On 5/26/21 8:01 PM, Bart Van Assche wrote:
> Make __dd_dispatch_request() easier to read by removing two local
> variables.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 8 ++------
>   1 file changed, 2 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 4da0bd3b9827..cc9d0fc72db2 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -276,7 +276,6 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
>   static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   {
>   	struct request *rq, *next_rq;
> -	bool reads, writes;
>   	int data_dir;
>   
>   	lockdep_assert_held(&dd->lock);
> @@ -287,9 +286,6 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   		goto done;
>   	}
>   
> -	reads = !list_empty(&dd->fifo_list[READ]);
> -	writes = !list_empty(&dd->fifo_list[WRITE]);
> -
>   	/*
>   	 * batches are currently reads XOR writes
>   	 */
> @@ -306,7 +302,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   	 * data direction (read / write)
>   	 */
>   
> -	if (reads) {
> +	if (!list_empty(&dd->fifo_list[READ])) {
>   		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
>   
>   		if (deadline_fifo_request(dd, WRITE) &&
> @@ -322,7 +318,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   	 * there are either no reads or writes have been starved
>   	 */
>   
> -	if (writes) {
> +	if (!list_empty(&dd->fifo_list[WRITE])) {
>   dispatch_writes:
>   		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
>   
> 

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-05-27  1:01 ` [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
                     ` (3 preceding siblings ...)
  2021-05-27  8:44   ` Johannes Thumshirn
@ 2021-05-27 15:16   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Himanshu Madhani @ 2021-05-27 15:16 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei



On 5/26/21 8:01 PM, Bart Van Assche wrote:
> Change "queue" into "sched" to make the function names reflect better the
> purpose of these functions.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index cc9d0fc72db2..c4f51e7884fb 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -395,7 +395,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   	return rq;
>   }
>   
> -static void dd_exit_queue(struct elevator_queue *e)
> +static void dd_exit_sched(struct elevator_queue *e)
>   {
>   	struct deadline_data *dd = e->elevator_data;
>   
> @@ -408,7 +408,7 @@ static void dd_exit_queue(struct elevator_queue *e)
>   /*
>    * initialize elevator private data (deadline_data).
>    */
> -static int dd_init_queue(struct request_queue *q, struct elevator_type *e)
> +static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>   {
>   	struct deadline_data *dd;
>   	struct elevator_queue *eq;
> @@ -800,8 +800,8 @@ static struct elevator_type mq_deadline = {
>   		.requests_merged	= dd_merged_requests,
>   		.request_merged		= dd_request_merged,
>   		.has_work		= dd_has_work,
> -		.init_sched		= dd_init_queue,
> -		.exit_sched		= dd_exit_queue,
> +		.init_sched		= dd_init_sched,
> +		.exit_sched		= dd_exit_sched,
>   	},
>   
>   #ifdef CONFIG_BLK_DEBUG_FS
> 

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking
  2021-05-27  1:01 ` [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
                     ` (3 preceding siblings ...)
  2021-05-27  8:49   ` Johannes Thumshirn
@ 2021-05-27 15:19   ` Himanshu Madhani
  4 siblings, 0 replies; 65+ messages in thread
From: Himanshu Madhani @ 2021-05-27 15:19 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei



On 5/26/21 8:01 PM, Bart Van Assche wrote:
> Modern compilers complain if an out-of-range value is passed to a function
> argument that has an enumeration type. Let the compiler detect out-of-range
> data direction arguments instead of verifying the data_dir argument at
> runtime.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 96 +++++++++++++++++++++++----------------------
>   1 file changed, 49 insertions(+), 47 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index c4f51e7884fb..dfbc6b77fa71 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -35,6 +35,13 @@ static const int writes_starved = 2;    /* max times reads can starve a write */
>   static const int fifo_batch = 16;       /* # of sequential requests treated as one
>   				     by the above parameters. For throughput. */
>   
> +enum dd_data_dir {
> +	DD_READ		= READ,
> +	DD_WRITE	= WRITE,
> +} __packed;
> +
> +enum { DD_DIR_COUNT = 2 };
> +
>   struct deadline_data {
>   	/*
>   	 * run time data
> @@ -43,20 +50,20 @@ struct deadline_data {
>   	/*
>   	 * requests (deadline_rq s) are present on both sort_list and fifo_list
>   	 */
> -	struct rb_root sort_list[2];
> -	struct list_head fifo_list[2];
> +	struct rb_root sort_list[DD_DIR_COUNT];
> +	struct list_head fifo_list[DD_DIR_COUNT];
>   
>   	/*
>   	 * next in sort order. read, write or both are NULL
>   	 */
> -	struct request *next_rq[2];
> +	struct request *next_rq[DD_DIR_COUNT];
>   	unsigned int batching;		/* number of sequential requests made */
>   	unsigned int starved;		/* times reads have starved writes */
>   
>   	/*
>   	 * settings that change how the i/o scheduler behaves
>   	 */
> -	int fifo_expire[2];
> +	int fifo_expire[DD_DIR_COUNT];
>   	int fifo_batch;
>   	int writes_starved;
>   	int front_merges;
> @@ -97,7 +104,7 @@ deadline_add_rq_rb(struct deadline_data *dd, struct request *rq)
>   static inline void
>   deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
>   {
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);
>   
>   	if (dd->next_rq[data_dir] == rq)
>   		dd->next_rq[data_dir] = deadline_latter_request(rq);
> @@ -169,10 +176,10 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
>   static void
>   deadline_move_request(struct deadline_data *dd, struct request *rq)
>   {
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);
>   
> -	dd->next_rq[READ] = NULL;
> -	dd->next_rq[WRITE] = NULL;
> +	dd->next_rq[DD_READ] = NULL;
> +	dd->next_rq[DD_WRITE] = NULL;
>   	dd->next_rq[data_dir] = deadline_latter_request(rq);
>   
>   	/*
> @@ -185,9 +192,10 @@ deadline_move_request(struct deadline_data *dd, struct request *rq)
>    * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>    * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>    */
> -static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
> +static inline int deadline_check_fifo(struct deadline_data *dd,
> +				      enum dd_data_dir data_dir)
>   {
> -	struct request *rq = rq_entry_fifo(dd->fifo_list[ddir].next);
> +	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
>   
>   	/*
>   	 * rq is expired!
> @@ -203,19 +211,16 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
>    * dispatch using arrival ordered lists.
>    */
>   static struct request *
> -deadline_fifo_request(struct deadline_data *dd, int data_dir)
> +deadline_fifo_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>   {
>   	struct request *rq;
>   	unsigned long flags;
>   
> -	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
> -		return NULL;
> -
>   	if (list_empty(&dd->fifo_list[data_dir]))
>   		return NULL;
>   
>   	rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
> -	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>   		return rq;
>   
>   	/*
> @@ -223,7 +228,7 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
>   	 * an unlocked target zone.
>   	 */
>   	spin_lock_irqsave(&dd->zone_lock, flags);
> -	list_for_each_entry(rq, &dd->fifo_list[WRITE], queuelist) {
> +	list_for_each_entry(rq, &dd->fifo_list[DD_WRITE], queuelist) {
>   		if (blk_req_can_dispatch_to_zone(rq))
>   			goto out;
>   	}
> @@ -239,19 +244,16 @@ deadline_fifo_request(struct deadline_data *dd, int data_dir)
>    * dispatch using sector position sorted lists.
>    */
>   static struct request *
> -deadline_next_request(struct deadline_data *dd, int data_dir)
> +deadline_next_request(struct deadline_data *dd, enum dd_data_dir data_dir)
>   {
>   	struct request *rq;
>   	unsigned long flags;
>   
> -	if (WARN_ON_ONCE(data_dir != READ && data_dir != WRITE))
> -		return NULL;
> -
>   	rq = dd->next_rq[data_dir];
>   	if (!rq)
>   		return NULL;
>   
> -	if (data_dir == READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
>   		return rq;
>   
>   	/*
> @@ -276,7 +278,7 @@ deadline_next_request(struct deadline_data *dd, int data_dir)
>   static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   {
>   	struct request *rq, *next_rq;
> -	int data_dir;
> +	enum dd_data_dir data_dir;
>   
>   	lockdep_assert_held(&dd->lock);
>   
> @@ -289,9 +291,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   	/*
>   	 * batches are currently reads XOR writes
>   	 */
> -	rq = deadline_next_request(dd, WRITE);
> +	rq = deadline_next_request(dd, DD_WRITE);
>   	if (!rq)
> -		rq = deadline_next_request(dd, READ);
> +		rq = deadline_next_request(dd, DD_READ);
>   
>   	if (rq && dd->batching < dd->fifo_batch)
>   		/* we have a next request are still entitled to batch */
> @@ -302,14 +304,14 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   	 * data direction (read / write)
>   	 */
>   
> -	if (!list_empty(&dd->fifo_list[READ])) {
> -		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
> +	if (!list_empty(&dd->fifo_list[DD_READ])) {
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_READ]));
>   
> -		if (deadline_fifo_request(dd, WRITE) &&
> +		if (deadline_fifo_request(dd, DD_WRITE) &&
>   		    (dd->starved++ >= dd->writes_starved))
>   			goto dispatch_writes;
>   
> -		data_dir = READ;
> +		data_dir = DD_READ;
>   
>   		goto dispatch_find_request;
>   	}
> @@ -318,13 +320,13 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd)
>   	 * there are either no reads or writes have been starved
>   	 */
>   
> -	if (!list_empty(&dd->fifo_list[WRITE])) {
> +	if (!list_empty(&dd->fifo_list[DD_WRITE])) {
>   dispatch_writes:
> -		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
> +		BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[DD_WRITE]));
>   
>   		dd->starved = 0;
>   
> -		data_dir = WRITE;
> +		data_dir = DD_WRITE;
>   
>   		goto dispatch_find_request;
>   	}
> @@ -399,8 +401,8 @@ static void dd_exit_sched(struct elevator_queue *e)
>   {
>   	struct deadline_data *dd = e->elevator_data;
>   
> -	BUG_ON(!list_empty(&dd->fifo_list[READ]));
> -	BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
> +	BUG_ON(!list_empty(&dd->fifo_list[DD_READ]));
> +	BUG_ON(!list_empty(&dd->fifo_list[DD_WRITE]));
>   
>   	kfree(dd);
>   }
> @@ -424,12 +426,12 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>   	}
>   	eq->elevator_data = dd;
>   
> -	INIT_LIST_HEAD(&dd->fifo_list[READ]);
> -	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
> -	dd->sort_list[READ] = RB_ROOT;
> -	dd->sort_list[WRITE] = RB_ROOT;
> -	dd->fifo_expire[READ] = read_expire;
> -	dd->fifo_expire[WRITE] = write_expire;
> +	INIT_LIST_HEAD(&dd->fifo_list[DD_READ]);
> +	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
> +	dd->sort_list[DD_READ] = RB_ROOT;
> +	dd->sort_list[DD_WRITE] = RB_ROOT;
> +	dd->fifo_expire[DD_READ] = read_expire;
> +	dd->fifo_expire[DD_WRITE] = write_expire;
>   	dd->writes_starved = writes_starved;
>   	dd->front_merges = 1;
>   	dd->fifo_batch = fifo_batch;
> @@ -497,7 +499,7 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   {
>   	struct request_queue *q = hctx->queue;
>   	struct deadline_data *dd = q->elevator->elevator_data;
> -	const int data_dir = rq_data_dir(rq);
> +	const enum dd_data_dir data_dir = rq_data_dir(rq);
>   
>   	lockdep_assert_held(&dd->lock);
>   
> @@ -585,7 +587,7 @@ static void dd_finish_request(struct request *rq)
>   
>   		spin_lock_irqsave(&dd->zone_lock, flags);
>   		blk_req_zone_write_unlock(rq);
> -		if (!list_empty(&dd->fifo_list[WRITE]))
> +		if (!list_empty(&dd->fifo_list[DD_WRITE]))
>   			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
>   		spin_unlock_irqrestore(&dd->zone_lock, flags);
>   	}
> @@ -626,8 +628,8 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>   		__data = jiffies_to_msecs(__data);			\
>   	return deadline_var_show(__data, (page));			\
>   }
> -SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
> -SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
> +SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[DD_READ], 1);
> +SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[DD_WRITE], 1);
>   SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
>   SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
>   SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
> @@ -649,8 +651,8 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
>   		*(__PTR) = __data;					\
>   	return count;							\
>   }
> -STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
> -STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX, 1);
> +STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX, 1);
>   STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
>   STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
>   STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
> @@ -717,8 +719,8 @@ static int deadline_##name##_next_rq_show(void *data,			\
>   		__blk_mq_debugfs_rq_show(m, rq);			\
>   	return 0;							\
>   }
> -DEADLINE_DEBUGFS_DDIR_ATTRS(READ, read)
> -DEADLINE_DEBUGFS_DDIR_ATTRS(WRITE, write)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_READ, read)
> +DEADLINE_DEBUGFS_DDIR_ATTRS(DD_WRITE, write)
>   #undef DEADLINE_DEBUGFS_DDIR_ATTRS
>   
>   static int deadline_batching_show(void *data, struct seq_file *m)
> 

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media
  2021-05-27  1:01 ` [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media Bart Van Assche
                     ` (2 preceding siblings ...)
  2021-05-27  6:52   ` Hannes Reinecke
@ 2021-05-27 15:20   ` Himanshu Madhani
  3 siblings, 0 replies; 65+ messages in thread
From: Himanshu Madhani @ 2021-05-27 15:20 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei



On 5/26/21 8:01 PM, Bart Van Assche wrote:
> Rotational media (hard disks) benefit more from merging requests than
> non-rotational media. Reduce the read expire time for non-rotational media
> to reduce read latency.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index dfbc6b77fa71..2ab844a4b6b5 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -29,7 +29,9 @@
>   /*
>    * See Documentation/block/deadline-iosched.rst
>    */
> -static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
> +/* max time before a read is submitted. */
> +static const int read_expire_rot = HZ / 2;
> +static const int read_expire_nonrot = 1;
>   static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
>   static const int writes_starved = 2;    /* max times reads can starve a write */
>   static const int fifo_batch = 16;       /* # of sequential requests treated as one
> @@ -430,7 +432,8 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>   	INIT_LIST_HEAD(&dd->fifo_list[DD_WRITE]);
>   	dd->sort_list[DD_READ] = RB_ROOT;
>   	dd->sort_list[DD_WRITE] = RB_ROOT;
> -	dd->fifo_expire[DD_READ] = read_expire;
> +	dd->fifo_expire[DD_READ] = blk_queue_nonrot(q) ? read_expire_nonrot :
> +		read_expire_rot;
>   	dd->fifo_expire[DD_WRITE] = write_expire;
>   	dd->writes_starved = writes_starved;
>   	dd->front_merges = 1;
> 

Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>

-- 
Himanshu Madhani                                Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27  8:56 ` Johannes Thumshirn
@ 2021-05-27 17:23   ` Chaitanya Kulkarni
  2021-05-27 19:00     ` Bart Van Assche
  0 siblings, 1 reply; 65+ messages in thread
From: Chaitanya Kulkarni @ 2021-05-27 17:23 UTC (permalink / raw)
  To: Johannes Thumshirn, Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares

On 5/27/21 01:56, Johannes Thumshirn wrote:
> On 27/05/2021 03:01, Bart Van Assche wrote:
>> Hi Jens,
>>
>> A feature that is missing from the Linux kernel for storage devices that
>> support I/O priorities is to set the I/O priority in requests involving page
>> cache writeback. Since the identity of the process that triggers page cache
>> writeback is not known in the writeback code, the priority set by ioprio_set()
>> is ignored. However, an I/O cgroup is associated with writeback requests
>> by certain filesystems. Hence this patch series that implements the following
>> changes for the mq-deadline scheduler:
>> * Make the I/O priority configurable per I/O cgroup.
>> * Change the I/O priority of requests to the lower of (request I/O priority,
>>   cgroup I/O priority).
>> * Introduce one queue per I/O priority in the mq-deadline scheduler.
>>
>> Please consider this patch series for kernel v5.14.
>>
>> Thanks,
>>
>> Bart.
>>
>> Bart Van Assche (9):
>>   block/mq-deadline: Add several comments
>>   block/mq-deadline: Add two lockdep_assert_held() statements
>>   block/mq-deadline: Remove two local variables
>>   block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
>>   block/mq-deadline: Improve compile-time argument checking
>>
> I think the above 5 patches can go in independently as cleanups.
>

Yes they seems purely cleanup from review point of view.



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
                   ` (10 preceding siblings ...)
  2021-05-27  8:56 ` Johannes Thumshirn
@ 2021-05-27 17:58 ` Adam Manzanares
  2021-05-27 18:53   ` Bart Van Assche
  11 siblings, 1 reply; 65+ messages in thread
From: Adam Manzanares @ 2021-05-27 17:58 UTC (permalink / raw)
  To: axboe, bvanassche; +Cc: hch, jaegeuk, linux-block

On Wed, 2021-05-26 at 18:01 -0700, Bart Van Assche wrote:
> Hi Jens,
> 
> A feature that is missing from the Linux kernel for storage devices
> that
> support I/O priorities is to set the I/O priority in requests
> involving page
> cache writeback. 

Hello Bart,

When I worked in this area the goal was to control tail latencies for a
prioritized app. Once the page cache is involved app control over IO is
handed off suggesting that the priorities passed down to the device
aren't as meaningful anymore. 

Is passing the priority to the device making an impact to performance
with your test case? If not, could BFQ be seen as a viable alternative.

Take care,
Adam

> Since the identity of the process that triggers page cache
> writeback is not known in the writeback code, the priority set by
> ioprio_set()
> is ignored. However, an I/O cgroup is associated with writeback
> requests
> by certain filesystems. Hence this patch series that implements the
> following
> changes for the mq-deadline scheduler:
> * Make the I/O priority configurable per I/O cgroup.
> * Change the I/O priority of requests to the lower of (request I/O
> priority,
>   cgroup I/O priority).
> * Introduce one queue per I/O priority in the mq-deadline scheduler.
> 
> Please consider this patch series for kernel v5.14.
> 
> Thanks,
> 
> Bart.
> 
> Bart Van Assche (9):
>   block/mq-deadline: Add several comments
>   block/mq-deadline: Add two lockdep_assert_held() statements
>   block/mq-deadline: Remove two local variables
>   block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
>   block/mq-deadline: Improve compile-time argument checking
>   block/mq-deadline: Reduce the read expiry time for non-rotational
>     media
>   block/mq-deadline: Reserve 25% of tags for synchronous requests
>   block/mq-deadline: Add I/O priority support
>   block/mq-deadline: Add cgroup support
> 
>  block/Kconfig.iosched      |   6 +
>  block/Makefile             |   1 +
>  block/mq-deadline-cgroup.c | 206 +++++++++++++++
>  block/mq-deadline-cgroup.h |  73 ++++++
>  block/mq-deadline.c        | 524 +++++++++++++++++++++++++++++------
> --
>  5 files changed, 695 insertions(+), 115 deletions(-)
>  create mode 100644 block/mq-deadline-cgroup.c
>  create mode 100644 block/mq-deadline-cgroup.h
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27  8:05   ` Wang Jianchao
@ 2021-05-27 18:40     ` Bart Van Assche
  2021-05-28  2:05       ` Wang Jianchao
  0 siblings, 1 reply; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 18:40 UTC (permalink / raw)
  To: Wang Jianchao, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares

On 5/27/21 1:05 AM, Wang Jianchao wrote:
> On 2021/5/27 2:25 PM, Wang Jianchao wrote:
>> On 2021/5/27 9:01 AM, Bart Van Assche wrote:
>>> A feature that is missing from the Linux kernel for storage devices that
>>> support I/O priorities is to set the I/O priority in requests involving page
>>> cache writeback. Since the identity of the process that triggers page cache
>>> writeback is not known in the writeback code, the priority set by ioprio_set()
>>> is ignored. However, an I/O cgroup is associated with writeback requests
>>> by certain filesystems. Hence this patch series that implements the following
>>> changes for the mq-deadline scheduler:
>>
>> How about implement this feature based on the rq-qos framework ?
>> Maybe it is better to make mq-deadline a pure io-scheduler.
> 
> In addition, it is more flexible to combine different io-scheduler and rq-qos policy
> if we take io priority as a new policy ;)

Hi Jianchao,

That's an interesting question. I'm in favor of flexibility. However,
I'm not sure whether it would be possible to combine an rq-qos policy
that submits high priority requests first with an I/O scheduler that
ignores I/O priorities. Since a typical I/O scheduler reorders requests,
such a combination would lead to requests being submitted in the wrong
order to storage devices. In other words, when using an I/O scheduler,
proper support for I/O priority in the I/O scheduler is essential. The
BFQ I/O scheduler is still maturing. The purpose of the Kyber I/O
scheduler is to control latency by reducing the queue depth without
differentiating between requests of the same type. The mq-deadline
scheduler is already being used in combination with storage devices that
support I/O priorities in their controller. Hence the choice for the
mq-deadline scheduler.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27 17:58 ` Adam Manzanares
@ 2021-05-27 18:53   ` Bart Van Assche
  2021-05-27 22:45     ` Adam Manzanares
  0 siblings, 1 reply; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 18:53 UTC (permalink / raw)
  To: Adam Manzanares, axboe; +Cc: hch, jaegeuk, linux-block

On 5/27/21 10:58 AM, Adam Manzanares wrote:
> On Wed, 2021-05-26 at 18:01 -0700, Bart Van Assche wrote:
>> A feature that is missing from the Linux kernel for storage
>> devices that support I/O priorities is to set the I/O priority in
>> requests involving page cache writeback.
> 
> When I worked in this area the goal was to control tail latencies for
> a prioritized app. Once the page cache is involved app control over
> IO is handed off suggesting that the priorities passed down to the
> device aren't as meaningful anymore.
> 
> Is passing the priority to the device making an impact to
> performance with your test case? If not, could BFQ be seen as a
> viable alternative.

Hi Adam,

As we all know complexity is the enemy of reliability. BFQ is
complicated so I am hesitant to use BFQ in a context where reliability
is important. Additionally, the BFQ scheduler uses heuristics to detect
the application type. As became clear recently, there heuristics can be
misled easily and fixing this is nontrivial (see also
https://lore.kernel.org/linux-block/20210521131034.GL18952@quack2.suse.cz/).
I want the application launcher to create one cgroup for foreground
applications and another cgroup for background applications. Configuring
different I/O priorities per cgroup is an easy and reliable approach to
communicate information about the application type and latency
expectations to the block layer.

Some database applications use buffered I/O and flush that data by
calling fsync(). We want to support such applications. So it seems like
we have a different opinion about whether or not an I/O priority should
be assigned to I/O that is the result of page cache writeback?

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27 17:23   ` Chaitanya Kulkarni
@ 2021-05-27 19:00     ` Bart Van Assche
  0 siblings, 0 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 19:00 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Johannes Thumshirn, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares

On 5/27/21 10:23 AM, Chaitanya Kulkarni wrote:
> On 5/27/21 01:56, Johannes Thumshirn wrote:
>> On 27/05/2021 03:01, Bart Van Assche wrote:
>>> Bart Van Assche (9):
>>>   block/mq-deadline: Add several comments
>>>   block/mq-deadline: Add two lockdep_assert_held() statements
>>>   block/mq-deadline: Remove two local variables
>>>   block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
>>>   block/mq-deadline: Improve compile-time argument checking
>>>
>> I think the above 5 patches can go in independently as cleanups.
> 
> Yes they seems purely cleanup from review point of view.

That's right. This patch series has been organized such that the cleanup
patches occur first.

I have posted all 9 patches as a series to make it easier to review and
apply all these patches.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/9] block/mq-deadline: Add several comments
  2021-05-27  6:45   ` Hannes Reinecke
@ 2021-05-27 19:30     ` Bart Van Assche
  0 siblings, 0 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 19:30 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/26/21 11:45 PM, Hannes Reinecke wrote:
> On 5/27/21 3:01 AM, Bart Van Assche wrote:
>>  /*
>> + * Callback function called from inside blk_mq_free_request().
>> + *
>>   * For zoned block devices, write unlock the target zone of
>>   * completed write requests. Do this while holding the zone lock
>>   * spinlock so that the zone is never unlocked while deadline_fifo_request()
>>
> Shouldn't these be kernel-doc comments?

Hi Hannes,

When using the kernel-doc format it is required to document the meaning
of all function arguments. It seems to me that it is easy to guess what
the meaning of the function arguments is in the deadline scheduler?

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()
  2021-05-27  3:13   ` Damien Le Moal
@ 2021-05-27 19:33     ` Bart Van Assche
  0 siblings, 0 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 19:33 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 5/26/21 8:13 PM, Damien Le Moal wrote:
> On 2021/05/27 10:01, Bart Van Assche wrote:
>> Change "queue" into "sched" to make the function names reflect better the
>> purpose of these functions.
> 
> Nit: may be change this to:
> 
> Change "queue" into "sched" to make the function names match the elevator type
> operation names.

Hi Damien,

The functions dd_init_sched() and dd_exit_sched() initialize a scheduler
data structure. To me the names dd_init_queue() / dd_exit_queue()
suggest that something in the request queue is initialized which is not
the case. Is sufficient as clarification or do you still want me to
change the patch description?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking
  2021-05-27  3:24   ` Damien Le Moal
@ 2021-05-27 19:38     ` Bart Van Assche
  2021-05-28  1:42       ` Damien Le Moal
  0 siblings, 1 reply; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 19:38 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 5/26/21 8:24 PM, Damien Le Moal wrote:
> On 2021/05/27 10:02, Bart Van Assche wrote:
>> +enum dd_data_dir {
>> +	DD_READ		= READ,
>> +	DD_WRITE	= WRITE,
>> +} __packed;
> 
> Why the "__packed" here ?

I agree that using __packed here does not have any advantage. I will
leave it out.

>> +
>> +enum { DD_DIR_COUNT = 2 };
> 
> Why not a simple #define for this one ?

I only use the C preprocessor if there is no other solution. As far as I
know there are no rules in the kernel coding style guidelines with
regard whether to use a #define or an enum to define a constant?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media
  2021-05-27  3:27   ` Damien Le Moal
@ 2021-05-27 19:43     ` Bart Van Assche
  0 siblings, 0 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 19:43 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Ming Lei, Hannes Reinecke

On 5/26/21 8:27 PM, Damien Le Moal wrote:
> On 2021/05/27 10:02, Bart Van Assche wrote:
>> Rotational media (hard disks) benefit more from merging requests than
>> non-rotational media. Reduce the read expire time for non-rotational media
> 
> "Reduce the read expire time for non-rotational..." -> "Reduce the default read
> expire time for non-rotational..."
> 
>> to reduce read latency.
> 
> I am not against this, but I wonder if we should not let udev do that with some
> rules instead of adding totally arbitrary values here...

Hi Damien and Hannes,

I agree that it's better to configure the expiry time via udev so I'm
considering to leave this patch out.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests
  2021-05-27  6:54   ` Hannes Reinecke
@ 2021-05-27 19:55     ` Bart Van Assche
  0 siblings, 0 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 19:55 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/26/21 11:54 PM, Hannes Reinecke wrote:
> On 5/27/21 3:01 AM, Bart Van Assche wrote:
>> For interactive workloads it is important that synchronous requests are
>> not delayed. Hence reserve 25% of tags for synchronous requests. This patch
>> still allows asynchronous requests to fill the hardware queues since
>> blk_mq_init_sched() makes sure that the number of scheduler requests is the
>> double of the hardware queue depth. From blk_mq_init_sched():
>>
>> 	q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
>> 				   BLKDEV_MAX_RQ);
>
> I wonder if that's a good idea in general ... I'm thinking of poor SATA
> drives which only have 31 tags; setting aside 8 of them for specific
> use-cases does make a difference one would think.
> 
> Do you have testcases for this?

Hi Hannes,

The mq-deadline scheduler is the only scheduler that does not yet set
aside tags for synchronous requests. BFQ and Kyber both implement the
.limit_depth I/O scheduler callback function.

Yes, I have a test case for this, namely SCSI UFS devices. The UFS
device in my test setup supports a single I/O queue and limits the
number of outstanding SCSI commands to 32 (UFSHCD_CAN_QUEUE).

Please note that the limit mentioned above is still above the number of
controller tags. For e.g. UFS q->tag_set->queue_depth == 32 and
attaching an I/O scheduler increases nr_requests from 32 to 64.
Reserving 25% of tags for synchronous requests leaves 3 * 64 / 4 = 48
scheduler tags for asynchronous requests.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests
  2021-05-27  3:33   ` Damien Le Moal
@ 2021-05-27 20:00     ` Bart Van Assche
  0 siblings, 0 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 20:00 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 5/26/21 8:33 PM, Damien Le Moal wrote:
> On 2021/05/27 10:02, Bart Van Assche wrote:
>> For interactive workloads it is important that synchronous requests are
>> not delayed. Hence reserve 25% of tags for synchronous requests. This patch
> 
> s/tags/scheduler tags
> 
> to be clear that we are not talking about the device tags. Same in the patch
> title may be.

OK.

>> +static void dd_limit_depth(unsigned int op, struct blk_mq_alloc_data *data)
> 
> Similarly as you did in patch 1, may be add a comment about this operation and
> when it is called ?

Will do.

>> +static void dd_depth_updated(struct blk_mq_hw_ctx *hctx)
>> +{
>> +	struct request_queue *q = hctx->queue;
>> +	struct deadline_data *dd = q->elevator->elevator_data;
>> +	struct blk_mq_tags *tags = hctx->sched_tags;
>> +
>> +	dd->async_depth = 3 * q->nr_requests / 4;
> 
> I think that nr_requests is always at least 2, but it may be good to have a
> sanity check here that we do not end up with async_depth == 0, no ?

OK, I will add a check.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 8/9] block/mq-deadline: Add I/O priority support
  2021-05-27  3:48   ` Damien Le Moal
@ 2021-05-27 20:12     ` Bart Van Assche
  0 siblings, 0 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 20:12 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 5/26/21 8:48 PM, Damien Le Moal wrote:
> On 2021/05/27 10:02, Bart Van Assche wrote:
>> +	if (next->elv.priv[0]) {
>> +		atomic_inc(&dd->merged[prio]);
>> +	} else {
>> +		WARN_ON_ONCE(true);
>> +	}
> 
> I do not think you need the curly braces here.

These are present because patch 9/9 adds more code in this if-statement
and to improve readability of patch 9/9.

>> @@ -392,9 +453,14 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>>  {
>>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>>  	struct request *rq;
>> +	enum dd_prio prio;
>>  
>>  	spin_lock(&dd->lock);
>> -	rq = __dd_dispatch_request(dd);
>> +	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
>> +		rq = __dd_dispatch_request(dd, prio);
>> +		if (rq || dd_queued(dd, prio))
>> +			break;
>> +	}
>>  	spin_unlock(&dd->lock);
> 
> Unless I missed something, this means that the aging (fifo list expire)
> mechanism is per prio list but there is no aging between the prio classes. This
> means that an application doing lots of RT direct IOs will completely starve
> other prio classes and hang the applications using them.
> 
> I think we need aging between the prio classes too to avoid that.

OK, I will add an aging mechanism.

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 8/9] block/mq-deadline: Add I/O priority support
  2021-05-27  7:07   ` Hannes Reinecke
@ 2021-05-27 20:23     ` Bart Van Assche
  2021-05-28  1:40       ` Damien Le Moal
  0 siblings, 1 reply; 65+ messages in thread
From: Bart Van Assche @ 2021-05-27 20:23 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Damien Le Moal, Ming Lei

On 5/27/21 12:07 AM, Hannes Reinecke wrote:
> On 5/27/21 3:01 AM, Bart Van Assche wrote:
>> Maintain one FIFO list per I/O priority: RT, BE and IDLE. Only dispatch
>> requests for a lower priority after all higher priority requests have
>> finished. Maintain statistics for each priority level. Split the debugfs
>> attributes per priority level as follows:
>>
>> $ ls /sys/kernel/debug/block/.../sched/
>> async_depth  dispatch2        read_next_rq      write2_fifo_list
>> batching     read0_fifo_list  starved           write_next_rq
>> dispatch0    read1_fifo_list  write0_fifo_list
>> dispatch1    read2_fifo_list  write1_fifo_list
>
> Interesting question, though, wrt to request merging and realtime
> priority: what takes priority?
> Is the realtime priority more important than request merging?

We plan to use two I/O priorities: one for foreground applications and
one for background applications. We want to lower the application
startup time if background I/O is ongoing. The code that associates
different cgroups with foreground and background applications is already
available. We care more about foreground I/O being prioritized over
background I/O than about foreground I/O being real-time.

> And also the ioprio document states that there are 8 levels of class
> data, determining how much time each class should spend on disk access.
> Have these been taken into consideration?

My understanding is that the ioprio document was written before any I/O
controllers implemented I/O priorities. I'm not sure whether I/O
controllers will ever implement more than two I/O priorities. See also
commit 8e061784b51e ("ata: Enabling ATA Command Priorities"). A paper
about the benefits of I/O controllers supporting I/O priorities is
available at
https://www.usenix.org/system/files/conference/hotstorage17/hotstorage17-paper-manzanares.pdf.

>>  /*
>>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>>   */
>>  static inline int deadline_check_fifo(struct deadline_data *dd,
>> +				      enum dd_prio prio,
>>  				      enum dd_data_dir data_dir)
>>  {
>> -	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
>> +	struct request *rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
>>  
>>  	/*
>>  	 * rq is expired!
> 
> I am _so_ not a fan of arrays in C.
> Can't you make this an accessor and use
> dd->fifo_list[prio * 2 + data_dir] ?

That's possible, but if the compiler can translate [prio][data_dir] into
[prio * 2 + data_dir], should I really do this myself instead of letting
the compiler generate that transformation?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27 18:53   ` Bart Van Assche
@ 2021-05-27 22:45     ` Adam Manzanares
  0 siblings, 0 replies; 65+ messages in thread
From: Adam Manzanares @ 2021-05-27 22:45 UTC (permalink / raw)
  To: axboe, bvanassche; +Cc: hch, jaegeuk, linux-block

On Thu, 2021-05-27 at 11:53 -0700, Bart Van Assche wrote:
> On 5/27/21 10:58 AM, Adam Manzanares wrote:
> > On Wed, 2021-05-26 at 18:01 -0700, Bart Van Assche wrote:
> > > A feature that is missing from the Linux kernel for storage
> > > devices that support I/O priorities is to set the I/O priority in
> > > requests involving page cache writeback.
> > 
> > When I worked in this area the goal was to control tail latencies
> > for
> > a prioritized app. Once the page cache is involved app control over
> > IO is handed off suggesting that the priorities passed down to the
> > device aren't as meaningful anymore.
> > 
> > Is passing the priority to the device making an impact to
> > performance with your test case? If not, could BFQ be seen as a
> > viable alternative.
> 
> Hi Adam,
> 
> As we all know complexity is the enemy of reliability. BFQ is
> complicated so I am hesitant to use BFQ in a context where
> reliability
> is important. Additionally, the BFQ scheduler uses heuristics to
> detect
> the application type. As became clear recently, there heuristics can
> be
> misled easily and fixing this is nontrivial (see also
> https://lore.kernel.org/linux-block/20210521131034.GL18952@quack2.suse.cz/
> ).
> I want the application launcher to create one cgroup for foreground
> applications and another cgroup for background applications.
> Configuring
> different I/O priorities per cgroup is an easy and reliable approach
> to
> communicate information about the application type and latency
> expectations to the block layer.

Ok, now I know why you are staying away from BFQ. Thanks for sharing
this information.

> 
> Some database applications use buffered I/O and flush that data by
> calling fsync(). We want to support such applications. So it seems
> like
> we have a different opinion about whether or not an I/O priority
> should
> be assigned to I/O that is the result of page cache writeback?

Not necessarily, I am just a bit cautious because in my experience
prioritized I/O has benefits as well as limitations. You just have to
make sure that the page cache writeback does not push the hardware to a
point where the I/O prioritization is no longer beneficial.  

> Thanks,
> 
> Bart.
> 


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 8/9] block/mq-deadline: Add I/O priority support
  2021-05-27 20:23     ` Bart Van Assche
@ 2021-05-28  1:40       ` Damien Le Moal
  0 siblings, 0 replies; 65+ messages in thread
From: Damien Le Moal @ 2021-05-28  1:40 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/28 5:23, Bart Van Assche wrote:
> On 5/27/21 12:07 AM, Hannes Reinecke wrote:
>> On 5/27/21 3:01 AM, Bart Van Assche wrote:
>>> Maintain one FIFO list per I/O priority: RT, BE and IDLE. Only dispatch
>>> requests for a lower priority after all higher priority requests have
>>> finished. Maintain statistics for each priority level. Split the debugfs
>>> attributes per priority level as follows:
>>>
>>> $ ls /sys/kernel/debug/block/.../sched/
>>> async_depth  dispatch2        read_next_rq      write2_fifo_list
>>> batching     read0_fifo_list  starved           write_next_rq
>>> dispatch0    read1_fifo_list  write0_fifo_list
>>> dispatch1    read2_fifo_list  write1_fifo_list
>>
>> Interesting question, though, wrt to request merging and realtime
>> priority: what takes priority?
>> Is the realtime priority more important than request merging?
> 
> We plan to use two I/O priorities: one for foreground applications and
> one for background applications. We want to lower the application
> startup time if background I/O is ongoing. The code that associates
> different cgroups with foreground and background applications is already
> available. We care more about foreground I/O being prioritized over
> background I/O than about foreground I/O being real-time.
> 
>> And also the ioprio document states that there are 8 levels of class
>> data, determining how much time each class should spend on disk access.
>> Have these been taken into consideration?
> 
> My understanding is that the ioprio document was written before any I/O
> controllers implemented I/O priorities. I'm not sure whether I/O
> controllers will ever implement more than two I/O priorities. See also
> commit 8e061784b51e ("ata: Enabling ATA Command Priorities"). A paper
> about the benefits of I/O controllers supporting I/O priorities is
> available at
> https://www.usenix.org/system/files/conference/hotstorage17/hotstorage17-paper-manzanares.pdf.

ATA NCQ has only 2 bits, defining 3 different priority levels, one of which is
not really a priority but rather latency limitation. So yes, basically, ATA
supports only high-priority and no-priority. 2 levels only with all prio levels
of the RT class mapping to the device high-priority level and everything else
mapping to device no-priority level.

That said, Command Duration Limits has been standardized and I was planning to
introduce support for it using a new priority class IOPRIO_CLASS_DURATION with
the priority level indicating the duration limit to apply to the IOs. In this
case, differentiation between the different priority levels of that class may be
necessary at the scheduler level. I will probably revisit this when I send
command duration limit support.

> 
>>>  /*
>>>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>>>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
>>>   */
>>>  static inline int deadline_check_fifo(struct deadline_data *dd,
>>> +				      enum dd_prio prio,
>>>  				      enum dd_data_dir data_dir)
>>>  {
>>> -	struct request *rq = rq_entry_fifo(dd->fifo_list[data_dir].next);
>>> +	struct request *rq = rq_entry_fifo(dd->fifo_list[prio][data_dir].next);
>>>  
>>>  	/*
>>>  	 * rq is expired!
>>
>> I am _so_ not a fan of arrays in C.
>> Can't you make this an accessor and use
>> dd->fifo_list[prio * 2 + data_dir] ?
> 
> That's possible, but if the compiler can translate [prio][data_dir] into
> [prio * 2 + data_dir], should I really do this myself instead of letting
> the compiler generate that transformation?
> 
> Thanks,
> 
> Bart.
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking
  2021-05-27 19:38     ` Bart Van Assche
@ 2021-05-28  1:42       ` Damien Le Moal
  0 siblings, 0 replies; 65+ messages in thread
From: Damien Le Moal @ 2021-05-28  1:42 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares, Ming Lei

On 2021/05/28 4:38, Bart Van Assche wrote:
> On 5/26/21 8:24 PM, Damien Le Moal wrote:
>> On 2021/05/27 10:02, Bart Van Assche wrote:
>>> +enum dd_data_dir {
>>> +	DD_READ		= READ,
>>> +	DD_WRITE	= WRITE,
>>> +} __packed;
>>
>> Why the "__packed" here ?
> 
> I agree that using __packed here does not have any advantage. I will
> leave it out.
> 
>>> +
>>> +enum { DD_DIR_COUNT = 2 };
>>
>> Why not a simple #define for this one ?
> 
> I only use the C preprocessor if there is no other solution. As far as I
> know there are no rules in the kernel coding style guidelines with
> regard whether to use a #define or an enum to define a constant?

It is fine as is. I just find it a little too verbose compared to a simple macro
definition. Just a matter of taste I guess :)



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-27 18:40     ` Bart Van Assche
@ 2021-05-28  2:05       ` Wang Jianchao
  2021-05-28  8:43         ` Paolo Valente
  2021-05-28 16:28         ` Bart Van Assche
  0 siblings, 2 replies; 65+ messages in thread
From: Wang Jianchao @ 2021-05-28  2:05 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares



On 2021/5/28 2:40 AM, Bart Van Assche wrote:
> On 5/27/21 1:05 AM, Wang Jianchao wrote:
>> On 2021/5/27 2:25 PM, Wang Jianchao wrote:
>>> On 2021/5/27 9:01 AM, Bart Van Assche wrote:
>>>> A feature that is missing from the Linux kernel for storage devices that
>>>> support I/O priorities is to set the I/O priority in requests involving page
>>>> cache writeback. Since the identity of the process that triggers page cache
>>>> writeback is not known in the writeback code, the priority set by ioprio_set()
>>>> is ignored. However, an I/O cgroup is associated with writeback requests
>>>> by certain filesystems. Hence this patch series that implements the following
>>>> changes for the mq-deadline scheduler:
>>>
>>> How about implement this feature based on the rq-qos framework ?
>>> Maybe it is better to make mq-deadline a pure io-scheduler.
>>
>> In addition, it is more flexible to combine different io-scheduler and rq-qos policy
>> if we take io priority as a new policy ;)
> 
> Hi Jianchao,
> 
> That's an interesting question. I'm in favor of flexibility. However,
> I'm not sure whether it would be possible to combine an rq-qos policy
> that submits high priority requests first with an I/O scheduler that
> ignores I/O priorities. Since a typical I/O scheduler reorders requests,
> such a combination would lead to requests being submitted in the wrong
> order to storage devices. In other words, when using an I/O scheduler,> proper support for I/O priority in the I/O scheduler is essential. The

Hi Bart,

Does it really matter that issue IO request by the order of io priority ?

Given a device with a 32 depth queue and doesn't support io priority, even if
we issue the request by the order of io priority, will the fw of device handle
them by the same order ? Or in the other word, we always issue 32 io requests
to device one time and then the fw of device decides how to handle them.
The order of dispatching request from queue may only work when the device's
queue is full and we have a deeper queue in io scheduler. And at this moment,
maybe the user needs to check why their application saturates the block device.

As for the qos policy of io priority, it seems similar with wbt in which read,
sync write and background write have different priority. Since we always want
the io with higher priority to be served more by the device, adapting the depth
of queue of different io priority may work ;)

Best regards
Jianchao

> BFQ I/O scheduler is still maturing. The purpose of the Kyber I/O
> scheduler is to control latency by reducing the queue depth without
> differentiating between requests of the same type. The mq-deadline
> scheduler is already being used in combination with storage devices that
> support I/O priorities in their controller. Hence the choice for the
> mq-deadline scheduler.
> 
> Thanks,
> 
> Bart.
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-28  2:05       ` Wang Jianchao
@ 2021-05-28  8:43         ` Paolo Valente
  2021-05-28 16:28         ` Bart Van Assche
  1 sibling, 0 replies; 65+ messages in thread
From: Paolo Valente @ 2021-05-28  8:43 UTC (permalink / raw)
  To: Wang Jianchao
  Cc: Bart Van Assche, Jens Axboe, linux-block, Christoph Hellwig,
	Jaegeuk Kim, Adam Manzanares



> Il giorno 28 mag 2021, alle ore 04:05, Wang Jianchao <jianchao.wan9@gmail.com> ha scritto:
> 
> 
> 
> On 2021/5/28 2:40 AM, Bart Van Assche wrote:
>> On 5/27/21 1:05 AM, Wang Jianchao wrote:
>>> On 2021/5/27 2:25 PM, Wang Jianchao wrote:
>>>> On 2021/5/27 9:01 AM, Bart Van Assche wrote:
>>>>> A feature that is missing from the Linux kernel for storage devices that
>>>>> support I/O priorities is to set the I/O priority in requests involving page
>>>>> cache writeback. Since the identity of the process that triggers page cache
>>>>> writeback is not known in the writeback code, the priority set by ioprio_set()
>>>>> is ignored. However, an I/O cgroup is associated with writeback requests
>>>>> by certain filesystems. Hence this patch series that implements the following
>>>>> changes for the mq-deadline scheduler:
>>>> 
>>>> How about implement this feature based on the rq-qos framework ?
>>>> Maybe it is better to make mq-deadline a pure io-scheduler.
>>> 
>>> In addition, it is more flexible to combine different io-scheduler and rq-qos policy
>>> if we take io priority as a new policy ;)
>> 
>> Hi Jianchao,
>> 
>> That's an interesting question. I'm in favor of flexibility. However,
>> I'm not sure whether it would be possible to combine an rq-qos policy
>> that submits high priority requests first with an I/O scheduler that
>> ignores I/O priorities. Since a typical I/O scheduler reorders requests,
>> such a combination would lead to requests being submitted in the wrong
>> order to storage devices. In other words, when using an I/O scheduler,> proper support for I/O priority in the I/O scheduler is essential. The
> 
> Hi Bart,
> 
> Does it really matter that issue IO request by the order of io priority ?
> 
> Given a device with a 32 depth queue and doesn't support io priority, even if
> we issue the request by the order of io priority, will the fw of device handle
> them by the same order ? Or in the other word, we always issue 32 io requests
> to device one time and then the fw of device decides how to handle them.
> The order of dispatching request from queue may only work when the device's
> queue is full and we have a deeper queue in io scheduler. And at this moment,
> maybe the user needs to check why their application saturates the block device.
> 
> As for the qos policy of io priority, it seems similar with wbt in which read,
> sync write and background write have different priority. Since we always want
> the io with higher priority to be served more by the device, adapting the depth
> of queue of different io priority may work ;)
> 

That's exactly what BFQ does, as this is the only option to truly
control I/O in a device with internal queueing :)

Thanks,
Paolo

> Best regards
> Jianchao
> 
>> BFQ I/O scheduler is still maturing. The purpose of the Kyber I/O
>> scheduler is to control latency by reducing the queue depth without
>> differentiating between requests of the same type. The mq-deadline
>> scheduler is already being used in combination with storage devices that
>> support I/O priorities in their controller. Hence the choice for the
>> mq-deadline scheduler.
>> 
>> Thanks,
>> 
>> Bart.


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 0/9] Improve I/O priority support
  2021-05-28  2:05       ` Wang Jianchao
  2021-05-28  8:43         ` Paolo Valente
@ 2021-05-28 16:28         ` Bart Van Assche
  1 sibling, 0 replies; 65+ messages in thread
From: Bart Van Assche @ 2021-05-28 16:28 UTC (permalink / raw)
  To: Wang Jianchao, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Adam Manzanares,
	Paolo Valente

On 5/27/21 7:05 PM, Wang Jianchao wrote:
> Does it really matter that issue IO request by the order of io priority ?
> 
> Given a device with a 32 depth queue and doesn't support io priority, even if
> we issue the request by the order of io priority, will the fw of device handle
> them by the same order ? Or in the other word, we always issue 32 io requests
> to device one time and then the fw of device decides how to handle them.
> The order of dispatching request from queue may only work when the device's
> queue is full and we have a deeper queue in io scheduler. And at this moment,
> maybe the user needs to check why their application saturates the block device.
> 
> As for the qos policy of io priority, it seems similar with wbt in which read,
> sync write and background write have different priority. Since we always want
> the io with higher priority to be served more by the device, adapting the depth
> of queue of different io priority may work ;)

Hi Jianchao,

Our conclusion from the extensive measurements we ran is that we cannot
reach our latency goals for high-priority I/O without I/O priority
support in the storage device. This is why we are working with device
vendors on implementing I/O priority support in the storage devices that
matter to us.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2021-05-28 16:28 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-27  1:01 [PATCH 0/9] Improve I/O priority support Bart Van Assche
2021-05-27  1:01 ` [PATCH 1/9] block/mq-deadline: Add several comments Bart Van Assche
2021-05-27  3:03   ` Damien Le Moal
2021-05-27  6:45   ` Hannes Reinecke
2021-05-27 19:30     ` Bart Van Assche
2021-05-27  8:43   ` Johannes Thumshirn
2021-05-27 15:13   ` Himanshu Madhani
2021-05-27  1:01 ` [PATCH 2/9] block/mq-deadline: Add two lockdep_assert_held() statements Bart Van Assche
2021-05-27  2:25   ` Chaitanya Kulkarni
2021-05-27  3:09   ` Damien Le Moal
2021-05-27  6:46   ` Hannes Reinecke
2021-05-27  8:44   ` Johannes Thumshirn
2021-05-27 15:14   ` Himanshu Madhani
2021-05-27  1:01 ` [PATCH 3/9] block/mq-deadline: Remove two local variables Bart Van Assche
2021-05-27  2:26   ` Chaitanya Kulkarni
2021-05-27  3:11   ` Damien Le Moal
2021-05-27  6:46   ` Hannes Reinecke
2021-05-27  8:44   ` Johannes Thumshirn
2021-05-27 15:15   ` Himanshu Madhani
2021-05-27  1:01 ` [PATCH 4/9] block/mq-deadline: Rename dd_init_queue() and dd_exit_queue() Bart Van Assche
2021-05-27  2:27   ` Chaitanya Kulkarni
2021-05-27  3:13   ` Damien Le Moal
2021-05-27 19:33     ` Bart Van Assche
2021-05-27  6:47   ` Hannes Reinecke
2021-05-27  8:44   ` Johannes Thumshirn
2021-05-27 15:16   ` Himanshu Madhani
2021-05-27  1:01 ` [PATCH 5/9] block/mq-deadline: Improve compile-time argument checking Bart Van Assche
2021-05-27  2:28   ` Chaitanya Kulkarni
2021-05-27  3:24   ` Damien Le Moal
2021-05-27 19:38     ` Bart Van Assche
2021-05-28  1:42       ` Damien Le Moal
2021-05-27  6:48   ` Hannes Reinecke
2021-05-27  8:49   ` Johannes Thumshirn
2021-05-27 15:19   ` Himanshu Madhani
2021-05-27  1:01 ` [PATCH 6/9] block/mq-deadline: Reduce the read expiry time for non-rotational media Bart Van Assche
2021-05-27  2:30   ` Chaitanya Kulkarni
2021-05-27  3:27   ` Damien Le Moal
2021-05-27 19:43     ` Bart Van Assche
2021-05-27  6:52   ` Hannes Reinecke
2021-05-27 15:20   ` Himanshu Madhani
2021-05-27  1:01 ` [PATCH 7/9] block/mq-deadline: Reserve 25% of tags for synchronous requests Bart Van Assche
2021-05-27  3:33   ` Damien Le Moal
2021-05-27 20:00     ` Bart Van Assche
2021-05-27  6:54   ` Hannes Reinecke
2021-05-27 19:55     ` Bart Van Assche
2021-05-27  1:01 ` [PATCH 8/9] block/mq-deadline: Add I/O priority support Bart Van Assche
2021-05-27  3:48   ` Damien Le Moal
2021-05-27 20:12     ` Bart Van Assche
2021-05-27  7:07   ` Hannes Reinecke
2021-05-27 20:23     ` Bart Van Assche
2021-05-28  1:40       ` Damien Le Moal
2021-05-27  1:01 ` [PATCH 9/9] block/mq-deadline: Add cgroup support Bart Van Assche
2021-05-27  7:09   ` Hannes Reinecke
2021-05-27  6:25 ` [PATCH 0/9] Improve I/O priority support Wang Jianchao
2021-05-27  8:05   ` Wang Jianchao
2021-05-27 18:40     ` Bart Van Assche
2021-05-28  2:05       ` Wang Jianchao
2021-05-28  8:43         ` Paolo Valente
2021-05-28 16:28         ` Bart Van Assche
2021-05-27  8:56 ` Johannes Thumshirn
2021-05-27 17:23   ` Chaitanya Kulkarni
2021-05-27 19:00     ` Bart Van Assche
2021-05-27 17:58 ` Adam Manzanares
2021-05-27 18:53   ` Bart Van Assche
2021-05-27 22:45     ` Adam Manzanares

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).