[PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler
@ 2021-09-27 22:03 Bart Van Assche
  2021-09-27 22:03 ` [PATCH v2 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Bart Van Assche @ 2021-09-27 22:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche

Hi Jens,

This patch series includes the following changes compared to the v5.14-r7
implementation:
- Fix request accounting by counting requeued requests once.
- Test correctness of the request accounting code by triggering a kernel
  warning from inside dd_exit_sched() if an inconsistency has been detected.
- Switch from per-CPU counters to individual counters.
- Restore I/O priority support in the mq-deadline scheduler. The performance
  measurements in the description of patch 4/4 show that the peformance
  regressions in the previous version of this patch have been fixed. This has
  been achieved by using 'jiffies' instead of ktime_get() and also by skipping
  the aging mechanism if all queued requests have the same I/O priority.

Please consider this patch series for kernel v5.16.

Thanks,

Bart.

Changes compared to v1:
- Renamed 'aging_expire' into 'prio_aging_expire'.
- Renamed dd_dispatch_aged_requests() into dd_dispatch_prio_aged_requests().
- Adjusted a source code comment.

Bart Van Assche (4):
  block/mq-deadline: Improve request accounting further
  block/mq-deadline: Add an invariant check
  block/mq-deadline: Stop using per-CPU counters
  block/mq-deadline: Prioritize high-priority requests

 block/mq-deadline.c | 220 ++++++++++++++++++++++++++++----------------
 1 file changed, 143 insertions(+), 77 deletions(-)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/4] block/mq-deadline: Improve request accounting further
  2021-09-27 22:03 [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
@ 2021-09-27 22:03 ` Bart Van Assche
  2021-09-28  5:45   ` Hannes Reinecke
  2021-09-27 22:03 ` [PATCH v2 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2021-09-27 22:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

The scheduler .insert_requests() callback is called when a request is
queued for the first time and also when it is requeued. Only count a
request the first time it is queued. Additionally, since the mq-deadline
scheduler only performs zone locking for requests that have been
inserted, skip the zone unlock code for requests that have not been
inserted into the mq-deadline scheduler.

Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 47f042fa6a68..c27b4347ca91 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -677,8 +677,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_req_zone_write_unlock(rq);
 
 	prio = ioprio_class_to_prio[ioprio_class];
-	dd_count(dd, inserted, prio);
-	rq->elv.priv[0] = (void *)(uintptr_t)1;
+	if (!rq->elv.priv[0]) {
+		dd_count(dd, inserted, prio);
+		rq->elv.priv[0] = (void *)(uintptr_t)1;
+	}
 
 	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
 		blk_mq_free_requests(&free);
@@ -759,12 +761,13 @@ static void dd_finish_request(struct request *rq)
 
 	/*
 	 * The block layer core may call dd_finish_request() without having
-	 * called dd_insert_requests(). Hence only update statistics for
-	 * requests for which dd_insert_requests() has been called. See also
-	 * blk_mq_request_bypass_insert().
+	 * called dd_insert_requests(). Skip requests that bypassed I/O
+	 * scheduling. See also blk_mq_request_bypass_insert().
 	 */
-	if (rq->elv.priv[0])
-		dd_count(dd, completed, prio);
+	if (!rq->elv.priv[0])
+		return;
+
+	dd_count(dd, completed, prio);
 
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/4] block/mq-deadline: Add an invariant check
  2021-09-27 22:03 [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
  2021-09-27 22:03 ` [PATCH v2 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
@ 2021-09-27 22:03 ` Bart Van Assche
  2021-09-28  5:47   ` Hannes Reinecke
  2021-09-27 22:03 ` [PATCH v2 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2021-09-27 22:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

Check a statistics invariant at module unload time. When running
blktests, the invariant is verified every time a request queue is
removed and hence is verified at least once per test.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index c27b4347ca91..2586b3f8c7e9 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -270,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	deadline_remove_request(rq->q, per_prio, rq);
 }
 
+/* Number of requests queued for a given priority level. */
+static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
+{
+	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+}
+
 /*
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
@@ -539,6 +545,12 @@ static void dd_exit_sched(struct elevator_queue *e)
 
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
+		WARN_ONCE(dd_queued(dd, prio) != 0,
+			  "statistics for priority %d: i %u m %u d %u c %u\n",
+			  prio, dd_sum(dd, inserted, prio),
+			  dd_sum(dd, merged, prio),
+			  dd_sum(dd, dispatched, prio),
+			  dd_sum(dd, completed, prio));
 	}
 
 	free_percpu(dd->stats);
@@ -950,12 +962,6 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
 	return 0;
 }
 
-/* Number of requests queued for a given priority level. */
-static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
-{
-	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
-}
-
 static int dd_queued_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-27 22:03 [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
  2021-09-27 22:03 ` [PATCH v2 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
  2021-09-27 22:03 ` [PATCH v2 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
@ 2021-09-27 22:03 ` Bart Van Assche
  2021-09-28  5:51   ` Hannes Reinecke
  2021-09-27 22:03 ` [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
  2021-09-27 23:14 ` [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Jens Axboe
  4 siblings, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2021-09-27 22:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

Calculating the sum over all CPUs of per-CPU counters frequently is
inefficient. Hence switch from per-CPU to individual counters. Three
counters are protected by the mq-deadline spinlock since these are
only accessed from contexts that already hold that spinlock. The fourth
counter is atomic because protecting it with the mq-deadline spinlock
would trigger lock contention.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 124 ++++++++++++++++++++------------------------
 1 file changed, 56 insertions(+), 68 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 2586b3f8c7e9..b262f40f32c0 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -51,17 +51,16 @@ enum dd_prio {
 
 enum { DD_PRIO_COUNT = 3 };
 
-/* I/O statistics per I/O priority. */
+/*
+ * I/O statistics per I/O priority. It is fine if these counters overflow.
+ * What matters is that these counters are at least as wide as
+ * log2(max_outstanding_requests).
+ */
 struct io_stats_per_prio {
-	local_t inserted;
-	local_t merged;
-	local_t dispatched;
-	local_t completed;
-};
-
-/* I/O statistics for all I/O priorities (enum dd_prio). */
-struct io_stats {
-	struct io_stats_per_prio stats[DD_PRIO_COUNT];
+	uint32_t inserted;
+	uint32_t merged;
+	uint32_t dispatched;
+	atomic_t completed;
 };
 
 /*
@@ -74,6 +73,7 @@ struct dd_per_prio {
 	struct list_head fifo_list[DD_DIR_COUNT];
 	/* Next request in FIFO order. Read, write or both are NULL. */
 	struct request *next_rq[DD_DIR_COUNT];
+	struct io_stats_per_prio stats;
 };
 
 struct deadline_data {
@@ -88,8 +88,6 @@ struct deadline_data {
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
-	struct io_stats __percpu *stats;
-
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -103,33 +101,6 @@ struct deadline_data {
 	spinlock_t zone_lock;
 };
 
-/* Count one event of type 'event_type' and with I/O priority 'prio' */
-#define dd_count(dd, event_type, prio) do {				\
-	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
-									\
-	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
-	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
-	local_inc(&io_stats->stats[(prio)].event_type);			\
-	put_cpu_ptr(io_stats);						\
-} while (0)
-
-/*
- * Returns the total number of dd_count(dd, event_type, prio) calls across all
- * CPUs. No locking or barriers since it is fine if the returned sum is slightly
- * outdated.
- */
-#define dd_sum(dd, event_type, prio) ({					\
-	unsigned int cpu;						\
-	u32 sum = 0;							\
-									\
-	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
-	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
-	for_each_present_cpu(cpu)					\
-		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
-				  stats[(prio)].event_type);		\
-	sum;								\
-})
-
 /* Maps an I/O priority class to a deadline scheduler priority. */
 static const enum dd_prio ioprio_class_to_prio[] = {
 	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
@@ -233,7 +204,9 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 
-	dd_count(dd, merged, prio);
+	lockdep_assert_held(&dd->lock);
+
+	dd->per_prio[prio].stats.merged++;
 
 	/*
 	 * if next expires before rq, assign its expire time to rq
@@ -273,7 +246,11 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 /* Number of requests queued for a given priority level. */
 static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
 {
-	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
+
+	lockdep_assert_held(&dd->lock);
+
+	return stats->inserted - atomic_read(&stats->completed);
 }
 
 /*
@@ -463,7 +440,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 done:
 	ioprio_class = dd_rq_ioclass(rq);
 	prio = ioprio_class_to_prio[ioprio_class];
-	dd_count(dd, dispatched, prio);
+	dd->per_prio[prio].stats.dispatched++;
 	/*
 	 * If the request needs its target zone locked, do it.
 	 */
@@ -542,19 +519,22 @@ static void dd_exit_sched(struct elevator_queue *e)
 
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
+		const struct io_stats_per_prio *stats = &per_prio->stats;
+		uint32_t queued;
 
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
-		WARN_ONCE(dd_queued(dd, prio) != 0,
+
+		spin_lock(&dd->lock);
+		queued = dd_queued(dd, prio);
+		spin_unlock(&dd->lock);
+
+		WARN_ONCE(queued != 0,
 			  "statistics for priority %d: i %u m %u d %u c %u\n",
-			  prio, dd_sum(dd, inserted, prio),
-			  dd_sum(dd, merged, prio),
-			  dd_sum(dd, dispatched, prio),
-			  dd_sum(dd, completed, prio));
+			  prio, stats->inserted, stats->merged,
+			  stats->dispatched, atomic_read(&stats->completed));
 	}
 
-	free_percpu(dd->stats);
-
 	kfree(dd);
 }
 
@@ -578,11 +558,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	eq->elevator_data = dd;
 
-	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
-				     GFP_KERNEL | __GFP_ZERO);
-	if (!dd->stats)
-		goto free_dd;
-
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -604,9 +579,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	q->elevator = eq;
 	return 0;
 
-free_dd:
-	kfree(dd);
-
 put_eq:
 	kobject_put(&eq->kobj);
 	return ret;
@@ -689,8 +661,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_req_zone_write_unlock(rq);
 
 	prio = ioprio_class_to_prio[ioprio_class];
+	per_prio = &dd->per_prio[prio];
 	if (!rq->elv.priv[0]) {
-		dd_count(dd, inserted, prio);
+		per_prio->stats.inserted++;
 		rq->elv.priv[0] = (void *)(uintptr_t)1;
 	}
 
@@ -701,7 +674,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	trace_block_rq_insert(rq);
 
-	per_prio = &dd->per_prio[prio];
 	if (at_head) {
 		list_add(&rq->queuelist, &per_prio->dispatch);
 	} else {
@@ -779,7 +751,7 @@ static void dd_finish_request(struct request *rq)
 	if (!rq->elv.priv[0])
 		return;
 
-	dd_count(dd, completed, prio);
+	atomic_inc(&per_prio->stats.completed);
 
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;
@@ -966,28 +938,44 @@ static int dd_queued_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
 	struct deadline_data *dd = q->elevator->elevator_data;
+	u32 rt, be, idle;
+
+	spin_lock(&dd->lock);
+	rt = dd_queued(dd, DD_RT_PRIO);
+	be = dd_queued(dd, DD_BE_PRIO);
+	idle = dd_queued(dd, DD_IDLE_PRIO);
+	spin_unlock(&dd->lock);
+
+	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
-	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
-		   dd_queued(dd, DD_BE_PRIO),
-		   dd_queued(dd, DD_IDLE_PRIO));
 	return 0;
 }
 
 /* Number of requests owned by the block driver for a given priority. */
 static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
 {
-	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
-		- dd_sum(dd, completed, prio);
+	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
+
+	lockdep_assert_held(&dd->lock);
+
+	return stats->dispatched + stats->merged -
+		atomic_read(&stats->completed);
 }
 
 static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
 	struct deadline_data *dd = q->elevator->elevator_data;
+	u32 rt, be, idle;
+
+	spin_lock(&dd->lock);
+	rt = dd_owned_by_driver(dd, DD_RT_PRIO);
+	be = dd_owned_by_driver(dd, DD_BE_PRIO);
+	idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
+	spin_unlock(&dd->lock);
+
+	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
-	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
-		   dd_owned_by_driver(dd, DD_BE_PRIO),
-		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-27 22:03 [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
                   ` (2 preceding siblings ...)
  2021-09-27 22:03 ` [PATCH v2 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
@ 2021-09-27 22:03 ` Bart Van Assche
  2021-09-27 22:53   ` Damien Le Moal
                     ` (3 more replies)
  2021-09-27 23:14 ` [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Jens Axboe
  4 siblings, 4 replies; 21+ messages in thread
From: Bart Van Assche @ 2021-09-27 22:03 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
Prioritize high-priority requests""), this patch uses 'jiffies' instead
of ktime_get() in the code for aging lower priority requests.

This patch has been tested as follows:

Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: 555 K IOPS.

Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: about 380 K IOPS.

Ran the following script:

set -e
scriptdir=$(dirname "$0")
if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
modprobe scsi_debug ndelay=1000000 max_queue=16
sd=''
while [ -z "$sd" ]; do
  sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done
echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
if [ -e /sys/fs/cgroup/io.prio.class ]; then
  cd /sys/fs/cgroup
  echo restrict-to-be >io.prio.class
  echo +io > cgroup.subtree_control
else
  cd /sys/fs/cgroup/blkio/
  echo restrict-to-be >blkio.prio.class
fi
echo $$ >cgroup.procs
mkdir -p hipri
cd hipri
if [ -e io.prio.class ]; then
  echo none-to-rt >io.prio.class
else
  echo none-to-rt >blkio.prio.class
fi
{ "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
echo $$ >cgroup.procs
"${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt

Result:
* 11000 IOPS for the high-priority job
*    40 IOPS for the low-priority job

If the prio aging expiry time is changed from 100s into 0, the IOPS results
change into 6712 and 6796 IOPS.

The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 77 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b262f40f32c0..bb723478baf1 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -31,6 +31,11 @@
  */
 static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
 static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+/*
+ * Time after which to dispatch lower priority requests even if higher
+ * priority requests are pending.
+ */
+static const int prio_aging_expire = 10 * HZ;
 static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
@@ -96,6 +101,7 @@ struct deadline_data {
 	int writes_starved;
 	int front_merges;
 	u32 async_depth;
+	int prio_aging_expire;
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
@@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	return rq;
 }
 
+/*
+ * Returns true if and only if @rq started after @latest_start where
+ * @latest_start is in jiffies.
+ */
+static bool started_after(struct deadline_data *dd, struct request *rq,
+			  unsigned long latest_start)
+{
+	unsigned long start_time = (unsigned long)rq->fifo_time;
+
+	start_time -= dd->fifo_expire[rq_data_dir(rq)];
+
+	return time_after(start_time, latest_start);
+}
+
 /*
  * deadline_dispatch_requests selects the best request according to
- * read/write expire, fifo_batch, etc
+ * read/write expire, fifo_batch, etc and with a start time <= @latest.
  */
 static struct request *__dd_dispatch_request(struct deadline_data *dd,
-					     struct dd_per_prio *per_prio)
+					     struct dd_per_prio *per_prio,
+					     unsigned long latest_start)
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
@@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	if (!list_empty(&per_prio->dispatch)) {
 		rq = list_first_entry(&per_prio->dispatch, struct request,
 				      queuelist);
+		if (started_after(dd, rq, latest_start))
+			return NULL;
 		list_del_init(&rq->queuelist);
 		goto done;
 	}
@@ -432,6 +455,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	dd->batching = 0;
 
 dispatch_request:
+	if (started_after(dd, rq, latest_start))
+		return NULL;
+
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -449,6 +475,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	return rq;
 }
 
+/*
+ * Check whether there are any requests with priority other than DD_RT_PRIO
+ * that were inserted more than prio_aging_expire jiffies ago.
+ */
+static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
+						      unsigned long now)
+{
+	struct request *rq;
+	enum dd_prio prio;
+	int prio_cnt;
+
+	lockdep_assert_held(&dd->lock);
+
+	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
+		   !!dd_queued(dd, DD_IDLE_PRIO);
+	if (prio_cnt < 2)
+		return NULL;
+
+	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
+					   now - dd->prio_aging_expire);
+		if (rq)
+			return rq;
+	}
+
+	return NULL;
+}
+
 /*
  * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
  *
@@ -460,15 +514,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
+	const unsigned long now = jiffies;
 	struct request *rq;
 	enum dd_prio prio;
 
 	spin_lock(&dd->lock);
+	rq = dd_dispatch_prio_aged_requests(dd, now);
+	if (rq)
+		goto unlock;
+
+	/*
+	 * Next, dispatch requests in priority order. Ignore lower priority
+	 * requests if any higher priority requests are pending.
+	 */
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
-		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
-		if (rq)
+		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
+		if (rq || dd_queued(dd, prio))
 			break;
 	}
+
+unlock:
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -573,6 +638,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->front_merges = 1;
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
+	dd->prio_aging_expire = prio_aging_expire;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
@@ -796,6 +862,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
 SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
 SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
+SHOW_JIFFIES(deadline_prio_aging_expire_show, dd->prio_aging_expire);
 SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
 SHOW_INT(deadline_front_merges_show, dd->front_merges);
 SHOW_INT(deadline_async_depth_show, dd->front_merges);
@@ -825,6 +892,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
 	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
 STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
 STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
+STORE_JIFFIES(deadline_prio_aging_expire_store, &dd->prio_aging_expire, 0, INT_MAX);
 STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
 STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
 STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
@@ -843,6 +911,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(front_merges),
 	DD_ATTR(async_depth),
 	DD_ATTR(fifo_batch),
+	DD_ATTR(prio_aging_expire),
 	__ATTR_NULL
 };
 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-27 22:03 ` [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-09-27 22:53   ` Damien Le Moal
  2021-09-27 23:12     ` Bart Van Assche
  2021-09-28  5:53   ` Hannes Reinecke
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Damien Le Moal @ 2021-09-27 22:53 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Niklas Cassel, Hannes Reinecke

On 2021/09/28 7:03, Bart Van Assche wrote:
> In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
> Prioritize high-priority requests""), this patch uses 'jiffies' instead
> of ktime_get() in the code for aging lower priority requests.
> 
> This patch has been tested as follows:
> 
> Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: 555 K IOPS.
> 
> Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: about 380 K IOPS.
> 
> Ran the following script:
> 
> set -e
> scriptdir=$(dirname "$0")
> if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
> modprobe scsi_debug ndelay=1000000 max_queue=16
> sd=''
> while [ -z "$sd" ]; do
>   sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done
> echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
> if [ -e /sys/fs/cgroup/io.prio.class ]; then
>   cd /sys/fs/cgroup
>   echo restrict-to-be >io.prio.class
>   echo +io > cgroup.subtree_control
> else
>   cd /sys/fs/cgroup/blkio/
>   echo restrict-to-be >blkio.prio.class
> fi
> echo $$ >cgroup.procs
> mkdir -p hipri
> cd hipri
> if [ -e io.prio.class ]; then
>   echo none-to-rt >io.prio.class
> else
>   echo none-to-rt >blkio.prio.class
> fi
> { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
> echo $$ >cgroup.procs
> "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *    40 IOPS for the low-priority job
> 
> If the prio aging expiry time is changed from 100s into 0, the IOPS results
> change into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 77 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 73 insertions(+), 4 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b262f40f32c0..bb723478baf1 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -31,6 +31,11 @@
>   */
>  static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
>  static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +/*
> + * Time after which to dispatch lower priority requests even if higher
> + * priority requests are pending.
> + */
> +static const int prio_aging_expire = 10 * HZ;
>  static const int writes_starved = 2;    /* max times reads can starve a write */
>  static const int fifo_batch = 16;       /* # of sequential requests treated as one
>  				     by the above parameters. For throughput. */
> @@ -96,6 +101,7 @@ struct deadline_data {
>  	int writes_starved;
>  	int front_merges;
>  	u32 async_depth;
> +	int prio_aging_expire;
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> @@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	return rq;
>  }
>  
> +/*
> + * Returns true if and only if @rq started after @latest_start where
> + * @latest_start is in jiffies.
> + */
> +static bool started_after(struct deadline_data *dd, struct request *rq,
> +			  unsigned long latest_start)
> +{
> +	unsigned long start_time = (unsigned long)rq->fifo_time;
> +
> +	start_time -= dd->fifo_expire[rq_data_dir(rq)];
> +
> +	return time_after(start_time, latest_start);
> +}
> +
>  /*
>   * deadline_dispatch_requests selects the best request according to
> - * read/write expire, fifo_batch, etc
> + * read/write expire, fifo_batch, etc and with a start time <= @latest.

s/@latest/@latest_start ?

>   */
>  static struct request *__dd_dispatch_request(struct deadline_data *dd,
> -					     struct dd_per_prio *per_prio)
> +					     struct dd_per_prio *per_prio,
> +					     unsigned long latest_start)
>  {
>  	struct request *rq, *next_rq;
>  	enum dd_data_dir data_dir;
> @@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	if (!list_empty(&per_prio->dispatch)) {
>  		rq = list_first_entry(&per_prio->dispatch, struct request,
>  				      queuelist);
> +		if (started_after(dd, rq, latest_start))
> +			return NULL;
>  		list_del_init(&rq->queuelist);
>  		goto done;
>  	}
> @@ -432,6 +455,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	dd->batching = 0;
>  
>  dispatch_request:
> +	if (started_after(dd, rq, latest_start))
> +		return NULL;
> +
>  	/*
>  	 * rq is the selected appropriate request.
>  	 */
> @@ -449,6 +475,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	return rq;
>  }
>  
> +/*
> + * Check whether there are any requests with priority other than DD_RT_PRIO
> + * that were inserted more than prio_aging_expire jiffies ago.
> + */
> +static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
> +						      unsigned long now)
> +{
> +	struct request *rq;
> +	enum dd_prio prio;
> +	int prio_cnt;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
> +		   !!dd_queued(dd, DD_IDLE_PRIO);
> +	if (prio_cnt < 2)
> +		return NULL;
> +
> +	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
> +					   now - dd->prio_aging_expire);
> +		if (rq)
> +			return rq;
> +	}
> +
> +	return NULL;
> +}
> +
>  /*
>   * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
>   *
> @@ -460,15 +514,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	const unsigned long now = jiffies;
>  	struct request *rq;
>  	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);
> +	rq = dd_dispatch_prio_aged_requests(dd, now);
> +	if (rq)
> +		goto unlock;
> +
> +	/*
> +	 * Next, dispatch requests in priority order. Ignore lower priority
> +	 * requests if any higher priority requests are pending.
> +	 */
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
> -		if (rq)
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
> +		if (rq || dd_queued(dd, prio))
>  			break;
>  	}
> +
> +unlock:
>  	spin_unlock(&dd->lock);
>  
>  	return rq;
> @@ -573,6 +638,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	dd->front_merges = 1;
>  	dd->last_dir = DD_WRITE;
>  	dd->fifo_batch = fifo_batch;
> +	dd->prio_aging_expire = prio_aging_expire;
>  	spin_lock_init(&dd->lock);
>  	spin_lock_init(&dd->zone_lock);
>  
> @@ -796,6 +862,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
>  SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
>  SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> +SHOW_JIFFIES(deadline_prio_aging_expire_show, dd->prio_aging_expire);
>  SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
>  SHOW_INT(deadline_front_merges_show, dd->front_merges);
>  SHOW_INT(deadline_async_depth_show, dd->front_merges);
> @@ -825,6 +892,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
>  	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
>  STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
>  STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> +STORE_JIFFIES(deadline_prio_aging_expire_store, &dd->prio_aging_expire, 0, INT_MAX);
>  STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
>  STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
>  STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
> @@ -843,6 +911,7 @@ static struct elv_fs_entry deadline_attrs[] = {
>  	DD_ATTR(front_merges),
>  	DD_ATTR(async_depth),
>  	DD_ATTR(fifo_batch),
> +	DD_ATTR(prio_aging_expire),
>  	__ATTR_NULL
>  };
>  
> 

Apart from the nit above, looks good to me.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-27 22:53   ` Damien Le Moal
@ 2021-09-27 23:12     ` Bart Van Assche
  0 siblings, 0 replies; 21+ messages in thread
From: Bart Van Assche @ 2021-09-27 23:12 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Niklas Cassel, Hannes Reinecke

On 9/27/21 3:53 PM, Damien Le Moal wrote:
> On 2021/09/28 7:03, Bart Van Assche wrote:
>>   /*
>>    * deadline_dispatch_requests selects the best request according to
>> - * read/write expire, fifo_batch, etc
>> + * read/write expire, fifo_batch, etc and with a start time <= @latest.
> 
> s/@latest/@latest_start ?
> 
>>    */
>>   static struct request *__dd_dispatch_request(struct deadline_data *dd,
>> -					     struct dd_per_prio *per_prio)
>> +					     struct dd_per_prio *per_prio,
>> +					     unsigned long latest_start)

Right, the comment above this function needs to be updated.

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler
  2021-09-27 22:03 [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
                   ` (3 preceding siblings ...)
  2021-09-27 22:03 ` [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-09-27 23:14 ` Jens Axboe
  4 siblings, 0 replies; 21+ messages in thread
From: Jens Axboe @ 2021-09-27 23:14 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-block, Christoph Hellwig, Jaegeuk Kim

On 9/27/21 4:03 PM, Bart Van Assche wrote:
> Hi Jens,
> 
> This patch series includes the following changes compared to the v5.14-r7
> implementation:
> - Fix request accounting by counting requeued requests once.
> - Test correctness of the request accounting code by triggering a kernel
>   warning from inside dd_exit_sched() if an inconsistency has been detected.
> - Switch from per-CPU counters to individual counters.
> - Restore I/O priority support in the mq-deadline scheduler. The performance
>   measurements in the description of patch 4/4 show that the peformance
>   regressions in the previous version of this patch have been fixed. This has
>   been achieved by using 'jiffies' instead of ktime_get() and also by skipping
>   the aging mechanism if all queued requests have the same I/O priority.
> 
> Please consider this patch series for kernel v5.16.

Applied, thanks.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/4] block/mq-deadline: Improve request accounting further
  2021-09-27 22:03 ` [PATCH v2 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
@ 2021-09-28  5:45   ` Hannes Reinecke
  0 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2021-09-28  5:45 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Niklas Cassel

On 9/28/21 12:03 AM, Bart Van Assche wrote:
> The scheduler .insert_requests() callback is called when a request is
> queued for the first time and also when it is requeued. Only count a
> request the first time it is queued. Additionally, since the mq-deadline
> scheduler only performs zone locking for requests that have been
> inserted, skip the zone unlock code for requests that have not been
> inserted into the mq-deadline scheduler.
> 
> Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 17 ++++++++++-------
>   1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 47f042fa6a68..c27b4347ca91 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -677,8 +677,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   	blk_req_zone_write_unlock(rq);
>   
>   	prio = ioprio_class_to_prio[ioprio_class];
> -	dd_count(dd, inserted, prio);
> -	rq->elv.priv[0] = (void *)(uintptr_t)1;
> +	if (!rq->elv.priv[0]) {
> +		dd_count(dd, inserted, prio);
> +		rq->elv.priv[0] = (void *)(uintptr_t)1;
> +	}
>   
>   	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>   		blk_mq_free_requests(&free);
> @@ -759,12 +761,13 @@ static void dd_finish_request(struct request *rq)
>   
>   	/*
>   	 * The block layer core may call dd_finish_request() without having
> -	 * called dd_insert_requests(). Hence only update statistics for
> -	 * requests for which dd_insert_requests() has been called. See also
> -	 * blk_mq_request_bypass_insert().
> +	 * called dd_insert_requests(). Skip requests that bypassed I/O
> +	 * scheduling. See also blk_mq_request_bypass_insert().
>   	 */
> -	if (rq->elv.priv[0])
> -		dd_count(dd, completed, prio);
> +	if (!rq->elv.priv[0])
> +		return;
> +
> +	dd_count(dd, completed, prio);
>   
>   	if (blk_queue_is_zoned(q)) {
>   		unsigned long flags;
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/4] block/mq-deadline: Add an invariant check
  2021-09-27 22:03 ` [PATCH v2 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
@ 2021-09-28  5:47   ` Hannes Reinecke
  0 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2021-09-28  5:47 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Niklas Cassel

On 9/28/21 12:03 AM, Bart Van Assche wrote:
> Check a statistics invariant at module unload time. When running
> blktests, the invariant is verified every time a request queue is
> removed and hence is verified at least once per test.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 18 ++++++++++++------
>   1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index c27b4347ca91..2586b3f8c7e9 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -270,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>   	deadline_remove_request(rq->q, per_prio, rq);
>   }
>   
> +/* Number of requests queued for a given priority level. */
> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +}
> +
>   /*
>    * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>    * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> @@ -539,6 +545,12 @@ static void dd_exit_sched(struct elevator_queue *e)
>   
>   		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>   		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> +		WARN_ONCE(dd_queued(dd, prio) != 0,
> +			  "statistics for priority %d: i %u m %u d %u c %u\n",
> +			  prio, dd_sum(dd, inserted, prio),
> +			  dd_sum(dd, merged, prio),
> +			  dd_sum(dd, dispatched, prio),
> +			  dd_sum(dd, completed, prio));
>   	}
>   
>   	free_percpu(dd->stats);
> @@ -950,12 +962,6 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
>   	return 0;
>   }
>   
> -/* Number of requests queued for a given priority level. */
> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> -{
> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> -}
> -
>   static int dd_queued_show(void *data, struct seq_file *m)
>   {
>   	struct request_queue *q = data;
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-27 22:03 ` [PATCH v2 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
@ 2021-09-28  5:51   ` Hannes Reinecke
  2021-09-28 17:35     ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2021-09-28  5:51 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Niklas Cassel

On 9/28/21 12:03 AM, Bart Van Assche wrote:
> Calculating the sum over all CPUs of per-CPU counters frequently is
> inefficient. Hence switch from per-CPU to individual counters. Three
> counters are protected by the mq-deadline spinlock since these are
> only accessed from contexts that already hold that spinlock. The fourth
> counter is atomic because protecting it with the mq-deadline spinlock
> would trigger lock contention.
> 
> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
> Reviewed-by: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 124 ++++++++++++++++++++------------------------
>   1 file changed, 56 insertions(+), 68 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 2586b3f8c7e9..b262f40f32c0 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -51,17 +51,16 @@ enum dd_prio {
>   
>   enum { DD_PRIO_COUNT = 3 };
>   
> -/* I/O statistics per I/O priority. */
> +/*
> + * I/O statistics per I/O priority. It is fine if these counters overflow.
> + * What matters is that these counters are at least as wide as
> + * log2(max_outstanding_requests).
> + */
>   struct io_stats_per_prio {
> -	local_t inserted;
> -	local_t merged;
> -	local_t dispatched;
> -	local_t completed;
> -};
> -
> -/* I/O statistics for all I/O priorities (enum dd_prio). */
> -struct io_stats {
> -	struct io_stats_per_prio stats[DD_PRIO_COUNT];
> +	uint32_t inserted;
> +	uint32_t merged;
> +	uint32_t dispatched;
> +	atomic_t completed;
>   };
>   
>   /*
> @@ -74,6 +73,7 @@ struct dd_per_prio {
>   	struct list_head fifo_list[DD_DIR_COUNT];
>   	/* Next request in FIFO order. Read, write or both are NULL. */
>   	struct request *next_rq[DD_DIR_COUNT];
> +	struct io_stats_per_prio stats;
>   };
>   
>   struct deadline_data {
> @@ -88,8 +88,6 @@ struct deadline_data {
>   	unsigned int batching;		/* number of sequential requests made */
>   	unsigned int starved;		/* times reads have starved writes */
>   
> -	struct io_stats __percpu *stats;
> -
>   	/*
>   	 * settings that change how the i/o scheduler behaves
>   	 */
> @@ -103,33 +101,6 @@ struct deadline_data {
>   	spinlock_t zone_lock;
>   };
>   
> -/* Count one event of type 'event_type' and with I/O priority 'prio' */
> -#define dd_count(dd, event_type, prio) do {				\
> -	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
> -									\
> -	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> -	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> -	local_inc(&io_stats->stats[(prio)].event_type);			\
> -	put_cpu_ptr(io_stats);						\
> -} while (0)
> -
> -/*
> - * Returns the total number of dd_count(dd, event_type, prio) calls across all
> - * CPUs. No locking or barriers since it is fine if the returned sum is slightly
> - * outdated.
> - */
> -#define dd_sum(dd, event_type, prio) ({					\
> -	unsigned int cpu;						\
> -	u32 sum = 0;							\
> -									\
> -	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> -	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> -	for_each_present_cpu(cpu)					\
> -		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
> -				  stats[(prio)].event_type);		\
> -	sum;								\
> -})
> -
>   /* Maps an I/O priority class to a deadline scheduler priority. */
>   static const enum dd_prio ioprio_class_to_prio[] = {
>   	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
> @@ -233,7 +204,9 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
>   	const u8 ioprio_class = dd_rq_ioclass(next);
>   	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>   
> -	dd_count(dd, merged, prio);
> +	lockdep_assert_held(&dd->lock);
> +
> +	dd->per_prio[prio].stats.merged++;
>   

Why don't you convert the macro 'dd_count()' to work with the new structure?

>   	/*
>   	 * if next expires before rq, assign its expire time to rq
> @@ -273,7 +246,11 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>   /* Number of requests queued for a given priority level. */
>   static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>   {
> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	return stats->inserted - atomic_read(&stats->completed);
>   }
>   
>   / > @@ -463,7 +440,7 @@ static struct request 
*__dd_dispatch_request(struct deadline_data *dd,
>   done:
>   	ioprio_class = dd_rq_ioclass(rq);
>   	prio = ioprio_class_to_prio[ioprio_class];
> -	dd_count(dd, dispatched, prio);
> +	dd->per_prio[prio].stats.dispatched++;

Same here.

>   	/*
>   	 * If the request needs its target zone locked, do it.
>   	 */
> @@ -542,19 +519,22 @@ static void dd_exit_sched(struct elevator_queue *e)
>   
>   	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
>   		struct dd_per_prio *per_prio = &dd->per_prio[prio];
> +		const struct io_stats_per_prio *stats = &per_prio->stats;
> +		uint32_t queued;
>   
>   		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>   		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> -		WARN_ONCE(dd_queued(dd, prio) != 0,
> +
> +		spin_lock(&dd->lock);
> +		queued = dd_queued(dd, prio);
> +		spin_unlock(&dd->lock);
> +
> +		WARN_ONCE(queued != 0,
>   			  "statistics for priority %d: i %u m %u d %u c %u\n",
> -			  prio, dd_sum(dd, inserted, prio),
> -			  dd_sum(dd, merged, prio),
> -			  dd_sum(dd, dispatched, prio),
> -			  dd_sum(dd, completed, prio));
> +			  prio, stats->inserted, stats->merged,
> +			  stats->dispatched, atomic_read(&stats->completed));
>   	}
>   
> -	free_percpu(dd->stats);
> -
>   	kfree(dd);
>   }
>   
> @@ -578,11 +558,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>   
>   	eq->elevator_data = dd;
>   
> -	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
> -				     GFP_KERNEL | __GFP_ZERO);
> -	if (!dd->stats)
> -		goto free_dd;
> -
>   	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
>   		struct dd_per_prio *per_prio = &dd->per_prio[prio];
>   
> @@ -604,9 +579,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>   	q->elevator = eq;
>   	return 0;
>   
> -free_dd:
> -	kfree(dd);
> -
>   put_eq:
>   	kobject_put(&eq->kobj);
>   	return ret;
> @@ -689,8 +661,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   	blk_req_zone_write_unlock(rq);
>   
>   	prio = ioprio_class_to_prio[ioprio_class];
> +	per_prio = &dd->per_prio[prio];
>   	if (!rq->elv.priv[0]) {
> -		dd_count(dd, inserted, prio);
> +		per_prio->stats.inserted++;

And here.

>   		rq->elv.priv[0] = (void *)(uintptr_t)1;
>   	}
>   
> @@ -701,7 +674,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>   
>   	trace_block_rq_insert(rq);
>   
> -	per_prio = &dd->per_prio[prio];
>   	if (at_head) {
>   		list_add(&rq->queuelist, &per_prio->dispatch);
>   	} else {
> @@ -779,7 +751,7 @@ static void dd_finish_request(struct request *rq)
>   	if (!rq->elv.priv[0])
>   		return;
>   
> -	dd_count(dd, completed, prio);
> +	atomic_inc(&per_prio->stats.completed);
>   
>   	if (blk_queue_is_zoned(q)) {
>   		unsigned long flags;
> @@ -966,28 +938,44 @@ static int dd_queued_show(void *data, struct seq_file *m)
>   {
>   	struct request_queue *q = data;
>   	struct deadline_data *dd = q->elevator->elevator_data;
> +	u32 rt, be, idle;
> +
> +	spin_lock(&dd->lock);
> +	rt = dd_queued(dd, DD_RT_PRIO);
> +	be = dd_queued(dd, DD_BE_PRIO);
> +	idle = dd_queued(dd, DD_IDLE_PRIO);
> +	spin_unlock(&dd->lock);
> +
> +	seq_printf(m, "%u %u %u\n", rt, be, idle);
>   
> -	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
> -		   dd_queued(dd, DD_BE_PRIO),
> -		   dd_queued(dd, DD_IDLE_PRIO));
>   	return 0;
>   }
>   
>   /* Number of requests owned by the block driver for a given priority. */
>   static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
>   {
> -	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
> -		- dd_sum(dd, completed, prio);
> +	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	return stats->dispatched + stats->merged -
> +		atomic_read(&stats->completed);
>   }
>   
>   static int dd_owned_by_driver_show(void *data, struct seq_file *m)
>   {
>   	struct request_queue *q = data;
>   	struct deadline_data *dd = q->elevator->elevator_data;
> +	u32 rt, be, idle;
> +
> +	spin_lock(&dd->lock);
> +	rt = dd_owned_by_driver(dd, DD_RT_PRIO);
> +	be = dd_owned_by_driver(dd, DD_BE_PRIO);
> +	idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
> +	spin_unlock(&dd->lock);
> +
> +	seq_printf(m, "%u %u %u\n", rt, be, idle);
>   
> -	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
> -		   dd_owned_by_driver(dd, DD_BE_PRIO),
> -		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
>   	return 0;
>   }
>   
> 
Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-27 22:03 ` [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
  2021-09-27 22:53   ` Damien Le Moal
@ 2021-09-28  5:53   ` Hannes Reinecke
  2021-09-28 10:36   ` Niklas Cassel
  2022-03-25 12:33   ` Oleksij Rempel
  3 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2021-09-28  5:53 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Niklas Cassel

On 9/28/21 12:03 AM, Bart Van Assche wrote:
> In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
> Prioritize high-priority requests""), this patch uses 'jiffies' instead
> of ktime_get() in the code for aging lower priority requests.
> 
> This patch has been tested as follows:
> 
> Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: 555 K IOPS.
> 
> Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: about 380 K IOPS.
> 
> Ran the following script:
> 
> set -e
> scriptdir=$(dirname "$0")
> if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
> modprobe scsi_debug ndelay=1000000 max_queue=16
> sd=''
> while [ -z "$sd" ]; do
>    sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done
> echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
> if [ -e /sys/fs/cgroup/io.prio.class ]; then
>    cd /sys/fs/cgroup
>    echo restrict-to-be >io.prio.class
>    echo +io > cgroup.subtree_control
> else
>    cd /sys/fs/cgroup/blkio/
>    echo restrict-to-be >blkio.prio.class
> fi
> echo $$ >cgroup.procs
> mkdir -p hipri
> cd hipri
> if [ -e io.prio.class ]; then
>    echo none-to-rt >io.prio.class
> else
>    echo none-to-rt >blkio.prio.class
> fi
> { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
> echo $$ >cgroup.procs
> "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *    40 IOPS for the low-priority job
> 
> If the prio aging expiry time is changed from 100s into 0, the IOPS results
> change into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>   block/mq-deadline.c | 77 ++++++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 73 insertions(+), 4 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b262f40f32c0..bb723478baf1 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -31,6 +31,11 @@
>    */
>   static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
>   static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +/*
> + * Time after which to dispatch lower priority requests even if higher
> + * priority requests are pending.
> + */
> +static const int prio_aging_expire = 10 * HZ;
>   static const int writes_starved = 2;    /* max times reads can starve a write */
>   static const int fifo_batch = 16;       /* # of sequential requests treated as one
>   				     by the above parameters. For throughput. */
> @@ -96,6 +101,7 @@ struct deadline_data {
>   	int writes_starved;
>   	int front_merges;
>   	u32 async_depth;
> +	int prio_aging_expire;
>   
>   	spinlock_t lock;
>   	spinlock_t zone_lock;
> @@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>   	return rq;
>   }
>   
> +/*
> + * Returns true if and only if @rq started after @latest_start where
> + * @latest_start is in jiffies.
> + */
> +static bool started_after(struct deadline_data *dd, struct request *rq,
> +			  unsigned long latest_start)
> +{
> +	unsigned long start_time = (unsigned long)rq->fifo_time;
> +
> +	start_time -= dd->fifo_expire[rq_data_dir(rq)];
> +
> +	return time_after(start_time, latest_start);
> +}
> +
>   /*
>    * deadline_dispatch_requests selects the best request according to
> - * read/write expire, fifo_batch, etc
> + * read/write expire, fifo_batch, etc and with a start time <= @latest.
>    */
>   static struct request *__dd_dispatch_request(struct deadline_data *dd,
> -					     struct dd_per_prio *per_prio)
> +					     struct dd_per_prio *per_prio,
> +					     unsigned long latest_start)
>   {
>   	struct request *rq, *next_rq;
>   	enum dd_data_dir data_dir;
> @@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>   	if (!list_empty(&per_prio->dispatch)) {
>   		rq = list_first_entry(&per_prio->dispatch, struct request,
>   				      queuelist);
> +		if (started_after(dd, rq, latest_start))
> +			return NULL;
>   		list_del_init(&rq->queuelist);
>   		goto done;
>   	}
> @@ -432,6 +455,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>   	dd->batching = 0;
>   
>   dispatch_request:
> +	if (started_after(dd, rq, latest_start))
> +		return NULL;
> +
>   	/*
>   	 * rq is the selected appropriate request.
>   	 */
> @@ -449,6 +475,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>   	return rq;
>   }
>   
> +/*
> + * Check whether there are any requests with priority other than DD_RT_PRIO
> + * that were inserted more than prio_aging_expire jiffies ago.
> + */
> +static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
> +						      unsigned long now)
> +{
> +	struct request *rq;
> +	enum dd_prio prio;
> +	int prio_cnt;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
> +		   !!dd_queued(dd, DD_IDLE_PRIO);
> +	if (prio_cnt < 2)
> +		return NULL;
> +
> +	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
> +					   now - dd->prio_aging_expire);
> +		if (rq)
> +			return rq;
> +	}
> +
> +	return NULL;
> +}
> +
>   /*
>    * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
>    *
> @@ -460,15 +514,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>   static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>   {
>   	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	const unsigned long now = jiffies;
>   	struct request *rq;
>   	enum dd_prio prio;
>   
>   	spin_lock(&dd->lock);
> +	rq = dd_dispatch_prio_aged_requests(dd, now);
> +	if (rq)
> +		goto unlock;
> +
> +	/*
> +	 * Next, dispatch requests in priority order. Ignore lower priority
> +	 * requests if any higher priority requests are pending.
> +	 */
>   	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
> -		if (rq)
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
> +		if (rq || dd_queued(dd, prio))
>   			break;
>   	}
> +
> +unlock:
>   	spin_unlock(&dd->lock);
>   
>   	return rq;
> @@ -573,6 +638,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>   	dd->front_merges = 1;
>   	dd->last_dir = DD_WRITE;
>   	dd->fifo_batch = fifo_batch;
> +	dd->prio_aging_expire = prio_aging_expire;
>   	spin_lock_init(&dd->lock);
>   	spin_lock_init(&dd->zone_lock);
>   
> @@ -796,6 +862,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>   #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
>   SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
>   SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> +SHOW_JIFFIES(deadline_prio_aging_expire_show, dd->prio_aging_expire);
>   SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
>   SHOW_INT(deadline_front_merges_show, dd->front_merges);
>   SHOW_INT(deadline_async_depth_show, dd->front_merges);
> @@ -825,6 +892,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
>   	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
>   STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
>   STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> +STORE_JIFFIES(deadline_prio_aging_expire_store, &dd->prio_aging_expire, 0, INT_MAX);
>   STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
>   STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
>   STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
> @@ -843,6 +911,7 @@ static struct elv_fs_entry deadline_attrs[] = {
>   	DD_ATTR(front_merges),
>   	DD_ATTR(async_depth),
>   	DD_ATTR(fifo_batch),
> +	DD_ATTR(prio_aging_expire),
>   	__ATTR_NULL
>   };
>   
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-27 22:03 ` [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
  2021-09-27 22:53   ` Damien Le Moal
  2021-09-28  5:53   ` Hannes Reinecke
@ 2021-09-28 10:36   ` Niklas Cassel
  2022-03-25 12:33   ` Oleksij Rempel
  3 siblings, 0 replies; 21+ messages in thread
From: Niklas Cassel @ 2021-09-28 10:36 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Hannes Reinecke

On Mon, Sep 27, 2021 at 03:03:28PM -0700, Bart Van Assche wrote:
> In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
> Prioritize high-priority requests""), this patch uses 'jiffies' instead
> of ktime_get() in the code for aging lower priority requests.

Considering that this is basically a revert of a revert,
except for the ktime_get() to jiffies change, I think that
you should state the reason for this change in the change log.

> Ran the following script:
> 
> set -e
> scriptdir=$(dirname "$0")
> if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
> modprobe scsi_debug ndelay=1000000 max_queue=16
> sd=''
> while [ -z "$sd" ]; do
>   sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done
> echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
> if [ -e /sys/fs/cgroup/io.prio.class ]; then
>   cd /sys/fs/cgroup
>   echo restrict-to-be >io.prio.class
>   echo +io > cgroup.subtree_control
> else
>   cd /sys/fs/cgroup/blkio/
>   echo restrict-to-be >blkio.prio.class
> fi
> echo $$ >cgroup.procs
> mkdir -p hipri
> cd hipri
> if [ -e io.prio.class ]; then
>   echo none-to-rt >io.prio.class
> else
>   echo none-to-rt >blkio.prio.class
> fi
> { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
> echo $$ >cgroup.procs
> "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *    40 IOPS for the low-priority job
> 
> If the prio aging expiry time is changed from 100s into 0, the IOPS results
> change into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---

With an elaborated change log, this looks good to me:

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-28  5:51   ` Hannes Reinecke
@ 2021-09-28 17:35     ` Bart Van Assche
  0 siblings, 0 replies; 21+ messages in thread
From: Bart Van Assche @ 2021-09-28 17:35 UTC (permalink / raw)
  To: Hannes Reinecke, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Damien Le Moal,
	Niklas Cassel

On 9/27/21 10:51 PM, Hannes Reinecke wrote:
> On 9/28/21 12:03 AM, Bart Van Assche wrote:
>> +    dd->per_prio[prio].stats.merged++;
> 
> Why don't you convert the macro 'dd_count()' to work with the new structure?

Hi Hannes,

The dd_count() macro would look as follows if it would have been kept:

#define dd_count(dd, event_type, prio) (dd)->per_prio[(prio)].stats.event_type++

I prefer to open-code such a macro since I think that the open-coded version
is as easy to read as when the dd_count() macro would have been converted.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-27 22:03 ` [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
                     ` (2 preceding siblings ...)
  2021-09-28 10:36   ` Niklas Cassel
@ 2022-03-25 12:33   ` Oleksij Rempel
  2022-03-25 13:07     ` Oleksij Rempel
  2022-03-25 17:05     ` Bart Van Assche
  3 siblings, 2 replies; 21+ messages in thread
From: Oleksij Rempel @ 2022-03-25 12:33 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

Hello Bart,

I have regression on iMX6 + USB storage device with this patch.
After plugging USB Flash driver (in my case USB3 Intenso 32GB) and
running mount /dev/sda1 /mnt, the command will never exit. 

Reverting this patchs (322cff70d46) on top of v5.17 solves it for me.

How can I help you to debug this issue?

Regards,
Oleksij

On Mon, Sep 27, 2021 at 03:03:28PM -0700, Bart Van Assche wrote:
> In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
> Prioritize high-priority requests""), this patch uses 'jiffies' instead
> of ktime_get() in the code for aging lower priority requests.
> 
> This patch has been tested as follows:
> 
> Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: 555 K IOPS.
> 
> Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: about 380 K IOPS.
> 
> Ran the following script:
> 
> set -e
> scriptdir=$(dirname "$0")
> if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
> modprobe scsi_debug ndelay=1000000 max_queue=16
> sd=''
> while [ -z "$sd" ]; do
>   sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done
> echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
> if [ -e /sys/fs/cgroup/io.prio.class ]; then
>   cd /sys/fs/cgroup
>   echo restrict-to-be >io.prio.class
>   echo +io > cgroup.subtree_control
> else
>   cd /sys/fs/cgroup/blkio/
>   echo restrict-to-be >blkio.prio.class
> fi
> echo $$ >cgroup.procs
> mkdir -p hipri
> cd hipri
> if [ -e io.prio.class ]; then
>   echo none-to-rt >io.prio.class
> else
>   echo none-to-rt >blkio.prio.class
> fi
> { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
> echo $$ >cgroup.procs
> "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *    40 IOPS for the low-priority job
> 
> If the prio aging expiry time is changed from 100s into 0, the IOPS results
> change into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 77 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 73 insertions(+), 4 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b262f40f32c0..bb723478baf1 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -31,6 +31,11 @@
>   */
>  static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
>  static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +/*
> + * Time after which to dispatch lower priority requests even if higher
> + * priority requests are pending.
> + */
> +static const int prio_aging_expire = 10 * HZ;
>  static const int writes_starved = 2;    /* max times reads can starve a write */
>  static const int fifo_batch = 16;       /* # of sequential requests treated as one
>  				     by the above parameters. For throughput. */
> @@ -96,6 +101,7 @@ struct deadline_data {
>  	int writes_starved;
>  	int front_merges;
>  	u32 async_depth;
> +	int prio_aging_expire;
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> @@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	return rq;
>  }
>  
> +/*
> + * Returns true if and only if @rq started after @latest_start where
> + * @latest_start is in jiffies.
> + */
> +static bool started_after(struct deadline_data *dd, struct request *rq,
> +			  unsigned long latest_start)
> +{
> +	unsigned long start_time = (unsigned long)rq->fifo_time;
> +
> +	start_time -= dd->fifo_expire[rq_data_dir(rq)];
> +
> +	return time_after(start_time, latest_start);
> +}
> +
>  /*
>   * deadline_dispatch_requests selects the best request according to
> - * read/write expire, fifo_batch, etc
> + * read/write expire, fifo_batch, etc and with a start time <= @latest.
>   */
>  static struct request *__dd_dispatch_request(struct deadline_data *dd,
> -					     struct dd_per_prio *per_prio)
> +					     struct dd_per_prio *per_prio,
> +					     unsigned long latest_start)
>  {
>  	struct request *rq, *next_rq;
>  	enum dd_data_dir data_dir;
> @@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	if (!list_empty(&per_prio->dispatch)) {
>  		rq = list_first_entry(&per_prio->dispatch, struct request,
>  				      queuelist);
> +		if (started_after(dd, rq, latest_start))
> +			return NULL;
>  		list_del_init(&rq->queuelist);
>  		goto done;
>  	}
> @@ -432,6 +455,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	dd->batching = 0;
>  
>  dispatch_request:
> +	if (started_after(dd, rq, latest_start))
> +		return NULL;
> +
>  	/*
>  	 * rq is the selected appropriate request.
>  	 */
> @@ -449,6 +475,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	return rq;
>  }
>  
> +/*
> + * Check whether there are any requests with priority other than DD_RT_PRIO
> + * that were inserted more than prio_aging_expire jiffies ago.
> + */
> +static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
> +						      unsigned long now)
> +{
> +	struct request *rq;
> +	enum dd_prio prio;
> +	int prio_cnt;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
> +		   !!dd_queued(dd, DD_IDLE_PRIO);
> +	if (prio_cnt < 2)
> +		return NULL;
> +
> +	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
> +					   now - dd->prio_aging_expire);
> +		if (rq)
> +			return rq;
> +	}
> +
> +	return NULL;
> +}
> +
>  /*
>   * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
>   *
> @@ -460,15 +514,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> +	const unsigned long now = jiffies;
>  	struct request *rq;
>  	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);
> +	rq = dd_dispatch_prio_aged_requests(dd, now);
> +	if (rq)
> +		goto unlock;
> +
> +	/*
> +	 * Next, dispatch requests in priority order. Ignore lower priority
> +	 * requests if any higher priority requests are pending.
> +	 */
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
> -		if (rq)
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
> +		if (rq || dd_queued(dd, prio))
>  			break;
>  	}
> +
> +unlock:
>  	spin_unlock(&dd->lock);
>  
>  	return rq;
> @@ -573,6 +638,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	dd->front_merges = 1;
>  	dd->last_dir = DD_WRITE;
>  	dd->fifo_batch = fifo_batch;
> +	dd->prio_aging_expire = prio_aging_expire;
>  	spin_lock_init(&dd->lock);
>  	spin_lock_init(&dd->zone_lock);
>  
> @@ -796,6 +862,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
>  SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
>  SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> +SHOW_JIFFIES(deadline_prio_aging_expire_show, dd->prio_aging_expire);
>  SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
>  SHOW_INT(deadline_front_merges_show, dd->front_merges);
>  SHOW_INT(deadline_async_depth_show, dd->front_merges);
> @@ -825,6 +892,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
>  	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
>  STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
>  STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> +STORE_JIFFIES(deadline_prio_aging_expire_store, &dd->prio_aging_expire, 0, INT_MAX);
>  STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
>  STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
>  STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
> @@ -843,6 +911,7 @@ static struct elv_fs_entry deadline_attrs[] = {
>  	DD_ATTR(front_merges),
>  	DD_ATTR(async_depth),
>  	DD_ATTR(fifo_batch),
> +	DD_ATTR(prio_aging_expire),
>  	__ATTR_NULL
>  };
>  

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2022-03-25 12:33   ` Oleksij Rempel
@ 2022-03-25 13:07     ` Oleksij Rempel
  2022-03-25 17:05     ` Bart Van Assche
  1 sibling, 0 replies; 21+ messages in thread
From: Oleksij Rempel @ 2022-03-25 13:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke, kernel

On Fri, Mar 25, 2022 at 01:33:20PM +0100, Oleksij Rempel wrote:
> Hello Bart,
> 
> I have regression on iMX6 + USB storage device with this patch.
> After plugging USB Flash driver (in my case USB3 Intenso 32GB) and
> running mount /dev/sda1 /mnt, the command will never exit. 
> 
> Reverting this patchs (322cff70d46) on top of v5.17 solves it for me.
> 
> How can I help you to debug this issue?

I hope this can help:
- it seems to be reproducible only with some type of USB storage devices
  (USB3 on top of USB2 host?)
- mount takes long time, but at some point it will manage it
- this backtrace I get from sysrq:

 kernel: Exception stack(0x83a59fb0 to 0x83a59ff8)
 kernel: 9fa0:                                     00000000 00000000 00000000 00000000
 kernel: 9fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 kernel: 9fe0: 00000000 00000000 00000000 00000000 00000013 00000000
 kernel:  r10:00000000 r9:00000000 r8:00000000 r7:00000000 r6:00000000 r5:80147ea0
 kernel:  r4:83a4cec0 r3:8118cd24
 kernel: task:mount           state:D stack:    0 pid:  282 ppid:   275 flags:0x00000000
 kernel: Backtrace: 
 kernel:  __schedule from schedule+0x84/0xc0
 kernel:  r10:82ebdd94 r9:00000102 r8:82ad3600 r7:81203ea4 r6:00000000 r5:00000002
 kernel:  r4:82ad3600
 kernel:  schedule from io_schedule+0x20/0x2c
 kernel:  r5:00000002 r4:00000000
 kernel:  io_schedule from folio_wait_bit_common+0x18c/0x220
 kernel:  r5:00000002 r4:9bc2e580
 kernel:  folio_wait_bit_common from folio_put_wait_locked+0x24/0x28
 kernel:  r10:00000003 r9:825124f0 r8:00000000 r7:825124e0 r6:9bc2e580 r5:82ebdf00
 kernel:  r4:00080001
 kernel:  folio_put_wait_locked from filemap_read+0x4f0/0x824
 kernel:  filemap_read from blkdev_read_iter+0x144/0x19c
 kernel:  r10:00000003 r9:82ad3600 r8:00000000 r7:82ebdee8 r6:82ebdf00 r5:00000040
 kernel:  r4:00000000
 kernel:  blkdev_read_iter from vfs_read+0x138/0x188
 kernel:  r8:00000040 r7:01649af0 r6:82ebdf58 r5:82ddfe40 r4:00000000
 kernel:  vfs_read from ksys_read+0x84/0xd8
 kernel:  r8:00000040 r7:82ebdf64 r6:82ebdf58 r5:01649af0 r4:82ddfe40
 kernel:  ksys_read from sys_read+0x18/0x1c
 kernel:  r8:801002a4 r7:00000003 r6:76fce840 r5:708f0000 r4:01649a00
 kernel:  sys_read from ret_fast_syscall+0x0/0x1c


> Regards,
> Oleksij
> 
> On Mon, Sep 27, 2021 at 03:03:28PM -0700, Bart Van Assche wrote:
> > In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
> > Prioritize high-priority requests""), this patch uses 'jiffies' instead
> > of ktime_get() in the code for aging lower priority requests.
> > 
> > This patch has been tested as follows:
> > 
> > Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
> > Result without and with this patch: 555 K IOPS.
> > 
> > Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
> > Result without and with this patch: about 380 K IOPS.
> > 
> > Ran the following script:
> > 
> > set -e
> > scriptdir=$(dirname "$0")
> > if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
> > modprobe scsi_debug ndelay=1000000 max_queue=16
> > sd=''
> > while [ -z "$sd" ]; do
> >   sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> > done
> > echo $((100*1000)) > "/sys/block/$sd/queue/iosched/prio_aging_expire"
> > if [ -e /sys/fs/cgroup/io.prio.class ]; then
> >   cd /sys/fs/cgroup
> >   echo restrict-to-be >io.prio.class
> >   echo +io > cgroup.subtree_control
> > else
> >   cd /sys/fs/cgroup/blkio/
> >   echo restrict-to-be >blkio.prio.class
> > fi
> > echo $$ >cgroup.procs
> > mkdir -p hipri
> > cd hipri
> > if [ -e io.prio.class ]; then
> >   echo none-to-rt >io.prio.class
> > else
> >   echo none-to-rt >blkio.prio.class
> > fi
> > { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
> > echo $$ >cgroup.procs
> > "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
> > 
> > Result:
> > * 11000 IOPS for the high-priority job
> > *    40 IOPS for the low-priority job
> > 
> > If the prio aging expiry time is changed from 100s into 0, the IOPS results
> > change into 6712 and 6796 IOPS.
> > 
> > The max-iops script is a script that runs fio with the following arguments:
> > --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> > --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> > --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> > --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> > --filename=${positional_argument_1}
> > 
> > Cc: Damien Le Moal <damien.lemoal@wdc.com>
> > Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> > Cc: Hannes Reinecke <hare@suse.de>
> > Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> > ---
> >  block/mq-deadline.c | 77 ++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 73 insertions(+), 4 deletions(-)
> > 
> > diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> > index b262f40f32c0..bb723478baf1 100644
> > --- a/block/mq-deadline.c
> > +++ b/block/mq-deadline.c
> > @@ -31,6 +31,11 @@
> >   */
> >  static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
> >  static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> > +/*
> > + * Time after which to dispatch lower priority requests even if higher
> > + * priority requests are pending.
> > + */
> > +static const int prio_aging_expire = 10 * HZ;
> >  static const int writes_starved = 2;    /* max times reads can starve a write */
> >  static const int fifo_batch = 16;       /* # of sequential requests treated as one
> >  				     by the above parameters. For throughput. */
> > @@ -96,6 +101,7 @@ struct deadline_data {
> >  	int writes_starved;
> >  	int front_merges;
> >  	u32 async_depth;
> > +	int prio_aging_expire;
> >  
> >  	spinlock_t lock;
> >  	spinlock_t zone_lock;
> > @@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
> >  	return rq;
> >  }
> >  
> > +/*
> > + * Returns true if and only if @rq started after @latest_start where
> > + * @latest_start is in jiffies.
> > + */
> > +static bool started_after(struct deadline_data *dd, struct request *rq,
> > +			  unsigned long latest_start)
> > +{
> > +	unsigned long start_time = (unsigned long)rq->fifo_time;
> > +
> > +	start_time -= dd->fifo_expire[rq_data_dir(rq)];
> > +
> > +	return time_after(start_time, latest_start);
> > +}
> > +
> >  /*
> >   * deadline_dispatch_requests selects the best request according to
> > - * read/write expire, fifo_batch, etc
> > + * read/write expire, fifo_batch, etc and with a start time <= @latest.
> >   */
> >  static struct request *__dd_dispatch_request(struct deadline_data *dd,
> > -					     struct dd_per_prio *per_prio)
> > +					     struct dd_per_prio *per_prio,
> > +					     unsigned long latest_start)
> >  {
> >  	struct request *rq, *next_rq;
> >  	enum dd_data_dir data_dir;
> > @@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> >  	if (!list_empty(&per_prio->dispatch)) {
> >  		rq = list_first_entry(&per_prio->dispatch, struct request,
> >  				      queuelist);
> > +		if (started_after(dd, rq, latest_start))
> > +			return NULL;
> >  		list_del_init(&rq->queuelist);
> >  		goto done;
> >  	}
> > @@ -432,6 +455,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> >  	dd->batching = 0;
> >  
> >  dispatch_request:
> > +	if (started_after(dd, rq, latest_start))
> > +		return NULL;
> > +
> >  	/*
> >  	 * rq is the selected appropriate request.
> >  	 */
> > @@ -449,6 +475,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> >  	return rq;
> >  }
> >  
> > +/*
> > + * Check whether there are any requests with priority other than DD_RT_PRIO
> > + * that were inserted more than prio_aging_expire jiffies ago.
> > + */
> > +static struct request *dd_dispatch_prio_aged_requests(struct deadline_data *dd,
> > +						      unsigned long now)
> > +{
> > +	struct request *rq;
> > +	enum dd_prio prio;
> > +	int prio_cnt;
> > +
> > +	lockdep_assert_held(&dd->lock);
> > +
> > +	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
> > +		   !!dd_queued(dd, DD_IDLE_PRIO);
> > +	if (prio_cnt < 2)
> > +		return NULL;
> > +
> > +	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> > +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
> > +					   now - dd->prio_aging_expire);
> > +		if (rq)
> > +			return rq;
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> >  /*
> >   * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
> >   *
> > @@ -460,15 +514,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> >  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> >  {
> >  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> > +	const unsigned long now = jiffies;
> >  	struct request *rq;
> >  	enum dd_prio prio;
> >  
> >  	spin_lock(&dd->lock);
> > +	rq = dd_dispatch_prio_aged_requests(dd, now);
> > +	if (rq)
> > +		goto unlock;
> > +
> > +	/*
> > +	 * Next, dispatch requests in priority order. Ignore lower priority
> > +	 * requests if any higher priority requests are pending.
> > +	 */
> >  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> > -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
> > -		if (rq)
> > +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
> > +		if (rq || dd_queued(dd, prio))
> >  			break;
> >  	}
> > +
> > +unlock:
> >  	spin_unlock(&dd->lock);
> >  
> >  	return rq;
> > @@ -573,6 +638,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
> >  	dd->front_merges = 1;
> >  	dd->last_dir = DD_WRITE;
> >  	dd->fifo_batch = fifo_batch;
> > +	dd->prio_aging_expire = prio_aging_expire;
> >  	spin_lock_init(&dd->lock);
> >  	spin_lock_init(&dd->zone_lock);
> >  
> > @@ -796,6 +862,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
> >  #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
> >  SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
> >  SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> > +SHOW_JIFFIES(deadline_prio_aging_expire_show, dd->prio_aging_expire);
> >  SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
> >  SHOW_INT(deadline_front_merges_show, dd->front_merges);
> >  SHOW_INT(deadline_async_depth_show, dd->front_merges);
> > @@ -825,6 +892,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
> >  	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
> >  STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
> >  STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> > +STORE_JIFFIES(deadline_prio_aging_expire_store, &dd->prio_aging_expire, 0, INT_MAX);
> >  STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
> >  STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
> >  STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
> > @@ -843,6 +911,7 @@ static struct elv_fs_entry deadline_attrs[] = {
> >  	DD_ATTR(front_merges),
> >  	DD_ATTR(async_depth),
> >  	DD_ATTR(fifo_batch),
> > +	DD_ATTR(prio_aging_expire),
> >  	__ATTR_NULL
> >  };
> >  
> 
> -- 
> Pengutronix e.K.                           |                             |
> Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
> 31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2022-03-25 12:33   ` Oleksij Rempel
  2022-03-25 13:07     ` Oleksij Rempel
@ 2022-03-25 17:05     ` Bart Van Assche
  2022-03-26  8:57       ` Oleksij Rempel
  1 sibling, 1 reply; 21+ messages in thread
From: Bart Van Assche @ 2022-03-25 17:05 UTC (permalink / raw)
  To: Oleksij Rempel
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

On 3/25/22 05:33, Oleksij Rempel wrote:
> I have regression on iMX6 + USB storage device with this patch.
> After plugging USB Flash driver (in my case USB3 Intenso 32GB) and
> running mount /dev/sda1 /mnt, the command will never exit.
> 
> Reverting this patchs (322cff70d46) on top of v5.17 solves it for me.
> 
> How can I help you to debug this issue?

That is unexpected. Is there perhaps something special about the USB
stick for which the hang occurs, e.g. the queue depth it supports is
low? Can you please share the output of the following commands after
having inserted the USB stick that triggers the hang?

(cd /sys/class && grep -aH . scsi_host/*/can_queue scsi_device/*/device/{blacklist,inquiry,model,queue*,vendor}) | sort

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2022-03-25 17:05     ` Bart Van Assche
@ 2022-03-26  8:57       ` Oleksij Rempel
  2022-03-28  2:20         ` Bart Van Assche
  0 siblings, 1 reply; 21+ messages in thread
From: Oleksij Rempel @ 2022-03-26  8:57 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

On Fri, Mar 25, 2022 at 10:05:43AM -0700, Bart Van Assche wrote:
> On 3/25/22 05:33, Oleksij Rempel wrote:
> > I have regression on iMX6 + USB storage device with this patch.
> > After plugging USB Flash driver (in my case USB3 Intenso 32GB) and
> > running mount /dev/sda1 /mnt, the command will never exit.
> > 
> > Reverting this patchs (322cff70d46) on top of v5.17 solves it for me.
> > 
> > How can I help you to debug this issue?
> 
> That is unexpected. Is there perhaps something special about the USB
> stick for which the hang occurs, e.g. the queue depth it supports is
> low? Can you please share the output of the following commands after
> having inserted the USB stick that triggers the hang?
> 
> (cd /sys/class && grep -aH . scsi_host/*/can_queue scsi_device/*/device/{blacklist,inquiry,model,queue*,vendor}) | sort

Here it is:
root@test:/sys/class cat scsi_host/*/can_queue
1
root@test:/sys/class cat scsi_device/*/device/blacklist
root@test:/sys/class cat scsi_device/*/device/inquiry
Intenso Speed Line      1.00
root@test:/sys/class cat scsi_device/*/device/model
Speed Line      
root@test:/sys/class cat scsi_device/*/device/queue*
1
none
root@test:/sys/class cat scsi_device/*/device/vendor
Intenso 

I do not know, if there is something special about it. If needed, i can
take it apart to see what controller is used.

Regards,
Oleksij
-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2022-03-26  8:57       ` Oleksij Rempel
@ 2022-03-28  2:20         ` Bart Van Assche
  2022-03-28 22:24           ` Bart Van Assche
  2022-03-29  4:49           ` Bart Van Assche
  0 siblings, 2 replies; 21+ messages in thread
From: Bart Van Assche @ 2022-03-28  2:20 UTC (permalink / raw)
  To: Oleksij Rempel
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

On 3/26/22 01:57, Oleksij Rempel wrote:
> On Fri, Mar 25, 2022 at 10:05:43AM -0700, Bart Van Assche wrote:
>> On 3/25/22 05:33, Oleksij Rempel wrote:
>>> I have regression on iMX6 + USB storage device with this patch.
>>> After plugging USB Flash driver (in my case USB3 Intenso 32GB) and
>>> running mount /dev/sda1 /mnt, the command will never exit.
>>>
>>> Reverting this patchs (322cff70d46) on top of v5.17 solves it for me.
>>>
>>> How can I help you to debug this issue?
>>
>> That is unexpected. Is there perhaps something special about the USB
>> stick for which the hang occurs, e.g. the queue depth it supports is
>> low? Can you please share the output of the following commands after
>> having inserted the USB stick that triggers the hang?
>>
>> (cd /sys/class && grep -aH . scsi_host/*/can_queue scsi_device/*/device/{blacklist,inquiry,model,queue*,vendor}) | sort
> 
> Here it is:
> root@test:/sys/class cat scsi_host/*/can_queue
> 1
> root@test:/sys/class cat scsi_device/*/device/blacklist
> root@test:/sys/class cat scsi_device/*/device/inquiry
> Intenso Speed Line      1.00
> root@test:/sys/class cat scsi_device/*/device/model
> Speed Line
> root@test:/sys/class cat scsi_device/*/device/queue*
> 1
> none
> root@test:/sys/class cat scsi_device/*/device/vendor
> Intenso
> 
> I do not know, if there is something special about it. If needed, i can
> take it apart to see what controller is used.

Thanks, this helps a lot. can_queue = 1 as I was suspecting. In a quick
test I noticed that I/O is about 10x slower for queue depth 1 and the
v5.17 mq-deadline scheduler. I will take a closer look at this tomorrow.

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2022-03-28  2:20         ` Bart Van Assche
@ 2022-03-28 22:24           ` Bart Van Assche
  2022-03-29  4:49           ` Bart Van Assche
  1 sibling, 0 replies; 21+ messages in thread
From: Bart Van Assche @ 2022-03-28 22:24 UTC (permalink / raw)
  To: Oleksij Rempel
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

On 3/27/22 19:20, Bart Van Assche wrote:
> Thanks, this helps a lot. can_queue = 1 as I was suspecting. In a quick
> test I noticed that I/O is about 10x slower for queue depth 1 and the
> v5.17 mq-deadline scheduler. I will take a closer look at this tomorrow.

An update: I can reproduce the slowness with (a) the patch at the start of
this thread reverted and (b) both the BFQ and the Kyber I/O schedulers. So
it seems like a regression has been introduced related to handling of queue
depth 1. I will try to bisect this.

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests
  2022-03-28  2:20         ` Bart Van Assche
  2022-03-28 22:24           ` Bart Van Assche
@ 2022-03-29  4:49           ` Bart Van Assche
  1 sibling, 0 replies; 21+ messages in thread
From: Bart Van Assche @ 2022-03-29  4:49 UTC (permalink / raw)
  To: Oleksij Rempel
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

On 3/27/22 19:20, Bart Van Assche wrote:
> Thanks, this helps a lot. can_queue = 1 as I was suspecting. In a quick
> test I noticed that I/O is about 10x slower for queue depth 1 and the
> v5.17 mq-deadline scheduler. I will take a closer look at this tomorrow.

(replying to my own email)

Please take a look at the two new tests in
https://github.com/osandov/blktests/pull/87. The results of that test for
kernel v5.13 show that enabling an I/O scheduler speeds up the I/O workload
except when using BFQ (times are in centiseconds):

# cat /home/bart/software/blktests/results/nodev/block/032.full
bfq: 465 vs 452: pass
kyber: 243 vs 452: pass
mq-deadline: 230 vs 452: pass

The results for kernel v5.16 shows that enabling an I/O scheduler slows down
the I/O workload up to 2.2x:

# cat /home/bart/software/blktests/results/nodev/block/032.full
bfq: 920 vs 419: fail
kyber: 732 vs 419: fail
mq-deadline: 751 vs 419: fail

In other words, this is not an mq-deadline issue.

Bart.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-03-29  4:49 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-27 22:03 [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
2021-09-27 22:03 ` [PATCH v2 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
2021-09-28  5:45   ` Hannes Reinecke
2021-09-27 22:03 ` [PATCH v2 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
2021-09-28  5:47   ` Hannes Reinecke
2021-09-27 22:03 ` [PATCH v2 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
2021-09-28  5:51   ` Hannes Reinecke
2021-09-28 17:35     ` Bart Van Assche
2021-09-27 22:03 ` [PATCH v2 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
2021-09-27 22:53   ` Damien Le Moal
2021-09-27 23:12     ` Bart Van Assche
2021-09-28  5:53   ` Hannes Reinecke
2021-09-28 10:36   ` Niklas Cassel
2022-03-25 12:33   ` Oleksij Rempel
2022-03-25 13:07     ` Oleksij Rempel
2022-03-25 17:05     ` Bart Van Assche
2022-03-26  8:57       ` Oleksij Rempel
2022-03-28  2:20         ` Bart Van Assche
2022-03-28 22:24           ` Bart Van Assche
2022-03-29  4:49           ` Bart Van Assche
2021-09-27 23:14 ` [PATCH v2 0/4] Restore I/O priority support in the mq-deadline scheduler Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).