[PATCH 0/4] Restore I/O priority support in the mq-deadline scheduler

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] Restore I/O priority support in the mq-deadline scheduler
@ 2021-09-23 23:23 Bart Van Assche
  2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:23 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche

Hi Jens,

This patch series includes the following changes:
- Fix request accounting by counting requeued requests once.
- Test correctness of the request accounting code by triggering a kernel
  warning from inside dd_exit_sched() if an inconsistency has been detected.
- Switch from per-CPU counters to individual counters.
- Restore I/O priority support in the mq-deadline scheduler. The performance
  measurements in the description of patch 4/4 show that the peformance
  regressions in the previous version of this patch have been fixed. This has
  been achieved by using 'jiffies' instead of ktime_get() and also by skipping
  the aging mechanism if all queued requests have the same I/O priority.

Please consider this patch series for kernel v5.16.

Thanks,

Bart.

Bart Van Assche (4):
  block/mq-deadline: Improve request accounting further
  block/mq-deadline: Add an invariant check
  block/mq-deadline: Stop using per-CPU counters
  block/mq-deadline: Prioritize high-priority requests

 block/mq-deadline.c | 221 ++++++++++++++++++++++++++++----------------
 1 file changed, 143 insertions(+), 78 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] block/mq-deadline: Improve request accounting further
  2021-09-23 23:23 [PATCH 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
@ 2021-09-23 23:26 ` Bart Van Assche
  2021-09-23 23:26   ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
                     ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

The scheduler .insert_requests() callback is called when a request is
queued for the first time and also when it is requeued. Only count a
request the first time it is queued. Additionally, since the mq-deadline
scheduler only performs zone locking for requests that have been
inserted, skip the zone unlock code for requests that have not been
inserted into the mq-deadline scheduler.

Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 7f3c3932b723..b1175e4560ad 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -677,8 +677,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_req_zone_write_unlock(rq);
 
 	prio = ioprio_class_to_prio[ioprio_class];
-	dd_count(dd, inserted, prio);
-	rq->elv.priv[0] = (void *)(uintptr_t)1;
+	if (!rq->elv.priv[0]) {
+		dd_count(dd, inserted, prio);
+		rq->elv.priv[0] = (void *)(uintptr_t)1;
+	}
 
 	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
 		blk_mq_free_requests(&free);
@@ -759,12 +761,13 @@ static void dd_finish_request(struct request *rq)
 
 	/*
 	 * The block layer core may call dd_finish_request() without having
-	 * called dd_insert_requests(). Hence only update statistics for
-	 * requests for which dd_insert_requests() has been called. See also
-	 * blk_mq_request_bypass_insert().
+	 * called dd_insert_requests(). Skip requests that bypassed I/O
+	 * scheduling. See also blk_mq_request_bypass_insert().
 	 */
-	if (rq->elv.priv[0])
-		dd_count(dd, completed, prio);
+	if (!rq->elv.priv[0])
+		return;
+
+	dd_count(dd, completed, prio);
 
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] block/mq-deadline: Add an invariant check
  2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
@ 2021-09-23 23:26   ` Bart Van Assche
  2021-09-24 10:54     ` Damien Le Moal
  2021-09-27 15:53     ` Niklas Cassel
  2021-09-23 23:26   ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

Check a statistics invariant at module unload time. When running
blktests, the invariant is verified every time a request queue is
removed and hence is verified at least once per test.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b1175e4560ad..6deb4306bcf3 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -270,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	deadline_remove_request(rq->q, per_prio, rq);
 }
 
+/* Number of requests queued for a given priority level. */
+static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
+{
+	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+}
+
 /*
  * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
  * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
@@ -539,6 +545,12 @@ static void dd_exit_sched(struct elevator_queue *e)
 
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
+		WARN_ONCE(dd_queued(dd, prio) != 0,
+			  "statistics for priority %d: i %u m %u d %u c %u\n",
+			  prio, dd_sum(dd, inserted, prio),
+			  dd_sum(dd, merged, prio),
+			  dd_sum(dd, dispatched, prio),
+			  dd_sum(dd, completed, prio));
 	}
 
 	free_percpu(dd->stats);
@@ -950,12 +962,6 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
 	return 0;
 }
 
-/* Number of requests queued for a given priority level. */
-static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
-{
-	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
-}
-
 static int dd_queued_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
  2021-09-23 23:26   ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
@ 2021-09-23 23:26   ` Bart Van Assche
  2021-09-24 10:58     ` Damien Le Moal
  2021-09-27 15:53     ` Niklas Cassel
  2021-09-23 23:26   ` [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

Calculating the sum over all CPUs of per-CPU counters frequently is
inefficient. Hence switch from per-CPU to individual counters. Three
counters are protected by the mq-deadline spinlock since these are
only accessed from contexts that already hold that spinlock. The fourth
counter is atomic because protecting it with the mq-deadline spinlock
would trigger lock contention.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 124 ++++++++++++++++++++------------------------
 1 file changed, 56 insertions(+), 68 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 6deb4306bcf3..b0cfc84a4e6b 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -51,17 +51,16 @@ enum dd_prio {
 
 enum { DD_PRIO_COUNT = 3 };
 
-/* I/O statistics per I/O priority. */
+/*
+ * I/O statistics per I/O priority. It is fine if these counters overflow.
+ * What matters is that these counters are at least as wide as
+ * log2(max_outstanding_requests).
+ */
 struct io_stats_per_prio {
-	local_t inserted;
-	local_t merged;
-	local_t dispatched;
-	local_t completed;
-};
-
-/* I/O statistics for all I/O priorities (enum dd_prio). */
-struct io_stats {
-	struct io_stats_per_prio stats[DD_PRIO_COUNT];
+	uint32_t inserted;
+	uint32_t merged;
+	uint32_t dispatched;
+	atomic_t completed;
 };
 
 /*
@@ -74,6 +73,7 @@ struct dd_per_prio {
 	struct list_head fifo_list[DD_DIR_COUNT];
 	/* Next request in FIFO order. Read, write or both are NULL. */
 	struct request *next_rq[DD_DIR_COUNT];
+	struct io_stats_per_prio stats;
 };
 
 struct deadline_data {
@@ -88,8 +88,6 @@ struct deadline_data {
 	unsigned int batching;		/* number of sequential requests made */
 	unsigned int starved;		/* times reads have starved writes */
 
-	struct io_stats __percpu *stats;
-
 	/*
 	 * settings that change how the i/o scheduler behaves
 	 */
@@ -103,33 +101,6 @@ struct deadline_data {
 	spinlock_t zone_lock;
 };
 
-/* Count one event of type 'event_type' and with I/O priority 'prio' */
-#define dd_count(dd, event_type, prio) do {				\
-	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
-									\
-	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
-	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
-	local_inc(&io_stats->stats[(prio)].event_type);			\
-	put_cpu_ptr(io_stats);						\
-} while (0)
-
-/*
- * Returns the total number of dd_count(dd, event_type, prio) calls across all
- * CPUs. No locking or barriers since it is fine if the returned sum is slightly
- * outdated.
- */
-#define dd_sum(dd, event_type, prio) ({					\
-	unsigned int cpu;						\
-	u32 sum = 0;							\
-									\
-	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
-	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
-	for_each_present_cpu(cpu)					\
-		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
-				  stats[(prio)].event_type);		\
-	sum;								\
-})
-
 /* Maps an I/O priority class to a deadline scheduler priority. */
 static const enum dd_prio ioprio_class_to_prio[] = {
 	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
@@ -233,7 +204,9 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
 	const u8 ioprio_class = dd_rq_ioclass(next);
 	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
 
-	dd_count(dd, merged, prio);
+	lockdep_assert_held(&dd->lock);
+
+	dd->per_prio[prio].stats.merged++;
 
 	/*
 	 * if next expires before rq, assign its expire time to rq
@@ -273,7 +246,11 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 /* Number of requests queued for a given priority level. */
 static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
 {
-	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
+
+	lockdep_assert_held(&dd->lock);
+
+	return stats->inserted - atomic_read(&stats->completed);
 }
 
 /*
@@ -463,7 +440,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 done:
 	ioprio_class = dd_rq_ioclass(rq);
 	prio = ioprio_class_to_prio[ioprio_class];
-	dd_count(dd, dispatched, prio);
+	dd->per_prio[prio].stats.dispatched++;
 	/*
 	 * If the request needs its target zone locked, do it.
 	 */
@@ -542,19 +519,22 @@ static void dd_exit_sched(struct elevator_queue *e)
 
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
+		const struct io_stats_per_prio *stats = &per_prio->stats;
+		uint32_t queued;
 
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
 		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
-		WARN_ONCE(dd_queued(dd, prio) != 0,
+
+		spin_lock(&dd->lock);
+		queued = dd_queued(dd, prio);
+		spin_unlock(&dd->lock);
+
+		WARN_ONCE(queued,
 			  "statistics for priority %d: i %u m %u d %u c %u\n",
-			  prio, dd_sum(dd, inserted, prio),
-			  dd_sum(dd, merged, prio),
-			  dd_sum(dd, dispatched, prio),
-			  dd_sum(dd, completed, prio));
+			  prio, stats->inserted, stats->merged,
+			  stats->dispatched, atomic_read(&stats->completed));
 	}
 
-	free_percpu(dd->stats);
-
 	kfree(dd);
 }
 
@@ -578,11 +558,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	eq->elevator_data = dd;
 
-	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
-				     GFP_KERNEL | __GFP_ZERO);
-	if (!dd->stats)
-		goto free_dd;
-
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -604,9 +579,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	q->elevator = eq;
 	return 0;
 
-free_dd:
-	kfree(dd);
-
 put_eq:
 	kobject_put(&eq->kobj);
 	return ret;
@@ -689,8 +661,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	blk_req_zone_write_unlock(rq);
 
 	prio = ioprio_class_to_prio[ioprio_class];
+	per_prio = &dd->per_prio[prio];
 	if (!rq->elv.priv[0]) {
-		dd_count(dd, inserted, prio);
+		per_prio->stats.inserted++;
 		rq->elv.priv[0] = (void *)(uintptr_t)1;
 	}
 
@@ -701,7 +674,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 
 	trace_block_rq_insert(rq);
 
-	per_prio = &dd->per_prio[prio];
 	if (at_head) {
 		list_add(&rq->queuelist, &per_prio->dispatch);
 	} else {
@@ -779,7 +751,7 @@ static void dd_finish_request(struct request *rq)
 	if (!rq->elv.priv[0])
 		return;
 
-	dd_count(dd, completed, prio);
+	atomic_inc(&per_prio->stats.completed);
 
 	if (blk_queue_is_zoned(q)) {
 		unsigned long flags;
@@ -966,28 +938,44 @@ static int dd_queued_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
 	struct deadline_data *dd = q->elevator->elevator_data;
+	u32 rt, be, idle;
+
+	spin_lock(&dd->lock);
+	rt = dd_queued(dd, DD_RT_PRIO);
+	be = dd_queued(dd, DD_BE_PRIO);
+	idle = dd_queued(dd, DD_IDLE_PRIO);
+	spin_unlock(&dd->lock);
+
+	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
-	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
-		   dd_queued(dd, DD_BE_PRIO),
-		   dd_queued(dd, DD_IDLE_PRIO));
 	return 0;
 }
 
 /* Number of requests owned by the block driver for a given priority. */
 static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
 {
-	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
-		- dd_sum(dd, completed, prio);
+	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
+
+	lockdep_assert_held(&dd->lock);
+
+	return stats->dispatched + stats->merged -
+		atomic_read(&stats->completed);
 }
 
 static int dd_owned_by_driver_show(void *data, struct seq_file *m)
 {
 	struct request_queue *q = data;
 	struct deadline_data *dd = q->elevator->elevator_data;
+	u32 rt, be, idle;
+
+	spin_lock(&dd->lock);
+	rt = dd_owned_by_driver(dd, DD_RT_PRIO);
+	be = dd_owned_by_driver(dd, DD_BE_PRIO);
+	idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
+	spin_unlock(&dd->lock);
+
+	seq_printf(m, "%u %u %u\n", rt, be, idle);
 
-	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
-		   dd_owned_by_driver(dd, DD_BE_PRIO),
-		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
  2021-09-23 23:26   ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
  2021-09-23 23:26   ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
@ 2021-09-23 23:26   ` Bart Van Assche
  2021-09-24 11:08     ` Damien Le Moal
  2021-09-24 10:54   ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Damien Le Moal
  2021-09-27 15:53   ` Niklas Cassel
  4 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
	Damien Le Moal, Niklas Cassel, Hannes Reinecke

In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
Prioritize high-priority requests""), this patch uses 'jiffies' instead
of ktime_get() in the code for aging lower priority requests.

This patch has been tested as follows:

Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: 555 K IOPS.

Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: about 380 K IOPS.

Ran the following script:

set -e
scriptdir=$(dirname "$0")
if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
modprobe scsi_debug ndelay=1000000 max_queue=16
sd=''
while [ -z "$sd" ]; do
  sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done
echo $((100*1000)) > "/sys/block/$sd/queue/iosched/aging_expire"
if [ -e /sys/fs/cgroup/io.prio.class ]; then
  cd /sys/fs/cgroup
  echo restrict-to-be >io.prio.class
  echo +io > cgroup.subtree_control
else
  cd /sys/fs/cgroup/blkio/
  echo restrict-to-be >blkio.prio.class
fi
echo $$ >cgroup.procs
mkdir -p hipri
cd hipri
if [ -e io.prio.class ]; then
  echo none-to-rt >io.prio.class
else
  echo none-to-rt >blkio.prio.class
fi
{ "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
echo $$ >cgroup.procs
"${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt

Result:
* 11000 IOPS for the high-priority job
*    40 IOPS for the low-priority job

If the aging expiry time is changed from 100s into 0, the IOPS results change
into 6712 and 6796 IOPS.

The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/mq-deadline.c | 78 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 73 insertions(+), 5 deletions(-)

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b0cfc84a4e6b..37440786fdbb 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -31,6 +31,11 @@
  */
 static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
 static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+/*
+ * Time after which to dispatch lower priority requests even if higher
+ * priority requests are pending.
+ */
+static const int aging_expire = 10 * HZ;
 static const int writes_starved = 2;    /* max times reads can starve a write */
 static const int fifo_batch = 16;       /* # of sequential requests treated as one
 				     by the above parameters. For throughput. */
@@ -96,6 +101,7 @@ struct deadline_data {
 	int writes_starved;
 	int front_merges;
 	u32 async_depth;
+	int aging_expire;
 
 	spinlock_t lock;
 	spinlock_t zone_lock;
@@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
 	return rq;
 }
 
+/*
+ * Returns true if and only if @rq started after @latest_start where
+ * @latest_start is in jiffies.
+ */
+static bool started_after(struct deadline_data *dd, struct request *rq,
+			  unsigned long latest_start)
+{
+	unsigned long start_time = (unsigned long)rq->fifo_time;
+
+	start_time -= dd->fifo_expire[rq_data_dir(rq)];
+
+	return time_after(start_time, latest_start);
+}
+
 /*
  * deadline_dispatch_requests selects the best request according to
- * read/write expire, fifo_batch, etc
+ * read/write expire, fifo_batch, etc and with a start time <= @latest.
  */
 static struct request *__dd_dispatch_request(struct deadline_data *dd,
-					     struct dd_per_prio *per_prio)
+					     struct dd_per_prio *per_prio,
+					     unsigned long latest_start)
 {
 	struct request *rq, *next_rq;
 	enum dd_data_dir data_dir;
@@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	if (!list_empty(&per_prio->dispatch)) {
 		rq = list_first_entry(&per_prio->dispatch, struct request,
 				      queuelist);
+		if (started_after(dd, rq, latest_start))
+			return NULL;
 		list_del_init(&rq->queuelist);
 		goto done;
 	}
@@ -432,6 +455,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	dd->batching = 0;
 
 dispatch_request:
+	if (started_after(dd, rq, latest_start))
+		return NULL;
 	/*
 	 * rq is the selected appropriate request.
 	 */
@@ -449,6 +474,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 	return rq;
 }
 
+/*
+ * Check whether there are any requests with a deadline that expired more than
+ * aging_expire jiffies ago.
+ */
+static struct request *dd_dispatch_aged_requests(struct deadline_data *dd,
+						 unsigned long now)
+{
+	struct request *rq;
+	enum dd_prio prio;
+	int prio_cnt;
+
+	lockdep_assert_held(&dd->lock);
+
+	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
+		   !!dd_queued(dd, DD_IDLE_PRIO);
+	if (prio_cnt < 2)
+		return NULL;
+
+	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
+		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
+					   now - dd->aging_expire);
+		if (rq)
+			return rq;
+	}
+
+	return NULL;
+}
+
 /*
  * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
  *
@@ -460,15 +513,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
 static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
-	struct request *rq;
+	const unsigned long now = jiffies;
+	struct request *rq = NULL;
 	enum dd_prio prio;
 
 	spin_lock(&dd->lock);
+	rq = dd_dispatch_aged_requests(dd, now);
+	if (rq)
+		goto unlock;
+
+	/*
+	 * Next, dispatch requests in priority order. Ignore lower priority
+	 * requests if any higher priority requests are pending.
+	 */
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
-		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
-		if (rq)
+		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
+		if (rq || dd_queued(dd, prio))
 			break;
 	}
+
+unlock:
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -573,6 +637,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->front_merges = 1;
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
+	dd->aging_expire = aging_expire;
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
 
@@ -796,6 +861,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
 #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
 SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
 SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
+SHOW_JIFFIES(deadline_aging_expire_show, dd->aging_expire);
 SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
 SHOW_INT(deadline_front_merges_show, dd->front_merges);
 SHOW_INT(deadline_async_depth_show, dd->front_merges);
@@ -825,6 +891,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
 	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
 STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
 STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
+STORE_JIFFIES(deadline_aging_expire_store, &dd->aging_expire, 0, INT_MAX);
 STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
 STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
 STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
@@ -843,6 +910,7 @@ static struct elv_fs_entry deadline_attrs[] = {
 	DD_ATTR(front_merges),
 	DD_ATTR(async_depth),
 	DD_ATTR(fifo_batch),
+	DD_ATTR(aging_expire),
 	__ATTR_NULL
 };
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] block/mq-deadline: Improve request accounting further
  2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
                     ` (2 preceding siblings ...)
  2021-09-23 23:26   ` [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-09-24 10:54   ` Damien Le Moal
  2021-09-27 15:53   ` Niklas Cassel
  4 siblings, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2021-09-24 10:54 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 2021/09/24 8:27, Bart Van Assche wrote:
> The scheduler .insert_requests() callback is called when a request is
> queued for the first time and also when it is requeued. Only count a
> request the first time it is queued. Additionally, since the mq-deadline
> scheduler only performs zone locking for requests that have been
> inserted, skip the zone unlock code for requests that have not been
> inserted into the mq-deadline scheduler.
> 
> Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 7f3c3932b723..b1175e4560ad 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -677,8 +677,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	blk_req_zone_write_unlock(rq);
>  
>  	prio = ioprio_class_to_prio[ioprio_class];
> -	dd_count(dd, inserted, prio);
> -	rq->elv.priv[0] = (void *)(uintptr_t)1;
> +	if (!rq->elv.priv[0]) {
> +		dd_count(dd, inserted, prio);
> +		rq->elv.priv[0] = (void *)(uintptr_t)1;
> +	}
>  
>  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>  		blk_mq_free_requests(&free);
> @@ -759,12 +761,13 @@ static void dd_finish_request(struct request *rq)
>  
>  	/*
>  	 * The block layer core may call dd_finish_request() without having
> -	 * called dd_insert_requests(). Hence only update statistics for
> -	 * requests for which dd_insert_requests() has been called. See also
> -	 * blk_mq_request_bypass_insert().
> +	 * called dd_insert_requests(). Skip requests that bypassed I/O
> +	 * scheduling. See also blk_mq_request_bypass_insert().
>  	 */
> -	if (rq->elv.priv[0])
> -		dd_count(dd, completed, prio);
> +	if (!rq->elv.priv[0])
> +		return;
> +
> +	dd_count(dd, completed, prio);
>  
>  	if (blk_queue_is_zoned(q)) {
>  		unsigned long flags;
> 

Looks good to me.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] block/mq-deadline: Add an invariant check
  2021-09-23 23:26   ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
@ 2021-09-24 10:54     ` Damien Le Moal
  2021-09-27 15:53     ` Niklas Cassel
  1 sibling, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2021-09-24 10:54 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 2021/09/24 8:27, Bart Van Assche wrote:
> Check a statistics invariant at module unload time. When running
> blktests, the invariant is verified every time a request queue is
> removed and hence is verified at least once per test.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b1175e4560ad..6deb4306bcf3 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -270,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	deadline_remove_request(rq->q, per_prio, rq);
>  }
>  
> +/* Number of requests queued for a given priority level. */
> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +}
> +
>  /*
>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> @@ -539,6 +545,12 @@ static void dd_exit_sched(struct elevator_queue *e)
>  
>  		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>  		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> +		WARN_ONCE(dd_queued(dd, prio) != 0,
> +			  "statistics for priority %d: i %u m %u d %u c %u\n",
> +			  prio, dd_sum(dd, inserted, prio),
> +			  dd_sum(dd, merged, prio),
> +			  dd_sum(dd, dispatched, prio),
> +			  dd_sum(dd, completed, prio));
>  	}
>  
>  	free_percpu(dd->stats);
> @@ -950,12 +962,6 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
>  	return 0;
>  }
>  
> -/* Number of requests queued for a given priority level. */
> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> -{
> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> -}
> -
>  static int dd_queued_show(void *data, struct seq_file *m)
>  {
>  	struct request_queue *q = data;
> 

Looks good.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-23 23:26   ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
@ 2021-09-24 10:58     ` Damien Le Moal
  2021-09-25  2:59       ` Bart Van Assche
  2021-09-27 15:53     ` Niklas Cassel
  1 sibling, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2021-09-24 10:58 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 2021/09/24 8:27, Bart Van Assche wrote:
> Calculating the sum over all CPUs of per-CPU counters frequently is
> inefficient. Hence switch from per-CPU to individual counters. Three
> counters are protected by the mq-deadline spinlock since these are
> only accessed from contexts that already hold that spinlock. The fourth
> counter is atomic because protecting it with the mq-deadline spinlock
> would trigger lock contention.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 124 ++++++++++++++++++++------------------------
>  1 file changed, 56 insertions(+), 68 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 6deb4306bcf3..b0cfc84a4e6b 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -51,17 +51,16 @@ enum dd_prio {
>  
>  enum { DD_PRIO_COUNT = 3 };
>  
> -/* I/O statistics per I/O priority. */
> +/*
> + * I/O statistics per I/O priority. It is fine if these counters overflow.
> + * What matters is that these counters are at least as wide as
> + * log2(max_outstanding_requests).
> + */
>  struct io_stats_per_prio {
> -	local_t inserted;
> -	local_t merged;
> -	local_t dispatched;
> -	local_t completed;
> -};
> -
> -/* I/O statistics for all I/O priorities (enum dd_prio). */
> -struct io_stats {
> -	struct io_stats_per_prio stats[DD_PRIO_COUNT];
> +	uint32_t inserted;
> +	uint32_t merged;
> +	uint32_t dispatched;
> +	atomic_t completed;

Why not use 64-bits types (regular unsigned long long and atomic64_t) ?

>  };
>  
>  /*
> @@ -74,6 +73,7 @@ struct dd_per_prio {
>  	struct list_head fifo_list[DD_DIR_COUNT];
>  	/* Next request in FIFO order. Read, write or both are NULL. */
>  	struct request *next_rq[DD_DIR_COUNT];
> +	struct io_stats_per_prio stats;
>  };
>  
>  struct deadline_data {
> @@ -88,8 +88,6 @@ struct deadline_data {
>  	unsigned int batching;		/* number of sequential requests made */
>  	unsigned int starved;		/* times reads have starved writes */
>  
> -	struct io_stats __percpu *stats;
> -
>  	/*
>  	 * settings that change how the i/o scheduler behaves
>  	 */
> @@ -103,33 +101,6 @@ struct deadline_data {
>  	spinlock_t zone_lock;
>  };
>  
> -/* Count one event of type 'event_type' and with I/O priority 'prio' */
> -#define dd_count(dd, event_type, prio) do {				\
> -	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
> -									\
> -	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> -	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> -	local_inc(&io_stats->stats[(prio)].event_type);			\
> -	put_cpu_ptr(io_stats);						\
> -} while (0)
> -
> -/*
> - * Returns the total number of dd_count(dd, event_type, prio) calls across all
> - * CPUs. No locking or barriers since it is fine if the returned sum is slightly
> - * outdated.
> - */
> -#define dd_sum(dd, event_type, prio) ({					\
> -	unsigned int cpu;						\
> -	u32 sum = 0;							\
> -									\
> -	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> -	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> -	for_each_present_cpu(cpu)					\
> -		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
> -				  stats[(prio)].event_type);		\
> -	sum;								\
> -})
> -
>  /* Maps an I/O priority class to a deadline scheduler priority. */
>  static const enum dd_prio ioprio_class_to_prio[] = {
>  	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
> @@ -233,7 +204,9 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
>  	const u8 ioprio_class = dd_rq_ioclass(next);
>  	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>  
> -	dd_count(dd, merged, prio);
> +	lockdep_assert_held(&dd->lock);
> +
> +	dd->per_prio[prio].stats.merged++;
>  
>  	/*
>  	 * if next expires before rq, assign its expire time to rq
> @@ -273,7 +246,11 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  /* Number of requests queued for a given priority level. */
>  static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	return stats->inserted - atomic_read(&stats->completed);
>  }
>  
>  /*
> @@ -463,7 +440,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  done:
>  	ioprio_class = dd_rq_ioclass(rq);
>  	prio = ioprio_class_to_prio[ioprio_class];
> -	dd_count(dd, dispatched, prio);
> +	dd->per_prio[prio].stats.dispatched++;
>  	/*
>  	 * If the request needs its target zone locked, do it.
>  	 */
> @@ -542,19 +519,22 @@ static void dd_exit_sched(struct elevator_queue *e)
>  
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
>  		struct dd_per_prio *per_prio = &dd->per_prio[prio];
> +		const struct io_stats_per_prio *stats = &per_prio->stats;
> +		uint32_t queued;
>  
>  		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>  		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> -		WARN_ONCE(dd_queued(dd, prio) != 0,
> +
> +		spin_lock(&dd->lock);
> +		queued = dd_queued(dd, prio);
> +		spin_unlock(&dd->lock);
> +
> +		WARN_ONCE(queued,
>  			  "statistics for priority %d: i %u m %u d %u c %u\n",
> -			  prio, dd_sum(dd, inserted, prio),
> -			  dd_sum(dd, merged, prio),
> -			  dd_sum(dd, dispatched, prio),
> -			  dd_sum(dd, completed, prio));
> +			  prio, stats->inserted, stats->merged,
> +			  stats->dispatched, atomic_read(&stats->completed));
>  	}
>  
> -	free_percpu(dd->stats);
> -
>  	kfree(dd);
>  }
>  
> @@ -578,11 +558,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  
>  	eq->elevator_data = dd;
>  
> -	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
> -				     GFP_KERNEL | __GFP_ZERO);
> -	if (!dd->stats)
> -		goto free_dd;
> -
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
>  		struct dd_per_prio *per_prio = &dd->per_prio[prio];
>  
> @@ -604,9 +579,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	q->elevator = eq;
>  	return 0;
>  
> -free_dd:
> -	kfree(dd);
> -
>  put_eq:
>  	kobject_put(&eq->kobj);
>  	return ret;
> @@ -689,8 +661,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	blk_req_zone_write_unlock(rq);
>  
>  	prio = ioprio_class_to_prio[ioprio_class];
> +	per_prio = &dd->per_prio[prio];
>  	if (!rq->elv.priv[0]) {
> -		dd_count(dd, inserted, prio);
> +		per_prio->stats.inserted++;
>  		rq->elv.priv[0] = (void *)(uintptr_t)1;
>  	}
>  
> @@ -701,7 +674,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  
>  	trace_block_rq_insert(rq);
>  
> -	per_prio = &dd->per_prio[prio];
>  	if (at_head) {
>  		list_add(&rq->queuelist, &per_prio->dispatch);
>  	} else {
> @@ -779,7 +751,7 @@ static void dd_finish_request(struct request *rq)
>  	if (!rq->elv.priv[0])
>  		return;
>  
> -	dd_count(dd, completed, prio);
> +	atomic_inc(&per_prio->stats.completed);
>  
>  	if (blk_queue_is_zoned(q)) {
>  		unsigned long flags;
> @@ -966,28 +938,44 @@ static int dd_queued_show(void *data, struct seq_file *m)
>  {
>  	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
> +	u32 rt, be, idle;
> +
> +	spin_lock(&dd->lock);
> +	rt = dd_queued(dd, DD_RT_PRIO);
> +	be = dd_queued(dd, DD_BE_PRIO);
> +	idle = dd_queued(dd, DD_IDLE_PRIO);
> +	spin_unlock(&dd->lock);
> +
> +	seq_printf(m, "%u %u %u\n", rt, be, idle);
>  
> -	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
> -		   dd_queued(dd, DD_BE_PRIO),
> -		   dd_queued(dd, DD_IDLE_PRIO));
>  	return 0;
>  }
>  
>  /* Number of requests owned by the block driver for a given priority. */
>  static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
> -		- dd_sum(dd, completed, prio);
> +	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	return stats->dispatched + stats->merged -
> +		atomic_read(&stats->completed);
>  }
>  
>  static int dd_owned_by_driver_show(void *data, struct seq_file *m)
>  {
>  	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
> +	u32 rt, be, idle;
> +
> +	spin_lock(&dd->lock);
> +	rt = dd_owned_by_driver(dd, DD_RT_PRIO);
> +	be = dd_owned_by_driver(dd, DD_BE_PRIO);
> +	idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
> +	spin_unlock(&dd->lock);
> +
> +	seq_printf(m, "%u %u %u\n", rt, be, idle);
>  
> -	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
> -		   dd_owned_by_driver(dd, DD_BE_PRIO),
> -		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
>  	return 0;
>  }
>  
> 

Apart from the nit/comment above, looks OK to me.

Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-23 23:26   ` [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-09-24 11:08     ` Damien Le Moal
  2021-09-27 20:03       ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2021-09-24 11:08 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 2021/09/24 8:27, Bart Van Assche wrote:
> In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
> Prioritize high-priority requests""), this patch uses 'jiffies' instead
> of ktime_get() in the code for aging lower priority requests.
> 
> This patch has been tested as follows:
> 
> Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: 555 K IOPS.
> 
> Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: about 380 K IOPS.
> 
> Ran the following script:
> 
> set -e
> scriptdir=$(dirname "$0")
> if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
> modprobe scsi_debug ndelay=1000000 max_queue=16
> sd=''
> while [ -z "$sd" ]; do
>   sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done
> echo $((100*1000)) > "/sys/block/$sd/queue/iosched/aging_expire"
> if [ -e /sys/fs/cgroup/io.prio.class ]; then
>   cd /sys/fs/cgroup
>   echo restrict-to-be >io.prio.class
>   echo +io > cgroup.subtree_control
> else
>   cd /sys/fs/cgroup/blkio/
>   echo restrict-to-be >blkio.prio.class
> fi
> echo $$ >cgroup.procs
> mkdir -p hipri
> cd hipri
> if [ -e io.prio.class ]; then
>   echo none-to-rt >io.prio.class
> else
>   echo none-to-rt >blkio.prio.class
> fi
> { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
> echo $$ >cgroup.procs
> "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
> 
> Result:
> * 11000 IOPS for the high-priority job
> *    40 IOPS for the low-priority job
> 
> If the aging expiry time is changed from 100s into 0, the IOPS results change
> into 6712 and 6796 IOPS.
> 
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 78 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 73 insertions(+), 5 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b0cfc84a4e6b..37440786fdbb 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -31,6 +31,11 @@
>   */
>  static const int read_expire = HZ / 2;  /* max time before a read is submitted. */
>  static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +/*
> + * Time after which to dispatch lower priority requests even if higher
> + * priority requests are pending.
> + */
> +static const int aging_expire = 10 * HZ;

What about calling this prio_starved, to be consistent with writes_starved ?
Or at least let's call it prio_aging_expire to show that it is for priority
levels, and not for requests within a priority queue.

>  static const int writes_starved = 2;    /* max times reads can starve a write */
>  static const int fifo_batch = 16;       /* # of sequential requests treated as one
>  				     by the above parameters. For throughput. */
> @@ -96,6 +101,7 @@ struct deadline_data {
>  	int writes_starved;
>  	int front_merges;
>  	u32 async_depth;
> +	int aging_expire;
>  
>  	spinlock_t lock;
>  	spinlock_t zone_lock;
> @@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	return rq;
>  }
>  
> +/*
> + * Returns true if and only if @rq started after @latest_start where
> + * @latest_start is in jiffies.
> + */
> +static bool started_after(struct deadline_data *dd, struct request *rq,
> +			  unsigned long latest_start)
> +{
> +	unsigned long start_time = (unsigned long)rq->fifo_time;
> +
> +	start_time -= dd->fifo_expire[rq_data_dir(rq)];
> +
> +	return time_after(start_time, latest_start);
> +}
> +
>  /*
>   * deadline_dispatch_requests selects the best request according to
> - * read/write expire, fifo_batch, etc
> + * read/write expire, fifo_batch, etc and with a start time <= @latest.
>   */
>  static struct request *__dd_dispatch_request(struct deadline_data *dd,
> -					     struct dd_per_prio *per_prio)
> +					     struct dd_per_prio *per_prio,
> +					     unsigned long latest_start)
>  {
>  	struct request *rq, *next_rq;
>  	enum dd_data_dir data_dir;
> @@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	if (!list_empty(&per_prio->dispatch)) {
>  		rq = list_first_entry(&per_prio->dispatch, struct request,
>  				      queuelist);
> +		if (started_after(dd, rq, latest_start))
> +			return NULL;
>  		list_del_init(&rq->queuelist);
>  		goto done;
>  	}
> @@ -432,6 +455,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	dd->batching = 0;
>  
>  dispatch_request:
> +	if (started_after(dd, rq, latest_start))
> +		return NULL;

Nit: add a blank line here ? (for aesthetic :))

>  	/*
>  	 * rq is the selected appropriate request.
>  	 */
> @@ -449,6 +474,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	return rq;
>  }
>  
> +/*
> + * Check whether there are any requests with a deadline that expired more than
> + * aging_expire jiffies ago.

Hmm... requests do not have a "deadline", so this is a little hard to
understand. What about something like this:

* Check whether there are any requests at a low priority level inserted
* more than aging_expire jiffies ago.

> + */
> +static struct request *dd_dispatch_aged_requests(struct deadline_data *dd,
> +						 unsigned long now)

Same remark as for the aging_expire variable: it may be good to have prio in the
name of this function. Something like:

dd_dispatch_prio_starved_requests() ?

> +{
> +	struct request *rq;
> +	enum dd_prio prio;
> +	int prio_cnt;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
> +		   !!dd_queued(dd, DD_IDLE_PRIO);
> +	if (prio_cnt < 2)
> +		return NULL;
> +
> +	for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
> +					   now - dd->aging_expire);
> +		if (rq)
> +			return rq;
> +	}
> +
> +	return NULL;
> +}
> +
>  /*
>   * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
>   *
> @@ -460,15 +513,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  {
>  	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> -	struct request *rq;
> +	const unsigned long now = jiffies;
> +	struct request *rq = NULL;
>  	enum dd_prio prio;
>  
>  	spin_lock(&dd->lock);

Nit: Add a blank line here to have the spin_lock() stand out ?
Easier to notice it that way...

> +	rq = dd_dispatch_aged_requests(dd, now);
> +	if (rq)
> +		goto unlock;
> +
> +	/*
> +	 * Next, dispatch requests in priority order. Ignore lower priority
> +	 * requests if any higher priority requests are pending.
> +	 */
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> -		rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
> -		if (rq)
> +		rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
> +		if (rq || dd_queued(dd, prio))
>  			break;
>  	}
> +
> +unlock:
>  	spin_unlock(&dd->lock);
>  
>  	return rq;
> @@ -573,6 +637,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	dd->front_merges = 1;
>  	dd->last_dir = DD_WRITE;
>  	dd->fifo_batch = fifo_batch;
> +	dd->aging_expire = aging_expire;
>  	spin_lock_init(&dd->lock);
>  	spin_lock_init(&dd->zone_lock);
>  
> @@ -796,6 +861,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page)		\
>  #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
>  SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
>  SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> +SHOW_JIFFIES(deadline_aging_expire_show, dd->aging_expire);
>  SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
>  SHOW_INT(deadline_front_merges_show, dd->front_merges);
>  SHOW_INT(deadline_async_depth_show, dd->front_merges);
> @@ -825,6 +891,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
>  	STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
>  STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
>  STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> +STORE_JIFFIES(deadline_aging_expire_store, &dd->aging_expire, 0, INT_MAX);
>  STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
>  STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
>  STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
> @@ -843,6 +910,7 @@ static struct elv_fs_entry deadline_attrs[] = {
>  	DD_ATTR(front_merges),
>  	DD_ATTR(async_depth),
>  	DD_ATTR(fifo_batch),
> +	DD_ATTR(aging_expire),
>  	__ATTR_NULL
>  };
>  
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-24 10:58     ` Damien Le Moal
@ 2021-09-25  2:59       ` Bart Van Assche
  2021-09-27  0:19         ` Damien Le Moal
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2021-09-25  2:59 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 9/24/21 03:58, Damien Le Moal wrote:
> On 2021/09/24 8:27, Bart Van Assche wrote:
>> -/* I/O statistics for all I/O priorities (enum dd_prio). */
>> -struct io_stats {
>> -	struct io_stats_per_prio stats[DD_PRIO_COUNT];
>> +	uint32_t inserted;
>> +	uint32_t merged;
>> +	uint32_t dispatched;
>> +	atomic_t completed;
> 
> Why not use 64-bits types (regular unsigned long long and atomic64_t) ?

Even 64-bit counters can overflow. Using 32-bit counters makes it easier to
trigger an overflow of these counters.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-25  2:59       ` Bart Van Assche
@ 2021-09-27  0:19         ` Damien Le Moal
  2021-09-27  2:38           ` Bart Van Assche
  0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2021-09-27  0:19 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 2021/09/25 11:59, Bart Van Assche wrote:
> On 9/24/21 03:58, Damien Le Moal wrote:
>> On 2021/09/24 8:27, Bart Van Assche wrote:
>>> -/* I/O statistics for all I/O priorities (enum dd_prio). */
>>> -struct io_stats {
>>> -	struct io_stats_per_prio stats[DD_PRIO_COUNT];
>>> +	uint32_t inserted;
>>> +	uint32_t merged;
>>> +	uint32_t dispatched;
>>> +	atomic_t completed;
>>
>> Why not use 64-bits types (regular unsigned long long and atomic64_t) ?
> 
> Even 64-bit counters can overflow. Using 32-bit counters makes it easier to
> trigger an overflow of these counters.

I was more thinking about the speed of additions/subtractions on 64-bits arch,
which is a large part (the majority ?) of mq-deadline users. Not sure if there
is a difference in speed for 32 bits and 64 bits simple math on 64 bits arch
though. Probably not.

Another thing: in patch 3, you are actually not handling the overflows. So
dd_queued() may return some very weird number (temporarily) when the inserted
count overflows before the completed count does. Since
dd_dispatch_aged_requests() does not care about the actual value of dd_queued(),
only if it is 0 or not, I am not 100% sure if it is useful to fix. Except maybe
for sysfs attributes ?

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-27  0:19         ` Damien Le Moal
@ 2021-09-27  2:38           ` Bart Van Assche
  2021-09-27  3:41             ` Damien Le Moal
  0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2021-09-27  2:38 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 9/26/21 17:19, Damien Le Moal wrote:
> Another thing: in patch 3, you are actually not handling the overflows. So
> dd_queued() may return some very weird number (temporarily) when the inserted
> count overflows before the completed count does. Since
> dd_dispatch_aged_requests() does not care about the actual value of dd_queued(),
> only if it is 0 or not, I am not 100% sure if it is useful to fix. Except maybe
> for sysfs attributes ?

Hmm ... I'm not following. I think it can be proven mathematically that
dd_queued() returns a number in the range [0, max_queued_requests). Here is
an example:
* inserted = 1 (wrapped around).
* completed = UINT32_MAX - 1 (about to wrap but has not yet wrapped around).
* dd_queued() returns 2.

Bart.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-27  2:38           ` Bart Van Assche
@ 2021-09-27  3:41             ` Damien Le Moal
  0 siblings, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2021-09-27  3:41 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 2021/09/27 11:38, Bart Van Assche wrote:
> On 9/26/21 17:19, Damien Le Moal wrote:
>> Another thing: in patch 3, you are actually not handling the overflows. So
>> dd_queued() may return some very weird number (temporarily) when the inserted
>> count overflows before the completed count does. Since
>> dd_dispatch_aged_requests() does not care about the actual value of dd_queued(),
>> only if it is 0 or not, I am not 100% sure if it is useful to fix. Except maybe
>> for sysfs attributes ?
> 
> Hmm ... I'm not following. I think it can be proven mathematically that
> dd_queued() returns a number in the range [0, max_queued_requests). Here is
> an example:
> * inserted = 1 (wrapped around).
> * completed = UINT32_MAX - 1 (about to wrap but has not yet wrapped around).
> * dd_queued() returns 2.

You mean 3, right ?
But yes... Slow Monday, I need more coffee :)

Cheers.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] block/mq-deadline: Improve request accounting further
  2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
                     ` (3 preceding siblings ...)
  2021-09-24 10:54   ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Damien Le Moal
@ 2021-09-27 15:53   ` Niklas Cassel
  4 siblings, 0 replies; 17+ messages in thread
From: Niklas Cassel @ 2021-09-27 15:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Hannes Reinecke

On Thu, Sep 23, 2021 at 04:26:52PM -0700, Bart Van Assche wrote:
> The scheduler .insert_requests() callback is called when a request is
> queued for the first time and also when it is requeued. Only count a
> request the first time it is queued. Additionally, since the mq-deadline
> scheduler only performs zone locking for requests that have been
> inserted, skip the zone unlock code for requests that have not been
> inserted into the mq-deadline scheduler.
> 
> Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 7f3c3932b723..b1175e4560ad 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -677,8 +677,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	blk_req_zone_write_unlock(rq);
>  
>  	prio = ioprio_class_to_prio[ioprio_class];
> -	dd_count(dd, inserted, prio);
> -	rq->elv.priv[0] = (void *)(uintptr_t)1;
> +	if (!rq->elv.priv[0]) {
> +		dd_count(dd, inserted, prio);
> +		rq->elv.priv[0] = (void *)(uintptr_t)1;
> +	}
>  
>  	if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
>  		blk_mq_free_requests(&free);
> @@ -759,12 +761,13 @@ static void dd_finish_request(struct request *rq)
>  
>  	/*
>  	 * The block layer core may call dd_finish_request() without having
> -	 * called dd_insert_requests(). Hence only update statistics for
> -	 * requests for which dd_insert_requests() has been called. See also
> -	 * blk_mq_request_bypass_insert().
> +	 * called dd_insert_requests(). Skip requests that bypassed I/O
> +	 * scheduling. See also blk_mq_request_bypass_insert().
>  	 */
> -	if (rq->elv.priv[0])
> -		dd_count(dd, completed, prio);
> +	if (!rq->elv.priv[0])
> +		return;
> +
> +	dd_count(dd, completed, prio);
>  
>  	if (blk_queue_is_zoned(q)) {
>  		unsigned long flags;

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] block/mq-deadline: Add an invariant check
  2021-09-23 23:26   ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
  2021-09-24 10:54     ` Damien Le Moal
@ 2021-09-27 15:53     ` Niklas Cassel
  1 sibling, 0 replies; 17+ messages in thread
From: Niklas Cassel @ 2021-09-27 15:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Hannes Reinecke

On Thu, Sep 23, 2021 at 04:26:53PM -0700, Bart Van Assche wrote:
> Check a statistics invariant at module unload time. When running
> blktests, the invariant is verified every time a request queue is
> removed and hence is verified at least once per test.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b1175e4560ad..6deb4306bcf3 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -270,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	deadline_remove_request(rq->q, per_prio, rq);
>  }
>  
> +/* Number of requests queued for a given priority level. */
> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> +	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +}
> +
>  /*
>   * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
>   * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> @@ -539,6 +545,12 @@ static void dd_exit_sched(struct elevator_queue *e)
>  
>  		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>  		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> +		WARN_ONCE(dd_queued(dd, prio) != 0,
> +			  "statistics for priority %d: i %u m %u d %u c %u\n",
> +			  prio, dd_sum(dd, inserted, prio),
> +			  dd_sum(dd, merged, prio),
> +			  dd_sum(dd, dispatched, prio),
> +			  dd_sum(dd, completed, prio));
>  	}
>  
>  	free_percpu(dd->stats);
> @@ -950,12 +962,6 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
>  	return 0;
>  }
>  
> -/* Number of requests queued for a given priority level. */
> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> -{
> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> -}
> -
>  static int dd_queued_show(void *data, struct seq_file *m)
>  {
>  	struct request_queue *q = data;

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
  2021-09-23 23:26   ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
  2021-09-24 10:58     ` Damien Le Moal
@ 2021-09-27 15:53     ` Niklas Cassel
  1 sibling, 0 replies; 17+ messages in thread
From: Niklas Cassel @ 2021-09-27 15:53 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
	Damien Le Moal, Hannes Reinecke

On Thu, Sep 23, 2021 at 04:26:54PM -0700, Bart Van Assche wrote:
> Calculating the sum over all CPUs of per-CPU counters frequently is
> inefficient. Hence switch from per-CPU to individual counters. Three
> counters are protected by the mq-deadline spinlock since these are
> only accessed from contexts that already hold that spinlock. The fourth
> counter is atomic because protecting it with the mq-deadline spinlock
> would trigger lock contention.
> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/mq-deadline.c | 124 ++++++++++++++++++++------------------------
>  1 file changed, 56 insertions(+), 68 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 6deb4306bcf3..b0cfc84a4e6b 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -51,17 +51,16 @@ enum dd_prio {
>  
>  enum { DD_PRIO_COUNT = 3 };
>  
> -/* I/O statistics per I/O priority. */
> +/*
> + * I/O statistics per I/O priority. It is fine if these counters overflow.
> + * What matters is that these counters are at least as wide as
> + * log2(max_outstanding_requests).
> + */
>  struct io_stats_per_prio {
> -	local_t inserted;
> -	local_t merged;
> -	local_t dispatched;
> -	local_t completed;
> -};
> -
> -/* I/O statistics for all I/O priorities (enum dd_prio). */
> -struct io_stats {
> -	struct io_stats_per_prio stats[DD_PRIO_COUNT];
> +	uint32_t inserted;
> +	uint32_t merged;
> +	uint32_t dispatched;
> +	atomic_t completed;
>  };
>  
>  /*
> @@ -74,6 +73,7 @@ struct dd_per_prio {
>  	struct list_head fifo_list[DD_DIR_COUNT];
>  	/* Next request in FIFO order. Read, write or both are NULL. */
>  	struct request *next_rq[DD_DIR_COUNT];
> +	struct io_stats_per_prio stats;
>  };
>  
>  struct deadline_data {
> @@ -88,8 +88,6 @@ struct deadline_data {
>  	unsigned int batching;		/* number of sequential requests made */
>  	unsigned int starved;		/* times reads have starved writes */
>  
> -	struct io_stats __percpu *stats;
> -
>  	/*
>  	 * settings that change how the i/o scheduler behaves
>  	 */
> @@ -103,33 +101,6 @@ struct deadline_data {
>  	spinlock_t zone_lock;
>  };
>  
> -/* Count one event of type 'event_type' and with I/O priority 'prio' */
> -#define dd_count(dd, event_type, prio) do {				\
> -	struct io_stats *io_stats = get_cpu_ptr((dd)->stats);		\
> -									\
> -	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> -	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> -	local_inc(&io_stats->stats[(prio)].event_type);			\
> -	put_cpu_ptr(io_stats);						\
> -} while (0)
> -
> -/*
> - * Returns the total number of dd_count(dd, event_type, prio) calls across all
> - * CPUs. No locking or barriers since it is fine if the returned sum is slightly
> - * outdated.
> - */
> -#define dd_sum(dd, event_type, prio) ({					\
> -	unsigned int cpu;						\
> -	u32 sum = 0;							\
> -									\
> -	BUILD_BUG_ON(!__same_type((dd), struct deadline_data *));	\
> -	BUILD_BUG_ON(!__same_type((prio), enum dd_prio));		\
> -	for_each_present_cpu(cpu)					\
> -		sum += local_read(&per_cpu_ptr((dd)->stats, cpu)->	\
> -				  stats[(prio)].event_type);		\
> -	sum;								\
> -})
> -
>  /* Maps an I/O priority class to a deadline scheduler priority. */
>  static const enum dd_prio ioprio_class_to_prio[] = {
>  	[IOPRIO_CLASS_NONE]	= DD_BE_PRIO,
> @@ -233,7 +204,9 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
>  	const u8 ioprio_class = dd_rq_ioclass(next);
>  	const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>  
> -	dd_count(dd, merged, prio);
> +	lockdep_assert_held(&dd->lock);
> +
> +	dd->per_prio[prio].stats.merged++;
>  
>  	/*
>  	 * if next expires before rq, assign its expire time to rq
> @@ -273,7 +246,11 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  /* Number of requests queued for a given priority level. */
>  static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	return stats->inserted - atomic_read(&stats->completed);
>  }
>  
>  /*
> @@ -463,7 +440,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  done:
>  	ioprio_class = dd_rq_ioclass(rq);
>  	prio = ioprio_class_to_prio[ioprio_class];
> -	dd_count(dd, dispatched, prio);
> +	dd->per_prio[prio].stats.dispatched++;
>  	/*
>  	 * If the request needs its target zone locked, do it.
>  	 */
> @@ -542,19 +519,22 @@ static void dd_exit_sched(struct elevator_queue *e)
>  
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
>  		struct dd_per_prio *per_prio = &dd->per_prio[prio];
> +		const struct io_stats_per_prio *stats = &per_prio->stats;
> +		uint32_t queued;
>  
>  		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
>  		WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> -		WARN_ONCE(dd_queued(dd, prio) != 0,
> +
> +		spin_lock(&dd->lock);
> +		queued = dd_queued(dd, prio);
> +		spin_unlock(&dd->lock);
> +
> +		WARN_ONCE(queued,
>  			  "statistics for priority %d: i %u m %u d %u c %u\n",
> -			  prio, dd_sum(dd, inserted, prio),
> -			  dd_sum(dd, merged, prio),
> -			  dd_sum(dd, dispatched, prio),
> -			  dd_sum(dd, completed, prio));
> +			  prio, stats->inserted, stats->merged,
> +			  stats->dispatched, atomic_read(&stats->completed));
>  	}
>  
> -	free_percpu(dd->stats);
> -
>  	kfree(dd);
>  }
>  
> @@ -578,11 +558,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  
>  	eq->elevator_data = dd;
>  
> -	dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
> -				     GFP_KERNEL | __GFP_ZERO);
> -	if (!dd->stats)
> -		goto free_dd;
> -
>  	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
>  		struct dd_per_prio *per_prio = &dd->per_prio[prio];
>  
> @@ -604,9 +579,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>  	q->elevator = eq;
>  	return 0;
>  
> -free_dd:
> -	kfree(dd);
> -
>  put_eq:
>  	kobject_put(&eq->kobj);
>  	return ret;
> @@ -689,8 +661,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  	blk_req_zone_write_unlock(rq);
>  
>  	prio = ioprio_class_to_prio[ioprio_class];
> +	per_prio = &dd->per_prio[prio];
>  	if (!rq->elv.priv[0]) {
> -		dd_count(dd, inserted, prio);
> +		per_prio->stats.inserted++;
>  		rq->elv.priv[0] = (void *)(uintptr_t)1;
>  	}
>  
> @@ -701,7 +674,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>  
>  	trace_block_rq_insert(rq);
>  
> -	per_prio = &dd->per_prio[prio];
>  	if (at_head) {
>  		list_add(&rq->queuelist, &per_prio->dispatch);
>  	} else {
> @@ -779,7 +751,7 @@ static void dd_finish_request(struct request *rq)
>  	if (!rq->elv.priv[0])
>  		return;
>  
> -	dd_count(dd, completed, prio);
> +	atomic_inc(&per_prio->stats.completed);
>  
>  	if (blk_queue_is_zoned(q)) {
>  		unsigned long flags;
> @@ -966,28 +938,44 @@ static int dd_queued_show(void *data, struct seq_file *m)
>  {
>  	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
> +	u32 rt, be, idle;
> +
> +	spin_lock(&dd->lock);
> +	rt = dd_queued(dd, DD_RT_PRIO);
> +	be = dd_queued(dd, DD_BE_PRIO);
> +	idle = dd_queued(dd, DD_IDLE_PRIO);
> +	spin_unlock(&dd->lock);
> +
> +	seq_printf(m, "%u %u %u\n", rt, be, idle);
>  
> -	seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
> -		   dd_queued(dd, DD_BE_PRIO),
> -		   dd_queued(dd, DD_IDLE_PRIO));
>  	return 0;
>  }
>  
>  /* Number of requests owned by the block driver for a given priority. */
>  static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
>  {
> -	return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
> -		- dd_sum(dd, completed, prio);
> +	const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> +	lockdep_assert_held(&dd->lock);
> +
> +	return stats->dispatched + stats->merged -
> +		atomic_read(&stats->completed);
>  }
>  
>  static int dd_owned_by_driver_show(void *data, struct seq_file *m)
>  {
>  	struct request_queue *q = data;
>  	struct deadline_data *dd = q->elevator->elevator_data;
> +	u32 rt, be, idle;
> +
> +	spin_lock(&dd->lock);
> +	rt = dd_owned_by_driver(dd, DD_RT_PRIO);
> +	be = dd_owned_by_driver(dd, DD_BE_PRIO);
> +	idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
> +	spin_unlock(&dd->lock);
> +
> +	seq_printf(m, "%u %u %u\n", rt, be, idle);
>  
> -	seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
> -		   dd_owned_by_driver(dd, DD_BE_PRIO),
> -		   dd_owned_by_driver(dd, DD_IDLE_PRIO));
>  	return 0;
>  }
>  

Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests
  2021-09-24 11:08     ` Damien Le Moal
@ 2021-09-27 20:03       ` Bart Van Assche
  0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2021-09-27 20:03 UTC (permalink / raw)
  To: Damien Le Moal, Jens Axboe
  Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
	Hannes Reinecke

On 9/24/21 4:08 AM, Damien Le Moal wrote:
> On 2021/09/24 8:27, Bart Van Assche wrote:
>> +/*
>> + * Time after which to dispatch lower priority requests even if higher
>> + * priority requests are pending.
>> + */
>> +static const int aging_expire = 10 * HZ;
> 
> What about calling this prio_starved, to be consistent with writes_starved ?
> Or at least let's call it prio_aging_expire to show that it is for priority
> levels, and not for requests within a priority queue.

The name 'prio_starved' probably would cause confusion since this parameter
represents a timeout while 'writes_starved' represents a count. I will rename
it into prio_aging_expire.

>>   dispatch_request:
>> +	if (started_after(dd, rq, latest_start))
>> +		return NULL;
> 
> Nit: add a blank line here ? (for aesthetic :))

Will do.

>> +/*
>> + * Check whether there are any requests with a deadline that expired more than
>> + * aging_expire jiffies ago.
> 
> Hmm... requests do not have a "deadline", so this is a little hard to
> understand. What about something like this:
> 
> * Check whether there are any requests at a low priority level inserted
> * more than aging_expire jiffies ago.

Agreed, I will clarify this comment.

>> + */
>> +static struct request *dd_dispatch_aged_requests(struct deadline_data *dd,
>> +						 unsigned long now)
> 
> Same remark as for the aging_expire variable: it may be good to have prio in the
> name of this function. Something like:
> 
> dd_dispatch_prio_starved_requests() ?

The name 'aging' has a well-defined meaning so I would like to keep it. See also
https://en.wikipedia.org/wiki/Aging_(scheduling).

>>   	spin_lock(&dd->lock);
> 
> Nit: Add a blank line here to have the spin_lock() stand out ?
> Easier to notice it that way...

The convention in kernel code is a blank line above statements that grab a lock
but not below these statements.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-09-27 20:03 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-23 23:23 [PATCH 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
2021-09-23 23:26   ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
2021-09-24 10:54     ` Damien Le Moal
2021-09-27 15:53     ` Niklas Cassel
2021-09-23 23:26   ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
2021-09-24 10:58     ` Damien Le Moal
2021-09-25  2:59       ` Bart Van Assche
2021-09-27  0:19         ` Damien Le Moal
2021-09-27  2:38           ` Bart Van Assche
2021-09-27  3:41             ` Damien Le Moal
2021-09-27 15:53     ` Niklas Cassel
2021-09-23 23:26   ` [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
2021-09-24 11:08     ` Damien Le Moal
2021-09-27 20:03       ` Bart Van Assche
2021-09-24 10:54   ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Damien Le Moal
2021-09-27 15:53   ` Niklas Cassel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).