* [PATCH 0/4] Restore I/O priority support in the mq-deadline scheduler
@ 2021-09-23 23:23 Bart Van Assche
2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:23 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche
Hi Jens,
This patch series includes the following changes:
- Fix request accounting by counting requeued requests once.
- Test correctness of the request accounting code by triggering a kernel
warning from inside dd_exit_sched() if an inconsistency has been detected.
- Switch from per-CPU counters to individual counters.
- Restore I/O priority support in the mq-deadline scheduler. The performance
measurements in the description of patch 4/4 show that the peformance
regressions in the previous version of this patch have been fixed. This has
been achieved by using 'jiffies' instead of ktime_get() and also by skipping
the aging mechanism if all queued requests have the same I/O priority.
Please consider this patch series for kernel v5.16.
Thanks,
Bart.
Bart Van Assche (4):
block/mq-deadline: Improve request accounting further
block/mq-deadline: Add an invariant check
block/mq-deadline: Stop using per-CPU counters
block/mq-deadline: Prioritize high-priority requests
block/mq-deadline.c | 221 ++++++++++++++++++++++++++++----------------
1 file changed, 143 insertions(+), 78 deletions(-)
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH 1/4] block/mq-deadline: Improve request accounting further
2021-09-23 23:23 [PATCH 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
@ 2021-09-23 23:26 ` Bart Van Assche
2021-09-23 23:26 ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:26 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Damien Le Moal, Niklas Cassel, Hannes Reinecke
The scheduler .insert_requests() callback is called when a request is
queued for the first time and also when it is requeued. Only count a
request the first time it is queued. Additionally, since the mq-deadline
scheduler only performs zone locking for requests that have been
inserted, skip the zone unlock code for requests that have not been
inserted into the mq-deadline scheduler.
Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 7f3c3932b723..b1175e4560ad 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -677,8 +677,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
blk_req_zone_write_unlock(rq);
prio = ioprio_class_to_prio[ioprio_class];
- dd_count(dd, inserted, prio);
- rq->elv.priv[0] = (void *)(uintptr_t)1;
+ if (!rq->elv.priv[0]) {
+ dd_count(dd, inserted, prio);
+ rq->elv.priv[0] = (void *)(uintptr_t)1;
+ }
if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
blk_mq_free_requests(&free);
@@ -759,12 +761,13 @@ static void dd_finish_request(struct request *rq)
/*
* The block layer core may call dd_finish_request() without having
- * called dd_insert_requests(). Hence only update statistics for
- * requests for which dd_insert_requests() has been called. See also
- * blk_mq_request_bypass_insert().
+ * called dd_insert_requests(). Skip requests that bypassed I/O
+ * scheduling. See also blk_mq_request_bypass_insert().
*/
- if (rq->elv.priv[0])
- dd_count(dd, completed, prio);
+ if (!rq->elv.priv[0])
+ return;
+
+ dd_count(dd, completed, prio);
if (blk_queue_is_zoned(q)) {
unsigned long flags;
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 2/4] block/mq-deadline: Add an invariant check
2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
@ 2021-09-23 23:26 ` Bart Van Assche
2021-09-24 10:54 ` Damien Le Moal
2021-09-27 15:53 ` Niklas Cassel
2021-09-23 23:26 ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
` (3 subsequent siblings)
4 siblings, 2 replies; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:26 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Damien Le Moal, Niklas Cassel, Hannes Reinecke
Check a statistics invariant at module unload time. When running
blktests, the invariant is verified every time a request queue is
removed and hence is verified at least once per test.
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b1175e4560ad..6deb4306bcf3 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -270,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
deadline_remove_request(rq->q, per_prio, rq);
}
+/* Number of requests queued for a given priority level. */
+static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
+{
+ return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+}
+
/*
* deadline_check_fifo returns 0 if there are no expired requests on the fifo,
* 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
@@ -539,6 +545,12 @@ static void dd_exit_sched(struct elevator_queue *e)
WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
+ WARN_ONCE(dd_queued(dd, prio) != 0,
+ "statistics for priority %d: i %u m %u d %u c %u\n",
+ prio, dd_sum(dd, inserted, prio),
+ dd_sum(dd, merged, prio),
+ dd_sum(dd, dispatched, prio),
+ dd_sum(dd, completed, prio));
}
free_percpu(dd->stats);
@@ -950,12 +962,6 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
return 0;
}
-/* Number of requests queued for a given priority level. */
-static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
-{
- return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
-}
-
static int dd_queued_show(void *data, struct seq_file *m)
{
struct request_queue *q = data;
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
2021-09-23 23:26 ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
@ 2021-09-23 23:26 ` Bart Van Assche
2021-09-24 10:58 ` Damien Le Moal
2021-09-27 15:53 ` Niklas Cassel
2021-09-23 23:26 ` [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
` (2 subsequent siblings)
4 siblings, 2 replies; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:26 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Damien Le Moal, Niklas Cassel, Hannes Reinecke
Calculating the sum over all CPUs of per-CPU counters frequently is
inefficient. Hence switch from per-CPU to individual counters. Three
counters are protected by the mq-deadline spinlock since these are
only accessed from contexts that already hold that spinlock. The fourth
counter is atomic because protecting it with the mq-deadline spinlock
would trigger lock contention.
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 124 ++++++++++++++++++++------------------------
1 file changed, 56 insertions(+), 68 deletions(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 6deb4306bcf3..b0cfc84a4e6b 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -51,17 +51,16 @@ enum dd_prio {
enum { DD_PRIO_COUNT = 3 };
-/* I/O statistics per I/O priority. */
+/*
+ * I/O statistics per I/O priority. It is fine if these counters overflow.
+ * What matters is that these counters are at least as wide as
+ * log2(max_outstanding_requests).
+ */
struct io_stats_per_prio {
- local_t inserted;
- local_t merged;
- local_t dispatched;
- local_t completed;
-};
-
-/* I/O statistics for all I/O priorities (enum dd_prio). */
-struct io_stats {
- struct io_stats_per_prio stats[DD_PRIO_COUNT];
+ uint32_t inserted;
+ uint32_t merged;
+ uint32_t dispatched;
+ atomic_t completed;
};
/*
@@ -74,6 +73,7 @@ struct dd_per_prio {
struct list_head fifo_list[DD_DIR_COUNT];
/* Next request in FIFO order. Read, write or both are NULL. */
struct request *next_rq[DD_DIR_COUNT];
+ struct io_stats_per_prio stats;
};
struct deadline_data {
@@ -88,8 +88,6 @@ struct deadline_data {
unsigned int batching; /* number of sequential requests made */
unsigned int starved; /* times reads have starved writes */
- struct io_stats __percpu *stats;
-
/*
* settings that change how the i/o scheduler behaves
*/
@@ -103,33 +101,6 @@ struct deadline_data {
spinlock_t zone_lock;
};
-/* Count one event of type 'event_type' and with I/O priority 'prio' */
-#define dd_count(dd, event_type, prio) do { \
- struct io_stats *io_stats = get_cpu_ptr((dd)->stats); \
- \
- BUILD_BUG_ON(!__same_type((dd), struct deadline_data *)); \
- BUILD_BUG_ON(!__same_type((prio), enum dd_prio)); \
- local_inc(&io_stats->stats[(prio)].event_type); \
- put_cpu_ptr(io_stats); \
-} while (0)
-
-/*
- * Returns the total number of dd_count(dd, event_type, prio) calls across all
- * CPUs. No locking or barriers since it is fine if the returned sum is slightly
- * outdated.
- */
-#define dd_sum(dd, event_type, prio) ({ \
- unsigned int cpu; \
- u32 sum = 0; \
- \
- BUILD_BUG_ON(!__same_type((dd), struct deadline_data *)); \
- BUILD_BUG_ON(!__same_type((prio), enum dd_prio)); \
- for_each_present_cpu(cpu) \
- sum += local_read(&per_cpu_ptr((dd)->stats, cpu)-> \
- stats[(prio)].event_type); \
- sum; \
-})
-
/* Maps an I/O priority class to a deadline scheduler priority. */
static const enum dd_prio ioprio_class_to_prio[] = {
[IOPRIO_CLASS_NONE] = DD_BE_PRIO,
@@ -233,7 +204,9 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
const u8 ioprio_class = dd_rq_ioclass(next);
const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
- dd_count(dd, merged, prio);
+ lockdep_assert_held(&dd->lock);
+
+ dd->per_prio[prio].stats.merged++;
/*
* if next expires before rq, assign its expire time to rq
@@ -273,7 +246,11 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
/* Number of requests queued for a given priority level. */
static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
{
- return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
+ const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
+
+ lockdep_assert_held(&dd->lock);
+
+ return stats->inserted - atomic_read(&stats->completed);
}
/*
@@ -463,7 +440,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
done:
ioprio_class = dd_rq_ioclass(rq);
prio = ioprio_class_to_prio[ioprio_class];
- dd_count(dd, dispatched, prio);
+ dd->per_prio[prio].stats.dispatched++;
/*
* If the request needs its target zone locked, do it.
*/
@@ -542,19 +519,22 @@ static void dd_exit_sched(struct elevator_queue *e)
for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
struct dd_per_prio *per_prio = &dd->per_prio[prio];
+ const struct io_stats_per_prio *stats = &per_prio->stats;
+ uint32_t queued;
WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
- WARN_ONCE(dd_queued(dd, prio) != 0,
+
+ spin_lock(&dd->lock);
+ queued = dd_queued(dd, prio);
+ spin_unlock(&dd->lock);
+
+ WARN_ONCE(queued,
"statistics for priority %d: i %u m %u d %u c %u\n",
- prio, dd_sum(dd, inserted, prio),
- dd_sum(dd, merged, prio),
- dd_sum(dd, dispatched, prio),
- dd_sum(dd, completed, prio));
+ prio, stats->inserted, stats->merged,
+ stats->dispatched, atomic_read(&stats->completed));
}
- free_percpu(dd->stats);
-
kfree(dd);
}
@@ -578,11 +558,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
eq->elevator_data = dd;
- dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
- GFP_KERNEL | __GFP_ZERO);
- if (!dd->stats)
- goto free_dd;
-
for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
struct dd_per_prio *per_prio = &dd->per_prio[prio];
@@ -604,9 +579,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
q->elevator = eq;
return 0;
-free_dd:
- kfree(dd);
-
put_eq:
kobject_put(&eq->kobj);
return ret;
@@ -689,8 +661,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
blk_req_zone_write_unlock(rq);
prio = ioprio_class_to_prio[ioprio_class];
+ per_prio = &dd->per_prio[prio];
if (!rq->elv.priv[0]) {
- dd_count(dd, inserted, prio);
+ per_prio->stats.inserted++;
rq->elv.priv[0] = (void *)(uintptr_t)1;
}
@@ -701,7 +674,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
trace_block_rq_insert(rq);
- per_prio = &dd->per_prio[prio];
if (at_head) {
list_add(&rq->queuelist, &per_prio->dispatch);
} else {
@@ -779,7 +751,7 @@ static void dd_finish_request(struct request *rq)
if (!rq->elv.priv[0])
return;
- dd_count(dd, completed, prio);
+ atomic_inc(&per_prio->stats.completed);
if (blk_queue_is_zoned(q)) {
unsigned long flags;
@@ -966,28 +938,44 @@ static int dd_queued_show(void *data, struct seq_file *m)
{
struct request_queue *q = data;
struct deadline_data *dd = q->elevator->elevator_data;
+ u32 rt, be, idle;
+
+ spin_lock(&dd->lock);
+ rt = dd_queued(dd, DD_RT_PRIO);
+ be = dd_queued(dd, DD_BE_PRIO);
+ idle = dd_queued(dd, DD_IDLE_PRIO);
+ spin_unlock(&dd->lock);
+
+ seq_printf(m, "%u %u %u\n", rt, be, idle);
- seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
- dd_queued(dd, DD_BE_PRIO),
- dd_queued(dd, DD_IDLE_PRIO));
return 0;
}
/* Number of requests owned by the block driver for a given priority. */
static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
{
- return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
- - dd_sum(dd, completed, prio);
+ const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
+
+ lockdep_assert_held(&dd->lock);
+
+ return stats->dispatched + stats->merged -
+ atomic_read(&stats->completed);
}
static int dd_owned_by_driver_show(void *data, struct seq_file *m)
{
struct request_queue *q = data;
struct deadline_data *dd = q->elevator->elevator_data;
+ u32 rt, be, idle;
+
+ spin_lock(&dd->lock);
+ rt = dd_owned_by_driver(dd, DD_RT_PRIO);
+ be = dd_owned_by_driver(dd, DD_BE_PRIO);
+ idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
+ spin_unlock(&dd->lock);
+
+ seq_printf(m, "%u %u %u\n", rt, be, idle);
- seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
- dd_owned_by_driver(dd, DD_BE_PRIO),
- dd_owned_by_driver(dd, DD_IDLE_PRIO));
return 0;
}
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests
2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
2021-09-23 23:26 ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
2021-09-23 23:26 ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
@ 2021-09-23 23:26 ` Bart Van Assche
2021-09-24 11:08 ` Damien Le Moal
2021-09-24 10:54 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Damien Le Moal
2021-09-27 15:53 ` Niklas Cassel
4 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2021-09-23 23:26 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Bart Van Assche,
Damien Le Moal, Niklas Cassel, Hannes Reinecke
In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
Prioritize high-priority requests""), this patch uses 'jiffies' instead
of ktime_get() in the code for aging lower priority requests.
This patch has been tested as follows:
Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: 555 K IOPS.
Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
Result without and with this patch: about 380 K IOPS.
Ran the following script:
set -e
scriptdir=$(dirname "$0")
if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
modprobe scsi_debug ndelay=1000000 max_queue=16
sd=''
while [ -z "$sd" ]; do
sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
done
echo $((100*1000)) > "/sys/block/$sd/queue/iosched/aging_expire"
if [ -e /sys/fs/cgroup/io.prio.class ]; then
cd /sys/fs/cgroup
echo restrict-to-be >io.prio.class
echo +io > cgroup.subtree_control
else
cd /sys/fs/cgroup/blkio/
echo restrict-to-be >blkio.prio.class
fi
echo $$ >cgroup.procs
mkdir -p hipri
cd hipri
if [ -e io.prio.class ]; then
echo none-to-rt >io.prio.class
else
echo none-to-rt >blkio.prio.class
fi
{ "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
echo $$ >cgroup.procs
"${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
Result:
* 11000 IOPS for the high-priority job
* 40 IOPS for the low-priority job
If the aging expiry time is changed from 100s into 0, the IOPS results change
into 6712 and 6796 IOPS.
The max-iops script is a script that runs fio with the following arguments:
--bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
--norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
--iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
--iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
--filename=${positional_argument_1}
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/mq-deadline.c | 78 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 73 insertions(+), 5 deletions(-)
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b0cfc84a4e6b..37440786fdbb 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -31,6 +31,11 @@
*/
static const int read_expire = HZ / 2; /* max time before a read is submitted. */
static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
+/*
+ * Time after which to dispatch lower priority requests even if higher
+ * priority requests are pending.
+ */
+static const int aging_expire = 10 * HZ;
static const int writes_starved = 2; /* max times reads can starve a write */
static const int fifo_batch = 16; /* # of sequential requests treated as one
by the above parameters. For throughput. */
@@ -96,6 +101,7 @@ struct deadline_data {
int writes_starved;
int front_merges;
u32 async_depth;
+ int aging_expire;
spinlock_t lock;
spinlock_t zone_lock;
@@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
return rq;
}
+/*
+ * Returns true if and only if @rq started after @latest_start where
+ * @latest_start is in jiffies.
+ */
+static bool started_after(struct deadline_data *dd, struct request *rq,
+ unsigned long latest_start)
+{
+ unsigned long start_time = (unsigned long)rq->fifo_time;
+
+ start_time -= dd->fifo_expire[rq_data_dir(rq)];
+
+ return time_after(start_time, latest_start);
+}
+
/*
* deadline_dispatch_requests selects the best request according to
- * read/write expire, fifo_batch, etc
+ * read/write expire, fifo_batch, etc and with a start time <= @latest.
*/
static struct request *__dd_dispatch_request(struct deadline_data *dd,
- struct dd_per_prio *per_prio)
+ struct dd_per_prio *per_prio,
+ unsigned long latest_start)
{
struct request *rq, *next_rq;
enum dd_data_dir data_dir;
@@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
if (!list_empty(&per_prio->dispatch)) {
rq = list_first_entry(&per_prio->dispatch, struct request,
queuelist);
+ if (started_after(dd, rq, latest_start))
+ return NULL;
list_del_init(&rq->queuelist);
goto done;
}
@@ -432,6 +455,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
dd->batching = 0;
dispatch_request:
+ if (started_after(dd, rq, latest_start))
+ return NULL;
/*
* rq is the selected appropriate request.
*/
@@ -449,6 +474,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
return rq;
}
+/*
+ * Check whether there are any requests with a deadline that expired more than
+ * aging_expire jiffies ago.
+ */
+static struct request *dd_dispatch_aged_requests(struct deadline_data *dd,
+ unsigned long now)
+{
+ struct request *rq;
+ enum dd_prio prio;
+ int prio_cnt;
+
+ lockdep_assert_held(&dd->lock);
+
+ prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
+ !!dd_queued(dd, DD_IDLE_PRIO);
+ if (prio_cnt < 2)
+ return NULL;
+
+ for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
+ rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
+ now - dd->aging_expire);
+ if (rq)
+ return rq;
+ }
+
+ return NULL;
+}
+
/*
* Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
*
@@ -460,15 +513,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
struct deadline_data *dd = hctx->queue->elevator->elevator_data;
- struct request *rq;
+ const unsigned long now = jiffies;
+ struct request *rq = NULL;
enum dd_prio prio;
spin_lock(&dd->lock);
+ rq = dd_dispatch_aged_requests(dd, now);
+ if (rq)
+ goto unlock;
+
+ /*
+ * Next, dispatch requests in priority order. Ignore lower priority
+ * requests if any higher priority requests are pending.
+ */
for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
- rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
- if (rq)
+ rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
+ if (rq || dd_queued(dd, prio))
break;
}
+
+unlock:
spin_unlock(&dd->lock);
return rq;
@@ -573,6 +637,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
dd->front_merges = 1;
dd->last_dir = DD_WRITE;
dd->fifo_batch = fifo_batch;
+ dd->aging_expire = aging_expire;
spin_lock_init(&dd->lock);
spin_lock_init(&dd->zone_lock);
@@ -796,6 +861,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page) \
#define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
+SHOW_JIFFIES(deadline_aging_expire_show, dd->aging_expire);
SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
SHOW_INT(deadline_front_merges_show, dd->front_merges);
SHOW_INT(deadline_async_depth_show, dd->front_merges);
@@ -825,6 +891,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
+STORE_JIFFIES(deadline_aging_expire_store, &dd->aging_expire, 0, INT_MAX);
STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
@@ -843,6 +910,7 @@ static struct elv_fs_entry deadline_attrs[] = {
DD_ATTR(front_merges),
DD_ATTR(async_depth),
DD_ATTR(fifo_batch),
+ DD_ATTR(aging_expire),
__ATTR_NULL
};
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH 1/4] block/mq-deadline: Improve request accounting further
2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
` (2 preceding siblings ...)
2021-09-23 23:26 ` [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-09-24 10:54 ` Damien Le Moal
2021-09-27 15:53 ` Niklas Cassel
4 siblings, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2021-09-24 10:54 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 2021/09/24 8:27, Bart Van Assche wrote:
> The scheduler .insert_requests() callback is called when a request is
> queued for the first time and also when it is requeued. Only count a
> request the first time it is queued. Additionally, since the mq-deadline
> scheduler only performs zone locking for requests that have been
> inserted, skip the zone unlock code for requests that have not been
> inserted into the mq-deadline scheduler.
>
> Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/mq-deadline.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 7f3c3932b723..b1175e4560ad 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -677,8 +677,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> blk_req_zone_write_unlock(rq);
>
> prio = ioprio_class_to_prio[ioprio_class];
> - dd_count(dd, inserted, prio);
> - rq->elv.priv[0] = (void *)(uintptr_t)1;
> + if (!rq->elv.priv[0]) {
> + dd_count(dd, inserted, prio);
> + rq->elv.priv[0] = (void *)(uintptr_t)1;
> + }
>
> if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
> blk_mq_free_requests(&free);
> @@ -759,12 +761,13 @@ static void dd_finish_request(struct request *rq)
>
> /*
> * The block layer core may call dd_finish_request() without having
> - * called dd_insert_requests(). Hence only update statistics for
> - * requests for which dd_insert_requests() has been called. See also
> - * blk_mq_request_bypass_insert().
> + * called dd_insert_requests(). Skip requests that bypassed I/O
> + * scheduling. See also blk_mq_request_bypass_insert().
> */
> - if (rq->elv.priv[0])
> - dd_count(dd, completed, prio);
> + if (!rq->elv.priv[0])
> + return;
> +
> + dd_count(dd, completed, prio);
>
> if (blk_queue_is_zoned(q)) {
> unsigned long flags;
>
Looks good to me.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] block/mq-deadline: Add an invariant check
2021-09-23 23:26 ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
@ 2021-09-24 10:54 ` Damien Le Moal
2021-09-27 15:53 ` Niklas Cassel
1 sibling, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2021-09-24 10:54 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 2021/09/24 8:27, Bart Van Assche wrote:
> Check a statistics invariant at module unload time. When running
> blktests, the invariant is verified every time a request queue is
> removed and hence is verified at least once per test.
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/mq-deadline.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b1175e4560ad..6deb4306bcf3 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -270,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
> deadline_remove_request(rq->q, per_prio, rq);
> }
>
> +/* Number of requests queued for a given priority level. */
> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> + return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +}
> +
> /*
> * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
> * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> @@ -539,6 +545,12 @@ static void dd_exit_sched(struct elevator_queue *e)
>
> WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
> WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> + WARN_ONCE(dd_queued(dd, prio) != 0,
> + "statistics for priority %d: i %u m %u d %u c %u\n",
> + prio, dd_sum(dd, inserted, prio),
> + dd_sum(dd, merged, prio),
> + dd_sum(dd, dispatched, prio),
> + dd_sum(dd, completed, prio));
> }
>
> free_percpu(dd->stats);
> @@ -950,12 +962,6 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
> return 0;
> }
>
> -/* Number of requests queued for a given priority level. */
> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> -{
> - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> -}
> -
> static int dd_queued_show(void *data, struct seq_file *m)
> {
> struct request_queue *q = data;
>
Looks good.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
2021-09-23 23:26 ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
@ 2021-09-24 10:58 ` Damien Le Moal
2021-09-25 2:59 ` Bart Van Assche
2021-09-27 15:53 ` Niklas Cassel
1 sibling, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2021-09-24 10:58 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 2021/09/24 8:27, Bart Van Assche wrote:
> Calculating the sum over all CPUs of per-CPU counters frequently is
> inefficient. Hence switch from per-CPU to individual counters. Three
> counters are protected by the mq-deadline spinlock since these are
> only accessed from contexts that already hold that spinlock. The fourth
> counter is atomic because protecting it with the mq-deadline spinlock
> would trigger lock contention.
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/mq-deadline.c | 124 ++++++++++++++++++++------------------------
> 1 file changed, 56 insertions(+), 68 deletions(-)
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 6deb4306bcf3..b0cfc84a4e6b 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -51,17 +51,16 @@ enum dd_prio {
>
> enum { DD_PRIO_COUNT = 3 };
>
> -/* I/O statistics per I/O priority. */
> +/*
> + * I/O statistics per I/O priority. It is fine if these counters overflow.
> + * What matters is that these counters are at least as wide as
> + * log2(max_outstanding_requests).
> + */
> struct io_stats_per_prio {
> - local_t inserted;
> - local_t merged;
> - local_t dispatched;
> - local_t completed;
> -};
> -
> -/* I/O statistics for all I/O priorities (enum dd_prio). */
> -struct io_stats {
> - struct io_stats_per_prio stats[DD_PRIO_COUNT];
> + uint32_t inserted;
> + uint32_t merged;
> + uint32_t dispatched;
> + atomic_t completed;
Why not use 64-bits types (regular unsigned long long and atomic64_t) ?
> };
>
> /*
> @@ -74,6 +73,7 @@ struct dd_per_prio {
> struct list_head fifo_list[DD_DIR_COUNT];
> /* Next request in FIFO order. Read, write or both are NULL. */
> struct request *next_rq[DD_DIR_COUNT];
> + struct io_stats_per_prio stats;
> };
>
> struct deadline_data {
> @@ -88,8 +88,6 @@ struct deadline_data {
> unsigned int batching; /* number of sequential requests made */
> unsigned int starved; /* times reads have starved writes */
>
> - struct io_stats __percpu *stats;
> -
> /*
> * settings that change how the i/o scheduler behaves
> */
> @@ -103,33 +101,6 @@ struct deadline_data {
> spinlock_t zone_lock;
> };
>
> -/* Count one event of type 'event_type' and with I/O priority 'prio' */
> -#define dd_count(dd, event_type, prio) do { \
> - struct io_stats *io_stats = get_cpu_ptr((dd)->stats); \
> - \
> - BUILD_BUG_ON(!__same_type((dd), struct deadline_data *)); \
> - BUILD_BUG_ON(!__same_type((prio), enum dd_prio)); \
> - local_inc(&io_stats->stats[(prio)].event_type); \
> - put_cpu_ptr(io_stats); \
> -} while (0)
> -
> -/*
> - * Returns the total number of dd_count(dd, event_type, prio) calls across all
> - * CPUs. No locking or barriers since it is fine if the returned sum is slightly
> - * outdated.
> - */
> -#define dd_sum(dd, event_type, prio) ({ \
> - unsigned int cpu; \
> - u32 sum = 0; \
> - \
> - BUILD_BUG_ON(!__same_type((dd), struct deadline_data *)); \
> - BUILD_BUG_ON(!__same_type((prio), enum dd_prio)); \
> - for_each_present_cpu(cpu) \
> - sum += local_read(&per_cpu_ptr((dd)->stats, cpu)-> \
> - stats[(prio)].event_type); \
> - sum; \
> -})
> -
> /* Maps an I/O priority class to a deadline scheduler priority. */
> static const enum dd_prio ioprio_class_to_prio[] = {
> [IOPRIO_CLASS_NONE] = DD_BE_PRIO,
> @@ -233,7 +204,9 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
> const u8 ioprio_class = dd_rq_ioclass(next);
> const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>
> - dd_count(dd, merged, prio);
> + lockdep_assert_held(&dd->lock);
> +
> + dd->per_prio[prio].stats.merged++;
>
> /*
> * if next expires before rq, assign its expire time to rq
> @@ -273,7 +246,11 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
> /* Number of requests queued for a given priority level. */
> static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> {
> - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> + const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> + lockdep_assert_held(&dd->lock);
> +
> + return stats->inserted - atomic_read(&stats->completed);
> }
>
> /*
> @@ -463,7 +440,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> done:
> ioprio_class = dd_rq_ioclass(rq);
> prio = ioprio_class_to_prio[ioprio_class];
> - dd_count(dd, dispatched, prio);
> + dd->per_prio[prio].stats.dispatched++;
> /*
> * If the request needs its target zone locked, do it.
> */
> @@ -542,19 +519,22 @@ static void dd_exit_sched(struct elevator_queue *e)
>
> for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> struct dd_per_prio *per_prio = &dd->per_prio[prio];
> + const struct io_stats_per_prio *stats = &per_prio->stats;
> + uint32_t queued;
>
> WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
> WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> - WARN_ONCE(dd_queued(dd, prio) != 0,
> +
> + spin_lock(&dd->lock);
> + queued = dd_queued(dd, prio);
> + spin_unlock(&dd->lock);
> +
> + WARN_ONCE(queued,
> "statistics for priority %d: i %u m %u d %u c %u\n",
> - prio, dd_sum(dd, inserted, prio),
> - dd_sum(dd, merged, prio),
> - dd_sum(dd, dispatched, prio),
> - dd_sum(dd, completed, prio));
> + prio, stats->inserted, stats->merged,
> + stats->dispatched, atomic_read(&stats->completed));
> }
>
> - free_percpu(dd->stats);
> -
> kfree(dd);
> }
>
> @@ -578,11 +558,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>
> eq->elevator_data = dd;
>
> - dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
> - GFP_KERNEL | __GFP_ZERO);
> - if (!dd->stats)
> - goto free_dd;
> -
> for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> struct dd_per_prio *per_prio = &dd->per_prio[prio];
>
> @@ -604,9 +579,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
> q->elevator = eq;
> return 0;
>
> -free_dd:
> - kfree(dd);
> -
> put_eq:
> kobject_put(&eq->kobj);
> return ret;
> @@ -689,8 +661,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> blk_req_zone_write_unlock(rq);
>
> prio = ioprio_class_to_prio[ioprio_class];
> + per_prio = &dd->per_prio[prio];
> if (!rq->elv.priv[0]) {
> - dd_count(dd, inserted, prio);
> + per_prio->stats.inserted++;
> rq->elv.priv[0] = (void *)(uintptr_t)1;
> }
>
> @@ -701,7 +674,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>
> trace_block_rq_insert(rq);
>
> - per_prio = &dd->per_prio[prio];
> if (at_head) {
> list_add(&rq->queuelist, &per_prio->dispatch);
> } else {
> @@ -779,7 +751,7 @@ static void dd_finish_request(struct request *rq)
> if (!rq->elv.priv[0])
> return;
>
> - dd_count(dd, completed, prio);
> + atomic_inc(&per_prio->stats.completed);
>
> if (blk_queue_is_zoned(q)) {
> unsigned long flags;
> @@ -966,28 +938,44 @@ static int dd_queued_show(void *data, struct seq_file *m)
> {
> struct request_queue *q = data;
> struct deadline_data *dd = q->elevator->elevator_data;
> + u32 rt, be, idle;
> +
> + spin_lock(&dd->lock);
> + rt = dd_queued(dd, DD_RT_PRIO);
> + be = dd_queued(dd, DD_BE_PRIO);
> + idle = dd_queued(dd, DD_IDLE_PRIO);
> + spin_unlock(&dd->lock);
> +
> + seq_printf(m, "%u %u %u\n", rt, be, idle);
>
> - seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
> - dd_queued(dd, DD_BE_PRIO),
> - dd_queued(dd, DD_IDLE_PRIO));
> return 0;
> }
>
> /* Number of requests owned by the block driver for a given priority. */
> static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
> {
> - return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
> - - dd_sum(dd, completed, prio);
> + const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> + lockdep_assert_held(&dd->lock);
> +
> + return stats->dispatched + stats->merged -
> + atomic_read(&stats->completed);
> }
>
> static int dd_owned_by_driver_show(void *data, struct seq_file *m)
> {
> struct request_queue *q = data;
> struct deadline_data *dd = q->elevator->elevator_data;
> + u32 rt, be, idle;
> +
> + spin_lock(&dd->lock);
> + rt = dd_owned_by_driver(dd, DD_RT_PRIO);
> + be = dd_owned_by_driver(dd, DD_BE_PRIO);
> + idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
> + spin_unlock(&dd->lock);
> +
> + seq_printf(m, "%u %u %u\n", rt, be, idle);
>
> - seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
> - dd_owned_by_driver(dd, DD_BE_PRIO),
> - dd_owned_by_driver(dd, DD_IDLE_PRIO));
> return 0;
> }
>
>
Apart from the nit/comment above, looks OK to me.
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests
2021-09-23 23:26 ` [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
@ 2021-09-24 11:08 ` Damien Le Moal
2021-09-27 20:03 ` Bart Van Assche
0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2021-09-24 11:08 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 2021/09/24 8:27, Bart Van Assche wrote:
> In addition to reverting commit 7b05bf771084 ("Revert "block/mq-deadline:
> Prioritize high-priority requests""), this patch uses 'jiffies' instead
> of ktime_get() in the code for aging lower priority requests.
>
> This patch has been tested as follows:
>
> Measured QD=1/jobs=1 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: 555 K IOPS.
>
> Measured QD=1/jobs=8 IOPS for nullb with the mq-deadline scheduler.
> Result without and with this patch: about 380 K IOPS.
>
> Ran the following script:
>
> set -e
> scriptdir=$(dirname "$0")
> if [ -e /sys/module/scsi_debug ]; then modprobe -r scsi_debug; fi
> modprobe scsi_debug ndelay=1000000 max_queue=16
> sd=''
> while [ -z "$sd" ]; do
> sd=$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*)
> done
> echo $((100*1000)) > "/sys/block/$sd/queue/iosched/aging_expire"
> if [ -e /sys/fs/cgroup/io.prio.class ]; then
> cd /sys/fs/cgroup
> echo restrict-to-be >io.prio.class
> echo +io > cgroup.subtree_control
> else
> cd /sys/fs/cgroup/blkio/
> echo restrict-to-be >blkio.prio.class
> fi
> echo $$ >cgroup.procs
> mkdir -p hipri
> cd hipri
> if [ -e io.prio.class ]; then
> echo none-to-rt >io.prio.class
> else
> echo none-to-rt >blkio.prio.class
> fi
> { "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/low-pri.txt & }
> echo $$ >cgroup.procs
> "${scriptdir}/max-iops" -a1 -d32 -j1 -e mq-deadline "/dev/$sd" >& ~/hi-pri.txt
>
> Result:
> * 11000 IOPS for the high-priority job
> * 40 IOPS for the low-priority job
>
> If the aging expiry time is changed from 100s into 0, the IOPS results change
> into 6712 and 6796 IOPS.
>
> The max-iops script is a script that runs fio with the following arguments:
> --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60
> --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j}
> --iodepth=${arg_d} --iodepth_batch_submit=${arg_a}
> --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1}
> --filename=${positional_argument_1}
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/mq-deadline.c | 78 ++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 73 insertions(+), 5 deletions(-)
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b0cfc84a4e6b..37440786fdbb 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -31,6 +31,11 @@
> */
> static const int read_expire = HZ / 2; /* max time before a read is submitted. */
> static const int write_expire = 5 * HZ; /* ditto for writes, these limits are SOFT! */
> +/*
> + * Time after which to dispatch lower priority requests even if higher
> + * priority requests are pending.
> + */
> +static const int aging_expire = 10 * HZ;
What about calling this prio_starved, to be consistent with writes_starved ?
Or at least let's call it prio_aging_expire to show that it is for priority
levels, and not for requests within a priority queue.
> static const int writes_starved = 2; /* max times reads can starve a write */
> static const int fifo_batch = 16; /* # of sequential requests treated as one
> by the above parameters. For throughput. */
> @@ -96,6 +101,7 @@ struct deadline_data {
> int writes_starved;
> int front_merges;
> u32 async_depth;
> + int aging_expire;
>
> spinlock_t lock;
> spinlock_t zone_lock;
> @@ -338,12 +344,27 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
> return rq;
> }
>
> +/*
> + * Returns true if and only if @rq started after @latest_start where
> + * @latest_start is in jiffies.
> + */
> +static bool started_after(struct deadline_data *dd, struct request *rq,
> + unsigned long latest_start)
> +{
> + unsigned long start_time = (unsigned long)rq->fifo_time;
> +
> + start_time -= dd->fifo_expire[rq_data_dir(rq)];
> +
> + return time_after(start_time, latest_start);
> +}
> +
> /*
> * deadline_dispatch_requests selects the best request according to
> - * read/write expire, fifo_batch, etc
> + * read/write expire, fifo_batch, etc and with a start time <= @latest.
> */
> static struct request *__dd_dispatch_request(struct deadline_data *dd,
> - struct dd_per_prio *per_prio)
> + struct dd_per_prio *per_prio,
> + unsigned long latest_start)
> {
> struct request *rq, *next_rq;
> enum dd_data_dir data_dir;
> @@ -355,6 +376,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> if (!list_empty(&per_prio->dispatch)) {
> rq = list_first_entry(&per_prio->dispatch, struct request,
> queuelist);
> + if (started_after(dd, rq, latest_start))
> + return NULL;
> list_del_init(&rq->queuelist);
> goto done;
> }
> @@ -432,6 +455,8 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> dd->batching = 0;
>
> dispatch_request:
> + if (started_after(dd, rq, latest_start))
> + return NULL;
Nit: add a blank line here ? (for aesthetic :))
> /*
> * rq is the selected appropriate request.
> */
> @@ -449,6 +474,34 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> return rq;
> }
>
> +/*
> + * Check whether there are any requests with a deadline that expired more than
> + * aging_expire jiffies ago.
Hmm... requests do not have a "deadline", so this is a little hard to
understand. What about something like this:
* Check whether there are any requests at a low priority level inserted
* more than aging_expire jiffies ago.
> + */
> +static struct request *dd_dispatch_aged_requests(struct deadline_data *dd,
> + unsigned long now)
Same remark as for the aging_expire variable: it may be good to have prio in the
name of this function. Something like:
dd_dispatch_prio_starved_requests() ?
> +{
> + struct request *rq;
> + enum dd_prio prio;
> + int prio_cnt;
> +
> + lockdep_assert_held(&dd->lock);
> +
> + prio_cnt = !!dd_queued(dd, DD_RT_PRIO) + !!dd_queued(dd, DD_BE_PRIO) +
> + !!dd_queued(dd, DD_IDLE_PRIO);
> + if (prio_cnt < 2)
> + return NULL;
> +
> + for (prio = DD_BE_PRIO; prio <= DD_PRIO_MAX; prio++) {
> + rq = __dd_dispatch_request(dd, &dd->per_prio[prio],
> + now - dd->aging_expire);
> + if (rq)
> + return rq;
> + }
> +
> + return NULL;
> +}
> +
> /*
> * Called from blk_mq_run_hw_queue() -> __blk_mq_sched_dispatch_requests().
> *
> @@ -460,15 +513,26 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> {
> struct deadline_data *dd = hctx->queue->elevator->elevator_data;
> - struct request *rq;
> + const unsigned long now = jiffies;
> + struct request *rq = NULL;
> enum dd_prio prio;
>
> spin_lock(&dd->lock);
Nit: Add a blank line here to have the spin_lock() stand out ?
Easier to notice it that way...
> + rq = dd_dispatch_aged_requests(dd, now);
> + if (rq)
> + goto unlock;
> +
> + /*
> + * Next, dispatch requests in priority order. Ignore lower priority
> + * requests if any higher priority requests are pending.
> + */
> for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> - rq = __dd_dispatch_request(dd, &dd->per_prio[prio]);
> - if (rq)
> + rq = __dd_dispatch_request(dd, &dd->per_prio[prio], now);
> + if (rq || dd_queued(dd, prio))
> break;
> }
> +
> +unlock:
> spin_unlock(&dd->lock);
>
> return rq;
> @@ -573,6 +637,7 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
> dd->front_merges = 1;
> dd->last_dir = DD_WRITE;
> dd->fifo_batch = fifo_batch;
> + dd->aging_expire = aging_expire;
> spin_lock_init(&dd->lock);
> spin_lock_init(&dd->zone_lock);
>
> @@ -796,6 +861,7 @@ static ssize_t __FUNC(struct elevator_queue *e, char *page) \
> #define SHOW_JIFFIES(__FUNC, __VAR) SHOW_INT(__FUNC, jiffies_to_msecs(__VAR))
> SHOW_JIFFIES(deadline_read_expire_show, dd->fifo_expire[DD_READ]);
> SHOW_JIFFIES(deadline_write_expire_show, dd->fifo_expire[DD_WRITE]);
> +SHOW_JIFFIES(deadline_aging_expire_show, dd->aging_expire);
> SHOW_INT(deadline_writes_starved_show, dd->writes_starved);
> SHOW_INT(deadline_front_merges_show, dd->front_merges);
> SHOW_INT(deadline_async_depth_show, dd->front_merges);
> @@ -825,6 +891,7 @@ static ssize_t __FUNC(struct elevator_queue *e, const char *page, size_t count)
> STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, msecs_to_jiffies)
> STORE_JIFFIES(deadline_read_expire_store, &dd->fifo_expire[DD_READ], 0, INT_MAX);
> STORE_JIFFIES(deadline_write_expire_store, &dd->fifo_expire[DD_WRITE], 0, INT_MAX);
> +STORE_JIFFIES(deadline_aging_expire_store, &dd->aging_expire, 0, INT_MAX);
> STORE_INT(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX);
> STORE_INT(deadline_front_merges_store, &dd->front_merges, 0, 1);
> STORE_INT(deadline_async_depth_store, &dd->front_merges, 1, INT_MAX);
> @@ -843,6 +910,7 @@ static struct elv_fs_entry deadline_attrs[] = {
> DD_ATTR(front_merges),
> DD_ATTR(async_depth),
> DD_ATTR(fifo_batch),
> + DD_ATTR(aging_expire),
> __ATTR_NULL
> };
>
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
2021-09-24 10:58 ` Damien Le Moal
@ 2021-09-25 2:59 ` Bart Van Assche
2021-09-27 0:19 ` Damien Le Moal
0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2021-09-25 2:59 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 9/24/21 03:58, Damien Le Moal wrote:
> On 2021/09/24 8:27, Bart Van Assche wrote:
>> -/* I/O statistics for all I/O priorities (enum dd_prio). */
>> -struct io_stats {
>> - struct io_stats_per_prio stats[DD_PRIO_COUNT];
>> + uint32_t inserted;
>> + uint32_t merged;
>> + uint32_t dispatched;
>> + atomic_t completed;
>
> Why not use 64-bits types (regular unsigned long long and atomic64_t) ?
Even 64-bit counters can overflow. Using 32-bit counters makes it easier to
trigger an overflow of these counters.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
2021-09-25 2:59 ` Bart Van Assche
@ 2021-09-27 0:19 ` Damien Le Moal
2021-09-27 2:38 ` Bart Van Assche
0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2021-09-27 0:19 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 2021/09/25 11:59, Bart Van Assche wrote:
> On 9/24/21 03:58, Damien Le Moal wrote:
>> On 2021/09/24 8:27, Bart Van Assche wrote:
>>> -/* I/O statistics for all I/O priorities (enum dd_prio). */
>>> -struct io_stats {
>>> - struct io_stats_per_prio stats[DD_PRIO_COUNT];
>>> + uint32_t inserted;
>>> + uint32_t merged;
>>> + uint32_t dispatched;
>>> + atomic_t completed;
>>
>> Why not use 64-bits types (regular unsigned long long and atomic64_t) ?
>
> Even 64-bit counters can overflow. Using 32-bit counters makes it easier to
> trigger an overflow of these counters.
I was more thinking about the speed of additions/subtractions on 64-bits arch,
which is a large part (the majority ?) of mq-deadline users. Not sure if there
is a difference in speed for 32 bits and 64 bits simple math on 64 bits arch
though. Probably not.
Another thing: in patch 3, you are actually not handling the overflows. So
dd_queued() may return some very weird number (temporarily) when the inserted
count overflows before the completed count does. Since
dd_dispatch_aged_requests() does not care about the actual value of dd_queued(),
only if it is 0 or not, I am not 100% sure if it is useful to fix. Except maybe
for sysfs attributes ?
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
2021-09-27 0:19 ` Damien Le Moal
@ 2021-09-27 2:38 ` Bart Van Assche
2021-09-27 3:41 ` Damien Le Moal
0 siblings, 1 reply; 17+ messages in thread
From: Bart Van Assche @ 2021-09-27 2:38 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 9/26/21 17:19, Damien Le Moal wrote:
> Another thing: in patch 3, you are actually not handling the overflows. So
> dd_queued() may return some very weird number (temporarily) when the inserted
> count overflows before the completed count does. Since
> dd_dispatch_aged_requests() does not care about the actual value of dd_queued(),
> only if it is 0 or not, I am not 100% sure if it is useful to fix. Except maybe
> for sysfs attributes ?
Hmm ... I'm not following. I think it can be proven mathematically that
dd_queued() returns a number in the range [0, max_queued_requests). Here is
an example:
* inserted = 1 (wrapped around).
* completed = UINT32_MAX - 1 (about to wrap but has not yet wrapped around).
* dd_queued() returns 2.
Bart.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
2021-09-27 2:38 ` Bart Van Assche
@ 2021-09-27 3:41 ` Damien Le Moal
0 siblings, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2021-09-27 3:41 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 2021/09/27 11:38, Bart Van Assche wrote:
> On 9/26/21 17:19, Damien Le Moal wrote:
>> Another thing: in patch 3, you are actually not handling the overflows. So
>> dd_queued() may return some very weird number (temporarily) when the inserted
>> count overflows before the completed count does. Since
>> dd_dispatch_aged_requests() does not care about the actual value of dd_queued(),
>> only if it is 0 or not, I am not 100% sure if it is useful to fix. Except maybe
>> for sysfs attributes ?
>
> Hmm ... I'm not following. I think it can be proven mathematically that
> dd_queued() returns a number in the range [0, max_queued_requests). Here is
> an example:
> * inserted = 1 (wrapped around).
> * completed = UINT32_MAX - 1 (about to wrap but has not yet wrapped around).
> * dd_queued() returns 2.
You mean 3, right ?
But yes... Slow Monday, I need more coffee :)
Cheers.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 1/4] block/mq-deadline: Improve request accounting further
2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
` (3 preceding siblings ...)
2021-09-24 10:54 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Damien Le Moal
@ 2021-09-27 15:53 ` Niklas Cassel
4 siblings, 0 replies; 17+ messages in thread
From: Niklas Cassel @ 2021-09-27 15:53 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
Damien Le Moal, Hannes Reinecke
On Thu, Sep 23, 2021 at 04:26:52PM -0700, Bart Van Assche wrote:
> The scheduler .insert_requests() callback is called when a request is
> queued for the first time and also when it is requeued. Only count a
> request the first time it is queued. Additionally, since the mq-deadline
> scheduler only performs zone locking for requests that have been
> inserted, skip the zone unlock code for requests that have not been
> inserted into the mq-deadline scheduler.
>
> Fixes: 38ba64d12d4c ("block/mq-deadline: Track I/O statistics")
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/mq-deadline.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 7f3c3932b723..b1175e4560ad 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -677,8 +677,10 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> blk_req_zone_write_unlock(rq);
>
> prio = ioprio_class_to_prio[ioprio_class];
> - dd_count(dd, inserted, prio);
> - rq->elv.priv[0] = (void *)(uintptr_t)1;
> + if (!rq->elv.priv[0]) {
> + dd_count(dd, inserted, prio);
> + rq->elv.priv[0] = (void *)(uintptr_t)1;
> + }
>
> if (blk_mq_sched_try_insert_merge(q, rq, &free)) {
> blk_mq_free_requests(&free);
> @@ -759,12 +761,13 @@ static void dd_finish_request(struct request *rq)
>
> /*
> * The block layer core may call dd_finish_request() without having
> - * called dd_insert_requests(). Hence only update statistics for
> - * requests for which dd_insert_requests() has been called. See also
> - * blk_mq_request_bypass_insert().
> + * called dd_insert_requests(). Skip requests that bypassed I/O
> + * scheduling. See also blk_mq_request_bypass_insert().
> */
> - if (rq->elv.priv[0])
> - dd_count(dd, completed, prio);
> + if (!rq->elv.priv[0])
> + return;
> +
> + dd_count(dd, completed, prio);
>
> if (blk_queue_is_zoned(q)) {
> unsigned long flags;
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 2/4] block/mq-deadline: Add an invariant check
2021-09-23 23:26 ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
2021-09-24 10:54 ` Damien Le Moal
@ 2021-09-27 15:53 ` Niklas Cassel
1 sibling, 0 replies; 17+ messages in thread
From: Niklas Cassel @ 2021-09-27 15:53 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
Damien Le Moal, Hannes Reinecke
On Thu, Sep 23, 2021 at 04:26:53PM -0700, Bart Van Assche wrote:
> Check a statistics invariant at module unload time. When running
> blktests, the invariant is verified every time a request queue is
> removed and hence is verified at least once per test.
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/mq-deadline.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index b1175e4560ad..6deb4306bcf3 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -270,6 +270,12 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
> deadline_remove_request(rq->q, per_prio, rq);
> }
>
> +/* Number of requests queued for a given priority level. */
> +static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> +{
> + return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> +}
> +
> /*
> * deadline_check_fifo returns 0 if there are no expired requests on the fifo,
> * 1 otherwise. Requires !list_empty(&dd->fifo_list[data_dir])
> @@ -539,6 +545,12 @@ static void dd_exit_sched(struct elevator_queue *e)
>
> WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
> WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> + WARN_ONCE(dd_queued(dd, prio) != 0,
> + "statistics for priority %d: i %u m %u d %u c %u\n",
> + prio, dd_sum(dd, inserted, prio),
> + dd_sum(dd, merged, prio),
> + dd_sum(dd, dispatched, prio),
> + dd_sum(dd, completed, prio));
> }
>
> free_percpu(dd->stats);
> @@ -950,12 +962,6 @@ static int dd_async_depth_show(void *data, struct seq_file *m)
> return 0;
> }
>
> -/* Number of requests queued for a given priority level. */
> -static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> -{
> - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> -}
> -
> static int dd_queued_show(void *data, struct seq_file *m)
> {
> struct request_queue *q = data;
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters
2021-09-23 23:26 ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
2021-09-24 10:58 ` Damien Le Moal
@ 2021-09-27 15:53 ` Niklas Cassel
1 sibling, 0 replies; 17+ messages in thread
From: Niklas Cassel @ 2021-09-27 15:53 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Jaegeuk Kim,
Damien Le Moal, Hannes Reinecke
On Thu, Sep 23, 2021 at 04:26:54PM -0700, Bart Van Assche wrote:
> Calculating the sum over all CPUs of per-CPU counters frequently is
> inefficient. Hence switch from per-CPU to individual counters. Three
> counters are protected by the mq-deadline spinlock since these are
> only accessed from contexts that already hold that spinlock. The fourth
> counter is atomic because protecting it with the mq-deadline spinlock
> would trigger lock contention.
>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Niklas Cassel <Niklas.Cassel@wdc.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
> block/mq-deadline.c | 124 ++++++++++++++++++++------------------------
> 1 file changed, 56 insertions(+), 68 deletions(-)
>
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 6deb4306bcf3..b0cfc84a4e6b 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -51,17 +51,16 @@ enum dd_prio {
>
> enum { DD_PRIO_COUNT = 3 };
>
> -/* I/O statistics per I/O priority. */
> +/*
> + * I/O statistics per I/O priority. It is fine if these counters overflow.
> + * What matters is that these counters are at least as wide as
> + * log2(max_outstanding_requests).
> + */
> struct io_stats_per_prio {
> - local_t inserted;
> - local_t merged;
> - local_t dispatched;
> - local_t completed;
> -};
> -
> -/* I/O statistics for all I/O priorities (enum dd_prio). */
> -struct io_stats {
> - struct io_stats_per_prio stats[DD_PRIO_COUNT];
> + uint32_t inserted;
> + uint32_t merged;
> + uint32_t dispatched;
> + atomic_t completed;
> };
>
> /*
> @@ -74,6 +73,7 @@ struct dd_per_prio {
> struct list_head fifo_list[DD_DIR_COUNT];
> /* Next request in FIFO order. Read, write or both are NULL. */
> struct request *next_rq[DD_DIR_COUNT];
> + struct io_stats_per_prio stats;
> };
>
> struct deadline_data {
> @@ -88,8 +88,6 @@ struct deadline_data {
> unsigned int batching; /* number of sequential requests made */
> unsigned int starved; /* times reads have starved writes */
>
> - struct io_stats __percpu *stats;
> -
> /*
> * settings that change how the i/o scheduler behaves
> */
> @@ -103,33 +101,6 @@ struct deadline_data {
> spinlock_t zone_lock;
> };
>
> -/* Count one event of type 'event_type' and with I/O priority 'prio' */
> -#define dd_count(dd, event_type, prio) do { \
> - struct io_stats *io_stats = get_cpu_ptr((dd)->stats); \
> - \
> - BUILD_BUG_ON(!__same_type((dd), struct deadline_data *)); \
> - BUILD_BUG_ON(!__same_type((prio), enum dd_prio)); \
> - local_inc(&io_stats->stats[(prio)].event_type); \
> - put_cpu_ptr(io_stats); \
> -} while (0)
> -
> -/*
> - * Returns the total number of dd_count(dd, event_type, prio) calls across all
> - * CPUs. No locking or barriers since it is fine if the returned sum is slightly
> - * outdated.
> - */
> -#define dd_sum(dd, event_type, prio) ({ \
> - unsigned int cpu; \
> - u32 sum = 0; \
> - \
> - BUILD_BUG_ON(!__same_type((dd), struct deadline_data *)); \
> - BUILD_BUG_ON(!__same_type((prio), enum dd_prio)); \
> - for_each_present_cpu(cpu) \
> - sum += local_read(&per_cpu_ptr((dd)->stats, cpu)-> \
> - stats[(prio)].event_type); \
> - sum; \
> -})
> -
> /* Maps an I/O priority class to a deadline scheduler priority. */
> static const enum dd_prio ioprio_class_to_prio[] = {
> [IOPRIO_CLASS_NONE] = DD_BE_PRIO,
> @@ -233,7 +204,9 @@ static void dd_merged_requests(struct request_queue *q, struct request *req,
> const u8 ioprio_class = dd_rq_ioclass(next);
> const enum dd_prio prio = ioprio_class_to_prio[ioprio_class];
>
> - dd_count(dd, merged, prio);
> + lockdep_assert_held(&dd->lock);
> +
> + dd->per_prio[prio].stats.merged++;
>
> /*
> * if next expires before rq, assign its expire time to rq
> @@ -273,7 +246,11 @@ deadline_move_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
> /* Number of requests queued for a given priority level. */
> static u32 dd_queued(struct deadline_data *dd, enum dd_prio prio)
> {
> - return dd_sum(dd, inserted, prio) - dd_sum(dd, completed, prio);
> + const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> + lockdep_assert_held(&dd->lock);
> +
> + return stats->inserted - atomic_read(&stats->completed);
> }
>
> /*
> @@ -463,7 +440,7 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
> done:
> ioprio_class = dd_rq_ioclass(rq);
> prio = ioprio_class_to_prio[ioprio_class];
> - dd_count(dd, dispatched, prio);
> + dd->per_prio[prio].stats.dispatched++;
> /*
> * If the request needs its target zone locked, do it.
> */
> @@ -542,19 +519,22 @@ static void dd_exit_sched(struct elevator_queue *e)
>
> for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> struct dd_per_prio *per_prio = &dd->per_prio[prio];
> + const struct io_stats_per_prio *stats = &per_prio->stats;
> + uint32_t queued;
>
> WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_READ]));
> WARN_ON_ONCE(!list_empty(&per_prio->fifo_list[DD_WRITE]));
> - WARN_ONCE(dd_queued(dd, prio) != 0,
> +
> + spin_lock(&dd->lock);
> + queued = dd_queued(dd, prio);
> + spin_unlock(&dd->lock);
> +
> + WARN_ONCE(queued,
> "statistics for priority %d: i %u m %u d %u c %u\n",
> - prio, dd_sum(dd, inserted, prio),
> - dd_sum(dd, merged, prio),
> - dd_sum(dd, dispatched, prio),
> - dd_sum(dd, completed, prio));
> + prio, stats->inserted, stats->merged,
> + stats->dispatched, atomic_read(&stats->completed));
> }
>
> - free_percpu(dd->stats);
> -
> kfree(dd);
> }
>
> @@ -578,11 +558,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
>
> eq->elevator_data = dd;
>
> - dd->stats = alloc_percpu_gfp(typeof(*dd->stats),
> - GFP_KERNEL | __GFP_ZERO);
> - if (!dd->stats)
> - goto free_dd;
> -
> for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
> struct dd_per_prio *per_prio = &dd->per_prio[prio];
>
> @@ -604,9 +579,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
> q->elevator = eq;
> return 0;
>
> -free_dd:
> - kfree(dd);
> -
> put_eq:
> kobject_put(&eq->kobj);
> return ret;
> @@ -689,8 +661,9 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
> blk_req_zone_write_unlock(rq);
>
> prio = ioprio_class_to_prio[ioprio_class];
> + per_prio = &dd->per_prio[prio];
> if (!rq->elv.priv[0]) {
> - dd_count(dd, inserted, prio);
> + per_prio->stats.inserted++;
> rq->elv.priv[0] = (void *)(uintptr_t)1;
> }
>
> @@ -701,7 +674,6 @@ static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
>
> trace_block_rq_insert(rq);
>
> - per_prio = &dd->per_prio[prio];
> if (at_head) {
> list_add(&rq->queuelist, &per_prio->dispatch);
> } else {
> @@ -779,7 +751,7 @@ static void dd_finish_request(struct request *rq)
> if (!rq->elv.priv[0])
> return;
>
> - dd_count(dd, completed, prio);
> + atomic_inc(&per_prio->stats.completed);
>
> if (blk_queue_is_zoned(q)) {
> unsigned long flags;
> @@ -966,28 +938,44 @@ static int dd_queued_show(void *data, struct seq_file *m)
> {
> struct request_queue *q = data;
> struct deadline_data *dd = q->elevator->elevator_data;
> + u32 rt, be, idle;
> +
> + spin_lock(&dd->lock);
> + rt = dd_queued(dd, DD_RT_PRIO);
> + be = dd_queued(dd, DD_BE_PRIO);
> + idle = dd_queued(dd, DD_IDLE_PRIO);
> + spin_unlock(&dd->lock);
> +
> + seq_printf(m, "%u %u %u\n", rt, be, idle);
>
> - seq_printf(m, "%u %u %u\n", dd_queued(dd, DD_RT_PRIO),
> - dd_queued(dd, DD_BE_PRIO),
> - dd_queued(dd, DD_IDLE_PRIO));
> return 0;
> }
>
> /* Number of requests owned by the block driver for a given priority. */
> static u32 dd_owned_by_driver(struct deadline_data *dd, enum dd_prio prio)
> {
> - return dd_sum(dd, dispatched, prio) + dd_sum(dd, merged, prio)
> - - dd_sum(dd, completed, prio);
> + const struct io_stats_per_prio *stats = &dd->per_prio[prio].stats;
> +
> + lockdep_assert_held(&dd->lock);
> +
> + return stats->dispatched + stats->merged -
> + atomic_read(&stats->completed);
> }
>
> static int dd_owned_by_driver_show(void *data, struct seq_file *m)
> {
> struct request_queue *q = data;
> struct deadline_data *dd = q->elevator->elevator_data;
> + u32 rt, be, idle;
> +
> + spin_lock(&dd->lock);
> + rt = dd_owned_by_driver(dd, DD_RT_PRIO);
> + be = dd_owned_by_driver(dd, DD_BE_PRIO);
> + idle = dd_owned_by_driver(dd, DD_IDLE_PRIO);
> + spin_unlock(&dd->lock);
> +
> + seq_printf(m, "%u %u %u\n", rt, be, idle);
>
> - seq_printf(m, "%u %u %u\n", dd_owned_by_driver(dd, DD_RT_PRIO),
> - dd_owned_by_driver(dd, DD_BE_PRIO),
> - dd_owned_by_driver(dd, DD_IDLE_PRIO));
> return 0;
> }
>
Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests
2021-09-24 11:08 ` Damien Le Moal
@ 2021-09-27 20:03 ` Bart Van Assche
0 siblings, 0 replies; 17+ messages in thread
From: Bart Van Assche @ 2021-09-27 20:03 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe
Cc: linux-block, Christoph Hellwig, Jaegeuk Kim, Niklas Cassel,
Hannes Reinecke
On 9/24/21 4:08 AM, Damien Le Moal wrote:
> On 2021/09/24 8:27, Bart Van Assche wrote:
>> +/*
>> + * Time after which to dispatch lower priority requests even if higher
>> + * priority requests are pending.
>> + */
>> +static const int aging_expire = 10 * HZ;
>
> What about calling this prio_starved, to be consistent with writes_starved ?
> Or at least let's call it prio_aging_expire to show that it is for priority
> levels, and not for requests within a priority queue.
The name 'prio_starved' probably would cause confusion since this parameter
represents a timeout while 'writes_starved' represents a count. I will rename
it into prio_aging_expire.
>> dispatch_request:
>> + if (started_after(dd, rq, latest_start))
>> + return NULL;
>
> Nit: add a blank line here ? (for aesthetic :))
Will do.
>> +/*
>> + * Check whether there are any requests with a deadline that expired more than
>> + * aging_expire jiffies ago.
>
> Hmm... requests do not have a "deadline", so this is a little hard to
> understand. What about something like this:
>
> * Check whether there are any requests at a low priority level inserted
> * more than aging_expire jiffies ago.
Agreed, I will clarify this comment.
>> + */
>> +static struct request *dd_dispatch_aged_requests(struct deadline_data *dd,
>> + unsigned long now)
>
> Same remark as for the aging_expire variable: it may be good to have prio in the
> name of this function. Something like:
>
> dd_dispatch_prio_starved_requests() ?
The name 'aging' has a well-defined meaning so I would like to keep it. See also
https://en.wikipedia.org/wiki/Aging_(scheduling).
>> spin_lock(&dd->lock);
>
> Nit: Add a blank line here to have the spin_lock() stand out ?
> Easier to notice it that way...
The convention in kernel code is a blank line above statements that grab a lock
but not below these statements.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2021-09-27 20:03 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-23 23:23 [PATCH 0/4] Restore I/O priority support in the mq-deadline scheduler Bart Van Assche
2021-09-23 23:26 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Bart Van Assche
2021-09-23 23:26 ` [PATCH 2/4] block/mq-deadline: Add an invariant check Bart Van Assche
2021-09-24 10:54 ` Damien Le Moal
2021-09-27 15:53 ` Niklas Cassel
2021-09-23 23:26 ` [PATCH 3/4] block/mq-deadline: Stop using per-CPU counters Bart Van Assche
2021-09-24 10:58 ` Damien Le Moal
2021-09-25 2:59 ` Bart Van Assche
2021-09-27 0:19 ` Damien Le Moal
2021-09-27 2:38 ` Bart Van Assche
2021-09-27 3:41 ` Damien Le Moal
2021-09-27 15:53 ` Niklas Cassel
2021-09-23 23:26 ` [PATCH 4/4] block/mq-deadline: Prioritize high-priority requests Bart Van Assche
2021-09-24 11:08 ` Damien Le Moal
2021-09-27 20:03 ` Bart Van Assche
2021-09-24 10:54 ` [PATCH 1/4] block/mq-deadline: Improve request accounting further Damien Le Moal
2021-09-27 15:53 ` Niklas Cassel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).