[RFC] blk-mq and I/O scheduling

* [RFC] blk-mq and I/O scheduling
@ 2015-11-19 12:02 Andreas Herrmann
  2015-11-24  8:19 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andreas Herrmann @ 2015-11-19 12:02 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: linux-kernel

Hi,

I've looked into blk-mq and possible support for I/O scheduling.

The reason for this is to minimize performance degradation with
rotational devices when scsi_mod.use_blk_mq=1 is switched on.

I think that the degradation is well reflected with fio measurements.
With an increasing number of jobs you'll encounter a significant
performance drop for sequential reads and writes with blk-mq in
contrast to CFQ. blk-mq ensures that requests from different processes
(CPUs) are "perfectly shuffled" in a hardware queue. This is no
problem for non-rotational devices for which blk-mq is aimed for but
not so nice for rotational disks.

  (i) I've done some tests with patch c2ed2f2dcf92 (blk-mq: first cut
      deadline scheduling) from branch mq-deadline of linux-block
      repository. I've not seen a significant performance impact when
      enabling it (neither for non-rotational nor for rotational
      disks).

 (ii) I've played with code to enable sorting/merging of requests. I
      did this in flush_busy_ctxs. This didn't have a performance
      impact either. On a closer look this was due to high frequency
      of calls to __blk_mq_run_hw_queue. There was almost nothing to
      sort (too few requests). I guess that's also the reason why (i)
      had not much impact.

(iii) With CFQ I've observed similar performance patterns to blk-mq if
      slice_idle was set to 0.

 (iv) I thought about introducing a per software queue time slice
      during which blk-mq will service only one software queue (one
      CPU) and not flush all software queues. This could help to
      enqueue multiple requests belonging to the same process (as long
      as it runs on same CPU) into a hardware queue.  A minimal patch
      to implement this is attached below.

The latter helped to improve performance for sequential reads and
writes. But it's not on a par with CFQ. Increasing the time slice is
suboptimal (as shown with the 2ms results, see below). It might be
possible to get better performance when further reducing the initial
time slice and adapting it up to a maximum value if there are
repeatedly pending requests for a CPU.

After these observations and assuming that non-rotational devices are
most likely fine using blk-mq without I/O scheduling support I wonder
whether

- it's really a good idea to re-implement scheduling support for
  blk-mq that eventually behaves like CFQ for rotational devices.

- it's technical possible to support both blk-mq and CFQ for different
  devices on the same host adapter. This would allow to use "good old"
  code for "good old" rotational devices. (But this might not be a
  choice if in the long run a goal is to get rid of non-blk-mq code --
  not sure what the plans are.)

What do you think about this?


Thanks,

Andreas

---

 Here are some results obtained with below patch on top of Linux
 4.4.0-rc1. to illustrate (iv).
 - Intel Core i7-3770S CPU @ 3.10GHz system, 4 cores, 8 threads/CPUs
 - fio version as of 2.2.9-37-g0e1c4
 - results were gathered iterating using rw and numjobs parameter, e.g.:

    fio --directory=$1 --rw=$j --name=fio-scaling --size=5G --group_reporting \
        --ioengine=libaio --direct=1 --iodepth=1 --runtime=$2 --numjobs=$i

 - rotational device was a 1500GB Samsung HD155UI

    ata3.00: ATA-8: SAMSUNG HD155UI, 1AQ10001, max UDMA/133
    scsi 2:0:0:0: Direct-Access     ATA      SAMSUNG HD155UI  0001 PQ: 0 ANSI: 5

 (1) cfq, slice_idle=8 (default)
 (2) cfq, slice_idle=0
 (3) blk-mq, use_time_slice=0 (behaviour matches unmodified mainline) 
 (4) blk-mq, use_time_slice=1 (1ms)
 (5) blk-mq, use_time_slice=2 (2ms)

 --------------------------------     -------------------------------
 n   cfq   cfq   blk   blk   blk      n   cfq   cfq   blk   blk   blk
 u               -mq   -mq   -mq      u               -mq   -mq   -mq
 m   (1)   (2)   (3)   (4)   (5)      m   (1)   (2)   (3)   (4)   (5)
 j                                    j                        
 o                     1ms   2ms      o                     1ms   2ms
 b                                    b                        
 s                                    s                                
 --------------------------------     -------------------------------
          read iops                         randread iops              
 --------------------------------     -------------------------------
 1  17594 17344 19892 19452 19505     1   102   103   100   102   102
 2  16939  8394  8201 13187 12141     2    97    97    97    97    97
 3  16972  7410  8759 12243 11311     3   103   102   101   102   103
 4  15125  8692  9418 11585 11388     4   109   108   109   109   108
 5  14721  6159  6050 10430 10539     5   119   119   118   118   118
 6  13833  5997  6087  8775  8812     6   125   125   126   125   125
 7  13247  6022  6187  8493  9354     7   128   132   130   130   129
 8  12979  5970  6169  8251  7688     8   137   136   135   137   136
 --------------------------------     -------------------------------
          write iops                        randwrite iops             
 --------------------------------     -------------------------------
 1  15841 15988 17239 17306 17274     1   182   183   184   181   179
 2  16679  9860 10203 13512 11073     2   169   174   172   170   170
 3  16579  9684  9928 11430 10737     3   162   163   160   161   160
 4  16795 10095 10078 10810 10331     4   161   162   162   163   163
 5  16586  9980 10051  9802  9567     5   162   160   162   162   161
 6  16621  9895  9857 10213  9920     6   161   163   159   159   160
 7  16387  9780  9783  9812  9484     7   165   164   164   159   160
 8  16513  9652  9658  9778  9572     8   165   165   165   164   165
 --------------------------------     -------------------------------
          rw (read) iops                    randrw (read) iops      
 --------------------------------     -------------------------------
 1   4061  3840  4301  4142  4172     1    65    65    65    64    64
 2   4028  4569  4807  3815  3792     2    63    63    63    64    63
 3   3615  3128  3502  3216  3202     3    64    63    63    64    63
 4   3611  2889  3207  2821  2862     4    66    67    66    66    67
 5   3419  3026  3033  2819  2871     5    71    70    71    70    71
 6   3450  2981  2969  2619  2656     6    74    74    73    73    73
 7   3448  2719  2709  2565  2594     7    76    75    74    75    76
 8   3285  2283  2184  2233  2408     8    76    77    78    77    77
 --------------------------------     -------------------------------
          rw (write) iops                   randrw (write) iops     
 --------------------------------     -------------------------------
 1   4057  3841  4295  4138  4168     1    63    63    62    62    62
 2   4014  4562  4802  3798  3773     2    60    60    60    60    60
 3   3615  3132  3503  3216  3201     3    55    55    55    55    54
 4   3611  2885  3205  2825  2865     4    59    59    58    58    59
 5   3418  3044  3049  2823  2877     5    61    61    62    61    62
 6   3474  2984  2970  2615  2653     6    64    64    63    63    63
 7   3451  2711  2702  2565  2593     7    65    65    65    65    65
 8   3280  2285  2188  2234  2403     8    68    68    68    67    67
 --------------------------------     -------------------------------
---
blk-mq: Introduce per sw queue time-slice

Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
---
 block/blk-mq-sysfs.c   |  27 +++++++++
 block/blk-mq.c         | 162 ++++++++++++++++++++++++++++++++++++++++---------
 include/linux/blk-mq.h |   6 ++
 3 files changed, 165 insertions(+), 30 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 1cf1878..400bf93 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -247,6 +247,26 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static ssize_t blk_mq_hw_sysfs_tslice_show(struct blk_mq_hw_ctx *hctx,
+					  char *page)
+{
+	return sprintf(page, "%u\n", hctx->use_time_slice);
+}
+
+static ssize_t blk_mq_hw_sysfs_tslice_store(struct blk_mq_hw_ctx *hctx,
+					const char *page, size_t length)
+{
+	unsigned long long store;
+	int err;
+
+	err = kstrtoull(page, 10, &store);
+	if (err)
+		return -EINVAL;
+
+	hctx->use_time_slice = (unsigned) store;
+	return length;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
 	.attr = {.name = "dispatched", .mode = S_IRUGO },
 	.show = blk_mq_sysfs_dispatched_show,
@@ -305,6 +325,12 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
 	.show = blk_mq_hw_sysfs_poll_show,
 };
 
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tslice = {
+	.attr = {.name = "use_time_slice", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_mq_hw_sysfs_tslice_show,
+	.store = blk_mq_hw_sysfs_tslice_store,
+};
+
 static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_queued.attr,
 	&blk_mq_hw_sysfs_run.attr,
@@ -314,6 +340,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_cpus.attr,
 	&blk_mq_hw_sysfs_active.attr,
 	&blk_mq_hw_sysfs_poll.attr,
+	&blk_mq_hw_sysfs_tslice.attr,
 	NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ae09de..b71d864 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -68,6 +68,8 @@ static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
 
 	if (!test_bit(CTX_TO_BIT(hctx, ctx), &bm->word))
 		set_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
+
+	cpumask_set_cpu(ctx->cpu, hctx->cpu_pending_mask);
 }
 
 static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
@@ -76,6 +78,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	struct blk_align_bitmap *bm = get_bm(hctx, ctx);
 
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
+	cpumask_clear_cpu(ctx->cpu, hctx->cpu_pending_mask);
 }
 
 void blk_mq_freeze_queue_start(struct request_queue *q)
@@ -685,10 +688,26 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
  * Process software queues that have been marked busy, splicing them
  * to the for-dispatch
  */
-static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx,
+			struct list_head *list,	int cpu)
 {
+	struct request_queue *q = hctx->queue;
 	struct blk_mq_ctx *ctx;
 	int i;
+	const int use_time_slice = hctx->use_time_slice;
+
+	if (use_time_slice && cpu != -1) {
+		if (cpu != -1) {
+			ctx = __blk_mq_get_ctx(q, cpu);
+			spin_lock(&ctx->lock);
+			list_splice_tail_init(&ctx->rq_list, list);
+			spin_lock(&hctx->lock);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+			spin_unlock(&hctx->lock);
+			spin_unlock(&ctx->lock);
+		}
+		return;
+	}
 
 	for (i = 0; i < hctx->ctx_map.size; i++) {
 		struct blk_align_bitmap *bm = &hctx->ctx_map.map[i];
@@ -705,9 +724,11 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 				break;
 
 			ctx = hctx->ctxs[bit + off];
-			clear_bit(bit, &bm->word);
 			spin_lock(&ctx->lock);
 			list_splice_tail_init(&ctx->rq_list, list);
+			spin_lock(&hctx->lock);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+			spin_unlock(&hctx->lock);
 			spin_unlock(&ctx->lock);
 
 			bit++;
@@ -716,6 +737,71 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 }
 
 /*
+ * It'd be great if the workqueue API had a way to pass
+ * in a mask and had some smarts for more clever placement.
+ * For now we just round-robin here, switching for every
+ * BLK_MQ_CPU_WORK_BATCH queued items.
+ */
+static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->queue->nr_hw_queues == 1)
+		return WORK_CPU_UNBOUND;
+
+	if (--hctx->next_cpu_batch <= 0) {
+		int cpu = hctx->next_cpu, next_cpu;
+
+		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
+		if (next_cpu >= nr_cpu_ids)
+			next_cpu = cpumask_first(hctx->cpumask);
+
+		hctx->next_cpu = next_cpu;
+		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+
+		return cpu;
+	}
+
+	return hctx->next_cpu;
+}
+
+static int blk_mq_sched_slice_expired(struct blk_mq_hw_ctx *hctx)
+{
+	/*
+	 * Use time slice before requests of other CPUs are serviced.
+	 * Maybe change to serve n requests certain number of sectors
+	 * before switching to other CPU.
+	 */
+	if (time_after_eq(jiffies, hctx->sched_slice_exp))
+		return 1;
+
+	return 0;
+}
+
+static int blk_mq_sched_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+	int c = hctx->sched_cpu;
+
+	if (c != -1)
+		cpumask_clear_cpu(c, hctx->cpu_service_mask);
+
+	if (cpumask_and(hctx->cpu_next_mask, hctx->cpu_pending_mask,
+				hctx->cpu_service_mask)) {
+		c = cpumask_first(hctx->cpu_next_mask);
+	} else {
+		/* we are idle, epoch ended, reset */
+		hctx->sched_slice_exp = 0;
+		cpumask_setall(hctx->cpu_service_mask);
+		c = cpumask_first(hctx->cpu_pending_mask);
+	}
+
+	if (c >= nr_cpu_ids)
+		return -1;
+
+	hctx->sched_slice_exp = jiffies +
+		msecs_to_jiffies(hctx->use_time_slice);
+	return c;
+}
+
+/*
  * Run this hardware queue, pulling any software queues mapped to it in.
  * Note that this function currently has various problems around ordering
  * of IO. In particular, we'd like FIFO behaviour on handling existing
@@ -729,6 +815,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
 	int queued;
+	const int use_time_slice = hctx->use_time_slice;
+	int cpu = -1;
 
 	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
 
@@ -737,10 +825,19 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	hctx->run++;
 
+	if (use_time_slice) {
+		spin_lock(&hctx->lock);
+		if (blk_mq_sched_slice_expired(hctx))
+			hctx->sched_cpu = blk_mq_sched_next_cpu(hctx);
+		cpu = hctx->sched_cpu;
+		spin_unlock(&hctx->lock);
+		/* don't leave, but check dispatch queue if cpu == -1 */
+	}
+
 	/*
 	 * Touch any software queue that has pending entries.
 	 */
-	flush_busy_ctxs(hctx, &rq_list);
+	flush_busy_ctxs(hctx, &rq_list, cpu);
 
 	/*
 	 * If we have previous entries on our dispatch list, grab them
@@ -825,36 +922,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		 * blk_mq_run_hw_queue() already checks the STOPPED bit
 		 **/
 		blk_mq_run_hw_queue(hctx, true);
+	} else if (use_time_slice && !cpumask_empty(hctx->cpu_pending_mask)) {
+		long t = hctx->sched_slice_exp - jiffies;
+		t = (t < 0) ? 0 : t;
+		kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+						&hctx->run_work,
+						(unsigned int)t);
 	}
 }
 
-/*
- * It'd be great if the workqueue API had a way to pass
- * in a mask and had some smarts for more clever placement.
- * For now we just round-robin here, switching for every
- * BLK_MQ_CPU_WORK_BATCH queued items.
- */
-static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
-{
-	if (hctx->queue->nr_hw_queues == 1)
-		return WORK_CPU_UNBOUND;
-
-	if (--hctx->next_cpu_batch <= 0) {
-		int cpu = hctx->next_cpu, next_cpu;
-
-		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
-		if (next_cpu >= nr_cpu_ids)
-			next_cpu = cpumask_first(hctx->cpumask);
-
-		hctx->next_cpu = next_cpu;
-		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
-
-		return cpu;
-	}
-
-	return hctx->next_cpu;
-}
-
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
 	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
@@ -991,7 +1067,10 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 
 	__blk_mq_insert_req_list(hctx, ctx, rq, at_head);
+	spin_lock(&hctx->lock);
 	blk_mq_hctx_mark_pending(hctx, ctx);
+	spin_unlock(&hctx->lock);
+
 }
 
 void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
@@ -1580,7 +1659,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_list)) {
 		list_splice_init(&ctx->rq_list, &tmp);
+		spin_lock(&hctx->lock);
 		blk_mq_hctx_clear_pending(hctx, ctx);
+		spin_unlock(&hctx->lock);
 	}
 	spin_unlock(&ctx->lock);
 
@@ -1599,7 +1680,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
 	}
 
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	spin_lock(&hctx->lock);
 	blk_mq_hctx_mark_pending(hctx, ctx);
+	spin_unlock(&hctx->lock);
 
 	spin_unlock(&ctx->lock);
 
@@ -1688,6 +1771,10 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue_num = hctx_idx;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	hctx->use_time_slice = 0;
+	hctx->sched_slice_exp = 0;
+	hctx->sched_cpu = -1;
+
 	blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
 					blk_mq_hctx_notify, hctx);
 	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
@@ -2010,6 +2097,18 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						node))
 			goto err_hctxs;
 
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_pending_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_service_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_next_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
 		atomic_set(&hctxs[i]->nr_active, 0);
 		hctxs[i]->numa_node = node;
 		hctxs[i]->queue_num = i;
@@ -2073,6 +2172,9 @@ err_hctxs:
 		if (!hctxs[i])
 			break;
 		free_cpumask_var(hctxs[i]->cpumask);
+		free_cpumask_var(hctxs[i]->cpu_pending_mask);
+		free_cpumask_var(hctxs[i]->cpu_service_mask);
+		free_cpumask_var(hctxs[i]->cpu_next_mask);
 		kfree(hctxs[i]);
 	}
 err_map:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index daf17d7..6e192b5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -28,6 +28,9 @@ struct blk_mq_hw_ctx {
 	struct delayed_work	run_work;
 	struct delayed_work	delay_work;
 	cpumask_var_t		cpumask;
+	cpumask_var_t		cpu_service_mask;
+	cpumask_var_t		cpu_pending_mask;
+	cpumask_var_t		cpu_next_mask;
 	int			next_cpu;
 	int			next_cpu_batch;
 
@@ -57,6 +60,9 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	int			use_time_slice;
+	unsigned long		sched_slice_exp;
+	int			sched_cpu;
 	struct blk_mq_cpu_notifier	cpu_notifier;
 	struct kobject		kobj;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread