All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] blk-mq and I/O scheduling
@ 2015-11-19 12:02 Andreas Herrmann
  2015-11-24  8:19 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andreas Herrmann @ 2015-11-19 12:02 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: linux-kernel

Hi,

I've looked into blk-mq and possible support for I/O scheduling.

The reason for this is to minimize performance degradation with
rotational devices when scsi_mod.use_blk_mq=1 is switched on.

I think that the degradation is well reflected with fio measurements.
With an increasing number of jobs you'll encounter a significant
performance drop for sequential reads and writes with blk-mq in
contrast to CFQ. blk-mq ensures that requests from different processes
(CPUs) are "perfectly shuffled" in a hardware queue. This is no
problem for non-rotational devices for which blk-mq is aimed for but
not so nice for rotational disks.

  (i) I've done some tests with patch c2ed2f2dcf92 (blk-mq: first cut
      deadline scheduling) from branch mq-deadline of linux-block
      repository. I've not seen a significant performance impact when
      enabling it (neither for non-rotational nor for rotational
      disks).

 (ii) I've played with code to enable sorting/merging of requests. I
      did this in flush_busy_ctxs. This didn't have a performance
      impact either. On a closer look this was due to high frequency
      of calls to __blk_mq_run_hw_queue. There was almost nothing to
      sort (too few requests). I guess that's also the reason why (i)
      had not much impact.

(iii) With CFQ I've observed similar performance patterns to blk-mq if
      slice_idle was set to 0.

 (iv) I thought about introducing a per software queue time slice
      during which blk-mq will service only one software queue (one
      CPU) and not flush all software queues. This could help to
      enqueue multiple requests belonging to the same process (as long
      as it runs on same CPU) into a hardware queue.  A minimal patch
      to implement this is attached below.

The latter helped to improve performance for sequential reads and
writes. But it's not on a par with CFQ. Increasing the time slice is
suboptimal (as shown with the 2ms results, see below). It might be
possible to get better performance when further reducing the initial
time slice and adapting it up to a maximum value if there are
repeatedly pending requests for a CPU.

After these observations and assuming that non-rotational devices are
most likely fine using blk-mq without I/O scheduling support I wonder
whether

- it's really a good idea to re-implement scheduling support for
  blk-mq that eventually behaves like CFQ for rotational devices.

- it's technical possible to support both blk-mq and CFQ for different
  devices on the same host adapter. This would allow to use "good old"
  code for "good old" rotational devices. (But this might not be a
  choice if in the long run a goal is to get rid of non-blk-mq code --
  not sure what the plans are.)

What do you think about this?


Thanks,

Andreas

---

 Here are some results obtained with below patch on top of Linux
 4.4.0-rc1. to illustrate (iv).
 - Intel Core i7-3770S CPU @ 3.10GHz system, 4 cores, 8 threads/CPUs
 - fio version as of 2.2.9-37-g0e1c4
 - results were gathered iterating using rw and numjobs parameter, e.g.:

    fio --directory=$1 --rw=$j --name=fio-scaling --size=5G --group_reporting \
        --ioengine=libaio --direct=1 --iodepth=1 --runtime=$2 --numjobs=$i

 - rotational device was a 1500GB Samsung HD155UI

    ata3.00: ATA-8: SAMSUNG HD155UI, 1AQ10001, max UDMA/133
    scsi 2:0:0:0: Direct-Access     ATA      SAMSUNG HD155UI  0001 PQ: 0 ANSI: 5

 (1) cfq, slice_idle=8 (default)
 (2) cfq, slice_idle=0
 (3) blk-mq, use_time_slice=0 (behaviour matches unmodified mainline) 
 (4) blk-mq, use_time_slice=1 (1ms)
 (5) blk-mq, use_time_slice=2 (2ms)

 --------------------------------     -------------------------------
 n   cfq   cfq   blk   blk   blk      n   cfq   cfq   blk   blk   blk
 u               -mq   -mq   -mq      u               -mq   -mq   -mq
 m   (1)   (2)   (3)   (4)   (5)      m   (1)   (2)   (3)   (4)   (5)
 j                                    j                        
 o                     1ms   2ms      o                     1ms   2ms
 b                                    b                        
 s                                    s                                
 --------------------------------     -------------------------------
          read iops                         randread iops              
 --------------------------------     -------------------------------
 1  17594 17344 19892 19452 19505     1   102   103   100   102   102
 2  16939  8394  8201 13187 12141     2    97    97    97    97    97
 3  16972  7410  8759 12243 11311     3   103   102   101   102   103
 4  15125  8692  9418 11585 11388     4   109   108   109   109   108
 5  14721  6159  6050 10430 10539     5   119   119   118   118   118
 6  13833  5997  6087  8775  8812     6   125   125   126   125   125
 7  13247  6022  6187  8493  9354     7   128   132   130   130   129
 8  12979  5970  6169  8251  7688     8   137   136   135   137   136
 --------------------------------     -------------------------------
          write iops                        randwrite iops             
 --------------------------------     -------------------------------
 1  15841 15988 17239 17306 17274     1   182   183   184   181   179
 2  16679  9860 10203 13512 11073     2   169   174   172   170   170
 3  16579  9684  9928 11430 10737     3   162   163   160   161   160
 4  16795 10095 10078 10810 10331     4   161   162   162   163   163
 5  16586  9980 10051  9802  9567     5   162   160   162   162   161
 6  16621  9895  9857 10213  9920     6   161   163   159   159   160
 7  16387  9780  9783  9812  9484     7   165   164   164   159   160
 8  16513  9652  9658  9778  9572     8   165   165   165   164   165
 --------------------------------     -------------------------------
          rw (read) iops                    randrw (read) iops      
 --------------------------------     -------------------------------
 1   4061  3840  4301  4142  4172     1    65    65    65    64    64
 2   4028  4569  4807  3815  3792     2    63    63    63    64    63
 3   3615  3128  3502  3216  3202     3    64    63    63    64    63
 4   3611  2889  3207  2821  2862     4    66    67    66    66    67
 5   3419  3026  3033  2819  2871     5    71    70    71    70    71
 6   3450  2981  2969  2619  2656     6    74    74    73    73    73
 7   3448  2719  2709  2565  2594     7    76    75    74    75    76
 8   3285  2283  2184  2233  2408     8    76    77    78    77    77
 --------------------------------     -------------------------------
          rw (write) iops                   randrw (write) iops     
 --------------------------------     -------------------------------
 1   4057  3841  4295  4138  4168     1    63    63    62    62    62
 2   4014  4562  4802  3798  3773     2    60    60    60    60    60
 3   3615  3132  3503  3216  3201     3    55    55    55    55    54
 4   3611  2885  3205  2825  2865     4    59    59    58    58    59
 5   3418  3044  3049  2823  2877     5    61    61    62    61    62
 6   3474  2984  2970  2615  2653     6    64    64    63    63    63
 7   3451  2711  2702  2565  2593     7    65    65    65    65    65
 8   3280  2285  2188  2234  2403     8    68    68    68    67    67
 --------------------------------     -------------------------------
---
blk-mq: Introduce per sw queue time-slice

Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
---
 block/blk-mq-sysfs.c   |  27 +++++++++
 block/blk-mq.c         | 162 ++++++++++++++++++++++++++++++++++++++++---------
 include/linux/blk-mq.h |   6 ++
 3 files changed, 165 insertions(+), 30 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 1cf1878..400bf93 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -247,6 +247,26 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static ssize_t blk_mq_hw_sysfs_tslice_show(struct blk_mq_hw_ctx *hctx,
+					  char *page)
+{
+	return sprintf(page, "%u\n", hctx->use_time_slice);
+}
+
+static ssize_t blk_mq_hw_sysfs_tslice_store(struct blk_mq_hw_ctx *hctx,
+					const char *page, size_t length)
+{
+	unsigned long long store;
+	int err;
+
+	err = kstrtoull(page, 10, &store);
+	if (err)
+		return -EINVAL;
+
+	hctx->use_time_slice = (unsigned) store;
+	return length;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
 	.attr = {.name = "dispatched", .mode = S_IRUGO },
 	.show = blk_mq_sysfs_dispatched_show,
@@ -305,6 +325,12 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
 	.show = blk_mq_hw_sysfs_poll_show,
 };
 
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tslice = {
+	.attr = {.name = "use_time_slice", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_mq_hw_sysfs_tslice_show,
+	.store = blk_mq_hw_sysfs_tslice_store,
+};
+
 static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_queued.attr,
 	&blk_mq_hw_sysfs_run.attr,
@@ -314,6 +340,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_cpus.attr,
 	&blk_mq_hw_sysfs_active.attr,
 	&blk_mq_hw_sysfs_poll.attr,
+	&blk_mq_hw_sysfs_tslice.attr,
 	NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3ae09de..b71d864 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -68,6 +68,8 @@ static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
 
 	if (!test_bit(CTX_TO_BIT(hctx, ctx), &bm->word))
 		set_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
+
+	cpumask_set_cpu(ctx->cpu, hctx->cpu_pending_mask);
 }
 
 static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
@@ -76,6 +78,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	struct blk_align_bitmap *bm = get_bm(hctx, ctx);
 
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
+	cpumask_clear_cpu(ctx->cpu, hctx->cpu_pending_mask);
 }
 
 void blk_mq_freeze_queue_start(struct request_queue *q)
@@ -685,10 +688,26 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
  * Process software queues that have been marked busy, splicing them
  * to the for-dispatch
  */
-static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx,
+			struct list_head *list,	int cpu)
 {
+	struct request_queue *q = hctx->queue;
 	struct blk_mq_ctx *ctx;
 	int i;
+	const int use_time_slice = hctx->use_time_slice;
+
+	if (use_time_slice && cpu != -1) {
+		if (cpu != -1) {
+			ctx = __blk_mq_get_ctx(q, cpu);
+			spin_lock(&ctx->lock);
+			list_splice_tail_init(&ctx->rq_list, list);
+			spin_lock(&hctx->lock);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+			spin_unlock(&hctx->lock);
+			spin_unlock(&ctx->lock);
+		}
+		return;
+	}
 
 	for (i = 0; i < hctx->ctx_map.size; i++) {
 		struct blk_align_bitmap *bm = &hctx->ctx_map.map[i];
@@ -705,9 +724,11 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 				break;
 
 			ctx = hctx->ctxs[bit + off];
-			clear_bit(bit, &bm->word);
 			spin_lock(&ctx->lock);
 			list_splice_tail_init(&ctx->rq_list, list);
+			spin_lock(&hctx->lock);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+			spin_unlock(&hctx->lock);
 			spin_unlock(&ctx->lock);
 
 			bit++;
@@ -716,6 +737,71 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 }
 
 /*
+ * It'd be great if the workqueue API had a way to pass
+ * in a mask and had some smarts for more clever placement.
+ * For now we just round-robin here, switching for every
+ * BLK_MQ_CPU_WORK_BATCH queued items.
+ */
+static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->queue->nr_hw_queues == 1)
+		return WORK_CPU_UNBOUND;
+
+	if (--hctx->next_cpu_batch <= 0) {
+		int cpu = hctx->next_cpu, next_cpu;
+
+		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
+		if (next_cpu >= nr_cpu_ids)
+			next_cpu = cpumask_first(hctx->cpumask);
+
+		hctx->next_cpu = next_cpu;
+		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+
+		return cpu;
+	}
+
+	return hctx->next_cpu;
+}
+
+static int blk_mq_sched_slice_expired(struct blk_mq_hw_ctx *hctx)
+{
+	/*
+	 * Use time slice before requests of other CPUs are serviced.
+	 * Maybe change to serve n requests certain number of sectors
+	 * before switching to other CPU.
+	 */
+	if (time_after_eq(jiffies, hctx->sched_slice_exp))
+		return 1;
+
+	return 0;
+}
+
+static int blk_mq_sched_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+	int c = hctx->sched_cpu;
+
+	if (c != -1)
+		cpumask_clear_cpu(c, hctx->cpu_service_mask);
+
+	if (cpumask_and(hctx->cpu_next_mask, hctx->cpu_pending_mask,
+				hctx->cpu_service_mask)) {
+		c = cpumask_first(hctx->cpu_next_mask);
+	} else {
+		/* we are idle, epoch ended, reset */
+		hctx->sched_slice_exp = 0;
+		cpumask_setall(hctx->cpu_service_mask);
+		c = cpumask_first(hctx->cpu_pending_mask);
+	}
+
+	if (c >= nr_cpu_ids)
+		return -1;
+
+	hctx->sched_slice_exp = jiffies +
+		msecs_to_jiffies(hctx->use_time_slice);
+	return c;
+}
+
+/*
  * Run this hardware queue, pulling any software queues mapped to it in.
  * Note that this function currently has various problems around ordering
  * of IO. In particular, we'd like FIFO behaviour on handling existing
@@ -729,6 +815,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
 	int queued;
+	const int use_time_slice = hctx->use_time_slice;
+	int cpu = -1;
 
 	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
 
@@ -737,10 +825,19 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 
 	hctx->run++;
 
+	if (use_time_slice) {
+		spin_lock(&hctx->lock);
+		if (blk_mq_sched_slice_expired(hctx))
+			hctx->sched_cpu = blk_mq_sched_next_cpu(hctx);
+		cpu = hctx->sched_cpu;
+		spin_unlock(&hctx->lock);
+		/* don't leave, but check dispatch queue if cpu == -1 */
+	}
+
 	/*
 	 * Touch any software queue that has pending entries.
 	 */
-	flush_busy_ctxs(hctx, &rq_list);
+	flush_busy_ctxs(hctx, &rq_list, cpu);
 
 	/*
 	 * If we have previous entries on our dispatch list, grab them
@@ -825,36 +922,15 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		 * blk_mq_run_hw_queue() already checks the STOPPED bit
 		 **/
 		blk_mq_run_hw_queue(hctx, true);
+	} else if (use_time_slice && !cpumask_empty(hctx->cpu_pending_mask)) {
+		long t = hctx->sched_slice_exp - jiffies;
+		t = (t < 0) ? 0 : t;
+		kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+						&hctx->run_work,
+						(unsigned int)t);
 	}
 }
 
-/*
- * It'd be great if the workqueue API had a way to pass
- * in a mask and had some smarts for more clever placement.
- * For now we just round-robin here, switching for every
- * BLK_MQ_CPU_WORK_BATCH queued items.
- */
-static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
-{
-	if (hctx->queue->nr_hw_queues == 1)
-		return WORK_CPU_UNBOUND;
-
-	if (--hctx->next_cpu_batch <= 0) {
-		int cpu = hctx->next_cpu, next_cpu;
-
-		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
-		if (next_cpu >= nr_cpu_ids)
-			next_cpu = cpumask_first(hctx->cpumask);
-
-		hctx->next_cpu = next_cpu;
-		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
-
-		return cpu;
-	}
-
-	return hctx->next_cpu;
-}
-
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
 {
 	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
@@ -991,7 +1067,10 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 
 	__blk_mq_insert_req_list(hctx, ctx, rq, at_head);
+	spin_lock(&hctx->lock);
 	blk_mq_hctx_mark_pending(hctx, ctx);
+	spin_unlock(&hctx->lock);
+
 }
 
 void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
@@ -1580,7 +1659,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_list)) {
 		list_splice_init(&ctx->rq_list, &tmp);
+		spin_lock(&hctx->lock);
 		blk_mq_hctx_clear_pending(hctx, ctx);
+		spin_unlock(&hctx->lock);
 	}
 	spin_unlock(&ctx->lock);
 
@@ -1599,7 +1680,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
 	}
 
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	spin_lock(&hctx->lock);
 	blk_mq_hctx_mark_pending(hctx, ctx);
+	spin_unlock(&hctx->lock);
 
 	spin_unlock(&ctx->lock);
 
@@ -1688,6 +1771,10 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue_num = hctx_idx;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	hctx->use_time_slice = 0;
+	hctx->sched_slice_exp = 0;
+	hctx->sched_cpu = -1;
+
 	blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
 					blk_mq_hctx_notify, hctx);
 	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
@@ -2010,6 +2097,18 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						node))
 			goto err_hctxs;
 
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_pending_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_service_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_next_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
 		atomic_set(&hctxs[i]->nr_active, 0);
 		hctxs[i]->numa_node = node;
 		hctxs[i]->queue_num = i;
@@ -2073,6 +2172,9 @@ err_hctxs:
 		if (!hctxs[i])
 			break;
 		free_cpumask_var(hctxs[i]->cpumask);
+		free_cpumask_var(hctxs[i]->cpu_pending_mask);
+		free_cpumask_var(hctxs[i]->cpu_service_mask);
+		free_cpumask_var(hctxs[i]->cpu_next_mask);
 		kfree(hctxs[i]);
 	}
 err_map:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index daf17d7..6e192b5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -28,6 +28,9 @@ struct blk_mq_hw_ctx {
 	struct delayed_work	run_work;
 	struct delayed_work	delay_work;
 	cpumask_var_t		cpumask;
+	cpumask_var_t		cpu_service_mask;
+	cpumask_var_t		cpu_pending_mask;
+	cpumask_var_t		cpu_next_mask;
 	int			next_cpu;
 	int			next_cpu_batch;
 
@@ -57,6 +60,9 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	int			use_time_slice;
+	unsigned long		sched_slice_exp;
+	int			sched_cpu;
 	struct blk_mq_cpu_notifier	cpu_notifier;
 	struct kobject		kobj;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC] blk-mq and I/O scheduling
  2015-11-19 12:02 [RFC] blk-mq and I/O scheduling Andreas Herrmann
@ 2015-11-24  8:19 ` Christoph Hellwig
  2015-12-01  7:37   ` Andreas Herrmann
  2015-11-25 19:47 ` Jens Axboe
  2016-02-01 22:43 ` [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice Andreas Herrmann
  2 siblings, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2015-11-24  8:19 UTC (permalink / raw)
  To: Andreas Herrmann; +Cc: Jens Axboe, linux-kernel, linux-block

Hi Andreas,

I don't understand the time slicing algorithm to well, but from the
blk-mq integration perspective this looks nice, and anything that
helps improving blk-mq for spinning rust is useful.

As a nitpick some of the larger "if (use_time_slice)" blocks should
be moved into separate helper functions‥


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] blk-mq and I/O scheduling
  2015-11-19 12:02 [RFC] blk-mq and I/O scheduling Andreas Herrmann
  2015-11-24  8:19 ` Christoph Hellwig
@ 2015-11-25 19:47 ` Jens Axboe
  2015-12-01  7:43   ` Andreas Herrmann
  2016-02-01 22:43 ` [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice Andreas Herrmann
  2 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2015-11-25 19:47 UTC (permalink / raw)
  To: Andreas Herrmann, Christoph Hellwig; +Cc: linux-kernel

On 11/19/2015 05:02 AM, Andreas Herrmann wrote:
> Hi,
>
> I've looked into blk-mq and possible support for I/O scheduling.
>
> The reason for this is to minimize performance degradation with
> rotational devices when scsi_mod.use_blk_mq=1 is switched on.
>
> I think that the degradation is well reflected with fio measurements.
> With an increasing number of jobs you'll encounter a significant
> performance drop for sequential reads and writes with blk-mq in
> contrast to CFQ. blk-mq ensures that requests from different processes
> (CPUs) are "perfectly shuffled" in a hardware queue. This is no
> problem for non-rotational devices for which blk-mq is aimed for but
> not so nice for rotational disks.
>
>    (i) I've done some tests with patch c2ed2f2dcf92 (blk-mq: first cut
>        deadline scheduling) from branch mq-deadline of linux-block
>        repository. I've not seen a significant performance impact when
>        enabling it (neither for non-rotational nor for rotational
>        disks).
>
>   (ii) I've played with code to enable sorting/merging of requests. I
>        did this in flush_busy_ctxs. This didn't have a performance
>        impact either. On a closer look this was due to high frequency
>        of calls to __blk_mq_run_hw_queue. There was almost nothing to
>        sort (too few requests). I guess that's also the reason why (i)
>        had not much impact.
>
> (iii) With CFQ I've observed similar performance patterns to blk-mq if
>        slice_idle was set to 0.
>
>   (iv) I thought about introducing a per software queue time slice
>        during which blk-mq will service only one software queue (one
>        CPU) and not flush all software queues. This could help to
>        enqueue multiple requests belonging to the same process (as long
>        as it runs on same CPU) into a hardware queue.  A minimal patch
>        to implement this is attached below.
>
> The latter helped to improve performance for sequential reads and
> writes. But it's not on a par with CFQ. Increasing the time slice is
> suboptimal (as shown with the 2ms results, see below). It might be
> possible to get better performance when further reducing the initial
> time slice and adapting it up to a maximum value if there are
> repeatedly pending requests for a CPU.
>
> After these observations and assuming that non-rotational devices are
> most likely fine using blk-mq without I/O scheduling support I wonder
> whether
>
> - it's really a good idea to re-implement scheduling support for
>    blk-mq that eventually behaves like CFQ for rotational devices.
>
> - it's technical possible to support both blk-mq and CFQ for different
>    devices on the same host adapter. This would allow to use "good old"
>    code for "good old" rotational devices. (But this might not be a
>    choice if in the long run a goal is to get rid of non-blk-mq code --
>    not sure what the plans are.)
>
> What do you think about this?

Sorry I did not get around to properly looking at this this week, I'll 
tend to it next week. I think the concept of tying the idling to a 
specific CPU is likely fine, though I wonder if there are cases where we 
preempt more heavily and subsequently miss breaking the idling properly. 
I don't think we want/need cfq for blk-mq, but basic idling could 
potentially be enough. That's still a far cry from a full cfq 
implementation. The long term plans are still to move away from the 
legacy IO path, though with things like scheduling, that's sure to take 
some time.

That is actually where the mq-deadline work comes in. The mq-deadline 
work is missing a test patch to limit tag allocations, and a bunch of 
other little things to truly make it functional. There might be some 
options for folding it all together, with idling, as that would still be 
important on rotating storage going forward.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] blk-mq and I/O scheduling
  2015-11-24  8:19 ` Christoph Hellwig
@ 2015-12-01  7:37   ` Andreas Herrmann
  0 siblings, 0 replies; 12+ messages in thread
From: Andreas Herrmann @ 2015-12-01  7:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-kernel, linux-block

On Tue, Nov 24, 2015 at 09:19:32AM +0100, Christoph Hellwig wrote:
> Hi Andreas,

Hi Christoph,

> I don't understand the time slicing algorithm to well, but from the
> blk-mq integration perspective this looks nice, and anything that
> helps improving blk-mq for spinning rust is useful.

I'll put description/comments in the next patch version that hopefully
explain it.

> As a nitpick some of the larger "if (use_time_slice)" blocks should
> be moved into separate helper functions‥

And I'll address this also.


Thanks,

Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] blk-mq and I/O scheduling
  2015-11-25 19:47 ` Jens Axboe
@ 2015-12-01  7:43   ` Andreas Herrmann
  0 siblings, 0 replies; 12+ messages in thread
From: Andreas Herrmann @ 2015-12-01  7:43 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Christoph Hellwig, linux-kernel

On Wed, Nov 25, 2015 at 12:47:21PM -0700, Jens Axboe wrote:
> On 11/19/2015 05:02 AM, Andreas Herrmann wrote:

 --8<--

> >The latter helped to improve performance for sequential reads and
> >writes. But it's not on a par with CFQ. Increasing the time slice is
> >suboptimal (as shown with the 2ms results, see below). It might be
> >possible to get better performance when further reducing the initial
> >time slice and adapting it up to a maximum value if there are
> >repeatedly pending requests for a CPU.
> >
> >After these observations and assuming that non-rotational devices are
> >most likely fine using blk-mq without I/O scheduling support I wonder
> >whether
> >
> >- it's really a good idea to re-implement scheduling support for
> >   blk-mq that eventually behaves like CFQ for rotational devices.
> >
> >- it's technical possible to support both blk-mq and CFQ for different
> >   devices on the same host adapter. This would allow to use "good old"
> >   code for "good old" rotational devices. (But this might not be a
> >   choice if in the long run a goal is to get rid of non-blk-mq code --
> >   not sure what the plans are.)
> >
> >What do you think about this?
> 
> Sorry I did not get around to properly looking at this this week,
> I'll tend to it next week. I think the concept of tying the idling
> to a specific CPU is likely fine, though I wonder if there are cases
> where we preempt more heavily and subsequently miss breaking the
> idling properly. I don't think we want/need cfq for blk-mq, but
> basic idling could potentially be enough. That's still a far cry
> from a full cfq implementation. The long term plans are still to
> move away from the legacy IO path, though with things like
> scheduling, that's sure to take some time.

FYI, I'll plan to send an updated patch later today.  

I've slightly changed it to allow specification of a time slice in µs
(instead of ms) and to extend it for a software queue when requests
were actually put into the hardware queue for this specific software
queue. This improved performance a little bit.

> That is actually where the mq-deadline work comes in. The
> mq-deadline work is missing a test patch to limit tag allocations,
> and a bunch of other little things to truly make it functional.
> There might be some options for folding it all together, with
> idling, as that would still be important on rotating storage going
> forward.


Thanks for you comments,

Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice
  2015-11-19 12:02 [RFC] blk-mq and I/O scheduling Andreas Herrmann
  2015-11-24  8:19 ` Christoph Hellwig
  2015-11-25 19:47 ` Jens Axboe
@ 2016-02-01 22:43 ` Andreas Herrmann
  2016-02-01 22:46   ` Andreas Herrmann
  2016-02-09 17:12   ` Andreas Herrmann
  2 siblings, 2 replies; 12+ messages in thread
From: Andreas Herrmann @ 2016-02-01 22:43 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: linux-kernel, Johannes Thumshirn, Jan Kara

This introduces a new blk_mq hw attribute time_slice_us which allows
to specify a time slice in usecs.

Default value is 0 and implies no modification to blk-mq behaviour.

A positive value changes blk-mq to service only one software queue
within this time slice until it expires or the software queue is
empty. Then the next software queue with pending requests is selected.

Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
---
 block/blk-mq-sysfs.c   |  27 +++++++
 block/blk-mq.c         | 208 +++++++++++++++++++++++++++++++++++++++++--------
 include/linux/blk-mq.h |   9 +++
 3 files changed, 211 insertions(+), 33 deletions(-)

Hi,

This update is long overdue (sorry for the delay).

Change to v1:
- time slice is now specified in usecs instead of msecs.
- time slice is extended (up to 3 times the initial value) when there
  was actually a request to be serviced for the software queue

Fio test results are sent in a separate mail to this.

Results for fio improved to some extent with this patch. But in
reality the picture is quite mixed. Performance is highly dependend on
task scheduling. There is no guarantee that the requests originated
from one CPU belong to the same process.

I think for rotary devices CFQ is by far the best choice. A simple
illustration is:

  Copying two files (750MB in this case) in parallel on a rotary
  device. The elapsed wall clock time (seconds) for this is
                               mean    stdev
   cfq, slice_idle=8           16.18   4.95
   cfq, slice_idle=0           23.74   2.82
   blk-mq, time_slice_usec=0   24.37   2.05
   blk-mq, time_slice_usec=250 25.58   3.16


Regards,

Andreas

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 1cf1878..77c875c 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -247,6 +247,26 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
 	return ret;
 }
 
+static ssize_t blk_mq_hw_sysfs_tslice_show(struct blk_mq_hw_ctx *hctx,
+					  char *page)
+{
+	return sprintf(page, "%u\n", hctx->tslice_us);
+}
+
+static ssize_t blk_mq_hw_sysfs_tslice_store(struct blk_mq_hw_ctx *hctx,
+					const char *page, size_t length)
+{
+	unsigned long long store;
+	int err;
+
+	err = kstrtoull(page, 10, &store);
+	if (err)
+		return -EINVAL;
+
+	hctx->tslice_us = (unsigned)store;
+	return length;
+}
+
 static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
 	.attr = {.name = "dispatched", .mode = S_IRUGO },
 	.show = blk_mq_sysfs_dispatched_show,
@@ -305,6 +325,12 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
 	.show = blk_mq_hw_sysfs_poll_show,
 };
 
+static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tslice = {
+	.attr = {.name = "time_slice_us", .mode = S_IRUGO | S_IWUSR },
+	.show = blk_mq_hw_sysfs_tslice_show,
+	.store = blk_mq_hw_sysfs_tslice_store,
+};
+
 static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_queued.attr,
 	&blk_mq_hw_sysfs_run.attr,
@@ -314,6 +340,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
 	&blk_mq_hw_sysfs_cpus.attr,
 	&blk_mq_hw_sysfs_active.attr,
 	&blk_mq_hw_sysfs_poll.attr,
+	&blk_mq_hw_sysfs_tslice.attr,
 	NULL,
 };
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c0622f..97d32f2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -68,6 +68,7 @@ static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
 
 	if (!test_bit(CTX_TO_BIT(hctx, ctx), &bm->word))
 		set_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
+	cpumask_set_cpu(ctx->cpu, hctx->cpu_pending_mask);
 }
 
 static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
@@ -76,6 +77,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	struct blk_align_bitmap *bm = get_bm(hctx, ctx);
 
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
+	cpumask_clear_cpu(ctx->cpu, hctx->cpu_pending_mask);
 }
 
 void blk_mq_freeze_queue_start(struct request_queue *q)
@@ -682,15 +684,41 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
 	return false;
 }
 
+static int tslice_flush_one_ctx(struct blk_mq_hw_ctx *hctx,
+				struct list_head *list,	int cpu)
+{
+	struct request_queue *q = hctx->queue;
+	struct blk_mq_ctx *ctx;
+
+	if (cpu == -1 || !hctx->tslice_us)
+		return 0;
+
+	ctx = __blk_mq_get_ctx(q, cpu);
+	spin_lock(&ctx->lock);
+	if (!list_empty(&ctx->rq_list)) {
+		list_splice_tail_init(&ctx->rq_list, list);
+		spin_lock(&hctx->lock);
+		hctx->tslice_inc = 1;
+		blk_mq_hctx_clear_pending(hctx, ctx);
+		spin_unlock(&hctx->lock);
+	}
+	spin_unlock(&ctx->lock);
+	return 1;
+}
+
 /*
  * Process software queues that have been marked busy, splicing them
  * to the for-dispatch
  */
-static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
+static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx,
+			struct list_head *list,	int cpu)
 {
 	struct blk_mq_ctx *ctx;
 	int i;
 
+	if (tslice_flush_one_ctx(hctx, list, cpu))
+		return;
+
 	for (i = 0; i < hctx->ctx_map.size; i++) {
 		struct blk_align_bitmap *bm = &hctx->ctx_map.map[i];
 		unsigned int off, bit;
@@ -706,9 +734,11 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 				break;
 
 			ctx = hctx->ctxs[bit + off];
-			clear_bit(bit, &bm->word);
 			spin_lock(&ctx->lock);
 			list_splice_tail_init(&ctx->rq_list, list);
+			spin_lock(&hctx->lock);
+			blk_mq_hctx_clear_pending(hctx, ctx);
+			spin_unlock(&hctx->lock);
 			spin_unlock(&ctx->lock);
 
 			bit++;
@@ -717,6 +747,114 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
 }
 
 /*
+ * It'd be great if the workqueue API had a way to pass
+ * in a mask and had some smarts for more clever placement.
+ * For now we just round-robin here, switching for every
+ * BLK_MQ_CPU_WORK_BATCH queued items.
+ */
+static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+	if (hctx->queue->nr_hw_queues == 1)
+		return WORK_CPU_UNBOUND;
+
+	if (--hctx->next_cpu_batch <= 0) {
+		int cpu = hctx->next_cpu, next_cpu;
+
+		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
+		if (next_cpu >= nr_cpu_ids)
+			next_cpu = cpumask_first(hctx->cpumask);
+
+		hctx->next_cpu = next_cpu;
+		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
+
+		return cpu;
+	}
+
+	return hctx->next_cpu;
+}
+
+static int blk_mq_tslice_expired(struct blk_mq_hw_ctx *hctx)
+{
+	if (time_after_eq(jiffies, hctx->tslice_expiration))
+		return 1;
+
+	return 0;
+}
+
+static int blk_mq_tslice_next_cpu(struct blk_mq_hw_ctx *hctx)
+{
+	int c = hctx->tslice_cpu;
+
+	/* allow extension of time slice */
+	if (hctx->tslice_inc && hctx->tslice_inc_count < 4) {
+		hctx->tslice_inc = 0;
+		hctx->tslice_inc_count++;
+		goto out;
+	}
+	hctx->tslice_inc = 0;
+	hctx->tslice_inc_count = 0;
+
+	/* clear CPU for which tslice has expired */
+	if (c != -1)
+		cpumask_clear_cpu(c, hctx->cpu_service_mask);
+
+	/* calculate mask of remaining CPUs with pending work */
+	if (cpumask_and(hctx->cpu_next_mask, hctx->cpu_pending_mask,
+				hctx->cpu_service_mask)) {
+		c = cpumask_first(hctx->cpu_next_mask);
+	} else {
+		/*
+		 * no remaining CPUs with pending work, reset epoch,
+		 * start with first CPU that has requests pending
+		 */
+		hctx->tslice_expiration = 0;
+		cpumask_setall(hctx->cpu_service_mask);
+		c = cpumask_first(hctx->cpu_pending_mask);
+	}
+
+	/* no CPU with pending requests */
+	if (c >= nr_cpu_ids)
+		return -1;
+
+out:
+	hctx->tslice_expiration = jiffies + usecs_to_jiffies(hctx->tslice_us);
+	return c;
+}
+
+static int tslice_get_busy_ctx(struct blk_mq_hw_ctx *hctx)
+{
+	int cpu = -1;
+
+	if (hctx->tslice_us) {
+		spin_lock(&hctx->lock);
+		if (blk_mq_tslice_expired(hctx))
+			hctx->tslice_cpu = blk_mq_tslice_next_cpu(hctx);
+		cpu = hctx->tslice_cpu;
+		spin_unlock(&hctx->lock);
+	}
+
+	return cpu;
+}
+
+/*
+ * Ensure that hardware queue is run if there are pending
+ * requests in any software queue.
+ */
+static void tslice_schedule_pending_work(struct blk_mq_hw_ctx *hctx)
+{
+	long t;
+
+	if (!hctx->tslice_us || cpumask_empty(hctx->cpu_pending_mask))
+		return;
+
+	t = hctx->tslice_expiration - jiffies;
+	if (t < 0)
+		t = 0;
+	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
+					&hctx->run_work, (unsigned int)t);
+}
+
+/*
  * Run this hardware queue, pulling any software queues mapped to it in.
  * Note that this function currently has various problems around ordering
  * of IO. In particular, we'd like FIFO behaviour on handling existing
@@ -729,7 +867,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	LIST_HEAD(rq_list);
 	LIST_HEAD(driver_list);
 	struct list_head *dptr;
-	int queued;
+	int queued, cpu;
 
 	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
 
@@ -739,9 +877,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 	hctx->run++;
 
 	/*
-	 * Touch any software queue that has pending entries.
+	 * Touch dedicated software queue if time slice is set or any
+	 * software queue that has pending entries (cpu == -1).
 	 */
-	flush_busy_ctxs(hctx, &rq_list);
+	cpu = tslice_get_busy_ctx(hctx);
+	flush_busy_ctxs(hctx, &rq_list, cpu);
 
 	/*
 	 * If we have previous entries on our dispatch list, grab them
@@ -826,34 +966,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
 		 * blk_mq_run_hw_queue() already checks the STOPPED bit
 		 **/
 		blk_mq_run_hw_queue(hctx, true);
-	}
-}
-
-/*
- * It'd be great if the workqueue API had a way to pass
- * in a mask and had some smarts for more clever placement.
- * For now we just round-robin here, switching for every
- * BLK_MQ_CPU_WORK_BATCH queued items.
- */
-static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
-{
-	if (hctx->queue->nr_hw_queues == 1)
-		return WORK_CPU_UNBOUND;
-
-	if (--hctx->next_cpu_batch <= 0) {
-		int cpu = hctx->next_cpu, next_cpu;
-
-		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
-		if (next_cpu >= nr_cpu_ids)
-			next_cpu = cpumask_first(hctx->cpumask);
-
-		hctx->next_cpu = next_cpu;
-		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
-
-		return cpu;
-	}
-
-	return hctx->next_cpu;
+	} else
+		tslice_schedule_pending_work(hctx);
 }
 
 void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
@@ -992,7 +1106,10 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
 	struct blk_mq_ctx *ctx = rq->mq_ctx;
 
 	__blk_mq_insert_req_list(hctx, ctx, rq, at_head);
+	spin_lock(&hctx->lock);
 	blk_mq_hctx_mark_pending(hctx, ctx);
+	spin_unlock(&hctx->lock);
+
 }
 
 void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
@@ -1583,7 +1700,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
 	spin_lock(&ctx->lock);
 	if (!list_empty(&ctx->rq_list)) {
 		list_splice_init(&ctx->rq_list, &tmp);
+		spin_lock(&hctx->lock);
 		blk_mq_hctx_clear_pending(hctx, ctx);
+		spin_unlock(&hctx->lock);
 	}
 	spin_unlock(&ctx->lock);
 
@@ -1602,7 +1721,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
 	}
 
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
+	spin_lock(&hctx->lock);
 	blk_mq_hctx_mark_pending(hctx, ctx);
+	spin_unlock(&hctx->lock);
 
 	spin_unlock(&ctx->lock);
 
@@ -1691,6 +1812,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	hctx->queue_num = hctx_idx;
 	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
 
+	hctx->tslice_us = 0;
+	hctx->tslice_expiration = 0;
+	hctx->tslice_cpu = -1;
+	hctx->tslice_inc = 0;
+	hctx->tslice_inc_count = 0;
+
 	blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
 					blk_mq_hctx_notify, hctx);
 	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
@@ -2006,6 +2133,18 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						node))
 			goto err_hctxs;
 
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_pending_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_service_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
+		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_next_mask,
+						GFP_KERNEL, node))
+			goto err_hctxs;
+
 		atomic_set(&hctxs[i]->nr_active, 0);
 		hctxs[i]->numa_node = node;
 		hctxs[i]->queue_num = i;
@@ -2069,6 +2208,9 @@ err_hctxs:
 		if (!hctxs[i])
 			break;
 		free_cpumask_var(hctxs[i]->cpumask);
+		free_cpumask_var(hctxs[i]->cpu_pending_mask);
+		free_cpumask_var(hctxs[i]->cpu_service_mask);
+		free_cpumask_var(hctxs[i]->cpu_next_mask);
 		kfree(hctxs[i]);
 	}
 err_map:
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7fc9296..a8ca685 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -57,6 +57,15 @@ struct blk_mq_hw_ctx {
 
 	atomic_t		nr_active;
 
+	cpumask_var_t		cpu_service_mask; /* CPUs not yet serviced */
+	cpumask_var_t		cpu_pending_mask; /* CPUs with pending work */
+	cpumask_var_t		cpu_next_mask;	/* CPUs to be serviced */
+	int			tslice_us;	/* time slice in µs */
+	int			tslice_inc;	/* extend time slice */
+	int			tslice_inc_count;
+	unsigned long		tslice_expiration; /* in jiffies */
+	int			tslice_cpu;	/* cpu of time slice */
+
 	struct blk_mq_cpu_notifier	cpu_notifier;
 	struct kobject		kobj;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice
  2016-02-01 22:43 ` [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice Andreas Herrmann
@ 2016-02-01 22:46   ` Andreas Herrmann
  2016-02-09 17:12   ` Andreas Herrmann
  1 sibling, 0 replies; 12+ messages in thread
From: Andreas Herrmann @ 2016-02-01 22:46 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: linux-kernel, Johannes Thumshirn, Jan Kara

Following benchmark results for the patch using fio.

 - kernel version 4.5.0-rc2-0001 (ie. with time-slice patch) 
 - Intel Core i7-3770S CPU @ 3.10GHz system, 4 cores, 8 threads/CPUs
 - fio version as of 2.2.9-37-g0e1c4
 - results were gathered iterating using rw and numjobs parameter, e.g.:

    fio --directory=$1 --rw=$j --name=fio-scaling --size=5G --group_reporting \
        --ioengine=libaio --direct=1 --iodepth=1 --runtime=$2 --numjobs=$i

 - rotational device was a 1500GB Samsung HD155UI

    ata3.00: ATA-8: SAMSUNG HD155UI, 1AQ10001, max UDMA/133
    scsi 2:0:0:0: Direct-Access     ATA      SAMSUNG HD155UI  0001 PQ: 0 ANSI: 5

 - no regression between blk-mq using vanilla v4.5-rc2 kernel and new
   blk-mq w/ time_slice_us=0
 - I omitted random read/write/rw tests as there was no significant
   difference
 - time_slice_us=250 seems to be optimal choice
   (I compared 1000 µs, 500 µs, 250 µs and 50 µs)

 (1) cfq, slice_idle=8 (default)
 (2) cfq, slice_idle=0
 (3) blk-mq, time_slice_us=0 (behaviour matches unmodified mainline) 
 (4) blk-mq, time_slice_us=250  (250 µs) (time slice incremented)

 ---------------------------------------------------------------------
 n       cfq              cfq            blk-mq           blk-mq
 u       (1)              (2)              (3)              (4) 
 m
 j                                           0              250 
 o                                          µs               µs 
 b
 s
 ---------------------------------------------------------------------
                        read iops [mean (stdev)]
 ---------------------------------------------------------------------
 1 17394.6 (470.48) 17336.4 (382.88) 20324.5 ( 72.75) 18914.5 (324.23)
 2 15862.1 (416.38)  8427.8 (104.45)  8373.2 ( 71.97) 13216.6 (964.83)
 3 16597.1 (246.77)  7078.0 (107.29)  8671.3 (238.12) 12370.2 (276.76)
 4 15078.6 (132.34)  8589.8 (169.00)  9107.5 (328.00) 11582.0 (225.09)
 5 14392.2 ( 75.55)  5908.5 (120.76)  6052.0 (102.01) 10906.3 (435.44)
 6 13540.2 (142.83)  5902.9 (122.41)  5944.5 ( 90.23)  9758.9 (374.65)
 7 13112.4 (175.02)  5938.7 ( 99.70)  5982.2 (104.44)  9233.0 (616.16)
 8 12651.0 (187.11)  5955.9 (181.99)  6010.2 (105.62)  8908.5 (399.25)
 ---------------------------------------------------------------------
                        write iops [mean (stdev)]
 ---------------------------------------------------------------------
 1 15233.6 (236.28) 15062.4 (176.56) 17141.0 (107.48  16236.8 (128.94)
 2 15333.5 (474.39)  9860.1 ( 58.42) 10020.0 ( 38.67  15027.4 (288.07)
 3 15753.6 (459.48)  9552.5 (153.48)  9523.0 (111.82  14021.5 (551.12)
 4 15932.2 (328.94)  9949.8 ( 97.03)  9946.0 (110.38  13166.0 (289.70)
 5 15840.5 (407.32)  9886.9 ( 43.84)  9814.0 ( 60.92  12401.2 (706.76)
 6 16071.2 (349.90)  9822.1 ( 88.81)  9796.0 ( 83.29  11916.1 (904.20)
 7 15864.1 (353.14)  9684.6 ( 63.48)  9628.0 ( 35.54  12194.0 (292.25)
 8 15901.2 (308.38)  9474.3 ( 86.85)  9447.0 ( 40.46  11933.3 (633.21)                
 ---------------------------------------------------------------------
                        rw (read) iops [mean (stdev)]
 ---------------------------------------------------------------------
 1  4034.5 ( 76.29)  4238.7 ( 65.71)  4305.9 (107.76)  4069.8 (115.45)
 2  3727.7 (108.46)  3860.5 (274.02)  4194.4 (132.19)  3909.3 (202.47)
 3  3837.6 ( 57.84)  3533.6 (206.67)  3560.5 (145.49)  3625.9 (170.29)
 4  3650.8 ( 53.90)  3201.1 ( 72.26)  3123.0 (154.18)  3506.5 (142.67)
 5  3638.8 (102.67)  3043.1 (122.89)  3210.6 ( 89.05)  3301.1 (194.63)
 6  3592.9 ( 93.41)  3002.3 (114.94)  3180.3 ( 42.36)  3297.5 (200.37)
 7  3666.1 ( 66.77)  3081.3 ( 92.76)  3120.7 (127.99)  3210.4 (139.00)                 
 8  3787.5 ( 46.90)  2859.6 (220.61)  2978.1 (119.55)  3215.4 (166.90)                 
 ---------------------------------------------------------------------
                        rw (write) iops [mean (stdev)]
 ---------------------------------------------------------------------
 1  4037.4 ( 74.32)  4241.1 ( 66.48)  4306.4 (105.00)  4073.3 (115.22)
 2  3740.1 (102.04)  3868.2 (265.28)  4193.9 (128.23)  3914.6 (195.13)
 3  3832.9 ( 58.52)  3526.3 (209.27)  3550.3 (148.23)  3622.4 (169.52)
 4  3647.1 ( 53.76)  3211.4 ( 68.84)  3133.7 (152.57)  3504.2 (138.19)
 5  3642.5 (102.65)  3051.0 (114.87)  3206.6 ( 84.58)  3303.5 (193.98)
 6  3610.2 ( 93.48)  3013.7 (124.22)  3208.8 ( 45.25)  3291.2 (197.91)
 7  3681.6 ( 66.38)  3077.4 ( 91.72)  3118.5 (130.27)  3184.9 (140.03)
 8  3792.6 ( 49.89)  2816.7 (235.14)  2943.7 (133.80)  3176.4 (164.24)
 ---------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice
  2016-02-01 22:43 ` [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice Andreas Herrmann
  2016-02-01 22:46   ` Andreas Herrmann
@ 2016-02-09 17:12   ` Andreas Herrmann
  2016-02-09 17:41     ` Markus Trippelsdorf
  1 sibling, 1 reply; 12+ messages in thread
From: Andreas Herrmann @ 2016-02-09 17:12 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe
  Cc: linux-kernel, Johannes Thumshirn, Jan Kara, linux-block,
	linux-scsi, Hannes Reinecke

[CC-ing linux-block and linux-scsi and adding some comments]

On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> This introduces a new blk_mq hw attribute time_slice_us which allows
> to specify a time slice in usecs.
> 
> Default value is 0 and implies no modification to blk-mq behaviour.
> 
> A positive value changes blk-mq to service only one software queue
> within this time slice until it expires or the software queue is
> empty. Then the next software queue with pending requests is selected.
> 
> Signed-off-by: Andreas Herrmann <aherrmann@suse.com>
> ---
>  block/blk-mq-sysfs.c   |  27 +++++++
>  block/blk-mq.c         | 208 +++++++++++++++++++++++++++++++++++++++++--------
>  include/linux/blk-mq.h |   9 +++
>  3 files changed, 211 insertions(+), 33 deletions(-)
> 
> Hi,
> 
> This update is long overdue (sorry for the delay).
> 
> Change to v1:
> - time slice is now specified in usecs instead of msecs.
> - time slice is extended (up to 3 times the initial value) when there
>   was actually a request to be serviced for the software queue
> 
> Fio test results are sent in a separate mail to this.

See http://marc.info/?l=linux-kernel&m=145436682607949&w=2

In short it shows significant performance gains in some tests,
e.g. sequential read iops up by >40% with 8 jobs. But it's never on
par with CFQ when more than 1 job was used during the test.

> Results for fio improved to some extent with this patch. But in
> reality the picture is quite mixed. Performance is highly dependend on
> task scheduling. There is no guarantee that the requests originated
> from one CPU belong to the same process.
> 
> I think for rotary devices CFQ is by far the best choice. A simple
> illustration is:
> 
>   Copying two files (750MB in this case) in parallel on a rotary
>   device. The elapsed wall clock time (seconds) for this is
>                                mean    stdev
>    cfq, slice_idle=8           16.18   4.95
>    cfq, slice_idle=0           23.74   2.82
>    blk-mq, time_slice_usec=0   24.37   2.05
>    blk-mq, time_slice_usec=250 25.58   3.16

This illustrates that although their was performance gain with fio
tests, the patch can cause higher variance and lower performance in
comparison to unmodified blk-mq with other tests. And it underscores
superiority of CFQ for rotary disks.

Meanwhile my opinion is that it's not really worth to look further
into introduction of I/O scheduling support in blk-mq. I don't see the
need for scheduling support (deadline or something else) for fast
storage devices. And rotary devices should really avoid usage of blk-mq
and stick to CFQ.

Thus I think that introducing some coexistence of blk-mq and the
legacy block with CFQ is the best option.

Recently Johannes sent a patch to enable scsi-mq per driver, see
http://marc.info/?l=linux-scsi&m=145347009631192&w=2

Probably that is a good solution (at least in the short term) to allow
users to switch to blk-mq for some host adapters (with fast storage
attached) but to stick to legacy stuff on other host adapters with
rotary devices.

What do others think?


Thanks,

Andreas


> Regards,
> 
> Andreas
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 1cf1878..77c875c 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -247,6 +247,26 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct blk_mq_hw_ctx *hctx, char *page)
>  	return ret;
>  }
>  
> +static ssize_t blk_mq_hw_sysfs_tslice_show(struct blk_mq_hw_ctx *hctx,
> +					  char *page)
> +{
> +	return sprintf(page, "%u\n", hctx->tslice_us);
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_tslice_store(struct blk_mq_hw_ctx *hctx,
> +					const char *page, size_t length)
> +{
> +	unsigned long long store;
> +	int err;
> +
> +	err = kstrtoull(page, 10, &store);
> +	if (err)
> +		return -EINVAL;
> +
> +	hctx->tslice_us = (unsigned)store;
> +	return length;
> +}
> +
>  static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
>  	.attr = {.name = "dispatched", .mode = S_IRUGO },
>  	.show = blk_mq_sysfs_dispatched_show,
> @@ -305,6 +325,12 @@ static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_poll = {
>  	.show = blk_mq_hw_sysfs_poll_show,
>  };
>  
> +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tslice = {
> +	.attr = {.name = "time_slice_us", .mode = S_IRUGO | S_IWUSR },
> +	.show = blk_mq_hw_sysfs_tslice_show,
> +	.store = blk_mq_hw_sysfs_tslice_store,
> +};
> +
>  static struct attribute *default_hw_ctx_attrs[] = {
>  	&blk_mq_hw_sysfs_queued.attr,
>  	&blk_mq_hw_sysfs_run.attr,
> @@ -314,6 +340,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
>  	&blk_mq_hw_sysfs_cpus.attr,
>  	&blk_mq_hw_sysfs_active.attr,
>  	&blk_mq_hw_sysfs_poll.attr,
> +	&blk_mq_hw_sysfs_tslice.attr,
>  	NULL,
>  };
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c0622f..97d32f2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -68,6 +68,7 @@ static void blk_mq_hctx_mark_pending(struct blk_mq_hw_ctx *hctx,
>  
>  	if (!test_bit(CTX_TO_BIT(hctx, ctx), &bm->word))
>  		set_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
> +	cpumask_set_cpu(ctx->cpu, hctx->cpu_pending_mask);
>  }
>  
>  static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
> @@ -76,6 +77,7 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
>  	struct blk_align_bitmap *bm = get_bm(hctx, ctx);
>  
>  	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
> +	cpumask_clear_cpu(ctx->cpu, hctx->cpu_pending_mask);
>  }
>  
>  void blk_mq_freeze_queue_start(struct request_queue *q)
> @@ -682,15 +684,41 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
>  	return false;
>  }
>  
> +static int tslice_flush_one_ctx(struct blk_mq_hw_ctx *hctx,
> +				struct list_head *list,	int cpu)
> +{
> +	struct request_queue *q = hctx->queue;
> +	struct blk_mq_ctx *ctx;
> +
> +	if (cpu == -1 || !hctx->tslice_us)
> +		return 0;
> +
> +	ctx = __blk_mq_get_ctx(q, cpu);
> +	spin_lock(&ctx->lock);
> +	if (!list_empty(&ctx->rq_list)) {
> +		list_splice_tail_init(&ctx->rq_list, list);
> +		spin_lock(&hctx->lock);
> +		hctx->tslice_inc = 1;
> +		blk_mq_hctx_clear_pending(hctx, ctx);
> +		spin_unlock(&hctx->lock);
> +	}
> +	spin_unlock(&ctx->lock);
> +	return 1;
> +}
> +
>  /*
>   * Process software queues that have been marked busy, splicing them
>   * to the for-dispatch
>   */
> -static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
> +static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx,
> +			struct list_head *list,	int cpu)
>  {
>  	struct blk_mq_ctx *ctx;
>  	int i;
>  
> +	if (tslice_flush_one_ctx(hctx, list, cpu))
> +		return;
> +
>  	for (i = 0; i < hctx->ctx_map.size; i++) {
>  		struct blk_align_bitmap *bm = &hctx->ctx_map.map[i];
>  		unsigned int off, bit;
> @@ -706,9 +734,11 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  				break;
>  
>  			ctx = hctx->ctxs[bit + off];
> -			clear_bit(bit, &bm->word);
>  			spin_lock(&ctx->lock);
>  			list_splice_tail_init(&ctx->rq_list, list);
> +			spin_lock(&hctx->lock);
> +			blk_mq_hctx_clear_pending(hctx, ctx);
> +			spin_unlock(&hctx->lock);
>  			spin_unlock(&ctx->lock);
>  
>  			bit++;
> @@ -717,6 +747,114 @@ static void flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, struct list_head *list)
>  }
>  
>  /*
> + * It'd be great if the workqueue API had a way to pass
> + * in a mask and had some smarts for more clever placement.
> + * For now we just round-robin here, switching for every
> + * BLK_MQ_CPU_WORK_BATCH queued items.
> + */
> +static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> +{
> +	if (hctx->queue->nr_hw_queues == 1)
> +		return WORK_CPU_UNBOUND;
> +
> +	if (--hctx->next_cpu_batch <= 0) {
> +		int cpu = hctx->next_cpu, next_cpu;
> +
> +		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
> +		if (next_cpu >= nr_cpu_ids)
> +			next_cpu = cpumask_first(hctx->cpumask);
> +
> +		hctx->next_cpu = next_cpu;
> +		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> +
> +		return cpu;
> +	}
> +
> +	return hctx->next_cpu;
> +}
> +
> +static int blk_mq_tslice_expired(struct blk_mq_hw_ctx *hctx)
> +{
> +	if (time_after_eq(jiffies, hctx->tslice_expiration))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +static int blk_mq_tslice_next_cpu(struct blk_mq_hw_ctx *hctx)
> +{
> +	int c = hctx->tslice_cpu;
> +
> +	/* allow extension of time slice */
> +	if (hctx->tslice_inc && hctx->tslice_inc_count < 4) {
> +		hctx->tslice_inc = 0;
> +		hctx->tslice_inc_count++;
> +		goto out;
> +	}
> +	hctx->tslice_inc = 0;
> +	hctx->tslice_inc_count = 0;
> +
> +	/* clear CPU for which tslice has expired */
> +	if (c != -1)
> +		cpumask_clear_cpu(c, hctx->cpu_service_mask);
> +
> +	/* calculate mask of remaining CPUs with pending work */
> +	if (cpumask_and(hctx->cpu_next_mask, hctx->cpu_pending_mask,
> +				hctx->cpu_service_mask)) {
> +		c = cpumask_first(hctx->cpu_next_mask);
> +	} else {
> +		/*
> +		 * no remaining CPUs with pending work, reset epoch,
> +		 * start with first CPU that has requests pending
> +		 */
> +		hctx->tslice_expiration = 0;
> +		cpumask_setall(hctx->cpu_service_mask);
> +		c = cpumask_first(hctx->cpu_pending_mask);
> +	}
> +
> +	/* no CPU with pending requests */
> +	if (c >= nr_cpu_ids)
> +		return -1;
> +
> +out:
> +	hctx->tslice_expiration = jiffies + usecs_to_jiffies(hctx->tslice_us);
> +	return c;
> +}
> +
> +static int tslice_get_busy_ctx(struct blk_mq_hw_ctx *hctx)
> +{
> +	int cpu = -1;
> +
> +	if (hctx->tslice_us) {
> +		spin_lock(&hctx->lock);
> +		if (blk_mq_tslice_expired(hctx))
> +			hctx->tslice_cpu = blk_mq_tslice_next_cpu(hctx);
> +		cpu = hctx->tslice_cpu;
> +		spin_unlock(&hctx->lock);
> +	}
> +
> +	return cpu;
> +}
> +
> +/*
> + * Ensure that hardware queue is run if there are pending
> + * requests in any software queue.
> + */
> +static void tslice_schedule_pending_work(struct blk_mq_hw_ctx *hctx)
> +{
> +	long t;
> +
> +	if (!hctx->tslice_us || cpumask_empty(hctx->cpu_pending_mask))
> +		return;
> +
> +	t = hctx->tslice_expiration - jiffies;
> +	if (t < 0)
> +		t = 0;
> +	kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
> +					&hctx->run_work, (unsigned int)t);
> +}
> +
> +/*
>   * Run this hardware queue, pulling any software queues mapped to it in.
>   * Note that this function currently has various problems around ordering
>   * of IO. In particular, we'd like FIFO behaviour on handling existing
> @@ -729,7 +867,7 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>  	LIST_HEAD(rq_list);
>  	LIST_HEAD(driver_list);
>  	struct list_head *dptr;
> -	int queued;
> +	int queued, cpu;
>  
>  	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
>  
> @@ -739,9 +877,11 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>  	hctx->run++;
>  
>  	/*
> -	 * Touch any software queue that has pending entries.
> +	 * Touch dedicated software queue if time slice is set or any
> +	 * software queue that has pending entries (cpu == -1).
>  	 */
> -	flush_busy_ctxs(hctx, &rq_list);
> +	cpu = tslice_get_busy_ctx(hctx);
> +	flush_busy_ctxs(hctx, &rq_list, cpu);
>  
>  	/*
>  	 * If we have previous entries on our dispatch list, grab them
> @@ -826,34 +966,8 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx)
>  		 * blk_mq_run_hw_queue() already checks the STOPPED bit
>  		 **/
>  		blk_mq_run_hw_queue(hctx, true);
> -	}
> -}
> -
> -/*
> - * It'd be great if the workqueue API had a way to pass
> - * in a mask and had some smarts for more clever placement.
> - * For now we just round-robin here, switching for every
> - * BLK_MQ_CPU_WORK_BATCH queued items.
> - */
> -static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
> -{
> -	if (hctx->queue->nr_hw_queues == 1)
> -		return WORK_CPU_UNBOUND;
> -
> -	if (--hctx->next_cpu_batch <= 0) {
> -		int cpu = hctx->next_cpu, next_cpu;
> -
> -		next_cpu = cpumask_next(hctx->next_cpu, hctx->cpumask);
> -		if (next_cpu >= nr_cpu_ids)
> -			next_cpu = cpumask_first(hctx->cpumask);
> -
> -		hctx->next_cpu = next_cpu;
> -		hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
> -
> -		return cpu;
> -	}
> -
> -	return hctx->next_cpu;
> +	} else
> +		tslice_schedule_pending_work(hctx);
>  }
>  
>  void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
> @@ -992,7 +1106,10 @@ static void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx,
>  	struct blk_mq_ctx *ctx = rq->mq_ctx;
>  
>  	__blk_mq_insert_req_list(hctx, ctx, rq, at_head);
> +	spin_lock(&hctx->lock);
>  	blk_mq_hctx_mark_pending(hctx, ctx);
> +	spin_unlock(&hctx->lock);
> +
>  }
>  
>  void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
> @@ -1583,7 +1700,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
>  	spin_lock(&ctx->lock);
>  	if (!list_empty(&ctx->rq_list)) {
>  		list_splice_init(&ctx->rq_list, &tmp);
> +		spin_lock(&hctx->lock);
>  		blk_mq_hctx_clear_pending(hctx, ctx);
> +		spin_unlock(&hctx->lock);
>  	}
>  	spin_unlock(&ctx->lock);
>  
> @@ -1602,7 +1721,9 @@ static int blk_mq_hctx_cpu_offline(struct blk_mq_hw_ctx *hctx, int cpu)
>  	}
>  
>  	hctx = q->mq_ops->map_queue(q, ctx->cpu);
> +	spin_lock(&hctx->lock);
>  	blk_mq_hctx_mark_pending(hctx, ctx);
> +	spin_unlock(&hctx->lock);
>  
>  	spin_unlock(&ctx->lock);
>  
> @@ -1691,6 +1812,12 @@ static int blk_mq_init_hctx(struct request_queue *q,
>  	hctx->queue_num = hctx_idx;
>  	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
>  
> +	hctx->tslice_us = 0;
> +	hctx->tslice_expiration = 0;
> +	hctx->tslice_cpu = -1;
> +	hctx->tslice_inc = 0;
> +	hctx->tslice_inc_count = 0;
> +
>  	blk_mq_init_cpu_notifier(&hctx->cpu_notifier,
>  					blk_mq_hctx_notify, hctx);
>  	blk_mq_register_cpu_notifier(&hctx->cpu_notifier);
> @@ -2006,6 +2133,18 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>  						node))
>  			goto err_hctxs;
>  
> +		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_pending_mask,
> +						GFP_KERNEL, node))
> +			goto err_hctxs;
> +
> +		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_service_mask,
> +						GFP_KERNEL, node))
> +			goto err_hctxs;
> +
> +		if (!zalloc_cpumask_var_node(&hctxs[i]->cpu_next_mask,
> +						GFP_KERNEL, node))
> +			goto err_hctxs;
> +
>  		atomic_set(&hctxs[i]->nr_active, 0);
>  		hctxs[i]->numa_node = node;
>  		hctxs[i]->queue_num = i;
> @@ -2069,6 +2208,9 @@ err_hctxs:
>  		if (!hctxs[i])
>  			break;
>  		free_cpumask_var(hctxs[i]->cpumask);
> +		free_cpumask_var(hctxs[i]->cpu_pending_mask);
> +		free_cpumask_var(hctxs[i]->cpu_service_mask);
> +		free_cpumask_var(hctxs[i]->cpu_next_mask);
>  		kfree(hctxs[i]);
>  	}
>  err_map:
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 7fc9296..a8ca685 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -57,6 +57,15 @@ struct blk_mq_hw_ctx {
>  
>  	atomic_t		nr_active;
>  
> +	cpumask_var_t		cpu_service_mask; /* CPUs not yet serviced */
> +	cpumask_var_t		cpu_pending_mask; /* CPUs with pending work */
> +	cpumask_var_t		cpu_next_mask;	/* CPUs to be serviced */
> +	int			tslice_us;	/* time slice in µs */
> +	int			tslice_inc;	/* extend time slice */
> +	int			tslice_inc_count;
> +	unsigned long		tslice_expiration; /* in jiffies */
> +	int			tslice_cpu;	/* cpu of time slice */
> +
>  	struct blk_mq_cpu_notifier	cpu_notifier;
>  	struct kobject		kobj;
>  
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice
  2016-02-09 17:12   ` Andreas Herrmann
@ 2016-02-09 17:41     ` Markus Trippelsdorf
  2016-02-10 19:34       ` Andreas Herrmann
  0 siblings, 1 reply; 12+ messages in thread
From: Markus Trippelsdorf @ 2016-02-09 17:41 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Christoph Hellwig, Jens Axboe, linux-kernel, Johannes Thumshirn,
	Jan Kara, linux-block, linux-scsi, Hannes Reinecke

On 2016.02.09 at 18:12 +0100, Andreas Herrmann wrote:
> [CC-ing linux-block and linux-scsi and adding some comments]
> 
> On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> > This introduces a new blk_mq hw attribute time_slice_us which allows
> > to specify a time slice in usecs.
> > 
> > Fio test results are sent in a separate mail to this.
> 
> See http://marc.info/?l=linux-kernel&m=145436682607949&w=2
> 
> In short it shows significant performance gains in some tests,
> e.g. sequential read iops up by >40% with 8 jobs. But it's never on
> par with CFQ when more than 1 job was used during the test.
> 
> > Results for fio improved to some extent with this patch. But in
> > reality the picture is quite mixed. Performance is highly dependend on
> > task scheduling. There is no guarantee that the requests originated
> > from one CPU belong to the same process.
> > 
> > I think for rotary devices CFQ is by far the best choice. A simple
> > illustration is:
> > 
> >   Copying two files (750MB in this case) in parallel on a rotary
> >   device. The elapsed wall clock time (seconds) for this is
> >                                mean    stdev
> >    cfq, slice_idle=8           16.18   4.95
> >    cfq, slice_idle=0           23.74   2.82
> >    blk-mq, time_slice_usec=0   24.37   2.05
> >    blk-mq, time_slice_usec=250 25.58   3.16
> 
> This illustrates that although their was performance gain with fio
> tests, the patch can cause higher variance and lower performance in
> comparison to unmodified blk-mq with other tests. And it underscores
> superiority of CFQ for rotary disks.
> 
> Meanwhile my opinion is that it's not really worth to look further
> into introduction of I/O scheduling support in blk-mq. I don't see the
> need for scheduling support (deadline or something else) for fast
> storage devices. And rotary devices should really avoid usage of blk-mq
> and stick to CFQ.
> 
> Thus I think that introducing some coexistence of blk-mq and the
> legacy block with CFQ is the best option.
> 
> Recently Johannes sent a patch to enable scsi-mq per driver, see
> http://marc.info/?l=linux-scsi&m=145347009631192&w=2
> 
> Probably that is a good solution (at least in the short term) to allow
> users to switch to blk-mq for some host adapters (with fast storage
> attached) but to stick to legacy stuff on other host adapters with
> rotary devices.

I don't think that Johannes' patch is a good solution.

The best solution for the user would be if blk-mq could be toggled per
drive (or even automatically enabled if queue/rotational == 0). Is there
a fundamental reason why this is not feasible?

Your solution is better than nothing, but it requires that the user
finds out the drive <=> host mapping by hand and then runs something
like: 
echo "250" > /sys/devices/pci0000:00/0000:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
during boot for spinning rust drives...

-- 
Markus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice
  2016-02-09 17:41     ` Markus Trippelsdorf
@ 2016-02-10 19:34       ` Andreas Herrmann
  2016-02-10 19:47         ` Markus Trippelsdorf
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Herrmann @ 2016-02-10 19:34 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Christoph Hellwig, Jens Axboe, linux-kernel, Johannes Thumshirn,
	Jan Kara, linux-block, linux-scsi, Hannes Reinecke

On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> On 2016.02.09 at 18:12 +0100, Andreas Herrmann wrote:
> > [CC-ing linux-block and linux-scsi and adding some comments]
> > 
> > On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> > > This introduces a new blk_mq hw attribute time_slice_us which allows
> > > to specify a time slice in usecs.
> > > 
> > > Fio test results are sent in a separate mail to this.
> > 
> > See http://marc.info/?l=linux-kernel&m=145436682607949&w=2
> > 
> > In short it shows significant performance gains in some tests,
> > e.g. sequential read iops up by >40% with 8 jobs. But it's never on
> > par with CFQ when more than 1 job was used during the test.
> > 
> > > Results for fio improved to some extent with this patch. But in
> > > reality the picture is quite mixed. Performance is highly dependend on
> > > task scheduling. There is no guarantee that the requests originated
> > > from one CPU belong to the same process.
> > > 
> > > I think for rotary devices CFQ is by far the best choice. A simple
> > > illustration is:
> > > 
> > >   Copying two files (750MB in this case) in parallel on a rotary
> > >   device. The elapsed wall clock time (seconds) for this is
> > >                                mean    stdev
> > >    cfq, slice_idle=8           16.18   4.95
> > >    cfq, slice_idle=0           23.74   2.82
> > >    blk-mq, time_slice_usec=0   24.37   2.05
> > >    blk-mq, time_slice_usec=250 25.58   3.16
> > 
> > This illustrates that although their was performance gain with fio
> > tests, the patch can cause higher variance and lower performance in
> > comparison to unmodified blk-mq with other tests. And it underscores
> > superiority of CFQ for rotary disks.
> > 
> > Meanwhile my opinion is that it's not really worth to look further
> > into introduction of I/O scheduling support in blk-mq. I don't see the
> > need for scheduling support (deadline or something else) for fast
> > storage devices. And rotary devices should really avoid usage of blk-mq
> > and stick to CFQ.
> > 
> > Thus I think that introducing some coexistence of blk-mq and the
> > legacy block with CFQ is the best option.
> > 
> > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > http://marc.info/?l=linux-scsi&m=145347009631192&w=2
> > 
> > Probably that is a good solution (at least in the short term) to allow
> > users to switch to blk-mq for some host adapters (with fast storage
> > attached) but to stick to legacy stuff on other host adapters with
> > rotary devices.
> 
> I don't think that Johannes' patch is a good solution.

Why? Because it's not per device?

> The best solution for the user would be if blk-mq could be toggled
> per drive (or even automatically enabled if queue/rotational == 0).

Yes, I aggree, but ...

> Is there a fundamental reason why this is not feasible?

... it's not possible (*) with the current implementation.

Tag handling/command allocation differs. Respective functions are set
per host.

(*) Or maybe it's possible but just hard to achieve and I didn't look
long enough into relevant code to get an idea how to do it.

> Your solution is better than nothing, but it requires that the user
> finds out the drive <=> host mapping by hand and then runs something
> like: 
> echo "250" > /sys/devices/pci0000:00/0000:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> during boot for spinning rust drives...

Or it could automatically be set in case of rotational device.
(Once we know for sure that it doesn't cause performance degradation.)

> -- 
> Markus

Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice
  2016-02-10 19:34       ` Andreas Herrmann
@ 2016-02-10 19:47         ` Markus Trippelsdorf
  2016-02-10 22:09           ` Andreas Herrmann
  0 siblings, 1 reply; 12+ messages in thread
From: Markus Trippelsdorf @ 2016-02-10 19:47 UTC (permalink / raw)
  To: Andreas Herrmann
  Cc: Christoph Hellwig, Jens Axboe, linux-kernel, Johannes Thumshirn,
	Jan Kara, linux-block, linux-scsi, Hannes Reinecke

On 2016.02.10 at 20:34 +0100, Andreas Herrmann wrote:
> On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> > > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > > http://marc.info/?l=linux-scsi&m=145347009631192&w=2
> > > 
> > > Probably that is a good solution (at least in the short term) to allow
> > > users to switch to blk-mq for some host adapters (with fast storage
> > > attached) but to stick to legacy stuff on other host adapters with
> > > rotary devices.
> > 
> > I don't think that Johannes' patch is a good solution.
> 
> Why? Because it's not per device?

Yes. Like Christoph said in his reply to the patch: »The host is simply
the wrong place to decide these things.«

> > The best solution for the user would be if blk-mq could be toggled
> > per drive (or even automatically enabled if queue/rotational == 0).
> 
> Yes, I aggree, but ...
> 
> > Is there a fundamental reason why this is not feasible?
> 
> ... it's not possible (*) with the current implementation.
> 
> Tag handling/command allocation differs. Respective functions are set
> per host.
> 
> (*) Or maybe it's possible but just hard to achieve and I didn't look
> long enough into relevant code to get an idea how to do it.
> 
> > Your solution is better than nothing, but it requires that the user
> > finds out the drive <=> host mapping by hand and then runs something
> > like: 
> > echo "250" > /sys/devices/pci0000:00/0000:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> > during boot for spinning rust drives...
> 
> Or it could automatically be set in case of rotational device.
> (Once we know for sure that it doesn't cause performance degradation.)

Yes, this sound like a good idea.

But, if I understand things correctly, your patch is only an interim
solution until proper I/O scheduler support gets implemented for blk-mq, no?

-- 
Markus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice
  2016-02-10 19:47         ` Markus Trippelsdorf
@ 2016-02-10 22:09           ` Andreas Herrmann
  0 siblings, 0 replies; 12+ messages in thread
From: Andreas Herrmann @ 2016-02-10 22:09 UTC (permalink / raw)
  To: Markus Trippelsdorf
  Cc: Christoph Hellwig, Jens Axboe, linux-kernel, Johannes Thumshirn,
	Jan Kara, linux-block, linux-scsi, Hannes Reinecke

On Wed, Feb 10, 2016 at 08:47:15PM +0100, Markus Trippelsdorf wrote:
> On 2016.02.10 at 20:34 +0100, Andreas Herrmann wrote:
> > On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> > > > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > > > http://marc.info/?l=linux-scsi&m=145347009631192&w=2
> > > > 
> > > > Probably that is a good solution (at least in the short term) to allow
> > > > users to switch to blk-mq for some host adapters (with fast storage
> > > > attached) but to stick to legacy stuff on other host adapters with
> > > > rotary devices.
> > > 
> > > I don't think that Johannes' patch is a good solution.
> > 
> > Why? Because it's not per device?
> 
> Yes. Like Christoph said in his reply to the patch: »The host is simply
> the wrong place to decide these things.«
> 
> > > The best solution for the user would be if blk-mq could be toggled
> > > per drive (or even automatically enabled if queue/rotational == 0).
> > 
> > Yes, I aggree, but ...
> > 
> > > Is there a fundamental reason why this is not feasible?
> > 
> > ... it's not possible (*) with the current implementation.
> > 
> > Tag handling/command allocation differs. Respective functions are set
> > per host.
> > 
> > (*) Or maybe it's possible but just hard to achieve and I didn't look
> > long enough into relevant code to get an idea how to do it.
> > 
> > > Your solution is better than nothing, but it requires that the user
> > > finds out the drive <=> host mapping by hand and then runs something
> > > like: 
> > > echo "250" > /sys/devices/pci0000:00/0000:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> > > during boot for spinning rust drives...
> > 
> > Or it could automatically be set in case of rotational device.
> > (Once we know for sure that it doesn't cause performance degradation.)
> 
> Yes, this sound like a good idea.
> 
> But, if I understand things correctly, your patch is only an interim
> solution until proper I/O scheduler support gets implemented for blk-mq, no?

That's to be discussed. (Hence the RFC)

My (potentially wrong) claims are

- I don't think that fast storage (e.g. SSDs) requires I/O scheduler
  support with blk-mq. blk-mq is very good at pushing a large number
  of requests from per CPU sw queues to hw queue(s). Why then
  introduce any overhead for I/O scheduler support?

- Slow storage (e.g. spinning drives) is fine with the old code which
  provides scheduler support and I doubt that there is any benefit for
  those devices when switching to blk-mq.

- The big hammer (scsi_mod.use_blk_mq) for the entire scsi stack to
  decide what to use is suboptimal. You can't have optimal performance
  when you have both slow and fast storage devices in your system.

I doubt that it is possible to add I/O scheduling support to blk-mq
which can be on par with what CFQ is able to achieve for slow devices
at the moment.

Requests are scattered among per-CPU software queues (and almost
instantly passed to hardware queue(s)). Due to CPU scheduling,
requests initiated from one process might come down via different
software queues. What is an efficient way to sort/merge requests from
all the software queues in such a way that the result is comparable to
what CFQ does (assuming that CFQ provides optimal performance)? So far
I didn't find a solution to this problem. (I just have this patch
which adds not too much overhead and improves the situation a little
bit.)

Maybe the solution is to avoid per-CPU queues for slow storage and
fall back to a set of queues comparable to what CFQ uses.

One way to do this is by falling back to non-blk-mq code and direct
use of CFQ.

Code that allows to select blk-mq per host would help to some
extent. But when you have both device types connected to the same host
adapter it doesn't help either.


Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-02-10 22:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-19 12:02 [RFC] blk-mq and I/O scheduling Andreas Herrmann
2015-11-24  8:19 ` Christoph Hellwig
2015-12-01  7:37   ` Andreas Herrmann
2015-11-25 19:47 ` Jens Axboe
2015-12-01  7:43   ` Andreas Herrmann
2016-02-01 22:43 ` [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice Andreas Herrmann
2016-02-01 22:46   ` Andreas Herrmann
2016-02-09 17:12   ` Andreas Herrmann
2016-02-09 17:41     ` Markus Trippelsdorf
2016-02-10 19:34       ` Andreas Herrmann
2016-02-10 19:47         ` Markus Trippelsdorf
2016-02-10 22:09           ` Andreas Herrmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.