[PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4
@ 2010-08-11 22:44 Vivek Goyal
  2010-08-11 22:44 ` [PATCH 1/5] cfq-iosched: Do not idle if slice_idle=0 Vivek Goyal
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Vivek Goyal @ 2010-08-11 22:44 UTC (permalink / raw)
  To: linux-kernel, jaxboe; +Cc: vgoyal

Hi,

This is V4 of the patches for group_idle and CFQ group charge accounting in
terms of IOPS implementation. Since V3 not much has changed. Just more testing
and rebase on top of for-2.6.36 branch of block tree.

What's the problem
------------------
On high end storage (I got on HP EVA storage array with 12 SATA disks in 
RAID 5), CFQ's model of dispatching requests from a single queue at a
time (sequential readers/write sync writers etc), becomes a bottleneck.
Often we don't drive enough request queue depth to keep all the disks busy
and suffer a lot in terms of overall throughput.

All these problems primarily originate from two things. Idling on per
cfq queue and quantum (dispatching limited number of requests from a
single queue) and till then not allowing dispatch from other queues. Once
you set the slice_idle=0 and quantum to higher value, most of the CFQ's
problem on higher end storage disappear.

This problem also becomes visible in IO controller where one creates
multiple groups and gets the fairness but overall throughput is less. In
the following table, I am running increasing number of sequential readers
(1,2,4,8) in 8 groups of weight 100 to 800.

Kernel=2.6.35-blktree-group_idle+
GROUPMODE=1          NRGRP=8      DEV=/dev/dm-3                 
Workload=bsr      iosched=cfq     Filesz=512M bs=4K   
gi=1  slice_idle=8    group_idle=8    quantum=8
=========================================================================
AVERAGE[bsr]    [bw in KB/s]    
------- 
job     Set NR  cgrp1  cgrp2  cgrp3  cgrp4  cgrp5  cgrp6  cgrp7  cgrp8  total  
---     --- --  ---------------------------------------------------------------
bsr     1   1   6519   12742  16801  23109  28694  35988  43175  49272  216300 
bsr     1   2   5522   10922  17174  22554  24151  30488  36572  42021  189404 
bsr     1   4   4593   9620   13120  21405  25827  28097  33029  37335  173026 
bsr     1   8   3622   8277   12557  18296  21775  26022  30760  35713  157022 

Notice that overall throughput is just around 160MB/s with 8 sequential reader
in each group.

With this patch set, I have set slice_idle=0 and re-ran same test.

Kernel=2.6.35-blktree-group_idle+
GROUPMODE=1          NRGRP=8         DEV=/dev/dm-3                 
Workload=bsr      iosched=cfq     Filesz=512M bs=4K   
gi=1  slice_idle=0    group_idle=8    quantum=8
=========================================================================
AVERAGE[bsr]    [bw in KB/s]    
------- 
job     Set NR  cgrp1  cgrp2  cgrp3  cgrp4  cgrp5  cgrp6  cgrp7  cgrp8  total  
---     --- --  ---------------------------------------------------------------
bsr     1   1   6652   12341  17335  23856  28740  36059  42833  48487  216303 
bsr     1   2   10168  20292  29827  38363  45746  52842  60071  63957  321266 
bsr     1   4   11176  21763  32713  42970  53222  58613  63598  69296  353351 
bsr     1   8   11750  23718  34102  47144  56975  63613  69000  69666  375968 

Notice how overall throughput has shot upto 350-370MB/s while retaining the
ability to do the IO control.

So this is not the default mode. This new tunable group_idle, allows one to
set slice_idle=0 to disable some of the CFQ features and and use primarily
group service differentation feature.

By default nothing should change for CFQ and this change should be fairly
low risk.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] cfq-iosched: Do not idle if slice_idle=0
  2010-08-11 22:44 [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Vivek Goyal
@ 2010-08-11 22:44 ` Vivek Goyal
  2010-08-16 18:37   ` Jeff Moyer
  2010-08-11 22:44 ` [PATCH 2/5] cfq-iosched: Do group share accounting in IOPS when slice_idle=0 Vivek Goyal
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2010-08-11 22:44 UTC (permalink / raw)
  To: linux-kernel, jaxboe; +Cc: vgoyal

o do not idle either on cfq queue or service tree if slice_idle=0. User does
  not want any queue or service tree idling. Currently even if slice_idle=0,
  we were waiting for request to finish before expiring the queue and that
  can lead to lower queue depths.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index eb4086f..8830569 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1839,6 +1839,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	BUG_ON(!service_tree);
 	BUG_ON(!service_tree->count);
 
+	if (!cfqd->cfq_slice_idle)
+		return false;
+
 	/* We never do for idle class queues. */
 	if (prio == IDLE_WORKLOAD)
 		return false;
@@ -1879,7 +1882,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
+	if (!cfq_should_idle(cfqd, cfqq))
 		return;
 
 	/*
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/5] cfq-iosched: Do group share accounting in IOPS when slice_idle=0
  2010-08-11 22:44 [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Vivek Goyal
  2010-08-11 22:44 ` [PATCH 1/5] cfq-iosched: Do not idle if slice_idle=0 Vivek Goyal
@ 2010-08-11 22:44 ` Vivek Goyal
  2010-08-16 18:45   ` Jeff Moyer
  2010-08-11 22:44 ` [PATCH 3/5] cfq-iosched: Implement tunable group_idle Vivek Goyal
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2010-08-11 22:44 UTC (permalink / raw)
  To: linux-kernel, jaxboe; +Cc: vgoyal

o Implement another CFQ mode where we charge group in terms of number
  of requests dispatched instead of measuring the time. Measuring in terms
  of time is not possible when we are driving deeper queue depths and there
  are requests from multiple cfq queues in the request queue.

o This mode currently gets activated if one sets slice_idle=0 and associated
  disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
  most of the queues will dispatch 1 or more requests and then cfq queue
  expiry happens and we don't have a way to measure time. So start providing
  fairness in terms of IOPS.

o Currently IOPS mode works only with cfq group scheduling. CFQ is following
  different scheduling algorithms for queue and group scheduling. These IOPS
  stats are used only for group scheduling hence in non-croup mode nothing
  should change.

o For CFQ group scheduling one can disable slice idling so that we don't idle
  on queue and drive deeper request queue depths (achieving better throughput),
  at the same time group idle is enabled so one should get service
  differentiation among groups.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   30 ++++++++++++++++++++++++------
 1 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 8830569..3fc6be1 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -378,6 +378,21 @@ CFQ_CFQQ_FNS(wait_busy);
 			&cfqg->service_trees[i][j]: NULL) \
 
 
+static inline bool iops_mode(struct cfq_data *cfqd)
+{
+	/*
+	 * If we are not idling on queues and it is a NCQ drive, parallel
+	 * execution of requests is on and measuring time is not possible
+	 * in most of the cases until and unless we drive shallower queue
+	 * depths and that becomes a performance bottleneck. In such cases
+	 * switch to start providing fairness in terms of number of IOs.
+	 */
+	if (!cfqd->cfq_slice_idle && cfqd->hw_tag)
+		return true;
+	else
+		return false;
+}
+
 static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
 {
 	if (cfq_class_idle(cfqq))
@@ -906,7 +921,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
 			slice_used = cfqq->allocated_slice;
 	}
 
-	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
 	return slice_used;
 }
 
@@ -914,19 +928,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 				struct cfq_queue *cfqq)
 {
 	struct cfq_rb_root *st = &cfqd->grp_service_tree;
-	unsigned int used_sl, charge_sl;
+	unsigned int used_sl, charge;
 	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
 			- cfqg->service_tree_idle.count;
 
 	BUG_ON(nr_sync < 0);
-	used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
+	used_sl = charge = cfq_cfqq_slice_usage(cfqq);
 
-	if (!cfq_cfqq_sync(cfqq) && !nr_sync)
-		charge_sl = cfqq->allocated_slice;
+	if (iops_mode(cfqd))
+		charge = cfqq->slice_dispatch;
+	else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
+		charge = cfqq->allocated_slice;
 
 	/* Can't update vdisktime while group is on service tree */
 	cfq_rb_erase(&cfqg->rb_node, st);
-	cfqg->vdisktime += cfq_scale_slice(charge_sl, cfqg);
+	cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
 	__cfq_group_service_tree_add(st, cfqg);
 
 	/* This group is being expired. Save the context */
@@ -940,6 +956,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 
 	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
 					st->min_vdisktime);
+	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
+			used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
 	cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
 	cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
 }
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/5] cfq-iosched: Implement tunable group_idle
  2010-08-11 22:44 [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Vivek Goyal
  2010-08-11 22:44 ` [PATCH 1/5] cfq-iosched: Do not idle if slice_idle=0 Vivek Goyal
  2010-08-11 22:44 ` [PATCH 2/5] cfq-iosched: Do group share accounting in IOPS when slice_idle=0 Vivek Goyal
@ 2010-08-11 22:44 ` Vivek Goyal
  2010-08-16 18:53   ` Jeff Moyer
  2010-08-11 22:44 ` [PATCH 4/5] cfq-iosched: blktrace print per slice sector stats Vivek Goyal
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2010-08-11 22:44 UTC (permalink / raw)
  To: linux-kernel, jaxboe; +Cc: vgoyal

o Implement a new tunable group_idle, which allows idling on the group
  instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
  on the individual queues but idle on the group. This way on fast storage
  we can get fairness between groups at the same time overall throughput
  improves.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |   65 +++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 56 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3fc6be1..85e4819 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
+static int cfq_group_idle = HZ / 125;
 static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
 static const int cfq_hist_divisor = 4;
 
@@ -198,6 +199,8 @@ struct cfq_group {
 	struct hlist_node cfqd_node;
 	atomic_t ref;
 #endif
+	/* number of requests that are on the dispatch list or inside driver */
+	int dispatched;
 };
 
 /*
@@ -271,6 +274,7 @@ struct cfq_data {
 	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
+	unsigned int cfq_group_idle;
 	unsigned int cfq_latency;
 	unsigned int cfq_group_isolation;
 
@@ -1884,7 +1888,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 {
 	struct cfq_queue *cfqq = cfqd->active_queue;
 	struct cfq_io_context *cic;
-	unsigned long sl;
+	unsigned long sl, group_idle = 0;
 
 	/*
 	 * SSD device without seek penalty, disable idling. But only do so
@@ -1900,8 +1904,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	/*
 	 * idle is disabled, either manually or by past process history
 	 */
-	if (!cfq_should_idle(cfqd, cfqq))
-		return;
+	if (!cfq_should_idle(cfqd, cfqq)) {
+		/* no queue idling. Check for group idling */
+		if (cfqd->cfq_group_idle)
+			group_idle = cfqd->cfq_group_idle;
+		else
+			return;
+	}
 
 	/*
 	 * still active requests from this queue, don't idle
@@ -1928,13 +1937,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 		return;
 	}
 
+	/* There are other queues in the group, don't do group idle */
+	if (group_idle && cfqq->cfqg->nr_cfqq > 1)
+		return;
+
 	cfq_mark_cfqq_wait_request(cfqq);
 
-	sl = cfqd->cfq_slice_idle;
+	if (group_idle)
+		sl = cfqd->cfq_group_idle;
+	else
+		sl = cfqd->cfq_slice_idle;
 
 	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
 	cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
-	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
+	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
+			group_idle ? 1 : 0);
 }
 
 /*
@@ -1950,6 +1967,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
 	cfqq->dispatched++;
+	(RQ_CFQG(rq))->dispatched++;
 	elv_dispatch_sort(q, rq);
 
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
@@ -2219,7 +2237,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 			cfqq = NULL;
 			goto keep_queue;
 		} else
-			goto expire;
+			goto check_group_idle;
 	}
 
 	/*
@@ -2247,8 +2265,23 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 	 * flight or is idling for a new request, allow either of these
 	 * conditions to happen (or time out) before selecting a new queue.
 	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
-	    (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
+	if (timer_pending(&cfqd->idle_slice_timer)) {
+		cfqq = NULL;
+		goto keep_queue;
+	}
+
+	if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
+		cfqq = NULL;
+		goto keep_queue;
+	}
+
+	/*
+	 * If group idle is enabled and there are requests dispatched from
+	 * this group, wait for requests to complete.
+	 */
+check_group_idle:
+	if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
+	    && cfqq->cfqg->dispatched) {
 		cfqq = NULL;
 		goto keep_queue;
 	}
@@ -3396,6 +3429,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 	WARN_ON(!cfqq->dispatched);
 	cfqd->rq_in_driver--;
 	cfqq->dispatched--;
+	(RQ_CFQG(rq))->dispatched--;
 	cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
 			rq_start_time_ns(rq), rq_io_start_time_ns(rq),
 			rq_data_dir(rq), rq_is_sync(rq));
@@ -3425,7 +3459,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 		 * the queue.
 		 */
 		if (cfq_should_wait_busy(cfqd, cfqq)) {
-			cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
+			unsigned long extend_sl = cfqd->cfq_slice_idle;
+			if (!cfqd->cfq_slice_idle)
+				extend_sl = cfqd->cfq_group_idle;
+			cfqq->slice_end = jiffies + extend_sl;
 			cfq_mark_cfqq_wait_busy(cfqq);
 			cfq_log_cfqq(cfqd, cfqq, "will busy wait");
 		}
@@ -3871,6 +3908,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
+	cfqd->cfq_group_idle = cfq_group_idle;
 	cfqd->cfq_latency = 1;
 	cfqd->cfq_group_isolation = 0;
 	cfqd->hw_tag = -1;
@@ -3943,6 +3981,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
 SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
 SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
 SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
+SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
@@ -3975,6 +4014,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
 STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
+STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
@@ -3996,6 +4036,7 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	CFQ_ATTR(group_idle),
 	CFQ_ATTR(low_latency),
 	CFQ_ATTR(group_isolation),
 	__ATTR_NULL
@@ -4049,6 +4090,12 @@ static int __init cfq_init(void)
 	if (!cfq_slice_idle)
 		cfq_slice_idle = 1;
 
+#ifdef CONFIG_CFQ_GROUP_IOSCHED
+	if (!cfq_group_idle)
+		cfq_group_idle = 1;
+#else
+		cfq_group_idle = 0;
+#endif
 	if (cfq_slab_setup())
 		return -ENOMEM;
 
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/5] cfq-iosched: blktrace print per slice sector stats
  2010-08-11 22:44 [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Vivek Goyal
                   ` (2 preceding siblings ...)
  2010-08-11 22:44 ` [PATCH 3/5] cfq-iosched: Implement tunable group_idle Vivek Goyal
@ 2010-08-11 22:44 ` Vivek Goyal
  2010-08-11 22:44 ` [PATCH 5/5] cfq-iosched: Documentation help for new tunables Vivek Goyal
  2010-08-23 10:26 ` [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Jens Axboe
  5 siblings, 0 replies; 13+ messages in thread
From: Vivek Goyal @ 2010-08-11 22:44 UTC (permalink / raw)
  To: linux-kernel, jaxboe; +Cc: vgoyal

o Divyesh had gotten rid of this code in the past. I want to re-introduce it
  back as it helps me a lot during debugging.

Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Reviewed-by: Divyesh Shah <dpshah@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 85e4819..f65c6f0 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -148,6 +148,8 @@ struct cfq_queue {
 	struct cfq_queue *new_cfqq;
 	struct cfq_group *cfqg;
 	struct cfq_group *orig_cfqg;
+	/* Number of sectors dispatched from queue in single dispatch round */
+	unsigned long nr_sectors;
 };
 
 /*
@@ -960,8 +962,9 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
 
 	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
 					st->min_vdisktime);
-	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
-			used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
+	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u"
+			" sect=%u", used_sl, cfqq->slice_dispatch, charge,
+			iops_mode(cfqd), cfqq->nr_sectors);
 	cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
 	cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
 }
@@ -1609,6 +1612,7 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 		cfqq->allocated_slice = 0;
 		cfqq->slice_end = 0;
 		cfqq->slice_dispatch = 0;
+		cfqq->nr_sectors = 0;
 
 		cfq_clear_cfqq_wait_request(cfqq);
 		cfq_clear_cfqq_must_dispatch(cfqq);
@@ -1971,6 +1975,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 	elv_dispatch_sort(q, rq);
 
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
+	cfqq->nr_sectors += blk_rq_sectors(rq);
 	cfq_blkiocg_update_dispatch_stats(&cfqq->cfqg->blkg, blk_rq_bytes(rq),
 					rq_data_dir(rq), rq_is_sync(rq));
 }
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/5] cfq-iosched: Documentation help for new tunables
  2010-08-11 22:44 [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Vivek Goyal
                   ` (3 preceding siblings ...)
  2010-08-11 22:44 ` [PATCH 4/5] cfq-iosched: blktrace print per slice sector stats Vivek Goyal
@ 2010-08-11 22:44 ` Vivek Goyal
  2010-08-16 19:00   ` Jeff Moyer
  2010-08-23 10:26 ` [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Jens Axboe
  5 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2010-08-11 22:44 UTC (permalink / raw)
  To: linux-kernel, jaxboe; +Cc: vgoyal

o Some documentation to provide help with tunables.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/block/cfq-iosched.txt        |   45 ++++++++++++++++++++++++++++
 Documentation/cgroups/blkio-controller.txt |   28 +++++++++++++++++
 2 files changed, 73 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/block/cfq-iosched.txt

diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
new file mode 100644
index 0000000..3ea9e79
--- /dev/null
+++ b/Documentation/block/cfq-iosched.txt
@@ -0,0 +1,45 @@
+CFQ ioscheduler tunables
+========================
+
+slice_idle
+----------
+This specifies how long CFQ should idle for next request on certain cfq queues
+(for sequential workloads) and service trees (for random workloads) before
+queue is expired and CFQ selects next queue to dispatch from.
+
+By default slice_idle is a non-zero value. That means by default we idle on
+queues/service trees. This can be very helpful on highly seeky media like
+single spindle SATA/SAS disks where we can cut down on overall number of
+seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues/service tree
+level and one should see an overall improved throughput on faster storage
+devices like multiple SATA/SAS disks in hardware RAID configuration. The down
+side is that isolation provided from WRITES also goes down and notion of
+IO priority becomes weaker.
+
+So depending on storage and workload, it might be useful to set slice_idle=0.
+In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
+keeping slice_idle enabled should be useful. For any configurations where
+there are multiple spindles behind single LUN (Host based hardware RAID
+controller or for storage arrays), setting slice_idle=0 might end up in better
+throughput and acceptable latencies.
+
+CFQ IOPS Mode for group scheduling
+==================================
+Basic CFQ design is to provide priority based time slices. Higher priority
+process gets bigger time slice and lower priority process gets smaller time
+slice. Measuring time becomes harder if storage is fast and supports NCQ and
+it would be better to dispatch multiple requests from multiple cfq queues in
+request queue at a time. In such scenario, it is not possible to measure time
+consumed by single queue accurately.
+
+What is possible though is to measure number of requests dispatched from a
+single queue and also allow dispatch from multiple cfq queue at the same time.
+This effectively becomes the fairness in terms of IOPS (IO operations per
+second).
+
+If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
+to IOPS mode and starts providing fairness in terms of number of requests
+dispatched. Note that this mode switching takes effect only for group
+scheduling. For non-cgroup users nothing should change.
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
index 48e0b21..6919d62 100644
--- a/Documentation/cgroups/blkio-controller.txt
+++ b/Documentation/cgroups/blkio-controller.txt
@@ -217,6 +217,7 @@ Details of cgroup files
 CFQ sysfs tunable
 =================
 /sys/block/<disk>/queue/iosched/group_isolation
+-----------------------------------------------
 
 If group_isolation=1, it provides stronger isolation between groups at the
 expense of throughput. By default group_isolation is 0. In general that
@@ -243,6 +244,33 @@ By default one should run with group_isolation=0. If that is not sufficient
 and one wants stronger isolation between groups, then set group_isolation=1
 but this will come at cost of reduced throughput.
 
+/sys/block/<disk>/queue/iosched/slice_idle
+------------------------------------------
+On a faster hardware CFQ can be slow, especially with sequential workload.
+This happens because CFQ idles on a single queue and single queue might not
+drive deeper request queue depths to keep the storage busy. In such scenarios
+one can try setting slice_idle=0 and that would switch CFQ to IOPS
+(IO operations per second) mode on NCQ supporting hardware.
+
+That means CFQ will not idle between cfq queues of a cfq group and hence be
+able to driver higher queue depth and achieve better throughput. That also
+means that cfq provides fairness among groups in terms of IOPS and not in
+terms of disk time.
+
+/sys/block/<disk>/queue/iosched/group_idle
+------------------------------------------
+If one disables idling on individual cfq queues and cfq service trees by
+setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
+on the group in an attempt to provide fairness among groups.
+
+By default group_idle is same as slice_idle and does not do anything if
+slice_idle is enabled.
+
+One can experience an overall throughput drop if you have created multiple
+groups and put applications in that group which are not driving enough
+IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
+on individual groups and throughput should improve.
+
 What works
 ==========
 - Currently only sync IO queues are support. All the buffered writes are
-- 
1.7.1.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] cfq-iosched: Do not idle if slice_idle=0
  2010-08-11 22:44 ` [PATCH 1/5] cfq-iosched: Do not idle if slice_idle=0 Vivek Goyal
@ 2010-08-16 18:37   ` Jeff Moyer
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Moyer @ 2010-08-16 18:37 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, jaxboe

Vivek Goyal <vgoyal@redhat.com> writes:

> o do not idle either on cfq queue or service tree if slice_idle=0. User does
>   not want any queue or service tree idling. Currently even if slice_idle=0,
>   we were waiting for request to finish before expiring the queue and that
>   can lead to lower queue depths.

Hm, I submitted a similar patch that was supposed to make .35.  Oh well,

Acked-by: Jeff Moyer <jmoyer@redhat.com>

>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/cfq-iosched.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index eb4086f..8830569 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1839,6 +1839,9 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  	BUG_ON(!service_tree);
>  	BUG_ON(!service_tree->count);
>  
> +	if (!cfqd->cfq_slice_idle)
> +		return false;
> +
>  	/* We never do for idle class queues. */
>  	if (prio == IDLE_WORKLOAD)
>  		return false;
> @@ -1879,7 +1882,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>  	/*
>  	 * idle is disabled, either manually or by past process history
>  	 */
> -	if (!cfqd->cfq_slice_idle || !cfq_should_idle(cfqd, cfqq))
> +	if (!cfq_should_idle(cfqd, cfqq))
>  		return;
>  
>  	/*

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/5] cfq-iosched: Do group share accounting in IOPS when slice_idle=0
  2010-08-11 22:44 ` [PATCH 2/5] cfq-iosched: Do group share accounting in IOPS when slice_idle=0 Vivek Goyal
@ 2010-08-16 18:45   ` Jeff Moyer
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Moyer @ 2010-08-16 18:45 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, jaxboe

Vivek Goyal <vgoyal@redhat.com> writes:

> o Implement another CFQ mode where we charge group in terms of number
>   of requests dispatched instead of measuring the time. Measuring in terms
>   of time is not possible when we are driving deeper queue depths and there
>   are requests from multiple cfq queues in the request queue.
>
> o This mode currently gets activated if one sets slice_idle=0 and associated
>   disk supports NCQ. Again the idea is that on an NCQ disk with idling disabled
>   most of the queues will dispatch 1 or more requests and then cfq queue
>   expiry happens and we don't have a way to measure time. So start providing
>   fairness in terms of IOPS.
>
> o Currently IOPS mode works only with cfq group scheduling. CFQ is following
>   different scheduling algorithms for queue and group scheduling. These IOPS
>   stats are used only for group scheduling hence in non-croup mode nothing
>   should change.
>
> o For CFQ group scheduling one can disable slice idling so that we don't idle
>   on queue and drive deeper request queue depths (achieving better throughput),
>   at the same time group idle is enabled so one should get service
>   differentiation among groups.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>


> ---
>  block/cfq-iosched.c |   30 ++++++++++++++++++++++++------
>  1 files changed, 24 insertions(+), 6 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 8830569..3fc6be1 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -378,6 +378,21 @@ CFQ_CFQQ_FNS(wait_busy);
>  			&cfqg->service_trees[i][j]: NULL) \
>  
>  
> +static inline bool iops_mode(struct cfq_data *cfqd)
> +{
> +	/*
> +	 * If we are not idling on queues and it is a NCQ drive, parallel
> +	 * execution of requests is on and measuring time is not possible
> +	 * in most of the cases until and unless we drive shallower queue
> +	 * depths and that becomes a performance bottleneck. In such cases
> +	 * switch to start providing fairness in terms of number of IOs.
> +	 */
> +	if (!cfqd->cfq_slice_idle && cfqd->hw_tag)
> +		return true;
> +	else
> +		return false;
> +}
> +
>  static inline enum wl_prio_t cfqq_prio(struct cfq_queue *cfqq)
>  {
>  	if (cfq_class_idle(cfqq))
> @@ -906,7 +921,6 @@ static inline unsigned int cfq_cfqq_slice_usage(struct cfq_queue *cfqq)
>  			slice_used = cfqq->allocated_slice;
>  	}
>  
> -	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u", slice_used);
>  	return slice_used;
>  }
>  
> @@ -914,19 +928,21 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>  				struct cfq_queue *cfqq)
>  {
>  	struct cfq_rb_root *st = &cfqd->grp_service_tree;
> -	unsigned int used_sl, charge_sl;
> +	unsigned int used_sl, charge;
>  	int nr_sync = cfqg->nr_cfqq - cfqg_busy_async_queues(cfqd, cfqg)
>  			- cfqg->service_tree_idle.count;
>  
>  	BUG_ON(nr_sync < 0);
> -	used_sl = charge_sl = cfq_cfqq_slice_usage(cfqq);
> +	used_sl = charge = cfq_cfqq_slice_usage(cfqq);
>  
> -	if (!cfq_cfqq_sync(cfqq) && !nr_sync)
> -		charge_sl = cfqq->allocated_slice;
> +	if (iops_mode(cfqd))
> +		charge = cfqq->slice_dispatch;
> +	else if (!cfq_cfqq_sync(cfqq) && !nr_sync)
> +		charge = cfqq->allocated_slice;
>  
>  	/* Can't update vdisktime while group is on service tree */
>  	cfq_rb_erase(&cfqg->rb_node, st);
> -	cfqg->vdisktime += cfq_scale_slice(charge_sl, cfqg);
> +	cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
>  	__cfq_group_service_tree_add(st, cfqg);
>  
>  	/* This group is being expired. Save the context */
> @@ -940,6 +956,8 @@ static void cfq_group_served(struct cfq_data *cfqd, struct cfq_group *cfqg,
>  
>  	cfq_log_cfqg(cfqd, cfqg, "served: vt=%llu min_vt=%llu", cfqg->vdisktime,
>  					st->min_vdisktime);
> +	cfq_log_cfqq(cfqq->cfqd, cfqq, "sl_used=%u disp=%u charge=%u iops=%u",
> +			used_sl, cfqq->slice_dispatch, charge, iops_mode(cfqd));
>  	cfq_blkiocg_update_timeslice_used(&cfqg->blkg, used_sl);
>  	cfq_blkiocg_set_start_empty_time(&cfqg->blkg);
>  }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/5] cfq-iosched: Implement tunable group_idle
  2010-08-11 22:44 ` [PATCH 3/5] cfq-iosched: Implement tunable group_idle Vivek Goyal
@ 2010-08-16 18:53   ` Jeff Moyer
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Moyer @ 2010-08-16 18:53 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, jaxboe

Vivek Goyal <vgoyal@redhat.com> writes:

> o Implement a new tunable group_idle, which allows idling on the group
>   instead of a cfq queue. Hence one can set slice_idle = 0 and not idle
>   on the individual queues but idle on the group. This way on fast storage
>   we can get fairness between groups at the same time overall throughput
>   improves.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>

> ---
>  block/cfq-iosched.c |   65 +++++++++++++++++++++++++++++++++++++++++++-------
>  1 files changed, 56 insertions(+), 9 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 3fc6be1..85e4819 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -30,6 +30,7 @@ static const int cfq_slice_sync = HZ / 10;
>  static int cfq_slice_async = HZ / 25;
>  static const int cfq_slice_async_rq = 2;
>  static int cfq_slice_idle = HZ / 125;
> +static int cfq_group_idle = HZ / 125;
>  static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
>  static const int cfq_hist_divisor = 4;
>  
> @@ -198,6 +199,8 @@ struct cfq_group {
>  	struct hlist_node cfqd_node;
>  	atomic_t ref;
>  #endif
> +	/* number of requests that are on the dispatch list or inside driver */
> +	int dispatched;
>  };
>  
>  /*
> @@ -271,6 +274,7 @@ struct cfq_data {
>  	unsigned int cfq_slice[2];
>  	unsigned int cfq_slice_async_rq;
>  	unsigned int cfq_slice_idle;
> +	unsigned int cfq_group_idle;
>  	unsigned int cfq_latency;
>  	unsigned int cfq_group_isolation;
>  
> @@ -1884,7 +1888,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>  {
>  	struct cfq_queue *cfqq = cfqd->active_queue;
>  	struct cfq_io_context *cic;
> -	unsigned long sl;
> +	unsigned long sl, group_idle = 0;
>  
>  	/*
>  	 * SSD device without seek penalty, disable idling. But only do so
> @@ -1900,8 +1904,13 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>  	/*
>  	 * idle is disabled, either manually or by past process history
>  	 */
> -	if (!cfq_should_idle(cfqd, cfqq))
> -		return;
> +	if (!cfq_should_idle(cfqd, cfqq)) {
> +		/* no queue idling. Check for group idling */
> +		if (cfqd->cfq_group_idle)
> +			group_idle = cfqd->cfq_group_idle;
> +		else
> +			return;
> +	}
>  
>  	/*
>  	 * still active requests from this queue, don't idle
> @@ -1928,13 +1937,21 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
>  		return;
>  	}
>  
> +	/* There are other queues in the group, don't do group idle */
> +	if (group_idle && cfqq->cfqg->nr_cfqq > 1)
> +		return;
> +
>  	cfq_mark_cfqq_wait_request(cfqq);
>  
> -	sl = cfqd->cfq_slice_idle;
> +	if (group_idle)
> +		sl = cfqd->cfq_group_idle;
> +	else
> +		sl = cfqd->cfq_slice_idle;
>  
>  	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
>  	cfq_blkiocg_update_set_idle_time_stats(&cfqq->cfqg->blkg);
> -	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
> +	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu group_idle: %d", sl,
> +			group_idle ? 1 : 0);
>  }
>  
>  /*
> @@ -1950,6 +1967,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
>  	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
>  	cfq_remove_request(rq);
>  	cfqq->dispatched++;
> +	(RQ_CFQG(rq))->dispatched++;
>  	elv_dispatch_sort(q, rq);
>  
>  	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++;
> @@ -2219,7 +2237,7 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>  			cfqq = NULL;
>  			goto keep_queue;
>  		} else
> -			goto expire;
> +			goto check_group_idle;
>  	}
>  
>  	/*
> @@ -2247,8 +2265,23 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
>  	 * flight or is idling for a new request, allow either of these
>  	 * conditions to happen (or time out) before selecting a new queue.
>  	 */
> -	if (timer_pending(&cfqd->idle_slice_timer) ||
> -	    (cfqq->dispatched && cfq_should_idle(cfqd, cfqq))) {
> +	if (timer_pending(&cfqd->idle_slice_timer)) {
> +		cfqq = NULL;
> +		goto keep_queue;
> +	}
> +
> +	if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
> +		cfqq = NULL;
> +		goto keep_queue;
> +	}
> +
> +	/*
> +	 * If group idle is enabled and there are requests dispatched from
> +	 * this group, wait for requests to complete.
> +	 */
> +check_group_idle:
> +	if (cfqd->cfq_group_idle && cfqq->cfqg->nr_cfqq == 1
> +	    && cfqq->cfqg->dispatched) {
>  		cfqq = NULL;
>  		goto keep_queue;
>  	}
> @@ -3396,6 +3429,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>  	WARN_ON(!cfqq->dispatched);
>  	cfqd->rq_in_driver--;
>  	cfqq->dispatched--;
> +	(RQ_CFQG(rq))->dispatched--;
>  	cfq_blkiocg_update_completion_stats(&cfqq->cfqg->blkg,
>  			rq_start_time_ns(rq), rq_io_start_time_ns(rq),
>  			rq_data_dir(rq), rq_is_sync(rq));
> @@ -3425,7 +3459,10 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
>  		 * the queue.
>  		 */
>  		if (cfq_should_wait_busy(cfqd, cfqq)) {
> -			cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> +			unsigned long extend_sl = cfqd->cfq_slice_idle;
> +			if (!cfqd->cfq_slice_idle)
> +				extend_sl = cfqd->cfq_group_idle;
> +			cfqq->slice_end = jiffies + extend_sl;
>  			cfq_mark_cfqq_wait_busy(cfqq);
>  			cfq_log_cfqq(cfqd, cfqq, "will busy wait");
>  		}
> @@ -3871,6 +3908,7 @@ static void *cfq_init_queue(struct request_queue *q)
>  	cfqd->cfq_slice[1] = cfq_slice_sync;
>  	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
>  	cfqd->cfq_slice_idle = cfq_slice_idle;
> +	cfqd->cfq_group_idle = cfq_group_idle;
>  	cfqd->cfq_latency = 1;
>  	cfqd->cfq_group_isolation = 0;
>  	cfqd->hw_tag = -1;
> @@ -3943,6 +3981,7 @@ SHOW_FUNCTION(cfq_fifo_expire_async_show, cfqd->cfq_fifo_expire[0], 1);
>  SHOW_FUNCTION(cfq_back_seek_max_show, cfqd->cfq_back_max, 0);
>  SHOW_FUNCTION(cfq_back_seek_penalty_show, cfqd->cfq_back_penalty, 0);
>  SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
> +SHOW_FUNCTION(cfq_group_idle_show, cfqd->cfq_group_idle, 1);
>  SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
>  SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
>  SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
> @@ -3975,6 +4014,7 @@ STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
>  STORE_FUNCTION(cfq_back_seek_penalty_store, &cfqd->cfq_back_penalty, 1,
>  		UINT_MAX, 0);
>  STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
> +STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
>  STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
>  STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
>  STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
> @@ -3996,6 +4036,7 @@ static struct elv_fs_entry cfq_attrs[] = {
>  	CFQ_ATTR(slice_async),
>  	CFQ_ATTR(slice_async_rq),
>  	CFQ_ATTR(slice_idle),
> +	CFQ_ATTR(group_idle),
>  	CFQ_ATTR(low_latency),
>  	CFQ_ATTR(group_isolation),
>  	__ATTR_NULL
> @@ -4049,6 +4090,12 @@ static int __init cfq_init(void)
>  	if (!cfq_slice_idle)
>  		cfq_slice_idle = 1;
>  
> +#ifdef CONFIG_CFQ_GROUP_IOSCHED
> +	if (!cfq_group_idle)
> +		cfq_group_idle = 1;
> +#else
> +		cfq_group_idle = 0;
> +#endif
>  	if (cfq_slab_setup())
>  		return -ENOMEM;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] cfq-iosched: Documentation help for new tunables
  2010-08-11 22:44 ` [PATCH 5/5] cfq-iosched: Documentation help for new tunables Vivek Goyal
@ 2010-08-16 19:00   ` Jeff Moyer
  2010-08-17 12:50     ` Vivek Goyal
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Moyer @ 2010-08-16 19:00 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, jaxboe


In general, I've resisted the urge to correct grammar.  Comments below.

Vivek Goyal <vgoyal@redhat.com> writes:

> o Some documentation to provide help with tunables.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  Documentation/block/cfq-iosched.txt        |   45 ++++++++++++++++++++++++++++
>  Documentation/cgroups/blkio-controller.txt |   28 +++++++++++++++++
>  2 files changed, 73 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/block/cfq-iosched.txt
>
> diff --git a/Documentation/block/cfq-iosched.txt b/Documentation/block/cfq-iosched.txt
> new file mode 100644
> index 0000000..3ea9e79
> --- /dev/null
> +++ b/Documentation/block/cfq-iosched.txt
> @@ -0,0 +1,45 @@
> +CFQ ioscheduler tunables
> +========================
> +
> +slice_idle
> +----------
> +This specifies how long CFQ should idle for next request on certain cfq queues
> +(for sequential workloads) and service trees (for random workloads) before
> +queue is expired and CFQ selects next queue to dispatch from.
> +
> +By default slice_idle is a non-zero value. That means by default we idle on
> +queues/service trees. This can be very helpful on highly seeky media like

'seeky' is not a  property of the media.  I think you meant on storage
devices with a high seek cost.

> +single spindle SATA/SAS disks where we can cut down on overall number of
> +seeks and see improved throughput.
> +
> +Setting slice_idle to 0 will remove all the idling on queues/service tree
> +level and one should see an overall improved throughput on faster storage
> +devices like multiple SATA/SAS disks in hardware RAID configuration. The down
> +side is that isolation provided from WRITES also goes down and notion of
> +IO priority becomes weaker.
> +
> +So depending on storage and workload, it might be useful to set slice_idle=0.
> +In general I think for SATA/SAS disks and software RAID of SATA/SAS disks

You think?  I'm pretty sure we've measured that.  ;-)

> +keeping slice_idle enabled should be useful. For any configurations where
> +there are multiple spindles behind single LUN (Host based hardware RAID
> +controller or for storage arrays), setting slice_idle=0 might end up in better
> +throughput and acceptable latencies.
> +
> +CFQ IOPS Mode for group scheduling
> +==================================
> +Basic CFQ design is to provide priority based time slices. Higher priority
> +process gets bigger time slice and lower priority process gets smaller time
> +slice. Measuring time becomes harder if storage is fast and supports NCQ and
> +it would be better to dispatch multiple requests from multiple cfq queues in
> +request queue at a time. In such scenario, it is not possible to measure time
> +consumed by single queue accurately.
> +
> +What is possible though is to measure number of requests dispatched from a
> +single queue and also allow dispatch from multiple cfq queue at the same time.
> +This effectively becomes the fairness in terms of IOPS (IO operations per
> +second).
> +
> +If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
> +to IOPS mode and starts providing fairness in terms of number of requests
> +dispatched. Note that this mode switching takes effect only for group
> +scheduling. For non-cgroup users nothing should change.

> diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt
> index 48e0b21..6919d62 100644
> --- a/Documentation/cgroups/blkio-controller.txt
> +++ b/Documentation/cgroups/blkio-controller.txt
> @@ -217,6 +217,7 @@ Details of cgroup files
>  CFQ sysfs tunable
>  =================
>  /sys/block/<disk>/queue/iosched/group_isolation
> +-----------------------------------------------
>  
>  If group_isolation=1, it provides stronger isolation between groups at the
>  expense of throughput. By default group_isolation is 0. In general that
> @@ -243,6 +244,33 @@ By default one should run with group_isolation=0. If that is not sufficient
>  and one wants stronger isolation between groups, then set group_isolation=1
>  but this will come at cost of reduced throughput.
>  
> +/sys/block/<disk>/queue/iosched/slice_idle
> +------------------------------------------
> +On a faster hardware CFQ can be slow, especially with sequential workload.
> +This happens because CFQ idles on a single queue and single queue might not
> +drive deeper request queue depths to keep the storage busy. In such scenarios
> +one can try setting slice_idle=0 and that would switch CFQ to IOPS
> +(IO operations per second) mode on NCQ supporting hardware.
> +
> +That means CFQ will not idle between cfq queues of a cfq group and hence be
> +able to driver higher queue depth and achieve better throughput. That also
> +means that cfq provides fairness among groups in terms of IOPS and not in
> +terms of disk time.

I'm not sure we need documentation of this tunable twice.  Why not just
give guidance on when it should be set to 0 in the next section
(group_idle) and refer to cfq-iosched.txt?

> +
> +/sys/block/<disk>/queue/iosched/group_idle
> +------------------------------------------
> +If one disables idling on individual cfq queues and cfq service trees by
> +setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
> +on the group in an attempt to provide fairness among groups.
> +
> +By default group_idle is same as slice_idle and does not do anything if
> +slice_idle is enabled.
> +
> +One can experience an overall throughput drop if you have created multiple
> +groups and put applications in that group which are not driving enough
> +IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
> +on individual groups and throughput should improve.
> +
>  What works
>  ==========
>  - Currently only sync IO queues are support. All the buffered writes are
                                       supported.

Looks like something is amiss.  Your text was truncated somewhere.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] cfq-iosched: Documentation help for new tunables
  2010-08-16 19:00   ` Jeff Moyer
@ 2010-08-17 12:50     ` Vivek Goyal
  2010-08-17 13:51       ` Jeff Moyer
  0 siblings, 1 reply; 13+ messages in thread
From: Vivek Goyal @ 2010-08-17 12:50 UTC (permalink / raw)
  To: Jeff Moyer; +Cc: linux-kernel, jaxboe

On Mon, Aug 16, 2010 at 03:00:53PM -0400, Jeff Moyer wrote:
> 
> In general, I've resisted the urge to correct grammar.  Comments below.

Thanks for the review Jeff.

[..]
> > +CFQ ioscheduler tunables
> > +========================
> > +
> > +slice_idle
> > +----------
> > +This specifies how long CFQ should idle for next request on certain cfq queues
> > +(for sequential workloads) and service trees (for random workloads) before
> > +queue is expired and CFQ selects next queue to dispatch from.
> > +
> > +By default slice_idle is a non-zero value. That means by default we idle on
> > +queues/service trees. This can be very helpful on highly seeky media like
> 
> 'seeky' is not a  property of the media.  I think you meant on storage
> devices with a high seek cost.

Fixed

> 
> > +single spindle SATA/SAS disks where we can cut down on overall number of
> > +seeks and see improved throughput.
> > +
> > +Setting slice_idle to 0 will remove all the idling on queues/service tree
> > +level and one should see an overall improved throughput on faster storage
> > +devices like multiple SATA/SAS disks in hardware RAID configuration. The down
> > +side is that isolation provided from WRITES also goes down and notion of
> > +IO priority becomes weaker.
> > +
> > +So depending on storage and workload, it might be useful to set slice_idle=0.
> > +In general I think for SATA/SAS disks and software RAID of SATA/SAS disks
> 
> You think?  I'm pretty sure we've measured that.  ;-)

Fixed :-)

[..]
> > +/sys/block/<disk>/queue/iosched/slice_idle
> > +------------------------------------------
> > +On a faster hardware CFQ can be slow, especially with sequential workload.
> > +This happens because CFQ idles on a single queue and single queue might not
> > +drive deeper request queue depths to keep the storage busy. In such scenarios
> > +one can try setting slice_idle=0 and that would switch CFQ to IOPS
> > +(IO operations per second) mode on NCQ supporting hardware.
> > +
> > +That means CFQ will not idle between cfq queues of a cfq group and hence be
> > +able to driver higher queue depth and achieve better throughput. That also
> > +means that cfq provides fairness among groups in terms of IOPS and not in
> > +terms of disk time.
> 
> I'm not sure we need documentation of this tunable twice.  Why not just
> give guidance on when it should be set to 0 in the next section
> (group_idle) and refer to cfq-iosched.txt?

I have put one line saying for more details look at cfq-iosched.txt. I
have still retained the slice_idle entry because at the end of the day
this is the tunable I expect people to modify and not group_idle. Also
notice that these are two different files (cfq-iosched.txt and
blkio-controller.txt).

> 
> > +
> > +/sys/block/<disk>/queue/iosched/group_idle
> > +------------------------------------------
> > +If one disables idling on individual cfq queues and cfq service trees by
> > +setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
> > +on the group in an attempt to provide fairness among groups.
> > +
> > +By default group_idle is same as slice_idle and does not do anything if
> > +slice_idle is enabled.
> > +
> > +One can experience an overall throughput drop if you have created multiple
> > +groups and put applications in that group which are not driving enough
> > +IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
> > +on individual groups and throughput should improve.
> > +
> >  What works
> >  ==========
> >  - Currently only sync IO queues are support. All the buffered writes are
>                                        supported.
> 
> Looks like something is amiss.  Your text was truncated somewhere.

Actually above text are context lines (3 lines). So nothing is truncated.

Thanks
Vivek


o Some documentation to provide help with tunables.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 Documentation/block/cfq-iosched.txt        |   45 +++++++++++++++++++++++++++++
 Documentation/cgroups/blkio-controller.txt |   32 +++++++++++++++++++-
 2 files changed, 76 insertions(+), 1 deletion(-)

Index: linux-2.6-block/Documentation/block/cfq-iosched.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6-block/Documentation/block/cfq-iosched.txt	2010-08-17 08:39:59.000000000 -0400
@@ -0,0 +1,45 @@
+CFQ ioscheduler tunables
+========================
+
+slice_idle
+----------
+This specifies how long CFQ should idle for next request on certain cfq queues
+(for sequential workloads) and service trees (for random workloads) before
+queue is expired and CFQ selects next queue to dispatch from.
+
+By default slice_idle is a non-zero value. That means by default we idle on
+queues/service trees. This can be very helpful on storage devices with high
+seek cost like single spindle SATA/SAS disks where we can cut down on overall
+number of seeks and see improved throughput.
+
+Setting slice_idle to 0 will remove all the idling on queues/service tree
+level and one should see an overall improved throughput on faster storage
+devices like multiple SATA/SAS disks in hardware RAID configuration. The down
+side is that isolation provided from WRITES also goes down and notion of
+IO priority becomes weaker.
+
+So depending on storage and workload, it might be useful to set slice_idle=0.
+In general for SATA/SAS disks and software RAID of SATA/SAS disks keeping
+slice_idle enabled should be useful. For any configurations where there are
+multiple spindles behind single LUN (Host based hardware RAID controller or
+for storage arrays), setting slice_idle=0 might end up in better throughput
+and acceptable latencies.
+
+CFQ IOPS Mode for group scheduling
+==================================
+Basic CFQ design is to provide priority based time slices. Higher priority
+process gets bigger time slice and lower priority process gets smaller time
+slice. Measuring time becomes harder if storage is fast and supports NCQ and
+it would be better to dispatch multiple requests from multiple cfq queues in
+request queue at a time. In such scenario, it is not possible to measure time
+consumed by single queue accurately.
+
+What is possible though is to measure number of requests dispatched from a
+single queue and also allow dispatch from multiple cfq queue at the same time.
+This effectively becomes the fairness in terms of IOPS (IO operations per
+second).
+
+If one sets slice_idle=0 and if storage supports NCQ, CFQ internally switches
+to IOPS mode and starts providing fairness in terms of number of requests
+dispatched. Note that this mode switching takes effect only for group
+scheduling. For non-cgroup users nothing should change.
Index: linux-2.6-block/Documentation/cgroups/blkio-controller.txt
===================================================================
--- linux-2.6-block.orig/Documentation/cgroups/blkio-controller.txt	2010-08-11 11:12:42.000000000 -0400
+++ linux-2.6-block/Documentation/cgroups/blkio-controller.txt	2010-08-17 08:44:42.000000000 -0400
@@ -217,6 +217,7 @@ Details of cgroup files
 CFQ sysfs tunable
 =================
 /sys/block/<disk>/queue/iosched/group_isolation
+-----------------------------------------------
 
 If group_isolation=1, it provides stronger isolation between groups at the
 expense of throughput. By default group_isolation is 0. In general that
@@ -243,8 +244,37 @@ By default one should run with group_iso
 and one wants stronger isolation between groups, then set group_isolation=1
 but this will come at cost of reduced throughput.
 
+/sys/block/<disk>/queue/iosched/slice_idle
+------------------------------------------
+On a faster hardware CFQ can be slow, especially with sequential workload.
+This happens because CFQ idles on a single queue and single queue might not
+drive deeper request queue depths to keep the storage busy. In such scenarios
+one can try setting slice_idle=0 and that would switch CFQ to IOPS
+(IO operations per second) mode on NCQ supporting hardware.
+
+That means CFQ will not idle between cfq queues of a cfq group and hence be
+able to driver higher queue depth and achieve better throughput. That also
+means that cfq provides fairness among groups in terms of IOPS and not in
+terms of disk time.
+
+For more details look at Documentation/block/cfq-iosched.txt
+
+/sys/block/<disk>/queue/iosched/group_idle
+------------------------------------------
+If one disables idling on individual cfq queues and cfq service trees by
+setting slice_idle=0, group_idle kicks in. That means CFQ will still idle
+on the group in an attempt to provide fairness among groups.
+
+By default group_idle is same as slice_idle and does not do anything if
+slice_idle is enabled.
+
+One can experience an overall throughput drop if you have created multiple
+groups and put applications in that group which are not driving enough
+IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle
+on individual groups and throughput should improve.
+
 What works
 ==========
-- Currently only sync IO queues are support. All the buffered writes are
+- Currently only sync IO queues are supported. All the buffered writes are
   still system wide and not per group. Hence we will not see service
   differentiation between buffered writes between groups.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/5] cfq-iosched: Documentation help for new tunables
  2010-08-17 12:50     ` Vivek Goyal
@ 2010-08-17 13:51       ` Jeff Moyer
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Moyer @ 2010-08-17 13:51 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, jaxboe

Vivek Goyal <vgoyal@redhat.com> writes:

>> I'm not sure we need documentation of this tunable twice.  Why not just
>> give guidance on when it should be set to 0 in the next section
>> (group_idle) and refer to cfq-iosched.txt?
>
> I have put one line saying for more details look at cfq-iosched.txt. I
> have still retained the slice_idle entry because at the end of the day
> this is the tunable I expect people to modify and not group_idle. Also
> notice that these are two different files (cfq-iosched.txt and
> blkio-controller.txt).

Yeah, I know it's in two places, and that means updating two places with
changes.  However, I doubt we'll have to update this often, so I'm not
going to argue.

Acked-by: Jeff Moyer <jmoyer@redhat.com>

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS  accounting for groups V4
  2010-08-11 22:44 [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Vivek Goyal
                   ` (4 preceding siblings ...)
  2010-08-11 22:44 ` [PATCH 5/5] cfq-iosched: Documentation help for new tunables Vivek Goyal
@ 2010-08-23 10:26 ` Jens Axboe
  5 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2010-08-23 10:26 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel

On 2010-08-12 00:44, Vivek Goyal wrote:
> 
> Hi,
> 
> This is V4 of the patches for group_idle and CFQ group charge accounting in
> terms of IOPS implementation. Since V3 not much has changed. Just more testing
> and rebase on top of for-2.6.36 branch of block tree.

Thanks Vivek, I have applied the patches.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2010-08-23 10:26 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-11 22:44 [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Vivek Goyal
2010-08-11 22:44 ` [PATCH 1/5] cfq-iosched: Do not idle if slice_idle=0 Vivek Goyal
2010-08-16 18:37   ` Jeff Moyer
2010-08-11 22:44 ` [PATCH 2/5] cfq-iosched: Do group share accounting in IOPS when slice_idle=0 Vivek Goyal
2010-08-16 18:45   ` Jeff Moyer
2010-08-11 22:44 ` [PATCH 3/5] cfq-iosched: Implement tunable group_idle Vivek Goyal
2010-08-16 18:53   ` Jeff Moyer
2010-08-11 22:44 ` [PATCH 4/5] cfq-iosched: blktrace print per slice sector stats Vivek Goyal
2010-08-11 22:44 ` [PATCH 5/5] cfq-iosched: Documentation help for new tunables Vivek Goyal
2010-08-16 19:00   ` Jeff Moyer
2010-08-17 12:50     ` Vivek Goyal
2010-08-17 13:51       ` Jeff Moyer
2010-08-23 10:26 ` [PATCH] cfq-iosched: cfq-iosched: Implement group idling and IOPS accounting for groups V4 Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.