All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/1] CFQ: fixing performance issues
@ 2011-08-19 15:38 Maxim Patlasov
  2011-08-19 15:38 ` [PATCH 1/1] CFQ: fix handling 'deep' cfqq Maxim Patlasov
  0 siblings, 1 reply; 14+ messages in thread
From: Maxim Patlasov @ 2011-08-19 15:38 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel

Hi,

While chasing cfq vs. noop performance degradation in some complex testing
environment (RHEL6-based kernel, Intel vconsolidate and Dell dvd-store tests
running in virtual environments on relaively powerful servers equipped with
fast h/w raid-s), I found a bunch of problems relating to 'idling' in cases
when 'preempting' would be much more beneficial. Now, securing some free time
to fiddle with mainline kernel (I used 3.1.0-rc2 in my tests), I managed to
reproduce one of performance issues using aio-stress alone. The problem that
this patch-set concerns is idling on seeky cfqq-s marked as 'deep'.

Special handling of 'deep' cfqq-s was introduced long time ago by commit
76280aff1c7e9ae761cac4b48591c43cd7d69159. The idea was that, if an application
is using large I/O depth, it is already optimized to make full utilization of
the hardware, and therefore idling should be beneficial. The problem was that
it's enough to see large I/O depth only once and given cfqq would keep 'deep'
flag for long while. Obviously, it may hurt performance much if h/w is able to
concurrently process many i/o request effectively.

Later, the problem was (partially) amended by patch
8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 clearing 'deep' and 'idle_window'
flags if "the device is much faster than the queue can deliver". Unfortunately,
the logic introduced by that patch suffers from two main problems:
 - cfqq may keep 'deep' and 'idle_window' flags for a while till that logic
   clears these flags; preemption is effectively disabled within this time gap
 - even on commodity h/w with single slow SATA hdd, that logic may provide
   wrong estimation (claim device as fast when it's actually slow).
There are also a few more deficiencies in that logic. I described them in some
details in patch description.

Let's now look at figures. Commodity server with slow hdd, eight aio-stress
running concurrently, cmd-line of each:

# aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I f2_$I f3_$I f4_$I

Aggreagate throughput:

Pristine 3.1.0-rc2 (CFQ): 3.59 MB/s
Pristine 3.1.0-rc2 (noop): 2.49 MB/s
3.1.0-rc2 w/o 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 (CFQ): 5.46 MB/s

So, that patch steals about 35% of throughput on single slow hdd!

Now let's look at the server with fast h/w raid (LSI 1078 RAID-0 from 8 10K
RPMS SAS Disks). To make "time gap" effect visible, I had to modify aio-stress
slightly:

> --- aio-stress-orig.c	2011-08-16 17:00:04.000000000 -0400
> +++ aio-stress.c	2011-08-18 14:49:31.000000000 -0400
> @@ -884,6 +884,7 @@ static int run_active_list(struct thread
>      }
>      if (num_built) {
>  	ret = run_built(t, num_built, t->iocbs);
> +	usleep(1000);
>  	if (ret < 0) {
>  	    fprintf(stderr, "error %d on run_built\n", ret);
>  	    exit(1);

(this change models an app with non-zero think-time). Aggregate throughput:

Pristine 3.1.0-rc2 (CFQ): 67.29 MB/s
Pristine 3.1.0-rc2 (noop): 99.76 MB/s

So, we can see about 30% performance degradation of CFQ as compared with noop.
Let's see how idling affects it:

Pristine 3.1.0-rc2 (CFQ, slice_idle=0): 106.28 MB/s

This proves that all degradation is due to idling. To be 100% sure that idling
on "deep" tasks is guilty, let's re-run test after commenting out lines marking
cfqq as "deep":

>        //if (cfqq->queued[0] + cfqq->queued[1] >= 4)
>        //      cfq_mark_cfqq_deep(cfqq);

3.1.0-rc2 (CFQ, mark_cfqq_deep is commented, default slice_idle): 98.51 MB/s

The throughput here is essentially the same as in case of noop scheduler. This
proves that 30% degradation resulted from idling on "deep" tasks and that patch
8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 doesn't fully address such a
test-case. As a last effort let's verify that that patch really recognize fast
h/w raid as "fast enough". To do it, let's revert changes in
cfq_update_idle_window back to the state of pristine 3.1.0-rc2, but make
clearing "deep" flag in cfq_select_queue unconditional (pretending that
condition "the queue delivers all requests before half its slice is used" is
always met):

>        if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) /* &&
>            (cfq_cfqq_slice_new(cfqq) ||
>            (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start)) */ ) {
>                cfq_clear_cfqq_deep(cfqq);
>                cfq_clear_cfqq_idle_window(cfqq);
>        }

3.1.0-rc2 (CFQ, always clear "deep" flag, default slice_idle): 67.67 MB/s

The throughput here is the same as in case of CFQ on pristine 3.1.0-rc2. This
testifies hypothesis that degradation results from lack of preemption due to
time gap between marking a task as "deep" in cfq_update_idle_window and clearing
this flag in cfq_select_queue.

After applying the patch from this patch-set, aggregate througput on the server
with fast h/w raid is 98.13 MB/s. On commodity server with slow hdd: 5.45 MB/s.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-08-19 15:38 [PATCH 0/1] CFQ: fixing performance issues Maxim Patlasov
@ 2011-08-19 15:38 ` Maxim Patlasov
  2011-08-23  2:44   ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Maxim Patlasov @ 2011-08-19 15:38 UTC (permalink / raw)
  To: axboe; +Cc: linux-kernel

The patch re-works changes introduced by commit
8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 to make decision about
idle/noidle behaviour on seeky&deep cfqq-s more precise.

The problems found in that former patch:

1. "the queue delivers all requests before half its slice is used" is too
pessimistic estimation. It can be demonstrated by aio-stress on a commodity
server with single slow SATA hdd that the estimate provides so many false
positives to degrade performance significantly (up to 35% as compared with
pristine 3.1.0-rc2 w/o that patch).

2. There is a time gap between marking cfqq as 'deep' and 'idle_window' in
cfq_update_idle_window() and clearing these bits in cfq_select_queue(). Till
these bits are cleared there may be pretty many cfqq-s marked as 'deep' and
'idle_window'. This effectively disables preemption becasue CFQ usually
allows it only between SYNC_NOIDLE_WORKLOAD cfqq-s. It can be demostrated by
slightly modified aio-stress (adding 1 msec think-time) that this problem may
chews up to 30% of performance on fast h/w raids (as compared with setting
slice_idle=0 or noop scheduler).

3. It's possible that a task was preempted in the middle of dispatch round.
Then, condition "the queue delivers all requests before half its slice is used"
will be checked for shallow queue. In such cases CFQ will regard slow disk as
fast.

4. It's possible that the queue at the moment of marking a task as "deep" was
much deeper than four requests. Then the fact that the queue haven't delivered
all requests quickly doesn't imply that the disk is slow. In such cases CFQ
may regard fast disk as slow.

An aproach suggested here avoids performance degradations mentioned above.
With this patch applied, the performance on slow hdd is the same as it used
to be before 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07, and, on fast h/w-raids,
it's roughly the same as for noop scheduler or CFQ with slice_idle=0.

Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com>
---
 block/cfq-iosched.c |   90 +++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 77 insertions(+), 13 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 1f96ad6..cf4b643 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -53,6 +53,11 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_SECT_THR_NONROT	(sector_t)(2 * 32)
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
+#define CFQQ_DEEP_THR		4
+
+#define CFQD_DISK_LOOKS_FAST(cfqd)				\
+	(cfqd->cfq_disk_looks_fast > cfqd->cfq_disk_looks_slow)
+
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private[0])
 #define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private[1])
@@ -147,6 +152,11 @@ struct cfq_queue {
 	struct cfq_group *cfqg;
 	/* Number of sectors dispatched from queue in single dispatch round */
 	unsigned long nr_sectors;
+
+	/* When fisrst dispatch in dispatch round happened */
+	unsigned long first_dispatch;
+	/* Number of dispatch happened since first dispatch + 1 */
+	int n_dispatched;
 };
 
 /*
@@ -303,6 +313,17 @@ struct cfq_data {
 
 	/* Number of groups which are on blkcg->blkg_list */
 	unsigned int nr_blkcg_linked_grps;
+
+	/*
+	 * # times disk claimed as fast and slow correspondingly
+	 */
+	int cfq_disk_looks_fast;
+	int cfq_disk_looks_slow;
+
+	/*
+	 * when fast/slow fields were updated last time
+	 */
+	unsigned long cfq_disk_last_updated;
 };
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
@@ -333,6 +354,10 @@ enum cfqq_state_flags {
 	CFQ_CFQQ_FLAG_coop,		/* cfqq is shared */
 	CFQ_CFQQ_FLAG_split_coop,	/* shared cfqq will be splitted */
 	CFQ_CFQQ_FLAG_deep,		/* sync cfqq experienced large depth */
+	CFQ_CFQQ_FLAG_deep_early,	/* sync cfqq experienced large depth in
+					 * the beginning of time-slice, check
+					 * whether dirver can drain the queue
+					 * quickly */
 	CFQ_CFQQ_FLAG_wait_busy,	/* Waiting for next request */
 };
 
@@ -362,6 +387,7 @@ CFQ_CFQQ_FNS(sync);
 CFQ_CFQQ_FNS(coop);
 CFQ_CFQQ_FNS(split_coop);
 CFQ_CFQQ_FNS(deep);
+CFQ_CFQQ_FNS(deep_early);
 CFQ_CFQQ_FNS(wait_busy);
 #undef CFQ_CFQQ_FNS
 
@@ -1704,6 +1730,11 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd,
 		cfqq->slice_dispatch = 0;
 		cfqq->nr_sectors = 0;
 
+		cfqq->first_dispatch = 0;
+		cfqq->n_dispatched = 0;
+		if (cfqq->queued[0] + cfqq->queued[1] >= CFQQ_DEEP_THR)
+			cfq_mark_cfqq_deep_early(cfqq);
+
 		cfq_clear_cfqq_wait_request(cfqq);
 		cfq_clear_cfqq_must_dispatch(cfqq);
 		cfq_clear_cfqq_must_alloc_slice(cfqq);
@@ -1725,6 +1756,12 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 {
 	cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out);
 
+	if (cfq_cfqq_deep_early(cfqq)) {
+		cfqq->first_dispatch = 0;
+		cfqq->n_dispatched = 0;
+		cfq_clear_cfqq_deep_early(cfqq);
+	}
+
 	if (cfq_cfqq_wait_request(cfqq))
 		cfq_del_timer(cfqd, cfqq);
 
@@ -2059,6 +2096,12 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
 
+	if (cfq_cfqq_deep_early(cfqq)) {
+		cfqq->n_dispatched++;
+		if (!cfqq->first_dispatch)
+			cfqq->first_dispatch = jiffies;
+	}
+
 	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
 	cfqq->dispatched++;
@@ -2333,6 +2376,27 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 			goto check_group_idle;
 	}
 
+	if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) {
+		if (cfqq->first_dispatch == jiffies)
+			cfqd->cfq_disk_looks_fast++;
+		else
+			cfqd->cfq_disk_looks_slow++;
+
+		cfqq->first_dispatch = 0;
+		cfqq->n_dispatched = 0;
+		cfq_clear_cfqq_deep_early(cfqq);
+		cfqd->cfq_disk_last_updated = jiffies;
+	}
+
+	if ((cfqd->cfq_disk_last_updated &&
+	     jiffies - cfqd->cfq_disk_last_updated > HZ * 10) ||
+	    cfqd->cfq_disk_looks_fast > 128 ||
+	    cfqd->cfq_disk_looks_slow > 128) {
+		cfqd->cfq_disk_looks_fast >>= 1;
+		cfqd->cfq_disk_looks_slow >>= 1;
+		cfqd->cfq_disk_last_updated = jiffies;
+	}
+
 	/*
 	 * The active queue has requests and isn't expired, allow it to
 	 * dispatch.
@@ -2340,6 +2404,12 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 	if (!RB_EMPTY_ROOT(&cfqq->sort_list))
 		goto keep_queue;
 
+	if (cfq_cfqq_deep_early(cfqq)) {
+		cfqq->first_dispatch = 0;
+		cfqq->n_dispatched = 0;
+		cfq_clear_cfqq_deep_early(cfqq);
+	}
+
 	/*
 	 * If another queue has a request waiting within our mean seek
 	 * distance, let it run.  The expire code will check for close
@@ -2363,17 +2433,6 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 		goto keep_queue;
 	}
 
-	/*
-	 * This is a deep seek queue, but the device is much faster than
-	 * the queue can deliver, don't idle
-	 **/
-	if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) &&
-	    (cfq_cfqq_slice_new(cfqq) ||
-	    (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) {
-		cfq_clear_cfqq_deep(cfqq);
-		cfq_clear_cfqq_idle_window(cfqq);
-	}
-
 	if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
 		cfqq = NULL;
 		goto keep_queue;
@@ -3286,13 +3345,18 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
-	if (cfqq->queued[0] + cfqq->queued[1] >= 4)
+	if (cfqq->queued[0] + cfqq->queued[1] >= CFQQ_DEEP_THR) {
 		cfq_mark_cfqq_deep(cfqq);
 
+		if (cfq_cfqq_slice_new(cfqq))
+			cfq_mark_cfqq_deep_early(cfqq);
+	}
+
 	if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE))
 		enable_idle = 0;
 	else if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq)))
+		 ((!cfq_cfqq_deep(cfqq) || CFQD_DISK_LOOKS_FAST(cfqd))
+		  && CFQQ_SEEKY(cfqq)))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime.ttime_samples)) {
 		if (cic->ttime.ttime_mean > cfqd->cfq_slice_idle)
-- 
1.7.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-08-19 15:38 ` [PATCH 1/1] CFQ: fix handling 'deep' cfqq Maxim Patlasov
@ 2011-08-23  2:44   ` Shaohua Li
  2011-08-23  8:56     ` Maxim Patlasov
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2011-08-23  2:44 UTC (permalink / raw)
  To: Maxim Patlasov; +Cc: axboe, linux-kernel

2011/8/19 Maxim Patlasov <maxim.patlasov@gmail.com>:
> The patch re-works changes introduced by commit
> 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 to make decision about
> idle/noidle behaviour on seeky&deep cfqq-s more precise.
>
> The problems found in that former patch:
>
> 1. "the queue delivers all requests before half its slice is used" is too
> pessimistic estimation. It can be demonstrated by aio-stress on a commodity
> server with single slow SATA hdd that the estimate provides so many false
> positives to degrade performance significantly (up to 35% as compared with
> pristine 3.1.0-rc2 w/o that patch).
>
> 2. There is a time gap between marking cfqq as 'deep' and 'idle_window' in
> cfq_update_idle_window() and clearing these bits in cfq_select_queue(). Till
> these bits are cleared there may be pretty many cfqq-s marked as 'deep' and
> 'idle_window'. This effectively disables preemption becasue CFQ usually
> allows it only between SYNC_NOIDLE_WORKLOAD cfqq-s. It can be demostrated by
> slightly modified aio-stress (adding 1 msec think-time) that this problem may
> chews up to 30% of performance on fast h/w raids (as compared with setting
> slice_idle=0 or noop scheduler).
>
> 3. It's possible that a task was preempted in the middle of dispatch round.
> Then, condition "the queue delivers all requests before half its slice is used"
> will be checked for shallow queue. In such cases CFQ will regard slow disk as
> fast.
>
> 4. It's possible that the queue at the moment of marking a task as "deep" was
> much deeper than four requests. Then the fact that the queue haven't delivered
> all requests quickly doesn't imply that the disk is slow. In such cases CFQ
> may regard fast disk as slow.
>
> An aproach suggested here avoids performance degradations mentioned above.
> With this patch applied, the performance on slow hdd is the same as it used
> to be before 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07, and, on fast h/w-raids,
> it's roughly the same as for noop scheduler or CFQ with slice_idle=0.
idle is a usual cause of cfq performance issue. did you have test in
disk without NCQ?
And did you test if this will hurt the performance of Vivek's original problem?

snip
> +       if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) {
> +               if (cfqq->first_dispatch == jiffies)
> +                       cfqd->cfq_disk_looks_fast++;
> +               else
> +                       cfqd->cfq_disk_looks_slow++;
> +
jiffies is too coarse here. A disk with NCQ can dispatch several
requests within one jiffy.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-08-23  2:44   ` Shaohua Li
@ 2011-08-23  8:56     ` Maxim Patlasov
  2011-08-24  2:06       ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Maxim Patlasov @ 2011-08-23  8:56 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, linux-kernel

Hi,

>> An aproach suggested here avoids performance degradations mentioned above.
>> With this patch applied, the performance on slow hdd is the same as it used
>> to be before 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07, and, on fast h/w-raids,
>> it's roughly the same as for noop scheduler or CFQ with slice_idle=0.
> idle is a usual cause of cfq performance issue. did you have test in
> disk without NCQ?

Yes, on the node with "slow hdd" CFQ detected it as hw_tag=0. I
explained what I tested in cover message (subj: [PATCH 0/1] CFQ:
fixing performance issues).

> And did you test if this will hurt the performance of Vivek's original problem?

No. What's Vivek's original problem?

> snip
>> +       if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) {
>> +               if (cfqq->first_dispatch == jiffies)
>> +                       cfqd->cfq_disk_looks_fast++;
>> +               else
>> +                       cfqd->cfq_disk_looks_slow++;
>> +
> jiffies is too coarse here. A disk with NCQ can dispatch several
> requests within one jiffy.

If a disk with NCQ dispatches four requests in raw within one jiffy
regularly, the patch I suggested will claim it as "fast enough". It
should be beneficial to disable idling for deep&seeky cfqq in this
case, imho. Anyway, existing code:

>	/*
>	 * This is a deep seek queue, but the device is much faster than
>	 * the queue can deliver, don't idle
>	 **/
>	if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) &&
>	    (cfq_cfqq_slice_new(cfqq) ||
>	    (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) {
>		cfq_clear_cfqq_deep(cfqq);
>		cfq_clear_cfqq_idle_window(cfqq);
>	}
>

would surely disable idling in this case. So, the patch I suggested
doesn't make things worse (as compared with existing implementation).

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-08-23  8:56     ` Maxim Patlasov
@ 2011-08-24  2:06       ` Shaohua Li
  2011-08-31  9:12         ` Maxim Patlasov
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2011-08-24  2:06 UTC (permalink / raw)
  To: Maxim Patlasov; +Cc: axboe, linux-kernel

2011/8/23 Maxim Patlasov <maxim.patlasov@gmail.com>:
> Hi,
>
>>> An aproach suggested here avoids performance degradations mentioned above.
>>> With this patch applied, the performance on slow hdd is the same as it used
>>> to be before 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07, and, on fast h/w-raids,
>>> it's roughly the same as for noop scheduler or CFQ with slice_idle=0.
>> idle is a usual cause of cfq performance issue. did you have test in
>> disk without NCQ?
>
> Yes, on the node with "slow hdd" CFQ detected it as hw_tag=0. I
> explained what I tested in cover message (subj: [PATCH 0/1] CFQ:
> fixing performance issues).
>
>> And did you test if this will hurt the performance of Vivek's original problem?
>
> No. What's Vivek's original problem?
>
>> snip
>>> +       if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) {
>>> +               if (cfqq->first_dispatch == jiffies)
>>> +                       cfqd->cfq_disk_looks_fast++;
>>> +               else
>>> +                       cfqd->cfq_disk_looks_slow++;
>>> +
>> jiffies is too coarse here. A disk with NCQ can dispatch several
>> requests within one jiffy.
>
> If a disk with NCQ dispatches four requests in raw within one jiffy
> regularly, the patch I suggested will claim it as "fast enough". It
> should be beneficial to disable idling for deep&seeky cfqq in this
> case, imho. Anyway, existing code:
>
>>       /*
>>        * This is a deep seek queue, but the device is much faster than
>>        * the queue can deliver, don't idle
>>        **/
>>       if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) &&
>>           (cfq_cfqq_slice_new(cfqq) ||
>>           (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) {
>>               cfq_clear_cfqq_deep(cfqq);
>>               cfq_clear_cfqq_idle_window(cfqq);
>>       }
>>
>
> would surely disable idling in this case. So, the patch I suggested
> doesn't make things worse (as compared with existing implementation).
but the cfq_disk_looks_slow isn't updated if the queue doesn't have 4 requests
or doesn't dispatch > 4 requests, so you always have CFQD_DISK_LOOKS_FAST()
return true if the first slice gets it to true. And if the queue does
dispatch > 4 requests in one jiffy, only cfq_disk_looks_fast is updated,
CFQD_DISK_LOOKS_FAST returns true too. I don't understand when
CFQD_DISK_LOOKS_FAST can be false.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-08-24  2:06       ` Shaohua Li
@ 2011-08-31  9:12         ` Maxim Patlasov
  2011-09-06  1:03           ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Maxim Patlasov @ 2011-08-31  9:12 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, linux-kernel

Shaohua,

Sorry for not getting back to you for long. See please inline comments below.

> but the cfq_disk_looks_slow isn't updated if the queue doesn't have 4 requests
> or doesn't dispatch > 4 requests, so you always have CFQD_DISK_LOOKS_FAST()
> return true if the first slice gets it to true. And if the queue does
> dispatch > 4 requests in one jiffy, only cfq_disk_looks_fast is updated,
> CFQD_DISK_LOOKS_FAST returns true too. I don't understand when
> CFQD_DISK_LOOKS_FAST can be false.

The patch essentially gathers events when cfqq experiences deep queue
early and dispatches 4 requests in one dispatch round. As soon as such
an event detected, we're drop inside outer 'if' clause:

> +	if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) {
> +		if (cfqq->first_dispatch == jiffies)
> +			cfqd->cfq_disk_looks_fast++;
> +		else
> +			cfqd->cfq_disk_looks_slow++;
> +
> +		cfqq->first_dispatch = 0;
> +		cfqq->n_dispatched = 0;
> +		cfq_clear_cfqq_deep_early(cfqq);
> +		cfqd->cfq_disk_last_updated = jiffies;
> +	}

and either increment disk_looks_fast or disk_looks_slow.

If the queue doesn't have 4 requests, cfqq is not 'deep' - no events
to gather. Neither disk_looks_fast nor disk_looks_slow is updated.

If the queue doesn't dispatch 4 requests in a raw, the event will be
discarded, and again, neither disk_looks_fast nor disk_looks_slow is
updated.

If the queue does dispatch > 4 requests in one jiffy, only
cfq_disk_looks_fast is updated - that's right. But if the queue
dispatches first 4 requests in *more* than one jiffy,
cfq_disk_looks_slow is updated.

BTW, note that CFQD_DISK_LOOKS_FAST returns true only if
disk_looks_fast greater than disk_looks_slow. So, in case of rare
accidental events, disk_looks_fast/slow counters will expire soon and
CFQD_DISK_LOOKS_FAST will return false ("0 > 0").

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-08-31  9:12         ` Maxim Patlasov
@ 2011-09-06  1:03           ` Shaohua Li
  2011-09-06 15:42             ` Maxim Patlasov
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2011-09-06  1:03 UTC (permalink / raw)
  To: Maxim Patlasov; +Cc: axboe, linux-kernel

2011/8/31 Maxim Patlasov <maxim.patlasov@gmail.com>:
> Shaohua,
>
> Sorry for not getting back to you for long. See please inline comments below.
>
>> but the cfq_disk_looks_slow isn't updated if the queue doesn't have 4 requests
>> or doesn't dispatch > 4 requests, so you always have CFQD_DISK_LOOKS_FAST()
>> return true if the first slice gets it to true. And if the queue does
>> dispatch > 4 requests in one jiffy, only cfq_disk_looks_fast is updated,
>> CFQD_DISK_LOOKS_FAST returns true too. I don't understand when
>> CFQD_DISK_LOOKS_FAST can be false.
>
> The patch essentially gathers events when cfqq experiences deep queue
> early and dispatches 4 requests in one dispatch round. As soon as such
> an event detected, we're drop inside outer 'if' clause:
>
>> +     if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) {
>> +             if (cfqq->first_dispatch == jiffies)
>> +                     cfqd->cfq_disk_looks_fast++;
>> +             else
>> +                     cfqd->cfq_disk_looks_slow++;
>> +
>> +             cfqq->first_dispatch = 0;
>> +             cfqq->n_dispatched = 0;
>> +             cfq_clear_cfqq_deep_early(cfqq);
>> +             cfqd->cfq_disk_last_updated = jiffies;
>> +     }
>
> and either increment disk_looks_fast or disk_looks_slow.
>
> If the queue doesn't have 4 requests, cfqq is not 'deep' - no events
> to gather. Neither disk_looks_fast nor disk_looks_slow is updated.
>
> If the queue doesn't dispatch 4 requests in a raw, the event will be
> discarded, and again, neither disk_looks_fast nor disk_looks_slow is
> updated.
>
> If the queue does dispatch > 4 requests in one jiffy, only
> cfq_disk_looks_fast is updated - that's right. But if the queue
> dispatches first 4 requests in *more* than one jiffy,
> cfq_disk_looks_slow is updated.
So the case is in first round, the queue dispatch > 4 requests in one jiffy,
looks_fast gets updated. Later, if the queue always only dispatch < 4 requests
or has < 4 requests queued, no looks_fast/looks_slow gets updated till
10*HZ later.
CFQD_DISK_LOOKS_FAST() always returns true in the period. is this sane?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-09-06  1:03           ` Shaohua Li
@ 2011-09-06 15:42             ` Maxim Patlasov
  2011-09-09  4:43               ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Maxim Patlasov @ 2011-09-06 15:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, linux-kernel

Shaohua,

>> If the queue does dispatch > 4 requests in one jiffy, only
>> cfq_disk_looks_fast is updated - that's right. But if the queue
>> dispatches first 4 requests in *more* than one jiffy,
>> cfq_disk_looks_slow is updated.
> So the case is in first round, the queue dispatch > 4 requests in one jiffy,
> looks_fast gets updated. Later, if the queue always only dispatch < 4 requests
> or has < 4 requests queued, no looks_fast/looks_slow gets updated till
> 10*HZ later.
> CFQD_DISK_LOOKS_FAST() always returns true in the period. is this sane?

Thanks a lot for feedback. I agree, a single accidental event should
not affect the whole system for so long. What do you think about
making positive decision only if some amount of events were gathered
recently:

#define CFQD_DISK_LOOKS_FAST(cfqd)				    \
	(cfqd->cfq_disk_looks_fast > cfqd->cfq_disk_looks_slow &&   \
	 cfqd->cfq_disk_looks_fast + cfqd->cfq_disk_looks_slow > 10)

Btw, CFQD_DISK_LOOKS_FAST() affects only idle/noidle behaviour for
seeky&deep queues. The question is which queues should be regarded as
'deep'. I tend to think that they experience deep queue quite often
and regularly. Then the chances to update looks_fast/looks_slow in
timely manner are high. At least "always ...  has < 4 requests queued"
is not the case for really 'deep' queues. As for "always only dispatch
< 4 requests", it's theoretically possible but should not happen often
for really 'deep' queues. And conversely, if a queue experienced deep
queue only once, is it good idea to regard it as 'deep' and grant
idle-window to it?

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-09-06 15:42             ` Maxim Patlasov
@ 2011-09-09  4:43               ` Shaohua Li
  2011-09-09 17:41                 ` Maxim Patlasov
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2011-09-09  4:43 UTC (permalink / raw)
  To: Maxim Patlasov; +Cc: axboe, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1607 bytes --]

2011/9/6 Maxim Patlasov <maxim.patlasov@gmail.com>:
> Shaohua,
>
>>> If the queue does dispatch > 4 requests in one jiffy, only
>>> cfq_disk_looks_fast is updated - that's right. But if the queue
>>> dispatches first 4 requests in *more* than one jiffy,
>>> cfq_disk_looks_slow is updated.
>> So the case is in first round, the queue dispatch > 4 requests in one jiffy,
>> looks_fast gets updated. Later, if the queue always only dispatch < 4 requests
>> or has < 4 requests queued, no looks_fast/looks_slow gets updated till
>> 10*HZ later.
>> CFQD_DISK_LOOKS_FAST() always returns true in the period. is this sane?
>
> Thanks a lot for feedback. I agree, a single accidental event should
> not affect the whole system for so long. What do you think about
> making positive decision only if some amount of events were gathered
> recently:
>
> #define CFQD_DISK_LOOKS_FAST(cfqd)                                  \
>        (cfqd->cfq_disk_looks_fast > cfqd->cfq_disk_looks_slow &&   \
>         cfqd->cfq_disk_looks_fast + cfqd->cfq_disk_looks_slow > 10)
I'm sorry a little later to reply.
that's better, but I'm still unsatisfied with the detection if a device is fast.

So the key problem here is how to detect if a device is fast. Doing
the detection
in the dispatch stage always can't give us correct result. A fast device really
should be requests can be finished in short time. So I have something attached.
In my environment, a hard disk is detected slow and a ssd is detected fast, but
I haven't run any benchmark so far. How do you think about it?

Thanks,
Shaohua

[-- Attachment #2: cfq-fast-device-detech.patch --]
[-- Type: text/x-patch, Size: 3032 bytes --]

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index a33bd43..8c4031c 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -53,6 +53,8 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_SECT_THR_NONROT	(sector_t)(2 * 32)
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
 
+#define CFQD_FAST(cfqd)		(hweight32(cfqd->slow_device) > 32/8)
+
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private[0])
 #define RQ_CFQQ(rq)		(struct cfq_queue *) ((rq)->elevator_private[1])
@@ -130,6 +132,8 @@ struct cfq_queue {
 	unsigned long slice_end;
 	long slice_resid;
 
+	unsigned long seeky_dispatch_start;
+
 	/* pending metadata requests */
 	int meta_pending;
 	/* number of requests that are on the dispatch list or inside driver */
@@ -305,6 +309,8 @@ struct cfq_data {
 
 	/* Number of groups which are on blkcg->blkg_list */
 	unsigned int nr_blkcg_linked_grps;
+
+	u32 slow_device;
 };
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
@@ -2073,6 +2079,8 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 
 	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
+	if (cfqq->dispatched == 0 && CFQQ_SEEKY(cfqq))
+		cfqq->seeky_dispatch_start = jiffies;
 	cfqq->dispatched++;
 	(RQ_CFQG(rq))->dispatched++;
 	elv_dispatch_sort(q, rq);
@@ -2375,17 +2383,6 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 		goto keep_queue;
 	}
 
-	/*
-	 * This is a deep seek queue, but the device is much faster than
-	 * the queue can deliver, don't idle
-	 **/
-	if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) &&
-	    (cfq_cfqq_slice_new(cfqq) ||
-	    (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) {
-		cfq_clear_cfqq_deep(cfqq);
-		cfq_clear_cfqq_idle_window(cfqq);
-	}
-
 	if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
 		cfqq = NULL;
 		goto keep_queue;
@@ -3298,7 +3295,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
-	if (cfqq->queued[0] + cfqq->queued[1] >= 4)
+	if (cfqq->queued[0] + cfqq->queued[1] >= 4 && !CFQD_FAST(cfqd))
 		cfq_mark_cfqq_deep(cfqq);
 
 	if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE))
@@ -3587,6 +3584,14 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	cfq_update_hw_tag(cfqd);
 
+	if (cfqq->seeky_dispatch_start) {
+		cfqd->slow_device <<= 1;
+		if (jiffies - cfqq->seeky_dispatch_start >=
+			cfqd->cfq_slice_idle / 2)
+			cfqd->slow_device |= 1;
+		cfqq->seeky_dispatch_start = 0;
+	}
+
 	WARN_ON(!cfqd->rq_in_driver);
 	WARN_ON(!cfqq->dispatched);
 	cfqd->rq_in_driver--;
@@ -4089,6 +4094,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_group_idle = cfq_group_idle;
 	cfqd->cfq_latency = 1;
 	cfqd->hw_tag = -1;
+	cfqd->slow_device = -1;
 	/*
 	 * we optimistically start assuming sync ops weren't delayed in last
 	 * second, in order to have larger depth for async operations.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-09-09  4:43               ` Shaohua Li
@ 2011-09-09 17:41                 ` Maxim Patlasov
  2011-09-12 13:09                   ` Maxim Patlasov
  0 siblings, 1 reply; 14+ messages in thread
From: Maxim Patlasov @ 2011-09-09 17:41 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, linux-kernel

Shaohua,

> So the key problem here is how to detect if a device is fast. Doing
> the detection
> in the dispatch stage always can't give us correct result. A fast device really
> should be requests can be finished in short time. So I have something attached.
> In my environment, a hard disk is detected slow and a ssd is detected fast, but
> I haven't run any benchmark so far. How do you think about it?

Thanks for the patch, I'll test it in several h/w configurations soon
and let you know about results.

As a first thought that comes to mind, this patch would hardly meet
needs of h/w raids. Every particular hdd may be not very fast, but
being assembled in RAID-0, it's still very beneficial to submit many
requests in a row. That's why I made estimation in the dispatch stage:
if hdd (or h/w raid) is regularly able to drain deep queue quite fast,
we should claim it as 'fast' and avoid idling if possible.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-09-09 17:41                 ` Maxim Patlasov
@ 2011-09-12 13:09                   ` Maxim Patlasov
  2011-09-13  3:30                     ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Maxim Patlasov @ 2011-09-12 13:09 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shaohua.li, axboe, linux-kernel

Shaohua,

>> So the key problem here is how to detect if a device is fast. Doing
>> the detection
>> in the dispatch stage always can't give us correct result. A fast device really
>> should be requests can be finished in short time. So I have something attached.
>> In my environment, a hard disk is detected slow and a ssd is detected fast, but
>> I haven't run any benchmark so far. How do you think about it?
>
> Thanks for the patch, I'll test it in several h/w configurations soon
> and let you know about results.

1. Single slow disk (ST3200826AS). Eight instances of aio-stress, cmd-line:

# aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I
f2_$I f3_$I f4_$I

Aggreagate throughput:

Pristine 3.1.0-rc5 (CFQ): 3.77 MB/s
Pristine 3.1.0-rc5 (noop): 2.63 MB/s
Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 2.81 MB/s
3.1.0-rc5 + my patch (CFQ): 5.76 MB/s
3.1.0-rc5 + your patch (CFQ): 5.61 MB/s

2. Four modern disks (WD1003FBYX) assembled in RAID-0 (Adaptec
AAC-RAID (rev 09) 256Mb RAM). Eight instances of aio-stress with
think-time 1msec:

> --- aio-stress-orig.c	2011-08-16 17:00:04.000000000 -0400
> +++ aio-stress.c	2011-08-18 14:49:31.000000000 -0400
> @@ -884,6 +884,7 @@ static int run_active_list(struct thread
>      }
>      if (num_built) {
>  	ret = run_built(t, num_built, t->iocbs);
> +	usleep(1000);
>  	if (ret < 0) {
>  	    fprintf(stderr, "error %d on run_built\n", ret);
>  	    exit(1);

Cmd-line:

# aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 f1_$I f2_$I f3_$I f4_$I

Aggreagate throughput:

Pristine 3.1.0-rc5 (CFQ): 63.67 MB/s
Pristine 3.1.0-rc5 (noop): 100.8 MB/s
Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 105.63 MB/s
3.1.0-rc5 + my patch (CFQ): 105.59 MB/s
3.1.0-rc5 + your patch (CFQ): 14.36 MB/s

So, to meet needs of striped raids, it's not enough to measure service
time of separate requests. We need somehow to measure whether given
hdd/raid is able to service many requests simultaneously in an
effective way.

Thanks,
Maxim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-09-12 13:09                   ` Maxim Patlasov
@ 2011-09-13  3:30                     ` Shaohua Li
  2011-09-13 12:46                       ` Maxim Patlasov
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2011-09-13  3:30 UTC (permalink / raw)
  To: Maxim Patlasov; +Cc: axboe, linux-kernel

On Mon, 2011-09-12 at 21:09 +0800, Maxim Patlasov wrote:
Hi,
> 
> >> So the key problem here is how to detect if a device is fast. Doing
> >> the detection
> >> in the dispatch stage always can't give us correct result. A fast device really
> >> should be requests can be finished in short time. So I have something attached.
> >> In my environment, a hard disk is detected slow and a ssd is detected fast, but
> >> I haven't run any benchmark so far. How do you think about it?
> >
> > Thanks for the patch, I'll test it in several h/w configurations soon
> > and let you know about results.
> 
> 1. Single slow disk (ST3200826AS). Eight instances of aio-stress, cmd-line:
> 
> # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I
> f2_$I f3_$I f4_$I
> 
> Aggreagate throughput:
> 
> Pristine 3.1.0-rc5 (CFQ): 3.77 MB/s
> Pristine 3.1.0-rc5 (noop): 2.63 MB/s
> Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 2.81 MB/s
> 3.1.0-rc5 + my patch (CFQ): 5.76 MB/s
> 3.1.0-rc5 + your patch (CFQ): 5.61 MB/s
> 
> 2. Four modern disks (WD1003FBYX) assembled in RAID-0 (Adaptec
> AAC-RAID (rev 09) 256Mb RAM). Eight instances of aio-stress with
> think-time 1msec:
> 
> > --- aio-stress-orig.c	2011-08-16 17:00:04.000000000 -0400
> > +++ aio-stress.c	2011-08-18 14:49:31.000000000 -0400
> > @@ -884,6 +884,7 @@ static int run_active_list(struct thread
> >      }
> >      if (num_built) {
> >  	ret = run_built(t, num_built, t->iocbs);
> > +	usleep(1000);
> >  	if (ret < 0) {
> >  	    fprintf(stderr, "error %d on run_built\n", ret);
> >  	    exit(1);
> 
> Cmd-line:
> 
> # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 f1_$I f2_$I f3_$I f4_$I
> 
> Aggreagate throughput:
> 
> Pristine 3.1.0-rc5 (CFQ): 63.67 MB/s
> Pristine 3.1.0-rc5 (noop): 100.8 MB/s
> Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 105.63 MB/s
> 3.1.0-rc5 + my patch (CFQ): 105.59 MB/s
> 3.1.0-rc5 + your patch (CFQ): 14.36 MB/s
> 
> So, to meet needs of striped raids, it's not enough to measure service
> time of separate requests. We need somehow to measure whether given
> hdd/raid is able to service many requests simultaneously in an
> effective way.
Thanks for the testing. You are right, this method doesn't work for hard
raid. I missed each request in raid still has long finish time. I
changed the patch to detect fast device, the idea remains but the
algorithm is different. It detects my hard disk/ssd well, but I haven't
raid setup, so please help test.
I'm not satisfied with fast device detection in dispatch stage, even
slow device with NCQ can dispatch several requests in short time (so my
original implementation is wrong as you pointed out)
---
 block/cfq-iosched.c |   70 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 12 deletions(-)

Index: linux/block/cfq-iosched.c
===================================================================
--- linux.orig/block/cfq-iosched.c	2011-09-09 16:50:19.000000000 +0800
+++ linux/block/cfq-iosched.c	2011-09-13 11:21:47.000000000 +0800
@@ -52,6 +52,7 @@ static const int cfq_hist_divisor = 4;
 #define CFQQ_CLOSE_THR		(sector_t)(8 * 1024)
 #define CFQQ_SECT_THR_NONROT	(sector_t)(2 * 32)
 #define CFQQ_SEEKY(cfqq)	(hweight32(cfqq->seek_history) > 32/8)
+#define CFQQ_STRICT_SEEKY(cfqq) (cfqq->seek_history == (u32)-1)
 
 #define RQ_CIC(rq)		\
 	((struct cfq_io_context *) (rq)->elevator_private[0])
@@ -75,6 +76,8 @@ static DEFINE_IDA(cic_index_ida);
 #define sample_valid(samples)	((samples) > 80)
 #define rb_entry_cfqg(node)	rb_entry((node), struct cfq_group, rb_node)
 
+#define CFQD_FAST(cfqd)		sample_valid((cfqd)->fast_device_samples)
+
 /*
  * Most of our rbtree usage is for sorting with min extraction, so
  * if we cache the leftmost node we don't have to walk down the tree
@@ -130,6 +133,9 @@ struct cfq_queue {
 	unsigned long slice_end;
 	long slice_resid;
 
+	unsigned long seeky_dispatch_start;
+	int seeky_dispatched;
+
 	/* pending metadata requests */
 	int meta_pending;
 	/* number of requests that are on the dispatch list or inside driver */
@@ -305,6 +311,8 @@ struct cfq_data {
 
 	/* Number of groups which are on blkcg->blkg_list */
 	unsigned int nr_blkcg_linked_grps;
+
+	int fast_device_samples;
 };
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
@@ -1723,6 +1731,9 @@ static void __cfq_set_active_queue(struc
 		cfq_mark_cfqq_slice_new(cfqq);
 
 		cfq_del_timer(cfqd, cfqq);
+
+		cfqq->seeky_dispatch_start = 0;
+		cfqq->seeky_dispatched = 0;
 	}
 
 	cfqd->active_queue = cfqq;
@@ -2062,6 +2073,48 @@ static void cfq_arm_slice_timer(struct c
 }
 
 /*
+ * Detect if a device is fast.
+ * We profile strictly seeky queue to check if a device is fast. CFQQ_SEEKY()
+ * isn't good here, because there might be sequential requests.
+ * In non-raid case, hard disk (slow device) usually finishes a request > 4ms;
+ * SSD (fast device) usually finishes a request < 1ms.
+ * in raid case, we consider raid as fast device, if the queue dispatches > 2
+ * requests to multiple disks, the average time of each request will < 4ms.
+ * >= 4ms >= 1 jiffy
+ * < 4ms == 0 jiffy
+ */
+static void cfq_fast_device_detect_start(struct cfq_data *cfqd,
+	struct cfq_queue *cfqq)
+{
+	if (CFQD_FAST(cfqd))
+		return;
+	if (cfqq->dispatched == 0 && CFQQ_STRICT_SEEKY(cfqq)) {
+		cfqq->seeky_dispatch_start = jiffies;
+		cfqq->seeky_dispatched = 1;
+		return;
+	}
+	if (cfqq->seeky_dispatch_start) {
+		if (!CFQQ_STRICT_SEEKY(cfqq)) {
+			cfqq->seeky_dispatch_start = 0;
+			cfqq->seeky_dispatched = 0;
+		} else
+			cfqq->seeky_dispatched++;
+	}
+}
+
+static void cfq_fast_device_detect_end(struct cfq_data *cfqd,
+	struct cfq_queue *cfqq)
+{
+	if (cfqq->seeky_dispatch_start && cfqq->dispatched == 0) {
+		if ((jiffies - cfqq->seeky_dispatch_start) /
+		   cfqq->seeky_dispatched <= 0)
+			cfqd->fast_device_samples ++;
+		cfqq->seeky_dispatch_start = 0;
+		cfqq->seeky_dispatched = 0;
+	}
+}
+
+/*
  * Move request from internal lists to the request queue dispatch list.
  */
 static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
@@ -2071,6 +2124,8 @@ static void cfq_dispatch_insert(struct r
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
 
+	cfq_fast_device_detect_start(cfqd, cfqq);
+
 	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
 	cfqq->dispatched++;
@@ -2375,17 +2430,6 @@ static struct cfq_queue *cfq_select_queu
 		goto keep_queue;
 	}
 
-	/*
-	 * This is a deep seek queue, but the device is much faster than
-	 * the queue can deliver, don't idle
-	 **/
-	if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) &&
-	    (cfq_cfqq_slice_new(cfqq) ||
-	    (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) {
-		cfq_clear_cfqq_deep(cfqq);
-		cfq_clear_cfqq_idle_window(cfqq);
-	}
-
 	if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) {
 		cfqq = NULL;
 		goto keep_queue;
@@ -3298,7 +3342,7 @@ cfq_update_idle_window(struct cfq_data *
 
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
-	if (cfqq->queued[0] + cfqq->queued[1] >= 4)
+	if (cfqq->queued[0] + cfqq->queued[1] >= 4 && !CFQD_FAST(cfqd))
 		cfq_mark_cfqq_deep(cfqq);
 
 	if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE))
@@ -3598,6 +3642,8 @@ static void cfq_completed_request(struct
 
 	cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--;
 
+	cfq_fast_device_detect_end(cfqd, cfqq);
+
 	if (sync) {
 		struct cfq_rb_root *service_tree;
 



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-09-13  3:30                     ` Shaohua Li
@ 2011-09-13 12:46                       ` Maxim Patlasov
  2011-09-14  5:12                         ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: Maxim Patlasov @ 2011-09-13 12:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: axboe, linux-kernel

Hi Shaohua,

See, please, test results below.

>> 1. Single slow disk (ST3200826AS). Eight instances of aio-stress, cmd-line:
>>
>> # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I
>> f2_$I f3_$I f4_$I
>>
>> Aggreagate throughput:
>>
>> Pristine 3.1.0-rc5 (CFQ): 3.77 MB/s
>> Pristine 3.1.0-rc5 (noop): 2.63 MB/s
>> Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 2.81 MB/s
>> 3.1.0-rc5 + my patch (CFQ): 5.76 MB/s
>> 3.1.0-rc5 + your patch (CFQ): 5.61 MB/s

3.1.0-rc5 + your patch-v2 (CFQ): 2.85 MB/s

I re-run test many times (including node reboot), results varied from
2.79 to 2.9. It's quite close to pristine 3.1.0-rc5 with slice_idle=0.
Probably, in this case hdd was claimed as 'fast' mistakenly by the
patch.

>> 2. Four modern disks (WD1003FBYX) assembled in RAID-0 (Adaptec
>> AAC-RAID (rev 09) 256Mb RAM). Eight instances of aio-stress with
>> think-time 1msec:
>>
>> > --- aio-stress-orig.c       2011-08-16 17:00:04.000000000 -0400
>> > +++ aio-stress.c    2011-08-18 14:49:31.000000000 -0400
>> > @@ -884,6 +884,7 @@ static int run_active_list(struct thread
>> >      }
>> >      if (num_built) {
>> >     ret = run_built(t, num_built, t->iocbs);
>> > +   usleep(1000);
>> >     if (ret < 0) {
>> >         fprintf(stderr, "error %d on run_built\n", ret);
>> >         exit(1);
>>
>> Cmd-line:
>>
>> # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 f1_$I f2_$I f3_$I f4_$I
>>
>> Aggreagate throughput:
>>
>> Pristine 3.1.0-rc5 (CFQ): 63.67 MB/s
>> Pristine 3.1.0-rc5 (noop): 100.8 MB/s
>> Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 105.63 MB/s
>> 3.1.0-rc5 + my patch (CFQ): 105.59 MB/s
>> 3.1.0-rc5 + your patch (CFQ): 14.36 MB/s

3.1.0-rc5 + your patch-v2 (CFQ): 92.44 - 109.49 MB/s.

First time (after reboot) I got 92.44. One of instances of aio-stress
took much longer than others to complete: 597secs, while others took
approximately 335secs. May be this happened because one instance got
deepy flag set before claiming disk as 'fast'. Second run: 109.49.
Then, after reboot, 101.44 in the first run and 109.41 in the second
run.

> Thanks for the testing. You are right, this method doesn't work for hard
> raid. I missed each request in raid still has long finish time. I
> changed the patch to detect fast device, the idea remains but the
> algorithm is different. It detects my hard disk/ssd well, but I haven't
> raid setup, so please help test.

A few general concerns (strictly IMHO) about this version of the patch:
1. If some cfqq was marked as 'deep' in the past and now we're
claiming disk 'fast', it would be nice either to clear stale flag or
ignore it (in making decision about idle/noidle behaviour).
2. cfqd->fast_device_samples is never expired. I think it's wrong. The
system might experience some peculiar workload long time ago that
resulted in claiming disk 'fast'. Why should we trust it now? Another
concern is the noise: from time to time requests may hit h/w disk
cache and so be completed very quickly. W/o expiration logic, such a
noise will eventually end up in claiming disk 'fast'.
3. CFQQ_STRICT_SEEKY() looks extremely strict. Theoretically, it's
possible that we have many SEEKY cfqq-s but the rate of STRICT_SEEKY
events is too low to make reliable estimation. Any rationale why
STRICT_SEEKY should be typical case?

> I'm not satisfied with fast device detection in dispatch stage, even
> slow device with NCQ can dispatch several requests in short time (so my
> original implementation is wrong as you pointed out)

I'd like to understand this concern better. OK, let it be slow device
with NCQ which can dispatch several requests in short time. But if we
have one or more disk-bound apps keeping device busy, is it possible
that the device dispatches several requests in short time more often
than in longer time? As soon as its internal queue is saturated, the
device will be able to snatch next request from CFQ only when one of
those which are servicing now is completed, won't it? If the device
drains deep queue in short time again and again and again, it's fast,
not slow, isn't it? What I'm missing?

Thanks in advance,
Maxim

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq
  2011-09-13 12:46                       ` Maxim Patlasov
@ 2011-09-14  5:12                         ` Shaohua Li
  0 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2011-09-14  5:12 UTC (permalink / raw)
  To: Maxim Patlasov; +Cc: axboe, linux-kernel

On Tue, 2011-09-13 at 20:46 +0800, Maxim Patlasov wrote:
Hi,
> See, please, test results below.
Thanks for trying it.

> >> 1. Single slow disk (ST3200826AS). Eight instances of aio-stress, cmd-line:
> >>
> >> # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I
> >> f2_$I f3_$I f4_$I
> >>
> >> Aggreagate throughput:
> >>
> >> Pristine 3.1.0-rc5 (CFQ): 3.77 MB/s
> >> Pristine 3.1.0-rc5 (noop): 2.63 MB/s
> >> Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 2.81 MB/s
> >> 3.1.0-rc5 + my patch (CFQ): 5.76 MB/s
> >> 3.1.0-rc5 + your patch (CFQ): 5.61 MB/s
> 
> 3.1.0-rc5 + your patch-v2 (CFQ): 2.85 MB/s
> 
> I re-run test many times (including node reboot), results varied from
> 2.79 to 2.9. It's quite close to pristine 3.1.0-rc5 with slice_idle=0.
> Probably, in this case hdd was claimed as 'fast' mistakenly by the
> patch.
Looks current seeky detection isn't good. I investigated your workload.
each task accesses 4 files, the task access disk sectors, A, B, C, D, A
+1, B+1, C+1, D+1,.... accessing A, B, C, D is seeky. but since disk has
cache, when A is accessed, actually A+1 is disk cache, so later
accessing A+1 just fetches cache. This task queue really should be
detected as sequential. Current CFQ seeky detection only maintains one
history, if it maintains 4, the task queue will be detected as
sequential.
But not sure if we should fix it, if no real workload has such access
patten.

> 
> > Thanks for the testing. You are right, this method doesn't work for hard
> > raid. I missed each request in raid still has long finish time. I
> > changed the patch to detect fast device, the idea remains but the
> > algorithm is different. It detects my hard disk/ssd well, but I haven't
> > raid setup, so please help test.
> 
> A few general concerns (strictly IMHO) about this version of the patch:
> 1. If some cfqq was marked as 'deep' in the past and now we're
> claiming disk 'fast', it would be nice either to clear stale flag or
> ignore it (in making decision about idle/noidle behaviour).
yes, this is wrong, I fixed it already.

> 2. cfqd->fast_device_samples is never expired. I think it's wrong. The
> system might experience some peculiar workload long time ago that
> resulted in claiming disk 'fast'. Why should we trust it now? Another
> concern is the noise: from time to time requests may hit h/w disk
> cache and so be completed very quickly. W/o expiration logic, such a
> noise will eventually end up in claiming disk 'fast'.
This is the reason why I use STRICT_SEEKY() to avoid noise. But when I
investigated your above aiostress workload, I found my algorithm can't
work reliable. It depends on measurement of seeky requests. But how to
detect a queue is really seeky is quite hard too.

> 3. CFQQ_STRICT_SEEKY() looks extremely strict. Theoretically, it's
> possible that we have many SEEKY cfqq-s but the rate of STRICT_SEEKY
> events is too low to make reliable estimation. Any rationale why
> STRICT_SEEKY should be typical case?
so we really need make fast_device_samples expirable, then we don't need
so strict. I need to rethink if it's possible.

> > I'm not satisfied with fast device detection in dispatch stage, even
> > slow device with NCQ can dispatch several requests in short time (so my
> > original implementation is wrong as you pointed out)
> 
> I'd like to understand this concern better. OK, let it be slow device
> with NCQ which can dispatch several requests in short time. But if we
> have one or more disk-bound apps keeping device busy, is it possible
> that the device dispatches several requests in short time more often
> than in longer time? As soon as its internal queue is saturated, the
> device will be able to snatch next request from CFQ only when one of
> those which are servicing now is completed, won't it?
yes. If the queue is saturated and drive can dispatch several requests
in short time, we _might_ can think the drive is fast. But how can you
know the drive is saturated? Usually the drive queue isn't saturated,
because CFQ limits how many requests a queue can dispatch for latency
considering, please see cfq_may_dispatch().
Why I said '__might__' above is you will have the same issue I occurs
above - how to detect seeky queue. The drive might dispatch several
requests in short time, which the queue is detected as seeky, but
actually not. In this case we will have wrong judgment.

> If the device
> drains deep queue in short time again and again and again, it's fast,
> not slow, isn't it? What I'm missing?
you are right here. but you didn't know if the device really drains the
deep queue in dispatch stage. it might just dispatch several requests
but not actually finish them.

Thanks,
Shaohua


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-09-14  5:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-19 15:38 [PATCH 0/1] CFQ: fixing performance issues Maxim Patlasov
2011-08-19 15:38 ` [PATCH 1/1] CFQ: fix handling 'deep' cfqq Maxim Patlasov
2011-08-23  2:44   ` Shaohua Li
2011-08-23  8:56     ` Maxim Patlasov
2011-08-24  2:06       ` Shaohua Li
2011-08-31  9:12         ` Maxim Patlasov
2011-09-06  1:03           ` Shaohua Li
2011-09-06 15:42             ` Maxim Patlasov
2011-09-09  4:43               ` Shaohua Li
2011-09-09 17:41                 ` Maxim Patlasov
2011-09-12 13:09                   ` Maxim Patlasov
2011-09-13  3:30                     ` Shaohua Li
2011-09-13 12:46                       ` Maxim Patlasov
2011-09-14  5:12                         ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.