* [PATCH 0/1] CFQ: fixing performance issues @ 2011-08-19 15:38 Maxim Patlasov 2011-08-19 15:38 ` [PATCH 1/1] CFQ: fix handling 'deep' cfqq Maxim Patlasov 0 siblings, 1 reply; 14+ messages in thread From: Maxim Patlasov @ 2011-08-19 15:38 UTC (permalink / raw) To: axboe; +Cc: linux-kernel Hi, While chasing cfq vs. noop performance degradation in some complex testing environment (RHEL6-based kernel, Intel vconsolidate and Dell dvd-store tests running in virtual environments on relaively powerful servers equipped with fast h/w raid-s), I found a bunch of problems relating to 'idling' in cases when 'preempting' would be much more beneficial. Now, securing some free time to fiddle with mainline kernel (I used 3.1.0-rc2 in my tests), I managed to reproduce one of performance issues using aio-stress alone. The problem that this patch-set concerns is idling on seeky cfqq-s marked as 'deep'. Special handling of 'deep' cfqq-s was introduced long time ago by commit 76280aff1c7e9ae761cac4b48591c43cd7d69159. The idea was that, if an application is using large I/O depth, it is already optimized to make full utilization of the hardware, and therefore idling should be beneficial. The problem was that it's enough to see large I/O depth only once and given cfqq would keep 'deep' flag for long while. Obviously, it may hurt performance much if h/w is able to concurrently process many i/o request effectively. Later, the problem was (partially) amended by patch 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 clearing 'deep' and 'idle_window' flags if "the device is much faster than the queue can deliver". Unfortunately, the logic introduced by that patch suffers from two main problems: - cfqq may keep 'deep' and 'idle_window' flags for a while till that logic clears these flags; preemption is effectively disabled within this time gap - even on commodity h/w with single slow SATA hdd, that logic may provide wrong estimation (claim device as fast when it's actually slow). There are also a few more deficiencies in that logic. I described them in some details in patch description. Let's now look at figures. Commodity server with slow hdd, eight aio-stress running concurrently, cmd-line of each: # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I f2_$I f3_$I f4_$I Aggreagate throughput: Pristine 3.1.0-rc2 (CFQ): 3.59 MB/s Pristine 3.1.0-rc2 (noop): 2.49 MB/s 3.1.0-rc2 w/o 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 (CFQ): 5.46 MB/s So, that patch steals about 35% of throughput on single slow hdd! Now let's look at the server with fast h/w raid (LSI 1078 RAID-0 from 8 10K RPMS SAS Disks). To make "time gap" effect visible, I had to modify aio-stress slightly: > --- aio-stress-orig.c 2011-08-16 17:00:04.000000000 -0400 > +++ aio-stress.c 2011-08-18 14:49:31.000000000 -0400 > @@ -884,6 +884,7 @@ static int run_active_list(struct thread > } > if (num_built) { > ret = run_built(t, num_built, t->iocbs); > + usleep(1000); > if (ret < 0) { > fprintf(stderr, "error %d on run_built\n", ret); > exit(1); (this change models an app with non-zero think-time). Aggregate throughput: Pristine 3.1.0-rc2 (CFQ): 67.29 MB/s Pristine 3.1.0-rc2 (noop): 99.76 MB/s So, we can see about 30% performance degradation of CFQ as compared with noop. Let's see how idling affects it: Pristine 3.1.0-rc2 (CFQ, slice_idle=0): 106.28 MB/s This proves that all degradation is due to idling. To be 100% sure that idling on "deep" tasks is guilty, let's re-run test after commenting out lines marking cfqq as "deep": > //if (cfqq->queued[0] + cfqq->queued[1] >= 4) > // cfq_mark_cfqq_deep(cfqq); 3.1.0-rc2 (CFQ, mark_cfqq_deep is commented, default slice_idle): 98.51 MB/s The throughput here is essentially the same as in case of noop scheduler. This proves that 30% degradation resulted from idling on "deep" tasks and that patch 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 doesn't fully address such a test-case. As a last effort let's verify that that patch really recognize fast h/w raid as "fast enough". To do it, let's revert changes in cfq_update_idle_window back to the state of pristine 3.1.0-rc2, but make clearing "deep" flag in cfq_select_queue unconditional (pretending that condition "the queue delivers all requests before half its slice is used" is always met): > if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) /* && > (cfq_cfqq_slice_new(cfqq) || > (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start)) */ ) { > cfq_clear_cfqq_deep(cfqq); > cfq_clear_cfqq_idle_window(cfqq); > } 3.1.0-rc2 (CFQ, always clear "deep" flag, default slice_idle): 67.67 MB/s The throughput here is the same as in case of CFQ on pristine 3.1.0-rc2. This testifies hypothesis that degradation results from lack of preemption due to time gap between marking a task as "deep" in cfq_update_idle_window and clearing this flag in cfq_select_queue. After applying the patch from this patch-set, aggregate througput on the server with fast h/w raid is 98.13 MB/s. On commodity server with slow hdd: 5.45 MB/s. Thanks, Maxim ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-08-19 15:38 [PATCH 0/1] CFQ: fixing performance issues Maxim Patlasov @ 2011-08-19 15:38 ` Maxim Patlasov 2011-08-23 2:44 ` Shaohua Li 0 siblings, 1 reply; 14+ messages in thread From: Maxim Patlasov @ 2011-08-19 15:38 UTC (permalink / raw) To: axboe; +Cc: linux-kernel The patch re-works changes introduced by commit 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 to make decision about idle/noidle behaviour on seeky&deep cfqq-s more precise. The problems found in that former patch: 1. "the queue delivers all requests before half its slice is used" is too pessimistic estimation. It can be demonstrated by aio-stress on a commodity server with single slow SATA hdd that the estimate provides so many false positives to degrade performance significantly (up to 35% as compared with pristine 3.1.0-rc2 w/o that patch). 2. There is a time gap between marking cfqq as 'deep' and 'idle_window' in cfq_update_idle_window() and clearing these bits in cfq_select_queue(). Till these bits are cleared there may be pretty many cfqq-s marked as 'deep' and 'idle_window'. This effectively disables preemption becasue CFQ usually allows it only between SYNC_NOIDLE_WORKLOAD cfqq-s. It can be demostrated by slightly modified aio-stress (adding 1 msec think-time) that this problem may chews up to 30% of performance on fast h/w raids (as compared with setting slice_idle=0 or noop scheduler). 3. It's possible that a task was preempted in the middle of dispatch round. Then, condition "the queue delivers all requests before half its slice is used" will be checked for shallow queue. In such cases CFQ will regard slow disk as fast. 4. It's possible that the queue at the moment of marking a task as "deep" was much deeper than four requests. Then the fact that the queue haven't delivered all requests quickly doesn't imply that the disk is slow. In such cases CFQ may regard fast disk as slow. An aproach suggested here avoids performance degradations mentioned above. With this patch applied, the performance on slow hdd is the same as it used to be before 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07, and, on fast h/w-raids, it's roughly the same as for noop scheduler or CFQ with slice_idle=0. Signed-off-by: Maxim Patlasov <maxim.patlasov@gmail.com> --- block/cfq-iosched.c | 90 +++++++++++++++++++++++++++++++++++++++++++------- 1 files changed, 77 insertions(+), 13 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 1f96ad6..cf4b643 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -53,6 +53,11 @@ static const int cfq_hist_divisor = 4; #define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) +#define CFQQ_DEEP_THR 4 + +#define CFQD_DISK_LOOKS_FAST(cfqd) \ + (cfqd->cfq_disk_looks_fast > cfqd->cfq_disk_looks_slow) + #define RQ_CIC(rq) \ ((struct cfq_io_context *) (rq)->elevator_private[0]) #define RQ_CFQQ(rq) (struct cfq_queue *) ((rq)->elevator_private[1]) @@ -147,6 +152,11 @@ struct cfq_queue { struct cfq_group *cfqg; /* Number of sectors dispatched from queue in single dispatch round */ unsigned long nr_sectors; + + /* When fisrst dispatch in dispatch round happened */ + unsigned long first_dispatch; + /* Number of dispatch happened since first dispatch + 1 */ + int n_dispatched; }; /* @@ -303,6 +313,17 @@ struct cfq_data { /* Number of groups which are on blkcg->blkg_list */ unsigned int nr_blkcg_linked_grps; + + /* + * # times disk claimed as fast and slow correspondingly + */ + int cfq_disk_looks_fast; + int cfq_disk_looks_slow; + + /* + * when fast/slow fields were updated last time + */ + unsigned long cfq_disk_last_updated; }; static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd); @@ -333,6 +354,10 @@ enum cfqq_state_flags { CFQ_CFQQ_FLAG_coop, /* cfqq is shared */ CFQ_CFQQ_FLAG_split_coop, /* shared cfqq will be splitted */ CFQ_CFQQ_FLAG_deep, /* sync cfqq experienced large depth */ + CFQ_CFQQ_FLAG_deep_early, /* sync cfqq experienced large depth in + * the beginning of time-slice, check + * whether dirver can drain the queue + * quickly */ CFQ_CFQQ_FLAG_wait_busy, /* Waiting for next request */ }; @@ -362,6 +387,7 @@ CFQ_CFQQ_FNS(sync); CFQ_CFQQ_FNS(coop); CFQ_CFQQ_FNS(split_coop); CFQ_CFQQ_FNS(deep); +CFQ_CFQQ_FNS(deep_early); CFQ_CFQQ_FNS(wait_busy); #undef CFQ_CFQQ_FNS @@ -1704,6 +1730,11 @@ static void __cfq_set_active_queue(struct cfq_data *cfqd, cfqq->slice_dispatch = 0; cfqq->nr_sectors = 0; + cfqq->first_dispatch = 0; + cfqq->n_dispatched = 0; + if (cfqq->queued[0] + cfqq->queued[1] >= CFQQ_DEEP_THR) + cfq_mark_cfqq_deep_early(cfqq); + cfq_clear_cfqq_wait_request(cfqq); cfq_clear_cfqq_must_dispatch(cfqq); cfq_clear_cfqq_must_alloc_slice(cfqq); @@ -1725,6 +1756,12 @@ __cfq_slice_expired(struct cfq_data *cfqd, struct cfq_queue *cfqq, { cfq_log_cfqq(cfqd, cfqq, "slice expired t=%d", timed_out); + if (cfq_cfqq_deep_early(cfqq)) { + cfqq->first_dispatch = 0; + cfqq->n_dispatched = 0; + cfq_clear_cfqq_deep_early(cfqq); + } + if (cfq_cfqq_wait_request(cfqq)) cfq_del_timer(cfqd, cfqq); @@ -2059,6 +2096,12 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) cfq_log_cfqq(cfqd, cfqq, "dispatch_insert"); + if (cfq_cfqq_deep_early(cfqq)) { + cfqq->n_dispatched++; + if (!cfqq->first_dispatch) + cfqq->first_dispatch = jiffies; + } + cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq); cfq_remove_request(rq); cfqq->dispatched++; @@ -2333,6 +2376,27 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) goto check_group_idle; } + if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) { + if (cfqq->first_dispatch == jiffies) + cfqd->cfq_disk_looks_fast++; + else + cfqd->cfq_disk_looks_slow++; + + cfqq->first_dispatch = 0; + cfqq->n_dispatched = 0; + cfq_clear_cfqq_deep_early(cfqq); + cfqd->cfq_disk_last_updated = jiffies; + } + + if ((cfqd->cfq_disk_last_updated && + jiffies - cfqd->cfq_disk_last_updated > HZ * 10) || + cfqd->cfq_disk_looks_fast > 128 || + cfqd->cfq_disk_looks_slow > 128) { + cfqd->cfq_disk_looks_fast >>= 1; + cfqd->cfq_disk_looks_slow >>= 1; + cfqd->cfq_disk_last_updated = jiffies; + } + /* * The active queue has requests and isn't expired, allow it to * dispatch. @@ -2340,6 +2404,12 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) if (!RB_EMPTY_ROOT(&cfqq->sort_list)) goto keep_queue; + if (cfq_cfqq_deep_early(cfqq)) { + cfqq->first_dispatch = 0; + cfqq->n_dispatched = 0; + cfq_clear_cfqq_deep_early(cfqq); + } + /* * If another queue has a request waiting within our mean seek * distance, let it run. The expire code will check for close @@ -2363,17 +2433,6 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) goto keep_queue; } - /* - * This is a deep seek queue, but the device is much faster than - * the queue can deliver, don't idle - **/ - if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) && - (cfq_cfqq_slice_new(cfqq) || - (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) { - cfq_clear_cfqq_deep(cfqq); - cfq_clear_cfqq_idle_window(cfqq); - } - if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) { cfqq = NULL; goto keep_queue; @@ -3286,13 +3345,18 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); - if (cfqq->queued[0] + cfqq->queued[1] >= 4) + if (cfqq->queued[0] + cfqq->queued[1] >= CFQQ_DEEP_THR) { cfq_mark_cfqq_deep(cfqq); + if (cfq_cfqq_slice_new(cfqq)) + cfq_mark_cfqq_deep_early(cfqq); + } + if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE)) enable_idle = 0; else if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) + ((!cfq_cfqq_deep(cfqq) || CFQD_DISK_LOOKS_FAST(cfqd)) + && CFQQ_SEEKY(cfqq))) enable_idle = 0; else if (sample_valid(cic->ttime.ttime_samples)) { if (cic->ttime.ttime_mean > cfqd->cfq_slice_idle) -- 1.7.4.4 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-08-19 15:38 ` [PATCH 1/1] CFQ: fix handling 'deep' cfqq Maxim Patlasov @ 2011-08-23 2:44 ` Shaohua Li 2011-08-23 8:56 ` Maxim Patlasov 0 siblings, 1 reply; 14+ messages in thread From: Shaohua Li @ 2011-08-23 2:44 UTC (permalink / raw) To: Maxim Patlasov; +Cc: axboe, linux-kernel 2011/8/19 Maxim Patlasov <maxim.patlasov@gmail.com>: > The patch re-works changes introduced by commit > 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07 to make decision about > idle/noidle behaviour on seeky&deep cfqq-s more precise. > > The problems found in that former patch: > > 1. "the queue delivers all requests before half its slice is used" is too > pessimistic estimation. It can be demonstrated by aio-stress on a commodity > server with single slow SATA hdd that the estimate provides so many false > positives to degrade performance significantly (up to 35% as compared with > pristine 3.1.0-rc2 w/o that patch). > > 2. There is a time gap between marking cfqq as 'deep' and 'idle_window' in > cfq_update_idle_window() and clearing these bits in cfq_select_queue(). Till > these bits are cleared there may be pretty many cfqq-s marked as 'deep' and > 'idle_window'. This effectively disables preemption becasue CFQ usually > allows it only between SYNC_NOIDLE_WORKLOAD cfqq-s. It can be demostrated by > slightly modified aio-stress (adding 1 msec think-time) that this problem may > chews up to 30% of performance on fast h/w raids (as compared with setting > slice_idle=0 or noop scheduler). > > 3. It's possible that a task was preempted in the middle of dispatch round. > Then, condition "the queue delivers all requests before half its slice is used" > will be checked for shallow queue. In such cases CFQ will regard slow disk as > fast. > > 4. It's possible that the queue at the moment of marking a task as "deep" was > much deeper than four requests. Then the fact that the queue haven't delivered > all requests quickly doesn't imply that the disk is slow. In such cases CFQ > may regard fast disk as slow. > > An aproach suggested here avoids performance degradations mentioned above. > With this patch applied, the performance on slow hdd is the same as it used > to be before 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07, and, on fast h/w-raids, > it's roughly the same as for noop scheduler or CFQ with slice_idle=0. idle is a usual cause of cfq performance issue. did you have test in disk without NCQ? And did you test if this will hurt the performance of Vivek's original problem? snip > + if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) { > + if (cfqq->first_dispatch == jiffies) > + cfqd->cfq_disk_looks_fast++; > + else > + cfqd->cfq_disk_looks_slow++; > + jiffies is too coarse here. A disk with NCQ can dispatch several requests within one jiffy. Thanks, Shaohua ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-08-23 2:44 ` Shaohua Li @ 2011-08-23 8:56 ` Maxim Patlasov 2011-08-24 2:06 ` Shaohua Li 0 siblings, 1 reply; 14+ messages in thread From: Maxim Patlasov @ 2011-08-23 8:56 UTC (permalink / raw) To: Shaohua Li; +Cc: axboe, linux-kernel Hi, >> An aproach suggested here avoids performance degradations mentioned above. >> With this patch applied, the performance on slow hdd is the same as it used >> to be before 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07, and, on fast h/w-raids, >> it's roughly the same as for noop scheduler or CFQ with slice_idle=0. > idle is a usual cause of cfq performance issue. did you have test in > disk without NCQ? Yes, on the node with "slow hdd" CFQ detected it as hw_tag=0. I explained what I tested in cover message (subj: [PATCH 0/1] CFQ: fixing performance issues). > And did you test if this will hurt the performance of Vivek's original problem? No. What's Vivek's original problem? > snip >> + if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) { >> + if (cfqq->first_dispatch == jiffies) >> + cfqd->cfq_disk_looks_fast++; >> + else >> + cfqd->cfq_disk_looks_slow++; >> + > jiffies is too coarse here. A disk with NCQ can dispatch several > requests within one jiffy. If a disk with NCQ dispatches four requests in raw within one jiffy regularly, the patch I suggested will claim it as "fast enough". It should be beneficial to disable idling for deep&seeky cfqq in this case, imho. Anyway, existing code: > /* > * This is a deep seek queue, but the device is much faster than > * the queue can deliver, don't idle > **/ > if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) && > (cfq_cfqq_slice_new(cfqq) || > (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) { > cfq_clear_cfqq_deep(cfqq); > cfq_clear_cfqq_idle_window(cfqq); > } > would surely disable idling in this case. So, the patch I suggested doesn't make things worse (as compared with existing implementation). Thanks, Maxim ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-08-23 8:56 ` Maxim Patlasov @ 2011-08-24 2:06 ` Shaohua Li 2011-08-31 9:12 ` Maxim Patlasov 0 siblings, 1 reply; 14+ messages in thread From: Shaohua Li @ 2011-08-24 2:06 UTC (permalink / raw) To: Maxim Patlasov; +Cc: axboe, linux-kernel 2011/8/23 Maxim Patlasov <maxim.patlasov@gmail.com>: > Hi, > >>> An aproach suggested here avoids performance degradations mentioned above. >>> With this patch applied, the performance on slow hdd is the same as it used >>> to be before 8e1ac6655104bc6e1e79d67e2df88cc8fa9b6e07, and, on fast h/w-raids, >>> it's roughly the same as for noop scheduler or CFQ with slice_idle=0. >> idle is a usual cause of cfq performance issue. did you have test in >> disk without NCQ? > > Yes, on the node with "slow hdd" CFQ detected it as hw_tag=0. I > explained what I tested in cover message (subj: [PATCH 0/1] CFQ: > fixing performance issues). > >> And did you test if this will hurt the performance of Vivek's original problem? > > No. What's Vivek's original problem? > >> snip >>> + if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) { >>> + if (cfqq->first_dispatch == jiffies) >>> + cfqd->cfq_disk_looks_fast++; >>> + else >>> + cfqd->cfq_disk_looks_slow++; >>> + >> jiffies is too coarse here. A disk with NCQ can dispatch several >> requests within one jiffy. > > If a disk with NCQ dispatches four requests in raw within one jiffy > regularly, the patch I suggested will claim it as "fast enough". It > should be beneficial to disable idling for deep&seeky cfqq in this > case, imho. Anyway, existing code: > >> /* >> * This is a deep seek queue, but the device is much faster than >> * the queue can deliver, don't idle >> **/ >> if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) && >> (cfq_cfqq_slice_new(cfqq) || >> (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) { >> cfq_clear_cfqq_deep(cfqq); >> cfq_clear_cfqq_idle_window(cfqq); >> } >> > > would surely disable idling in this case. So, the patch I suggested > doesn't make things worse (as compared with existing implementation). but the cfq_disk_looks_slow isn't updated if the queue doesn't have 4 requests or doesn't dispatch > 4 requests, so you always have CFQD_DISK_LOOKS_FAST() return true if the first slice gets it to true. And if the queue does dispatch > 4 requests in one jiffy, only cfq_disk_looks_fast is updated, CFQD_DISK_LOOKS_FAST returns true too. I don't understand when CFQD_DISK_LOOKS_FAST can be false. Thanks, Shaohua ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-08-24 2:06 ` Shaohua Li @ 2011-08-31 9:12 ` Maxim Patlasov 2011-09-06 1:03 ` Shaohua Li 0 siblings, 1 reply; 14+ messages in thread From: Maxim Patlasov @ 2011-08-31 9:12 UTC (permalink / raw) To: Shaohua Li; +Cc: axboe, linux-kernel Shaohua, Sorry for not getting back to you for long. See please inline comments below. > but the cfq_disk_looks_slow isn't updated if the queue doesn't have 4 requests > or doesn't dispatch > 4 requests, so you always have CFQD_DISK_LOOKS_FAST() > return true if the first slice gets it to true. And if the queue does > dispatch > 4 requests in one jiffy, only cfq_disk_looks_fast is updated, > CFQD_DISK_LOOKS_FAST returns true too. I don't understand when > CFQD_DISK_LOOKS_FAST can be false. The patch essentially gathers events when cfqq experiences deep queue early and dispatches 4 requests in one dispatch round. As soon as such an event detected, we're drop inside outer 'if' clause: > + if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) { > + if (cfqq->first_dispatch == jiffies) > + cfqd->cfq_disk_looks_fast++; > + else > + cfqd->cfq_disk_looks_slow++; > + > + cfqq->first_dispatch = 0; > + cfqq->n_dispatched = 0; > + cfq_clear_cfqq_deep_early(cfqq); > + cfqd->cfq_disk_last_updated = jiffies; > + } and either increment disk_looks_fast or disk_looks_slow. If the queue doesn't have 4 requests, cfqq is not 'deep' - no events to gather. Neither disk_looks_fast nor disk_looks_slow is updated. If the queue doesn't dispatch 4 requests in a raw, the event will be discarded, and again, neither disk_looks_fast nor disk_looks_slow is updated. If the queue does dispatch > 4 requests in one jiffy, only cfq_disk_looks_fast is updated - that's right. But if the queue dispatches first 4 requests in *more* than one jiffy, cfq_disk_looks_slow is updated. BTW, note that CFQD_DISK_LOOKS_FAST returns true only if disk_looks_fast greater than disk_looks_slow. So, in case of rare accidental events, disk_looks_fast/slow counters will expire soon and CFQD_DISK_LOOKS_FAST will return false ("0 > 0"). Thanks, Maxim ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-08-31 9:12 ` Maxim Patlasov @ 2011-09-06 1:03 ` Shaohua Li 2011-09-06 15:42 ` Maxim Patlasov 0 siblings, 1 reply; 14+ messages in thread From: Shaohua Li @ 2011-09-06 1:03 UTC (permalink / raw) To: Maxim Patlasov; +Cc: axboe, linux-kernel 2011/8/31 Maxim Patlasov <maxim.patlasov@gmail.com>: > Shaohua, > > Sorry for not getting back to you for long. See please inline comments below. > >> but the cfq_disk_looks_slow isn't updated if the queue doesn't have 4 requests >> or doesn't dispatch > 4 requests, so you always have CFQD_DISK_LOOKS_FAST() >> return true if the first slice gets it to true. And if the queue does >> dispatch > 4 requests in one jiffy, only cfq_disk_looks_fast is updated, >> CFQD_DISK_LOOKS_FAST returns true too. I don't understand when >> CFQD_DISK_LOOKS_FAST can be false. > > The patch essentially gathers events when cfqq experiences deep queue > early and dispatches 4 requests in one dispatch round. As soon as such > an event detected, we're drop inside outer 'if' clause: > >> + if (cfq_cfqq_deep_early(cfqq) && cfqq->n_dispatched >= CFQQ_DEEP_THR) { >> + if (cfqq->first_dispatch == jiffies) >> + cfqd->cfq_disk_looks_fast++; >> + else >> + cfqd->cfq_disk_looks_slow++; >> + >> + cfqq->first_dispatch = 0; >> + cfqq->n_dispatched = 0; >> + cfq_clear_cfqq_deep_early(cfqq); >> + cfqd->cfq_disk_last_updated = jiffies; >> + } > > and either increment disk_looks_fast or disk_looks_slow. > > If the queue doesn't have 4 requests, cfqq is not 'deep' - no events > to gather. Neither disk_looks_fast nor disk_looks_slow is updated. > > If the queue doesn't dispatch 4 requests in a raw, the event will be > discarded, and again, neither disk_looks_fast nor disk_looks_slow is > updated. > > If the queue does dispatch > 4 requests in one jiffy, only > cfq_disk_looks_fast is updated - that's right. But if the queue > dispatches first 4 requests in *more* than one jiffy, > cfq_disk_looks_slow is updated. So the case is in first round, the queue dispatch > 4 requests in one jiffy, looks_fast gets updated. Later, if the queue always only dispatch < 4 requests or has < 4 requests queued, no looks_fast/looks_slow gets updated till 10*HZ later. CFQD_DISK_LOOKS_FAST() always returns true in the period. is this sane? Thanks, Shaohua ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-09-06 1:03 ` Shaohua Li @ 2011-09-06 15:42 ` Maxim Patlasov 2011-09-09 4:43 ` Shaohua Li 0 siblings, 1 reply; 14+ messages in thread From: Maxim Patlasov @ 2011-09-06 15:42 UTC (permalink / raw) To: Shaohua Li; +Cc: axboe, linux-kernel Shaohua, >> If the queue does dispatch > 4 requests in one jiffy, only >> cfq_disk_looks_fast is updated - that's right. But if the queue >> dispatches first 4 requests in *more* than one jiffy, >> cfq_disk_looks_slow is updated. > So the case is in first round, the queue dispatch > 4 requests in one jiffy, > looks_fast gets updated. Later, if the queue always only dispatch < 4 requests > or has < 4 requests queued, no looks_fast/looks_slow gets updated till > 10*HZ later. > CFQD_DISK_LOOKS_FAST() always returns true in the period. is this sane? Thanks a lot for feedback. I agree, a single accidental event should not affect the whole system for so long. What do you think about making positive decision only if some amount of events were gathered recently: #define CFQD_DISK_LOOKS_FAST(cfqd) \ (cfqd->cfq_disk_looks_fast > cfqd->cfq_disk_looks_slow && \ cfqd->cfq_disk_looks_fast + cfqd->cfq_disk_looks_slow > 10) Btw, CFQD_DISK_LOOKS_FAST() affects only idle/noidle behaviour for seeky&deep queues. The question is which queues should be regarded as 'deep'. I tend to think that they experience deep queue quite often and regularly. Then the chances to update looks_fast/looks_slow in timely manner are high. At least "always ... has < 4 requests queued" is not the case for really 'deep' queues. As for "always only dispatch < 4 requests", it's theoretically possible but should not happen often for really 'deep' queues. And conversely, if a queue experienced deep queue only once, is it good idea to regard it as 'deep' and grant idle-window to it? Thanks, Maxim ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-09-06 15:42 ` Maxim Patlasov @ 2011-09-09 4:43 ` Shaohua Li 2011-09-09 17:41 ` Maxim Patlasov 0 siblings, 1 reply; 14+ messages in thread From: Shaohua Li @ 2011-09-09 4:43 UTC (permalink / raw) To: Maxim Patlasov; +Cc: axboe, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1607 bytes --] 2011/9/6 Maxim Patlasov <maxim.patlasov@gmail.com>: > Shaohua, > >>> If the queue does dispatch > 4 requests in one jiffy, only >>> cfq_disk_looks_fast is updated - that's right. But if the queue >>> dispatches first 4 requests in *more* than one jiffy, >>> cfq_disk_looks_slow is updated. >> So the case is in first round, the queue dispatch > 4 requests in one jiffy, >> looks_fast gets updated. Later, if the queue always only dispatch < 4 requests >> or has < 4 requests queued, no looks_fast/looks_slow gets updated till >> 10*HZ later. >> CFQD_DISK_LOOKS_FAST() always returns true in the period. is this sane? > > Thanks a lot for feedback. I agree, a single accidental event should > not affect the whole system for so long. What do you think about > making positive decision only if some amount of events were gathered > recently: > > #define CFQD_DISK_LOOKS_FAST(cfqd) \ > (cfqd->cfq_disk_looks_fast > cfqd->cfq_disk_looks_slow && \ > cfqd->cfq_disk_looks_fast + cfqd->cfq_disk_looks_slow > 10) I'm sorry a little later to reply. that's better, but I'm still unsatisfied with the detection if a device is fast. So the key problem here is how to detect if a device is fast. Doing the detection in the dispatch stage always can't give us correct result. A fast device really should be requests can be finished in short time. So I have something attached. In my environment, a hard disk is detected slow and a ssd is detected fast, but I haven't run any benchmark so far. How do you think about it? Thanks, Shaohua [-- Attachment #2: cfq-fast-device-detech.patch --] [-- Type: text/x-patch, Size: 3032 bytes --] diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index a33bd43..8c4031c 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -53,6 +53,8 @@ static const int cfq_hist_divisor = 4; #define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) +#define CFQD_FAST(cfqd) (hweight32(cfqd->slow_device) > 32/8) + #define RQ_CIC(rq) \ ((struct cfq_io_context *) (rq)->elevator_private[0]) #define RQ_CFQQ(rq) (struct cfq_queue *) ((rq)->elevator_private[1]) @@ -130,6 +132,8 @@ struct cfq_queue { unsigned long slice_end; long slice_resid; + unsigned long seeky_dispatch_start; + /* pending metadata requests */ int meta_pending; /* number of requests that are on the dispatch list or inside driver */ @@ -305,6 +309,8 @@ struct cfq_data { /* Number of groups which are on blkcg->blkg_list */ unsigned int nr_blkcg_linked_grps; + + u32 slow_device; }; static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd); @@ -2073,6 +2079,8 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq); cfq_remove_request(rq); + if (cfqq->dispatched == 0 && CFQQ_SEEKY(cfqq)) + cfqq->seeky_dispatch_start = jiffies; cfqq->dispatched++; (RQ_CFQG(rq))->dispatched++; elv_dispatch_sort(q, rq); @@ -2375,17 +2383,6 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) goto keep_queue; } - /* - * This is a deep seek queue, but the device is much faster than - * the queue can deliver, don't idle - **/ - if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) && - (cfq_cfqq_slice_new(cfqq) || - (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) { - cfq_clear_cfqq_deep(cfqq); - cfq_clear_cfqq_idle_window(cfqq); - } - if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) { cfqq = NULL; goto keep_queue; @@ -3298,7 +3295,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); - if (cfqq->queued[0] + cfqq->queued[1] >= 4) + if (cfqq->queued[0] + cfqq->queued[1] >= 4 && !CFQD_FAST(cfqd)) cfq_mark_cfqq_deep(cfqq); if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE)) @@ -3587,6 +3584,14 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfq_update_hw_tag(cfqd); + if (cfqq->seeky_dispatch_start) { + cfqd->slow_device <<= 1; + if (jiffies - cfqq->seeky_dispatch_start >= + cfqd->cfq_slice_idle / 2) + cfqd->slow_device |= 1; + cfqq->seeky_dispatch_start = 0; + } + WARN_ON(!cfqd->rq_in_driver); WARN_ON(!cfqq->dispatched); cfqd->rq_in_driver--; @@ -4089,6 +4094,7 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_group_idle = cfq_group_idle; cfqd->cfq_latency = 1; cfqd->hw_tag = -1; + cfqd->slow_device = -1; /* * we optimistically start assuming sync ops weren't delayed in last * second, in order to have larger depth for async operations. ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-09-09 4:43 ` Shaohua Li @ 2011-09-09 17:41 ` Maxim Patlasov 2011-09-12 13:09 ` Maxim Patlasov 0 siblings, 1 reply; 14+ messages in thread From: Maxim Patlasov @ 2011-09-09 17:41 UTC (permalink / raw) To: Shaohua Li; +Cc: axboe, linux-kernel Shaohua, > So the key problem here is how to detect if a device is fast. Doing > the detection > in the dispatch stage always can't give us correct result. A fast device really > should be requests can be finished in short time. So I have something attached. > In my environment, a hard disk is detected slow and a ssd is detected fast, but > I haven't run any benchmark so far. How do you think about it? Thanks for the patch, I'll test it in several h/w configurations soon and let you know about results. As a first thought that comes to mind, this patch would hardly meet needs of h/w raids. Every particular hdd may be not very fast, but being assembled in RAID-0, it's still very beneficial to submit many requests in a row. That's why I made estimation in the dispatch stage: if hdd (or h/w raid) is regularly able to drain deep queue quite fast, we should claim it as 'fast' and avoid idling if possible. Thanks, Maxim ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-09-09 17:41 ` Maxim Patlasov @ 2011-09-12 13:09 ` Maxim Patlasov 2011-09-13 3:30 ` Shaohua Li 0 siblings, 1 reply; 14+ messages in thread From: Maxim Patlasov @ 2011-09-12 13:09 UTC (permalink / raw) To: Shaohua Li; +Cc: shaohua.li, axboe, linux-kernel Shaohua, >> So the key problem here is how to detect if a device is fast. Doing >> the detection >> in the dispatch stage always can't give us correct result. A fast device really >> should be requests can be finished in short time. So I have something attached. >> In my environment, a hard disk is detected slow and a ssd is detected fast, but >> I haven't run any benchmark so far. How do you think about it? > > Thanks for the patch, I'll test it in several h/w configurations soon > and let you know about results. 1. Single slow disk (ST3200826AS). Eight instances of aio-stress, cmd-line: # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I f2_$I f3_$I f4_$I Aggreagate throughput: Pristine 3.1.0-rc5 (CFQ): 3.77 MB/s Pristine 3.1.0-rc5 (noop): 2.63 MB/s Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 2.81 MB/s 3.1.0-rc5 + my patch (CFQ): 5.76 MB/s 3.1.0-rc5 + your patch (CFQ): 5.61 MB/s 2. Four modern disks (WD1003FBYX) assembled in RAID-0 (Adaptec AAC-RAID (rev 09) 256Mb RAM). Eight instances of aio-stress with think-time 1msec: > --- aio-stress-orig.c 2011-08-16 17:00:04.000000000 -0400 > +++ aio-stress.c 2011-08-18 14:49:31.000000000 -0400 > @@ -884,6 +884,7 @@ static int run_active_list(struct thread > } > if (num_built) { > ret = run_built(t, num_built, t->iocbs); > + usleep(1000); > if (ret < 0) { > fprintf(stderr, "error %d on run_built\n", ret); > exit(1); Cmd-line: # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 f1_$I f2_$I f3_$I f4_$I Aggreagate throughput: Pristine 3.1.0-rc5 (CFQ): 63.67 MB/s Pristine 3.1.0-rc5 (noop): 100.8 MB/s Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 105.63 MB/s 3.1.0-rc5 + my patch (CFQ): 105.59 MB/s 3.1.0-rc5 + your patch (CFQ): 14.36 MB/s So, to meet needs of striped raids, it's not enough to measure service time of separate requests. We need somehow to measure whether given hdd/raid is able to service many requests simultaneously in an effective way. Thanks, Maxim ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-09-12 13:09 ` Maxim Patlasov @ 2011-09-13 3:30 ` Shaohua Li 2011-09-13 12:46 ` Maxim Patlasov 0 siblings, 1 reply; 14+ messages in thread From: Shaohua Li @ 2011-09-13 3:30 UTC (permalink / raw) To: Maxim Patlasov; +Cc: axboe, linux-kernel On Mon, 2011-09-12 at 21:09 +0800, Maxim Patlasov wrote: Hi, > > >> So the key problem here is how to detect if a device is fast. Doing > >> the detection > >> in the dispatch stage always can't give us correct result. A fast device really > >> should be requests can be finished in short time. So I have something attached. > >> In my environment, a hard disk is detected slow and a ssd is detected fast, but > >> I haven't run any benchmark so far. How do you think about it? > > > > Thanks for the patch, I'll test it in several h/w configurations soon > > and let you know about results. > > 1. Single slow disk (ST3200826AS). Eight instances of aio-stress, cmd-line: > > # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I > f2_$I f3_$I f4_$I > > Aggreagate throughput: > > Pristine 3.1.0-rc5 (CFQ): 3.77 MB/s > Pristine 3.1.0-rc5 (noop): 2.63 MB/s > Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 2.81 MB/s > 3.1.0-rc5 + my patch (CFQ): 5.76 MB/s > 3.1.0-rc5 + your patch (CFQ): 5.61 MB/s > > 2. Four modern disks (WD1003FBYX) assembled in RAID-0 (Adaptec > AAC-RAID (rev 09) 256Mb RAM). Eight instances of aio-stress with > think-time 1msec: > > > --- aio-stress-orig.c 2011-08-16 17:00:04.000000000 -0400 > > +++ aio-stress.c 2011-08-18 14:49:31.000000000 -0400 > > @@ -884,6 +884,7 @@ static int run_active_list(struct thread > > } > > if (num_built) { > > ret = run_built(t, num_built, t->iocbs); > > + usleep(1000); > > if (ret < 0) { > > fprintf(stderr, "error %d on run_built\n", ret); > > exit(1); > > Cmd-line: > > # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 f1_$I f2_$I f3_$I f4_$I > > Aggreagate throughput: > > Pristine 3.1.0-rc5 (CFQ): 63.67 MB/s > Pristine 3.1.0-rc5 (noop): 100.8 MB/s > Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 105.63 MB/s > 3.1.0-rc5 + my patch (CFQ): 105.59 MB/s > 3.1.0-rc5 + your patch (CFQ): 14.36 MB/s > > So, to meet needs of striped raids, it's not enough to measure service > time of separate requests. We need somehow to measure whether given > hdd/raid is able to service many requests simultaneously in an > effective way. Thanks for the testing. You are right, this method doesn't work for hard raid. I missed each request in raid still has long finish time. I changed the patch to detect fast device, the idea remains but the algorithm is different. It detects my hard disk/ssd well, but I haven't raid setup, so please help test. I'm not satisfied with fast device detection in dispatch stage, even slow device with NCQ can dispatch several requests in short time (so my original implementation is wrong as you pointed out) --- block/cfq-iosched.c | 70 +++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 58 insertions(+), 12 deletions(-) Index: linux/block/cfq-iosched.c =================================================================== --- linux.orig/block/cfq-iosched.c 2011-09-09 16:50:19.000000000 +0800 +++ linux/block/cfq-iosched.c 2011-09-13 11:21:47.000000000 +0800 @@ -52,6 +52,7 @@ static const int cfq_hist_divisor = 4; #define CFQQ_CLOSE_THR (sector_t)(8 * 1024) #define CFQQ_SECT_THR_NONROT (sector_t)(2 * 32) #define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) +#define CFQQ_STRICT_SEEKY(cfqq) (cfqq->seek_history == (u32)-1) #define RQ_CIC(rq) \ ((struct cfq_io_context *) (rq)->elevator_private[0]) @@ -75,6 +76,8 @@ static DEFINE_IDA(cic_index_ida); #define sample_valid(samples) ((samples) > 80) #define rb_entry_cfqg(node) rb_entry((node), struct cfq_group, rb_node) +#define CFQD_FAST(cfqd) sample_valid((cfqd)->fast_device_samples) + /* * Most of our rbtree usage is for sorting with min extraction, so * if we cache the leftmost node we don't have to walk down the tree @@ -130,6 +133,9 @@ struct cfq_queue { unsigned long slice_end; long slice_resid; + unsigned long seeky_dispatch_start; + int seeky_dispatched; + /* pending metadata requests */ int meta_pending; /* number of requests that are on the dispatch list or inside driver */ @@ -305,6 +311,8 @@ struct cfq_data { /* Number of groups which are on blkcg->blkg_list */ unsigned int nr_blkcg_linked_grps; + + int fast_device_samples; }; static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd); @@ -1723,6 +1731,9 @@ static void __cfq_set_active_queue(struc cfq_mark_cfqq_slice_new(cfqq); cfq_del_timer(cfqd, cfqq); + + cfqq->seeky_dispatch_start = 0; + cfqq->seeky_dispatched = 0; } cfqd->active_queue = cfqq; @@ -2062,6 +2073,48 @@ static void cfq_arm_slice_timer(struct c } /* + * Detect if a device is fast. + * We profile strictly seeky queue to check if a device is fast. CFQQ_SEEKY() + * isn't good here, because there might be sequential requests. + * In non-raid case, hard disk (slow device) usually finishes a request > 4ms; + * SSD (fast device) usually finishes a request < 1ms. + * in raid case, we consider raid as fast device, if the queue dispatches > 2 + * requests to multiple disks, the average time of each request will < 4ms. + * >= 4ms >= 1 jiffy + * < 4ms == 0 jiffy + */ +static void cfq_fast_device_detect_start(struct cfq_data *cfqd, + struct cfq_queue *cfqq) +{ + if (CFQD_FAST(cfqd)) + return; + if (cfqq->dispatched == 0 && CFQQ_STRICT_SEEKY(cfqq)) { + cfqq->seeky_dispatch_start = jiffies; + cfqq->seeky_dispatched = 1; + return; + } + if (cfqq->seeky_dispatch_start) { + if (!CFQQ_STRICT_SEEKY(cfqq)) { + cfqq->seeky_dispatch_start = 0; + cfqq->seeky_dispatched = 0; + } else + cfqq->seeky_dispatched++; + } +} + +static void cfq_fast_device_detect_end(struct cfq_data *cfqd, + struct cfq_queue *cfqq) +{ + if (cfqq->seeky_dispatch_start && cfqq->dispatched == 0) { + if ((jiffies - cfqq->seeky_dispatch_start) / + cfqq->seeky_dispatched <= 0) + cfqd->fast_device_samples ++; + cfqq->seeky_dispatch_start = 0; + cfqq->seeky_dispatched = 0; + } +} + +/* * Move request from internal lists to the request queue dispatch list. */ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) @@ -2071,6 +2124,8 @@ static void cfq_dispatch_insert(struct r cfq_log_cfqq(cfqd, cfqq, "dispatch_insert"); + cfq_fast_device_detect_start(cfqd, cfqq); + cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq); cfq_remove_request(rq); cfqq->dispatched++; @@ -2375,17 +2430,6 @@ static struct cfq_queue *cfq_select_queu goto keep_queue; } - /* - * This is a deep seek queue, but the device is much faster than - * the queue can deliver, don't idle - **/ - if (CFQQ_SEEKY(cfqq) && cfq_cfqq_idle_window(cfqq) && - (cfq_cfqq_slice_new(cfqq) || - (cfqq->slice_end - jiffies > jiffies - cfqq->slice_start))) { - cfq_clear_cfqq_deep(cfqq); - cfq_clear_cfqq_idle_window(cfqq); - } - if (cfqq->dispatched && cfq_should_idle(cfqd, cfqq)) { cfqq = NULL; goto keep_queue; @@ -3298,7 +3342,7 @@ cfq_update_idle_window(struct cfq_data * enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); - if (cfqq->queued[0] + cfqq->queued[1] >= 4) + if (cfqq->queued[0] + cfqq->queued[1] >= 4 && !CFQD_FAST(cfqd)) cfq_mark_cfqq_deep(cfqq); if (cfqq->next_rq && (cfqq->next_rq->cmd_flags & REQ_NOIDLE)) @@ -3598,6 +3642,8 @@ static void cfq_completed_request(struct cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--; + cfq_fast_device_detect_end(cfqd, cfqq); + if (sync) { struct cfq_rb_root *service_tree; ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-09-13 3:30 ` Shaohua Li @ 2011-09-13 12:46 ` Maxim Patlasov 2011-09-14 5:12 ` Shaohua Li 0 siblings, 1 reply; 14+ messages in thread From: Maxim Patlasov @ 2011-09-13 12:46 UTC (permalink / raw) To: Shaohua Li; +Cc: axboe, linux-kernel Hi Shaohua, See, please, test results below. >> 1. Single slow disk (ST3200826AS). Eight instances of aio-stress, cmd-line: >> >> # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I >> f2_$I f3_$I f4_$I >> >> Aggreagate throughput: >> >> Pristine 3.1.0-rc5 (CFQ): 3.77 MB/s >> Pristine 3.1.0-rc5 (noop): 2.63 MB/s >> Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 2.81 MB/s >> 3.1.0-rc5 + my patch (CFQ): 5.76 MB/s >> 3.1.0-rc5 + your patch (CFQ): 5.61 MB/s 3.1.0-rc5 + your patch-v2 (CFQ): 2.85 MB/s I re-run test many times (including node reboot), results varied from 2.79 to 2.9. It's quite close to pristine 3.1.0-rc5 with slice_idle=0. Probably, in this case hdd was claimed as 'fast' mistakenly by the patch. >> 2. Four modern disks (WD1003FBYX) assembled in RAID-0 (Adaptec >> AAC-RAID (rev 09) 256Mb RAM). Eight instances of aio-stress with >> think-time 1msec: >> >> > --- aio-stress-orig.c 2011-08-16 17:00:04.000000000 -0400 >> > +++ aio-stress.c 2011-08-18 14:49:31.000000000 -0400 >> > @@ -884,6 +884,7 @@ static int run_active_list(struct thread >> > } >> > if (num_built) { >> > ret = run_built(t, num_built, t->iocbs); >> > + usleep(1000); >> > if (ret < 0) { >> > fprintf(stderr, "error %d on run_built\n", ret); >> > exit(1); >> >> Cmd-line: >> >> # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 f1_$I f2_$I f3_$I f4_$I >> >> Aggreagate throughput: >> >> Pristine 3.1.0-rc5 (CFQ): 63.67 MB/s >> Pristine 3.1.0-rc5 (noop): 100.8 MB/s >> Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 105.63 MB/s >> 3.1.0-rc5 + my patch (CFQ): 105.59 MB/s >> 3.1.0-rc5 + your patch (CFQ): 14.36 MB/s 3.1.0-rc5 + your patch-v2 (CFQ): 92.44 - 109.49 MB/s. First time (after reboot) I got 92.44. One of instances of aio-stress took much longer than others to complete: 597secs, while others took approximately 335secs. May be this happened because one instance got deepy flag set before claiming disk as 'fast'. Second run: 109.49. Then, after reboot, 101.44 in the first run and 109.41 in the second run. > Thanks for the testing. You are right, this method doesn't work for hard > raid. I missed each request in raid still has long finish time. I > changed the patch to detect fast device, the idea remains but the > algorithm is different. It detects my hard disk/ssd well, but I haven't > raid setup, so please help test. A few general concerns (strictly IMHO) about this version of the patch: 1. If some cfqq was marked as 'deep' in the past and now we're claiming disk 'fast', it would be nice either to clear stale flag or ignore it (in making decision about idle/noidle behaviour). 2. cfqd->fast_device_samples is never expired. I think it's wrong. The system might experience some peculiar workload long time ago that resulted in claiming disk 'fast'. Why should we trust it now? Another concern is the noise: from time to time requests may hit h/w disk cache and so be completed very quickly. W/o expiration logic, such a noise will eventually end up in claiming disk 'fast'. 3. CFQQ_STRICT_SEEKY() looks extremely strict. Theoretically, it's possible that we have many SEEKY cfqq-s but the rate of STRICT_SEEKY events is too low to make reliable estimation. Any rationale why STRICT_SEEKY should be typical case? > I'm not satisfied with fast device detection in dispatch stage, even > slow device with NCQ can dispatch several requests in short time (so my > original implementation is wrong as you pointed out) I'd like to understand this concern better. OK, let it be slow device with NCQ which can dispatch several requests in short time. But if we have one or more disk-bound apps keeping device busy, is it possible that the device dispatches several requests in short time more often than in longer time? As soon as its internal queue is saturated, the device will be able to snatch next request from CFQ only when one of those which are servicing now is completed, won't it? If the device drains deep queue in short time again and again and again, it's fast, not slow, isn't it? What I'm missing? Thanks in advance, Maxim ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/1] CFQ: fix handling 'deep' cfqq 2011-09-13 12:46 ` Maxim Patlasov @ 2011-09-14 5:12 ` Shaohua Li 0 siblings, 0 replies; 14+ messages in thread From: Shaohua Li @ 2011-09-14 5:12 UTC (permalink / raw) To: Maxim Patlasov; +Cc: axboe, linux-kernel On Tue, 2011-09-13 at 20:46 +0800, Maxim Patlasov wrote: Hi, > See, please, test results below. Thanks for trying it. > >> 1. Single slow disk (ST3200826AS). Eight instances of aio-stress, cmd-line: > >> > >> # aio-stress -a 4 -b 4 -c 1 -r 4 -O -o 0 -t 1 -d 1 -i 1 -s 16 f1_$I > >> f2_$I f3_$I f4_$I > >> > >> Aggreagate throughput: > >> > >> Pristine 3.1.0-rc5 (CFQ): 3.77 MB/s > >> Pristine 3.1.0-rc5 (noop): 2.63 MB/s > >> Pristine 3.1.0-rc5 (CFQ, slice_idle=0): 2.81 MB/s > >> 3.1.0-rc5 + my patch (CFQ): 5.76 MB/s > >> 3.1.0-rc5 + your patch (CFQ): 5.61 MB/s > > 3.1.0-rc5 + your patch-v2 (CFQ): 2.85 MB/s > > I re-run test many times (including node reboot), results varied from > 2.79 to 2.9. It's quite close to pristine 3.1.0-rc5 with slice_idle=0. > Probably, in this case hdd was claimed as 'fast' mistakenly by the > patch. Looks current seeky detection isn't good. I investigated your workload. each task accesses 4 files, the task access disk sectors, A, B, C, D, A +1, B+1, C+1, D+1,.... accessing A, B, C, D is seeky. but since disk has cache, when A is accessed, actually A+1 is disk cache, so later accessing A+1 just fetches cache. This task queue really should be detected as sequential. Current CFQ seeky detection only maintains one history, if it maintains 4, the task queue will be detected as sequential. But not sure if we should fix it, if no real workload has such access patten. > > > Thanks for the testing. You are right, this method doesn't work for hard > > raid. I missed each request in raid still has long finish time. I > > changed the patch to detect fast device, the idea remains but the > > algorithm is different. It detects my hard disk/ssd well, but I haven't > > raid setup, so please help test. > > A few general concerns (strictly IMHO) about this version of the patch: > 1. If some cfqq was marked as 'deep' in the past and now we're > claiming disk 'fast', it would be nice either to clear stale flag or > ignore it (in making decision about idle/noidle behaviour). yes, this is wrong, I fixed it already. > 2. cfqd->fast_device_samples is never expired. I think it's wrong. The > system might experience some peculiar workload long time ago that > resulted in claiming disk 'fast'. Why should we trust it now? Another > concern is the noise: from time to time requests may hit h/w disk > cache and so be completed very quickly. W/o expiration logic, such a > noise will eventually end up in claiming disk 'fast'. This is the reason why I use STRICT_SEEKY() to avoid noise. But when I investigated your above aiostress workload, I found my algorithm can't work reliable. It depends on measurement of seeky requests. But how to detect a queue is really seeky is quite hard too. > 3. CFQQ_STRICT_SEEKY() looks extremely strict. Theoretically, it's > possible that we have many SEEKY cfqq-s but the rate of STRICT_SEEKY > events is too low to make reliable estimation. Any rationale why > STRICT_SEEKY should be typical case? so we really need make fast_device_samples expirable, then we don't need so strict. I need to rethink if it's possible. > > I'm not satisfied with fast device detection in dispatch stage, even > > slow device with NCQ can dispatch several requests in short time (so my > > original implementation is wrong as you pointed out) > > I'd like to understand this concern better. OK, let it be slow device > with NCQ which can dispatch several requests in short time. But if we > have one or more disk-bound apps keeping device busy, is it possible > that the device dispatches several requests in short time more often > than in longer time? As soon as its internal queue is saturated, the > device will be able to snatch next request from CFQ only when one of > those which are servicing now is completed, won't it? yes. If the queue is saturated and drive can dispatch several requests in short time, we _might_ can think the drive is fast. But how can you know the drive is saturated? Usually the drive queue isn't saturated, because CFQ limits how many requests a queue can dispatch for latency considering, please see cfq_may_dispatch(). Why I said '__might__' above is you will have the same issue I occurs above - how to detect seeky queue. The drive might dispatch several requests in short time, which the queue is detected as seeky, but actually not. In this case we will have wrong judgment. > If the device > drains deep queue in short time again and again and again, it's fast, > not slow, isn't it? What I'm missing? you are right here. but you didn't know if the device really drains the deep queue in dispatch stage. it might just dispatch several requests but not actually finish them. Thanks, Shaohua ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2011-09-14 5:09 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-08-19 15:38 [PATCH 0/1] CFQ: fixing performance issues Maxim Patlasov 2011-08-19 15:38 ` [PATCH 1/1] CFQ: fix handling 'deep' cfqq Maxim Patlasov 2011-08-23 2:44 ` Shaohua Li 2011-08-23 8:56 ` Maxim Patlasov 2011-08-24 2:06 ` Shaohua Li 2011-08-31 9:12 ` Maxim Patlasov 2011-09-06 1:03 ` Shaohua Li 2011-09-06 15:42 ` Maxim Patlasov 2011-09-09 4:43 ` Shaohua Li 2011-09-09 17:41 ` Maxim Patlasov 2011-09-12 13:09 ` Maxim Patlasov 2011-09-13 3:30 ` Shaohua Li 2011-09-13 12:46 ` Maxim Patlasov 2011-09-14 5:12 ` Shaohua Li
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.