* fio mmap randread 64k more than 40% regression with 2.6.33-rc1 @ 2009-12-31 9:16 Zhang, Yanmin 2009-12-31 10:34 ` Corrado Zoccolo 0 siblings, 1 reply; 17+ messages in thread From: Zhang, Yanmin @ 2009-12-31 9:16 UTC (permalink / raw) To: czoccolo, Jens Axboe; +Cc: Shaohua Li, jmoyer, LKML Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with 2.6.33-rc1. The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create 8 1-GB files per partition and start 8 processes to do rand read on the 8 files per partitions. There are 8*24 processes totally. randread block size is 64K. We found the regression on 2 machines. One machine has 8GB memory and the other has 6GB. Bisect is very unstable. The related patches are many instead of just one. 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 Author: Corrado Zoccolo <czoccolo@gmail.com> Date: Thu Nov 26 10:02:58 2009 +0100 cfq-iosched: fix corner cases in idling logic This patch introduces about less than 20% regression. I just reverted below section and this part regression disappear. It shows this regression is stable and not impacted by other patches. @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) return; /* - * still requests with the driver, don't idle + * still active requests from this queue, don't idle */ - if (rq_in_driver(cfqd)) + if (cfqq->dispatched) return; 2) How about other 20%~30% regressions? It's complicated. My bisect plus Li Shaohua's investigation located 3 patches, df5fe3e8e13883f58dc97489076bbcc150789a21, b3b6d0408c953524f979468562e7e210d8634150, 5db5d64277bf390056b1a87d0bb288c8b8553f96. tiobench also has regression and Li Shaohua located the same patches. See link http://lkml.indiana.edu/hypermail/linux/kernel/0912.2/03355.html. Shaohua worked about patches to fix the tiobench regression. However, his patch doesn't work for fio randread 64k regression. I retried bisect manually and eventually located below patch, commit 718eee0579b802aabe3bafacf09d0a9b0830f1dd Author: Corrado Zoccolo <czoccolo@gmail.com> Date: Mon Oct 26 22:45:29 2009 +0100 cfq-iosched: fairness for sync no-idle queues The patch is a little big. After many try, I found below section is the key. @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) enable_idle = 0; That section deletes the condition checking of !cfqd->cfq_latency, so enable_idle=0 with more possibility. I wrote a testing patch which just overlooks the original 3 patches related to tiobench regression, and a patch which adds back the checking of !cfqd->cfq_latency. Then, all regression of fio randread 64k disappears. Then, instead of working around the original 3 patches, I applied Shaohua's 2 patches and added the checking of !cfqd->cfq_latency while also reverting the patch mentioned in 1). But the result still has more than 20% regression. So Shaohua's patches couldn't improve fio rand read 64k regression. fio_mmap_randread_4k has about 10% improvement instead of regression. I checked that my patch plus the debugging patch have no impact on this improvement. randwrite 64k has about 25% regression. My method also restores its performance. I worked out a patch to add the checking of !cfqd->cfq_latency back in function cfq_update_idle_window. In addition, as for item 1), could we just revert the section in cfq_arm_slice_timer? As Shaohua's patches don't work for this regression, we might continue to find better methods. I will check it next week. --- With kernel 2.6.33-rc1, fio rand read 64k has more than 40% regression. Located below patch. commit 718eee0579b802aabe3bafacf09d0a9b0830f1dd Author: Corrado Zoccolo <czoccolo@gmail.com> Date: Mon Oct 26 22:45:29 2009 +0100 cfq-iosched: fairness for sync no-idle queues It introduces for more than 20% regression. The reason is function cfq_update_idle_window forgets to check cfqd->cfq_latency, so enable_idle=0 with more possibility. Below patch against 2.6.33-rc1 adds the checking back. Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> --- diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_rand64k/block/cfq-iosched.c --- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800 +++ linux-2.6.33_rc1_rand64k/block/cfq-iosched.c 2009-12-31 16:26:32.000000000 +0800 @@ -3064,8 +3064,8 @@ cfq_update_idle_window(struct cfq_data * cfq_mark_cfqq_deep(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) - && CFQQ_SEEKY(cfqq))) + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) && + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2009-12-31 9:16 fio mmap randread 64k more than 40% regression with 2.6.33-rc1 Zhang, Yanmin @ 2009-12-31 10:34 ` Corrado Zoccolo 2010-01-01 10:12 ` Zhang, Yanmin 0 siblings, 1 reply; 17+ messages in thread From: Corrado Zoccolo @ 2009-12-31 10:34 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: Jens Axboe, Shaohua Li, jmoyer, LKML Hi Yanmin, On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with > 2.6.33-rc1. Can you compare the performance also with 2.6.31? I think I understand what causes your problem. 2.6.32, with default settings, handled even random readers as sequential ones to provide fairness. This has benefits on single disks and JBODs, but causes harm on raids. For 2.6.33, we changed the way in which this is handled, restoring the enable_idle = 0 for seeky queues as it was in 2.6.31: @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) enable_idle = 0; (compare with 2.6.31: if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || (cfqd->hw_tag && CIC_SEEKY(cic))) enable_idle = 0; excluding the sample_valid check, it should be equivalent for you (I assume you have NCQ disks)) and we provide fairness for them by servicing all seeky queues together, and then idling before switching to other ones. The mmap 64k randreader will have a large seek_mean, resulting in being marked seeky, but will send 16 * 4k sequential requests one after the other, so alternating between those seeky queues will cause harm. I'm working on a new way to compute seekiness of queues, that should fix your issue, correctly identifying those queues as non-seeky (for me, a queue should be considered seeky only if it submits more than 1 seeky requests for 8 sequential ones). > > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files > per partitions. There are 8*24 processes totally. randread block size is 64K. > > We found the regression on 2 machines. One machine has 8GB memory and the other has > 6GB. > > Bisect is very unstable. The related patches are many instead of just one. > > > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 > Author: Corrado Zoccolo <czoccolo@gmail.com> > Date: Thu Nov 26 10:02:58 2009 +0100 > > cfq-iosched: fix corner cases in idling logic > > > This patch introduces about less than 20% regression. I just reverted below section > and this part regression disappear. It shows this regression is stable and not impacted > by other patches. > > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > return; > > /* > - * still requests with the driver, don't idle > + * still active requests from this queue, don't idle > */ > - if (rq_in_driver(cfqd)) > + if (cfqq->dispatched) > return; > This shouldn't affect you if all queues are marked as idle. Does just your patch: > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) > - && CFQQ_SEEKY(cfqq))) > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) && > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) fix most of the regression without touching arm_slice_timer? I guess > 5db5d64277bf390056b1a87d0bb288c8b8553f96. will still introduce a 10% regression, but this is needed to improve latency, and you can just disable low_latency to avoid it. Thanks, Corrado ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2009-12-31 10:34 ` Corrado Zoccolo @ 2010-01-01 10:12 ` Zhang, Yanmin 2010-01-01 16:32 ` Corrado Zoccolo 0 siblings, 1 reply; 17+ messages in thread From: Zhang, Yanmin @ 2010-01-01 10:12 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Jens Axboe, Shaohua Li, jmoyer, LKML [-- Attachment #1: Type: text/plain, Size: 6387 bytes --] On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote: > Hi Yanmin, > On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: > > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with > > 2.6.33-rc1. > Thanks for your timely reply. Some comments inlined below. > Can you compare the performance also with 2.6.31? We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel is released. The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about 8% better than the one of 2.6.31. > I think I understand what causes your problem. > 2.6.32, with default settings, handled even random readers as > sequential ones to provide fairness. This has benefits on single disks > and JBODs, but causes harm on raids. I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on hardware RAID in HBA, mostly we use noop io scheduler. > For 2.6.33, we changed the way in which this is handled, restoring the > enable_idle = 0 for seeky queues as it was in 2.6.31: > @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, > struct cfq_queue *cfqq, > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) > + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) > enable_idle = 0; > (compare with 2.6.31: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > excluding the sample_valid check, it should be equivalent for you (I > assume you have NCQ disks)) > and we provide fairness for them by servicing all seeky queues > together, and then idling before switching to other ones. As for function cfq_update_idle_window, you is right. But since 2.6.32, CFQ merges many patches and the patches have impact on each other. > > The mmap 64k randreader will have a large seek_mean, resulting in > being marked seeky, but will send 16 * 4k sequential requests one > after the other, so alternating between those seeky queues will cause > harm. > > I'm working on a new way to compute seekiness of queues, that should > fix your issue, correctly identifying those queues as non-seeky (for > me, a queue should be considered seeky only if it submits more than 1 > seeky requests for 8 sequential ones). > > > > > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create > > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files > > per partitions. There are 8*24 processes totally. randread block size is 64K. > > > > We found the regression on 2 machines. One machine has 8GB memory and the other has > > 6GB. > > > > Bisect is very unstable. The related patches are many instead of just one. > > > > > > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 > > Author: Corrado Zoccolo <czoccolo@gmail.com> > > Date: Thu Nov 26 10:02:58 2009 +0100 > > > > cfq-iosched: fix corner cases in idling logic > > > > > > This patch introduces about less than 20% regression. I just reverted below section > > and this part regression disappear. It shows this regression is stable and not impacted > > by other patches. > > > > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > > return; > > > > /* > > - * still requests with the driver, don't idle > > + * still active requests from this queue, don't idle > > */ > > - if (rq_in_driver(cfqd)) > > + if (cfqq->dispatched) > > return; Although 5 patches are related to the regression, above line is quite independent. Reverting above line could always improve the result for about 20%. > > > This shouldn't affect you if all queues are marked as idle. Do you mean to use command ionice to mark it as idle class? I didn't try it. > Does just > your patch: > > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) > > - && CFQQ_SEEKY(cfqq))) > > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) && > > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) > fix most of the regression without touching arm_slice_timer? No. If to fix the regression completely, I need apply above patch plus a debug patch. The debug patch is to just work around the 3 patches report by Shaohua's tiobench regression report. Without the debug patch, the regression isn't resolved. Below is the debug patch. diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c --- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800 +++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c 2009-12-30 17:12:28.000000000 +0800 @@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd cfqq->slice_start = jiffies; cfqq->slice_end = jiffies + slice; cfqq->allocated_slice = slice; +/*YMZHANG*/ + cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; + cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); } @@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c /* * still active requests from this queue, don't idle */ - if (cfqq->dispatched) + //if (cfqq->dispatched) + if (rq_in_driver(cfqd)) return; /* @@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q new_cfqq = __cfqq; } + /* YMZHANG debug */ + return; + process_refs = cfqq_process_refs(cfqq); /* * If the process for the cfqq has gone away, there is no > > I guess > > 5db5d64277bf390056b1a87d0bb288c8b8553f96. > will still introduce a 10% regression, but this is needed to improve > latency, and you can just disable low_latency to avoid it. You are right. I did a quick testing. If my patch + revert 2 patches and keep 5db5d64, the regression is about 20%. But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches and keep 5db5d64 while set low_latency=0, the regression is still there. One reason is my patch doesn't work when low_latency=0. > > Thanks, > Corrado I attach the fio job file for your reference. I got a cold and will continue to work on it next week. Yanmin [-- Attachment #2: fio_randread_job_file --] [-- Type: text/plain, Size: 24674 bytes --] [global] direct=0 ioengine=mmap size=8G bs=64k numjobs=1 loops=5 runtime=300 group_reporting invalidate=0 directory=/mnt/stp/fiodata file_service_type=random:36 [job_sdb1_sub0] startdelay=0 rw=randread filename=data0/f1:data0/f2:data0/f3:data0/f4:data0/f5:data0/f6:data0/f7:data0/f8 [job_sdb1_sub1] startdelay=0 rw=randread filename=data0/f2:data0/f3:data0/f4:data0/f5:data0/f6:data0/f7:data0/f8:data0/f1 [job_sdb1_sub2] startdelay=0 rw=randread filename=data0/f3:data0/f4:data0/f5:data0/f6:data0/f7:data0/f8:data0/f1:data0/f2 [job_sdb1_sub3] startdelay=0 rw=randread filename=data0/f4:data0/f5:data0/f6:data0/f7:data0/f8:data0/f1:data0/f2:data0/f3 [job_sdb1_sub4] startdelay=0 rw=randread filename=data0/f5:data0/f6:data0/f7:data0/f8:data0/f1:data0/f2:data0/f3:data0/f4 [job_sdb1_sub5] startdelay=0 rw=randread filename=data0/f6:data0/f7:data0/f8:data0/f1:data0/f2:data0/f3:data0/f4:data0/f5 [job_sdb1_sub6] startdelay=0 rw=randread filename=data0/f7:data0/f8:data0/f1:data0/f2:data0/f3:data0/f4:data0/f5:data0/f6 [job_sdb1_sub7] startdelay=0 rw=randread filename=data0/f8:data0/f1:data0/f2:data0/f3:data0/f4:data0/f5:data0/f6:data0/f7 [job_sdb2_sub0] startdelay=0 rw=randread filename=data1/f1:data1/f2:data1/f3:data1/f4:data1/f5:data1/f6:data1/f7:data1/f8 [job_sdb2_sub1] startdelay=0 rw=randread filename=data1/f2:data1/f3:data1/f4:data1/f5:data1/f6:data1/f7:data1/f8:data1/f1 [job_sdb2_sub2] startdelay=0 rw=randread filename=data1/f3:data1/f4:data1/f5:data1/f6:data1/f7:data1/f8:data1/f1:data1/f2 [job_sdb2_sub3] startdelay=0 rw=randread filename=data1/f4:data1/f5:data1/f6:data1/f7:data1/f8:data1/f1:data1/f2:data1/f3 [job_sdb2_sub4] startdelay=0 rw=randread filename=data1/f5:data1/f6:data1/f7:data1/f8:data1/f1:data1/f2:data1/f3:data1/f4 [job_sdb2_sub5] startdelay=0 rw=randread filename=data1/f6:data1/f7:data1/f8:data1/f1:data1/f2:data1/f3:data1/f4:data1/f5 [job_sdb2_sub6] startdelay=0 rw=randread filename=data1/f7:data1/f8:data1/f1:data1/f2:data1/f3:data1/f4:data1/f5:data1/f6 [job_sdb2_sub7] startdelay=0 rw=randread filename=data1/f8:data1/f1:data1/f2:data1/f3:data1/f4:data1/f5:data1/f6:data1/f7 [job_sdc1_sub0] startdelay=0 rw=randread filename=data2/f1:data2/f2:data2/f3:data2/f4:data2/f5:data2/f6:data2/f7:data2/f8 [job_sdc1_sub1] startdelay=0 rw=randread filename=data2/f2:data2/f3:data2/f4:data2/f5:data2/f6:data2/f7:data2/f8:data2/f1 [job_sdc1_sub2] startdelay=0 rw=randread filename=data2/f3:data2/f4:data2/f5:data2/f6:data2/f7:data2/f8:data2/f1:data2/f2 [job_sdc1_sub3] startdelay=0 rw=randread filename=data2/f4:data2/f5:data2/f6:data2/f7:data2/f8:data2/f1:data2/f2:data2/f3 [job_sdc1_sub4] startdelay=0 rw=randread filename=data2/f5:data2/f6:data2/f7:data2/f8:data2/f1:data2/f2:data2/f3:data2/f4 [job_sdc1_sub5] startdelay=0 rw=randread filename=data2/f6:data2/f7:data2/f8:data2/f1:data2/f2:data2/f3:data2/f4:data2/f5 [job_sdc1_sub6] startdelay=0 rw=randread filename=data2/f7:data2/f8:data2/f1:data2/f2:data2/f3:data2/f4:data2/f5:data2/f6 [job_sdc1_sub7] startdelay=0 rw=randread filename=data2/f8:data2/f1:data2/f2:data2/f3:data2/f4:data2/f5:data2/f6:data2/f7 [job_sdc2_sub0] startdelay=0 rw=randread filename=data3/f1:data3/f2:data3/f3:data3/f4:data3/f5:data3/f6:data3/f7:data3/f8 [job_sdc2_sub1] startdelay=0 rw=randread filename=data3/f2:data3/f3:data3/f4:data3/f5:data3/f6:data3/f7:data3/f8:data3/f1 [job_sdc2_sub2] startdelay=0 rw=randread filename=data3/f3:data3/f4:data3/f5:data3/f6:data3/f7:data3/f8:data3/f1:data3/f2 [job_sdc2_sub3] startdelay=0 rw=randread filename=data3/f4:data3/f5:data3/f6:data3/f7:data3/f8:data3/f1:data3/f2:data3/f3 [job_sdc2_sub4] startdelay=0 rw=randread filename=data3/f5:data3/f6:data3/f7:data3/f8:data3/f1:data3/f2:data3/f3:data3/f4 [job_sdc2_sub5] startdelay=0 rw=randread filename=data3/f6:data3/f7:data3/f8:data3/f1:data3/f2:data3/f3:data3/f4:data3/f5 [job_sdc2_sub6] startdelay=0 rw=randread filename=data3/f7:data3/f8:data3/f1:data3/f2:data3/f3:data3/f4:data3/f5:data3/f6 [job_sdc2_sub7] startdelay=0 rw=randread filename=data3/f8:data3/f1:data3/f2:data3/f3:data3/f4:data3/f5:data3/f6:data3/f7 [job_sdd1_sub0] startdelay=0 rw=randread filename=data4/f1:data4/f2:data4/f3:data4/f4:data4/f5:data4/f6:data4/f7:data4/f8 [job_sdd1_sub1] startdelay=0 rw=randread filename=data4/f2:data4/f3:data4/f4:data4/f5:data4/f6:data4/f7:data4/f8:data4/f1 [job_sdd1_sub2] startdelay=0 rw=randread filename=data4/f3:data4/f4:data4/f5:data4/f6:data4/f7:data4/f8:data4/f1:data4/f2 [job_sdd1_sub3] startdelay=0 rw=randread filename=data4/f4:data4/f5:data4/f6:data4/f7:data4/f8:data4/f1:data4/f2:data4/f3 [job_sdd1_sub4] startdelay=0 rw=randread filename=data4/f5:data4/f6:data4/f7:data4/f8:data4/f1:data4/f2:data4/f3:data4/f4 [job_sdd1_sub5] startdelay=0 rw=randread filename=data4/f6:data4/f7:data4/f8:data4/f1:data4/f2:data4/f3:data4/f4:data4/f5 [job_sdd1_sub6] startdelay=0 rw=randread filename=data4/f7:data4/f8:data4/f1:data4/f2:data4/f3:data4/f4:data4/f5:data4/f6 [job_sdd1_sub7] startdelay=0 rw=randread filename=data4/f8:data4/f1:data4/f2:data4/f3:data4/f4:data4/f5:data4/f6:data4/f7 [job_sdd2_sub0] startdelay=0 rw=randread filename=data5/f1:data5/f2:data5/f3:data5/f4:data5/f5:data5/f6:data5/f7:data5/f8 [job_sdd2_sub1] startdelay=0 rw=randread filename=data5/f2:data5/f3:data5/f4:data5/f5:data5/f6:data5/f7:data5/f8:data5/f1 [job_sdd2_sub2] startdelay=0 rw=randread filename=data5/f3:data5/f4:data5/f5:data5/f6:data5/f7:data5/f8:data5/f1:data5/f2 [job_sdd2_sub3] startdelay=0 rw=randread filename=data5/f4:data5/f5:data5/f6:data5/f7:data5/f8:data5/f1:data5/f2:data5/f3 [job_sdd2_sub4] startdelay=0 rw=randread filename=data5/f5:data5/f6:data5/f7:data5/f8:data5/f1:data5/f2:data5/f3:data5/f4 [job_sdd2_sub5] startdelay=0 rw=randread filename=data5/f6:data5/f7:data5/f8:data5/f1:data5/f2:data5/f3:data5/f4:data5/f5 [job_sdd2_sub6] startdelay=0 rw=randread filename=data5/f7:data5/f8:data5/f1:data5/f2:data5/f3:data5/f4:data5/f5:data5/f6 [job_sdd2_sub7] startdelay=0 rw=randread filename=data5/f8:data5/f1:data5/f2:data5/f3:data5/f4:data5/f5:data5/f6:data5/f7 [job_sde1_sub0] startdelay=0 rw=randread filename=data6/f1:data6/f2:data6/f3:data6/f4:data6/f5:data6/f6:data6/f7:data6/f8 [job_sde1_sub1] startdelay=0 rw=randread filename=data6/f2:data6/f3:data6/f4:data6/f5:data6/f6:data6/f7:data6/f8:data6/f1 [job_sde1_sub2] startdelay=0 rw=randread filename=data6/f3:data6/f4:data6/f5:data6/f6:data6/f7:data6/f8:data6/f1:data6/f2 [job_sde1_sub3] startdelay=0 rw=randread filename=data6/f4:data6/f5:data6/f6:data6/f7:data6/f8:data6/f1:data6/f2:data6/f3 [job_sde1_sub4] startdelay=0 rw=randread filename=data6/f5:data6/f6:data6/f7:data6/f8:data6/f1:data6/f2:data6/f3:data6/f4 [job_sde1_sub5] startdelay=0 rw=randread filename=data6/f6:data6/f7:data6/f8:data6/f1:data6/f2:data6/f3:data6/f4:data6/f5 [job_sde1_sub6] startdelay=0 rw=randread filename=data6/f7:data6/f8:data6/f1:data6/f2:data6/f3:data6/f4:data6/f5:data6/f6 [job_sde1_sub7] startdelay=0 rw=randread filename=data6/f8:data6/f1:data6/f2:data6/f3:data6/f4:data6/f5:data6/f6:data6/f7 [job_sde2_sub0] startdelay=0 rw=randread filename=data7/f1:data7/f2:data7/f3:data7/f4:data7/f5:data7/f6:data7/f7:data7/f8 [job_sde2_sub1] startdelay=0 rw=randread filename=data7/f2:data7/f3:data7/f4:data7/f5:data7/f6:data7/f7:data7/f8:data7/f1 [job_sde2_sub2] startdelay=0 rw=randread filename=data7/f3:data7/f4:data7/f5:data7/f6:data7/f7:data7/f8:data7/f1:data7/f2 [job_sde2_sub3] startdelay=0 rw=randread filename=data7/f4:data7/f5:data7/f6:data7/f7:data7/f8:data7/f1:data7/f2:data7/f3 [job_sde2_sub4] startdelay=0 rw=randread filename=data7/f5:data7/f6:data7/f7:data7/f8:data7/f1:data7/f2:data7/f3:data7/f4 [job_sde2_sub5] startdelay=0 rw=randread filename=data7/f6:data7/f7:data7/f8:data7/f1:data7/f2:data7/f3:data7/f4:data7/f5 [job_sde2_sub6] startdelay=0 rw=randread filename=data7/f7:data7/f8:data7/f1:data7/f2:data7/f3:data7/f4:data7/f5:data7/f6 [job_sde2_sub7] startdelay=0 rw=randread filename=data7/f8:data7/f1:data7/f2:data7/f3:data7/f4:data7/f5:data7/f6:data7/f7 [job_sdf1_sub0] startdelay=0 rw=randread filename=data8/f1:data8/f2:data8/f3:data8/f4:data8/f5:data8/f6:data8/f7:data8/f8 [job_sdf1_sub1] startdelay=0 rw=randread filename=data8/f2:data8/f3:data8/f4:data8/f5:data8/f6:data8/f7:data8/f8:data8/f1 [job_sdf1_sub2] startdelay=0 rw=randread filename=data8/f3:data8/f4:data8/f5:data8/f6:data8/f7:data8/f8:data8/f1:data8/f2 [job_sdf1_sub3] startdelay=0 rw=randread filename=data8/f4:data8/f5:data8/f6:data8/f7:data8/f8:data8/f1:data8/f2:data8/f3 [job_sdf1_sub4] startdelay=0 rw=randread filename=data8/f5:data8/f6:data8/f7:data8/f8:data8/f1:data8/f2:data8/f3:data8/f4 [job_sdf1_sub5] startdelay=0 rw=randread filename=data8/f6:data8/f7:data8/f8:data8/f1:data8/f2:data8/f3:data8/f4:data8/f5 [job_sdf1_sub6] startdelay=0 rw=randread filename=data8/f7:data8/f8:data8/f1:data8/f2:data8/f3:data8/f4:data8/f5:data8/f6 [job_sdf1_sub7] startdelay=0 rw=randread filename=data8/f8:data8/f1:data8/f2:data8/f3:data8/f4:data8/f5:data8/f6:data8/f7 [job_sdf2_sub0] startdelay=0 rw=randread filename=data9/f1:data9/f2:data9/f3:data9/f4:data9/f5:data9/f6:data9/f7:data9/f8 [job_sdf2_sub1] startdelay=0 rw=randread filename=data9/f2:data9/f3:data9/f4:data9/f5:data9/f6:data9/f7:data9/f8:data9/f1 [job_sdf2_sub2] startdelay=0 rw=randread filename=data9/f3:data9/f4:data9/f5:data9/f6:data9/f7:data9/f8:data9/f1:data9/f2 [job_sdf2_sub3] startdelay=0 rw=randread filename=data9/f4:data9/f5:data9/f6:data9/f7:data9/f8:data9/f1:data9/f2:data9/f3 [job_sdf2_sub4] startdelay=0 rw=randread filename=data9/f5:data9/f6:data9/f7:data9/f8:data9/f1:data9/f2:data9/f3:data9/f4 [job_sdf2_sub5] startdelay=0 rw=randread filename=data9/f6:data9/f7:data9/f8:data9/f1:data9/f2:data9/f3:data9/f4:data9/f5 [job_sdf2_sub6] startdelay=0 rw=randread filename=data9/f7:data9/f8:data9/f1:data9/f2:data9/f3:data9/f4:data9/f5:data9/f6 [job_sdf2_sub7] startdelay=0 rw=randread filename=data9/f8:data9/f1:data9/f2:data9/f3:data9/f4:data9/f5:data9/f6:data9/f7 [job_sdg1_sub0] startdelay=0 rw=randread filename=data10/f1:data10/f2:data10/f3:data10/f4:data10/f5:data10/f6:data10/f7:data10/f8 [job_sdg1_sub1] startdelay=0 rw=randread filename=data10/f2:data10/f3:data10/f4:data10/f5:data10/f6:data10/f7:data10/f8:data10/f1 [job_sdg1_sub2] startdelay=0 rw=randread filename=data10/f3:data10/f4:data10/f5:data10/f6:data10/f7:data10/f8:data10/f1:data10/f2 [job_sdg1_sub3] startdelay=0 rw=randread filename=data10/f4:data10/f5:data10/f6:data10/f7:data10/f8:data10/f1:data10/f2:data10/f3 [job_sdg1_sub4] startdelay=0 rw=randread filename=data10/f5:data10/f6:data10/f7:data10/f8:data10/f1:data10/f2:data10/f3:data10/f4 [job_sdg1_sub5] startdelay=0 rw=randread filename=data10/f6:data10/f7:data10/f8:data10/f1:data10/f2:data10/f3:data10/f4:data10/f5 [job_sdg1_sub6] startdelay=0 rw=randread filename=data10/f7:data10/f8:data10/f1:data10/f2:data10/f3:data10/f4:data10/f5:data10/f6 [job_sdg1_sub7] startdelay=0 rw=randread filename=data10/f8:data10/f1:data10/f2:data10/f3:data10/f4:data10/f5:data10/f6:data10/f7 [job_sdg2_sub0] startdelay=0 rw=randread filename=data11/f1:data11/f2:data11/f3:data11/f4:data11/f5:data11/f6:data11/f7:data11/f8 [job_sdg2_sub1] startdelay=0 rw=randread filename=data11/f2:data11/f3:data11/f4:data11/f5:data11/f6:data11/f7:data11/f8:data11/f1 [job_sdg2_sub2] startdelay=0 rw=randread filename=data11/f3:data11/f4:data11/f5:data11/f6:data11/f7:data11/f8:data11/f1:data11/f2 [job_sdg2_sub3] startdelay=0 rw=randread filename=data11/f4:data11/f5:data11/f6:data11/f7:data11/f8:data11/f1:data11/f2:data11/f3 [job_sdg2_sub4] startdelay=0 rw=randread filename=data11/f5:data11/f6:data11/f7:data11/f8:data11/f1:data11/f2:data11/f3:data11/f4 [job_sdg2_sub5] startdelay=0 rw=randread filename=data11/f6:data11/f7:data11/f8:data11/f1:data11/f2:data11/f3:data11/f4:data11/f5 [job_sdg2_sub6] startdelay=0 rw=randread filename=data11/f7:data11/f8:data11/f1:data11/f2:data11/f3:data11/f4:data11/f5:data11/f6 [job_sdg2_sub7] startdelay=0 rw=randread filename=data11/f8:data11/f1:data11/f2:data11/f3:data11/f4:data11/f5:data11/f6:data11/f7 [job_sdh1_sub0] startdelay=0 rw=randread filename=data12/f1:data12/f2:data12/f3:data12/f4:data12/f5:data12/f6:data12/f7:data12/f8 [job_sdh1_sub1] startdelay=0 rw=randread filename=data12/f2:data12/f3:data12/f4:data12/f5:data12/f6:data12/f7:data12/f8:data12/f1 [job_sdh1_sub2] startdelay=0 rw=randread filename=data12/f3:data12/f4:data12/f5:data12/f6:data12/f7:data12/f8:data12/f1:data12/f2 [job_sdh1_sub3] startdelay=0 rw=randread filename=data12/f4:data12/f5:data12/f6:data12/f7:data12/f8:data12/f1:data12/f2:data12/f3 [job_sdh1_sub4] startdelay=0 rw=randread filename=data12/f5:data12/f6:data12/f7:data12/f8:data12/f1:data12/f2:data12/f3:data12/f4 [job_sdh1_sub5] startdelay=0 rw=randread filename=data12/f6:data12/f7:data12/f8:data12/f1:data12/f2:data12/f3:data12/f4:data12/f5 [job_sdh1_sub6] startdelay=0 rw=randread filename=data12/f7:data12/f8:data12/f1:data12/f2:data12/f3:data12/f4:data12/f5:data12/f6 [job_sdh1_sub7] startdelay=0 rw=randread filename=data12/f8:data12/f1:data12/f2:data12/f3:data12/f4:data12/f5:data12/f6:data12/f7 [job_sdh2_sub0] startdelay=0 rw=randread filename=data13/f1:data13/f2:data13/f3:data13/f4:data13/f5:data13/f6:data13/f7:data13/f8 [job_sdh2_sub1] startdelay=0 rw=randread filename=data13/f2:data13/f3:data13/f4:data13/f5:data13/f6:data13/f7:data13/f8:data13/f1 [job_sdh2_sub2] startdelay=0 rw=randread filename=data13/f3:data13/f4:data13/f5:data13/f6:data13/f7:data13/f8:data13/f1:data13/f2 [job_sdh2_sub3] startdelay=0 rw=randread filename=data13/f4:data13/f5:data13/f6:data13/f7:data13/f8:data13/f1:data13/f2:data13/f3 [job_sdh2_sub4] startdelay=0 rw=randread filename=data13/f5:data13/f6:data13/f7:data13/f8:data13/f1:data13/f2:data13/f3:data13/f4 [job_sdh2_sub5] startdelay=0 rw=randread filename=data13/f6:data13/f7:data13/f8:data13/f1:data13/f2:data13/f3:data13/f4:data13/f5 [job_sdh2_sub6] startdelay=0 rw=randread filename=data13/f7:data13/f8:data13/f1:data13/f2:data13/f3:data13/f4:data13/f5:data13/f6 [job_sdh2_sub7] startdelay=0 rw=randread filename=data13/f8:data13/f1:data13/f2:data13/f3:data13/f4:data13/f5:data13/f6:data13/f7 [job_sdi1_sub0] startdelay=0 rw=randread filename=data14/f1:data14/f2:data14/f3:data14/f4:data14/f5:data14/f6:data14/f7:data14/f8 [job_sdi1_sub1] startdelay=0 rw=randread filename=data14/f2:data14/f3:data14/f4:data14/f5:data14/f6:data14/f7:data14/f8:data14/f1 [job_sdi1_sub2] startdelay=0 rw=randread filename=data14/f3:data14/f4:data14/f5:data14/f6:data14/f7:data14/f8:data14/f1:data14/f2 [job_sdi1_sub3] startdelay=0 rw=randread filename=data14/f4:data14/f5:data14/f6:data14/f7:data14/f8:data14/f1:data14/f2:data14/f3 [job_sdi1_sub4] startdelay=0 rw=randread filename=data14/f5:data14/f6:data14/f7:data14/f8:data14/f1:data14/f2:data14/f3:data14/f4 [job_sdi1_sub5] startdelay=0 rw=randread filename=data14/f6:data14/f7:data14/f8:data14/f1:data14/f2:data14/f3:data14/f4:data14/f5 [job_sdi1_sub6] startdelay=0 rw=randread filename=data14/f7:data14/f8:data14/f1:data14/f2:data14/f3:data14/f4:data14/f5:data14/f6 [job_sdi1_sub7] startdelay=0 rw=randread filename=data14/f8:data14/f1:data14/f2:data14/f3:data14/f4:data14/f5:data14/f6:data14/f7 [job_sdi2_sub0] startdelay=0 rw=randread filename=data15/f1:data15/f2:data15/f3:data15/f4:data15/f5:data15/f6:data15/f7:data15/f8 [job_sdi2_sub1] startdelay=0 rw=randread filename=data15/f2:data15/f3:data15/f4:data15/f5:data15/f6:data15/f7:data15/f8:data15/f1 [job_sdi2_sub2] startdelay=0 rw=randread filename=data15/f3:data15/f4:data15/f5:data15/f6:data15/f7:data15/f8:data15/f1:data15/f2 [job_sdi2_sub3] startdelay=0 rw=randread filename=data15/f4:data15/f5:data15/f6:data15/f7:data15/f8:data15/f1:data15/f2:data15/f3 [job_sdi2_sub4] startdelay=0 rw=randread filename=data15/f5:data15/f6:data15/f7:data15/f8:data15/f1:data15/f2:data15/f3:data15/f4 [job_sdi2_sub5] startdelay=0 rw=randread filename=data15/f6:data15/f7:data15/f8:data15/f1:data15/f2:data15/f3:data15/f4:data15/f5 [job_sdi2_sub6] startdelay=0 rw=randread filename=data15/f7:data15/f8:data15/f1:data15/f2:data15/f3:data15/f4:data15/f5:data15/f6 [job_sdi2_sub7] startdelay=0 rw=randread filename=data15/f8:data15/f1:data15/f2:data15/f3:data15/f4:data15/f5:data15/f6:data15/f7 [job_sdj1_sub0] startdelay=0 rw=randread filename=data16/f1:data16/f2:data16/f3:data16/f4:data16/f5:data16/f6:data16/f7:data16/f8 [job_sdj1_sub1] startdelay=0 rw=randread filename=data16/f2:data16/f3:data16/f4:data16/f5:data16/f6:data16/f7:data16/f8:data16/f1 [job_sdj1_sub2] startdelay=0 rw=randread filename=data16/f3:data16/f4:data16/f5:data16/f6:data16/f7:data16/f8:data16/f1:data16/f2 [job_sdj1_sub3] startdelay=0 rw=randread filename=data16/f4:data16/f5:data16/f6:data16/f7:data16/f8:data16/f1:data16/f2:data16/f3 [job_sdj1_sub4] startdelay=0 rw=randread filename=data16/f5:data16/f6:data16/f7:data16/f8:data16/f1:data16/f2:data16/f3:data16/f4 [job_sdj1_sub5] startdelay=0 rw=randread filename=data16/f6:data16/f7:data16/f8:data16/f1:data16/f2:data16/f3:data16/f4:data16/f5 [job_sdj1_sub6] startdelay=0 rw=randread filename=data16/f7:data16/f8:data16/f1:data16/f2:data16/f3:data16/f4:data16/f5:data16/f6 [job_sdj1_sub7] startdelay=0 rw=randread filename=data16/f8:data16/f1:data16/f2:data16/f3:data16/f4:data16/f5:data16/f6:data16/f7 [job_sdj2_sub0] startdelay=0 rw=randread filename=data17/f1:data17/f2:data17/f3:data17/f4:data17/f5:data17/f6:data17/f7:data17/f8 [job_sdj2_sub1] startdelay=0 rw=randread filename=data17/f2:data17/f3:data17/f4:data17/f5:data17/f6:data17/f7:data17/f8:data17/f1 [job_sdj2_sub2] startdelay=0 rw=randread filename=data17/f3:data17/f4:data17/f5:data17/f6:data17/f7:data17/f8:data17/f1:data17/f2 [job_sdj2_sub3] startdelay=0 rw=randread filename=data17/f4:data17/f5:data17/f6:data17/f7:data17/f8:data17/f1:data17/f2:data17/f3 [job_sdj2_sub4] startdelay=0 rw=randread filename=data17/f5:data17/f6:data17/f7:data17/f8:data17/f1:data17/f2:data17/f3:data17/f4 [job_sdj2_sub5] startdelay=0 rw=randread filename=data17/f6:data17/f7:data17/f8:data17/f1:data17/f2:data17/f3:data17/f4:data17/f5 [job_sdj2_sub6] startdelay=0 rw=randread filename=data17/f7:data17/f8:data17/f1:data17/f2:data17/f3:data17/f4:data17/f5:data17/f6 [job_sdj2_sub7] startdelay=0 rw=randread filename=data17/f8:data17/f1:data17/f2:data17/f3:data17/f4:data17/f5:data17/f6:data17/f7 [job_sdk1_sub0] startdelay=0 rw=randread filename=data18/f1:data18/f2:data18/f3:data18/f4:data18/f5:data18/f6:data18/f7:data18/f8 [job_sdk1_sub1] startdelay=0 rw=randread filename=data18/f2:data18/f3:data18/f4:data18/f5:data18/f6:data18/f7:data18/f8:data18/f1 [job_sdk1_sub2] startdelay=0 rw=randread filename=data18/f3:data18/f4:data18/f5:data18/f6:data18/f7:data18/f8:data18/f1:data18/f2 [job_sdk1_sub3] startdelay=0 rw=randread filename=data18/f4:data18/f5:data18/f6:data18/f7:data18/f8:data18/f1:data18/f2:data18/f3 [job_sdk1_sub4] startdelay=0 rw=randread filename=data18/f5:data18/f6:data18/f7:data18/f8:data18/f1:data18/f2:data18/f3:data18/f4 [job_sdk1_sub5] startdelay=0 rw=randread filename=data18/f6:data18/f7:data18/f8:data18/f1:data18/f2:data18/f3:data18/f4:data18/f5 [job_sdk1_sub6] startdelay=0 rw=randread filename=data18/f7:data18/f8:data18/f1:data18/f2:data18/f3:data18/f4:data18/f5:data18/f6 [job_sdk1_sub7] startdelay=0 rw=randread filename=data18/f8:data18/f1:data18/f2:data18/f3:data18/f4:data18/f5:data18/f6:data18/f7 [job_sdk2_sub0] startdelay=0 rw=randread filename=data19/f1:data19/f2:data19/f3:data19/f4:data19/f5:data19/f6:data19/f7:data19/f8 [job_sdk2_sub1] startdelay=0 rw=randread filename=data19/f2:data19/f3:data19/f4:data19/f5:data19/f6:data19/f7:data19/f8:data19/f1 [job_sdk2_sub2] startdelay=0 rw=randread filename=data19/f3:data19/f4:data19/f5:data19/f6:data19/f7:data19/f8:data19/f1:data19/f2 [job_sdk2_sub3] startdelay=0 rw=randread filename=data19/f4:data19/f5:data19/f6:data19/f7:data19/f8:data19/f1:data19/f2:data19/f3 [job_sdk2_sub4] startdelay=0 rw=randread filename=data19/f5:data19/f6:data19/f7:data19/f8:data19/f1:data19/f2:data19/f3:data19/f4 [job_sdk2_sub5] startdelay=0 rw=randread filename=data19/f6:data19/f7:data19/f8:data19/f1:data19/f2:data19/f3:data19/f4:data19/f5 [job_sdk2_sub6] startdelay=0 rw=randread filename=data19/f7:data19/f8:data19/f1:data19/f2:data19/f3:data19/f4:data19/f5:data19/f6 [job_sdk2_sub7] startdelay=0 rw=randread filename=data19/f8:data19/f1:data19/f2:data19/f3:data19/f4:data19/f5:data19/f6:data19/f7 [job_sdl1_sub0] startdelay=0 rw=randread filename=data20/f1:data20/f2:data20/f3:data20/f4:data20/f5:data20/f6:data20/f7:data20/f8 [job_sdl1_sub1] startdelay=0 rw=randread filename=data20/f2:data20/f3:data20/f4:data20/f5:data20/f6:data20/f7:data20/f8:data20/f1 [job_sdl1_sub2] startdelay=0 rw=randread filename=data20/f3:data20/f4:data20/f5:data20/f6:data20/f7:data20/f8:data20/f1:data20/f2 [job_sdl1_sub3] startdelay=0 rw=randread filename=data20/f4:data20/f5:data20/f6:data20/f7:data20/f8:data20/f1:data20/f2:data20/f3 [job_sdl1_sub4] startdelay=0 rw=randread filename=data20/f5:data20/f6:data20/f7:data20/f8:data20/f1:data20/f2:data20/f3:data20/f4 [job_sdl1_sub5] startdelay=0 rw=randread filename=data20/f6:data20/f7:data20/f8:data20/f1:data20/f2:data20/f3:data20/f4:data20/f5 [job_sdl1_sub6] startdelay=0 rw=randread filename=data20/f7:data20/f8:data20/f1:data20/f2:data20/f3:data20/f4:data20/f5:data20/f6 [job_sdl1_sub7] startdelay=0 rw=randread filename=data20/f8:data20/f1:data20/f2:data20/f3:data20/f4:data20/f5:data20/f6:data20/f7 [job_sdl2_sub0] startdelay=0 rw=randread filename=data21/f1:data21/f2:data21/f3:data21/f4:data21/f5:data21/f6:data21/f7:data21/f8 [job_sdl2_sub1] startdelay=0 rw=randread filename=data21/f2:data21/f3:data21/f4:data21/f5:data21/f6:data21/f7:data21/f8:data21/f1 [job_sdl2_sub2] startdelay=0 rw=randread filename=data21/f3:data21/f4:data21/f5:data21/f6:data21/f7:data21/f8:data21/f1:data21/f2 [job_sdl2_sub3] startdelay=0 rw=randread filename=data21/f4:data21/f5:data21/f6:data21/f7:data21/f8:data21/f1:data21/f2:data21/f3 [job_sdl2_sub4] startdelay=0 rw=randread filename=data21/f5:data21/f6:data21/f7:data21/f8:data21/f1:data21/f2:data21/f3:data21/f4 [job_sdl2_sub5] startdelay=0 rw=randread filename=data21/f6:data21/f7:data21/f8:data21/f1:data21/f2:data21/f3:data21/f4:data21/f5 [job_sdl2_sub6] startdelay=0 rw=randread filename=data21/f7:data21/f8:data21/f1:data21/f2:data21/f3:data21/f4:data21/f5:data21/f6 [job_sdl2_sub7] startdelay=0 rw=randread filename=data21/f8:data21/f1:data21/f2:data21/f3:data21/f4:data21/f5:data21/f6:data21/f7 [job_sdm1_sub0] startdelay=0 rw=randread filename=data22/f1:data22/f2:data22/f3:data22/f4:data22/f5:data22/f6:data22/f7:data22/f8 [job_sdm1_sub1] startdelay=0 rw=randread filename=data22/f2:data22/f3:data22/f4:data22/f5:data22/f6:data22/f7:data22/f8:data22/f1 [job_sdm1_sub2] startdelay=0 rw=randread filename=data22/f3:data22/f4:data22/f5:data22/f6:data22/f7:data22/f8:data22/f1:data22/f2 [job_sdm1_sub3] startdelay=0 rw=randread filename=data22/f4:data22/f5:data22/f6:data22/f7:data22/f8:data22/f1:data22/f2:data22/f3 [job_sdm1_sub4] startdelay=0 rw=randread filename=data22/f5:data22/f6:data22/f7:data22/f8:data22/f1:data22/f2:data22/f3:data22/f4 [job_sdm1_sub5] startdelay=0 rw=randread filename=data22/f6:data22/f7:data22/f8:data22/f1:data22/f2:data22/f3:data22/f4:data22/f5 [job_sdm1_sub6] startdelay=0 rw=randread filename=data22/f7:data22/f8:data22/f1:data22/f2:data22/f3:data22/f4:data22/f5:data22/f6 [job_sdm1_sub7] startdelay=0 rw=randread filename=data22/f8:data22/f1:data22/f2:data22/f3:data22/f4:data22/f5:data22/f6:data22/f7 [job_sdm2_sub0] startdelay=0 rw=randread filename=data23/f1:data23/f2:data23/f3:data23/f4:data23/f5:data23/f6:data23/f7:data23/f8 [job_sdm2_sub1] startdelay=0 rw=randread filename=data23/f2:data23/f3:data23/f4:data23/f5:data23/f6:data23/f7:data23/f8:data23/f1 [job_sdm2_sub2] startdelay=0 rw=randread filename=data23/f3:data23/f4:data23/f5:data23/f6:data23/f7:data23/f8:data23/f1:data23/f2 [job_sdm2_sub3] startdelay=0 rw=randread filename=data23/f4:data23/f5:data23/f6:data23/f7:data23/f8:data23/f1:data23/f2:data23/f3 [job_sdm2_sub4] startdelay=0 rw=randread filename=data23/f5:data23/f6:data23/f7:data23/f8:data23/f1:data23/f2:data23/f3:data23/f4 [job_sdm2_sub5] startdelay=0 rw=randread filename=data23/f6:data23/f7:data23/f8:data23/f1:data23/f2:data23/f3:data23/f4:data23/f5 [job_sdm2_sub6] startdelay=0 rw=randread filename=data23/f7:data23/f8:data23/f1:data23/f2:data23/f3:data23/f4:data23/f5:data23/f6 [job_sdm2_sub7] startdelay=0 rw=randread filename=data23/f8:data23/f1:data23/f2:data23/f3:data23/f4:data23/f5:data23/f6:data23/f7 ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-01 10:12 ` Zhang, Yanmin @ 2010-01-01 16:32 ` Corrado Zoccolo 2010-01-02 12:33 ` Zhang, Yanmin 0 siblings, 1 reply; 17+ messages in thread From: Corrado Zoccolo @ 2010-01-01 16:32 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: Jens Axboe, Shaohua Li, jmoyer, LKML [-- Attachment #1: Type: text/plain, Size: 9516 bytes --] Hi Yanmin, On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote: >> Hi Yanmin, >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin >> <yanmin_zhang@linux.intel.com> wrote: >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with >> > 2.6.33-rc1. >> > Thanks for your timely reply. Some comments inlined below. > >> Can you compare the performance also with 2.6.31? > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel > is released. > > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about > 8% better than the one of 2.6.31. > >> I think I understand what causes your problem. >> 2.6.32, with default settings, handled even random readers as >> sequential ones to provide fairness. This has benefits on single disks >> and JBODs, but causes harm on raids. > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on > hardware RAID in HBA, mostly we use noop io scheduler. I think you should start testing cfq with them, too. From 2.6.33, we have some big improvements in this area. > >> For 2.6.33, we changed the way in which this is handled, restoring the >> enable_idle = 0 for seeky queues as it was in 2.6.31: >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, >> struct cfq_queue *cfqq, >> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) >> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) >> enable_idle = 0; >> (compare with 2.6.31: >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> (cfqd->hw_tag && CIC_SEEKY(cic))) >> enable_idle = 0; >> excluding the sample_valid check, it should be equivalent for you (I >> assume you have NCQ disks)) >> and we provide fairness for them by servicing all seeky queues >> together, and then idling before switching to other ones. > As for function cfq_update_idle_window, you is right. But since > 2.6.32, CFQ merges many patches and the patches have impact on each other. > >> >> The mmap 64k randreader will have a large seek_mean, resulting in >> being marked seeky, but will send 16 * 4k sequential requests one >> after the other, so alternating between those seeky queues will cause >> harm. >> >> I'm working on a new way to compute seekiness of queues, that should >> fix your issue, correctly identifying those queues as non-seeky (for >> me, a queue should be considered seeky only if it submits more than 1 >> seeky requests for 8 sequential ones). >> >> > >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files >> > per partitions. There are 8*24 processes totally. randread block size is 64K. >> > >> > We found the regression on 2 machines. One machine has 8GB memory and the other has >> > 6GB. >> > >> > Bisect is very unstable. The related patches are many instead of just one. >> > >> > >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 >> > Author: Corrado Zoccolo <czoccolo@gmail.com> >> > Date: Thu Nov 26 10:02:58 2009 +0100 >> > >> > cfq-iosched: fix corner cases in idling logic >> > >> > >> > This patch introduces about less than 20% regression. I just reverted below section >> > and this part regression disappear. It shows this regression is stable and not impacted >> > by other patches. >> > >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) >> > return; >> > >> > /* >> > - * still requests with the driver, don't idle >> > + * still active requests from this queue, don't idle >> > */ >> > - if (rq_in_driver(cfqd)) >> > + if (cfqq->dispatched) >> > return; > Although 5 patches are related to the regression, above line is quite > independent. Reverting above line could always improve the result for about > 20%. I've looked at your fio script, and it is quite complex, with lot of things going on. Let's keep this for last. I've created a smaller test, that already shows some regression: [global] direct=0 ioengine=mmap size=8G bs=64k numjobs=1 loops=5 runtime=60 #group_reporting invalidate=0 directory=/media/hd/cfq-tests [job0] startdelay=0 rw=randread filename=testfile1 [job1] startdelay=0 rw=randread filename=testfile2 [job2] startdelay=0 rw=randread filename=testfile3 [job3] startdelay=0 rw=randread filename=testfile4 The attached patches, in particular 0005 (that apply on top of for-linus branch of Jen's tree git://git.kernel.dk/linux-2.6-block.git) fix the regression on this simplified workload. > >> > >> This shouldn't affect you if all queues are marked as idle. > Do you mean to use command ionice to mark it as idle class? I didn't try it. No. I meant forcing enable_idle = 1, as you were almost doing with your patch, when cfq_latency was set. With my above patch, this should not be needed any more, since the queues should be seen as sequential. > >> Does just >> your patch: >> > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) >> > - && CFQQ_SEEKY(cfqq))) >> > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) && >> > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) >> fix most of the regression without touching arm_slice_timer? > No. If to fix the regression completely, I need apply above patch plus > a debug patch. The debug patch is to just work around the 3 patches report by > Shaohua's tiobench regression report. Without the debug patch, the regression > isn't resolved. Jens already merged one of Shaohua's patches, that may fix the problem with queue combining. > Below is the debug patch. > diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c > --- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800 > +++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c 2009-12-30 17:12:28.000000000 +0800 > @@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd > cfqq->slice_start = jiffies; > cfqq->slice_end = jiffies + slice; > cfqq->allocated_slice = slice; > +/*YMZHANG*/ > + cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; > + This is disabled, on a vanilla 2.6.33 kernel, by setting low_latency = 0 > cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); > } > > @@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c > /* > * still active requests from this queue, don't idle > */ > - if (cfqq->dispatched) > + //if (cfqq->dispatched) > + if (rq_in_driver(cfqd)) > return; > > /* > @@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q > new_cfqq = __cfqq; > } > > + /* YMZHANG debug */ > + return; > + This should be partially addressed by Shaohua's patch merged in Jens' tree. But note that your 8 processes, can randomly start doing I/O on the same file, so merging those queues is sometimes reasonable. The patch to split them quickly was still not merged, though, so you will still see some regression due to this. In my simplified job file, I removed the randomness to make sure this cannot happen. > process_refs = cfqq_process_refs(cfqq); > /* > * If the process for the cfqq has gone away, there is no > > >> >> I guess >> > 5db5d64277bf390056b1a87d0bb288c8b8553f96. >> will still introduce a 10% regression, but this is needed to improve >> latency, and you can just disable low_latency to avoid it. > You are right. I did a quick testing. If my patch + revert 2 patches and keep > 5db5d64, the regression is about 20%. > > But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches > and keep 5db5d64 while set low_latency=0, the regression is still there. One > reason is my patch doesn't work when low_latency=0. Right. You can try with my patch, instead, that doesn't depend on low_latency, and set it to 0 to remove this performance degradation. My results: 2.6.32.2: READ: io=146688KB, aggrb=2442KB/s, minb=602KB/s, maxb=639KB/s, mint=60019msec, maxt=60067msec 2.6.33 - jens: READ: io=128512KB, aggrb=2140KB/s, minb=526KB/s, maxb=569KB/s, mint=60004msec, maxt=60032msec 2.6.33 - jens + my patches : READ: io=143232KB, aggrb=2384KB/s, minb=595KB/s, maxb=624KB/s, mint=60003msec, maxt=60072msec 2.6.33 - jens + my patches + low_lat = 0: READ: io=145216KB, aggrb=2416KB/s, minb=596KB/s, maxb=632KB/s, mint=60027msec, maxt=60087msec >> >> Thanks, >> Corrado > I attach the fio job file for your reference. > > I got a cold and will continue to work on it next week. > > Yanmin > Thanks, Corrado [-- Attachment #2: 0003-cfq-iosched-non-rot-devices-do-not-need-read-queue-m.patch --] [-- Type: application/octet-stream, Size: 2962 bytes --] From f7bf4db76818d6a0ce83c179cb6ae2a305af3082 Mon Sep 17 00:00:00 2001 From: Corrado Zoccolo <czoccolo@gmail.com> Date: Wed, 30 Dec 2009 11:58:34 +0100 Subject: [PATCH 3/5] cfq-iosched: non-rot devices do not need read queue merging Non rotational devices' performances are not affected by distance of read requests, so there is no point in having overhead to merge such queues. This doesn't apply to writes, so this patch changes the queued[] field, to be indexed by READ/WRITE instead of SYNC/ASYNC, and only compute proximity for queues with WRITE requests. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> --- block/cfq-iosched.c | 20 +++++++++++--------- 1 files changed, 11 insertions(+), 9 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 918c7fd..7da9391 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -108,9 +108,9 @@ struct cfq_queue { struct rb_root sort_list; /* if fifo isn't expired, next request to serve */ struct request *next_rq; - /* requests queued in sort_list */ + /* requests queued in sort_list, indexed by READ/WRITE */ int queued[2]; - /* currently allocated requests */ + /* currently allocated requests, indexed by READ/WRITE */ int allocated[2]; /* fifo list of requests in sort_list */ struct list_head fifo; @@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq) return; if (!cfqq->next_rq) return; - + if (blk_queue_nonrot(cfqd->queue) && !cfqq->queued[WRITE]) + return; cfqq->p_root = &cfqd->prio_trees[cfqq->org_ioprio]; __cfqq = cfq_prio_tree_lookup(cfqd, cfqq->p_root, blk_rq_pos(cfqq->next_rq), &parent, &p); @@ -1337,10 +1338,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) static void cfq_del_rq_rb(struct request *rq) { struct cfq_queue *cfqq = RQ_CFQQ(rq); - const int sync = rq_is_sync(rq); + const int rw = rq_data_dir(rq); - BUG_ON(!cfqq->queued[sync]); - cfqq->queued[sync]--; + BUG_ON(!cfqq->queued[rw]); + cfqq->queued[rw]--; elv_rb_del(&cfqq->sort_list, rq); @@ -1363,7 +1364,7 @@ static void cfq_add_rq_rb(struct request *rq) struct cfq_data *cfqd = cfqq->cfqd; struct request *__alias, *prev; - cfqq->queued[rq_is_sync(rq)]++; + cfqq->queued[rq_data_dir(rq)]++; /* * looks a little odd, but the first insert might return an alias. @@ -1393,7 +1394,7 @@ static void cfq_add_rq_rb(struct request *rq) static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq) { elv_rb_del(&cfqq->sort_list, rq); - cfqq->queued[rq_is_sync(rq)]--; + cfqq->queued[rq_data_dir(rq)]--; cfq_add_rq_rb(rq); } @@ -1689,7 +1690,8 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, struct cfq_queue *__cfqq; sector_t sector = cfqd->last_position; - if (RB_EMPTY_ROOT(root)) + if (RB_EMPTY_ROOT(root) || + (blk_queue_nonrot(cfqd->queue) && !cur_cfqq->queued[WRITE])) return NULL; /* -- 1.6.4.4 [-- Attachment #3: 0004-cfq-iosched-requests-in-flight-vs-in-driver-clarific.patch --] [-- Type: application/octet-stream, Size: 4945 bytes --] From bd47454a4381f584e79b9bb57eb0329e4b385ee5 Mon Sep 17 00:00:00 2001 From: Corrado Zoccolo <czoccolo@gmail.com> Date: Wed, 30 Dec 2009 22:49:42 +0100 Subject: [PATCH 4/5] cfq-iosched: requests "in flight" vs "in driver" clarification Counters for requests "in flight" and "in driver" are used asymmetrically in cfq_may_dispatch, and have slightly different meaning. We split the rq_in_flight counter (was sync_flight) to count both sync and async requests, in order to use this one, which is more accurate in some corner cases. The rq_in_driver counter is coalesced, since individual sync/async counts are not used any more. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> --- block/cfq-iosched.c | 44 ++++++++++++++++++-------------------------- 1 files changed, 18 insertions(+), 26 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 7da9391..c6d5678 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -227,8 +227,8 @@ struct cfq_data { unsigned int busy_queues; - int rq_in_driver[2]; - int sync_flight; + int rq_in_driver; + int rq_in_flight[2]; /* * queue-depth detection @@ -419,11 +419,6 @@ static struct cfq_queue *cfq_get_queue(struct cfq_data *, bool, static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *, struct io_context *); -static inline int rq_in_driver(struct cfq_data *cfqd) -{ - return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1]; -} - static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic, bool is_sync) { @@ -1423,9 +1418,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq) { struct cfq_data *cfqd = q->elevator->elevator_data; - cfqd->rq_in_driver[rq_is_sync(rq)]++; + cfqd->rq_in_driver++; cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d", - rq_in_driver(cfqd)); + cfqd->rq_in_driver); cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq); } @@ -1433,12 +1428,11 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq) static void cfq_deactivate_request(struct request_queue *q, struct request *rq) { struct cfq_data *cfqd = q->elevator->elevator_data; - const int sync = rq_is_sync(rq); - WARN_ON(!cfqd->rq_in_driver[sync]); - cfqd->rq_in_driver[sync]--; + WARN_ON(!cfqd->rq_in_driver); + cfqd->rq_in_driver--; cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d", - rq_in_driver(cfqd)); + cfqd->rq_in_driver); } static void cfq_remove_request(struct request *rq) @@ -1876,8 +1870,7 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) cfqq->dispatched++; elv_dispatch_sort(q, rq); - if (cfq_cfqq_sync(cfqq)) - cfqd->sync_flight++; + cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]++; cfqq->nr_sectors += blk_rq_sectors(rq); } @@ -2224,13 +2217,13 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) /* * Drain async requests before we start sync IO */ - if (cfq_should_idle(cfqd, cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + if (cfq_should_idle(cfqd, cfqq) && cfqd->rq_in_flight[BLK_RW_ASYNC]) return false; /* * If this is an async queue and we have sync IO in flight, let it wait */ - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) + if (cfqd->rq_in_flight[BLK_RW_SYNC] && !cfq_cfqq_sync(cfqq)) return false; max_dispatch = cfqd->cfq_quantum; @@ -3220,14 +3213,14 @@ static void cfq_update_hw_tag(struct cfq_data *cfqd) { struct cfq_queue *cfqq = cfqd->active_queue; - if (rq_in_driver(cfqd) > cfqd->hw_tag_est_depth) - cfqd->hw_tag_est_depth = rq_in_driver(cfqd); + if (cfqd->rq_in_driver > cfqd->hw_tag_est_depth) + cfqd->hw_tag_est_depth = cfqd->rq_in_driver; if (cfqd->hw_tag == 1) return; if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN && - rq_in_driver(cfqd) <= CFQ_HW_QUEUE_MIN) + cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN) return; /* @@ -3237,7 +3230,7 @@ static void cfq_update_hw_tag(struct cfq_data *cfqd) */ if (cfqq && cfq_cfqq_idle_window(cfqq) && cfqq->dispatched + cfqq->queued[0] + cfqq->queued[1] < - CFQ_HW_QUEUE_MIN && rq_in_driver(cfqd) < CFQ_HW_QUEUE_MIN) + CFQ_HW_QUEUE_MIN && cfqd->rq_in_driver < CFQ_HW_QUEUE_MIN) return; if (cfqd->hw_tag_samples++ < 50) @@ -3290,13 +3283,12 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfq_update_hw_tag(cfqd); - WARN_ON(!cfqd->rq_in_driver[sync]); + WARN_ON(!cfqd->rq_in_driver); WARN_ON(!cfqq->dispatched); - cfqd->rq_in_driver[sync]--; + cfqd->rq_in_driver--; cfqq->dispatched--; - if (cfq_cfqq_sync(cfqq)) - cfqd->sync_flight--; + cfqd->rq_in_flight[cfq_cfqq_sync(cfqq)]--; if (sync) { RQ_CIC(rq)->last_end_request = now; @@ -3350,7 +3342,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) } } - if (!rq_in_driver(cfqd)) + if (!cfqd->rq_in_driver) cfq_schedule_dispatch(cfqd); } -- 1.6.4.4 [-- Attachment #4: 0005-cfq-iosched-rework-seeky-detection.patch --] [-- Type: application/octet-stream, Size: 3405 bytes --] From c6eb136205c0b6ebe2e9732de249ddefba26d41d Mon Sep 17 00:00:00 2001 From: Corrado Zoccolo <czoccolo@gmail.com> Date: Thu, 31 Dec 2009 13:41:59 +0100 Subject: [PATCH 5/5] cfq-iosched: rework seeky detection --- block/cfq-iosched.c | 54 +++++++++++++------------------------------------- 1 files changed, 14 insertions(+), 40 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index c6d5678..4e203c4 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -133,9 +133,7 @@ struct cfq_queue { unsigned short ioprio, org_ioprio; unsigned short ioprio_class, org_ioprio_class; - unsigned int seek_samples; - u64 seek_total; - sector_t seek_mean; + u32 seek_history; sector_t last_request_pos; unsigned long seeky_start; @@ -1658,22 +1656,13 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, return cfqd->last_position - blk_rq_pos(rq); } -#define CFQQ_SEEK_THR 8 * 1024 -#define CFQQ_SEEKY(cfqq) ((cfqq)->seek_mean > CFQQ_SEEK_THR) +#define CFQQ_SEEK_THR (sector_t)(8 * 100) +#define CFQQ_SEEKY(cfqq) (hweight32(cfqq->seek_history) > 32/8) static inline int cfq_rq_close(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct request *rq, bool for_preempt) { - sector_t sdist = cfqq->seek_mean; - - if (!sample_valid(cfqq->seek_samples)) - sdist = CFQQ_SEEK_THR; - - /* if seek_mean is big, using it as close criteria is meaningless */ - if (sdist > CFQQ_SEEK_THR && !for_preempt) - sdist = CFQQ_SEEK_THR; - - return cfq_dist_from_last(cfqd, rq) <= sdist; + return cfq_dist_from_last(cfqd, rq) <= CFQQ_SEEK_THR; } static struct cfq_queue *cfqq_close(struct cfq_data *cfqd, @@ -2971,30 +2960,16 @@ static void cfq_update_io_seektime(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct request *rq) { - sector_t sdist; - u64 total; - - if (!cfqq->last_request_pos) - sdist = 0; - else if (cfqq->last_request_pos < blk_rq_pos(rq)) - sdist = blk_rq_pos(rq) - cfqq->last_request_pos; - else - sdist = cfqq->last_request_pos - blk_rq_pos(rq); - - /* - * Don't allow the seek distance to get too large from the - * odd fragment, pagein, etc - */ - if (cfqq->seek_samples <= 60) /* second&third seek */ - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*1024); - else - sdist = min(sdist, (cfqq->seek_mean * 4) + 2*1024*64); + sector_t sdist = 0; + if (cfqq->last_request_pos) { + if (cfqq->last_request_pos < blk_rq_pos(rq)) + sdist = blk_rq_pos(rq) - cfqq->last_request_pos; + else + sdist = cfqq->last_request_pos - blk_rq_pos(rq); + } - cfqq->seek_samples = (7*cfqq->seek_samples + 256) / 8; - cfqq->seek_total = (7*cfqq->seek_total + (u64)256*sdist) / 8; - total = cfqq->seek_total + (cfqq->seek_samples/2); - do_div(total, cfqq->seek_samples); - cfqq->seek_mean = (sector_t)total; + cfqq->seek_history <<= 1; + cfqq->seek_history |= (sdist > CFQQ_SEEK_THR); /* * If this cfqq is shared between multiple processes, check to @@ -3032,8 +3007,7 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, cfq_mark_cfqq_deep(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) - && CFQQ_SEEKY(cfqq))) + (!cfq_cfqq_deep(cfqq) && CFQQ_SEEKY(cfqq))) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) -- 1.6.4.4 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-01 16:32 ` Corrado Zoccolo @ 2010-01-02 12:33 ` Zhang, Yanmin 2010-01-02 18:52 ` Corrado Zoccolo 0 siblings, 1 reply; 17+ messages in thread From: Zhang, Yanmin @ 2010-01-02 12:33 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Jens Axboe, Shaohua Li, jmoyer, LKML On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote: > Hi Yanmin, > On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: > > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote: > >> Hi Yanmin, > >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin > >> <yanmin_zhang@linux.intel.com> wrote: > >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with > >> > 2.6.33-rc1. > >> > > Thanks for your timely reply. Some comments inlined below. > > > >> Can you compare the performance also with 2.6.31? > > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel > > is released. > > > > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about > > 8% better than the one of 2.6.31. > > > >> I think I understand what causes your problem. > >> 2.6.32, with default settings, handled even random readers as > >> sequential ones to provide fairness. This has benefits on single disks > >> and JBODs, but causes harm on raids. > > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on > > hardware RAID in HBA, mostly we use noop io scheduler. > I think you should start testing cfq with them, too. From 2.6.33, we > have some big improvements in this area. Great! I once compared cfq and noop against non-raid and raid0. One interesting finding about sequential read testing is when there are fewer processes to read files on the raid0 JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues. > > > >> For 2.6.33, we changed the way in which this is handled, restoring the > >> enable_idle = 0 for seeky queues as it was in 2.6.31: > >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, > >> struct cfq_queue *cfqq, > >> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > >> > >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > >> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) > >> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) > >> enable_idle = 0; > >> (compare with 2.6.31: > >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > >> (cfqd->hw_tag && CIC_SEEKY(cic))) > >> enable_idle = 0; > >> excluding the sample_valid check, it should be equivalent for you (I > >> assume you have NCQ disks)) > >> and we provide fairness for them by servicing all seeky queues > >> together, and then idling before switching to other ones. > > As for function cfq_update_idle_window, you is right. But since > > 2.6.32, CFQ merges many patches and the patches have impact on each other. > > > >> > >> The mmap 64k randreader will have a large seek_mean, resulting in > >> being marked seeky, but will send 16 * 4k sequential requests one > >> after the other, so alternating between those seeky queues will cause > >> harm. > >> > >> I'm working on a new way to compute seekiness of queues, that should > >> fix your issue, correctly identifying those queues as non-seeky (for > >> me, a queue should be considered seeky only if it submits more than 1 > >> seeky requests for 8 sequential ones). > >> > >> > > >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create > >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files > >> > per partitions. There are 8*24 processes totally. randread block size is 64K. > >> > > >> > We found the regression on 2 machines. One machine has 8GB memory and the other has > >> > 6GB. > >> > > >> > Bisect is very unstable. The related patches are many instead of just one. > >> > > >> > > >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 > >> > Author: Corrado Zoccolo <czoccolo@gmail.com> > >> > Date: Thu Nov 26 10:02:58 2009 +0100 > >> > > >> > cfq-iosched: fix corner cases in idling logic > >> > > >> > > >> > This patch introduces about less than 20% regression. I just reverted below section > >> > and this part regression disappear. It shows this regression is stable and not impacted > >> > by other patches. > >> > > >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > >> > return; > >> > > >> > /* > >> > - * still requests with the driver, don't idle > >> > + * still active requests from this queue, don't idle > >> > */ > >> > - if (rq_in_driver(cfqd)) > >> > + if (cfqq->dispatched) > >> > return; > > Although 5 patches are related to the regression, above line is quite > > independent. Reverting above line could always improve the result for about > > 20%. > I've looked at your fio script, and it is quite complex, As we have about 40 fio sub cases, we have a script to create fio job file from a specific parameter list. So there are some superfluous parameters. Another point is we need stable result. > with lot of > things going on. > Let's keep this for last. Ok. But the change like what you do mostly reduces regresion. > I've created a smaller test, that already shows some regression: > [global] > direct=0 > ioengine=mmap > size=8G > bs=64k > numjobs=1 > loops=5 > runtime=60 > #group_reporting > invalidate=0 > directory=/media/hd/cfq-tests > > [job0] > startdelay=0 > rw=randread > filename=testfile1 > > [job1] > startdelay=0 > rw=randread > filename=testfile2 > > [job2] > startdelay=0 > rw=randread > filename=testfile3 > > [job3] > startdelay=0 > rw=randread > filename=testfile4 > > The attached patches, in particular 0005 (that apply on top of > for-linus branch of Jen's tree > git://git.kernel.dk/linux-2.6-block.git) fix the regression on this > simplified workload. I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The result isn't resolved. > > > > >> > > >> This shouldn't affect you if all queues are marked as idle. > > Do you mean to use command ionice to mark it as idle class? I didn't try it. > No. I meant forcing enable_idle = 1, as you were almost doing with > your patch, when cfq_latency was set. > With my above patch, this should not be needed any more, since the > queues should be seen as sequential. > > > > >> Does just > >> your patch: > >> > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) > >> > - && CFQQ_SEEKY(cfqq))) > >> > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) && > >> > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) > >> fix most of the regression without touching arm_slice_timer? > > No. If to fix the regression completely, I need apply above patch plus > > a debug patch. The debug patch is to just work around the 3 patches report by > > Shaohua's tiobench regression report. Without the debug patch, the regression > > isn't resolved. > > Jens already merged one of Shaohua's patches, that may fix the problem > with queue combining. I did another testing. Apply my debug patch+ the low_latency patch, but use Shaohua's 2 patches (improve merge and split), the regression disappears. > > > Below is the debug patch. > > diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c > > --- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800 > > +++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c 2009-12-30 17:12:28.000000000 +0800 > > @@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd > > cfqq->slice_start = jiffies; > > cfqq->slice_end = jiffies + slice; > > cfqq->allocated_slice = slice; > > +/*YMZHANG*/ > > + cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; > > + > This is disabled, on a vanilla 2.6.33 kernel, by setting low_latency = 0 > > cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); > > } > > > > @@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c > > /* > > * still active requests from this queue, don't idle > > */ > > - if (cfqq->dispatched) > > + //if (cfqq->dispatched) > > + if (rq_in_driver(cfqd)) > > return; > > > > /* > > @@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q > > new_cfqq = __cfqq; > > } > > > > + /* YMZHANG debug */ > > + return; > > + > This should be partially addressed by Shaohua's patch merged in Jens' tree. > But note that your 8 processes, can randomly start doing I/O on the > same file, so merging those queues is sometimes reasonable. Another reason is I start 8 processes per partition and every disk has 2 partitions, so there are 16 processes per disk. With another JBOD, I use one partition per disk, and the regression is only 8%. >From this point, can CFQ do not merge request queues which access different partitions? As you know, it's unusual that a process accesses files across partitions. io scheduler is at low layer which doesn't know partition. > The patch to split them quickly was still not merged, though, so you > will still see some regression due to this. In my simplified job file, > I removed the randomness to make sure this cannot happen. > > > process_refs = cfqq_process_refs(cfqq); > > /* > > * If the process for the cfqq has gone away, there is no > > > > > >> > >> I guess > >> > 5db5d64277bf390056b1a87d0bb288c8b8553f96. > >> will still introduce a 10% regression, but this is needed to improve > >> latency, and you can just disable low_latency to avoid it. > > You are right. I did a quick testing. If my patch + revert 2 patches and keep > > 5db5d64, the regression is about 20%. > > > > But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches > > and keep 5db5d64 while set low_latency=0, the regression is still there. One > > reason is my patch doesn't work when low_latency=0. > Right. You can try with my patch, instead, that doesn't depend on > low_latency, and set it to 0 to remove this performance degradation. > My results: > 2.6.32.2: > READ: io=146688KB, aggrb=2442KB/s, minb=602KB/s, maxb=639KB/s, > mint=60019msec, maxt=60067msec > > 2.6.33 - jens: > READ: io=128512KB, aggrb=2140KB/s, minb=526KB/s, maxb=569KB/s, > mint=60004msec, maxt=60032msec > > 2.6.33 - jens + my patches : > READ: io=143232KB, aggrb=2384KB/s, minb=595KB/s, maxb=624KB/s, > mint=60003msec, maxt=60072msec > > 2.6.33 - jens + my patches + low_lat = 0: > READ: io=145216KB, aggrb=2416KB/s, minb=596KB/s, maxb=632KB/s, > mint=60027msec, maxt=60087msec > > > >> > >> Thanks, > >> Corrado > > I attach the fio job file for your reference. > > > > I got a cold and will continue to work on it next week. > > > > Yanmin > > > > Thanks, > Corrado ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-02 12:33 ` Zhang, Yanmin @ 2010-01-02 18:52 ` Corrado Zoccolo 2010-01-04 8:18 ` Zhang, Yanmin 0 siblings, 1 reply; 17+ messages in thread From: Corrado Zoccolo @ 2010-01-02 18:52 UTC (permalink / raw) To: Zhang, Yanmin; +Cc: Jens Axboe, Shaohua Li, jmoyer, LKML Hi On Sat, Jan 2, 2010 at 1:33 PM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote: >> Hi Yanmin, >> On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin >> <yanmin_zhang@linux.intel.com> wrote: >> > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote: >> >> Hi Yanmin, >> >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin >> >> <yanmin_zhang@linux.intel.com> wrote: >> >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with >> >> > 2.6.33-rc1. >> >> >> > Thanks for your timely reply. Some comments inlined below. >> > >> >> Can you compare the performance also with 2.6.31? >> > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel >> > is released. >> > >> > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about >> > 8% better than the one of 2.6.31. >> > >> >> I think I understand what causes your problem. >> >> 2.6.32, with default settings, handled even random readers as >> >> sequential ones to provide fairness. This has benefits on single disks >> >> and JBODs, but causes harm on raids. >> > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on >> > hardware RAID in HBA, mostly we use noop io scheduler. >> I think you should start testing cfq with them, too. From 2.6.33, we >> have some big improvements in this area. > Great! I once compared cfq and noop against non-raid and raid0. One interesting finding > about sequential read testing is when there are fewer processes to read files on the raid0 > JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid > JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues. > >> > >> >> For 2.6.33, we changed the way in which this is handled, restoring the >> >> enable_idle = 0 for seeky queues as it was in 2.6.31: >> >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, >> >> struct cfq_queue *cfqq, >> >> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); >> >> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> >> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) >> >> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) >> >> enable_idle = 0; >> >> (compare with 2.6.31: >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> >> (cfqd->hw_tag && CIC_SEEKY(cic))) >> >> enable_idle = 0; >> >> excluding the sample_valid check, it should be equivalent for you (I >> >> assume you have NCQ disks)) >> >> and we provide fairness for them by servicing all seeky queues >> >> together, and then idling before switching to other ones. >> > As for function cfq_update_idle_window, you is right. But since >> > 2.6.32, CFQ merges many patches and the patches have impact on each other. >> > >> >> >> >> The mmap 64k randreader will have a large seek_mean, resulting in >> >> being marked seeky, but will send 16 * 4k sequential requests one >> >> after the other, so alternating between those seeky queues will cause >> >> harm. >> >> >> >> I'm working on a new way to compute seekiness of queues, that should >> >> fix your issue, correctly identifying those queues as non-seeky (for >> >> me, a queue should be considered seeky only if it submits more than 1 >> >> seeky requests for 8 sequential ones). >> >> >> >> > >> >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create >> >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files >> >> > per partitions. There are 8*24 processes totally. randread block size is 64K. >> >> > >> >> > We found the regression on 2 machines. One machine has 8GB memory and the other has >> >> > 6GB. >> >> > >> >> > Bisect is very unstable. The related patches are many instead of just one. >> >> > >> >> > >> >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 >> >> > Author: Corrado Zoccolo <czoccolo@gmail.com> >> >> > Date: Thu Nov 26 10:02:58 2009 +0100 >> >> > >> >> > cfq-iosched: fix corner cases in idling logic >> >> > >> >> > >> >> > This patch introduces about less than 20% regression. I just reverted below section >> >> > and this part regression disappear. It shows this regression is stable and not impacted >> >> > by other patches. >> >> > >> >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) >> >> > return; >> >> > >> >> > /* >> >> > - * still requests with the driver, don't idle >> >> > + * still active requests from this queue, don't idle >> >> > */ >> >> > - if (rq_in_driver(cfqd)) >> >> > + if (cfqq->dispatched) >> >> > return; >> > Although 5 patches are related to the regression, above line is quite >> > independent. Reverting above line could always improve the result for about >> > 20%. >> I've looked at your fio script, and it is quite complex, > As we have about 40 fio sub cases, we have a script to create fio job file from > a specific parameter list. So there are some superfluous parameters. > My point is that there are so many things going on, that is more difficult to analyse the issues. I prefer looking at one problem at a time, so (initially) removing the possibility of queue merging, that Shaohua already investigated, can help in spotting the still not-well-understood problem. Could you generate the same script, but with each process accessing only one of the files, instead of chosing it at random? > Another point is we need stable result. > >> with lot of >> things going on. >> Let's keep this for last. > Ok. But the change like what you do mostly reduces regresion. > >> I've created a smaller test, that already shows some regression: >> [global] >> direct=0 >> ioengine=mmap >> size=8G >> bs=64k >> numjobs=1 >> loops=5 >> runtime=60 >> #group_reporting >> invalidate=0 >> directory=/media/hd/cfq-tests >> >> [job0] >> startdelay=0 >> rw=randread >> filename=testfile1 >> >> [job1] >> startdelay=0 >> rw=randread >> filename=testfile2 >> >> [job2] >> startdelay=0 >> rw=randread >> filename=testfile3 >> >> [job3] >> startdelay=0 >> rw=randread >> filename=testfile4 >> >> The attached patches, in particular 0005 (that apply on top of >> for-linus branch of Jen's tree >> git://git.kernel.dk/linux-2.6-block.git) fix the regression on this >> simplified workload. > I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The > result isn't resolved. Can you quantify if there is an improvement, though? Please, also include Shahoua's patches. I'd like to see the comparison between (always with low_latency set to 0): plain 2.6.33 plain 2.6.33 + shahoua's plain 2.6.33 + shahoua's + my patch plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch. >> >> > >> >> > >> >> This shouldn't affect you if all queues are marked as idle. >> > Do you mean to use command ionice to mark it as idle class? I didn't try it. >> No. I meant forcing enable_idle = 1, as you were almost doing with >> your patch, when cfq_latency was set. >> With my above patch, this should not be needed any more, since the >> queues should be seen as sequential. >> >> > >> >> Does just >> >> your patch: >> >> > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) >> >> > - && CFQQ_SEEKY(cfqq))) >> >> > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) && >> >> > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) >> >> fix most of the regression without touching arm_slice_timer? >> > No. If to fix the regression completely, I need apply above patch plus >> > a debug patch. The debug patch is to just work around the 3 patches report by >> > Shaohua's tiobench regression report. Without the debug patch, the regression >> > isn't resolved. >> >> Jens already merged one of Shaohua's patches, that may fix the problem >> with queue combining. > I did another testing. Apply my debug patch+ the low_latency patch, but use > Shaohua's 2 patches (improve merge and split), the regression disappears. > >> >> > Below is the debug patch. >> > diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c >> > --- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800 >> > +++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c 2009-12-30 17:12:28.000000000 +0800 >> > @@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd >> > cfqq->slice_start = jiffies; >> > cfqq->slice_end = jiffies + slice; >> > cfqq->allocated_slice = slice; >> > +/*YMZHANG*/ >> > + cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; >> > + >> This is disabled, on a vanilla 2.6.33 kernel, by setting low_latency = 0 >> > cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); >> > } >> > >> > @@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c >> > /* >> > * still active requests from this queue, don't idle >> > */ >> > - if (cfqq->dispatched) >> > + //if (cfqq->dispatched) >> > + if (rq_in_driver(cfqd)) >> > return; >> > >> > /* >> > @@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q >> > new_cfqq = __cfqq; >> > } >> > >> > + /* YMZHANG debug */ >> > + return; >> > + >> This should be partially addressed by Shaohua's patch merged in Jens' tree. >> But note that your 8 processes, can randomly start doing I/O on the >> same file, so merging those queues is sometimes reasonable. > Another reason is I start 8 processes per partition and every disk has 2 partitions, > so there are 16 processes per disk. With another JBOD, I use one partition per disk, > and the regression is only 8%. With half of the processes, time slices are higher, and the disk cache can do a better job when servicing interleaved sequential requests. > > >From this point, can CFQ do not merge request queues which access different partitions? (puzzled: I didn't write this, and can't find a message in the thread with this question.) > As you know, it's unusual that a process accesses files across partitions. io scheduler > is at low layer which doesn't know partition. CFQ bases decision on distance between requests, and requests going to different partitions will have much higher distance. So the associated queues will be more likely marked as seeky. > > >> The patch to split them quickly was still not merged, though, so you >> will still see some regression due to this. In my simplified job file, >> I removed the randomness to make sure this cannot happen. >> >> > process_refs = cfqq_process_refs(cfqq); >> > /* >> > * If the process for the cfqq has gone away, there is no >> > >> > >> >> >> >> I guess >> >> > 5db5d64277bf390056b1a87d0bb288c8b8553f96. >> >> will still introduce a 10% regression, but this is needed to improve >> >> latency, and you can just disable low_latency to avoid it. >> > You are right. I did a quick testing. If my patch + revert 2 patches and keep >> > 5db5d64, the regression is about 20%. >> > >> > But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches >> > and keep 5db5d64 while set low_latency=0, the regression is still there. One >> > reason is my patch doesn't work when low_latency=0. >> Right. You can try with my patch, instead, that doesn't depend on >> low_latency, and set it to 0 to remove this performance degradation. >> My results: >> 2.6.32.2: >> READ: io=146688KB, aggrb=2442KB/s, minb=602KB/s, maxb=639KB/s, >> mint=60019msec, maxt=60067msec >> >> 2.6.33 - jens: >> READ: io=128512KB, aggrb=2140KB/s, minb=526KB/s, maxb=569KB/s, >> mint=60004msec, maxt=60032msec >> >> 2.6.33 - jens + my patches : >> READ: io=143232KB, aggrb=2384KB/s, minb=595KB/s, maxb=624KB/s, >> mint=60003msec, maxt=60072msec >> >> 2.6.33 - jens + my patches + low_lat = 0: >> READ: io=145216KB, aggrb=2416KB/s, minb=596KB/s, maxb=632KB/s, >> mint=60027msec, maxt=60087msec >> >> >> >> >> >> Thanks, >> >> Corrado >> > I attach the fio job file for your reference. >> > >> > I got a cold and will continue to work on it next week. >> > >> > Yanmin >> > >> >> Thanks, >> Corrado > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-02 18:52 ` Corrado Zoccolo @ 2010-01-04 8:18 ` Zhang, Yanmin 2010-01-04 18:28 ` Corrado Zoccolo 0 siblings, 1 reply; 17+ messages in thread From: Zhang, Yanmin @ 2010-01-04 8:18 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Jens Axboe, Shaohua Li, jmoyer, LKML On Sat, 2010-01-02 at 19:52 +0100, Corrado Zoccolo wrote: > Hi > On Sat, Jan 2, 2010 at 1:33 PM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: > > On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote: > >> Hi Yanmin, > >> On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin > >> <yanmin_zhang@linux.intel.com> wrote: > >> > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote: > >> >> Hi Yanmin, > >> >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin > >> >> <yanmin_zhang@linux.intel.com> wrote: > >> >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with > >> >> > 2.6.33-rc1. > >> >> > >> > Thanks for your timely reply. Some comments inlined below. > >> > > >> >> Can you compare the performance also with 2.6.31? > >> > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel > >> > is released. > >> > > >> > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about > >> > 8% better than the one of 2.6.31. > >> > > >> >> I think I understand what causes your problem. > >> >> 2.6.32, with default settings, handled even random readers as > >> >> sequential ones to provide fairness. This has benefits on single disks > >> >> and JBODs, but causes harm on raids. > >> > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on > >> > hardware RAID in HBA, mostly we use noop io scheduler. > >> I think you should start testing cfq with them, too. From 2.6.33, we > >> have some big improvements in this area. > > Great! I once compared cfq and noop against non-raid and raid0. One interesting finding > > about sequential read testing is when there are fewer processes to read files on the raid0 > > JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid > > JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues. > > > >> > > >> >> For 2.6.33, we changed the way in which this is handled, restoring the > >> >> enable_idle = 0 for seeky queues as it was in 2.6.31: > >> >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, > >> >> struct cfq_queue *cfqq, > >> >> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > >> >> > >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > >> >> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) > >> >> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) > >> >> enable_idle = 0; > >> >> (compare with 2.6.31: > >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > >> >> (cfqd->hw_tag && CIC_SEEKY(cic))) > >> >> enable_idle = 0; > >> >> excluding the sample_valid check, it should be equivalent for you (I > >> >> assume you have NCQ disks)) > >> >> and we provide fairness for them by servicing all seeky queues > >> >> together, and then idling before switching to other ones. > >> > As for function cfq_update_idle_window, you is right. But since > >> > 2.6.32, CFQ merges many patches and the patches have impact on each other. > >> > > >> >> > >> >> The mmap 64k randreader will have a large seek_mean, resulting in > >> >> being marked seeky, but will send 16 * 4k sequential requests one > >> >> after the other, so alternating between those seeky queues will cause > >> >> harm. > >> >> > >> >> I'm working on a new way to compute seekiness of queues, that should > >> >> fix your issue, correctly identifying those queues as non-seeky (for > >> >> me, a queue should be considered seeky only if it submits more than 1 > >> >> seeky requests for 8 sequential ones). > >> >> > >> >> > > >> >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create > >> >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files > >> >> > per partitions. There are 8*24 processes totally. randread block size is 64K. > >> >> > > >> >> > We found the regression on 2 machines. One machine has 8GB memory and the other has > >> >> > 6GB. > >> >> > > >> >> > Bisect is very unstable. The related patches are many instead of just one. > >> >> > > >> >> > > >> >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 > >> >> > Author: Corrado Zoccolo <czoccolo@gmail.com> > >> >> > Date: Thu Nov 26 10:02:58 2009 +0100 > >> >> > > >> >> > cfq-iosched: fix corner cases in idling logic > >> >> > > >> >> > > >> >> > This patch introduces about less than 20% regression. I just reverted below section > >> >> > and this part regression disappear. It shows this regression is stable and not impacted > >> >> > by other patches. > >> >> > > >> >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) > >> >> > return; > >> >> > > >> >> > /* > >> >> > - * still requests with the driver, don't idle > >> >> > + * still active requests from this queue, don't idle > >> >> > */ > >> >> > - if (rq_in_driver(cfqd)) > >> >> > + if (cfqq->dispatched) > >> >> > return; > >> > Although 5 patches are related to the regression, above line is quite > >> > independent. Reverting above line could always improve the result for about > >> > 20%. > >> I've looked at your fio script, and it is quite complex, > > As we have about 40 fio sub cases, we have a script to create fio job file from > > a specific parameter list. So there are some superfluous parameters. > > > My point is that there are so many things going on, that is more > difficult to analyse the issues. > I prefer looking at one problem at a time, so (initially) removing the > possibility of queue merging, that Shaohua already investigated, can > help in spotting the still not-well-understood problem. Sounds reasonable. > Could you generate the same script, but with each process accessing > only one of the files, instead of chosing it at random? Ok. New testing starts 8 processes per partition and every process just works on one file. > > > Another point is we need stable result. > > > >> with lot of > >> things going on. > >> Let's keep this for last. > > Ok. But the change like what you do mostly reduces regresion. > > > >> I've created a smaller test, that already shows some regression: > >> [global] > >> direct=0 > >> ioengine=mmap > >> size=8G > >> bs=64k > >> numjobs=1 > >> loops=5 > >> runtime=60 > >> #group_reporting > >> invalidate=0 > >> directory=/media/hd/cfq-tests > >> > >> [job0] > >> startdelay=0 > >> rw=randread > >> filename=testfile1 > >> > >> [job1] > >> startdelay=0 > >> rw=randread > >> filename=testfile2 > >> > >> [job2] > >> startdelay=0 > >> rw=randread > >> filename=testfile3 > >> > >> [job3] > >> startdelay=0 > >> rw=randread > >> filename=testfile4 > >> > >> The attached patches, in particular 0005 (that apply on top of > >> for-linus branch of Jen's tree > >> git://git.kernel.dk/linux-2.6-block.git) fix the regression on this > >> simplified workload. > > I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The > > result isn't resolved. > Can you quantify if there is an improvement, though? Ok. Because of company policy, I could only post percent instead of real number. > Please, also include Shahoua's patches. > I'd like to see the comparison between (always with low_latency set to 0): > plain 2.6.33 > plain 2.6.33 + shahoua's > plain 2.6.33 + shahoua's + my patch > plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch. 1) low_latency=0 2.6.32 kernel 0 2.6.33-rc1 -0.33 2.6.33-rc1_shaohua -0.33 2.6.33-rc1+corrado 0.03 2.6.33-rc1_corrado+shaohua 0.02 2.6.33-rc1_corrado+shaohua+rq_in_driver 0.01 2) low_latency=1 2.6.32 kernel 0 2.6.33-rc1 -0.45 2.6.33-rc1+corrado -0.24 2.6.33-rc1_corrado+shaohua -0.23 2.6.33-rc1_corrado+shaohua+rq_in_driver -0.23 When low_latency=1, we get the biggest number with kernel 2.6.32. Comparing with low_latency=0's result, the prior one is about 4% better. > > >> > >> > > >> >> > > >> >> This shouldn't affect you if all queues are marked as idle. > >> > Do you mean to use command ionice to mark it as idle class? I didn't try it. > >> No. I meant forcing enable_idle = 1, as you were almost doing with > >> your patch, when cfq_latency was set. > >> With my above patch, this should not be needed any more, since the > >> queues should be seen as sequential. > >> > >> > > >> >> Does just > >> >> your patch: > >> >> > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples) > >> >> > - && CFQQ_SEEKY(cfqq))) > >> >> > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) && > >> >> > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) > >> >> fix most of the regression without touching arm_slice_timer? > >> > No. If to fix the regression completely, I need apply above patch plus > >> > a debug patch. The debug patch is to just work around the 3 patches report by > >> > Shaohua's tiobench regression report. Without the debug patch, the regression > >> > isn't resolved. > >> > >> Jens already merged one of Shaohua's patches, that may fix the problem > >> with queue combining. > > I did another testing. Apply my debug patch+ the low_latency patch, but use > > Shaohua's 2 patches (improve merge and split), the regression disappears. > > > >> > >> > Below is the debug patch. > >> > diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c > >> > --- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800 > >> > +++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c 2009-12-30 17:12:28.000000000 +0800 > >> > @@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd > >> > cfqq->slice_start = jiffies; > >> > cfqq->slice_end = jiffies + slice; > >> > cfqq->allocated_slice = slice; > >> > +/*YMZHANG*/ > >> > + cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; > >> > + > >> This is disabled, on a vanilla 2.6.33 kernel, by setting low_latency = 0 > >> > cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); > >> > } > >> > > >> > @@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c > >> > /* > >> > * still active requests from this queue, don't idle > >> > */ > >> > - if (cfqq->dispatched) > >> > + //if (cfqq->dispatched) > >> > + if (rq_in_driver(cfqd)) > >> > return; > >> > > >> > /* > >> > @@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q > >> > new_cfqq = __cfqq; > >> > } > >> > > >> > + /* YMZHANG debug */ > >> > + return; > >> > + > >> This should be partially addressed by Shaohua's patch merged in Jens' tree. > >> But note that your 8 processes, can randomly start doing I/O on the > >> same file, so merging those queues is sometimes reasonable. > > Another reason is I start 8 processes per partition and every disk has 2 partitions, > > so there are 16 processes per disk. With another JBOD, I use one partition per disk, > > and the regression is only 8%. > With half of the processes, time slices are higher, and the disk cache > can do a better job when servicing interleaved sequential requests. > > > > >From this point, can CFQ do not merge request queues which access different partitions? > (puzzled: I didn't write this, and can't find a message in the thread > with this question.) My email client is evolution and sometimes it adds > unexpectedly. > > As you know, it's unusual that a process accesses files across partitions. io scheduler > > is at low layer which doesn't know partition. > CFQ bases decision on distance between requests, and requests going to > different partitions will have much higher distance. So the associated > queues will be more likely marked as seeky. Right. Thanks for your explanation. > > > > > >> The patch to split them quickly was still not merged, though, so you > >> will still see some regression due to this. In my simplified job file, > >> I removed the randomness to make sure this cannot happen. > >> > >> > process_refs = cfqq_process_refs(cfqq); > >> > /* > >> > * If the process for the cfqq has gone away, there is no > >> > > >> > > >> >> > >> >> I guess > >> >> > 5db5d64277bf390056b1a87d0bb288c8b8553f96. > >> >> will still introduce a 10% regression, but this is needed to improve > >> >> latency, and you can just disable low_latency to avoid it. > >> > You are right. I did a quick testing. If my patch + revert 2 patches and keep > >> > 5db5d64, the regression is about 20%. > >> > > >> > But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches > >> > and keep 5db5d64 while set low_latency=0, the regression is still there. One > >> > reason is my patch doesn't work when low_latency=0. > >> Right. You can try with my patch, instead, that doesn't depend on > >> low_latency, and set it to 0 to remove this performance degradation. > >> My results: > >> 2.6.32.2: > >> READ: io=146688KB, aggrb=2442KB/s, minb=602KB/s, maxb=639KB/s, > >> mint=60019msec, maxt=60067msec > >> > >> 2.6.33 - jens: > >> READ: io=128512KB, aggrb=2140KB/s, minb=526KB/s, maxb=569KB/s, > >> mint=60004msec, maxt=60032msec > >> > >> 2.6.33 - jens + my patches : > >> READ: io=143232KB, aggrb=2384KB/s, minb=595KB/s, maxb=624KB/s, > >> mint=60003msec, maxt=60072msec > >> > >> 2.6.33 - jens + my patches + low_lat = 0: > >> READ: io=145216KB, aggrb=2416KB/s, minb=596KB/s, maxb=632KB/s, > >> mint=60027msec, maxt=60087msec ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-04 8:18 ` Zhang, Yanmin @ 2010-01-04 18:28 ` Corrado Zoccolo 2010-01-16 16:27 ` Corrado Zoccolo 0 siblings, 1 reply; 17+ messages in thread From: Corrado Zoccolo @ 2010-01-04 18:28 UTC (permalink / raw) To: Zhang, Yanmin, jmoyer; +Cc: Jens Axboe, Shaohua Li, LKML Hi Yanmin, On Mon, Jan 4, 2010 at 9:18 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > On Sat, 2010-01-02 at 19:52 +0100, Corrado Zoccolo wrote: >> Hi >> On Sat, Jan 2, 2010 at 1:33 PM, Zhang, Yanmin >> <yanmin_zhang@linux.intel.com> wrote: >> > On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote: >> >> Hi Yanmin, >> >> On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin >> >> <yanmin_zhang@linux.intel.com> wrote: >> >> > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote: >> >> >> Hi Yanmin, >> >> >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin >> >> >> <yanmin_zhang@linux.intel.com> wrote: >> >> >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with >> >> >> > 2.6.33-rc1. >> >> >> >> >> > Thanks for your timely reply. Some comments inlined below. >> >> > >> >> >> Can you compare the performance also with 2.6.31? >> >> > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel >> >> > is released. >> >> > >> >> > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about >> >> > 8% better than the one of 2.6.31. >> >> > >> >> >> I think I understand what causes your problem. >> >> >> 2.6.32, with default settings, handled even random readers as >> >> >> sequential ones to provide fairness. This has benefits on single disks >> >> >> and JBODs, but causes harm on raids. >> >> > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on >> >> > hardware RAID in HBA, mostly we use noop io scheduler. >> >> I think you should start testing cfq with them, too. From 2.6.33, we >> >> have some big improvements in this area. >> > Great! I once compared cfq and noop against non-raid and raid0. One interesting finding >> > about sequential read testing is when there are fewer processes to read files on the raid0 >> > JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid >> > JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues. >> > >> >> > >> >> >> For 2.6.33, we changed the way in which this is handled, restoring the >> >> >> enable_idle = 0 for seeky queues as it was in 2.6.31: >> >> >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, >> >> >> struct cfq_queue *cfqq, >> >> >> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); >> >> >> >> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> >> >> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) >> >> >> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) >> >> >> enable_idle = 0; >> >> >> (compare with 2.6.31: >> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >> >> >> (cfqd->hw_tag && CIC_SEEKY(cic))) >> >> >> enable_idle = 0; >> >> >> excluding the sample_valid check, it should be equivalent for you (I >> >> >> assume you have NCQ disks)) >> >> >> and we provide fairness for them by servicing all seeky queues >> >> >> together, and then idling before switching to other ones. >> >> > As for function cfq_update_idle_window, you is right. But since >> >> > 2.6.32, CFQ merges many patches and the patches have impact on each other. >> >> > >> >> >> >> >> >> The mmap 64k randreader will have a large seek_mean, resulting in >> >> >> being marked seeky, but will send 16 * 4k sequential requests one >> >> >> after the other, so alternating between those seeky queues will cause >> >> >> harm. >> >> >> >> >> >> I'm working on a new way to compute seekiness of queues, that should >> >> >> fix your issue, correctly identifying those queues as non-seeky (for >> >> >> me, a queue should be considered seeky only if it submits more than 1 >> >> >> seeky requests for 8 sequential ones). >> >> >> >> >> >> > >> >> >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create >> >> >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files >> >> >> > per partitions. There are 8*24 processes totally. randread block size is 64K. >> >> >> > >> >> >> > We found the regression on 2 machines. One machine has 8GB memory and the other has >> >> >> > 6GB. >> >> >> > >> >> >> > Bisect is very unstable. The related patches are many instead of just one. >> >> >> > >> >> >> > >> >> >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 >> >> >> > Author: Corrado Zoccolo <czoccolo@gmail.com> >> >> >> > Date: Thu Nov 26 10:02:58 2009 +0100 >> >> >> > >> >> >> > cfq-iosched: fix corner cases in idling logic >> >> >> > >> >> >> > >> >> >> > This patch introduces about less than 20% regression. I just reverted below section >> >> >> > and this part regression disappear. It shows this regression is stable and not impacted >> >> >> > by other patches. >> >> >> > >> >> >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) >> >> >> > return; >> >> >> > >> >> >> > /* >> >> >> > - * still requests with the driver, don't idle >> >> >> > + * still active requests from this queue, don't idle >> >> >> > */ >> >> >> > - if (rq_in_driver(cfqd)) >> >> >> > + if (cfqq->dispatched) >> >> >> > return; >> >> > Although 5 patches are related to the regression, above line is quite >> >> > independent. Reverting above line could always improve the result for about >> >> > 20%. >> >> I've looked at your fio script, and it is quite complex, >> > As we have about 40 fio sub cases, we have a script to create fio job file from >> > a specific parameter list. So there are some superfluous parameters. >> > >> My point is that there are so many things going on, that is more >> difficult to analyse the issues. >> I prefer looking at one problem at a time, so (initially) removing the >> possibility of queue merging, that Shaohua already investigated, can >> help in spotting the still not-well-understood problem. > Sounds reasonable. > >> Could you generate the same script, but with each process accessing >> only one of the files, instead of chosing it at random? > Ok. New testing starts 8 processes per partition and every process just works > on one file. Great, thanks. > >> >> > Another point is we need stable result. >> > >> >> with lot of >> >> things going on. >> >> Let's keep this for last. >> > Ok. But the change like what you do mostly reduces regresion. >> > >> >> I've created a smaller test, that already shows some regression: >> >> [global] >> >> direct=0 >> >> ioengine=mmap >> >> size=8G >> >> bs=64k >> >> numjobs=1 >> >> loops=5 >> >> runtime=60 >> >> #group_reporting >> >> invalidate=0 >> >> directory=/media/hd/cfq-tests >> >> >> >> [job0] >> >> startdelay=0 >> >> rw=randread >> >> filename=testfile1 >> >> >> >> [job1] >> >> startdelay=0 >> >> rw=randread >> >> filename=testfile2 >> >> >> >> [job2] >> >> startdelay=0 >> >> rw=randread >> >> filename=testfile3 >> >> >> >> [job3] >> >> startdelay=0 >> >> rw=randread >> >> filename=testfile4 >> >> >> >> The attached patches, in particular 0005 (that apply on top of >> >> for-linus branch of Jen's tree >> >> git://git.kernel.dk/linux-2.6-block.git) fix the regression on this >> >> simplified workload. >> > I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The >> > result isn't resolved. >> Can you quantify if there is an improvement, though? > > Ok. Because of company policy, I could only post percent instead of real number Sure, it is fine. > >> Please, also include Shahoua's patches. >> I'd like to see the comparison between (always with low_latency set to 0): >> plain 2.6.33 >> plain 2.6.33 + shahoua's >> plain 2.6.33 + shahoua's + my patch >> plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch. > > 1) low_latency=0 > 2.6.32 kernel 0 > 2.6.33-rc1 -0.33 > 2.6.33-rc1_shaohua -0.33 > 2.6.33-rc1+corrado 0.03 > 2.6.33-rc1_corrado+shaohua 0.02 > 2.6.33-rc1_corrado+shaohua+rq_in_driver 0.01 > So my patch fixes the situation for low_latency = 0, as I expected. I'll send it to Jens with a proper changelog. > 2) low_latency=1 > 2.6.32 kernel 0 > 2.6.33-rc1 -0.45 > 2.6.33-rc1+corrado -0.24 > 2.6.33-rc1_corrado+shaohua -0.23 > 2.6.33-rc1_corrado+shaohua+rq_in_driver -0.23 The results are as expected. With each process working on a separate file, Shahoua's patches do not influence the result sensibly. Interestingly, even rq_in_driver doesn't improve in this case, so maybe its effect is somewhat connected to queue merging. The remaining -23% is due to timeslice shrinking, that is done to reduce max latency when there are too many processes doing I/O, at the expense of throughput. It is a documented change, and the suggested way if you favor throughput over latency is to set low_latency = 0. > > > When low_latency=1, we get the biggest number with kernel 2.6.32. > Comparing with low_latency=0's result, the prior one is about 4% better. Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with fastest 2.6.32, so we can consider the first part of the problem solved. For the queue merging issue, maybe Jeff has some improvements w.r.t shaohua's approach. Thanks, Corrado ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-04 18:28 ` Corrado Zoccolo @ 2010-01-16 16:27 ` Corrado Zoccolo 2010-01-18 3:06 ` Zhang, Yanmin 0 siblings, 1 reply; 17+ messages in thread From: Corrado Zoccolo @ 2010-01-16 16:27 UTC (permalink / raw) To: Zhang, Yanmin, jmoyer; +Cc: Jens Axboe, Shaohua Li, LKML Hi Yanmin On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > Hi Yanmin, > On Mon, Jan 4, 2010 at 9:18 AM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: >> On Sat, 2010-01-02 at 19:52 +0100, Corrado Zoccolo wrote: >>> Hi >>> On Sat, Jan 2, 2010 at 1:33 PM, Zhang, Yanmin >>> <yanmin_zhang@linux.intel.com> wrote: >>> > On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote: >>> >> Hi Yanmin, >>> >> On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin >>> >> <yanmin_zhang@linux.intel.com> wrote: >>> >> > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote: >>> >> >> Hi Yanmin, >>> >> >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin >>> >> >> <yanmin_zhang@linux.intel.com> wrote: >>> >> >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with >>> >> >> > 2.6.33-rc1. >>> >> >> >>> >> > Thanks for your timely reply. Some comments inlined below. >>> >> > >>> >> >> Can you compare the performance also with 2.6.31? >>> >> > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel >>> >> > is released. >>> >> > >>> >> > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about >>> >> > 8% better than the one of 2.6.31. >>> >> > >>> >> >> I think I understand what causes your problem. >>> >> >> 2.6.32, with default settings, handled even random readers as >>> >> >> sequential ones to provide fairness. This has benefits on single disks >>> >> >> and JBODs, but causes harm on raids. >>> >> > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on >>> >> > hardware RAID in HBA, mostly we use noop io scheduler. >>> >> I think you should start testing cfq with them, too. From 2.6.33, we >>> >> have some big improvements in this area. >>> > Great! I once compared cfq and noop against non-raid and raid0. One interesting finding >>> > about sequential read testing is when there are fewer processes to read files on the raid0 >>> > JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid >>> > JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues. >>> > >>> >> > >>> >> >> For 2.6.33, we changed the way in which this is handled, restoring the >>> >> >> enable_idle = 0 for seeky queues as it was in 2.6.31: >>> >> >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, >>> >> >> struct cfq_queue *cfqq, >>> >> >> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); >>> >> >> >>> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >>> >> >> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) >>> >> >> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) >>> >> >> enable_idle = 0; >>> >> >> (compare with 2.6.31: >>> >> >> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >>> >> >> (cfqd->hw_tag && CIC_SEEKY(cic))) >>> >> >> enable_idle = 0; >>> >> >> excluding the sample_valid check, it should be equivalent for you (I >>> >> >> assume you have NCQ disks)) >>> >> >> and we provide fairness for them by servicing all seeky queues >>> >> >> together, and then idling before switching to other ones. >>> >> > As for function cfq_update_idle_window, you is right. But since >>> >> > 2.6.32, CFQ merges many patches and the patches have impact on each other. >>> >> > >>> >> >> >>> >> >> The mmap 64k randreader will have a large seek_mean, resulting in >>> >> >> being marked seeky, but will send 16 * 4k sequential requests one >>> >> >> after the other, so alternating between those seeky queues will cause >>> >> >> harm. >>> >> >> >>> >> >> I'm working on a new way to compute seekiness of queues, that should >>> >> >> fix your issue, correctly identifying those queues as non-seeky (for >>> >> >> me, a queue should be considered seeky only if it submits more than 1 >>> >> >> seeky requests for 8 sequential ones). >>> >> >> >>> >> >> > >>> >> >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create >>> >> >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files >>> >> >> > per partitions. There are 8*24 processes totally. randread block size is 64K. >>> >> >> > >>> >> >> > We found the regression on 2 machines. One machine has 8GB memory and the other has >>> >> >> > 6GB. >>> >> >> > >>> >> >> > Bisect is very unstable. The related patches are many instead of just one. >>> >> >> > >>> >> >> > >>> >> >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 >>> >> >> > Author: Corrado Zoccolo <czoccolo@gmail.com> >>> >> >> > Date: Thu Nov 26 10:02:58 2009 +0100 >>> >> >> > >>> >> >> > cfq-iosched: fix corner cases in idling logic >>> >> >> > >>> >> >> > >>> >> >> > This patch introduces about less than 20% regression. I just reverted below section >>> >> >> > and this part regression disappear. It shows this regression is stable and not impacted >>> >> >> > by other patches. >>> >> >> > >>> >> >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) >>> >> >> > return; >>> >> >> > >>> >> >> > /* >>> >> >> > - * still requests with the driver, don't idle >>> >> >> > + * still active requests from this queue, don't idle >>> >> >> > */ >>> >> >> > - if (rq_in_driver(cfqd)) >>> >> >> > + if (cfqq->dispatched) >>> >> >> > return; >>> >> > Although 5 patches are related to the regression, above line is quite >>> >> > independent. Reverting above line could always improve the result for about >>> >> > 20%. >>> >> I've looked at your fio script, and it is quite complex, >>> > As we have about 40 fio sub cases, we have a script to create fio job file from >>> > a specific parameter list. So there are some superfluous parameters. >>> > >>> My point is that there are so many things going on, that is more >>> difficult to analyse the issues. >>> I prefer looking at one problem at a time, so (initially) removing the >>> possibility of queue merging, that Shaohua already investigated, can >>> help in spotting the still not-well-understood problem. >> Sounds reasonable. >> >>> Could you generate the same script, but with each process accessing >>> only one of the files, instead of chosing it at random? >> Ok. New testing starts 8 processes per partition and every process just works >> on one file. > Great, thanks. >> >>> >>> > Another point is we need stable result. >>> > >>> >> with lot of >>> >> things going on. >>> >> Let's keep this for last. >>> > Ok. But the change like what you do mostly reduces regresion. >>> > >>> >> I've created a smaller test, that already shows some regression: >>> >> [global] >>> >> direct=0 >>> >> ioengine=mmap >>> >> size=8G >>> >> bs=64k >>> >> numjobs=1 >>> >> loops=5 >>> >> runtime=60 >>> >> #group_reporting >>> >> invalidate=0 >>> >> directory=/media/hd/cfq-tests >>> >> >>> >> [job0] >>> >> startdelay=0 >>> >> rw=randread >>> >> filename=testfile1 >>> >> >>> >> [job1] >>> >> startdelay=0 >>> >> rw=randread >>> >> filename=testfile2 >>> >> >>> >> [job2] >>> >> startdelay=0 >>> >> rw=randread >>> >> filename=testfile3 >>> >> >>> >> [job3] >>> >> startdelay=0 >>> >> rw=randread >>> >> filename=testfile4 >>> >> >>> >> The attached patches, in particular 0005 (that apply on top of >>> >> for-linus branch of Jen's tree >>> >> git://git.kernel.dk/linux-2.6-block.git) fix the regression on this >>> >> simplified workload. >>> > I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The >>> > result isn't resolved. >>> Can you quantify if there is an improvement, though? >> >> Ok. Because of company policy, I could only post percent instead of real number > Sure, it is fine. >> >>> Please, also include Shahoua's patches. >>> I'd like to see the comparison between (always with low_latency set to 0): >>> plain 2.6.33 >>> plain 2.6.33 + shahoua's >>> plain 2.6.33 + shahoua's + my patch >>> plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch. >> >> 1) low_latency=0 >> 2.6.32 kernel 0 >> 2.6.33-rc1 -0.33 >> 2.6.33-rc1_shaohua -0.33 >> 2.6.33-rc1+corrado 0.03 >> 2.6.33-rc1_corrado+shaohua 0.02 >> 2.6.33-rc1_corrado+shaohua+rq_in_driver 0.01 >> > So my patch fixes the situation for low_latency = 0, as I expected. > I'll send it to Jens with a proper changelog. > >> 2) low_latency=1 >> 2.6.32 kernel 0 >> 2.6.33-rc1 -0.45 >> 2.6.33-rc1+corrado -0.24 >> 2.6.33-rc1_corrado+shaohua -0.23 >> 2.6.33-rc1_corrado+shaohua+rq_in_driver -0.23 > The results are as expected. With each process working on a separate > file, Shahoua's patches do not influence the result sensibly. > Interestingly, even rq_in_driver doesn't improve in this case, so > maybe its effect is somewhat connected to queue merging. > The remaining -23% is due to timeslice shrinking, that is done to > reduce max latency when there are too many processes doing I/O, at the > expense of throughput. It is a documented change, and the suggested > way if you favor throughput over latency is to set low_latency = 0. > >> >> >> When low_latency=1, we get the biggest number with kernel 2.6.32. >> Comparing with low_latency=0's result, the prior one is about 4% better. > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with > fastest 2.6.32, so we can consider the first part of the problem > solved. > I think we can return now to your full script with queue merging. I'm wondering if (in arm_slice_timer): - if (cfqq->dispatched) + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) return; gives the same improvement you were experiencing just reverting to rq_in_driver. We saw that cfqq->dispatched worked fine when there was no queue merging happening, so it must be something concerning merging, probably dispatched is not accurate when we set up for a merging, but the merging was not yet done. Thanks, Corrado ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-16 16:27 ` Corrado Zoccolo @ 2010-01-18 3:06 ` Zhang, Yanmin 2010-01-19 20:10 ` Corrado Zoccolo 0 siblings, 1 reply; 17+ messages in thread From: Zhang, Yanmin @ 2010-01-18 3:06 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: jmoyer, Jens Axboe, Shaohua Li, LKML On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote: > Hi Yanmin > On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > > Hi Yanmin, > >> When low_latency=1, we get the biggest number with kernel 2.6.32. > >> Comparing with low_latency=0's result, the prior one is about 4% better. > > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with > > fastest 2.6.32, so we can consider the first part of the problem > > solved. > > > I think we can return now to your full script with queue merging. > I'm wondering if (in arm_slice_timer): > - if (cfqq->dispatched) > + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) > return; > gives the same improvement you were experiencing just reverting to rq_in_driver. I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k has about 20% improvement. With just checking rq_in_driver(cfqd), it has about 33% improvement. > > We saw that cfqq->dispatched worked fine when there was no queue > merging happening, so it must be something concerning merging, > probably dispatched is not accurate when we set up for a merging, but > the merging was not yet done. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-18 3:06 ` Zhang, Yanmin @ 2010-01-19 20:10 ` Corrado Zoccolo 2010-01-19 20:42 ` Jeff Moyer 2010-01-19 21:40 ` Vivek Goyal 0 siblings, 2 replies; 17+ messages in thread From: Corrado Zoccolo @ 2010-01-19 20:10 UTC (permalink / raw) To: jmoyer, Vivek Goyal; +Cc: Zhang, Yanmin, Jens Axboe, Shaohua Li, LKML On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote: >> Hi Yanmin >> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: >> > Hi Yanmin, >> >> When low_latency=1, we get the biggest number with kernel 2.6.32. >> >> Comparing with low_latency=0's result, the prior one is about 4% better. >> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with >> > fastest 2.6.32, so we can consider the first part of the problem >> > solved. >> > >> I think we can return now to your full script with queue merging. >> I'm wondering if (in arm_slice_timer): >> - if (cfqq->dispatched) >> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) >> return; >> gives the same improvement you were experiencing just reverting to rq_in_driver. > I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k > has about 20% improvement. With just checking rq_in_driver(cfqd), it has > about 33% improvement. > Jeff, do you have an idea why in arm_slice_timer, checking rq_in_driver instead of cfqq->dispatched gives so much improvement in presence of queue merging, while it doesn't have noticeable effect when there are no merges? Thanks, Corrado > >> >> We saw that cfqq->dispatched worked fine when there was no queue >> merging happening, so it must be something concerning merging, >> probably dispatched is not accurate when we set up for a merging, but >> the merging was not yet done. > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-19 20:10 ` Corrado Zoccolo @ 2010-01-19 20:42 ` Jeff Moyer 2010-01-19 21:40 ` Vivek Goyal 1 sibling, 0 replies; 17+ messages in thread From: Jeff Moyer @ 2010-01-19 20:42 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Vivek Goyal, Zhang, Yanmin, Jens Axboe, Shaohua Li, LKML Corrado Zoccolo <czoccolo@gmail.com> writes: > On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: >> On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote: >>> Hi Yanmin >>> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: >>> > Hi Yanmin, >>> >> When low_latency=1, we get the biggest number with kernel 2.6.32. >>> >> Comparing with low_latency=0's result, the prior one is about 4% better. >>> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with >>> > fastest 2.6.32, so we can consider the first part of the problem >>> > solved. >>> > >>> I think we can return now to your full script with queue merging. >>> I'm wondering if (in arm_slice_timer): >>> - if (cfqq->dispatched) >>> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) >>> return; >>> gives the same improvement you were experiencing just reverting to rq_in_driver. >> I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k >> has about 20% improvement. With just checking rq_in_driver(cfqd), it has >> about 33% improvement. >> > Jeff, do you have an idea why in arm_slice_timer, checking > rq_in_driver instead of cfqq->dispatched gives so much improvement in > presence of queue merging, while it doesn't have noticeable effect > when there are no merges? It's tough to say. Is there any chance I could get some blktrace data for the run? Cheers, Jeff ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-19 20:10 ` Corrado Zoccolo 2010-01-19 20:42 ` Jeff Moyer @ 2010-01-19 21:40 ` Vivek Goyal 2010-01-19 21:58 ` Corrado Zoccolo 2010-01-20 1:29 ` Shaohua Li 1 sibling, 2 replies; 17+ messages in thread From: Vivek Goyal @ 2010-01-19 21:40 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: jmoyer, Zhang, Yanmin, Jens Axboe, Shaohua Li, LKML On Tue, Jan 19, 2010 at 09:10:33PM +0100, Corrado Zoccolo wrote: > On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: > > On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote: > >> Hi Yanmin > >> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > >> > Hi Yanmin, > >> >> When low_latency=1, we get the biggest number with kernel 2.6.32. > >> >> Comparing with low_latency=0's result, the prior one is about 4% better. > >> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with > >> > fastest 2.6.32, so we can consider the first part of the problem > >> > solved. > >> > > >> I think we can return now to your full script with queue merging. > >> I'm wondering if (in arm_slice_timer): > >> - if (cfqq->dispatched) > >> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) > >> return; > >> gives the same improvement you were experiencing just reverting to rq_in_driver. > > I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k > > has about 20% improvement. With just checking rq_in_driver(cfqd), it has > > about 33% improvement. > > > Jeff, do you have an idea why in arm_slice_timer, checking > rq_in_driver instead of cfqq->dispatched gives so much improvement in > presence of queue merging, while it doesn't have noticeable effect > when there are no merges? Performance improvement because of replacing cfqq->dispatched with rq_in_driver() is really strange. This will mean we will do even lesser idling on the cfqq. That means faster cfqq switching and that should mean more seeks (for this test case) and reduce throughput. This is just opposite to your approach of treating a random read mmap queue as sync where we will idle on the queue. Thanks Vivek > > Thanks, > Corrado > > > > >> > >> We saw that cfqq->dispatched worked fine when there was no queue > >> merging happening, so it must be something concerning merging, > >> probably dispatched is not accurate when we set up for a merging, but > >> the merging was not yet done. > > > > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-19 21:40 ` Vivek Goyal @ 2010-01-19 21:58 ` Corrado Zoccolo 2010-01-20 19:18 ` Vivek Goyal 2010-01-20 1:29 ` Shaohua Li 1 sibling, 1 reply; 17+ messages in thread From: Corrado Zoccolo @ 2010-01-19 21:58 UTC (permalink / raw) To: Vivek Goyal; +Cc: jmoyer, Zhang, Yanmin, Jens Axboe, Shaohua Li, LKML On Tue, Jan 19, 2010 at 10:40 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Jan 19, 2010 at 09:10:33PM +0100, Corrado Zoccolo wrote: >> On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin >> <yanmin_zhang@linux.intel.com> wrote: >> > On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote: >> >> Hi Yanmin >> >> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: >> >> > Hi Yanmin, >> >> >> When low_latency=1, we get the biggest number with kernel 2.6.32. >> >> >> Comparing with low_latency=0's result, the prior one is about 4% better. >> >> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with >> >> > fastest 2.6.32, so we can consider the first part of the problem >> >> > solved. >> >> > >> >> I think we can return now to your full script with queue merging. >> >> I'm wondering if (in arm_slice_timer): >> >> - if (cfqq->dispatched) >> >> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) >> >> return; >> >> gives the same improvement you were experiencing just reverting to rq_in_driver. >> > I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k >> > has about 20% improvement. With just checking rq_in_driver(cfqd), it has >> > about 33% improvement. >> > >> Jeff, do you have an idea why in arm_slice_timer, checking >> rq_in_driver instead of cfqq->dispatched gives so much improvement in >> presence of queue merging, while it doesn't have noticeable effect >> when there are no merges? > > Performance improvement because of replacing cfqq->dispatched with > rq_in_driver() is really strange. This will mean we will do even lesser > idling on the cfqq. That means faster cfqq switching and that should mean more > seeks (for this test case) and reduce throughput. This is just opposite to your approach of treating a random read mmap queue as sync where we will idle on > the queue. The tests (previous mails in this thread) show that, if no queue merging is happening, handling the queue as sync_idle, and setting low_latency = 0 to have bigger slices completely recovers the regression. If, though, we have queue merges, current arm_slice_timer shows regression w.r.t. the rq_in_driver version (2.6.32). I think a possible explanation is that we are idling instead of switching to an other queue that would be merged with this one. In fact, my half-backed try to have the rq_in_driver check conditional on queue merging fixed part of the regression (not all, because queue merges are not symmetrical, and I could be seeing the queue that is 'new_cfqq' for an other). Thanks, Corrado > > Thanks > Vivek > >> >> Thanks, >> Corrado >> >> > >> >> >> >> We saw that cfqq->dispatched worked fine when there was no queue >> >> merging happening, so it must be something concerning merging, >> >> probably dispatched is not accurate when we set up for a merging, but >> >> the merging was not yet done. >> > >> > >> > > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-19 21:58 ` Corrado Zoccolo @ 2010-01-20 19:18 ` Vivek Goyal 0 siblings, 0 replies; 17+ messages in thread From: Vivek Goyal @ 2010-01-20 19:18 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: jmoyer, Zhang, Yanmin, Jens Axboe, Shaohua Li, LKML On Tue, Jan 19, 2010 at 10:58:26PM +0100, Corrado Zoccolo wrote: > On Tue, Jan 19, 2010 at 10:40 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Tue, Jan 19, 2010 at 09:10:33PM +0100, Corrado Zoccolo wrote: > >> On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin > >> <yanmin_zhang@linux.intel.com> wrote: > >> > On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote: > >> >> Hi Yanmin > >> >> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > >> >> > Hi Yanmin, > >> >> >> When low_latency=1, we get the biggest number with kernel 2.6.32. > >> >> >> Comparing with low_latency=0's result, the prior one is about 4% better. > >> >> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with > >> >> > fastest 2.6.32, so we can consider the first part of the problem > >> >> > solved. > >> >> > > >> >> I think we can return now to your full script with queue merging. > >> >> I'm wondering if (in arm_slice_timer): > >> >> - if (cfqq->dispatched) > >> >> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) > >> >> return; > >> >> gives the same improvement you were experiencing just reverting to rq_in_driver. > >> > I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k > >> > has about 20% improvement. With just checking rq_in_driver(cfqd), it has > >> > about 33% improvement. > >> > > >> Jeff, do you have an idea why in arm_slice_timer, checking > >> rq_in_driver instead of cfqq->dispatched gives so much improvement in > >> presence of queue merging, while it doesn't have noticeable effect > >> when there are no merges? > > > > Performance improvement because of replacing cfqq->dispatched with > > rq_in_driver() is really strange. This will mean we will do even lesser > > idling on the cfqq. That means faster cfqq switching and that should mean more > > seeks (for this test case) and reduce throughput. This is just opposite to your approach of treating a random read mmap queue as sync where we will idle on > > the queue. > The tests (previous mails in this thread) show that, if no queue > merging is happening, handling the queue as sync_idle, and setting > low_latency = 0 to have bigger slices completely recovers the > regression. > If, though, we have queue merges, current arm_slice_timer shows > regression w.r.t. the rq_in_driver version (2.6.32). > I think a possible explanation is that we are idling instead of > switching to an other queue that would be merged with this one. In > fact, my half-backed try to have the rq_in_driver check conditional on > queue merging fixed part of the regression (not all, because queue > merges are not symmetrical, and I could be seeing the queue that is > 'new_cfqq' for an other). > Just a data point. I ran 8 fio mmap jobs, bs=64K, direct=1, size=2G runtime=30 with vanilla kernel (2.6.33-rc4) and with modified kernel which replaced cfqq->dispatched with rq_in_driver(cfqd). I did not see any significant throughput improvement but I did see max_clat halfed in modified kernel. Vanilla kernel ============== read bw: 3701KB/s max clat: 401050 us Number of times idle timer was armed: 20980 Number of cfqq expired/switched: 6377 cfqq merge operations: 0 Modified kernel (rq_in_driver(cfqd)) =================================== read bw: 3645KB/s max clat: 800515 us Number of times idle timer was armed: 2875 Number of cfqq expired/switched: 17750 cfqq merge operations: 0 This kind of confirms that rq_in_driver(cfqd) will reduce the number of times we idle on queues and will make queue switching faster. That also explains the reduce max clat. If that's the case, then it should also have increased the number of seeks (at least on yanmin's setup of JBOD), and reduce throughput. But instead reverse seems to be happening in his setup. Yanmin, as Jeff mentioned, if you can capture the blktrace of vanilla and modified kernel and upload somewhere to look at, it might help. Thanks Vivek ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-19 21:40 ` Vivek Goyal 2010-01-19 21:58 ` Corrado Zoccolo @ 2010-01-20 1:29 ` Shaohua Li 2010-01-20 14:00 ` Jeff Moyer 1 sibling, 1 reply; 17+ messages in thread From: Shaohua Li @ 2010-01-20 1:29 UTC (permalink / raw) To: Vivek Goyal; +Cc: Corrado Zoccolo, jmoyer, Zhang, Yanmin, Jens Axboe, LKML On Tue, 2010-01-19 at 13:40 -0800, Vivek Goyal wrote: > On Tue, Jan 19, 2010 at 09:10:33PM +0100, Corrado Zoccolo wrote: > > On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin > > <yanmin_zhang@linux.intel.com> wrote: > > > On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote: > > >> Hi Yanmin > > >> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: > > >> > Hi Yanmin, > > >> >> When low_latency=1, we get the biggest number with kernel 2.6.32. > > >> >> Comparing with low_latency=0's result, the prior one is about 4% better. > > >> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with > > >> > fastest 2.6.32, so we can consider the first part of the problem > > >> > solved. > > >> > > > >> I think we can return now to your full script with queue merging. > > >> I'm wondering if (in arm_slice_timer): > > >> - if (cfqq->dispatched) > > >> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) > > >> return; > > >> gives the same improvement you were experiencing just reverting to rq_in_driver. > > > I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k > > > has about 20% improvement. With just checking rq_in_driver(cfqd), it has > > > about 33% improvement. > > > > > Jeff, do you have an idea why in arm_slice_timer, checking > > rq_in_driver instead of cfqq->dispatched gives so much improvement in > > presence of queue merging, while it doesn't have noticeable effect > > when there are no merges? > > Performance improvement because of replacing cfqq->dispatched with > rq_in_driver() is really strange. This will mean we will do even lesser > idling on the cfqq. That means faster cfqq switching and that should mean more > seeks (for this test case) and reduce throughput. This is just opposite to your approach of treating a random read mmap queue as sync where we will idle on > the queue. I used to look at the issue, but not fully understand it. Some interesting finding: the cfqq->dispatched cause cfq_select_queue frequently switch queues. it appears frequent switch can make we could quickly switch to sequential requests in the workload. without the cfqq->dispatched, we dispatch queue1 request, M requests from other queues, queue1 request. with it, we dispatch queue1 request, N requests from other queues, queue1 request. It appears M < N from blktrace, which cause we have less seeky. I don't see any other obvious difference from blktrace in the two cases. Thanks, Shaohua ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 2010-01-20 1:29 ` Shaohua Li @ 2010-01-20 14:00 ` Jeff Moyer 0 siblings, 0 replies; 17+ messages in thread From: Jeff Moyer @ 2010-01-20 14:00 UTC (permalink / raw) To: Shaohua Li; +Cc: Vivek Goyal, Corrado Zoccolo, Zhang, Yanmin, Jens Axboe, LKML Shaohua Li <shaohua.li@intel.com> writes: > On Tue, 2010-01-19 at 13:40 -0800, Vivek Goyal wrote: >> On Tue, Jan 19, 2010 at 09:10:33PM +0100, Corrado Zoccolo wrote: >> > On Mon, Jan 18, 2010 at 4:06 AM, Zhang, Yanmin >> > <yanmin_zhang@linux.intel.com> wrote: >> > > On Sat, 2010-01-16 at 17:27 +0100, Corrado Zoccolo wrote: >> > >> Hi Yanmin >> > >> On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo <czoccolo@gmail.com> wrote: >> > >> > Hi Yanmin, >> > >> >> When low_latency=1, we get the biggest number with kernel 2.6.32. >> > >> >> Comparing with low_latency=0's result, the prior one is about 4% better. >> > >> > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with >> > >> > fastest 2.6.32, so we can consider the first part of the problem >> > >> > solved. >> > >> > >> > >> I think we can return now to your full script with queue merging. >> > >> I'm wondering if (in arm_slice_timer): >> > >> - if (cfqq->dispatched) >> > >> + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) >> > >> return; >> > >> gives the same improvement you were experiencing just reverting to rq_in_driver. >> > > I did a quick testing against 2.6.33-rc1. With the new method, fio mmap randread 46k >> > > has about 20% improvement. With just checking rq_in_driver(cfqd), it has >> > > about 33% improvement. >> > > >> > Jeff, do you have an idea why in arm_slice_timer, checking >> > rq_in_driver instead of cfqq->dispatched gives so much improvement in >> > presence of queue merging, while it doesn't have noticeable effect >> > when there are no merges? >> >> Performance improvement because of replacing cfqq->dispatched with >> rq_in_driver() is really strange. This will mean we will do even lesser >> idling on the cfqq. That means faster cfqq switching and that should mean more >> seeks (for this test case) and reduce throughput. This is just opposite to your approach of treating a random read mmap queue as sync where we will idle on >> the queue. > I used to look at the issue, but not fully understand it. Some > interesting finding: > the cfqq->dispatched cause cfq_select_queue frequently switch queues. > it appears frequent switch can make we could quickly switch to > sequential requests in the workload. without the cfqq->dispatched, we > dispatch queue1 request, M requests from other queues, queue1 request. > with it, we dispatch queue1 request, N requests from other queues, > queue1 request. It appears M < N from blktrace, which cause we have less > seeky. I don't see any other obvious difference from blktrace in the two > cases. I thought there was merging and/or unmerging activity. You don't mention that here. I'll see if I can reproduce it. Cheers, Jeff ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2010-01-20 19:18 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-12-31 9:16 fio mmap randread 64k more than 40% regression with 2.6.33-rc1 Zhang, Yanmin 2009-12-31 10:34 ` Corrado Zoccolo 2010-01-01 10:12 ` Zhang, Yanmin 2010-01-01 16:32 ` Corrado Zoccolo 2010-01-02 12:33 ` Zhang, Yanmin 2010-01-02 18:52 ` Corrado Zoccolo 2010-01-04 8:18 ` Zhang, Yanmin 2010-01-04 18:28 ` Corrado Zoccolo 2010-01-16 16:27 ` Corrado Zoccolo 2010-01-18 3:06 ` Zhang, Yanmin 2010-01-19 20:10 ` Corrado Zoccolo 2010-01-19 20:42 ` Jeff Moyer 2010-01-19 21:40 ` Vivek Goyal 2010-01-19 21:58 ` Corrado Zoccolo 2010-01-20 19:18 ` Vivek Goyal 2010-01-20 1:29 ` Shaohua Li 2010-01-20 14:00 ` Jeff Moyer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).