* [RFC]cfq-iosched: quantum check tweak @ 2009-12-25 9:10 Shaohua Li 2009-12-25 9:44 ` Corrado Zoccolo 0 siblings, 1 reply; 24+ messages in thread From: Shaohua Li @ 2009-12-25 9:10 UTC (permalink / raw) To: linux-kernel; +Cc: jens.axboe, czoccolo, yanmin.zhang Currently a queue can only dispatch up to 4 requests if there are other queues. This isn't optimal, device can handle more requests, for example, AHCI can handle 31 requests. I can understand the limit is for fairness, but we could do some tweaks: 1. if the queue still has a lot of slice left, sounds we could ignore the limit 2. we could keep the check only when cfq_latency is on. For uses who don't care about latency should be happy to have device fully piped on. I have a test of random direct io of two threads, each has 32 requests one time without patch: 78m/s with tweak 1: 138m/s with two tweaks and disable latency: 156m/s Signed-off-by: Shaohua Li <shaohua.li@intel.com> --- block/cfq-iosched.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -2242,6 +2242,18 @@ static int cfq_forced_dispatch(struct cf return dispatched; } +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, + struct cfq_queue *cfqq) +{ + /* the queue hasn't finished any request, can't estimate */ + if (cfq_cfqq_slice_new(cfqq)) + return 1; + if (time_after(jiffies + cfqd->cfq_slice_idle, cfqq->slice_end)) + return 1; + + return 0; +} + static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) { unsigned int max_dispatch; @@ -2275,7 +2287,8 @@ static bool cfq_may_dispatch(struct cfq_ /* * We have other queues, don't allow more IO from this one */ - if (cfqd->busy_queues > 1) + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq) + && cfqd->cfq_latency) return false; /* ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2009-12-25 9:10 [RFC]cfq-iosched: quantum check tweak Shaohua Li @ 2009-12-25 9:44 ` Corrado Zoccolo 2009-12-28 3:35 ` Shaohua Li 0 siblings, 1 reply; 24+ messages in thread From: Corrado Zoccolo @ 2009-12-25 9:44 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, jens.axboe, yanmin.zhang On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > Currently a queue can only dispatch up to 4 requests if there are other queues. > This isn't optimal, device can handle more requests, for example, AHCI can > handle 31 requests. I can understand the limit is for fairness, but we could > do some tweaks: > 1. if the queue still has a lot of slice left, sounds we could ignore the limit ok. You can even scale the limit proportionally to the remaining slice (see below). > 2. we could keep the check only when cfq_latency is on. For uses who don't care > about latency should be happy to have device fully piped on. I wouldn't overload low_latency with this meaning. You can obtain the same by setting the quantum to 32. > > I have a test of random direct io of two threads, each has 32 requests one time > without patch: 78m/s > with tweak 1: 138m/s > with two tweaks and disable latency: 156m/s Please, test also with competing seq/random(depth1)/async workloads, and measure also introduced latencies. > > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > --- > block/cfq-iosched.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -2242,6 +2242,18 @@ static int cfq_forced_dispatch(struct cf > return dispatched; > } > > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, > + struct cfq_queue *cfqq) > +{ > + /* the queue hasn't finished any request, can't estimate */ > + if (cfq_cfqq_slice_new(cfqq)) > + return 1; > + if (time_after(jiffies + cfqd->cfq_slice_idle, cfqq->slice_end)) here I would use jiffies + cfqd->cfq_slice_idle * (cfqq->queued[0] + cfqq->queued[1]) to obtain the proper scaling > + return 1; > + > + return 0; > +} > + > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > { > unsigned int max_dispatch; Thanks, Corrado -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2009-12-25 9:44 ` Corrado Zoccolo @ 2009-12-28 3:35 ` Shaohua Li 2009-12-28 9:02 ` Corrado Zoccolo 0 siblings, 1 reply; 24+ messages in thread From: Shaohua Li @ 2009-12-28 3:35 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > Currently a queue can only dispatch up to 4 requests if there are other queues. > > This isn't optimal, device can handle more requests, for example, AHCI can > > handle 31 requests. I can understand the limit is for fairness, but we could > > do some tweaks: > > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > ok. You can even scale the limit proportionally to the remaining slice > (see below). I can't understand the meaning of below scale. cfq_slice_used_soon() means dispatched requests can finish before slice is used, so other queues will not be impacted. I thought/hope a cfq_slice_idle time is enough to finish the dispatched requests. > > 2. we could keep the check only when cfq_latency is on. For uses who don't care > > about latency should be happy to have device fully piped on. > I wouldn't overload low_latency with this meaning. You can obtain the > same by setting the quantum to 32. As this impact fairness, so natually thought we could use low_latency. I'll remove the check in next post. > > I have a test of random direct io of two threads, each has 32 requests one time > > without patch: 78m/s > > with tweak 1: 138m/s > > with two tweaks and disable latency: 156m/s > > Please, test also with competing seq/random(depth1)/async workloads, > and measure also introduced latencies. depth1 should be ok, as if device can only send one request, it should not require more requests from ioscheduler. I'll do more checks. The time is hard to choose (I choose cfq_slice-idle here) to balance thoughput and latency. Do we have creteria to measure this? See the patch passes some tests, so it's ok for latency. Thanks, Shaohua ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2009-12-28 3:35 ` Shaohua Li @ 2009-12-28 9:02 ` Corrado Zoccolo 2010-01-07 2:04 ` Shaohua Li 0 siblings, 1 reply; 24+ messages in thread From: Corrado Zoccolo @ 2009-12-28 9:02 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin Hi Shaohua, On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> > Currently a queue can only dispatch up to 4 requests if there are other queues. >> > This isn't optimal, device can handle more requests, for example, AHCI can >> > handle 31 requests. I can understand the limit is for fairness, but we could >> > do some tweaks: >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit >> ok. You can even scale the limit proportionally to the remaining slice >> (see below). > I can't understand the meaning of below scale. cfq_slice_used_soon() means > dispatched requests can finish before slice is used, so other queues will not be > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > dispatched requests. cfq_slice_idle is 8ms, that is the average time to complete 1 request on most disks. If you have more requests dispatched on a NCQ-rotational disk (non-RAID), it will take more time. Probably a linear formula is not the most accurate, but still more accurate than taking just 1 cfq_slice_idle. If you can experiment a bit, you could also try: cfq_slice_idle * ilog2(nr_dispatched+1) cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care >> > about latency should be happy to have device fully piped on. >> I wouldn't overload low_latency with this meaning. You can obtain the >> same by setting the quantum to 32. > As this impact fairness, so natually thought we could use low_latency. I'll remove > the check in next post. Great. >> > I have a test of random direct io of two threads, each has 32 requests one time >> > without patch: 78m/s >> > with tweak 1: 138m/s >> > with two tweaks and disable latency: 156m/s >> >> Please, test also with competing seq/random(depth1)/async workloads, >> and measure also introduced latencies. > depth1 should be ok, as if device can only send one request, it should not require > more requests from ioscheduler. I mean have a run with, at the same time: * one seq reader, * h random readers with depth 1 (non-aio) * one async seq writer * k random readers with large depth. In this way, you can see if the changes you introduce to boost your workload affect more realistic scenarios, in which various workloads are mixed. I explicitly add the depth1 random readers, since they are sceduled differently than the large (>4) depth ones. > I'll do more checks. The time is hard to choose (I choose cfq_slice-idle here) to > balance thoughput and latency. Do we have creteria to measure this? See the patch > passes some tests, so it's ok for latency. Max latency should be near 300ms (compare with and without the patch), for a reasonable number of concurrent processes (300ms/#proc < 2*idle_slice). Thanks, Corrado > > Thanks, > Shaohua > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2009-12-28 9:02 ` Corrado Zoccolo @ 2010-01-07 2:04 ` Shaohua Li 2010-01-07 21:44 ` Corrado Zoccolo 0 siblings, 1 reply; 24+ messages in thread From: Shaohua Li @ 2010-01-07 2:04 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin [-- Attachment #1: Type: text/plain, Size: 4977 bytes --] On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > Hi Shaohua, > On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > >> > This isn't optimal, device can handle more requests, for example, AHCI can > >> > handle 31 requests. I can understand the limit is for fairness, but we could > >> > do some tweaks: > >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > >> ok. You can even scale the limit proportionally to the remaining slice > >> (see below). > > I can't understand the meaning of below scale. cfq_slice_used_soon() means > > dispatched requests can finish before slice is used, so other queues will not be > > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > > dispatched requests. > cfq_slice_idle is 8ms, that is the average time to complete 1 request > on most disks. If you have more requests dispatched on a > NCQ-rotational disk (non-RAID), it will take more time. Probably a > linear formula is not the most accurate, but still more accurate than > taking just 1 cfq_slice_idle. If you can experiment a bit, you could > also try: > cfq_slice_idle * ilog2(nr_dispatched+1) > cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > > > > >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > >> > about latency should be happy to have device fully piped on. > >> I wouldn't overload low_latency with this meaning. You can obtain the > >> same by setting the quantum to 32. > > As this impact fairness, so natually thought we could use low_latency. I'll remove > > the check in next post. > Great. > >> > I have a test of random direct io of two threads, each has 32 requests one time > >> > without patch: 78m/s > >> > with tweak 1: 138m/s > >> > with two tweaks and disable latency: 156m/s > >> > >> Please, test also with competing seq/random(depth1)/async workloads, > >> and measure also introduced latencies. > > depth1 should be ok, as if device can only send one request, it should not require > > more requests from ioscheduler. > I mean have a run with, at the same time: > * one seq reader, > * h random readers with depth 1 (non-aio) > * one async seq writer > * k random readers with large depth. > In this way, you can see if the changes you introduce to boost your > workload affect more realistic scenarios, in which various workloads > are mixed. > I explicitly add the depth1 random readers, since they are sceduled > differently than the large (>4) depth ones. I tried a fio script which does like your description, but the data isn't stable, especially the write speed, other kind of io speed is stable. Apply below patch doesn't make things worse (still write speed isn't stable, other io is stable), so I can't say if the patch passes the test, but it appears latency reported by fio hasn't change. I adopt the slice_idle * dispatched approach, which I thought should be safe. Currently a queue can only dispatch up to 4 requests if there are other queues. This isn't optimal, device can handle more requests, for example, AHCI can handle 31 requests. I can understand the limit is for fairness, but we could do a tweak: if the queue still has a lot of slice left, sounds we could ignore the limit. For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium. For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium. This might cause latency issue if the queue is preempted at the very beginning. This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last post, but also is a big improvement. Signed-off-by: Shaohua Li <shaohua.li@intel.com> --- block/cfq-iosched.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf return dispatched; } +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, + struct cfq_queue *cfqq) +{ + /* the queue hasn't finished any request, can't estimate */ + if (cfq_cfqq_slice_new(cfqq)) + return true; + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, + cfqq->slice_end)) + return true; + + return false; +} + static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) { unsigned int max_dispatch; @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_ /* * We have other queues, don't allow more IO from this one */ - if (cfqd->busy_queues > 1) + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) return false; /* [-- Attachment #2: aiorandread --] [-- Type: text/plain, Size: 396 bytes --] [global] runtime=120 ioscheduler=cfq overwrite=1 loops=10 group_reporting [seq-read] directory=/mnt/b1 bs=4k rw=read ioengine=sync size=1G [seq-write] directory=/mnt/b1 bs=4k rw=write ioengine=sync size=1G [rand-read] directory=/mnt/b1 bs=4k rw=randread ioengine=sync size=1G numjobs=4 [rand-read-aio] directory=/mnt/b1 bs=4k iodepth=32 rw=randread ioengine=libaio direct=1 size=1G numjobs=4 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-07 2:04 ` Shaohua Li @ 2010-01-07 21:44 ` Corrado Zoccolo 2010-01-08 0:57 ` Shaohua Li 2010-01-08 17:15 ` Vivek Goyal 0 siblings, 2 replies; 24+ messages in thread From: Corrado Zoccolo @ 2010-01-07 21:44 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin Hi Shahoua, On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: >> Hi Shaohua, >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. >> >> > This isn't optimal, device can handle more requests, for example, AHCI can >> >> > handle 31 requests. I can understand the limit is for fairness, but we could >> >> > do some tweaks: >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit >> >> ok. You can even scale the limit proportionally to the remaining slice >> >> (see below). >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means >> > dispatched requests can finish before slice is used, so other queues will not be >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the >> > dispatched requests. >> cfq_slice_idle is 8ms, that is the average time to complete 1 request >> on most disks. If you have more requests dispatched on a >> NCQ-rotational disk (non-RAID), it will take more time. Probably a >> linear formula is not the most accurate, but still more accurate than >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could >> also try: >> cfq_slice_idle * ilog2(nr_dispatched+1) >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) >> >> > >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care >> >> > about latency should be happy to have device fully piped on. >> >> I wouldn't overload low_latency with this meaning. You can obtain the >> >> same by setting the quantum to 32. >> > As this impact fairness, so natually thought we could use low_latency. I'll remove >> > the check in next post. >> Great. >> >> > I have a test of random direct io of two threads, each has 32 requests one time >> >> > without patch: 78m/s >> >> > with tweak 1: 138m/s >> >> > with two tweaks and disable latency: 156m/s >> >> >> >> Please, test also with competing seq/random(depth1)/async workloads, >> >> and measure also introduced latencies. >> > depth1 should be ok, as if device can only send one request, it should not require >> > more requests from ioscheduler. >> I mean have a run with, at the same time: >> * one seq reader, >> * h random readers with depth 1 (non-aio) >> * one async seq writer >> * k random readers with large depth. >> In this way, you can see if the changes you introduce to boost your >> workload affect more realistic scenarios, in which various workloads >> are mixed. >> I explicitly add the depth1 random readers, since they are sceduled >> differently than the large (>4) depth ones. > I tried a fio script which does like your description, but the data > isn't stable, especially the write speed, other kind of io speed is > stable. Apply below patch doesn't make things worse (still write speed > isn't stable, other io is stable), so I can't say if the patch passes > the test, but it appears latency reported by fio hasn't change. I adopt > the slice_idle * dispatched approach, which I thought should be safe. I'm doing some tests right now on a single ncq rotational disk, and the average service time when submitting with a high depth is halved w.r.t. depth 1, so I think you could test also with the formula : slice_idle * dispatched / 2. It should give a performance boost, without noticeable impact on latency. > Currently a queue can only dispatch up to 4 requests if there are other queues. > This isn't optimal, device can handle more requests, for example, AHCI can > handle 31 requests. I can understand the limit is for fairness, but we could > do a tweak: if the queue still has a lot of slice left, sounds we could ignore > the limit. > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium. > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium. > This might cause latency issue if the queue is preempted at the very beginning. > > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last > post, but also is a big improvement. Acked-by: Corrado Zoccolo <czoccolo@gmail.com> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > --- > block/cfq-iosched.c | 15 ++++++++++++++- > 1 file changed, 14 insertions(+), 1 deletion(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf > return dispatched; > } > > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, > + struct cfq_queue *cfqq) > +{ > + /* the queue hasn't finished any request, can't estimate */ > + if (cfq_cfqq_slice_new(cfqq)) > + return true; > + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, > + cfqq->slice_end)) > + return true; > + > + return false; > +} > + > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > { > unsigned int max_dispatch; > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_ > /* > * We have other queues, don't allow more IO from this one > */ > - if (cfqd->busy_queues > 1) > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > return false; > > /* > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-07 21:44 ` Corrado Zoccolo @ 2010-01-08 0:57 ` Shaohua Li 2010-01-08 20:22 ` Corrado Zoccolo 2010-01-08 17:15 ` Vivek Goyal 1 sibling, 1 reply; 24+ messages in thread From: Shaohua Li @ 2010-01-08 0:57 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin On Fri, Jan 08, 2010 at 05:44:27AM +0800, Corrado Zoccolo wrote: > Hi Shahoua, > > On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > >> Hi Shaohua, > >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > >> >> > do some tweaks: > >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > >> >> ok. You can even scale the limit proportionally to the remaining slice > >> >> (see below). > >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > >> > dispatched requests can finish before slice is used, so other queues will not be > >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > >> > dispatched requests. > >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > >> on most disks. If you have more requests dispatched on a > >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > >> linear formula is not the most accurate, but still more accurate than > >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > >> also try: > >> cfq_slice_idle * ilog2(nr_dispatched+1) > >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > >> > >> > > >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > >> >> > about latency should be happy to have device fully piped on. > >> >> I wouldn't overload low_latency with this meaning. You can obtain the > >> >> same by setting the quantum to 32. > >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > >> > the check in next post. > >> Great. > >> >> > I have a test of random direct io of two threads, each has 32 requests one time > >> >> > without patch: 78m/s > >> >> > with tweak 1: 138m/s > >> >> > with two tweaks and disable latency: 156m/s > >> >> > >> >> Please, test also with competing seq/random(depth1)/async workloads, > >> >> and measure also introduced latencies. > >> > depth1 should be ok, as if device can only send one request, it should not require > >> > more requests from ioscheduler. > >> I mean have a run with, at the same time: > >> * one seq reader, > >> * h random readers with depth 1 (non-aio) > >> * one async seq writer > >> * k random readers with large depth. > >> In this way, you can see if the changes you introduce to boost your > >> workload affect more realistic scenarios, in which various workloads > >> are mixed. > >> I explicitly add the depth1 random readers, since they are sceduled > >> differently than the large (>4) depth ones. > > I tried a fio script which does like your description, but the data > > isn't stable, especially the write speed, other kind of io speed is > > stable. Apply below patch doesn't make things worse (still write speed > > isn't stable, other io is stable), so I can't say if the patch passes > > the test, but it appears latency reported by fio hasn't change. I adopt > > the slice_idle * dispatched approach, which I thought should be safe. > > I'm doing some tests right now on a single ncq rotational disk, and > the average service time when submitting with a high depth is halved > w.r.t. depth 1, so I think you could test also with the formula : > slice_idle * dispatched / 2. It should give a performance boost, > without noticeable impact on latency Thanks for looking at it. can you forward your tests to me so I can check here? I'll do more aggressive formula next. It shouldn't impact my SSD which is very fast, each request takes less than 1ms. We'd better have a mechanism to measure device speed, but jiffies isn't good. I'm thinking using sched_clock() which has its issue too like having drift between CPUs. Thanks, Shaohua ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-08 0:57 ` Shaohua Li @ 2010-01-08 20:22 ` Corrado Zoccolo 2010-01-11 1:49 ` Shaohua Li 2010-01-11 2:01 ` Shaohua Li 0 siblings, 2 replies; 24+ messages in thread From: Corrado Zoccolo @ 2010-01-08 20:22 UTC (permalink / raw) To: Shaohua Li; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin On Fri, Jan 8, 2010 at 1:57 AM, Shaohua Li <shaohua.li@intel.com> wrote: > On Fri, Jan 08, 2010 at 05:44:27AM +0800, Corrado Zoccolo wrote: >> Hi Shahoua, >> >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: >> >> Hi Shaohua, >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could >> >> >> > do some tweaks: >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit >> >> >> ok. You can even scale the limit proportionally to the remaining slice >> >> >> (see below). >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means >> >> > dispatched requests can finish before slice is used, so other queues will not be >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the >> >> > dispatched requests. >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request >> >> on most disks. If you have more requests dispatched on a >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a >> >> linear formula is not the most accurate, but still more accurate than >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could >> >> also try: >> >> cfq_slice_idle * ilog2(nr_dispatched+1) >> >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) >> >> >> >> > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care >> >> >> > about latency should be happy to have device fully piped on. >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the >> >> >> same by setting the quantum to 32. >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove >> >> > the check in next post. >> >> Great. >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time >> >> >> > without patch: 78m/s >> >> >> > with tweak 1: 138m/s >> >> >> > with two tweaks and disable latency: 156m/s >> >> >> >> >> >> Please, test also with competing seq/random(depth1)/async workloads, >> >> >> and measure also introduced latencies. >> >> > depth1 should be ok, as if device can only send one request, it should not require >> >> > more requests from ioscheduler. >> >> I mean have a run with, at the same time: >> >> * one seq reader, >> >> * h random readers with depth 1 (non-aio) >> >> * one async seq writer >> >> * k random readers with large depth. >> >> In this way, you can see if the changes you introduce to boost your >> >> workload affect more realistic scenarios, in which various workloads >> >> are mixed. >> >> I explicitly add the depth1 random readers, since they are sceduled >> >> differently than the large (>4) depth ones. >> > I tried a fio script which does like your description, but the data >> > isn't stable, especially the write speed, other kind of io speed is >> > stable. Apply below patch doesn't make things worse (still write speed >> > isn't stable, other io is stable), so I can't say if the patch passes >> > the test, but it appears latency reported by fio hasn't change. I adopt >> > the slice_idle * dispatched approach, which I thought should be safe. >> >> I'm doing some tests right now on a single ncq rotational disk, and >> the average service time when submitting with a high depth is halved >> w.r.t. depth 1, so I think you could test also with the formula : >> slice_idle * dispatched / 2. It should give a performance boost, >> without noticeable impact on latency > Thanks for looking at it. can you forward your tests to me so I can > check here? Sure, you can find them at: * simple seek time: http://dl.dropbox.com/u/3525644/stride.c * avg seek time with NCQ: http://dl.dropbox.com/u/3525644/ncq.c > I'll do more aggressive formula next. It shouldn't impact > my SSD which is very fast, each request takes less than 1ms. We'd better > have a mechanism to measure device speed, but jiffies isn't good. I'm thinking > using sched_clock() which has its issue too like having drift between CPUs. I'm using ktime_get() successfully. Measuring is the easy part. The difficult one is decide what to do with the value :) Thanks Corrado > > Thanks, > Shaohua > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-08 20:22 ` Corrado Zoccolo @ 2010-01-11 1:49 ` Shaohua Li 2010-01-11 2:01 ` Shaohua Li 1 sibling, 0 replies; 24+ messages in thread From: Shaohua Li @ 2010-01-11 1:49 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin On Sat, Jan 09, 2010 at 04:22:47AM +0800, Corrado Zoccolo wrote: > On Fri, Jan 8, 2010 at 1:57 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Fri, Jan 08, 2010 at 05:44:27AM +0800, Corrado Zoccolo wrote: > >> Hi Shahoua, > >> > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > >> >> Hi Shaohua, > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > >> >> >> > do some tweaks: > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > >> >> >> ok. You can even scale the limit proportionally to the remaining slice > >> >> >> (see below). > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > >> >> > dispatched requests can finish before slice is used, so other queues will not be > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > >> >> > dispatched requests. > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > >> >> on most disks. If you have more requests dispatched on a > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > >> >> linear formula is not the most accurate, but still more accurate than > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > >> >> also try: > >> >> cfq_slice_idle * ilog2(nr_dispatched+1) > >> >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > >> >> > >> >> > > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > >> >> >> > about latency should be happy to have device fully piped on. > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the > >> >> >> same by setting the quantum to 32. > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > >> >> > the check in next post. > >> >> Great. > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time > >> >> >> > without patch: 78m/s > >> >> >> > with tweak 1: 138m/s > >> >> >> > with two tweaks and disable latency: 156m/s > >> >> >> > >> >> >> Please, test also with competing seq/random(depth1)/async workloads, > >> >> >> and measure also introduced latencies. > >> >> > depth1 should be ok, as if device can only send one request, it should not require > >> >> > more requests from ioscheduler. > >> >> I mean have a run with, at the same time: > >> >> * one seq reader, > >> >> * h random readers with depth 1 (non-aio) > >> >> * one async seq writer > >> >> * k random readers with large depth. > >> >> In this way, you can see if the changes you introduce to boost your > >> >> workload affect more realistic scenarios, in which various workloads > >> >> are mixed. > >> >> I explicitly add the depth1 random readers, since they are sceduled > >> >> differently than the large (>4) depth ones. > >> > I tried a fio script which does like your description, but the data > >> > isn't stable, especially the write speed, other kind of io speed is > >> > stable. Apply below patch doesn't make things worse (still write speed > >> > isn't stable, other io is stable), so I can't say if the patch passes > >> > the test, but it appears latency reported by fio hasn't change. I adopt > >> > the slice_idle * dispatched approach, which I thought should be safe. > >> > >> I'm doing some tests right now on a single ncq rotational disk, and > >> the average service time when submitting with a high depth is halved > >> w.r.t. depth 1, so I think you could test also with the formula : > >> slice_idle * dispatched / 2. It should give a performance boost, > >> without noticeable impact on latency > > Thanks for looking at it. can you forward your tests to me so I can > > check here? > Sure, you can find them at: > * simple seek time: http://dl.dropbox.com/u/3525644/stride.c > * avg seek time with NCQ: http://dl.dropbox.com/u/3525644/ncq.c great, I'll do some tests. > > I'll do more aggressive formula next. It shouldn't impact > > my SSD which is very fast, each request takes less than 1ms. We'd better > > have a mechanism to measure device speed, but jiffies isn't good. I'm thinking > > using sched_clock() which has its issue too like having drift between CPUs. > I'm using ktime_get() successfully. Measuring is the easy part. The > difficult one is decide what to do with the value :) If we can get the average time a request takes, we can further tweak the quantum check. In fast device, a slice_idle can finish 100 requests. Thanks, Shaohua ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-08 20:22 ` Corrado Zoccolo 2010-01-11 1:49 ` Shaohua Li @ 2010-01-11 2:01 ` Shaohua Li 1 sibling, 0 replies; 24+ messages in thread From: Shaohua Li @ 2010-01-11 2:01 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin On Sat, Jan 09, 2010 at 04:22:47AM +0800, Corrado Zoccolo wrote: > On Fri, Jan 8, 2010 at 1:57 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Fri, Jan 08, 2010 at 05:44:27AM +0800, Corrado Zoccolo wrote: > >> Hi Shahoua, > >> > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > >> >> Hi Shaohua, > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > >> >> >> > do some tweaks: > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > >> >> >> ok. You can even scale the limit proportionally to the remaining slice > >> >> >> (see below). > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > >> >> > dispatched requests can finish before slice is used, so other queues will not be > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > >> >> > dispatched requests. > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > >> >> on most disks. If you have more requests dispatched on a > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > >> >> linear formula is not the most accurate, but still more accurate than > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > >> >> also try: > >> >> cfq_slice_idle * ilog2(nr_dispatched+1) > >> >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > >> >> > >> >> > > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > >> >> >> > about latency should be happy to have device fully piped on. > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the > >> >> >> same by setting the quantum to 32. > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > >> >> > the check in next post. > >> >> Great. > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time > >> >> >> > without patch: 78m/s > >> >> >> > with tweak 1: 138m/s > >> >> >> > with two tweaks and disable latency: 156m/s > >> >> >> > >> >> >> Please, test also with competing seq/random(depth1)/async workloads, > >> >> >> and measure also introduced latencies. > >> >> > depth1 should be ok, as if device can only send one request, it should not require > >> >> > more requests from ioscheduler. > >> >> I mean have a run with, at the same time: > >> >> * one seq reader, > >> >> * h random readers with depth 1 (non-aio) > >> >> * one async seq writer > >> >> * k random readers with large depth. > >> >> In this way, you can see if the changes you introduce to boost your > >> >> workload affect more realistic scenarios, in which various workloads > >> >> are mixed. > >> >> I explicitly add the depth1 random readers, since they are sceduled > >> >> differently than the large (>4) depth ones. > >> > I tried a fio script which does like your description, but the data > >> > isn't stable, especially the write speed, other kind of io speed is > >> > stable. Apply below patch doesn't make things worse (still write speed > >> > isn't stable, other io is stable), so I can't say if the patch passes > >> > the test, but it appears latency reported by fio hasn't change. I adopt > >> > the slice_idle * dispatched approach, which I thought should be safe. > >> > >> I'm doing some tests right now on a single ncq rotational disk, and > >> the average service time when submitting with a high depth is halved > >> w.r.t. depth 1, so I think you could test also with the formula : > >> slice_idle * dispatched / 2. It should give a performance boost, > >> without noticeable impact on latency > > Thanks for looking at it. can you forward your tests to me so I can > > check here? > Sure, you can find them at: > * simple seek time: http://dl.dropbox.com/u/3525644/stride.c > * avg seek time with NCQ: http://dl.dropbox.com/u/3525644/ncq.c > > I'll do more aggressive formula next. It shouldn't impact > > my SSD which is very fast, each request takes less than 1ms. We'd better > > have a mechanism to measure device speed, but jiffies isn't good. I'm thinking > > using sched_clock() which has its issue too like having drift between CPUs. > I'm using ktime_get() successfully. Measuring is the easy part. The > difficult one is decide what to do with the value :) BTW, is ktime_get() light enough for a 40k/s call? I thought we should have a very light clock for counting. Thanks, Shaohua ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-07 21:44 ` Corrado Zoccolo 2010-01-08 0:57 ` Shaohua Li @ 2010-01-08 17:15 ` Vivek Goyal 2010-01-08 17:40 ` Vivek Goyal 2010-01-08 20:35 ` Corrado Zoccolo 1 sibling, 2 replies; 24+ messages in thread From: Vivek Goyal @ 2010-01-08 17:15 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Shaohua Li, linux-kernel, jens.axboe, Zhang, Yanmin On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote: > Hi Shahoua, > > On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > >> Hi Shaohua, > >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > >> >> > do some tweaks: > >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > >> >> ok. You can even scale the limit proportionally to the remaining slice > >> >> (see below). > >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > >> > dispatched requests can finish before slice is used, so other queues will not be > >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > >> > dispatched requests. > >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > >> on most disks. If you have more requests dispatched on a > >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > >> linear formula is not the most accurate, but still more accurate than > >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > >> also try: > >> cfq_slice_idle * ilog2(nr_dispatched+1) > >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > >> > >> > > >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > >> >> > about latency should be happy to have device fully piped on. > >> >> I wouldn't overload low_latency with this meaning. You can obtain the > >> >> same by setting the quantum to 32. > >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > >> > the check in next post. > >> Great. > >> >> > I have a test of random direct io of two threads, each has 32 requests one time > >> >> > without patch: 78m/s > >> >> > with tweak 1: 138m/s > >> >> > with two tweaks and disable latency: 156m/s > >> >> > >> >> Please, test also with competing seq/random(depth1)/async workloads, > >> >> and measure also introduced latencies. > >> > depth1 should be ok, as if device can only send one request, it should not require > >> > more requests from ioscheduler. > >> I mean have a run with, at the same time: > >> * one seq reader, > >> * h random readers with depth 1 (non-aio) > >> * one async seq writer > >> * k random readers with large depth. > >> In this way, you can see if the changes you introduce to boost your > >> workload affect more realistic scenarios, in which various workloads > >> are mixed. > >> I explicitly add the depth1 random readers, since they are sceduled > >> differently than the large (>4) depth ones. > > I tried a fio script which does like your description, but the data > > isn't stable, especially the write speed, other kind of io speed is > > stable. Apply below patch doesn't make things worse (still write speed > > isn't stable, other io is stable), so I can't say if the patch passes > > the test, but it appears latency reported by fio hasn't change. I adopt > > the slice_idle * dispatched approach, which I thought should be safe. > > I'm doing some tests right now on a single ncq rotational disk, and > the average service time when submitting with a high depth is halved > w.r.t. depth 1, so I think you could test also with the formula : > slice_idle * dispatched / 2. It should give a performance boost, > without noticeable impact on latency. > But I guess the right comparison here would service times vary when we push queue depths from 4 to higher (as done by this patch). Were you running deep seeky queues or sequential queues. Curious to know whether service times reduced even in case of deep seeky queues on this single disk. I think this patch breaks the meaning of cfq_quantum? Now we can allow dispatch of more requests from the same queue. I had kind of liked the idea of respecting cfq_quantum. Especially it can help in testing. With this patch cfq_quantum will more or less loose its meaning. Vivek > > Currently a queue can only dispatch up to 4 requests if there are other queues. > > This isn't optimal, device can handle more requests, for example, AHCI can > > handle 31 requests. I can understand the limit is for fairness, but we could > > do a tweak: if the queue still has a lot of slice left, sounds we could ignore > > the limit. > > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium. > > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium. > > This might cause latency issue if the queue is preempted at the very beginning. > > > > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last > > post, but also is a big improvement. > > Acked-by: Corrado Zoccolo <czoccolo@gmail.com> > > > > > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > > --- > > block/cfq-iosched.c | 15 ++++++++++++++- > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > Index: linux-2.6/block/cfq-iosched.c > > =================================================================== > > --- linux-2.6.orig/block/cfq-iosched.c > > +++ linux-2.6/block/cfq-iosched.c > > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf > > return dispatched; > > } > > > > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, > > + struct cfq_queue *cfqq) > > +{ > > + /* the queue hasn't finished any request, can't estimate */ > > + if (cfq_cfqq_slice_new(cfqq)) > > + return true; > > + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, > > + cfqq->slice_end)) > > + return true; > > + > > + return false; > > +} > > + > > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > { > > unsigned int max_dispatch; > > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_ > > /* > > * We have other queues, don't allow more IO from this one > > */ > > - if (cfqd->busy_queues > 1) > > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > > return false; > > > > /* > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-08 17:15 ` Vivek Goyal @ 2010-01-08 17:40 ` Vivek Goyal 2010-01-08 20:35 ` Corrado Zoccolo 1 sibling, 0 replies; 24+ messages in thread From: Vivek Goyal @ 2010-01-08 17:40 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Shaohua Li, linux-kernel, jens.axboe, Zhang, Yanmin On Fri, Jan 08, 2010 at 12:15:35PM -0500, Vivek Goyal wrote: > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote: > > Hi Shahoua, > > > > On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > > >> Hi Shaohua, > > >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > > >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > > >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > > >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > > >> >> > do some tweaks: > > >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > > >> >> ok. You can even scale the limit proportionally to the remaining slice > > >> >> (see below). > > >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > > >> > dispatched requests can finish before slice is used, so other queues will not be > > >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > > >> > dispatched requests. > > >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > > >> on most disks. If you have more requests dispatched on a > > >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > > >> linear formula is not the most accurate, but still more accurate than > > >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > > >> also try: > > >> cfq_slice_idle * ilog2(nr_dispatched+1) > > >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > > >> > > >> > > > >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > > >> >> > about latency should be happy to have device fully piped on. > > >> >> I wouldn't overload low_latency with this meaning. You can obtain the > > >> >> same by setting the quantum to 32. > > >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > > >> > the check in next post. > > >> Great. > > >> >> > I have a test of random direct io of two threads, each has 32 requests one time > > >> >> > without patch: 78m/s > > >> >> > with tweak 1: 138m/s > > >> >> > with two tweaks and disable latency: 156m/s > > >> >> > > >> >> Please, test also with competing seq/random(depth1)/async workloads, > > >> >> and measure also introduced latencies. > > >> > depth1 should be ok, as if device can only send one request, it should not require > > >> > more requests from ioscheduler. > > >> I mean have a run with, at the same time: > > >> * one seq reader, > > >> * h random readers with depth 1 (non-aio) > > >> * one async seq writer > > >> * k random readers with large depth. > > >> In this way, you can see if the changes you introduce to boost your > > >> workload affect more realistic scenarios, in which various workloads > > >> are mixed. > > >> I explicitly add the depth1 random readers, since they are sceduled > > >> differently than the large (>4) depth ones. > > > I tried a fio script which does like your description, but the data > > > isn't stable, especially the write speed, other kind of io speed is > > > stable. Apply below patch doesn't make things worse (still write speed > > > isn't stable, other io is stable), so I can't say if the patch passes > > > the test, but it appears latency reported by fio hasn't change. I adopt > > > the slice_idle * dispatched approach, which I thought should be safe. > > > > I'm doing some tests right now on a single ncq rotational disk, and > > the average service time when submitting with a high depth is halved > > w.r.t. depth 1, so I think you could test also with the formula : > > slice_idle * dispatched / 2. It should give a performance boost, > > without noticeable impact on latency. > > > > But I guess the right comparison here would service times vary when we > push queue depths from 4 to higher (as done by this patch). Were you > running deep seeky queues or sequential queues. Curious to know whether > service times reduced even in case of deep seeky queues on this single > disk. > > I think this patch breaks the meaning of cfq_quantum? Now we can allow > dispatch of more requests from the same queue. I had kind of liked the > idea of respecting cfq_quantum. Especially it can help in testing. With > this patch cfq_quantum will more or less loose its meaning. > I guess this is a question of soft limit and hard limit. May be we can bump up default cfq_quantum to 8 and internally define a soft limit of of 50% of cfq_quantum. So we will start throttling number of requests from a queue when cfqq->dispatched reaches 4. But will allow more dispatches up to cfq_quantum based on how much slice is left and what's the possiblity that already dispatched request will finish with-in the slice. That way we will maintain existing behavior, meaning of cfq_quantum as well as possibly get performance improvement in the said case. Vivek > > > Currently a queue can only dispatch up to 4 requests if there are other queues. > > > This isn't optimal, device can handle more requests, for example, AHCI can > > > handle 31 requests. I can understand the limit is for fairness, but we could > > > do a tweak: if the queue still has a lot of slice left, sounds we could ignore > > > the limit. > > > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium. > > > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium. > > > This might cause latency issue if the queue is preempted at the very beginning. > > > > > > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last > > > post, but also is a big improvement. > > > > Acked-by: Corrado Zoccolo <czoccolo@gmail.com> > > > > > > > > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > > > --- > > > block/cfq-iosched.c | 15 ++++++++++++++- > > > 1 file changed, 14 insertions(+), 1 deletion(-) > > > > > > Index: linux-2.6/block/cfq-iosched.c > > > =================================================================== > > > --- linux-2.6.orig/block/cfq-iosched.c > > > +++ linux-2.6/block/cfq-iosched.c > > > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf > > > return dispatched; > > > } > > > > > > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, > > > + struct cfq_queue *cfqq) > > > +{ > > > + /* the queue hasn't finished any request, can't estimate */ > > > + if (cfq_cfqq_slice_new(cfqq)) > > > + return true; > > > + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, > > > + cfqq->slice_end)) > > > + return true; > > > + > > > + return false; > > > +} > > > + > > > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > > { > > > unsigned int max_dispatch; > > > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_ > > > /* > > > * We have other queues, don't allow more IO from this one > > > */ > > > - if (cfqd->busy_queues > 1) > > > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > > > return false; > > > > > > /* > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-08 17:15 ` Vivek Goyal 2010-01-08 17:40 ` Vivek Goyal @ 2010-01-08 20:35 ` Corrado Zoccolo 2010-01-08 20:59 ` Vivek Goyal 1 sibling, 1 reply; 24+ messages in thread From: Corrado Zoccolo @ 2010-01-08 20:35 UTC (permalink / raw) To: Vivek Goyal; +Cc: Shaohua Li, linux-kernel, jens.axboe, Zhang, Yanmin On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote: >> Hi Shahoua, >> >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: >> >> Hi Shaohua, >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could >> >> >> > do some tweaks: >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit >> >> >> ok. You can even scale the limit proportionally to the remaining slice >> >> >> (see below). >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means >> >> > dispatched requests can finish before slice is used, so other queues will not be >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the >> >> > dispatched requests. >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request >> >> on most disks. If you have more requests dispatched on a >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a >> >> linear formula is not the most accurate, but still more accurate than >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could >> >> also try: >> >> cfq_slice_idle * ilog2(nr_dispatched+1) >> >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) >> >> >> >> > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care >> >> >> > about latency should be happy to have device fully piped on. >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the >> >> >> same by setting the quantum to 32. >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove >> >> > the check in next post. >> >> Great. >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time >> >> >> > without patch: 78m/s >> >> >> > with tweak 1: 138m/s >> >> >> > with two tweaks and disable latency: 156m/s >> >> >> >> >> >> Please, test also with competing seq/random(depth1)/async workloads, >> >> >> and measure also introduced latencies. >> >> > depth1 should be ok, as if device can only send one request, it should not require >> >> > more requests from ioscheduler. >> >> I mean have a run with, at the same time: >> >> * one seq reader, >> >> * h random readers with depth 1 (non-aio) >> >> * one async seq writer >> >> * k random readers with large depth. >> >> In this way, you can see if the changes you introduce to boost your >> >> workload affect more realistic scenarios, in which various workloads >> >> are mixed. >> >> I explicitly add the depth1 random readers, since they are sceduled >> >> differently than the large (>4) depth ones. >> > I tried a fio script which does like your description, but the data >> > isn't stable, especially the write speed, other kind of io speed is >> > stable. Apply below patch doesn't make things worse (still write speed >> > isn't stable, other io is stable), so I can't say if the patch passes >> > the test, but it appears latency reported by fio hasn't change. I adopt >> > the slice_idle * dispatched approach, which I thought should be safe. >> >> I'm doing some tests right now on a single ncq rotational disk, and >> the average service time when submitting with a high depth is halved >> w.r.t. depth 1, so I think you could test also with the formula : >> slice_idle * dispatched / 2. It should give a performance boost, >> without noticeable impact on latency. >> > > But I guess the right comparison here would service times vary when we > push queue depths from 4 to higher (as done by this patch). I think here we want to determine the average cost of a request, when there are many submitted. > Were you > running deep seeky queues or sequential queues. Curious to know whether > service times reduced even in case of deep seeky queues on this single > disk. Seeky queues. Seeks where rather small (not more than 1/64 of the whole disk), but already meaningful for comparison. > > I think this patch breaks the meaning of cfq_quantum? Now we can allow > dispatch of more requests from the same queue. I had kind of liked the > idea of respecting cfq_quantum. Especially it can help in testing. With > this patch cfq_quantum will more or less loose its meaning. cfq_quantum will still be enforced at the end of the slice, so its meaning of how many requests can be still pending when you finish your slice is preserved. One can argue, instead, that this reduces a bit the effectiveness of preemption on ncq disks. However, I don't think preemption is the solution for low latency, while cfq_quantum reduction is. With this change in place, we could change the default cfq_quantum to a smaller number (ideally 1), to have lowest number of leftovers when the slice finishes, while still driving deep queues at the beginning of the slice. This needs thorough testing, though. Maybe it is better to delay those changes to 2.6.34... Thanks, Corrado > > Vivek > >> > Currently a queue can only dispatch up to 4 requests if there are other queues. >> > This isn't optimal, device can handle more requests, for example, AHCI can >> > handle 31 requests. I can understand the limit is for fairness, but we could >> > do a tweak: if the queue still has a lot of slice left, sounds we could ignore >> > the limit. >> > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium. >> > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium. >> > This might cause latency issue if the queue is preempted at the very beginning. >> > >> > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last >> > post, but also is a big improvement. >> >> Acked-by: Corrado Zoccolo <czoccolo@gmail.com> >> >> > >> > Signed-off-by: Shaohua Li <shaohua.li@intel.com> >> > --- >> > block/cfq-iosched.c | 15 ++++++++++++++- >> > 1 file changed, 14 insertions(+), 1 deletion(-) >> > >> > Index: linux-2.6/block/cfq-iosched.c >> > =================================================================== >> > --- linux-2.6.orig/block/cfq-iosched.c >> > +++ linux-2.6/block/cfq-iosched.c >> > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf >> > return dispatched; >> > } >> > >> > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, >> > + struct cfq_queue *cfqq) >> > +{ >> > + /* the queue hasn't finished any request, can't estimate */ >> > + if (cfq_cfqq_slice_new(cfqq)) >> > + return true; >> > + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, >> > + cfqq->slice_end)) >> > + return true; >> > + >> > + return false; >> > +} >> > + >> > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) >> > { >> > unsigned int max_dispatch; >> > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_ >> > /* >> > * We have other queues, don't allow more IO from this one >> > */ >> > - if (cfqd->busy_queues > 1) >> > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) >> > return false; >> > >> > /* ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-08 20:35 ` Corrado Zoccolo @ 2010-01-08 20:59 ` Vivek Goyal 2010-01-11 2:34 ` Shaohua Li 0 siblings, 1 reply; 24+ messages in thread From: Vivek Goyal @ 2010-01-08 20:59 UTC (permalink / raw) To: Corrado Zoccolo; +Cc: Shaohua Li, linux-kernel, jens.axboe, Zhang, Yanmin On Fri, Jan 08, 2010 at 09:35:33PM +0100, Corrado Zoccolo wrote: > On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote: > >> Hi Shahoua, > >> > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > >> >> Hi Shaohua, > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > >> >> >> > do some tweaks: > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > >> >> >> ok. You can even scale the limit proportionally to the remaining slice > >> >> >> (see below). > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > >> >> > dispatched requests can finish before slice is used, so other queues will not be > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > >> >> > dispatched requests. > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > >> >> on most disks. If you have more requests dispatched on a > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > >> >> linear formula is not the most accurate, but still more accurate than > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > >> >> also try: > >> >> cfq_slice_idle * ilog2(nr_dispatched+1) > >> >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > >> >> > >> >> > > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > >> >> >> > about latency should be happy to have device fully piped on. > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the > >> >> >> same by setting the quantum to 32. > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > >> >> > the check in next post. > >> >> Great. > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time > >> >> >> > without patch: 78m/s > >> >> >> > with tweak 1: 138m/s > >> >> >> > with two tweaks and disable latency: 156m/s > >> >> >> > >> >> >> Please, test also with competing seq/random(depth1)/async workloads, > >> >> >> and measure also introduced latencies. > >> >> > depth1 should be ok, as if device can only send one request, it should not require > >> >> > more requests from ioscheduler. > >> >> I mean have a run with, at the same time: > >> >> * one seq reader, > >> >> * h random readers with depth 1 (non-aio) > >> >> * one async seq writer > >> >> * k random readers with large depth. > >> >> In this way, you can see if the changes you introduce to boost your > >> >> workload affect more realistic scenarios, in which various workloads > >> >> are mixed. > >> >> I explicitly add the depth1 random readers, since they are sceduled > >> >> differently than the large (>4) depth ones. > >> > I tried a fio script which does like your description, but the data > >> > isn't stable, especially the write speed, other kind of io speed is > >> > stable. Apply below patch doesn't make things worse (still write speed > >> > isn't stable, other io is stable), so I can't say if the patch passes > >> > the test, but it appears latency reported by fio hasn't change. I adopt > >> > the slice_idle * dispatched approach, which I thought should be safe. > >> > >> I'm doing some tests right now on a single ncq rotational disk, and > >> the average service time when submitting with a high depth is halved > >> w.r.t. depth 1, so I think you could test also with the formula : > >> slice_idle * dispatched / 2. It should give a performance boost, > >> without noticeable impact on latency. > >> > > > > But I guess the right comparison here would service times vary when we > > push queue depths from 4 to higher (as done by this patch). > > I think here we want to determine the average cost of a request, when > there are many submitted. > > > Were you > > running deep seeky queues or sequential queues. Curious to know whether > > service times reduced even in case of deep seeky queues on this single > > disk. > > Seeky queues. Seeks where rather small (not more than 1/64 of the > whole disk), but already meaningful for comparison. > > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow > > dispatch of more requests from the same queue. I had kind of liked the > > idea of respecting cfq_quantum. Especially it can help in testing. With > > this patch cfq_quantum will more or less loose its meaning. > cfq_quantum will still be enforced at the end of the slice, so its > meaning of how many requests can be still pending when you finish your > slice is preserved. Not always and it will depend how accurate your approximation of service time is. If per request completion time is more than approximation (in this case slice_idle), than you will end up with more requests in dispatch queue from one cfqq at the time of slice expiry. > > One can argue, instead, that this reduces a bit the effectiveness of > preemption on ncq disks. > However, I don't think preemption is the solution for low latency, > while cfq_quantum reduction is. > With this change in place, we could change the default cfq_quantum to > a smaller number (ideally 1), to have lowest number of leftovers when > the slice finishes, while still driving deep queues at the beginning > of the slice. I think using cfq_quantum as hard limit might be a better idea as it gives more predictable control. Instead of treating it as soft limit and trying to meet it at the end of slice expiry based on our approximation of predicted completion time. > > This needs thorough testing, though. Maybe it is better to delay those > changes to 2.6.34... Agreed. This should be tested more throughly and should be candidate for 2.6.34. Thanks Vivek > > Thanks, > Corrado > > > > > Vivek > > > >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > >> > This isn't optimal, device can handle more requests, for example, AHCI can > >> > handle 31 requests. I can understand the limit is for fairness, but we could > >> > do a tweak: if the queue still has a lot of slice left, sounds we could ignore > >> > the limit. > >> > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium. > >> > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium. > >> > This might cause latency issue if the queue is preempted at the very beginning. > >> > > >> > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last > >> > post, but also is a big improvement. > >> > >> Acked-by: Corrado Zoccolo <czoccolo@gmail.com> > >> > >> > > >> > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > >> > --- > >> > block/cfq-iosched.c | 15 ++++++++++++++- > >> > 1 file changed, 14 insertions(+), 1 deletion(-) > >> > > >> > Index: linux-2.6/block/cfq-iosched.c > >> > =================================================================== > >> > --- linux-2.6.orig/block/cfq-iosched.c > >> > +++ linux-2.6/block/cfq-iosched.c > >> > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf > >> > return dispatched; > >> > } > >> > > >> > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, > >> > + struct cfq_queue *cfqq) > >> > +{ > >> > + /* the queue hasn't finished any request, can't estimate */ > >> > + if (cfq_cfqq_slice_new(cfqq)) > >> > + return true; > >> > + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, > >> > + cfqq->slice_end)) > >> > + return true; > >> > + > >> > + return false; > >> > +} > >> > + > >> > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > >> > { > >> > unsigned int max_dispatch; > >> > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_ > >> > /* > >> > * We have other queues, don't allow more IO from this one > >> > */ > >> > - if (cfqd->busy_queues > 1) > >> > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > >> > return false; > >> > > >> > /* ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-08 20:59 ` Vivek Goyal @ 2010-01-11 2:34 ` Shaohua Li 2010-01-11 17:03 ` Vivek Goyal 0 siblings, 1 reply; 24+ messages in thread From: Shaohua Li @ 2010-01-11 2:34 UTC (permalink / raw) To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin On Sat, Jan 09, 2010 at 04:59:48AM +0800, Vivek Goyal wrote: > On Fri, Jan 08, 2010 at 09:35:33PM +0100, Corrado Zoccolo wrote: > > On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote: > > >> Hi Shahoua, > > >> > > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > > >> >> Hi Shaohua, > > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > > >> >> >> > do some tweaks: > > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > > >> >> >> ok. You can even scale the limit proportionally to the remaining slice > > >> >> >> (see below). > > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > > >> >> > dispatched requests can finish before slice is used, so other queues will not be > > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > > >> >> > dispatched requests. > > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > > >> >> on most disks. If you have more requests dispatched on a > > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > > >> >> linear formula is not the most accurate, but still more accurate than > > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > > >> >> also try: > > >> >> cfq_slice_idle * ilog2(nr_dispatched+1) > > >> >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > > >> >> > > >> >> > > > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > > >> >> >> > about latency should be happy to have device fully piped on. > > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the > > >> >> >> same by setting the quantum to 32. > > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > > >> >> > the check in next post. > > >> >> Great. > > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time > > >> >> >> > without patch: 78m/s > > >> >> >> > with tweak 1: 138m/s > > >> >> >> > with two tweaks and disable latency: 156m/s > > >> >> >> > > >> >> >> Please, test also with competing seq/random(depth1)/async workloads, > > >> >> >> and measure also introduced latencies. > > >> >> > depth1 should be ok, as if device can only send one request, it should not require > > >> >> > more requests from ioscheduler. > > >> >> I mean have a run with, at the same time: > > >> >> * one seq reader, > > >> >> * h random readers with depth 1 (non-aio) > > >> >> * one async seq writer > > >> >> * k random readers with large depth. > > >> >> In this way, you can see if the changes you introduce to boost your > > >> >> workload affect more realistic scenarios, in which various workloads > > >> >> are mixed. > > >> >> I explicitly add the depth1 random readers, since they are sceduled > > >> >> differently than the large (>4) depth ones. > > >> > I tried a fio script which does like your description, but the data > > >> > isn't stable, especially the write speed, other kind of io speed is > > >> > stable. Apply below patch doesn't make things worse (still write speed > > >> > isn't stable, other io is stable), so I can't say if the patch passes > > >> > the test, but it appears latency reported by fio hasn't change. I adopt > > >> > the slice_idle * dispatched approach, which I thought should be safe. > > >> > > >> I'm doing some tests right now on a single ncq rotational disk, and > > >> the average service time when submitting with a high depth is halved > > >> w.r.t. depth 1, so I think you could test also with the formula : > > >> slice_idle * dispatched / 2. It should give a performance boost, > > >> without noticeable impact on latency. > > >> > > > > > > But I guess the right comparison here would service times vary when we > > > push queue depths from 4 to higher (as done by this patch). > > > > I think here we want to determine the average cost of a request, when > > there are many submitted. > > > > > Were you > > > running deep seeky queues or sequential queues. Curious to know whether > > > service times reduced even in case of deep seeky queues on this single > > > disk. > > > > Seeky queues. Seeks where rather small (not more than 1/64 of the > > whole disk), but already meaningful for comparison. > > > > > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow > > > dispatch of more requests from the same queue. I had kind of liked the > > > idea of respecting cfq_quantum. Especially it can help in testing. With > > > this patch cfq_quantum will more or less loose its meaning. > > cfq_quantum will still be enforced at the end of the slice, so its > > meaning of how many requests can be still pending when you finish your > > slice is preserved. > > Not always and it will depend how accurate your approximation of service > time is. If per request completion time is more than approximation (in > this case slice_idle), than you will end up with more requests in dispatch > queue from one cfqq at the time of slice expiry. we use slice_idle for a long time and no complain. So assume the approximation of service time is good. > > > > One can argue, instead, that this reduces a bit the effectiveness of > > preemption on ncq disks. > > However, I don't think preemption is the solution for low latency, > > while cfq_quantum reduction is. > > With this change in place, we could change the default cfq_quantum to > > a smaller number (ideally 1), to have lowest number of leftovers when > > the slice finishes, while still driving deep queues at the beginning > > of the slice. > > I think using cfq_quantum as hard limit might be a better idea as it gives > more predictable control. Instead of treating it as soft limit and trying > to meet it at the end of slice expiry based on our approximation of > predicted completion time. Current patch has such hard limit too (100ms/8m = 12 for sync io and 40ms/8 = 5 for async io). > > This needs thorough testing, though. Maybe it is better to delay those > > changes to 2.6.34... > > Agreed. This should be tested more throughly and should be candidate for > 2.6.34. Sure, this needs a lot of test. Thanks, Shaohua ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-11 2:34 ` Shaohua Li @ 2010-01-11 17:03 ` Vivek Goyal 2010-01-12 3:07 ` Shaohua Li 0 siblings, 1 reply; 24+ messages in thread From: Vivek Goyal @ 2010-01-11 17:03 UTC (permalink / raw) To: Shaohua Li; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin On Mon, Jan 11, 2010 at 10:34:09AM +0800, Shaohua Li wrote: > On Sat, Jan 09, 2010 at 04:59:48AM +0800, Vivek Goyal wrote: > > On Fri, Jan 08, 2010 at 09:35:33PM +0100, Corrado Zoccolo wrote: > > > On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote: > > > >> Hi Shahoua, > > > >> > > > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > > > >> >> Hi Shaohua, > > > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > > > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > > > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > > > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > > > >> >> >> > do some tweaks: > > > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > > > >> >> >> ok. You can even scale the limit proportionally to the remaining slice > > > >> >> >> (see below). > > > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > > > >> >> > dispatched requests can finish before slice is used, so other queues will not be > > > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > > > >> >> > dispatched requests. > > > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > > > >> >> on most disks. If you have more requests dispatched on a > > > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > > > >> >> linear formula is not the most accurate, but still more accurate than > > > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > > > >> >> also try: > > > >> >> cfq_slice_idle * ilog2(nr_dispatched+1) > > > >> >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > > > >> >> > > > >> >> > > > > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > > > >> >> >> > about latency should be happy to have device fully piped on. > > > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the > > > >> >> >> same by setting the quantum to 32. > > > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > > > >> >> > the check in next post. > > > >> >> Great. > > > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time > > > >> >> >> > without patch: 78m/s > > > >> >> >> > with tweak 1: 138m/s > > > >> >> >> > with two tweaks and disable latency: 156m/s > > > >> >> >> > > > >> >> >> Please, test also with competing seq/random(depth1)/async workloads, > > > >> >> >> and measure also introduced latencies. > > > >> >> > depth1 should be ok, as if device can only send one request, it should not require > > > >> >> > more requests from ioscheduler. > > > >> >> I mean have a run with, at the same time: > > > >> >> * one seq reader, > > > >> >> * h random readers with depth 1 (non-aio) > > > >> >> * one async seq writer > > > >> >> * k random readers with large depth. > > > >> >> In this way, you can see if the changes you introduce to boost your > > > >> >> workload affect more realistic scenarios, in which various workloads > > > >> >> are mixed. > > > >> >> I explicitly add the depth1 random readers, since they are sceduled > > > >> >> differently than the large (>4) depth ones. > > > >> > I tried a fio script which does like your description, but the data > > > >> > isn't stable, especially the write speed, other kind of io speed is > > > >> > stable. Apply below patch doesn't make things worse (still write speed > > > >> > isn't stable, other io is stable), so I can't say if the patch passes > > > >> > the test, but it appears latency reported by fio hasn't change. I adopt > > > >> > the slice_idle * dispatched approach, which I thought should be safe. > > > >> > > > >> I'm doing some tests right now on a single ncq rotational disk, and > > > >> the average service time when submitting with a high depth is halved > > > >> w.r.t. depth 1, so I think you could test also with the formula : > > > >> slice_idle * dispatched / 2. It should give a performance boost, > > > >> without noticeable impact on latency. > > > >> > > > > > > > > But I guess the right comparison here would service times vary when we > > > > push queue depths from 4 to higher (as done by this patch). > > > > > > I think here we want to determine the average cost of a request, when > > > there are many submitted. > > > > > > > Were you > > > > running deep seeky queues or sequential queues. Curious to know whether > > > > service times reduced even in case of deep seeky queues on this single > > > > disk. > > > > > > Seeky queues. Seeks where rather small (not more than 1/64 of the > > > whole disk), but already meaningful for comparison. > > > > > > > > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow > > > > dispatch of more requests from the same queue. I had kind of liked the > > > > idea of respecting cfq_quantum. Especially it can help in testing. With > > > > this patch cfq_quantum will more or less loose its meaning. > > > cfq_quantum will still be enforced at the end of the slice, so its > > > meaning of how many requests can be still pending when you finish your > > > slice is preserved. > > > > Not always and it will depend how accurate your approximation of service > > time is. If per request completion time is more than approximation (in > > this case slice_idle), than you will end up with more requests in dispatch > > queue from one cfqq at the time of slice expiry. > we use slice_idle for a long time and no complain. So assume the approximation > of service time is good. slice_idle is a variable and user can easily change it to 1ms and even 0. In that case you will be theoritically be ready to dispatch 100/1 requests from the cfqq? > > > > > > > One can argue, instead, that this reduces a bit the effectiveness of > > > preemption on ncq disks. > > > However, I don't think preemption is the solution for low latency, > > > while cfq_quantum reduction is. > > > With this change in place, we could change the default cfq_quantum to > > > a smaller number (ideally 1), to have lowest number of leftovers when > > > the slice finishes, while still driving deep queues at the beginning > > > of the slice. > > > > I think using cfq_quantum as hard limit might be a better idea as it gives > > more predictable control. Instead of treating it as soft limit and trying > > to meet it at the end of slice expiry based on our approximation of > > predicted completion time. > Current patch has such hard limit too (100ms/8m = 12 for sync io and 40ms/8 > = 5 for async io). This is software logic driven and not cfq_quantum driven. We can always keep on changing how to approximate service time completions. So a user first needs to read the code, derive internal limits and then do testing? I think than tunable looses its significance. That's why I am advocating of treating cfq_quantum as hard limit and derive an internal soft limit based on certain % of hard limit and use that as default max queue depth for cfqq. In this case user knows no matter what, you are not dispatching more than cfq_quantum requests from a queue at a time. > > > > This needs thorough testing, though. Maybe it is better to delay those > > > changes to 2.6.34... > > > > Agreed. This should be tested more throughly and should be candidate for > > 2.6.34. > Sure, this needs a lot of test. > > Thanks, > Shaohua ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-11 17:03 ` Vivek Goyal @ 2010-01-12 3:07 ` Shaohua Li 2010-01-12 15:48 ` Vivek Goyal 0 siblings, 1 reply; 24+ messages in thread From: Shaohua Li @ 2010-01-12 3:07 UTC (permalink / raw) To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin On Tue, Jan 12, 2010 at 01:03:39AM +0800, Vivek Goyal wrote: > On Mon, Jan 11, 2010 at 10:34:09AM +0800, Shaohua Li wrote: > > On Sat, Jan 09, 2010 at 04:59:48AM +0800, Vivek Goyal wrote: > > > On Fri, Jan 08, 2010 at 09:35:33PM +0100, Corrado Zoccolo wrote: > > > > On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote: > > > > >> Hi Shahoua, > > > > >> > > > > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote: > > > > >> >> Hi Shaohua, > > > > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote: > > > > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote: > > > > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues. > > > > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can > > > > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could > > > > >> >> >> > do some tweaks: > > > > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit > > > > >> >> >> ok. You can even scale the limit proportionally to the remaining slice > > > > >> >> >> (see below). > > > > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means > > > > >> >> > dispatched requests can finish before slice is used, so other queues will not be > > > > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the > > > > >> >> > dispatched requests. > > > > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request > > > > >> >> on most disks. If you have more requests dispatched on a > > > > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a > > > > >> >> linear formula is not the most accurate, but still more accurate than > > > > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could > > > > >> >> also try: > > > > >> >> cfq_slice_idle * ilog2(nr_dispatched+1) > > > > >> >> cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1)) > > > > >> >> > > > > >> >> > > > > > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care > > > > >> >> >> > about latency should be happy to have device fully piped on. > > > > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the > > > > >> >> >> same by setting the quantum to 32. > > > > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove > > > > >> >> > the check in next post. > > > > >> >> Great. > > > > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time > > > > >> >> >> > without patch: 78m/s > > > > >> >> >> > with tweak 1: 138m/s > > > > >> >> >> > with two tweaks and disable latency: 156m/s > > > > >> >> >> > > > > >> >> >> Please, test also with competing seq/random(depth1)/async workloads, > > > > >> >> >> and measure also introduced latencies. > > > > >> >> > depth1 should be ok, as if device can only send one request, it should not require > > > > >> >> > more requests from ioscheduler. > > > > >> >> I mean have a run with, at the same time: > > > > >> >> * one seq reader, > > > > >> >> * h random readers with depth 1 (non-aio) > > > > >> >> * one async seq writer > > > > >> >> * k random readers with large depth. > > > > >> >> In this way, you can see if the changes you introduce to boost your > > > > >> >> workload affect more realistic scenarios, in which various workloads > > > > >> >> are mixed. > > > > >> >> I explicitly add the depth1 random readers, since they are sceduled > > > > >> >> differently than the large (>4) depth ones. > > > > >> > I tried a fio script which does like your description, but the data > > > > >> > isn't stable, especially the write speed, other kind of io speed is > > > > >> > stable. Apply below patch doesn't make things worse (still write speed > > > > >> > isn't stable, other io is stable), so I can't say if the patch passes > > > > >> > the test, but it appears latency reported by fio hasn't change. I adopt > > > > >> > the slice_idle * dispatched approach, which I thought should be safe. > > > > >> > > > > >> I'm doing some tests right now on a single ncq rotational disk, and > > > > >> the average service time when submitting with a high depth is halved > > > > >> w.r.t. depth 1, so I think you could test also with the formula : > > > > >> slice_idle * dispatched / 2. It should give a performance boost, > > > > >> without noticeable impact on latency. > > > > >> > > > > > > > > > > But I guess the right comparison here would service times vary when we > > > > > push queue depths from 4 to higher (as done by this patch). > > > > > > > > I think here we want to determine the average cost of a request, when > > > > there are many submitted. > > > > > > > > > Were you > > > > > running deep seeky queues or sequential queues. Curious to know whether > > > > > service times reduced even in case of deep seeky queues on this single > > > > > disk. > > > > > > > > Seeky queues. Seeks where rather small (not more than 1/64 of the > > > > whole disk), but already meaningful for comparison. > > > > > > > > > > > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow > > > > > dispatch of more requests from the same queue. I had kind of liked the > > > > > idea of respecting cfq_quantum. Especially it can help in testing. With > > > > > this patch cfq_quantum will more or less loose its meaning. > > > > cfq_quantum will still be enforced at the end of the slice, so its > > > > meaning of how many requests can be still pending when you finish your > > > > slice is preserved. > > > > > > Not always and it will depend how accurate your approximation of service > > > time is. If per request completion time is more than approximation (in > > > this case slice_idle), than you will end up with more requests in dispatch > > > queue from one cfqq at the time of slice expiry. > > we use slice_idle for a long time and no complain. So assume the approximation > > of service time is good. > > slice_idle is a variable and user can easily change it to 1ms and even 0. > In that case you will be theoritically be ready to dispatch 100/1 requests > from the cfqq? User changing it should know what he does. A less-experienced user can mess a lot of things, which we don't care. > > > > One can argue, instead, that this reduces a bit the effectiveness of > > > > preemption on ncq disks. > > > > However, I don't think preemption is the solution for low latency, > > > > while cfq_quantum reduction is. > > > > With this change in place, we could change the default cfq_quantum to > > > > a smaller number (ideally 1), to have lowest number of leftovers when > > > > the slice finishes, while still driving deep queues at the beginning > > > > of the slice. > > > > > > I think using cfq_quantum as hard limit might be a better idea as it gives > > > more predictable control. Instead of treating it as soft limit and trying > > > to meet it at the end of slice expiry based on our approximation of > > > predicted completion time. > > Current patch has such hard limit too (100ms/8m = 12 for sync io and 40ms/8 > > = 5 for async io). > > This is software logic driven and not cfq_quantum driven. We can always > keep on changing how to approximate service time completions. So a user > first needs to read the code, derive internal limits and then do testing? > > I think than tunable looses its significance. That's why I am advocating > of treating cfq_quantum as hard limit and derive an internal soft limit > based on certain % of hard limit and use that as default max queue depth > for cfqq. > > In this case user knows no matter what, you are not dispatching more than > cfq_quantum requests from a queue at a time. ok, then the question is which value should cfq_quantum have. I have a test with below patch. Its performance still is good. With hard limit 8, speed is 100m/s. without hard limit, speed is 102m/s. Currently a queue can only dispatch up to 4 requests if there are other queues. This isn't optimal, device can handle more requests, for example, AHCI can handle 31 requests. I can understand the limit is for fairness, but we could do a tweak: if the queue still has a lot of slice left, sounds we could ignore the limit. Test shows this boost my workload (two thread randread of a SSD) from 78m/s to 100m/s. Signed-off-by: Shaohua Li <shaohua.li@intel.com> --- block/cfq-iosched.c | 24 +++++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -19,7 +19,7 @@ * tunables */ /* max queue in one round of service */ -static const int cfq_quantum = 4; +static const int cfq_quantum = 8; static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 }; /* maximum backwards seek, in KiB */ static const int cfq_back_max = 16 * 1024; @@ -32,6 +32,8 @@ static int cfq_slice_idle = HZ / 125; static const int cfq_target_latency = HZ * 3/10; /* 300 ms */ static const int cfq_hist_divisor = 4; +#define CFQ_SOFT_QUANTUM (4) + /* * offset from end of service tree */ @@ -2242,6 +2244,19 @@ static int cfq_forced_dispatch(struct cf return dispatched; } +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, + struct cfq_queue *cfqq) +{ + /* the queue hasn't finished any request, can't estimate */ + if (cfq_cfqq_slice_new(cfqq) || cfqq->dispatched >= cfqd->cfq_quantum) + return 1; + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, + cfqq->slice_end)) + return 1; + + return 0; +} + static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) { unsigned int max_dispatch; @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) return false; - max_dispatch = cfqd->cfq_quantum; + max_dispatch = cfqd->cfq_quantum / 2; + if (max_dispatch < CFQ_SOFT_QUANTUM) + max_dispatch = min_t(unsigned int, CFQ_SOFT_QUANTUM, + cfqd->cfq_quantum); if (cfq_class_idle(cfqq)) max_dispatch = 1; @@ -2275,7 +2293,7 @@ static bool cfq_may_dispatch(struct cfq_ /* * We have other queues, don't allow more IO from this one */ - if (cfqd->busy_queues > 1) + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) return false; /* ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-12 3:07 ` Shaohua Li @ 2010-01-12 15:48 ` Vivek Goyal 2010-01-13 8:17 ` Shaohua Li 0 siblings, 1 reply; 24+ messages in thread From: Vivek Goyal @ 2010-01-12 15:48 UTC (permalink / raw) To: Shaohua Li; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin On Tue, Jan 12, 2010 at 11:07:56AM +0800, Shaohua Li wrote: [..] > > > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow > > > > > > dispatch of more requests from the same queue. I had kind of liked the > > > > > > idea of respecting cfq_quantum. Especially it can help in testing. With > > > > > > this patch cfq_quantum will more or less loose its meaning. > > > > > cfq_quantum will still be enforced at the end of the slice, so its > > > > > meaning of how many requests can be still pending when you finish your > > > > > slice is preserved. > > > > > > > > Not always and it will depend how accurate your approximation of service > > > > time is. If per request completion time is more than approximation (in > > > > this case slice_idle), than you will end up with more requests in dispatch > > > > queue from one cfqq at the time of slice expiry. > > > we use slice_idle for a long time and no complain. So assume the approximation > > > of service time is good. > > > > slice_idle is a variable and user can easily change it to 1ms and even 0. > > In that case you will be theoritically be ready to dispatch 100/1 requests > > from the cfqq? > User changing it should know what he does. A less-experienced user can mess a lot > of things, which we don't care. > The point is that there is no obivious co-relation between slice_idle and cfq_quantum. Even an experienced user would not expect that changing slice_idle silently will enable dispatching more requests from each cfqq. > > > > > One can argue, instead, that this reduces a bit the effectiveness of > > > > > preemption on ncq disks. > > > > > However, I don't think preemption is the solution for low latency, > > > > > while cfq_quantum reduction is. > > > > > With this change in place, we could change the default cfq_quantum to > > > > > a smaller number (ideally 1), to have lowest number of leftovers when > > > > > the slice finishes, while still driving deep queues at the beginning > > > > > of the slice. > > > > > > > > I think using cfq_quantum as hard limit might be a better idea as it gives > > > > more predictable control. Instead of treating it as soft limit and trying > > > > to meet it at the end of slice expiry based on our approximation of > > > > predicted completion time. > > > Current patch has such hard limit too (100ms/8m = 12 for sync io and 40ms/8 > > > = 5 for async io). > > > > This is software logic driven and not cfq_quantum driven. We can always > > keep on changing how to approximate service time completions. So a user > > first needs to read the code, derive internal limits and then do testing? > > > > I think than tunable looses its significance. That's why I am advocating > > of treating cfq_quantum as hard limit and derive an internal soft limit > > based on certain % of hard limit and use that as default max queue depth > > for cfqq. > > > > In this case user knows no matter what, you are not dispatching more than > > cfq_quantum requests from a queue at a time. > ok, then the question is which value should cfq_quantum have. I have a test with > below patch. Its performance still is good. With hard limit 8, speed is 100m/s. > without hard limit, speed is 102m/s. > > > Currently a queue can only dispatch up to 4 requests if there are other queues. > This isn't optimal, device can handle more requests, for example, AHCI can > handle 31 requests. I can understand the limit is for fairness, but we could > do a tweak: if the queue still has a lot of slice left, sounds we could > ignore the limit. > Test shows this boost my workload (two thread randread of a SSD) from 78m/s > to 100m/s. > > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > --- > block/cfq-iosched.c | 24 +++++++++++++++++++++--- > 1 file changed, 21 insertions(+), 3 deletions(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -19,7 +19,7 @@ > * tunables > */ > /* max queue in one round of service */ > -static const int cfq_quantum = 4; > +static const int cfq_quantum = 8; > static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 }; > /* maximum backwards seek, in KiB */ > static const int cfq_back_max = 16 * 1024; > @@ -32,6 +32,8 @@ static int cfq_slice_idle = HZ / 125; > static const int cfq_target_latency = HZ * 3/10; /* 300 ms */ > static const int cfq_hist_divisor = 4; > > +#define CFQ_SOFT_QUANTUM (4) > + > /* > * offset from end of service tree > */ > @@ -2242,6 +2244,19 @@ static int cfq_forced_dispatch(struct cf > return dispatched; > } > > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, > + struct cfq_queue *cfqq) > +{ > + /* the queue hasn't finished any request, can't estimate */ > + if (cfq_cfqq_slice_new(cfqq) || cfqq->dispatched >= cfqd->cfq_quantum) > + return 1; > + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, > + cfqq->slice_end)) > + return 1; > + > + return 0; > +} > + > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > { > unsigned int max_dispatch; > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_ > if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > return false; > > - max_dispatch = cfqd->cfq_quantum; > + max_dispatch = cfqd->cfq_quantum / 2; > + if (max_dispatch < CFQ_SOFT_QUANTUM) We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can derive the soft limit from hard limit (cfq_quantum). Say soft limit will be 50% of cfq_quantum value. > + max_dispatch = min_t(unsigned int, CFQ_SOFT_QUANTUM, > + cfqd->cfq_quantum); > if (cfq_class_idle(cfqq)) > max_dispatch = 1; > > @@ -2275,7 +2293,7 @@ static bool cfq_may_dispatch(struct cfq_ > /* > * We have other queues, don't allow more IO from this one > */ > - if (cfqd->busy_queues > 1) > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > return false; So I guess here we can write something as follows. if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) return false; if (cfqd->busy_queues == 1) max_dispatch = -1; else /* * Normally we start throttling cfqq when cfq_quantum/2 * requests have been dispatched. But we can drive * deeper queue depths at the beginning of slice * subjected to upper limit of cfq_quantum. */ max_dispatch = cfqd->cfq_quantum; Thanks Vivek ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-12 15:48 ` Vivek Goyal @ 2010-01-13 8:17 ` Shaohua Li 2010-01-13 11:18 ` Vivek Goyal 0 siblings, 1 reply; 24+ messages in thread From: Shaohua Li @ 2010-01-13 8:17 UTC (permalink / raw) To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin On Tue, Jan 12, 2010 at 11:48:20PM +0800, Vivek Goyal wrote: > On Tue, Jan 12, 2010 at 11:07:56AM +0800, Shaohua Li wrote: > > [..] > > > > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow > > > > > > > dispatch of more requests from the same queue. I had kind of liked the > > > > > > > idea of respecting cfq_quantum. Especially it can help in testing. With > > > > > > > this patch cfq_quantum will more or less loose its meaning. > > > > > > cfq_quantum will still be enforced at the end of the slice, so its > > > > > > meaning of how many requests can be still pending when you finish your > > > > > > slice is preserved. > > > > > > > > > > Not always and it will depend how accurate your approximation of service > > > > > time is. If per request completion time is more than approximation (in > > > > > this case slice_idle), than you will end up with more requests in dispatch > > > > > queue from one cfqq at the time of slice expiry. > > > > we use slice_idle for a long time and no complain. So assume the approximation > > > > of service time is good. > > > > > > slice_idle is a variable and user can easily change it to 1ms and even 0. > > > In that case you will be theoritically be ready to dispatch 100/1 requests > > > from the cfqq? > > User changing it should know what he does. A less-experienced user can mess a lot > > of things, which we don't care. > > > > The point is that there is no obivious co-relation between slice_idle and > cfq_quantum. Even an experienced user would not expect that changing > slice_idle silently will enable dispatching more requests from each cfqq. Agree slice_idle hasn't relationship with cfq_quantum. Yes, there are more requests dispatched, but it shouldn't impact user experience. If it does, then the patch fails. > > 1 file changed, 21 insertions(+), 3 deletions(-) > > > > Index: linux-2.6/block/cfq-iosched.c > > =================================================================== > > --- linux-2.6.orig/block/cfq-iosched.c > > +++ linux-2.6/block/cfq-iosched.c > > @@ -19,7 +19,7 @@ > > * tunables > > */ > > /* max queue in one round of service */ > > -static const int cfq_quantum = 4; > > +static const int cfq_quantum = 8; > > static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 }; > > /* maximum backwards seek, in KiB */ > > static const int cfq_back_max = 16 * 1024; > > @@ -32,6 +32,8 @@ static int cfq_slice_idle = HZ / 125; > > static const int cfq_target_latency = HZ * 3/10; /* 300 ms */ > > static const int cfq_hist_divisor = 4; > > > > +#define CFQ_SOFT_QUANTUM (4) > > + > > /* > > * offset from end of service tree > > */ > > @@ -2242,6 +2244,19 @@ static int cfq_forced_dispatch(struct cf > > return dispatched; > > } > > > > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, > > + struct cfq_queue *cfqq) > > +{ > > + /* the queue hasn't finished any request, can't estimate */ > > + if (cfq_cfqq_slice_new(cfqq) || cfqq->dispatched >= cfqd->cfq_quantum) > > + return 1; > > + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, > > + cfqq->slice_end)) > > + return 1; > > + > > + return 0; > > +} > > + > > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > { > > unsigned int max_dispatch; > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_ > > if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > > return false; > > > > - max_dispatch = cfqd->cfq_quantum; > > + max_dispatch = cfqd->cfq_quantum / 2; > > + if (max_dispatch < CFQ_SOFT_QUANTUM) > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be > 50% of cfq_quantum value. I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility against old behavior at least. Am I over thinking? > > + max_dispatch = min_t(unsigned int, CFQ_SOFT_QUANTUM, > > + cfqd->cfq_quantum); > > if (cfq_class_idle(cfqq)) > > max_dispatch = 1; > > > > @@ -2275,7 +2293,7 @@ static bool cfq_may_dispatch(struct cfq_ > > /* > > * We have other queues, don't allow more IO from this one > > */ > > - if (cfqd->busy_queues > 1) > > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > > return false; > > So I guess here we can write something as follows. > > if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > return false; > > if (cfqd->busy_queues == 1) > max_dispatch = -1; > else > /* > * Normally we start throttling cfqq when cfq_quantum/2 > * requests have been dispatched. But we can drive > * deeper queue depths at the beginning of slice > * subjected to upper limit of cfq_quantum. > */ > max_dispatch = cfqd->cfq_quantum; ok. Thanks, Shaohua ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-13 8:17 ` Shaohua Li @ 2010-01-13 11:18 ` Vivek Goyal 2010-01-14 4:16 ` Shaohua Li 0 siblings, 1 reply; 24+ messages in thread From: Vivek Goyal @ 2010-01-13 11:18 UTC (permalink / raw) To: Shaohua Li; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote: [..] > > > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > > { > > > unsigned int max_dispatch; > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_ > > > if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > > > return false; > > > > > > - max_dispatch = cfqd->cfq_quantum; > > > + max_dispatch = cfqd->cfq_quantum / 2; > > > + if (max_dispatch < CFQ_SOFT_QUANTUM) > > > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can > > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be > > 50% of cfq_quantum value. > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility > against old behavior at least. Am I over thinking? > I would not worry too much about that. If you are really worried about that, then create one Documentation/block/cfq-iosched.txt and document how cfq_quantum works so that users know that cfq_quantum is upper hard limit and internal soft limit is cfq_quantum/2. Thanks Vivek > > > + max_dispatch = min_t(unsigned int, CFQ_SOFT_QUANTUM, > > > + cfqd->cfq_quantum); > > > if (cfq_class_idle(cfqq)) > > > max_dispatch = 1; > > > > > > @@ -2275,7 +2293,7 @@ static bool cfq_may_dispatch(struct cfq_ > > > /* > > > * We have other queues, don't allow more IO from this one > > > */ > > > - if (cfqd->busy_queues > 1) > > > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > > > return false; > > > > So I guess here we can write something as follows. > > > > if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > > return false; > > > > if (cfqd->busy_queues == 1) > > max_dispatch = -1; > > else > > /* > > * Normally we start throttling cfqq when cfq_quantum/2 > > * requests have been dispatched. But we can drive > > * deeper queue depths at the beginning of slice > > * subjected to upper limit of cfq_quantum. > > */ > > max_dispatch = cfqd->cfq_quantum; > ok. > > Thanks, > Shaohua ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-13 11:18 ` Vivek Goyal @ 2010-01-14 4:16 ` Shaohua Li 2010-01-14 11:31 ` Vivek Goyal 0 siblings, 1 reply; 24+ messages in thread From: Shaohua Li @ 2010-01-14 4:16 UTC (permalink / raw) To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin On Wed, Jan 13, 2010 at 07:18:07PM +0800, Vivek Goyal wrote: > On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote: > [..] > > > > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > > > { > > > > unsigned int max_dispatch; > > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_ > > > > if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > > > > return false; > > > > > > > > - max_dispatch = cfqd->cfq_quantum; > > > > + max_dispatch = cfqd->cfq_quantum / 2; > > > > + if (max_dispatch < CFQ_SOFT_QUANTUM) > > > > > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can > > > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be > > > 50% of cfq_quantum value. > > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we > > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility > > against old behavior at least. Am I over thinking? > > > > I would not worry too much about that. If you are really worried about > that, then create one Documentation/block/cfq-iosched.txt and document > how cfq_quantum works so that users know that cfq_quantum is upper hard > limit and internal soft limit is cfq_quantum/2. Good idea. Looks we don't document cfq tunnables, I'll try to do it later. Currently a queue can only dispatch up to 4 requests if there are other queues. This isn't optimal, device can handle more requests, for example, AHCI can handle 31 requests. I can understand the limit is for fairness, but we could do a tweak: if the queue still has a lot of slice left, sounds we could ignore the limit. Test shows this boost my workload (two thread randread of a SSD) from 78m/s to 100m/s. Signed-off-by: Shaohua Li <shaohua.li@intel.com> --- block/cfq-iosched.c | 30 ++++++++++++++++++++++++++---- 1 file changed, 26 insertions(+), 4 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -19,7 +19,7 @@ * tunables */ /* max queue in one round of service */ -static const int cfq_quantum = 4; +static const int cfq_quantum = 8; static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 }; /* maximum backwards seek, in KiB */ static const int cfq_back_max = 16 * 1024; @@ -2215,6 +2215,19 @@ static int cfq_forced_dispatch(struct cf return dispatched; } +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, + struct cfq_queue *cfqq) +{ + /* the queue hasn't finished any request, can't estimate */ + if (cfq_cfqq_slice_new(cfqq)) + return 1; + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, + cfqq->slice_end)) + return 1; + + return 0; +} + static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) { unsigned int max_dispatch; @@ -2231,7 +2244,7 @@ static bool cfq_may_dispatch(struct cfq_ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) return false; - max_dispatch = cfqd->cfq_quantum; + max_dispatch = max_t(unsigned int, cfqd->cfq_quantum / 2, 1); if (cfq_class_idle(cfqq)) max_dispatch = 1; @@ -2248,13 +2261,22 @@ static bool cfq_may_dispatch(struct cfq_ /* * We have other queues, don't allow more IO from this one */ - if (cfqd->busy_queues > 1) + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) return false; /* * Sole queue user, no limit */ - max_dispatch = -1; + if (cfqd->busy_queues == 1) + max_dispatch = -1; + else + /* + * Normally we start throttling cfqq when cfq_quantum/2 + * requests have been dispatched. But we can drive + * deeper queue depths at the beginning of slice + * subjected to upper limit of cfq_quantum. + * */ + max_dispatch = cfqd->cfq_quantum; } /* ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-14 4:16 ` Shaohua Li @ 2010-01-14 11:31 ` Vivek Goyal 2010-01-14 13:49 ` Jens Axboe 2010-01-15 3:20 ` Li, Shaohua 0 siblings, 2 replies; 24+ messages in thread From: Vivek Goyal @ 2010-01-14 11:31 UTC (permalink / raw) To: Shaohua Li; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin On Thu, Jan 14, 2010 at 12:16:24PM +0800, Shaohua Li wrote: > On Wed, Jan 13, 2010 at 07:18:07PM +0800, Vivek Goyal wrote: > > On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote: > > [..] > > > > > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > > > > { > > > > > unsigned int max_dispatch; > > > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_ > > > > > if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > > > > > return false; > > > > > > > > > > - max_dispatch = cfqd->cfq_quantum; > > > > > + max_dispatch = cfqd->cfq_quantum / 2; > > > > > + if (max_dispatch < CFQ_SOFT_QUANTUM) > > > > > > > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can > > > > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be > > > > 50% of cfq_quantum value. > > > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we > > > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility > > > against old behavior at least. Am I over thinking? > > > > > > > I would not worry too much about that. If you are really worried about > > that, then create one Documentation/block/cfq-iosched.txt and document > > how cfq_quantum works so that users know that cfq_quantum is upper hard > > limit and internal soft limit is cfq_quantum/2. > Good idea. Looks we don't document cfq tunnables, I'll try to do it later. > > Currently a queue can only dispatch up to 4 requests if there are other queues. > This isn't optimal, device can handle more requests, for example, AHCI can > handle 31 requests. I can understand the limit is for fairness, but we could > do a tweak: if the queue still has a lot of slice left, sounds we could > ignore the limit. Hi Shaohua, This looks much better. Though usage of "slice_idle" as measure of service times, I find little un-intutive. Especially, I do some testing with slice_idle=0, in that case, we will be allowing dispatch of 8 requests from each queue even if slice is about to expire. But I guess that's fine for the time being as upper limit is still controlld by cfq_quantum. > Test shows this boost my workload (two thread randread of a SSD) from 78m/s > to 100m/s. Are these deep queue random reads (with higher iodepths, using libaio)? Have you done similar test on some slower NCQ rotational hardware also and seen the impact on throughput and *max latency* of readers, especially in the presence of buffered writers. Thanks Vivek > > Signed-off-by: Shaohua Li <shaohua.li@intel.com> > --- > block/cfq-iosched.c | 30 ++++++++++++++++++++++++++---- > 1 file changed, 26 insertions(+), 4 deletions(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -19,7 +19,7 @@ > * tunables > */ > /* max queue in one round of service */ > -static const int cfq_quantum = 4; > +static const int cfq_quantum = 8; > static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 }; > /* maximum backwards seek, in KiB */ > static const int cfq_back_max = 16 * 1024; > @@ -2215,6 +2215,19 @@ static int cfq_forced_dispatch(struct cf > return dispatched; > } > > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd, > + struct cfq_queue *cfqq) > +{ > + /* the queue hasn't finished any request, can't estimate */ > + if (cfq_cfqq_slice_new(cfqq)) > + return 1; > + if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched, > + cfqq->slice_end)) > + return 1; > + > + return 0; > +} > + > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > { > unsigned int max_dispatch; > @@ -2231,7 +2244,7 @@ static bool cfq_may_dispatch(struct cfq_ > if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > return false; > > - max_dispatch = cfqd->cfq_quantum; > + max_dispatch = max_t(unsigned int, cfqd->cfq_quantum / 2, 1); > if (cfq_class_idle(cfqq)) > max_dispatch = 1; > > @@ -2248,13 +2261,22 @@ static bool cfq_may_dispatch(struct cfq_ > /* > * We have other queues, don't allow more IO from this one > */ > - if (cfqd->busy_queues > 1) > + if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)) > return false; > > /* > * Sole queue user, no limit > */ > - max_dispatch = -1; > + if (cfqd->busy_queues == 1) > + max_dispatch = -1; > + else > + /* > + * Normally we start throttling cfqq when cfq_quantum/2 > + * requests have been dispatched. But we can drive > + * deeper queue depths at the beginning of slice > + * subjected to upper limit of cfq_quantum. > + * */ > + max_dispatch = cfqd->cfq_quantum; > } > > /* ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [RFC]cfq-iosched: quantum check tweak 2010-01-14 11:31 ` Vivek Goyal @ 2010-01-14 13:49 ` Jens Axboe 2010-01-15 3:20 ` Li, Shaohua 1 sibling, 0 replies; 24+ messages in thread From: Jens Axboe @ 2010-01-14 13:49 UTC (permalink / raw) To: Vivek Goyal; +Cc: Shaohua Li, Corrado Zoccolo, linux-kernel, Zhang, Yanmin On Thu, Jan 14 2010, Vivek Goyal wrote: > On Thu, Jan 14, 2010 at 12:16:24PM +0800, Shaohua Li wrote: > > On Wed, Jan 13, 2010 at 07:18:07PM +0800, Vivek Goyal wrote: > > > On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote: > > > [..] > > > > > > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > > > > > > { > > > > > > unsigned int max_dispatch; > > > > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_ > > > > > > if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > > > > > > return false; > > > > > > > > > > > > - max_dispatch = cfqd->cfq_quantum; > > > > > > + max_dispatch = cfqd->cfq_quantum / 2; > > > > > > + if (max_dispatch < CFQ_SOFT_QUANTUM) > > > > > > > > > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can > > > > > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be > > > > > 50% of cfq_quantum value. > > > > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we > > > > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility > > > > against old behavior at least. Am I over thinking? > > > > > > > > > > I would not worry too much about that. If you are really worried about > > > that, then create one Documentation/block/cfq-iosched.txt and document > > > how cfq_quantum works so that users know that cfq_quantum is upper hard > > > limit and internal soft limit is cfq_quantum/2. > > Good idea. Looks we don't document cfq tunnables, I'll try to do it later. > > > > Currently a queue can only dispatch up to 4 requests if there are other queues. > > This isn't optimal, device can handle more requests, for example, AHCI can > > handle 31 requests. I can understand the limit is for fairness, but we could > > do a tweak: if the queue still has a lot of slice left, sounds we could > > ignore the limit. > > Hi Shaohua, > > This looks much better. Though usage of "slice_idle" as measure of service > times, I find little un-intutive. Especially, I do some testing with > slice_idle=0, in that case, we will be allowing dispatch of 8 requests > from each queue even if slice is about to expire. I agree this is problematic, but I also think we need to do something about the control of queuing depth. For most users, keeping it low is what they want - performance doesn't change much with higher depths, you only pay a latency cost when switching to a new queue. And they don't want that. But for other hardware, driving up the queue depth to what the hardware supports (potentially) can be a big win, and CFQ definitely needs to be able to do that. Write caches are again problematic in this area... For reads and writes on write through caching, just looking at what this cfqq has already dispatched and completed in this slice would be sufficient. It could even be carried over to the next slice as a seed value, so you could dispatch more earlier. What we want to avoid is stuffing the device queue with tons of writes that complete immediately, only to move the pentalty of those requests into the slices of other queues. -- Jens Axboe ^ permalink raw reply [flat|nested] 24+ messages in thread
* RE: [RFC]cfq-iosched: quantum check tweak 2010-01-14 11:31 ` Vivek Goyal 2010-01-14 13:49 ` Jens Axboe @ 2010-01-15 3:20 ` Li, Shaohua 1 sibling, 0 replies; 24+ messages in thread From: Li, Shaohua @ 2010-01-15 3:20 UTC (permalink / raw) To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin >-----Original Message----- >From: Vivek Goyal [mailto:vgoyal@redhat.com] >Sent: Thursday, January 14, 2010 7:31 PM >To: Li, Shaohua >Cc: Corrado Zoccolo; linux-kernel@vger.kernel.org; jens.axboe@oracle.com; >Zhang, Yanmin >Subject: Re: [RFC]cfq-iosched: quantum check tweak > >On Thu, Jan 14, 2010 at 12:16:24PM +0800, Shaohua Li wrote: >> On Wed, Jan 13, 2010 at 07:18:07PM +0800, Vivek Goyal wrote: >> > On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote: >> > [..] >> > > > > static bool cfq_may_dispatch(struct cfq_data *cfqd, struct >cfq_queue *cfqq) >> > > > > { >> > > > > unsigned int max_dispatch; >> > > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_ >> > > > > if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) >> > > > > return false; >> > > > > >> > > > > - max_dispatch = cfqd->cfq_quantum; >> > > > > + max_dispatch = cfqd->cfq_quantum / 2; >> > > > > + if (max_dispatch < CFQ_SOFT_QUANTUM) >> > > > >> > > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't >need it. We can >> > > > derive the soft limit from hard limit (cfq_quantum). Say soft >limit will be >> > > > 50% of cfq_quantum value. >> > > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets >to 7, then we >> > > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM >gives a compatibility >> > > against old behavior at least. Am I over thinking? >> > > >> > >> > I would not worry too much about that. If you are really worried about >> > that, then create one Documentation/block/cfq-iosched.txt and document >> > how cfq_quantum works so that users know that cfq_quantum is upper >hard >> > limit and internal soft limit is cfq_quantum/2. >> Good idea. Looks we don't document cfq tunnables, I'll try to do it >later. >> >> Currently a queue can only dispatch up to 4 requests if there are other >queues. >> This isn't optimal, device can handle more requests, for example, AHCI >can >> handle 31 requests. I can understand the limit is for fairness, but we >could >> do a tweak: if the queue still has a lot of slice left, sounds we could >> ignore the limit. > >Hi Shaohua, > >This looks much better. Though usage of "slice_idle" as measure of service >times, I find little un-intutive. Especially, I do some testing with >slice_idle=0, in that case, we will be allowing dispatch of 8 requests >from each queue even if slice is about to expire. > >But I guess that's fine for the time being as upper limit is still >controlld by cfq_quantum. > >> Test shows this boost my workload (two thread randread of a SSD) from >78m/s >> to 100m/s. > >Are these deep queue random reads (with higher iodepths, using libaio)? > >Have you done similar test on some slower NCQ rotational hardware also and >seen the impact on throughput and *max latency* of readers, especially in >the presence of buffered writers. Tested in a 320g hardidisk (ST3320620AS). The throughput improves about 6% and average latency drops 6% too. Below is the fio output, I tested 3 run for each case, the result is similar. No patch case: sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 Starting 2 processes sdb: (groupid=0, jobs=2): err= 0: pid=3389 read : io=90,900KiB, bw=755KiB/s, iops=188, runt=120336msec slat (usec): min=8, max=527K, avg=679.01, stdev=6101.05 clat (usec): min=0, max=0, avg= 0.00, stdev= 0.00 bw (KiB/s) : min= 0, max= 837, per=47.35%, avg=357.50, stdev=78.71 cpu : usr=0.02%, sys=0.13%, ctx=22661, majf=0, minf=169 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.7%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued r/w: total=22725/0, short=0/0 lat (msec): 10=0.04%, 20=1.34%, 50=7.42%, 100=8.05%, 250=31.58% lat (msec): 500=30.38%, 750=13.79%, 1000=5.27%, 2000=2.14% Run status group 0 (all jobs): READ: io=90,900KiB, aggrb=755KiB/s, minb=755KiB/s, maxb=755KiB/s, mint=120336msec, maxt=120336msec -------------------------------------------------------------------------- Patched case: sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32 Starting 2 processes sdb: (groupid=0, jobs=2): err= 0: pid=4776 read : io=98,140KiB, bw=815KiB/s, iops=203, runt=120323msec slat (usec): min=9, max=68, avg=11.23, stdev= 1.03 clat (usec): min=0, max=0, avg= 0.00, stdev= 0.00 bw (KiB/s) : min= 0, max= 534, per=47.28%, avg=385.32, stdev=74.37 cpu : usr=0.04%, sys=0.13%, ctx=24523, majf=0, minf=188 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.7%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued r/w: total=24535/0, short=0/0 lat (msec): 10=0.01%, 20=0.93%, 50=6.50%, 100=7.65%, 250=36.40% lat (msec): 500=31.81%, 750=11.24%, 1000=4.08%, 2000=1.38% Run status group 0 (all jobs): READ: io=98,140KiB, aggrb=815KiB/s, minb=815KiB/s, maxb=815KiB/s, mint=120323msec, maxt=120323msec ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2010-01-15 3:20 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-12-25 9:10 [RFC]cfq-iosched: quantum check tweak Shaohua Li 2009-12-25 9:44 ` Corrado Zoccolo 2009-12-28 3:35 ` Shaohua Li 2009-12-28 9:02 ` Corrado Zoccolo 2010-01-07 2:04 ` Shaohua Li 2010-01-07 21:44 ` Corrado Zoccolo 2010-01-08 0:57 ` Shaohua Li 2010-01-08 20:22 ` Corrado Zoccolo 2010-01-11 1:49 ` Shaohua Li 2010-01-11 2:01 ` Shaohua Li 2010-01-08 17:15 ` Vivek Goyal 2010-01-08 17:40 ` Vivek Goyal 2010-01-08 20:35 ` Corrado Zoccolo 2010-01-08 20:59 ` Vivek Goyal 2010-01-11 2:34 ` Shaohua Li 2010-01-11 17:03 ` Vivek Goyal 2010-01-12 3:07 ` Shaohua Li 2010-01-12 15:48 ` Vivek Goyal 2010-01-13 8:17 ` Shaohua Li 2010-01-13 11:18 ` Vivek Goyal 2010-01-14 4:16 ` Shaohua Li 2010-01-14 11:31 ` Vivek Goyal 2010-01-14 13:49 ` Jens Axboe 2010-01-15 3:20 ` Li, Shaohua
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.