All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC]cfq-iosched: quantum check tweak
@ 2009-12-25  9:10 Shaohua Li
  2009-12-25  9:44 ` Corrado Zoccolo
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2009-12-25  9:10 UTC (permalink / raw)
  To: linux-kernel; +Cc: jens.axboe, czoccolo, yanmin.zhang

Currently a queue can only dispatch up to 4 requests if there are other queues.
This isn't optimal, device can handle more requests, for example, AHCI can
handle 31 requests. I can understand the limit is for fairness, but we could
do some tweaks:
1. if the queue still has a lot of slice left, sounds we could ignore the limit
2. we could keep the check only when cfq_latency is on. For uses who don't care
about latency should be happy to have device fully piped on.

I have a test of random direct io of two threads, each has 32 requests one time
without patch: 78m/s
with tweak 1: 138m/s
with two tweaks and disable latency: 156m/s

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/cfq-iosched.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -2242,6 +2242,18 @@ static int cfq_forced_dispatch(struct cf
 	return dispatched;
 }
 
+static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
+	struct cfq_queue *cfqq)
+{
+	/* the queue hasn't finished any request, can't estimate */
+	if (cfq_cfqq_slice_new(cfqq))
+		return 1;
+	if (time_after(jiffies + cfqd->cfq_slice_idle, cfqq->slice_end))
+		return 1;
+
+	return 0;
+}
+
 static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	unsigned int max_dispatch;
@@ -2275,7 +2287,8 @@ static bool cfq_may_dispatch(struct cfq_
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq)
+			&& cfqd->cfq_latency)
 			return false;
 
 		/*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2009-12-25  9:10 [RFC]cfq-iosched: quantum check tweak Shaohua Li
@ 2009-12-25  9:44 ` Corrado Zoccolo
  2009-12-28  3:35   ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Corrado Zoccolo @ 2009-12-25  9:44 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, jens.axboe, yanmin.zhang

On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> Currently a queue can only dispatch up to 4 requests if there are other queues.
> This isn't optimal, device can handle more requests, for example, AHCI can
> handle 31 requests. I can understand the limit is for fairness, but we could
> do some tweaks:
> 1. if the queue still has a lot of slice left, sounds we could ignore the limit
ok. You can even scale the limit proportionally to the remaining slice
(see below).

> 2. we could keep the check only when cfq_latency is on. For uses who don't care
> about latency should be happy to have device fully piped on.
I wouldn't overload low_latency with this meaning. You can obtain the
same by setting the quantum to 32.
>
> I have a test of random direct io of two threads, each has 32 requests one time
> without patch: 78m/s
> with tweak 1: 138m/s
> with two tweaks and disable latency: 156m/s

Please, test also with competing seq/random(depth1)/async workloads,
and measure also introduced latencies.

>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> ---
>  block/cfq-iosched.c |   15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -2242,6 +2242,18 @@ static int cfq_forced_dispatch(struct cf
>        return dispatched;
>  }
>
> +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
> +       struct cfq_queue *cfqq)
> +{
> +       /* the queue hasn't finished any request, can't estimate */
> +       if (cfq_cfqq_slice_new(cfqq))
> +               return 1;
> +       if (time_after(jiffies + cfqd->cfq_slice_idle, cfqq->slice_end))
here I would use
jiffies + cfqd->cfq_slice_idle * (cfqq->queued[0] + cfqq->queued[1])
to obtain the proper scaling
> +               return 1;
> +
> +       return 0;
> +}
> +
>  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
>        unsigned int max_dispatch;

Thanks,
Corrado

-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2009-12-25  9:44 ` Corrado Zoccolo
@ 2009-12-28  3:35   ` Shaohua Li
  2009-12-28  9:02     ` Corrado Zoccolo
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2009-12-28  3:35 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin

On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> > This isn't optimal, device can handle more requests, for example, AHCI can
> > handle 31 requests. I can understand the limit is for fairness, but we could
> > do some tweaks:
> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> ok. You can even scale the limit proportionally to the remaining slice
> (see below).
I can't understand the meaning of below scale. cfq_slice_used_soon() means
dispatched requests can finish before slice is used, so other queues will not be
impacted. I thought/hope a cfq_slice_idle time is enough to finish the
dispatched requests.
 
> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> > about latency should be happy to have device fully piped on.
> I wouldn't overload low_latency with this meaning. You can obtain the
> same by setting the quantum to 32.
As this impact fairness, so natually thought we could use low_latency. I'll remove
the check in next post.

> > I have a test of random direct io of two threads, each has 32 requests one time
> > without patch: 78m/s
> > with tweak 1: 138m/s
> > with two tweaks and disable latency: 156m/s
> 
> Please, test also with competing seq/random(depth1)/async workloads,
> and measure also introduced latencies.
depth1 should be ok, as if device can only send one request, it should not require
more requests from ioscheduler.
I'll do more checks. The time is hard to choose (I choose cfq_slice-idle here) to
balance thoughput and latency. Do we have creteria to measure this? See the patch
passes some tests, so it's ok for latency.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2009-12-28  3:35   ` Shaohua Li
@ 2009-12-28  9:02     ` Corrado Zoccolo
  2010-01-07  2:04       ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Corrado Zoccolo @ 2009-12-28  9:02 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin

Hi Shaohua,
On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
>> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> > Currently a queue can only dispatch up to 4 requests if there are other queues.
>> > This isn't optimal, device can handle more requests, for example, AHCI can
>> > handle 31 requests. I can understand the limit is for fairness, but we could
>> > do some tweaks:
>> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
>> ok. You can even scale the limit proportionally to the remaining slice
>> (see below).
> I can't understand the meaning of below scale. cfq_slice_used_soon() means
> dispatched requests can finish before slice is used, so other queues will not be
> impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> dispatched requests.
cfq_slice_idle is 8ms, that is the average time to complete 1 request
on most disks. If you have more requests dispatched on a
NCQ-rotational disk (non-RAID), it will take more time. Probably a
linear formula is not the most accurate, but still more accurate than
taking just 1 cfq_slice_idle. If you can experiment a bit, you could
also try:
 cfq_slice_idle * ilog2(nr_dispatched+1)
 cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))

>
>> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
>> > about latency should be happy to have device fully piped on.
>> I wouldn't overload low_latency with this meaning. You can obtain the
>> same by setting the quantum to 32.
> As this impact fairness, so natually thought we could use low_latency. I'll remove
> the check in next post.
Great.
>> > I have a test of random direct io of two threads, each has 32 requests one time
>> > without patch: 78m/s
>> > with tweak 1: 138m/s
>> > with two tweaks and disable latency: 156m/s
>>
>> Please, test also with competing seq/random(depth1)/async workloads,
>> and measure also introduced latencies.
> depth1 should be ok, as if device can only send one request, it should not require
> more requests from ioscheduler.
I mean have a run with, at the same time:
* one seq reader,
* h random readers with depth 1 (non-aio)
* one async seq writer
* k random readers with large depth.
In this way, you can see if the changes you introduce to boost your
workload affect more realistic scenarios, in which various workloads
are mixed.
I explicitly add the depth1 random readers, since they are sceduled
differently than the large (>4) depth ones.

> I'll do more checks. The time is hard to choose (I choose cfq_slice-idle here) to
> balance thoughput and latency. Do we have creteria to measure this? See the patch
> passes some tests, so it's ok for latency.

Max latency should be near 300ms (compare with and without the patch),
for a reasonable number of concurrent processes (300ms/#proc <
2*idle_slice).

Thanks,
Corrado

>
> Thanks,
> Shaohua
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2009-12-28  9:02     ` Corrado Zoccolo
@ 2010-01-07  2:04       ` Shaohua Li
  2010-01-07 21:44         ` Corrado Zoccolo
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2010-01-07  2:04 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin

[-- Attachment #1: Type: text/plain, Size: 4977 bytes --]

On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> Hi Shaohua,
> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> >> > do some tweaks:
> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> >> ok. You can even scale the limit proportionally to the remaining slice
> >> (see below).
> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> > dispatched requests can finish before slice is used, so other queues will not be
> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> > dispatched requests.
> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> on most disks. If you have more requests dispatched on a
> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> linear formula is not the most accurate, but still more accurate than
> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> also try:
>  cfq_slice_idle * ilog2(nr_dispatched+1)
>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> 
> >
> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> >> > about latency should be happy to have device fully piped on.
> >> I wouldn't overload low_latency with this meaning. You can obtain the
> >> same by setting the quantum to 32.
> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> > the check in next post.
> Great.
> >> > I have a test of random direct io of two threads, each has 32 requests one time
> >> > without patch: 78m/s
> >> > with tweak 1: 138m/s
> >> > with two tweaks and disable latency: 156m/s
> >>
> >> Please, test also with competing seq/random(depth1)/async workloads,
> >> and measure also introduced latencies.
> > depth1 should be ok, as if device can only send one request, it should not require
> > more requests from ioscheduler.
> I mean have a run with, at the same time:
> * one seq reader,
> * h random readers with depth 1 (non-aio)
> * one async seq writer
> * k random readers with large depth.
> In this way, you can see if the changes you introduce to boost your
> workload affect more realistic scenarios, in which various workloads
> are mixed.
> I explicitly add the depth1 random readers, since they are sceduled
> differently than the large (>4) depth ones.
I tried a fio script which does like your description, but the data
isn't stable, especially the write speed, other kind of io speed is
stable. Apply below patch doesn't make things worse (still write speed
isn't stable, other io is stable), so I can't say if the patch passes
the test, but it appears latency reported by fio hasn't change. I adopt
the slice_idle * dispatched approach, which I thought should be safe.

Currently a queue can only dispatch up to 4 requests if there are other queues.
This isn't optimal, device can handle more requests, for example, AHCI can
handle 31 requests. I can understand the limit is for fairness, but we could
do a tweak: if the queue still has a lot of slice left, sounds we could ignore
the limit.
For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium.
For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium.
This might cause latency issue if the queue is preempted at the very beginning.

This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last
post, but also is a big improvement.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/cfq-iosched.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf
 	return dispatched;
 }
 
+static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
+	struct cfq_queue *cfqq)
+{
+	/* the queue hasn't finished any request, can't estimate */
+	if (cfq_cfqq_slice_new(cfqq))
+		return true;
+	if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
+		cfqq->slice_end))
+		return true;
+
+	return false;
+}
+
 static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	unsigned int max_dispatch;
@@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
 			return false;
 
 		/*


[-- Attachment #2: aiorandread --]
[-- Type: text/plain, Size: 396 bytes --]

[global]
runtime=120
ioscheduler=cfq
overwrite=1
loops=10
group_reporting

[seq-read]
directory=/mnt/b1
bs=4k
rw=read
ioengine=sync
size=1G

[seq-write]
directory=/mnt/b1
bs=4k
rw=write
ioengine=sync
size=1G

[rand-read]
directory=/mnt/b1
bs=4k
rw=randread
ioengine=sync
size=1G
numjobs=4

[rand-read-aio]
directory=/mnt/b1
bs=4k
iodepth=32
rw=randread
ioengine=libaio
direct=1
size=1G
numjobs=4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-07  2:04       ` Shaohua Li
@ 2010-01-07 21:44         ` Corrado Zoccolo
  2010-01-08  0:57           ` Shaohua Li
  2010-01-08 17:15           ` Vivek Goyal
  0 siblings, 2 replies; 24+ messages in thread
From: Corrado Zoccolo @ 2010-01-07 21:44 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin

Hi Shahoua,

On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
>> Hi Shaohua,
>> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
>> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
>> >> > This isn't optimal, device can handle more requests, for example, AHCI can
>> >> > handle 31 requests. I can understand the limit is for fairness, but we could
>> >> > do some tweaks:
>> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
>> >> ok. You can even scale the limit proportionally to the remaining slice
>> >> (see below).
>> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
>> > dispatched requests can finish before slice is used, so other queues will not be
>> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
>> > dispatched requests.
>> cfq_slice_idle is 8ms, that is the average time to complete 1 request
>> on most disks. If you have more requests dispatched on a
>> NCQ-rotational disk (non-RAID), it will take more time. Probably a
>> linear formula is not the most accurate, but still more accurate than
>> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
>> also try:
>>  cfq_slice_idle * ilog2(nr_dispatched+1)
>>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
>>
>> >
>> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
>> >> > about latency should be happy to have device fully piped on.
>> >> I wouldn't overload low_latency with this meaning. You can obtain the
>> >> same by setting the quantum to 32.
>> > As this impact fairness, so natually thought we could use low_latency. I'll remove
>> > the check in next post.
>> Great.
>> >> > I have a test of random direct io of two threads, each has 32 requests one time
>> >> > without patch: 78m/s
>> >> > with tweak 1: 138m/s
>> >> > with two tweaks and disable latency: 156m/s
>> >>
>> >> Please, test also with competing seq/random(depth1)/async workloads,
>> >> and measure also introduced latencies.
>> > depth1 should be ok, as if device can only send one request, it should not require
>> > more requests from ioscheduler.
>> I mean have a run with, at the same time:
>> * one seq reader,
>> * h random readers with depth 1 (non-aio)
>> * one async seq writer
>> * k random readers with large depth.
>> In this way, you can see if the changes you introduce to boost your
>> workload affect more realistic scenarios, in which various workloads
>> are mixed.
>> I explicitly add the depth1 random readers, since they are sceduled
>> differently than the large (>4) depth ones.
> I tried a fio script which does like your description, but the data
> isn't stable, especially the write speed, other kind of io speed is
> stable. Apply below patch doesn't make things worse (still write speed
> isn't stable, other io is stable), so I can't say if the patch passes
> the test, but it appears latency reported by fio hasn't change. I adopt
> the slice_idle * dispatched approach, which I thought should be safe.

I'm doing some tests right now on a single ncq rotational disk, and
the average service time when submitting with a high depth is halved
w.r.t. depth 1, so I think you could test also with the formula :
slice_idle * dispatched / 2. It should give a performance boost,
without noticeable impact on latency.

> Currently a queue can only dispatch up to 4 requests if there are other queues.
> This isn't optimal, device can handle more requests, for example, AHCI can
> handle 31 requests. I can understand the limit is for fairness, but we could
> do a tweak: if the queue still has a lot of slice left, sounds we could ignore
> the limit.
> For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium.
> For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium.
> This might cause latency issue if the queue is preempted at the very beginning.
>
> This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last
> post, but also is a big improvement.

Acked-by: Corrado Zoccolo <czoccolo@gmail.com>

>
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> ---
>  block/cfq-iosched.c |   15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf
>        return dispatched;
>  }
>
> +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
> +       struct cfq_queue *cfqq)
> +{
> +       /* the queue hasn't finished any request, can't estimate */
> +       if (cfq_cfqq_slice_new(cfqq))
> +               return true;
> +       if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
> +               cfqq->slice_end))
> +               return true;
> +
> +       return false;
> +}
> +
>  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
>        unsigned int max_dispatch;
> @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_
>                /*
>                 * We have other queues, don't allow more IO from this one
>                 */
> -               if (cfqd->busy_queues > 1)
> +               if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
>                        return false;
>
>                /*
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-07 21:44         ` Corrado Zoccolo
@ 2010-01-08  0:57           ` Shaohua Li
  2010-01-08 20:22             ` Corrado Zoccolo
  2010-01-08 17:15           ` Vivek Goyal
  1 sibling, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2010-01-08  0:57 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin

On Fri, Jan 08, 2010 at 05:44:27AM +0800, Corrado Zoccolo wrote:
> Hi Shahoua,
> 
> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> >> Hi Shaohua,
> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> >> >> > do some tweaks:
> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> >> >> ok. You can even scale the limit proportionally to the remaining slice
> >> >> (see below).
> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> >> > dispatched requests can finish before slice is used, so other queues will not be
> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> >> > dispatched requests.
> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> >> on most disks. If you have more requests dispatched on a
> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> >> linear formula is not the most accurate, but still more accurate than
> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> >> also try:
> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> >>
> >> >
> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> >> >> > about latency should be happy to have device fully piped on.
> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> >> >> same by setting the quantum to 32.
> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> >> > the check in next post.
> >> Great.
> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> >> >> > without patch: 78m/s
> >> >> > with tweak 1: 138m/s
> >> >> > with two tweaks and disable latency: 156m/s
> >> >>
> >> >> Please, test also with competing seq/random(depth1)/async workloads,
> >> >> and measure also introduced latencies.
> >> > depth1 should be ok, as if device can only send one request, it should not require
> >> > more requests from ioscheduler.
> >> I mean have a run with, at the same time:
> >> * one seq reader,
> >> * h random readers with depth 1 (non-aio)
> >> * one async seq writer
> >> * k random readers with large depth.
> >> In this way, you can see if the changes you introduce to boost your
> >> workload affect more realistic scenarios, in which various workloads
> >> are mixed.
> >> I explicitly add the depth1 random readers, since they are sceduled
> >> differently than the large (>4) depth ones.
> > I tried a fio script which does like your description, but the data
> > isn't stable, especially the write speed, other kind of io speed is
> > stable. Apply below patch doesn't make things worse (still write speed
> > isn't stable, other io is stable), so I can't say if the patch passes
> > the test, but it appears latency reported by fio hasn't change. I adopt
> > the slice_idle * dispatched approach, which I thought should be safe.
> 
> I'm doing some tests right now on a single ncq rotational disk, and
> the average service time when submitting with a high depth is halved
> w.r.t. depth 1, so I think you could test also with the formula :
> slice_idle * dispatched / 2. It should give a performance boost,
> without noticeable impact on latency
Thanks for looking at it. can you forward your tests to me so I can
check here? I'll do more aggressive formula next. It shouldn't impact
my SSD which is very fast, each request takes less than 1ms. We'd better
have a mechanism to measure device speed, but jiffies isn't good. I'm thinking
using sched_clock() which has its issue too like having drift between CPUs.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-07 21:44         ` Corrado Zoccolo
  2010-01-08  0:57           ` Shaohua Li
@ 2010-01-08 17:15           ` Vivek Goyal
  2010-01-08 17:40             ` Vivek Goyal
  2010-01-08 20:35             ` Corrado Zoccolo
  1 sibling, 2 replies; 24+ messages in thread
From: Vivek Goyal @ 2010-01-08 17:15 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: Shaohua Li, linux-kernel, jens.axboe, Zhang, Yanmin

On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote:
> Hi Shahoua,
> 
> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> >> Hi Shaohua,
> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> >> >> > do some tweaks:
> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> >> >> ok. You can even scale the limit proportionally to the remaining slice
> >> >> (see below).
> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> >> > dispatched requests can finish before slice is used, so other queues will not be
> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> >> > dispatched requests.
> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> >> on most disks. If you have more requests dispatched on a
> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> >> linear formula is not the most accurate, but still more accurate than
> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> >> also try:
> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> >>
> >> >
> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> >> >> > about latency should be happy to have device fully piped on.
> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> >> >> same by setting the quantum to 32.
> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> >> > the check in next post.
> >> Great.
> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> >> >> > without patch: 78m/s
> >> >> > with tweak 1: 138m/s
> >> >> > with two tweaks and disable latency: 156m/s
> >> >>
> >> >> Please, test also with competing seq/random(depth1)/async workloads,
> >> >> and measure also introduced latencies.
> >> > depth1 should be ok, as if device can only send one request, it should not require
> >> > more requests from ioscheduler.
> >> I mean have a run with, at the same time:
> >> * one seq reader,
> >> * h random readers with depth 1 (non-aio)
> >> * one async seq writer
> >> * k random readers with large depth.
> >> In this way, you can see if the changes you introduce to boost your
> >> workload affect more realistic scenarios, in which various workloads
> >> are mixed.
> >> I explicitly add the depth1 random readers, since they are sceduled
> >> differently than the large (>4) depth ones.
> > I tried a fio script which does like your description, but the data
> > isn't stable, especially the write speed, other kind of io speed is
> > stable. Apply below patch doesn't make things worse (still write speed
> > isn't stable, other io is stable), so I can't say if the patch passes
> > the test, but it appears latency reported by fio hasn't change. I adopt
> > the slice_idle * dispatched approach, which I thought should be safe.
> 
> I'm doing some tests right now on a single ncq rotational disk, and
> the average service time when submitting with a high depth is halved
> w.r.t. depth 1, so I think you could test also with the formula :
> slice_idle * dispatched / 2. It should give a performance boost,
> without noticeable impact on latency.
> 

But I guess the right comparison here would service times vary when we
push queue depths from 4 to higher (as done by this patch). Were you
running deep seeky queues or sequential queues. Curious to know whether
service times reduced even in case of deep seeky queues on this single
disk.

I think this patch breaks the meaning of cfq_quantum? Now we can allow
dispatch of more requests from the same queue. I had kind of liked the
idea of respecting cfq_quantum. Especially it can help in testing. With
this patch cfq_quantum will more or less loose its meaning.

Vivek

> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> > This isn't optimal, device can handle more requests, for example, AHCI can
> > handle 31 requests. I can understand the limit is for fairness, but we could
> > do a tweak: if the queue still has a lot of slice left, sounds we could ignore
> > the limit.
> > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium.
> > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium.
> > This might cause latency issue if the queue is preempted at the very beginning.
> >
> > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last
> > post, but also is a big improvement.
> 
> Acked-by: Corrado Zoccolo <czoccolo@gmail.com>
> 
> >
> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > ---
> >  block/cfq-iosched.c |   15 ++++++++++++++-
> >  1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6/block/cfq-iosched.c
> > ===================================================================
> > --- linux-2.6.orig/block/cfq-iosched.c
> > +++ linux-2.6/block/cfq-iosched.c
> > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf
> >        return dispatched;
> >  }
> >
> > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
> > +       struct cfq_queue *cfqq)
> > +{
> > +       /* the queue hasn't finished any request, can't estimate */
> > +       if (cfq_cfqq_slice_new(cfqq))
> > +               return true;
> > +       if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
> > +               cfqq->slice_end))
> > +               return true;
> > +
> > +       return false;
> > +}
> > +
> >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> >  {
> >        unsigned int max_dispatch;
> > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_
> >                /*
> >                 * We have other queues, don't allow more IO from this one
> >                 */
> > -               if (cfqd->busy_queues > 1)
> > +               if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
> >                        return false;
> >
> >                /*
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-08 17:15           ` Vivek Goyal
@ 2010-01-08 17:40             ` Vivek Goyal
  2010-01-08 20:35             ` Corrado Zoccolo
  1 sibling, 0 replies; 24+ messages in thread
From: Vivek Goyal @ 2010-01-08 17:40 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: Shaohua Li, linux-kernel, jens.axboe, Zhang, Yanmin

On Fri, Jan 08, 2010 at 12:15:35PM -0500, Vivek Goyal wrote:
> On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote:
> > Hi Shahoua,
> > 
> > On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> > >> Hi Shaohua,
> > >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> > >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> > >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> > >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> > >> >> > do some tweaks:
> > >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> > >> >> ok. You can even scale the limit proportionally to the remaining slice
> > >> >> (see below).
> > >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> > >> > dispatched requests can finish before slice is used, so other queues will not be
> > >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> > >> > dispatched requests.
> > >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> > >> on most disks. If you have more requests dispatched on a
> > >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> > >> linear formula is not the most accurate, but still more accurate than
> > >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> > >> also try:
> > >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> > >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> > >>
> > >> >
> > >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> > >> >> > about latency should be happy to have device fully piped on.
> > >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> > >> >> same by setting the quantum to 32.
> > >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> > >> > the check in next post.
> > >> Great.
> > >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> > >> >> > without patch: 78m/s
> > >> >> > with tweak 1: 138m/s
> > >> >> > with two tweaks and disable latency: 156m/s
> > >> >>
> > >> >> Please, test also with competing seq/random(depth1)/async workloads,
> > >> >> and measure also introduced latencies.
> > >> > depth1 should be ok, as if device can only send one request, it should not require
> > >> > more requests from ioscheduler.
> > >> I mean have a run with, at the same time:
> > >> * one seq reader,
> > >> * h random readers with depth 1 (non-aio)
> > >> * one async seq writer
> > >> * k random readers with large depth.
> > >> In this way, you can see if the changes you introduce to boost your
> > >> workload affect more realistic scenarios, in which various workloads
> > >> are mixed.
> > >> I explicitly add the depth1 random readers, since they are sceduled
> > >> differently than the large (>4) depth ones.
> > > I tried a fio script which does like your description, but the data
> > > isn't stable, especially the write speed, other kind of io speed is
> > > stable. Apply below patch doesn't make things worse (still write speed
> > > isn't stable, other io is stable), so I can't say if the patch passes
> > > the test, but it appears latency reported by fio hasn't change. I adopt
> > > the slice_idle * dispatched approach, which I thought should be safe.
> > 
> > I'm doing some tests right now on a single ncq rotational disk, and
> > the average service time when submitting with a high depth is halved
> > w.r.t. depth 1, so I think you could test also with the formula :
> > slice_idle * dispatched / 2. It should give a performance boost,
> > without noticeable impact on latency.
> > 
> 
> But I guess the right comparison here would service times vary when we
> push queue depths from 4 to higher (as done by this patch). Were you
> running deep seeky queues or sequential queues. Curious to know whether
> service times reduced even in case of deep seeky queues on this single
> disk.
> 
> I think this patch breaks the meaning of cfq_quantum? Now we can allow
> dispatch of more requests from the same queue. I had kind of liked the
> idea of respecting cfq_quantum. Especially it can help in testing. With
> this patch cfq_quantum will more or less loose its meaning.
> 

I guess this is a question of soft limit and hard limit. May be we can
bump up default cfq_quantum to 8 and internally define a soft limit of
of 50% of cfq_quantum. So we will start throttling number of requests
from a queue when cfqq->dispatched reaches 4. But will allow more
dispatches up to cfq_quantum based on how much slice is left and what's
the possiblity that already dispatched request will finish with-in the
slice.

That way we will maintain existing behavior, meaning of cfq_quantum as
well as possibly get performance improvement in the said case.

Vivek

> > > Currently a queue can only dispatch up to 4 requests if there are other queues.
> > > This isn't optimal, device can handle more requests, for example, AHCI can
> > > handle 31 requests. I can understand the limit is for fairness, but we could
> > > do a tweak: if the queue still has a lot of slice left, sounds we could ignore
> > > the limit.
> > > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium.
> > > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium.
> > > This might cause latency issue if the queue is preempted at the very beginning.
> > >
> > > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last
> > > post, but also is a big improvement.
> > 
> > Acked-by: Corrado Zoccolo <czoccolo@gmail.com>
> > 
> > >
> > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > > ---
> > >  block/cfq-iosched.c |   15 ++++++++++++++-
> > >  1 file changed, 14 insertions(+), 1 deletion(-)
> > >
> > > Index: linux-2.6/block/cfq-iosched.c
> > > ===================================================================
> > > --- linux-2.6.orig/block/cfq-iosched.c
> > > +++ linux-2.6/block/cfq-iosched.c
> > > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf
> > >        return dispatched;
> > >  }
> > >
> > > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
> > > +       struct cfq_queue *cfqq)
> > > +{
> > > +       /* the queue hasn't finished any request, can't estimate */
> > > +       if (cfq_cfqq_slice_new(cfqq))
> > > +               return true;
> > > +       if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
> > > +               cfqq->slice_end))
> > > +               return true;
> > > +
> > > +       return false;
> > > +}
> > > +
> > >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > >  {
> > >        unsigned int max_dispatch;
> > > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_
> > >                /*
> > >                 * We have other queues, don't allow more IO from this one
> > >                 */
> > > -               if (cfqd->busy_queues > 1)
> > > +               if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
> > >                        return false;
> > >
> > >                /*
> > >
> > >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-08  0:57           ` Shaohua Li
@ 2010-01-08 20:22             ` Corrado Zoccolo
  2010-01-11  1:49               ` Shaohua Li
  2010-01-11  2:01               ` Shaohua Li
  0 siblings, 2 replies; 24+ messages in thread
From: Corrado Zoccolo @ 2010-01-08 20:22 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin

On Fri, Jan 8, 2010 at 1:57 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> On Fri, Jan 08, 2010 at 05:44:27AM +0800, Corrado Zoccolo wrote:
>> Hi Shahoua,
>>
>> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
>> >> Hi Shaohua,
>> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
>> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
>> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
>> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
>> >> >> > do some tweaks:
>> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
>> >> >> ok. You can even scale the limit proportionally to the remaining slice
>> >> >> (see below).
>> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
>> >> > dispatched requests can finish before slice is used, so other queues will not be
>> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
>> >> > dispatched requests.
>> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
>> >> on most disks. If you have more requests dispatched on a
>> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
>> >> linear formula is not the most accurate, but still more accurate than
>> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
>> >> also try:
>> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
>> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
>> >>
>> >> >
>> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
>> >> >> > about latency should be happy to have device fully piped on.
>> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
>> >> >> same by setting the quantum to 32.
>> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
>> >> > the check in next post.
>> >> Great.
>> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
>> >> >> > without patch: 78m/s
>> >> >> > with tweak 1: 138m/s
>> >> >> > with two tweaks and disable latency: 156m/s
>> >> >>
>> >> >> Please, test also with competing seq/random(depth1)/async workloads,
>> >> >> and measure also introduced latencies.
>> >> > depth1 should be ok, as if device can only send one request, it should not require
>> >> > more requests from ioscheduler.
>> >> I mean have a run with, at the same time:
>> >> * one seq reader,
>> >> * h random readers with depth 1 (non-aio)
>> >> * one async seq writer
>> >> * k random readers with large depth.
>> >> In this way, you can see if the changes you introduce to boost your
>> >> workload affect more realistic scenarios, in which various workloads
>> >> are mixed.
>> >> I explicitly add the depth1 random readers, since they are sceduled
>> >> differently than the large (>4) depth ones.
>> > I tried a fio script which does like your description, but the data
>> > isn't stable, especially the write speed, other kind of io speed is
>> > stable. Apply below patch doesn't make things worse (still write speed
>> > isn't stable, other io is stable), so I can't say if the patch passes
>> > the test, but it appears latency reported by fio hasn't change. I adopt
>> > the slice_idle * dispatched approach, which I thought should be safe.
>>
>> I'm doing some tests right now on a single ncq rotational disk, and
>> the average service time when submitting with a high depth is halved
>> w.r.t. depth 1, so I think you could test also with the formula :
>> slice_idle * dispatched / 2. It should give a performance boost,
>> without noticeable impact on latency
> Thanks for looking at it. can you forward your tests to me so I can
> check here?
Sure, you can find them at:
* simple seek time: http://dl.dropbox.com/u/3525644/stride.c
* avg seek time with NCQ: http://dl.dropbox.com/u/3525644/ncq.c
> I'll do more aggressive formula next. It shouldn't impact
> my SSD which is very fast, each request takes less than 1ms. We'd better
> have a mechanism to measure device speed, but jiffies isn't good. I'm thinking
> using sched_clock() which has its issue too like having drift between CPUs.
I'm using ktime_get() successfully. Measuring is the easy part. The
difficult one is decide what to do with the value :)

Thanks
Corrado

>
> Thanks,
> Shaohua
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-08 17:15           ` Vivek Goyal
  2010-01-08 17:40             ` Vivek Goyal
@ 2010-01-08 20:35             ` Corrado Zoccolo
  2010-01-08 20:59               ` Vivek Goyal
  1 sibling, 1 reply; 24+ messages in thread
From: Corrado Zoccolo @ 2010-01-08 20:35 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Shaohua Li, linux-kernel, jens.axboe, Zhang, Yanmin

On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote:
>> Hi Shahoua,
>>
>> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
>> >> Hi Shaohua,
>> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
>> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
>> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
>> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
>> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
>> >> >> > do some tweaks:
>> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
>> >> >> ok. You can even scale the limit proportionally to the remaining slice
>> >> >> (see below).
>> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
>> >> > dispatched requests can finish before slice is used, so other queues will not be
>> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
>> >> > dispatched requests.
>> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
>> >> on most disks. If you have more requests dispatched on a
>> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
>> >> linear formula is not the most accurate, but still more accurate than
>> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
>> >> also try:
>> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
>> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
>> >>
>> >> >
>> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
>> >> >> > about latency should be happy to have device fully piped on.
>> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
>> >> >> same by setting the quantum to 32.
>> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
>> >> > the check in next post.
>> >> Great.
>> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
>> >> >> > without patch: 78m/s
>> >> >> > with tweak 1: 138m/s
>> >> >> > with two tweaks and disable latency: 156m/s
>> >> >>
>> >> >> Please, test also with competing seq/random(depth1)/async workloads,
>> >> >> and measure also introduced latencies.
>> >> > depth1 should be ok, as if device can only send one request, it should not require
>> >> > more requests from ioscheduler.
>> >> I mean have a run with, at the same time:
>> >> * one seq reader,
>> >> * h random readers with depth 1 (non-aio)
>> >> * one async seq writer
>> >> * k random readers with large depth.
>> >> In this way, you can see if the changes you introduce to boost your
>> >> workload affect more realistic scenarios, in which various workloads
>> >> are mixed.
>> >> I explicitly add the depth1 random readers, since they are sceduled
>> >> differently than the large (>4) depth ones.
>> > I tried a fio script which does like your description, but the data
>> > isn't stable, especially the write speed, other kind of io speed is
>> > stable. Apply below patch doesn't make things worse (still write speed
>> > isn't stable, other io is stable), so I can't say if the patch passes
>> > the test, but it appears latency reported by fio hasn't change. I adopt
>> > the slice_idle * dispatched approach, which I thought should be safe.
>>
>> I'm doing some tests right now on a single ncq rotational disk, and
>> the average service time when submitting with a high depth is halved
>> w.r.t. depth 1, so I think you could test also with the formula :
>> slice_idle * dispatched / 2. It should give a performance boost,
>> without noticeable impact on latency.
>>
>
> But I guess the right comparison here would service times vary when we
> push queue depths from 4 to higher (as done by this patch).

I think here we want to determine the average cost of a request, when
there are many submitted.

> Were you
> running deep seeky queues or sequential queues. Curious to know whether
> service times reduced even in case of deep seeky queues on this single
> disk.

Seeky queues. Seeks where rather small (not more than 1/64 of the
whole disk), but already meaningful for comparison.

>
> I think this patch breaks the meaning of cfq_quantum? Now we can allow
> dispatch of more requests from the same queue. I had kind of liked the
> idea of respecting cfq_quantum. Especially it can help in testing. With
> this patch cfq_quantum will more or less loose its meaning.
cfq_quantum will still be enforced at the end of the slice, so its
meaning of how many requests can be still pending when you finish your
slice is preserved.

One can argue, instead, that this reduces a bit the effectiveness of
preemption on ncq disks.
However, I don't think preemption is the solution for low latency,
while cfq_quantum reduction is.
With this change in place, we could change the default cfq_quantum to
a smaller number (ideally 1), to have lowest number of leftovers when
the slice finishes, while still driving deep queues at the beginning
of the slice.

This needs thorough testing, though. Maybe it is better to delay those
changes to 2.6.34...

Thanks,
Corrado

>
> Vivek
>
>> > Currently a queue can only dispatch up to 4 requests if there are other queues.
>> > This isn't optimal, device can handle more requests, for example, AHCI can
>> > handle 31 requests. I can understand the limit is for fairness, but we could
>> > do a tweak: if the queue still has a lot of slice left, sounds we could ignore
>> > the limit.
>> > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium.
>> > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium.
>> > This might cause latency issue if the queue is preempted at the very beginning.
>> >
>> > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last
>> > post, but also is a big improvement.
>>
>> Acked-by: Corrado Zoccolo <czoccolo@gmail.com>
>>
>> >
>> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
>> > ---
>> >  block/cfq-iosched.c |   15 ++++++++++++++-
>> >  1 file changed, 14 insertions(+), 1 deletion(-)
>> >
>> > Index: linux-2.6/block/cfq-iosched.c
>> > ===================================================================
>> > --- linux-2.6.orig/block/cfq-iosched.c
>> > +++ linux-2.6/block/cfq-iosched.c
>> > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf
>> >        return dispatched;
>> >  }
>> >
>> > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
>> > +       struct cfq_queue *cfqq)
>> > +{
>> > +       /* the queue hasn't finished any request, can't estimate */
>> > +       if (cfq_cfqq_slice_new(cfqq))
>> > +               return true;
>> > +       if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
>> > +               cfqq->slice_end))
>> > +               return true;
>> > +
>> > +       return false;
>> > +}
>> > +
>> >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> >  {
>> >        unsigned int max_dispatch;
>> > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_
>> >                /*
>> >                 * We have other queues, don't allow more IO from this one
>> >                 */
>> > -               if (cfqd->busy_queues > 1)
>> > +               if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
>> >                        return false;
>> >
>> >                /*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-08 20:35             ` Corrado Zoccolo
@ 2010-01-08 20:59               ` Vivek Goyal
  2010-01-11  2:34                 ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Vivek Goyal @ 2010-01-08 20:59 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: Shaohua Li, linux-kernel, jens.axboe, Zhang, Yanmin

On Fri, Jan 08, 2010 at 09:35:33PM +0100, Corrado Zoccolo wrote:
> On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote:
> >> Hi Shahoua,
> >>
> >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> >> >> Hi Shaohua,
> >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> >> >> >> > do some tweaks:
> >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> >> >> >> ok. You can even scale the limit proportionally to the remaining slice
> >> >> >> (see below).
> >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> >> >> > dispatched requests can finish before slice is used, so other queues will not be
> >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> >> >> > dispatched requests.
> >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> >> >> on most disks. If you have more requests dispatched on a
> >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> >> >> linear formula is not the most accurate, but still more accurate than
> >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> >> >> also try:
> >> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> >> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> >> >>
> >> >> >
> >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> >> >> >> > about latency should be happy to have device fully piped on.
> >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> >> >> >> same by setting the quantum to 32.
> >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> >> >> > the check in next post.
> >> >> Great.
> >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> >> >> >> > without patch: 78m/s
> >> >> >> > with tweak 1: 138m/s
> >> >> >> > with two tweaks and disable latency: 156m/s
> >> >> >>
> >> >> >> Please, test also with competing seq/random(depth1)/async workloads,
> >> >> >> and measure also introduced latencies.
> >> >> > depth1 should be ok, as if device can only send one request, it should not require
> >> >> > more requests from ioscheduler.
> >> >> I mean have a run with, at the same time:
> >> >> * one seq reader,
> >> >> * h random readers with depth 1 (non-aio)
> >> >> * one async seq writer
> >> >> * k random readers with large depth.
> >> >> In this way, you can see if the changes you introduce to boost your
> >> >> workload affect more realistic scenarios, in which various workloads
> >> >> are mixed.
> >> >> I explicitly add the depth1 random readers, since they are sceduled
> >> >> differently than the large (>4) depth ones.
> >> > I tried a fio script which does like your description, but the data
> >> > isn't stable, especially the write speed, other kind of io speed is
> >> > stable. Apply below patch doesn't make things worse (still write speed
> >> > isn't stable, other io is stable), so I can't say if the patch passes
> >> > the test, but it appears latency reported by fio hasn't change. I adopt
> >> > the slice_idle * dispatched approach, which I thought should be safe.
> >>
> >> I'm doing some tests right now on a single ncq rotational disk, and
> >> the average service time when submitting with a high depth is halved
> >> w.r.t. depth 1, so I think you could test also with the formula :
> >> slice_idle * dispatched / 2. It should give a performance boost,
> >> without noticeable impact on latency.
> >>
> >
> > But I guess the right comparison here would service times vary when we
> > push queue depths from 4 to higher (as done by this patch).
> 
> I think here we want to determine the average cost of a request, when
> there are many submitted.
> 
> > Were you
> > running deep seeky queues or sequential queues. Curious to know whether
> > service times reduced even in case of deep seeky queues on this single
> > disk.
> 
> Seeky queues. Seeks where rather small (not more than 1/64 of the
> whole disk), but already meaningful for comparison.
> 
> >
> > I think this patch breaks the meaning of cfq_quantum? Now we can allow
> > dispatch of more requests from the same queue. I had kind of liked the
> > idea of respecting cfq_quantum. Especially it can help in testing. With
> > this patch cfq_quantum will more or less loose its meaning.
> cfq_quantum will still be enforced at the end of the slice, so its
> meaning of how many requests can be still pending when you finish your
> slice is preserved.

Not always and it will depend how accurate your approximation of service
time is. If per request completion time is more than approximation (in
this case slice_idle), than you will end up with more requests in dispatch
queue from one cfqq at the time of slice expiry.

> 
> One can argue, instead, that this reduces a bit the effectiveness of
> preemption on ncq disks.
> However, I don't think preemption is the solution for low latency,
> while cfq_quantum reduction is.
> With this change in place, we could change the default cfq_quantum to
> a smaller number (ideally 1), to have lowest number of leftovers when
> the slice finishes, while still driving deep queues at the beginning
> of the slice.

I think using cfq_quantum as hard limit might be a better idea as it gives
more predictable control. Instead of treating it as soft limit and trying
to meet it at the end of slice expiry based on our approximation of
predicted completion time.

> 
> This needs thorough testing, though. Maybe it is better to delay those
> changes to 2.6.34...

Agreed. This should be tested more throughly and should be candidate for
2.6.34.

Thanks
Vivek

> 
> Thanks,
> Corrado
> 
> >
> > Vivek
> >
> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> >> > do a tweak: if the queue still has a lot of slice left, sounds we could ignore
> >> > the limit.
> >> > For async io, 40ms/8ms = 5 - quantum = 1, we only send extra 1 request in maxium.
> >> > For sync io, 100ms/8ms = 12 - quantum = 8, we might send extra 8 requests in maxium.
> >> > This might cause latency issue if the queue is preempted at the very beginning.
> >> >
> >> > This patch boost my workload from 78m/s to 102m/s, which isn't that big as my last
> >> > post, but also is a big improvement.
> >>
> >> Acked-by: Corrado Zoccolo <czoccolo@gmail.com>
> >>
> >> >
> >> > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> >> > ---
> >> >  block/cfq-iosched.c |   15 ++++++++++++++-
> >> >  1 file changed, 14 insertions(+), 1 deletion(-)
> >> >
> >> > Index: linux-2.6/block/cfq-iosched.c
> >> > ===================================================================
> >> > --- linux-2.6.orig/block/cfq-iosched.c
> >> > +++ linux-2.6/block/cfq-iosched.c
> >> > @@ -2242,6 +2242,19 @@ static int cfq_forced_dispatch(struct cf
> >> >        return dispatched;
> >> >  }
> >> >
> >> > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
> >> > +       struct cfq_queue *cfqq)
> >> > +{
> >> > +       /* the queue hasn't finished any request, can't estimate */
> >> > +       if (cfq_cfqq_slice_new(cfqq))
> >> > +               return true;
> >> > +       if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
> >> > +               cfqq->slice_end))
> >> > +               return true;
> >> > +
> >> > +       return false;
> >> > +}
> >> > +
> >> >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> >> >  {
> >> >        unsigned int max_dispatch;
> >> > @@ -2275,7 +2288,7 @@ static bool cfq_may_dispatch(struct cfq_
> >> >                /*
> >> >                 * We have other queues, don't allow more IO from this one
> >> >                 */
> >> > -               if (cfqd->busy_queues > 1)
> >> > +               if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
> >> >                        return false;
> >> >
> >> >                /*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-08 20:22             ` Corrado Zoccolo
@ 2010-01-11  1:49               ` Shaohua Li
  2010-01-11  2:01               ` Shaohua Li
  1 sibling, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2010-01-11  1:49 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin

On Sat, Jan 09, 2010 at 04:22:47AM +0800, Corrado Zoccolo wrote:
> On Fri, Jan 8, 2010 at 1:57 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > On Fri, Jan 08, 2010 at 05:44:27AM +0800, Corrado Zoccolo wrote:
> >> Hi Shahoua,
> >>
> >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> >> >> Hi Shaohua,
> >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> >> >> >> > do some tweaks:
> >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> >> >> >> ok. You can even scale the limit proportionally to the remaining slice
> >> >> >> (see below).
> >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> >> >> > dispatched requests can finish before slice is used, so other queues will not be
> >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> >> >> > dispatched requests.
> >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> >> >> on most disks. If you have more requests dispatched on a
> >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> >> >> linear formula is not the most accurate, but still more accurate than
> >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> >> >> also try:
> >> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> >> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> >> >>
> >> >> >
> >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> >> >> >> > about latency should be happy to have device fully piped on.
> >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> >> >> >> same by setting the quantum to 32.
> >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> >> >> > the check in next post.
> >> >> Great.
> >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> >> >> >> > without patch: 78m/s
> >> >> >> > with tweak 1: 138m/s
> >> >> >> > with two tweaks and disable latency: 156m/s
> >> >> >>
> >> >> >> Please, test also with competing seq/random(depth1)/async workloads,
> >> >> >> and measure also introduced latencies.
> >> >> > depth1 should be ok, as if device can only send one request, it should not require
> >> >> > more requests from ioscheduler.
> >> >> I mean have a run with, at the same time:
> >> >> * one seq reader,
> >> >> * h random readers with depth 1 (non-aio)
> >> >> * one async seq writer
> >> >> * k random readers with large depth.
> >> >> In this way, you can see if the changes you introduce to boost your
> >> >> workload affect more realistic scenarios, in which various workloads
> >> >> are mixed.
> >> >> I explicitly add the depth1 random readers, since they are sceduled
> >> >> differently than the large (>4) depth ones.
> >> > I tried a fio script which does like your description, but the data
> >> > isn't stable, especially the write speed, other kind of io speed is
> >> > stable. Apply below patch doesn't make things worse (still write speed
> >> > isn't stable, other io is stable), so I can't say if the patch passes
> >> > the test, but it appears latency reported by fio hasn't change. I adopt
> >> > the slice_idle * dispatched approach, which I thought should be safe.
> >>
> >> I'm doing some tests right now on a single ncq rotational disk, and
> >> the average service time when submitting with a high depth is halved
> >> w.r.t. depth 1, so I think you could test also with the formula :
> >> slice_idle * dispatched / 2. It should give a performance boost,
> >> without noticeable impact on latency
> > Thanks for looking at it. can you forward your tests to me so I can
> > check here?
> Sure, you can find them at:
> * simple seek time: http://dl.dropbox.com/u/3525644/stride.c
> * avg seek time with NCQ: http://dl.dropbox.com/u/3525644/ncq.c
great, I'll do some tests.

> > I'll do more aggressive formula next. It shouldn't impact
> > my SSD which is very fast, each request takes less than 1ms. We'd better
> > have a mechanism to measure device speed, but jiffies isn't good. I'm thinking
> > using sched_clock() which has its issue too like having drift between CPUs.
> I'm using ktime_get() successfully. Measuring is the easy part. The
> difficult one is decide what to do with the value :)
If we can get the average time a request takes, we can further tweak the quantum check.
In fast device, a slice_idle can finish 100 requests.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-08 20:22             ` Corrado Zoccolo
  2010-01-11  1:49               ` Shaohua Li
@ 2010-01-11  2:01               ` Shaohua Li
  1 sibling, 0 replies; 24+ messages in thread
From: Shaohua Li @ 2010-01-11  2:01 UTC (permalink / raw)
  To: Corrado Zoccolo; +Cc: linux-kernel, jens.axboe, Zhang, Yanmin

On Sat, Jan 09, 2010 at 04:22:47AM +0800, Corrado Zoccolo wrote:
> On Fri, Jan 8, 2010 at 1:57 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > On Fri, Jan 08, 2010 at 05:44:27AM +0800, Corrado Zoccolo wrote:
> >> Hi Shahoua,
> >>
> >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> >> >> Hi Shaohua,
> >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> >> >> >> > do some tweaks:
> >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> >> >> >> ok. You can even scale the limit proportionally to the remaining slice
> >> >> >> (see below).
> >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> >> >> > dispatched requests can finish before slice is used, so other queues will not be
> >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> >> >> > dispatched requests.
> >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> >> >> on most disks. If you have more requests dispatched on a
> >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> >> >> linear formula is not the most accurate, but still more accurate than
> >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> >> >> also try:
> >> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> >> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> >> >>
> >> >> >
> >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> >> >> >> > about latency should be happy to have device fully piped on.
> >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> >> >> >> same by setting the quantum to 32.
> >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> >> >> > the check in next post.
> >> >> Great.
> >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> >> >> >> > without patch: 78m/s
> >> >> >> > with tweak 1: 138m/s
> >> >> >> > with two tweaks and disable latency: 156m/s
> >> >> >>
> >> >> >> Please, test also with competing seq/random(depth1)/async workloads,
> >> >> >> and measure also introduced latencies.
> >> >> > depth1 should be ok, as if device can only send one request, it should not require
> >> >> > more requests from ioscheduler.
> >> >> I mean have a run with, at the same time:
> >> >> * one seq reader,
> >> >> * h random readers with depth 1 (non-aio)
> >> >> * one async seq writer
> >> >> * k random readers with large depth.
> >> >> In this way, you can see if the changes you introduce to boost your
> >> >> workload affect more realistic scenarios, in which various workloads
> >> >> are mixed.
> >> >> I explicitly add the depth1 random readers, since they are sceduled
> >> >> differently than the large (>4) depth ones.
> >> > I tried a fio script which does like your description, but the data
> >> > isn't stable, especially the write speed, other kind of io speed is
> >> > stable. Apply below patch doesn't make things worse (still write speed
> >> > isn't stable, other io is stable), so I can't say if the patch passes
> >> > the test, but it appears latency reported by fio hasn't change. I adopt
> >> > the slice_idle * dispatched approach, which I thought should be safe.
> >>
> >> I'm doing some tests right now on a single ncq rotational disk, and
> >> the average service time when submitting with a high depth is halved
> >> w.r.t. depth 1, so I think you could test also with the formula :
> >> slice_idle * dispatched / 2. It should give a performance boost,
> >> without noticeable impact on latency
> > Thanks for looking at it. can you forward your tests to me so I can
> > check here?
> Sure, you can find them at:
> * simple seek time: http://dl.dropbox.com/u/3525644/stride.c
> * avg seek time with NCQ: http://dl.dropbox.com/u/3525644/ncq.c
> > I'll do more aggressive formula next. It shouldn't impact
> > my SSD which is very fast, each request takes less than 1ms. We'd better
> > have a mechanism to measure device speed, but jiffies isn't good. I'm thinking
> > using sched_clock() which has its issue too like having drift between CPUs.
> I'm using ktime_get() successfully. Measuring is the easy part. The
> difficult one is decide what to do with the value :)
BTW, is ktime_get() light enough for a 40k/s call? I thought we should have a very
light clock for counting.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-08 20:59               ` Vivek Goyal
@ 2010-01-11  2:34                 ` Shaohua Li
  2010-01-11 17:03                   ` Vivek Goyal
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2010-01-11  2:34 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

On Sat, Jan 09, 2010 at 04:59:48AM +0800, Vivek Goyal wrote:
> On Fri, Jan 08, 2010 at 09:35:33PM +0100, Corrado Zoccolo wrote:
> > On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote:
> > >> Hi Shahoua,
> > >>
> > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> > >> >> Hi Shaohua,
> > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> > >> >> >> > do some tweaks:
> > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> > >> >> >> ok. You can even scale the limit proportionally to the remaining slice
> > >> >> >> (see below).
> > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> > >> >> > dispatched requests can finish before slice is used, so other queues will not be
> > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> > >> >> > dispatched requests.
> > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> > >> >> on most disks. If you have more requests dispatched on a
> > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> > >> >> linear formula is not the most accurate, but still more accurate than
> > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> > >> >> also try:
> > >> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> > >> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> > >> >>
> > >> >> >
> > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> > >> >> >> > about latency should be happy to have device fully piped on.
> > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> > >> >> >> same by setting the quantum to 32.
> > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> > >> >> > the check in next post.
> > >> >> Great.
> > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> > >> >> >> > without patch: 78m/s
> > >> >> >> > with tweak 1: 138m/s
> > >> >> >> > with two tweaks and disable latency: 156m/s
> > >> >> >>
> > >> >> >> Please, test also with competing seq/random(depth1)/async workloads,
> > >> >> >> and measure also introduced latencies.
> > >> >> > depth1 should be ok, as if device can only send one request, it should not require
> > >> >> > more requests from ioscheduler.
> > >> >> I mean have a run with, at the same time:
> > >> >> * one seq reader,
> > >> >> * h random readers with depth 1 (non-aio)
> > >> >> * one async seq writer
> > >> >> * k random readers with large depth.
> > >> >> In this way, you can see if the changes you introduce to boost your
> > >> >> workload affect more realistic scenarios, in which various workloads
> > >> >> are mixed.
> > >> >> I explicitly add the depth1 random readers, since they are sceduled
> > >> >> differently than the large (>4) depth ones.
> > >> > I tried a fio script which does like your description, but the data
> > >> > isn't stable, especially the write speed, other kind of io speed is
> > >> > stable. Apply below patch doesn't make things worse (still write speed
> > >> > isn't stable, other io is stable), so I can't say if the patch passes
> > >> > the test, but it appears latency reported by fio hasn't change. I adopt
> > >> > the slice_idle * dispatched approach, which I thought should be safe.
> > >>
> > >> I'm doing some tests right now on a single ncq rotational disk, and
> > >> the average service time when submitting with a high depth is halved
> > >> w.r.t. depth 1, so I think you could test also with the formula :
> > >> slice_idle * dispatched / 2. It should give a performance boost,
> > >> without noticeable impact on latency.
> > >>
> > >
> > > But I guess the right comparison here would service times vary when we
> > > push queue depths from 4 to higher (as done by this patch).
> > 
> > I think here we want to determine the average cost of a request, when
> > there are many submitted.
> > 
> > > Were you
> > > running deep seeky queues or sequential queues. Curious to know whether
> > > service times reduced even in case of deep seeky queues on this single
> > > disk.
> > 
> > Seeky queues. Seeks where rather small (not more than 1/64 of the
> > whole disk), but already meaningful for comparison.
> > 
> > >
> > > I think this patch breaks the meaning of cfq_quantum? Now we can allow
> > > dispatch of more requests from the same queue. I had kind of liked the
> > > idea of respecting cfq_quantum. Especially it can help in testing. With
> > > this patch cfq_quantum will more or less loose its meaning.
> > cfq_quantum will still be enforced at the end of the slice, so its
> > meaning of how many requests can be still pending when you finish your
> > slice is preserved.
> 
> Not always and it will depend how accurate your approximation of service
> time is. If per request completion time is more than approximation (in
> this case slice_idle), than you will end up with more requests in dispatch
> queue from one cfqq at the time of slice expiry.
we use slice_idle for a long time and no complain. So assume the approximation
of service time is good.

> > 
> > One can argue, instead, that this reduces a bit the effectiveness of
> > preemption on ncq disks.
> > However, I don't think preemption is the solution for low latency,
> > while cfq_quantum reduction is.
> > With this change in place, we could change the default cfq_quantum to
> > a smaller number (ideally 1), to have lowest number of leftovers when
> > the slice finishes, while still driving deep queues at the beginning
> > of the slice.
> 
> I think using cfq_quantum as hard limit might be a better idea as it gives
> more predictable control. Instead of treating it as soft limit and trying
> to meet it at the end of slice expiry based on our approximation of
> predicted completion time.
Current patch has such hard limit too (100ms/8m = 12 for sync io and 40ms/8
 = 5 for async io).
 
> > This needs thorough testing, though. Maybe it is better to delay those
> > changes to 2.6.34...
> 
> Agreed. This should be tested more throughly and should be candidate for
> 2.6.34.
Sure, this needs a lot of test.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-11  2:34                 ` Shaohua Li
@ 2010-01-11 17:03                   ` Vivek Goyal
  2010-01-12  3:07                     ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Vivek Goyal @ 2010-01-11 17:03 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

On Mon, Jan 11, 2010 at 10:34:09AM +0800, Shaohua Li wrote:
> On Sat, Jan 09, 2010 at 04:59:48AM +0800, Vivek Goyal wrote:
> > On Fri, Jan 08, 2010 at 09:35:33PM +0100, Corrado Zoccolo wrote:
> > > On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote:
> > > >> Hi Shahoua,
> > > >>
> > > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> > > >> >> Hi Shaohua,
> > > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> > > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> > > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> > > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> > > >> >> >> > do some tweaks:
> > > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> > > >> >> >> ok. You can even scale the limit proportionally to the remaining slice
> > > >> >> >> (see below).
> > > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> > > >> >> > dispatched requests can finish before slice is used, so other queues will not be
> > > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> > > >> >> > dispatched requests.
> > > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> > > >> >> on most disks. If you have more requests dispatched on a
> > > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> > > >> >> linear formula is not the most accurate, but still more accurate than
> > > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> > > >> >> also try:
> > > >> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> > > >> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> > > >> >>
> > > >> >> >
> > > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> > > >> >> >> > about latency should be happy to have device fully piped on.
> > > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> > > >> >> >> same by setting the quantum to 32.
> > > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> > > >> >> > the check in next post.
> > > >> >> Great.
> > > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> > > >> >> >> > without patch: 78m/s
> > > >> >> >> > with tweak 1: 138m/s
> > > >> >> >> > with two tweaks and disable latency: 156m/s
> > > >> >> >>
> > > >> >> >> Please, test also with competing seq/random(depth1)/async workloads,
> > > >> >> >> and measure also introduced latencies.
> > > >> >> > depth1 should be ok, as if device can only send one request, it should not require
> > > >> >> > more requests from ioscheduler.
> > > >> >> I mean have a run with, at the same time:
> > > >> >> * one seq reader,
> > > >> >> * h random readers with depth 1 (non-aio)
> > > >> >> * one async seq writer
> > > >> >> * k random readers with large depth.
> > > >> >> In this way, you can see if the changes you introduce to boost your
> > > >> >> workload affect more realistic scenarios, in which various workloads
> > > >> >> are mixed.
> > > >> >> I explicitly add the depth1 random readers, since they are sceduled
> > > >> >> differently than the large (>4) depth ones.
> > > >> > I tried a fio script which does like your description, but the data
> > > >> > isn't stable, especially the write speed, other kind of io speed is
> > > >> > stable. Apply below patch doesn't make things worse (still write speed
> > > >> > isn't stable, other io is stable), so I can't say if the patch passes
> > > >> > the test, but it appears latency reported by fio hasn't change. I adopt
> > > >> > the slice_idle * dispatched approach, which I thought should be safe.
> > > >>
> > > >> I'm doing some tests right now on a single ncq rotational disk, and
> > > >> the average service time when submitting with a high depth is halved
> > > >> w.r.t. depth 1, so I think you could test also with the formula :
> > > >> slice_idle * dispatched / 2. It should give a performance boost,
> > > >> without noticeable impact on latency.
> > > >>
> > > >
> > > > But I guess the right comparison here would service times vary when we
> > > > push queue depths from 4 to higher (as done by this patch).
> > > 
> > > I think here we want to determine the average cost of a request, when
> > > there are many submitted.
> > > 
> > > > Were you
> > > > running deep seeky queues or sequential queues. Curious to know whether
> > > > service times reduced even in case of deep seeky queues on this single
> > > > disk.
> > > 
> > > Seeky queues. Seeks where rather small (not more than 1/64 of the
> > > whole disk), but already meaningful for comparison.
> > > 
> > > >
> > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow
> > > > dispatch of more requests from the same queue. I had kind of liked the
> > > > idea of respecting cfq_quantum. Especially it can help in testing. With
> > > > this patch cfq_quantum will more or less loose its meaning.
> > > cfq_quantum will still be enforced at the end of the slice, so its
> > > meaning of how many requests can be still pending when you finish your
> > > slice is preserved.
> > 
> > Not always and it will depend how accurate your approximation of service
> > time is. If per request completion time is more than approximation (in
> > this case slice_idle), than you will end up with more requests in dispatch
> > queue from one cfqq at the time of slice expiry.
> we use slice_idle for a long time and no complain. So assume the approximation
> of service time is good.

slice_idle is a variable and user can easily change it to 1ms and even 0.
In that case you will be theoritically be ready to dispatch 100/1 requests
from the cfqq?

> 
> > > 
> > > One can argue, instead, that this reduces a bit the effectiveness of
> > > preemption on ncq disks.
> > > However, I don't think preemption is the solution for low latency,
> > > while cfq_quantum reduction is.
> > > With this change in place, we could change the default cfq_quantum to
> > > a smaller number (ideally 1), to have lowest number of leftovers when
> > > the slice finishes, while still driving deep queues at the beginning
> > > of the slice.
> > 
> > I think using cfq_quantum as hard limit might be a better idea as it gives
> > more predictable control. Instead of treating it as soft limit and trying
> > to meet it at the end of slice expiry based on our approximation of
> > predicted completion time.
> Current patch has such hard limit too (100ms/8m = 12 for sync io and 40ms/8
>  = 5 for async io).

This is software logic driven and not cfq_quantum driven. We can always
keep on changing how to approximate service time completions. So a user
first needs to read the code, derive internal limits and then do testing?

I think than tunable looses its significance. That's why I am advocating
of treating cfq_quantum as hard limit and derive an internal soft limit
based on certain % of hard limit and use that as default max queue depth
for cfqq.

In this case user knows no matter what, you are not dispatching more than
cfq_quantum requests from a queue at a time.

>  
> > > This needs thorough testing, though. Maybe it is better to delay those
> > > changes to 2.6.34...
> > 
> > Agreed. This should be tested more throughly and should be candidate for
> > 2.6.34.
> Sure, this needs a lot of test.
> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-11 17:03                   ` Vivek Goyal
@ 2010-01-12  3:07                     ` Shaohua Li
  2010-01-12 15:48                       ` Vivek Goyal
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2010-01-12  3:07 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

On Tue, Jan 12, 2010 at 01:03:39AM +0800, Vivek Goyal wrote:
> On Mon, Jan 11, 2010 at 10:34:09AM +0800, Shaohua Li wrote:
> > On Sat, Jan 09, 2010 at 04:59:48AM +0800, Vivek Goyal wrote:
> > > On Fri, Jan 08, 2010 at 09:35:33PM +0100, Corrado Zoccolo wrote:
> > > > On Fri, Jan 8, 2010 at 6:15 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > On Thu, Jan 07, 2010 at 10:44:27PM +0100, Corrado Zoccolo wrote:
> > > > >> Hi Shahoua,
> > > > >>
> > > > >> On Thu, Jan 7, 2010 at 3:04 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > > > >> > On Mon, 2009-12-28 at 17:02 +0800, Corrado Zoccolo wrote:
> > > > >> >> Hi Shaohua,
> > > > >> >> On Mon, Dec 28, 2009 at 4:35 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > > > >> >> > On Fri, Dec 25, 2009 at 05:44:40PM +0800, Corrado Zoccolo wrote:
> > > > >> >> >> On Fri, Dec 25, 2009 at 10:10 AM, Shaohua Li <shaohua.li@intel.com> wrote:
> > > > >> >> >> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> > > > >> >> >> > This isn't optimal, device can handle more requests, for example, AHCI can
> > > > >> >> >> > handle 31 requests. I can understand the limit is for fairness, but we could
> > > > >> >> >> > do some tweaks:
> > > > >> >> >> > 1. if the queue still has a lot of slice left, sounds we could ignore the limit
> > > > >> >> >> ok. You can even scale the limit proportionally to the remaining slice
> > > > >> >> >> (see below).
> > > > >> >> > I can't understand the meaning of below scale. cfq_slice_used_soon() means
> > > > >> >> > dispatched requests can finish before slice is used, so other queues will not be
> > > > >> >> > impacted. I thought/hope a cfq_slice_idle time is enough to finish the
> > > > >> >> > dispatched requests.
> > > > >> >> cfq_slice_idle is 8ms, that is the average time to complete 1 request
> > > > >> >> on most disks. If you have more requests dispatched on a
> > > > >> >> NCQ-rotational disk (non-RAID), it will take more time. Probably a
> > > > >> >> linear formula is not the most accurate, but still more accurate than
> > > > >> >> taking just 1 cfq_slice_idle. If you can experiment a bit, you could
> > > > >> >> also try:
> > > > >> >>  cfq_slice_idle * ilog2(nr_dispatched+1)
> > > > >> >>  cfq_slice_idle * (1<<(ilog2(nr_dispatched+1)>>1))
> > > > >> >>
> > > > >> >> >
> > > > >> >> >> > 2. we could keep the check only when cfq_latency is on. For uses who don't care
> > > > >> >> >> > about latency should be happy to have device fully piped on.
> > > > >> >> >> I wouldn't overload low_latency with this meaning. You can obtain the
> > > > >> >> >> same by setting the quantum to 32.
> > > > >> >> > As this impact fairness, so natually thought we could use low_latency. I'll remove
> > > > >> >> > the check in next post.
> > > > >> >> Great.
> > > > >> >> >> > I have a test of random direct io of two threads, each has 32 requests one time
> > > > >> >> >> > without patch: 78m/s
> > > > >> >> >> > with tweak 1: 138m/s
> > > > >> >> >> > with two tweaks and disable latency: 156m/s
> > > > >> >> >>
> > > > >> >> >> Please, test also with competing seq/random(depth1)/async workloads,
> > > > >> >> >> and measure also introduced latencies.
> > > > >> >> > depth1 should be ok, as if device can only send one request, it should not require
> > > > >> >> > more requests from ioscheduler.
> > > > >> >> I mean have a run with, at the same time:
> > > > >> >> * one seq reader,
> > > > >> >> * h random readers with depth 1 (non-aio)
> > > > >> >> * one async seq writer
> > > > >> >> * k random readers with large depth.
> > > > >> >> In this way, you can see if the changes you introduce to boost your
> > > > >> >> workload affect more realistic scenarios, in which various workloads
> > > > >> >> are mixed.
> > > > >> >> I explicitly add the depth1 random readers, since they are sceduled
> > > > >> >> differently than the large (>4) depth ones.
> > > > >> > I tried a fio script which does like your description, but the data
> > > > >> > isn't stable, especially the write speed, other kind of io speed is
> > > > >> > stable. Apply below patch doesn't make things worse (still write speed
> > > > >> > isn't stable, other io is stable), so I can't say if the patch passes
> > > > >> > the test, but it appears latency reported by fio hasn't change. I adopt
> > > > >> > the slice_idle * dispatched approach, which I thought should be safe.
> > > > >>
> > > > >> I'm doing some tests right now on a single ncq rotational disk, and
> > > > >> the average service time when submitting with a high depth is halved
> > > > >> w.r.t. depth 1, so I think you could test also with the formula :
> > > > >> slice_idle * dispatched / 2. It should give a performance boost,
> > > > >> without noticeable impact on latency.
> > > > >>
> > > > >
> > > > > But I guess the right comparison here would service times vary when we
> > > > > push queue depths from 4 to higher (as done by this patch).
> > > > 
> > > > I think here we want to determine the average cost of a request, when
> > > > there are many submitted.
> > > > 
> > > > > Were you
> > > > > running deep seeky queues or sequential queues. Curious to know whether
> > > > > service times reduced even in case of deep seeky queues on this single
> > > > > disk.
> > > > 
> > > > Seeky queues. Seeks where rather small (not more than 1/64 of the
> > > > whole disk), but already meaningful for comparison.
> > > > 
> > > > >
> > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow
> > > > > dispatch of more requests from the same queue. I had kind of liked the
> > > > > idea of respecting cfq_quantum. Especially it can help in testing. With
> > > > > this patch cfq_quantum will more or less loose its meaning.
> > > > cfq_quantum will still be enforced at the end of the slice, so its
> > > > meaning of how many requests can be still pending when you finish your
> > > > slice is preserved.
> > > 
> > > Not always and it will depend how accurate your approximation of service
> > > time is. If per request completion time is more than approximation (in
> > > this case slice_idle), than you will end up with more requests in dispatch
> > > queue from one cfqq at the time of slice expiry.
> > we use slice_idle for a long time and no complain. So assume the approximation
> > of service time is good.
> 
> slice_idle is a variable and user can easily change it to 1ms and even 0.
> In that case you will be theoritically be ready to dispatch 100/1 requests
> from the cfqq?
User changing it should know what he does. A less-experienced user can mess a lot
of things, which we don't care. 

> > > > One can argue, instead, that this reduces a bit the effectiveness of
> > > > preemption on ncq disks.
> > > > However, I don't think preemption is the solution for low latency,
> > > > while cfq_quantum reduction is.
> > > > With this change in place, we could change the default cfq_quantum to
> > > > a smaller number (ideally 1), to have lowest number of leftovers when
> > > > the slice finishes, while still driving deep queues at the beginning
> > > > of the slice.
> > > 
> > > I think using cfq_quantum as hard limit might be a better idea as it gives
> > > more predictable control. Instead of treating it as soft limit and trying
> > > to meet it at the end of slice expiry based on our approximation of
> > > predicted completion time.
> > Current patch has such hard limit too (100ms/8m = 12 for sync io and 40ms/8
> >  = 5 for async io).
> 
> This is software logic driven and not cfq_quantum driven. We can always
> keep on changing how to approximate service time completions. So a user
> first needs to read the code, derive internal limits and then do testing?
> 
> I think than tunable looses its significance. That's why I am advocating
> of treating cfq_quantum as hard limit and derive an internal soft limit
> based on certain % of hard limit and use that as default max queue depth
> for cfqq.
> 
> In this case user knows no matter what, you are not dispatching more than
> cfq_quantum requests from a queue at a time.
ok, then the question is which value should cfq_quantum have. I have a test with
below patch. Its performance still is good. With hard limit 8, speed is 100m/s.
without hard limit, speed is 102m/s.


Currently a queue can only dispatch up to 4 requests if there are other queues.
This isn't optimal, device can handle more requests, for example, AHCI can
handle 31 requests. I can understand the limit is for fairness, but we could
do a tweak: if the queue still has a lot of slice left, sounds we could
ignore the limit.
Test shows this boost my workload (two thread randread of a SSD) from 78m/s
to 100m/s.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/cfq-iosched.c |   24 +++++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -19,7 +19,7 @@
  * tunables
  */
 /* max queue in one round of service */
-static const int cfq_quantum = 4;
+static const int cfq_quantum = 8;
 static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 /* maximum backwards seek, in KiB */
 static const int cfq_back_max = 16 * 1024;
@@ -32,6 +32,8 @@ static int cfq_slice_idle = HZ / 125;
 static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
 static const int cfq_hist_divisor = 4;
 
+#define CFQ_SOFT_QUANTUM (4)
+
 /*
  * offset from end of service tree
  */
@@ -2242,6 +2244,19 @@ static int cfq_forced_dispatch(struct cf
 	return dispatched;
 }
 
+static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
+	struct cfq_queue *cfqq)
+{
+	/* the queue hasn't finished any request, can't estimate */
+	if (cfq_cfqq_slice_new(cfqq) || cfqq->dispatched >= cfqd->cfq_quantum)
+		return 1;
+	if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
+		cfqq->slice_end))
+		return 1;
+
+	return 0;
+}
+
 static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	unsigned int max_dispatch;
@@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_
 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
 		return false;
 
-	max_dispatch = cfqd->cfq_quantum;
+	max_dispatch = cfqd->cfq_quantum / 2;
+	if (max_dispatch < CFQ_SOFT_QUANTUM)
+		max_dispatch = min_t(unsigned int, CFQ_SOFT_QUANTUM,
+			cfqd->cfq_quantum);
 	if (cfq_class_idle(cfqq))
 		max_dispatch = 1;
 
@@ -2275,7 +2293,7 @@ static bool cfq_may_dispatch(struct cfq_
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
 			return false;
 
 		/*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-12  3:07                     ` Shaohua Li
@ 2010-01-12 15:48                       ` Vivek Goyal
  2010-01-13  8:17                         ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Vivek Goyal @ 2010-01-12 15:48 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

On Tue, Jan 12, 2010 at 11:07:56AM +0800, Shaohua Li wrote:

[..]
> > > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow
> > > > > > dispatch of more requests from the same queue. I had kind of liked the
> > > > > > idea of respecting cfq_quantum. Especially it can help in testing. With
> > > > > > this patch cfq_quantum will more or less loose its meaning.
> > > > > cfq_quantum will still be enforced at the end of the slice, so its
> > > > > meaning of how many requests can be still pending when you finish your
> > > > > slice is preserved.
> > > > 
> > > > Not always and it will depend how accurate your approximation of service
> > > > time is. If per request completion time is more than approximation (in
> > > > this case slice_idle), than you will end up with more requests in dispatch
> > > > queue from one cfqq at the time of slice expiry.
> > > we use slice_idle for a long time and no complain. So assume the approximation
> > > of service time is good.
> > 
> > slice_idle is a variable and user can easily change it to 1ms and even 0.
> > In that case you will be theoritically be ready to dispatch 100/1 requests
> > from the cfqq?
> User changing it should know what he does. A less-experienced user can mess a lot
> of things, which we don't care. 
> 

The point is that there is no obivious co-relation between slice_idle and
cfq_quantum. Even an experienced user would not expect that changing
slice_idle silently will enable dispatching more requests from each cfqq.

> > > > > One can argue, instead, that this reduces a bit the effectiveness of
> > > > > preemption on ncq disks.
> > > > > However, I don't think preemption is the solution for low latency,
> > > > > while cfq_quantum reduction is.
> > > > > With this change in place, we could change the default cfq_quantum to
> > > > > a smaller number (ideally 1), to have lowest number of leftovers when
> > > > > the slice finishes, while still driving deep queues at the beginning
> > > > > of the slice.
> > > > 
> > > > I think using cfq_quantum as hard limit might be a better idea as it gives
> > > > more predictable control. Instead of treating it as soft limit and trying
> > > > to meet it at the end of slice expiry based on our approximation of
> > > > predicted completion time.
> > > Current patch has such hard limit too (100ms/8m = 12 for sync io and 40ms/8
> > >  = 5 for async io).
> > 
> > This is software logic driven and not cfq_quantum driven. We can always
> > keep on changing how to approximate service time completions. So a user
> > first needs to read the code, derive internal limits and then do testing?
> > 
> > I think than tunable looses its significance. That's why I am advocating
> > of treating cfq_quantum as hard limit and derive an internal soft limit
> > based on certain % of hard limit and use that as default max queue depth
> > for cfqq.
> > 
> > In this case user knows no matter what, you are not dispatching more than
> > cfq_quantum requests from a queue at a time.
> ok, then the question is which value should cfq_quantum have. I have a test with
> below patch. Its performance still is good. With hard limit 8, speed is 100m/s.
> without hard limit, speed is 102m/s.
> 
> 
> Currently a queue can only dispatch up to 4 requests if there are other queues.
> This isn't optimal, device can handle more requests, for example, AHCI can
> handle 31 requests. I can understand the limit is for fairness, but we could
> do a tweak: if the queue still has a lot of slice left, sounds we could
> ignore the limit.
> Test shows this boost my workload (two thread randread of a SSD) from 78m/s
> to 100m/s.
> 
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> ---
>  block/cfq-iosched.c |   24 +++++++++++++++++++++---
>  1 file changed, 21 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -19,7 +19,7 @@
>   * tunables
>   */
>  /* max queue in one round of service */
> -static const int cfq_quantum = 4;
> +static const int cfq_quantum = 8;
>  static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
>  /* maximum backwards seek, in KiB */
>  static const int cfq_back_max = 16 * 1024;
> @@ -32,6 +32,8 @@ static int cfq_slice_idle = HZ / 125;
>  static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
>  static const int cfq_hist_divisor = 4;
>  
> +#define CFQ_SOFT_QUANTUM (4)
> +
>  /*
>   * offset from end of service tree
>   */
> @@ -2242,6 +2244,19 @@ static int cfq_forced_dispatch(struct cf
>  	return dispatched;
>  }
>  
> +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
> +	struct cfq_queue *cfqq)
> +{
> +	/* the queue hasn't finished any request, can't estimate */
> +	if (cfq_cfqq_slice_new(cfqq) || cfqq->dispatched >= cfqd->cfq_quantum)
> +		return 1;
> +	if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
> +		cfqq->slice_end))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
>  	unsigned int max_dispatch;
> @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_
>  	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
>  		return false;
>  
> -	max_dispatch = cfqd->cfq_quantum;
> +	max_dispatch = cfqd->cfq_quantum / 2;
> +	if (max_dispatch < CFQ_SOFT_QUANTUM)

We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can
derive the soft limit from hard limit (cfq_quantum). Say soft limit will be
50% of cfq_quantum value.

> +		max_dispatch = min_t(unsigned int, CFQ_SOFT_QUANTUM,
> +			cfqd->cfq_quantum);
>  	if (cfq_class_idle(cfqq))
>  		max_dispatch = 1;
>  
> @@ -2275,7 +2293,7 @@ static bool cfq_may_dispatch(struct cfq_
>  		/*
>  		 * We have other queues, don't allow more IO from this one
>  		 */
> -		if (cfqd->busy_queues > 1)
> +		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
>  			return false;

So I guess here we can write something as follows.

		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
			return false;

		if (cfqd->busy_queues == 1)
			max_dispatch = -1;
		else
			/*
			 * Normally we start throttling cfqq when cfq_quantum/2
			 * requests have been dispatched. But we can drive
			 * deeper queue depths at the beginning of slice
			 * subjected to upper limit of cfq_quantum.
			 */
			max_dispatch = cfqd->cfq_quantum;

Thanks
Vivek

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-12 15:48                       ` Vivek Goyal
@ 2010-01-13  8:17                         ` Shaohua Li
  2010-01-13 11:18                           ` Vivek Goyal
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2010-01-13  8:17 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

On Tue, Jan 12, 2010 at 11:48:20PM +0800, Vivek Goyal wrote:
> On Tue, Jan 12, 2010 at 11:07:56AM +0800, Shaohua Li wrote:
> 
> [..]
> > > > > > > I think this patch breaks the meaning of cfq_quantum? Now we can allow
> > > > > > > dispatch of more requests from the same queue. I had kind of liked the
> > > > > > > idea of respecting cfq_quantum. Especially it can help in testing. With
> > > > > > > this patch cfq_quantum will more or less loose its meaning.
> > > > > > cfq_quantum will still be enforced at the end of the slice, so its
> > > > > > meaning of how many requests can be still pending when you finish your
> > > > > > slice is preserved.
> > > > > 
> > > > > Not always and it will depend how accurate your approximation of service
> > > > > time is. If per request completion time is more than approximation (in
> > > > > this case slice_idle), than you will end up with more requests in dispatch
> > > > > queue from one cfqq at the time of slice expiry.
> > > > we use slice_idle for a long time and no complain. So assume the approximation
> > > > of service time is good.
> > > 
> > > slice_idle is a variable and user can easily change it to 1ms and even 0.
> > > In that case you will be theoritically be ready to dispatch 100/1 requests
> > > from the cfqq?
> > User changing it should know what he does. A less-experienced user can mess a lot
> > of things, which we don't care. 
> > 
> 
> The point is that there is no obivious co-relation between slice_idle and
> cfq_quantum. Even an experienced user would not expect that changing
> slice_idle silently will enable dispatching more requests from each cfqq.
Agree slice_idle hasn't relationship with cfq_quantum. Yes, there are more requests
dispatched, but it shouldn't impact user experience. If it does, then the patch fails.
 
> >  1 file changed, 21 insertions(+), 3 deletions(-)
> > 
> > Index: linux-2.6/block/cfq-iosched.c
> > ===================================================================
> > --- linux-2.6.orig/block/cfq-iosched.c
> > +++ linux-2.6/block/cfq-iosched.c
> > @@ -19,7 +19,7 @@
> >   * tunables
> >   */
> >  /* max queue in one round of service */
> > -static const int cfq_quantum = 4;
> > +static const int cfq_quantum = 8;
> >  static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
> >  /* maximum backwards seek, in KiB */
> >  static const int cfq_back_max = 16 * 1024;
> > @@ -32,6 +32,8 @@ static int cfq_slice_idle = HZ / 125;
> >  static const int cfq_target_latency = HZ * 3/10; /* 300 ms */
> >  static const int cfq_hist_divisor = 4;
> >  
> > +#define CFQ_SOFT_QUANTUM (4)
> > +
> >  /*
> >   * offset from end of service tree
> >   */
> > @@ -2242,6 +2244,19 @@ static int cfq_forced_dispatch(struct cf
> >  	return dispatched;
> >  }
> >  
> > +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
> > +	struct cfq_queue *cfqq)
> > +{
> > +	/* the queue hasn't finished any request, can't estimate */
> > +	if (cfq_cfqq_slice_new(cfqq) || cfqq->dispatched >= cfqd->cfq_quantum)
> > +		return 1;
> > +	if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
> > +		cfqq->slice_end))
> > +		return 1;
> > +
> > +	return 0;
> > +}
> > +
> >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> >  {
> >  	unsigned int max_dispatch;
> > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_
> >  	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> >  		return false;
> >  
> > -	max_dispatch = cfqd->cfq_quantum;
> > +	max_dispatch = cfqd->cfq_quantum / 2;
> > +	if (max_dispatch < CFQ_SOFT_QUANTUM)
> 
> We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can
> derive the soft limit from hard limit (cfq_quantum). Say soft limit will be
> 50% of cfq_quantum value.
I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we
start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility
against old behavior at least. Am I over thinking?
 
> > +		max_dispatch = min_t(unsigned int, CFQ_SOFT_QUANTUM,
> > +			cfqd->cfq_quantum);
> >  	if (cfq_class_idle(cfqq))
> >  		max_dispatch = 1;
> >  
> > @@ -2275,7 +2293,7 @@ static bool cfq_may_dispatch(struct cfq_
> >  		/*
> >  		 * We have other queues, don't allow more IO from this one
> >  		 */
> > -		if (cfqd->busy_queues > 1)
> > +		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
> >  			return false;
> 
> So I guess here we can write something as follows.
> 
> 		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
> 			return false;
> 
> 		if (cfqd->busy_queues == 1)
> 			max_dispatch = -1;
> 		else
> 			/*
> 			 * Normally we start throttling cfqq when cfq_quantum/2
> 			 * requests have been dispatched. But we can drive
> 			 * deeper queue depths at the beginning of slice
> 			 * subjected to upper limit of cfq_quantum.
> 			 */
> 			max_dispatch = cfqd->cfq_quantum;
ok.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-13  8:17                         ` Shaohua Li
@ 2010-01-13 11:18                           ` Vivek Goyal
  2010-01-14  4:16                             ` Shaohua Li
  0 siblings, 1 reply; 24+ messages in thread
From: Vivek Goyal @ 2010-01-13 11:18 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote:
[..]
> > >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > >  {
> > >  	unsigned int max_dispatch;
> > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_
> > >  	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > >  		return false;
> > >  
> > > -	max_dispatch = cfqd->cfq_quantum;
> > > +	max_dispatch = cfqd->cfq_quantum / 2;
> > > +	if (max_dispatch < CFQ_SOFT_QUANTUM)
> > 
> > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can
> > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be
> > 50% of cfq_quantum value.
> I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we
> start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility
> against old behavior at least. Am I over thinking?
>  

I would not worry too much about that. If you are really worried about
that, then create one Documentation/block/cfq-iosched.txt and document
how cfq_quantum works so that users know that cfq_quantum is upper hard
limit and internal soft limit is cfq_quantum/2.

Thanks
Vivek

> > > +		max_dispatch = min_t(unsigned int, CFQ_SOFT_QUANTUM,
> > > +			cfqd->cfq_quantum);
> > >  	if (cfq_class_idle(cfqq))
> > >  		max_dispatch = 1;
> > >  
> > > @@ -2275,7 +2293,7 @@ static bool cfq_may_dispatch(struct cfq_
> > >  		/*
> > >  		 * We have other queues, don't allow more IO from this one
> > >  		 */
> > > -		if (cfqd->busy_queues > 1)
> > > +		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
> > >  			return false;
> > 
> > So I guess here we can write something as follows.
> > 
> > 		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
> > 			return false;
> > 
> > 		if (cfqd->busy_queues == 1)
> > 			max_dispatch = -1;
> > 		else
> > 			/*
> > 			 * Normally we start throttling cfqq when cfq_quantum/2
> > 			 * requests have been dispatched. But we can drive
> > 			 * deeper queue depths at the beginning of slice
> > 			 * subjected to upper limit of cfq_quantum.
> > 			 */
> > 			max_dispatch = cfqd->cfq_quantum;
> ok.
> 
> Thanks,
> Shaohua

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-13 11:18                           ` Vivek Goyal
@ 2010-01-14  4:16                             ` Shaohua Li
  2010-01-14 11:31                               ` Vivek Goyal
  0 siblings, 1 reply; 24+ messages in thread
From: Shaohua Li @ 2010-01-14  4:16 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

On Wed, Jan 13, 2010 at 07:18:07PM +0800, Vivek Goyal wrote:
> On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote:
> [..]
> > > >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > > >  {
> > > >  	unsigned int max_dispatch;
> > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_
> > > >  	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > >  		return false;
> > > >  
> > > > -	max_dispatch = cfqd->cfq_quantum;
> > > > +	max_dispatch = cfqd->cfq_quantum / 2;
> > > > +	if (max_dispatch < CFQ_SOFT_QUANTUM)
> > > 
> > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can
> > > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be
> > > 50% of cfq_quantum value.
> > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we
> > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility
> > against old behavior at least. Am I over thinking?
> >  
> 
> I would not worry too much about that. If you are really worried about
> that, then create one Documentation/block/cfq-iosched.txt and document
> how cfq_quantum works so that users know that cfq_quantum is upper hard
> limit and internal soft limit is cfq_quantum/2.
Good idea. Looks we don't document cfq tunnables, I'll try to do it later.

Currently a queue can only dispatch up to 4 requests if there are other queues.
This isn't optimal, device can handle more requests, for example, AHCI can
handle 31 requests. I can understand the limit is for fairness, but we could
do a tweak: if the queue still has a lot of slice left, sounds we could
ignore the limit.
Test shows this boost my workload (two thread randread of a SSD) from 78m/s
to 100m/s.

Signed-off-by: Shaohua Li <shaohua.li@intel.com>
---
 block/cfq-iosched.c |   30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -19,7 +19,7 @@
  * tunables
  */
 /* max queue in one round of service */
-static const int cfq_quantum = 4;
+static const int cfq_quantum = 8;
 static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
 /* maximum backwards seek, in KiB */
 static const int cfq_back_max = 16 * 1024;
@@ -2215,6 +2215,19 @@ static int cfq_forced_dispatch(struct cf
 	return dispatched;
 }
 
+static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
+	struct cfq_queue *cfqq)
+{
+	/* the queue hasn't finished any request, can't estimate */
+	if (cfq_cfqq_slice_new(cfqq))
+		return 1;
+	if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
+		cfqq->slice_end))
+		return 1;
+
+	return 0;
+}
+
 static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	unsigned int max_dispatch;
@@ -2231,7 +2244,7 @@ static bool cfq_may_dispatch(struct cfq_
 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
 		return false;
 
-	max_dispatch = cfqd->cfq_quantum;
+	max_dispatch = max_t(unsigned int, cfqd->cfq_quantum / 2, 1);
 	if (cfq_class_idle(cfqq))
 		max_dispatch = 1;
 
@@ -2248,13 +2261,22 @@ static bool cfq_may_dispatch(struct cfq_
 		/*
 		 * We have other queues, don't allow more IO from this one
 		 */
-		if (cfqd->busy_queues > 1)
+		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
 			return false;
 
 		/*
 		 * Sole queue user, no limit
 		 */
-		max_dispatch = -1;
+		if (cfqd->busy_queues == 1)
+			max_dispatch = -1;
+		else
+			/*
+			 * Normally we start throttling cfqq when cfq_quantum/2
+			 * requests have been dispatched. But we can drive
+			 * deeper queue depths at the beginning of slice
+			 * subjected to upper limit of cfq_quantum.
+			 * */
+			max_dispatch = cfqd->cfq_quantum;
 	}
 
 	/*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-14  4:16                             ` Shaohua Li
@ 2010-01-14 11:31                               ` Vivek Goyal
  2010-01-14 13:49                                 ` Jens Axboe
  2010-01-15  3:20                                 ` Li, Shaohua
  0 siblings, 2 replies; 24+ messages in thread
From: Vivek Goyal @ 2010-01-14 11:31 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

On Thu, Jan 14, 2010 at 12:16:24PM +0800, Shaohua Li wrote:
> On Wed, Jan 13, 2010 at 07:18:07PM +0800, Vivek Goyal wrote:
> > On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote:
> > [..]
> > > > >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > > > >  {
> > > > >  	unsigned int max_dispatch;
> > > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_
> > > > >  	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > >  		return false;
> > > > >  
> > > > > -	max_dispatch = cfqd->cfq_quantum;
> > > > > +	max_dispatch = cfqd->cfq_quantum / 2;
> > > > > +	if (max_dispatch < CFQ_SOFT_QUANTUM)
> > > > 
> > > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can
> > > > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be
> > > > 50% of cfq_quantum value.
> > > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we
> > > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility
> > > against old behavior at least. Am I over thinking?
> > >  
> > 
> > I would not worry too much about that. If you are really worried about
> > that, then create one Documentation/block/cfq-iosched.txt and document
> > how cfq_quantum works so that users know that cfq_quantum is upper hard
> > limit and internal soft limit is cfq_quantum/2.
> Good idea. Looks we don't document cfq tunnables, I'll try to do it later.
> 
> Currently a queue can only dispatch up to 4 requests if there are other queues.
> This isn't optimal, device can handle more requests, for example, AHCI can
> handle 31 requests. I can understand the limit is for fairness, but we could
> do a tweak: if the queue still has a lot of slice left, sounds we could
> ignore the limit.

Hi Shaohua,

This looks much better. Though usage of "slice_idle" as measure of service
times, I find little un-intutive. Especially, I do some testing with
slice_idle=0, in that case, we will be allowing dispatch of 8 requests
from each queue even if slice is about to expire.

But I guess that's fine for the time being as upper limit is still
controlld by cfq_quantum.

> Test shows this boost my workload (two thread randread of a SSD) from 78m/s
> to 100m/s.

Are these deep queue random reads (with higher iodepths, using libaio)?

Have you done similar test on some slower NCQ rotational hardware also and
seen the impact on throughput and *max latency* of readers, especially in
the presence of buffered writers.

Thanks
Vivek

> 
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> ---
>  block/cfq-iosched.c |   30 ++++++++++++++++++++++++++----
>  1 file changed, 26 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -19,7 +19,7 @@
>   * tunables
>   */
>  /* max queue in one round of service */
> -static const int cfq_quantum = 4;
> +static const int cfq_quantum = 8;
>  static const int cfq_fifo_expire[2] = { HZ / 4, HZ / 8 };
>  /* maximum backwards seek, in KiB */
>  static const int cfq_back_max = 16 * 1024;
> @@ -2215,6 +2215,19 @@ static int cfq_forced_dispatch(struct cf
>  	return dispatched;
>  }
>  
> +static inline bool cfq_slice_used_soon(struct cfq_data *cfqd,
> +	struct cfq_queue *cfqq)
> +{
> +	/* the queue hasn't finished any request, can't estimate */
> +	if (cfq_cfqq_slice_new(cfqq))
> +		return 1;
> +	if (time_after(jiffies + cfqd->cfq_slice_idle * cfqq->dispatched,
> +		cfqq->slice_end))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>  {
>  	unsigned int max_dispatch;
> @@ -2231,7 +2244,7 @@ static bool cfq_may_dispatch(struct cfq_
>  	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
>  		return false;
>  
> -	max_dispatch = cfqd->cfq_quantum;
> +	max_dispatch = max_t(unsigned int, cfqd->cfq_quantum / 2, 1);
>  	if (cfq_class_idle(cfqq))
>  		max_dispatch = 1;
>  
> @@ -2248,13 +2261,22 @@ static bool cfq_may_dispatch(struct cfq_
>  		/*
>  		 * We have other queues, don't allow more IO from this one
>  		 */
> -		if (cfqd->busy_queues > 1)
> +		if (cfqd->busy_queues > 1 && cfq_slice_used_soon(cfqd, cfqq))
>  			return false;
>  
>  		/*
>  		 * Sole queue user, no limit
>  		 */
> -		max_dispatch = -1;
> +		if (cfqd->busy_queues == 1)
> +			max_dispatch = -1;
> +		else
> +			/*
> +			 * Normally we start throttling cfqq when cfq_quantum/2
> +			 * requests have been dispatched. But we can drive
> +			 * deeper queue depths at the beginning of slice
> +			 * subjected to upper limit of cfq_quantum.
> +			 * */
> +			max_dispatch = cfqd->cfq_quantum;
>  	}
>  
>  	/*

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC]cfq-iosched: quantum check tweak
  2010-01-14 11:31                               ` Vivek Goyal
@ 2010-01-14 13:49                                 ` Jens Axboe
  2010-01-15  3:20                                 ` Li, Shaohua
  1 sibling, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2010-01-14 13:49 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Shaohua Li, Corrado Zoccolo, linux-kernel, Zhang, Yanmin

On Thu, Jan 14 2010, Vivek Goyal wrote:
> On Thu, Jan 14, 2010 at 12:16:24PM +0800, Shaohua Li wrote:
> > On Wed, Jan 13, 2010 at 07:18:07PM +0800, Vivek Goyal wrote:
> > > On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote:
> > > [..]
> > > > > >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> > > > > >  {
> > > > > >  	unsigned int max_dispatch;
> > > > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_
> > > > > >  	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> > > > > >  		return false;
> > > > > >  
> > > > > > -	max_dispatch = cfqd->cfq_quantum;
> > > > > > +	max_dispatch = cfqd->cfq_quantum / 2;
> > > > > > +	if (max_dispatch < CFQ_SOFT_QUANTUM)
> > > > > 
> > > > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't need it. We can
> > > > > derive the soft limit from hard limit (cfq_quantum). Say soft limit will be
> > > > > 50% of cfq_quantum value.
> > > > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets to 7, then we
> > > > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM gives a compatibility
> > > > against old behavior at least. Am I over thinking?
> > > >  
> > > 
> > > I would not worry too much about that. If you are really worried about
> > > that, then create one Documentation/block/cfq-iosched.txt and document
> > > how cfq_quantum works so that users know that cfq_quantum is upper hard
> > > limit and internal soft limit is cfq_quantum/2.
> > Good idea. Looks we don't document cfq tunnables, I'll try to do it later.
> > 
> > Currently a queue can only dispatch up to 4 requests if there are other queues.
> > This isn't optimal, device can handle more requests, for example, AHCI can
> > handle 31 requests. I can understand the limit is for fairness, but we could
> > do a tweak: if the queue still has a lot of slice left, sounds we could
> > ignore the limit.
> 
> Hi Shaohua,
> 
> This looks much better. Though usage of "slice_idle" as measure of service
> times, I find little un-intutive. Especially, I do some testing with
> slice_idle=0, in that case, we will be allowing dispatch of 8 requests
> from each queue even if slice is about to expire.

I agree this is problematic, but I also think we need to do something
about the control of queuing depth. For most users, keeping it low is
what they want - performance doesn't change much with higher depths, you
only pay a latency cost when switching to a new queue. And they don't
want that.

But for other hardware, driving up the queue depth to what the hardware
supports (potentially) can be a big win, and CFQ definitely needs to be
able to do that.

Write caches are again problematic in this area... For reads and writes
on write through caching, just looking at what this cfqq has already
dispatched and completed in this slice would be sufficient. It could
even be carried over to the next slice as a seed value, so you could
dispatch more earlier. What we want to avoid is stuffing the device
queue with tons of writes that complete immediately, only to move the
pentalty of those requests into the slices of other queues.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC]cfq-iosched: quantum check tweak
  2010-01-14 11:31                               ` Vivek Goyal
  2010-01-14 13:49                                 ` Jens Axboe
@ 2010-01-15  3:20                                 ` Li, Shaohua
  1 sibling, 0 replies; 24+ messages in thread
From: Li, Shaohua @ 2010-01-15  3:20 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: Corrado Zoccolo, linux-kernel, jens.axboe, Zhang, Yanmin

>-----Original Message-----
>From: Vivek Goyal [mailto:vgoyal@redhat.com]
>Sent: Thursday, January 14, 2010 7:31 PM
>To: Li, Shaohua
>Cc: Corrado Zoccolo; linux-kernel@vger.kernel.org; jens.axboe@oracle.com;
>Zhang, Yanmin
>Subject: Re: [RFC]cfq-iosched: quantum check tweak
>
>On Thu, Jan 14, 2010 at 12:16:24PM +0800, Shaohua Li wrote:
>> On Wed, Jan 13, 2010 at 07:18:07PM +0800, Vivek Goyal wrote:
>> > On Wed, Jan 13, 2010 at 04:17:35PM +0800, Shaohua Li wrote:
>> > [..]
>> > > > >  static bool cfq_may_dispatch(struct cfq_data *cfqd, struct
>cfq_queue *cfqq)
>> > > > >  {
>> > > > >  	unsigned int max_dispatch;
>> > > > > @@ -2258,7 +2273,10 @@ static bool cfq_may_dispatch(struct cfq_
>> > > > >  	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
>> > > > >  		return false;
>> > > > >
>> > > > > -	max_dispatch = cfqd->cfq_quantum;
>> > > > > +	max_dispatch = cfqd->cfq_quantum / 2;
>> > > > > +	if (max_dispatch < CFQ_SOFT_QUANTUM)
>> > > >
>> > > > We don't have to hardcode CFQ_SOFT_QUANTUM or in fact we don't
>need it. We can
>> > > > derive the soft limit from hard limit (cfq_quantum). Say soft
>limit will be
>> > > > 50% of cfq_quantum value.
>> > > I'm hoping this doesn't give user a surprise. Say cfq_quantum sets
>to 7, then we
>> > > start doing throttling from 3 requests. Adding the CFQ_SOFT_QUANTUM
>gives a compatibility
>> > > against old behavior at least. Am I over thinking?
>> > >
>> >
>> > I would not worry too much about that. If you are really worried about
>> > that, then create one Documentation/block/cfq-iosched.txt and document
>> > how cfq_quantum works so that users know that cfq_quantum is upper
>hard
>> > limit and internal soft limit is cfq_quantum/2.
>> Good idea. Looks we don't document cfq tunnables, I'll try to do it
>later.
>>
>> Currently a queue can only dispatch up to 4 requests if there are other
>queues.
>> This isn't optimal, device can handle more requests, for example, AHCI
>can
>> handle 31 requests. I can understand the limit is for fairness, but we
>could
>> do a tweak: if the queue still has a lot of slice left, sounds we could
>> ignore the limit.
>
>Hi Shaohua,
>
>This looks much better. Though usage of "slice_idle" as measure of service
>times, I find little un-intutive. Especially, I do some testing with
>slice_idle=0, in that case, we will be allowing dispatch of 8 requests
>from each queue even if slice is about to expire.
>
>But I guess that's fine for the time being as upper limit is still
>controlld by cfq_quantum.
>
>> Test shows this boost my workload (two thread randread of a SSD) from
>78m/s
>> to 100m/s.
>
>Are these deep queue random reads (with higher iodepths, using libaio)?
>
>Have you done similar test on some slower NCQ rotational hardware also and
>seen the impact on throughput and *max latency* of readers, especially in
>the presence of buffered writers.
Tested in a 320g hardidisk (ST3320620AS). The throughput
improves about 6% and average latency drops 6% too. Below is the fio
output, I tested 3 run for each case, the result is similar.

No patch case:
sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
Starting 2 processes

sdb: (groupid=0, jobs=2): err= 0: pid=3389
  read : io=90,900KiB, bw=755KiB/s, iops=188, runt=120336msec
    slat (usec): min=8, max=527K, avg=679.01, stdev=6101.05
    clat (usec): min=0, max=0, avg= 0.00, stdev= 0.00
    bw (KiB/s) : min=    0, max=  837, per=47.35%, avg=357.50, stdev=78.71
  cpu          : usr=0.02%, sys=0.13%, ctx=22661, majf=0, minf=169
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.7%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued r/w: total=22725/0, short=0/0

     lat (msec): 10=0.04%, 20=1.34%, 50=7.42%, 100=8.05%, 250=31.58%
     lat (msec): 500=30.38%, 750=13.79%, 1000=5.27%, 2000=2.14%

Run status group 0 (all jobs):
   READ: io=90,900KiB, aggrb=755KiB/s, minb=755KiB/s, maxb=755KiB/s, mint=120336msec, maxt=120336msec
--------------------------------------------------------------------------
Patched case:
sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=32
Starting 2 processes

sdb: (groupid=0, jobs=2): err= 0: pid=4776
  read : io=98,140KiB, bw=815KiB/s, iops=203, runt=120323msec
    slat (usec): min=9, max=68, avg=11.23, stdev= 1.03
    clat (usec): min=0, max=0, avg= 0.00, stdev= 0.00
    bw (KiB/s) : min=    0, max=  534, per=47.28%, avg=385.32, stdev=74.37
  cpu          : usr=0.04%, sys=0.13%, ctx=24523, majf=0, minf=188
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=99.7%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued r/w: total=24535/0, short=0/0

     lat (msec): 10=0.01%, 20=0.93%, 50=6.50%, 100=7.65%, 250=36.40%
     lat (msec): 500=31.81%, 750=11.24%, 1000=4.08%, 2000=1.38%

Run status group 0 (all jobs):
   READ: io=98,140KiB, aggrb=815KiB/s, minb=815KiB/s, maxb=815KiB/s, mint=120323msec, maxt=120323msec

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2010-01-15  3:20 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-25  9:10 [RFC]cfq-iosched: quantum check tweak Shaohua Li
2009-12-25  9:44 ` Corrado Zoccolo
2009-12-28  3:35   ` Shaohua Li
2009-12-28  9:02     ` Corrado Zoccolo
2010-01-07  2:04       ` Shaohua Li
2010-01-07 21:44         ` Corrado Zoccolo
2010-01-08  0:57           ` Shaohua Li
2010-01-08 20:22             ` Corrado Zoccolo
2010-01-11  1:49               ` Shaohua Li
2010-01-11  2:01               ` Shaohua Li
2010-01-08 17:15           ` Vivek Goyal
2010-01-08 17:40             ` Vivek Goyal
2010-01-08 20:35             ` Corrado Zoccolo
2010-01-08 20:59               ` Vivek Goyal
2010-01-11  2:34                 ` Shaohua Li
2010-01-11 17:03                   ` Vivek Goyal
2010-01-12  3:07                     ` Shaohua Li
2010-01-12 15:48                       ` Vivek Goyal
2010-01-13  8:17                         ` Shaohua Li
2010-01-13 11:18                           ` Vivek Goyal
2010-01-14  4:16                             ` Shaohua Li
2010-01-14 11:31                               ` Vivek Goyal
2010-01-14 13:49                                 ` Jens Axboe
2010-01-15  3:20                                 ` Li, Shaohua

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.